Robotics startup 1X Technologies has unveiled an innovative generative model poised to revolutionize the efficiency of training robotic systems within simulated environments. Announced recently in a company blog post, this model directly tackles a significant challenge faced by roboticists: accurately predicting the world’s response to robot actions, an undertaking crucial for bridging the gap between simulated training and real-world deployment.
Traditionally, the high costs and risks of training robots in physical settings have led engineers to rely on simulated environments. These simulations, however, often fall short due to discrepancies known as the “sim2real gap,” where differences between the simulated and actual worlds result in less effective training outcomes.
“Roboticists typically hand-author scenes that are a ‘digital twin’ of the real world and use rigid body simulators like Mujoco, Bullet, and Isaac to simulate their dynamics,” explained Eric Jang, VP of AI at 1X Technologies, in an interview with VentureBeat. “However, these digital twins may have physics and geometric inaccuracies, leading to discrepancies when transitioning from training to real-world deployment. For instance, an online model of a door is unlikely to replicate the exact spring stiffness of the physical door tested on the robot.”
To address this, 1X’s new model learns to simulate the real world through training on raw sensor data directly collected from its robots. By analyzing thousands of hours of video and actuator data gathered from various robotic tasks, the model can predict outcomes based on a robot’s actions in real-time scenarios.
Data was primarily collected from EVE humanoid robots performing diverse manipulation tasks in domestic and office settings, interacting with humans. “We collected all the data at our various 1X offices, employing a team of Android Operators to annotate and filter it,” Jang noted. This method aims to ensure that dynamics simulated by the model closely match real-world interactions as the volume of data increases.
The model excels in simulating object interactions, with videos from 1X showcasing successful predictions in scenarios where a robot handles various objects—from rigid items like boxes to deformable objects such as curtains and laundry, as well as articulated objects like doors and drawers. The ability to simulate long-horizon tasks, such as folding shirts, while navigating around obstacles and maintaining a safe distance from people, highlights the model’s sophistication.
Nonetheless, updating the generative model to reflect changing environments remains a challenge. Jang emphasized that the model’s learning capability simplifies updates. “The generative model itself might have a sim2real gap if its training data becomes outdated. However, as it is a learned simulator, incorporating fresh real-world data will help correct inaccuracies without the need for manually tuning a physics simulator.”
Inspired by recent advances such as OpenAI’s Sora and Runway, which generate consistent video sequences from text, 1X’s model furthers the trend of generative systems responding dynamically to actions. Other notable strides include Google’s generative model trained to simulate the game DOOM, showcasing the expansive potential for interactive generative models in robotics and beyond.
Despite its advancements, 1X’s system still faces inherent challenges typical of generative models, occasionally producing unrealistic situations. For example, objects may not fall as expected or might disappear between frames. Ongoing efforts to mitigate these issues involve gathering more data and developing improved models. “We’ve seen dramatic progress in generative video modeling over the last couple of years, and results like OpenAI Sora suggest that scaling data and compute can go quite far,” Jang remarked.
To catalyze further innovation, 1X is releasing its models and weights to the community and plans to launch competitions with monetary prizes to encourage improvements. “We’re actively investigating multiple methods for world modeling and video generation,” Jang concluded.
Source: 1X Technologies