Google outlines new methods for training robots with video and large language models

A light, positive scene depicting an animated, small, cute robot assistant against a background of a room lit up by the ethereal glow of a night sky visible through a window. In one claw-like hand, the robot holds a smartphone with a screen displaying lines of code representing neural networks. Around the robot, floating holographic images coming from the smartphone show different tasks it could accomplish. In a corner of the room, a computer screen shows a video with an overlay of a robotic arm in action. The scene is subtly tinged with the warm golden tones of dusk creeping in.

Google’s DeepMind Robotics researchers are exploring the potential of generative AI and large foundational models in robotics. They aim to give robots a better understanding of human desires. Traditionally, robots have been limited to singular tasks, but the newly announced AutoRT system harnesses large foundational models to expand their capabilities. AutoRT uses a Visual Language Model (VLM) for situational awareness and manages a fleet of robots equipped with cameras. A large language model suggests tasks that can be accomplished by the robots. The system has been tested with up to 20 robots and 52 different devices, collecting over 77,000 trials. Another development is RT-Trajectory, which uses video input and overlays a sketch of the arm in action to train robots. This method has shown double the success rate compared to previous training methods. RT-Trajectory also utilizes existing robot datasets to unlock knowledge and improve robot control policies. Overall, these advancements aim to enable robots to move accurately and efficiently in novel situations.

Full article

Leave a Reply