Reading level
RT-2 is the successor to RT-1 with one fundamental difference: the base model is trained not only on robot data, but also on billions of web images and text. This means the robot "already knows" many things about the world before touching an object.
The practical result is striking: if you ask the robot to "pick up the object used to cut fruit," it does so correctly even without seeing that phrase during training. The language model's semantic reasoning transfers to physical control.
It's like taking a model like GPT and teaching it to move hands: language becomes the bridge between world knowledge and physical action.
Companies
DeepMind, Google
Tools
RT-2, PaLI-X, PaLM-E
Tags
DeepMindRT-2VLAVision-Language-ActionEmbodied AIRobotics Transformer
Sources