RT-2: the robot that reasons with a language model

In one sentence DeepMind's RT-2 merges vision-language pretraining with robot control, transferring semantic reasoning from the web to a physical arm without task-specific training.

Verified Official source

ShareLinkedIn X

RT-2 is the successor to RT-1 with one fundamental difference: the base model is trained not only on robot data, but also on billions of web images and text. This means the robot "already knows" many things about the world before touching an object.

The practical result is striking: if you ask the robot to "pick up the object used to cut fruit," it does so correctly even without seeing that phrase during training. The language model's semantic reasoning transfers to physical control.

It's like taking a model like GPT and teaching it to move hands: language becomes the bridge between world knowledge and physical action.