PaLM-E: the first embodied VLM at 562 billion parameters

In one sentence Google presents PaLM-E, a 562B-parameter multimodal model that feeds images and robot state directly into the transformer, capable of long-horizon planning on real robots.

Verified Official source

ShareLinkedIn X

PaLM-E is the first model to fuse vision, language, and robot control in a single giant transformer of 562 billion parameters — at the time the largest multimodal model ever built.

The novelty over previous systems is that the robot's physical observations (camera images, limb positions, environment state) enter the transformer sequence directly, as if they were text tokens. The model can then reason about questions like "what do I need to do to bring the cup to the seated person?" while seeing the real world.

Tests show planning capabilities over long action sequences on mobile robots in real environments, without redefining each task from scratch.