Skip to content
AImpact
IT EN
Landmark Robotics · 1 min read

PaLM-E: the first embodied VLM at 562 billion parameters

In one sentence Google presents PaLM-E, a 562B-parameter multimodal model that feeds images and robot state directly into the transformer, capable of long-horizon planning on real robots.

Verified Official source
ShareLinkedInX
Reading level

PaLM-E is the first model to fuse vision, language, and robot control in a single giant transformer of 562 billion parameters — at the time the largest multimodal model ever built.

The novelty over previous systems is that the robot's physical observations (camera images, limb positions, environment state) enter the transformer sequence directly, as if they were text tokens. The model can then reason about questions like "what do I need to do to bring the cup to the seated person?" while seeing the real world.

Tests show planning capabilities over long action sequences on mobile robots in real environments, without redefining each task from scratch.

Companies

Google

Tools

PaLM-E, PaLM, ViT

Tags

GooglePaLM-EVLMEmbodied AIMultimodalFoundation ModelLong-Horizon Planning

Sources