Vision-Language-Action Model
A Vision-Language-Action Model (VLA) is a neural network that takes visual observations and natural language instructions as input and directly outputs robot actions such as end-effector coordinates or joint commands. It extends vision-language models (VLMs) by adding an action head trained on robot trajectory data. Notable examples include RT-2 (Google DeepMind), OpenVLA (Berkeley), GR-2 (ByteDance), and Helix (Figure AI). The result is a robot that can interpret a command like 'pick up the red cup' by looking at the scene and translating it into precise physical movements.
In practice
A developer working with VLAs typically starts from a pretrained checkpoint (e.g., OpenVLA on HuggingFace) and fine-tunes it on teleoperation data collected from their own robot using LoRA or full fine-tuning. The model input is an RGB image from the robot's camera concatenated with the text instruction; the output is an action vector (end-effector pose, gripper aperture). The deployment pipeline uses ROS 2 or LeRobot to close the control loop at 5-10 Hz inference frequency.
Related terms
Seen in the wild
8 entries mentioning it- HighGemini Robotics: DeepMind brings foundation models into the physical world
- High1X Neo Home: the first humanoid sold to consumers (with caveats)
- HighPhysical Intelligence π0.5: first policy that generalizes to new homes
- HighFigure Helix: first generalist VLA driving a full-body humanoid
- HighPhysical Intelligence's π0: the first cross-embodiment robotic foundation model
- HighGR-2: ByteDance pre-trains a robot on 38,000 hours of human internet videos
- MediumOpenVLA: the first open-source Vision-Language-Action model for generalist robotics
- HighRT-2: the robot that reasons with a language model