Models Advanced Also known as: Vision-Language-Action Model · VLA

Vision-Language-Action Model

A Vision-Language-Action Model (VLA) is a neural network that takes visual observations and natural language instructions as input and directly outputs robot actions such as end-effector coordinates or joint commands. It extends vision-language models (VLMs) by adding an action head trained on robot trajectory data. Notable examples include RT-2 (Google DeepMind), OpenVLA (Berkeley), GR-2 (ByteDance), and Helix (Figure AI). The result is a robot that can interpret a command like 'pick up the red cup' by looking at the scene and translating it into precise physical movements.

ShareLinkedIn X

In practice

A developer working with VLAs typically starts from a pretrained checkpoint (e.g., OpenVLA on HuggingFace) and fine-tunes it on teleoperation data collected from their own robot using LoRA or full fine-tuning. The model input is an RGB image from the robot's camera concatenated with the text instruction; the output is an action vector (end-effector pose, gripper aperture). The deployment pipeline uses ROS 2 or LeRobot to close the control loop at 5-10 Hz inference frequency.

Related terms

Multimodal Fine-tuning Foundation model

Seen in the wild

8 entries mentioning it

← All terms