Skip to content
AImpact
IT EN
High Robotics · 1 min read

GR-2: ByteDance pre-trains a robot on 38,000 hours of human internet videos

In one sentence ByteDance presents GR-2, a generalist robot that uses 38,000 hours of human activity videos from the internet as pre-training before robot data. It achieves 88.9% success on 100 tasks, best-in-class at release, demonstrating that internet videos are scalable robot training data.

Needs review Reputable source
ShareLinkedInX
Reading level

One of robotics' big problems is that collecting robot data is slow and expensive: every demonstration requires a physical robot, an operator, and time. ByteDance found a clever shortcut: use YouTube and other internet videos where humans do things with their hands.

GR-2 is first trained on 38,000 hours of human activity videos — cooking, DIY projects, crafts, anything showing hands manipulating objects — and only then on real robot data. Pretraining on human video teaches the model the basic physics of objects, how things behave when grasped, moved, or poured.

The result is a robot that succeeds on 88.9% of 100 different tasks, the best result available at the time of publication. Performance is particularly high on tasks requiring understanding of object-object interaction and the physical consequences of actions.

GR-2 demonstrates that the enormous amount of video available on the internet is not just useful for training language or image generation models — it can become a source of physical knowledge for robots. This fundamentally changes the scalability of the problem: instead of collecting millions of hours of robot data, you can leverage human experience already recorded.

Companies

ByteDance

Tools

Tags

GR-2ByteDancevideo pretraininggeneralist robotmanipulationinternet videoVLA

Sources