GR-2: ByteDance pre-trains a robot on 38,000 hours of human internet videos
In one sentence ByteDance presents GR-2, a generalist robot that uses 38,000 hours of human activity videos from the internet as pre-training before robot data. It achieves 88.9% success on 100 tasks, best-in-class at release, demonstrating that internet videos are scalable robot training data.
One of robotics' big problems is that collecting robot data is slow and expensive: every demonstration requires a physical robot, an operator, and time. ByteDance found a clever shortcut: use YouTube and other internet videos where humans do things with their hands.
GR-2 is first trained on 38,000 hours of human activity videos — cooking, DIY projects, crafts, anything showing hands manipulating objects — and only then on real robot data. Pretraining on human video teaches the model the basic physics of objects, how things behave when grasped, moved, or poured.
The result is a robot that succeeds on 88.9% of 100 different tasks, the best result available at the time of publication. Performance is particularly high on tasks requiring understanding of object-object interaction and the physical consequences of actions.
GR-2 demonstrates that the enormous amount of video available on the internet is not just useful for training language or image generation models — it can become a source of physical knowledge for robots. This fundamentally changes the scalability of the problem: instead of collecting millions of hours of robot data, you can leverage human experience already recorded.
Companies
ByteDance
Tools
—
Tags
Sources