OLMo: the first truly open model — weights, data, code, and checkpoints

In one sentence AllenAI releases OLMo with weights, the full Dolma dataset (3T tokens), training code, and all intermediate checkpoints, making the entire LLM training process scientifically reproducible for the first time.

Needs review Official source

ShareLinkedIn X

"Open source" in the AI world has become a term used very loosely. Meta's Llama is "open" in the sense that you can download the final model, but you don't know exactly what data it was trained on, you can't reproduce the training, and you can't see the intermediate steps.

AllenAI did something different with OLMo: they published everything. The final model, yes, but also the entire training dataset (Dolma, 3 trillion tokens), the source code to reproduce training from scratch, and hundreds of intermediate checkpoints showing how the model changes during training.

This matters because science requires reproducibility. If you can't repeat an experiment, you can't truly verify the claims. OLMo is the first LLM on which an external researcher can do this kind of rigorous analysis.