StarCoder2: 619 languages, 4T tokens, and next-level data governance

In one sentence BigCode releases StarCoder2 in three sizes (3B/7B/15B) trained on 4 trillion tokens from The Stack v2 covering 619 languages, with the most transparent data governance system yet seen for a coding model.

Needs review Official source

ShareLinkedIn X

StarCoder2 is the successor to StarCoder, and the improvement over the previous version is significant across all fronts. The previous version covered 86 programming languages; StarCoder2 covers 619 — practically every programming language that has ever been written in any significant way.

The training dataset, called The Stack v2, is four times larger than the previous version, reaching 4 trillion tokens. To give a sense of scale: it would be like reading every technical book ever written tens of thousands of times.

The most important thing for many developers is how the data was managed. BigCode worked with Software Heritage, the world's source code archive, to ensure that every piece of code in the training data had traceable provenance. The opt-out system was improved. The 15B parameter model achieves performance similar to Code Llama 70B, a model four times larger. This makes it very efficient to host and use on normal hardware, opening practical possibilities for enterprise deployment without depending on cloud APIs.