StarCoder: the first serious open coding model with transparent training data

In one sentence BigCode and HuggingFace release StarCoder, a 15.5B-parameter model trained on 1 trillion tokens from The Stack across 86 languages, with an opt-out data governance system.

Needs review Official source

ShareLinkedIn X

Imagine a massive library containing the source code of nearly every open source project in the world. BigCode and HuggingFace used that library to train StarCoder, a 15.5 billion parameter model capable of writing, completing, and explaining code in 86 programming languages.

The real innovation is not just the model quality, but how it was built. For the first time, developers who had published code on GitHub could ask to be excluded from the training data through an opt-out system. All data used is traceable and documented in the "The Stack" dataset.

Before StarCoder, powerful coding models were all closed and proprietary: OpenAI's Codex, GitHub's Copilot, Amazon's CodeWhisperer. StarCoder proved that an open source model with transparent data could compete with these giants. It became the foundation on which many other open models were built in the following years, paving the way for a new generation of coding tools accessible to everyone.