Codex paper: OpenAI publishes HumanEval and the model behind Copilot

In one sentence OpenAI releases Evaluating Large Language Models Trained on Code describing Codex (the model powering GitHub Copilot) and introduces HumanEval, the standard benchmark for code generation.

Verified Official source

ShareLinkedIn X

OpenAI publishes the technical Codex paper, on the model powering GitHub Copilot. It explains how they took GPT-3 and further trained it on 159 GB of Python code scraped from GitHub.

The paper also introduces a practical test: HumanEval, 164 programming problems with known correct answers. For each problem, you generate code and check whether it passes automatic tests. It becomes the standard benchmark for measuring "does this model code", used by every later model (Llama, Claude, GPT-4, DeepSeek...).

For the first time the industry has a shared yardstick to say "this model is better than that one at code".