Codex paper: OpenAI publishes HumanEval and the model behind Copilot
In one sentence OpenAI releases Evaluating Large Language Models Trained on Code describing Codex (the model powering GitHub Copilot) and introduces HumanEval, the standard benchmark for code generation.
OpenAI publishes the technical Codex paper, on the model powering GitHub Copilot. It explains how they took GPT-3 and further trained it on 159 GB of Python code scraped from GitHub.
The paper also introduces a practical test: HumanEval, 164 programming problems with known correct answers. For each problem, you generate code and check whether it passes automatic tests. It becomes the standard benchmark for measuring "does this model code", used by every later model (Llama, Claude, GPT-4, DeepSeek...).
For the first time the industry has a shared yardstick to say "this model is better than that one at code".
Companies
OpenAI
Tools
Codex, HumanEval
Tags
Sources