Inference Intermediate

HumanEval

/human-eval/

An OpenAI benchmark of 164 Python programming problems scored by running unit tests against the code generated by the model.

ShareLinkedIn X

In practice

It was the standard for measuring LLM coding ability since 2021. It too is now saturated (over 90% pass@1), and the community has moved to SWE-bench, more realistic because it is based on real repositories.

Related terms

SWE-bench MMLU

Seen in the wild

5 entries mentioning it

January 17, 2025

Qwen2.5-Coder-32B: the open source model that beats GPT-4o on code

High
January 29, 2024

Code Llama 70B: Meta brings the Llama 2 code branch to GPT-3.5 level

Medium
October 11, 2023

WizardCoder: evolutionary instructions for GPT-4 level code generation

Medium
June 8, 2023

Phi-1: 1.3B parameters beating models 10x larger on code

High
July 7, 2021

Codex paper: OpenAI publishes HumanEval and the model behind Copilot

High

← All terms