Skip to content
AImpact
IT EN
Inference Intermediate

HumanEval

/human-eval/

An OpenAI benchmark of 164 Python programming problems scored by running unit tests against the code generated by the model.

ShareLinkedInX

In practice

It was the standard for measuring LLM coding ability since 2021. It too is now saturated (over 90% pass@1), and the community has moved to SWE-bench, more realistic because it is based on real repositories.

Related terms

← All terms