In practice
It was the standard for measuring LLM coding ability since 2021. It too is now saturated (over 90% pass@1), and the community has moved to SWE-bench, more realistic because it is based on real repositories.
Related terms
Seen in the wild
5 entries mentioning it- HighQwen2.5-Coder-32B: the open source model that beats GPT-4o on code
- MediumCode Llama 70B: Meta brings the Llama 2 code branch to GPT-3.5 level
- MediumWizardCoder: evolutionary instructions for GPT-4 level code generation
- HighPhi-1: 1.3B parameters beating models 10x larger on code
- HighCodex paper: OpenAI publishes HumanEval and the model behind Copilot