In practice
For years it was the headline benchmark cited in new model announcements. Today it is saturated: frontier models score above 85%, and the field is moving to harder benchmarks like MMLU-Pro and GPQA.
For years it was the headline benchmark cited in new model announcements. Today it is saturated: frontier models score above 85%, and the field is moving to harder benchmarks like MMLU-Pro and GPQA.