Skip to content
AImpact
IT EN
Landmark Foundation Models · 2 min read

OpenAI o3: the model that beats ARC-AGI and redefines 'reasoning'

In one sentence OpenAI announces o3 and o3-mini: SWE-bench 71.7%, FrontierMath 25.2%, ARC-AGI 87.5% (with high compute budget). Huge jump on hard reasoning. GA expected in 2025.

Verified Official source
ShareLinkedInX
Reading level

On the last day of OpenAI's "12 Days of OpenAI" in December 2024, Sam Altman announces o3, successor to o1 (they skip "o2" to avoid trademark issues with a UK telco). Not a release: it's an announcement with spectacular benchmarks and access only for safety researchers.

The numbers are striking. ARC-AGI, a benchmark designed to measure "human-like" reasoning on novel problems, was considered one of the hardest tests for AI. o3 hits 87.5% (humans around 85%). FrontierMath, advanced math problems that take a working mathematician hours, o3 solves 25.2% (previous models: 2%).

It costs a lot: each ARC-AGI hard solution can cost thousands of dollars in compute, because the model "thinks" for a long time. But it signals that scaling reasoning at inference time works, and "AGI-like" benchmarks are starting to fall one after another.

Companies

OpenAI

Tools

o3, o3-mini

Tags

OpenAIo3ReasoningARC-AGIFrontier

Sources