OpenAI o3: the model that beats ARC-AGI and redefines 'reasoning'

In one sentence OpenAI announces o3 and o3-mini: SWE-bench 71.7%, FrontierMath 25.2%, ARC-AGI 87.5% (with high compute budget). Huge jump on hard reasoning. GA expected in 2025.

Verified Official source

ShareLinkedIn X

On the last day of OpenAI's "12 Days of OpenAI" in December 2024, Sam Altman announces o3, successor to o1 (they skip "o2" to avoid trademark issues with a UK telco). Not a release: it's an announcement with spectacular benchmarks and access only for safety researchers.

The numbers are striking. ARC-AGI, a benchmark designed to measure "human-like" reasoning on novel problems, was considered one of the hardest tests for AI. o3 hits 87.5% (humans around 85%). FrontierMath, advanced math problems that take a working mathematician hours, o3 solves 25.2% (previous models: 2%).

It costs a lot: each ARC-AGI hard solution can cost thousands of dollars in compute, because the model "thinks" for a long time. But it signals that scaling reasoning at inference time works, and "AGI-like" benchmarks are starting to fall one after another.