s1: 1000 examples and a prompt trick to replicate a reasoning model
In one sentence Stanford/UW paper: with 1000 curated examples and a technique called 'budget forcing' they fine-tune Qwen2.5-32B to compete with o1-preview on math. Training cost: <$50.
A Stanford and University of Washington team publishes s1, a reasoning model that matches o1-preview on some math benchmarks — but the point isn't the model: it's how they got it.
They took an existing model (Qwen2.5-32B-Instruct), 1000 high-quality reasoning examples (s1K, distilled from Gemini Thinking), a 26-minute supervised fine-tune on 16 H100s. Cloud cost: under $50.
Plus a trick: "budget forcing". To make the model think longer, they suppress the end-of-thinking token and inject the word "Wait". The model self-corrects and keeps reasoning. Strong evidence that much of "reasoning" is already inside base models and just needs to be elicited.
Companies
Stanford, University of Washington
Tools
s1-32B, s1K dataset
Tags
Sources