SWE-bench Verified: OpenAI cleans up the reference benchmark for coding agents
In one sentence OpenAI releases SWE-bench Verified, a 500-task human-curated subset that fixes ambiguities in the original SWE-bench and becomes the reference benchmark for coding agents.
SWE-bench was a famous 2023 benchmark for testing how well an AI can solve real bugs and feature requests in open-source software (Django, sympy, requests, etc.). You give the AI a GitHub issue and check if it produces the correct patch.
But many original tasks were ambiguous: incomplete prompts, tests that failed for unrelated reasons, problems that needed context not provided. Real scores varied based on interpretation.
OpenAI pays human engineers to review and clean up 500 tasks. The result is "SWE-bench Verified," a version you can trust. It immediately becomes the standard.
Companies
OpenAI, Princeton
Tools
SWE-bench Verified
Tags
Sources