SWE-bench Verified: OpenAI cleans up the reference benchmark for coding agents

In one sentence OpenAI releases SWE-bench Verified, a 500-task human-curated subset that fixes ambiguities in the original SWE-bench and becomes the reference benchmark for coding agents.

Verified Official source

ShareLinkedIn X

SWE-bench was a famous 2023 benchmark for testing how well an AI can solve real bugs and feature requests in open-source software (Django, sympy, requests, etc.). You give the AI a GitHub issue and check if it produces the correct patch.

But many original tasks were ambiguous: incomplete prompts, tests that failed for unrelated reasons, problems that needed context not provided. Real scores varied based on interpretation.

OpenAI pays human engineers to review and clean up 500 tasks. The result is "SWE-bench Verified," a version you can trust. It immediately becomes the standard.