Skip to content
AImpact
IT EN
Medium AI Security · 1 min read

SWE-bench Verified: OpenAI cleans up the reference benchmark for coding agents

In one sentence OpenAI releases SWE-bench Verified, a 500-task human-curated subset that fixes ambiguities in the original SWE-bench and becomes the reference benchmark for coding agents.

Verified Official source
ShareLinkedInX
Reading level

SWE-bench was a famous 2023 benchmark for testing how well an AI can solve real bugs and feature requests in open-source software (Django, sympy, requests, etc.). You give the AI a GitHub issue and check if it produces the correct patch.

But many original tasks were ambiguous: incomplete prompts, tests that failed for unrelated reasons, problems that needed context not provided. Real scores varied based on interpretation.

OpenAI pays human engineers to review and clean up 500 tasks. The result is "SWE-bench Verified," a version you can trust. It immediately becomes the standard.

Companies

OpenAI, Princeton

Tools

SWE-bench Verified

Tags

OpenAISWE-benchEvaluationBenchmarkCoding

Sources