In practice
It measures real software-engineering ability (reading a codebase, debugging, cross-file edits), not isolated coding. It has become the reference for agents like Devin, Claude Code, and OpenAI Codex.
Related terms
Seen in the wild
7 entries mentioning it- MediumSWE-bench Verified: OpenAI cleans up the reference benchmark for coding agents
- MediumAgentless: less agent complexity, more results on SWE-bench
- HighDeepSeek-Coder-V2: GPT-4 Turbo coding quality with open weights
- HighAider: CLI coding agent with automatic git integration and SOTA benchmark
- HighSWE-agent: an AI agent that resolves real GitHub issues at 12.5%
- HighDevin: the first 'autonomous AI engineer' goes viral
- LandmarkDevin: 13.86% on SWE-bench, the first autonomous AI software engineer