Inference Intermediate Also known as: Software Engineering Bench

SWE-bench

/swee-bench/

A benchmark of over 2,000 real GitHub issues from Python repositories: the model must produce a patch that makes the project's tests pass.

ShareLinkedIn X

In practice

It measures real software-engineering ability (reading a codebase, debugging, cross-file edits), not isolated coding. It has become the reference for agents like Devin, Claude Code, and OpenAI Codex.

Related terms

HumanEval Agent

Seen in the wild

7 entries mentioning it

August 13, 2024

SWE-bench Verified: OpenAI cleans up the reference benchmark for coding agents

Medium
July 10, 2024

Agentless: less agent complexity, more results on SWE-bench

Medium
May 28, 2024

DeepSeek-Coder-V2: GPT-4 Turbo coding quality with open weights

High
April 2, 2024

Aider: CLI coding agent with automatic git integration and SOTA benchmark

High
April 2, 2024

SWE-agent: an AI agent that resolves real GitHub issues at 12.5%

High
March 12, 2024

Devin: the first 'autonomous AI engineer' goes viral

High
March 12, 2024

Devin: 13.86% on SWE-bench, the first autonomous AI software engineer

Landmark

← All terms