AgentBench: the first benchmark that measures LLMs as real agents

In one sentence Tsinghua presents AgentBench, the first comprehensive benchmark for LLM agents across 8 operational environments, revealing a massive gap between GPT-4 and open-source models.

Verified Official source

ShareLinkedIn X

Until now, language models were evaluated on knowledge quizzes or text comprehension. But an AI agent doesn't take quizzes: it navigates websites, writes code, manages databases, plays text games. How do you really measure agentic capability?

AgentBench is the first systematic answer to this question: it proposes 8 different environments — OS, database, web browser, e-commerce, text games, and more — where the model must complete concrete tasks with real consequences.

The most important finding isn't the ranking itself, but the gap discovered: GPT-4 far outperforms all competitors, while the best open-source models of the time (LLaMA, Vicuna) fail on almost everything. A data point that accelerates development of open-source agent-capable models.