MCP at 18 months: the server ecosystem hits critical mass
Eighteen months after launch (November 2024), Model Context Protocol consolidates: thousands of public servers, confirmed cross-vendor adoption, first stable official registry.
Category
54 entries
Eighteen months after launch (November 2024), Model Context Protocol consolidates: thousands of public servers, confirmed cross-vendor adoption, first stable official registry.
Google ships two research agents on the Gemini API: Deep Research (fast) and Deep Research Max (deep + slow, 93.3% on DeepSearchQA). MCP support for private data, native visualizations via Nano Banana 2.
OpenAI reorganizes Operator (January 2025) and ChatGPT Agent (July 2025) into a unified platform, with refreshed SDK and new async multi-task execution modes.
Anthropic ships Cowork as a research preview: a desktop agent with sandboxed shell and local file access, aimed at people who don't live in the terminal the way Claude Code users do.
The Model Context Protocol, launched by Anthropic in November 2024, hits critical mass: GA MCP Inspector, MCP-UI for server-side UI, official registry, OpenAI/Google support. Becomes the 'USB-C of LLM tools'.
Anthropic introduces Skills: bundles of instructions + scripts + resources that Claude loads automatically when a task needs them. De facto replaces most custom enterprise system prompts.
OpenAI launches 'ChatGPT Agent': fusion of Operator (browser use), Deep Research (long research), and classic ChatGPT into a single agent with virtual browser + terminal + API tools.
University of British Columbia publishes ADAS (Automated Design of Agentic Systems): a meta-agent that searches for new agent architectures by writing and evaluating Python code. Discovers novel patterns (dynamic critic, step-back abstraction) that outperform human-designed agents. First system automating agent architecture research.
Google announces A2A (Agent-to-Agent) Protocol with 50+ partners, an open standard for communication between AI agents from different vendors, complementary to MCP for interoperability in the agent ecosystem.
Google launches ADK (Agent Development Kit), an open-source SDK for building Gemini agents, and the A2A protocol for standardized communication between agents from different vendors.
MIT and Google researchers show that having multiple LLM instances debate and critique each other's answers over N rounds leads to more accurate results: +20% on arithmetic and reasoning benchmarks vs single agent. Establishes the debate-based verification pattern in modern agents.
Butterfly Effect launches Manus, an invite-only Chinese AI agent that runs autonomous tasks (stock analysis, research, CV screening) and ships reports with files. Devin-2024-level hype, invite-only access.
OpenAI launches Deep Research, an autonomous o3-based agent that browses the web for 10-30 minutes, performs hundreds of searches, and produces reports with verified citations.
Google launches ADK, an open source SDK for building hierarchical multi-level agents on Gemini with structured tool calling, native state machines, and native multi-agent orchestration.
Microsoft Research publishes UFO (UI-Focused Agent), an agent that observes the Windows screen (active app + screenshot + control tree), plans actions and executes them via Windows UI Automation and Win32 API. First Windows-native system with reliable multi-application workflow support.
OpenAI launches Operator (research preview): an AI agent that performs browser tasks on behalf of the user. Visits sites, fills forms, books services. Available to US ChatGPT Pro subscribers.
Browser Use is an open-source Python library enabling GPT-4, Claude and Gemini to reliably control a Chromium browser via Playwright. 30k GitHub stars in the first month. First truly usable browser control layer without custom extensions. Enables reliable web agent tasks on any website.
Hugging Face releases smolagents, a ~1000-line minimal library for LLM agents. Pushes the 'code agents' paradigm: the agent writes Python snippets instead of JSON tool calls.
Google releases Gemini 2.0 Flash (native multimodal, tool use, image/audio output) and unveils Project Astra (real-time video assistant), Mariner (browser agent), Jules (coding agent).
Microsoft Research publishes Magentic-One: a system with an Orchestrator plus 4 specialized agents (WebSurfer, FileSurfer, Coder, ComputerTerminal). First place on GAIA benchmark. Key insight: stateless specialized agents plus stateful orchestrator outperform a monolithic agent. Open source MIT.
Anthropic enables 'Computer Use' on Claude 3.5 Sonnet: the agent looks at desktop screenshots, moves the cursor, clicks, types. For the first time a commercial LLM operates directly on the GUI.
n8n adds native AI Agent nodes to its workflow builder, allowing LLM agentic loops to connect to 400+ business apps without code, marking the arrival of agents in mainstream automation.
OpenAI publishes Swarm on GitHub, a minimal Python framework for orchestrating multiple agents with handoffs and routines — explicitly positioned as an 'educational' precursor to a future Agents SDK.
Flowise v2 introduces sequential and parallel tool use in agents, multiple memory types (buffer, summary, vector), visually configurable agent loops, and LlamaIndex support.
Dify 0.7 brings a no-code/low-code visual builder for complex agentic workflows, integrated RAG with document parsing, support for 10+ LLM providers, and self-hostable deployment on Docker.
UIUC publishes Agentless: a two-phase pipeline (localize fault, generate repair) without complex agent loops. Outperforms AutoCodeRover and SWE-agent on SWE-bench. Top open submission on SWE-bench at publication time. Challenges the assumption that more agent complexity equals better results.
Agno, renamed from Phidata, is a model-agnostic Python agent framework with modular memory, storage, tools and knowledge base, native multimodal support, and performance 10x better than LangChain.
Princeton presents SWE-agent, an agent with a dedicated ACI interface that resolves real GitHub issues on SWE-bench at 12.5% — 6x to 12x better than previous systems.
Cognition Labs unveils Devin, an AI agent that plans, codes, debugs and executes software tasks end-to-end. Viral demo, SWE-bench 13.86%. Defines the 'AI software engineer' category.
Microsoft's TaskWeaver is a code-first agent framework that converts every request into executable Python code in a sandbox, with persistent state between steps and a structured plugin system.
Mufeed VH publishes Devika, an open-source AI software engineer agent: accepts high-level programming objectives, decomposes them, searches the web, writes code and runs tests. First real open alternative to Devin. 15k GitHub stars in 72 hours.
CrewAI launches a Python framework for orchestrating teams of LLM agents with defined roles, individual objectives, and backstories, supporting both sequential and parallel processes.
LangChain launches LangGraph, a framework for building agents as node graphs with persistent state, support for cycles, conditional branching, and parallel execution of complex workflows.
XLab (SUTD Singapore) publishes OpenAgents: a deployable platform with three specialized agents (web browsing, data analysis, code execution) accessible from a browser without API keys. First demonstration of real agentic capabilities for non-technical users, with complete open-source code.
Tsinghua presents AgentBench, the first comprehensive benchmark for LLM agents across 8 operational environments, revealing a massive gap between GPT-4 and open-source models.
SuperAGI offers an open-source platform for autonomous agents with a web dashboard, tool marketplace, and the ability to run agents in the background without writing code. First solution to bring the 'monitor agent' experience to non-programmers. Concurrent with AutoGPT but more production-oriented.
Microsoft Research publishes AutoGen, a framework where you define agents with different roles and let them converse with each other to solve a task. First framework to formalize the 'agent-to-agent communication' pattern. Becomes the foundation of many enterprise multi-agent workflows.
MIT and Northeastern propose Reflexion: agents that self-reflect in natural language after each failure, accumulating insights in episodic memory without modifying weights.
MetaGPT assigns each LLM agent a specific company role (PM, Architect, Engineer, QA) and has them collaborate to produce working code from a single text requirement.
Anton Osika publishes GPT-Engineer on GitHub: describe what you want in natural language, the agent asks clarifying questions, then writes all the files and runs them. 50k stars in one week. First viral implementation of the 'one-shot project generator' concept.
UC Berkeley presents Gorilla, a retrieval-augmented fine-tuned LLaMA for accurate API calls: reduces API hallucination from 83% to 3%, outperforming GPT-4 on this task.
Princeton and DeepMind propose Tree of Thoughts: the LLM generates and evaluates multiple reasoning paths as a search tree, clearly outperforming Chain-of-Thought.
NVIDIA creates Voyager, a lifelong-learning agent in Minecraft that uses GPT-4 to write skills in JavaScript and accumulate them in a persistent library, never forgetting.
Stanford creates 25 LLM-based agents simulating daily life in a virtual village, with episodic memory, reflection, and planning — the first credible artificial society.
Yohei Nakajima publishes BabyAGI, an autonomous task manager in ~200 Python lines using GPT-4 and Pinecone that creates and executes subtasks in an infinite loop, viral on Twitter within 24 hours.
A developer publishes AutoGPT on GitHub: given a text goal, the system calls GPT-4 in a loop to plan tasks, execute them, and self-criticize. In two weeks, becomes the most-starred repo in history.
OpenAI ships plugins for ChatGPT: the model can browse the web, run Python in a sandbox, book flights (Expedia, Kayak), order groceries (Instacart). First big mainstream tool-use experiment.
Microsoft Research uses ChatGPT as a central planner that decomposes complex tasks and delegates execution to specialized HuggingFace models for vision, audio, and NLP.
Microsoft open-sources Semantic Kernel, a C#/Python/Java SDK for integrating LLMs into enterprise apps. Introduces 'skills' (reusable AI functions) and 'planners' (auto-chaining toward a goal). Becomes Microsoft's standard AI orchestration layer for Copilot builds.
KAUST presents CAMEL, a role-playing framework where an 'AI user' LLM and an 'AI assistant' LLM autonomously collaborate on tasks without human intervention at each step.
Meta AI presents Toolformer: an LLM that autonomously learns when and how to call external tools (calculator, Wikipedia, calendar) using self-supervised examples only.
Harrison Chase releases LangChain, an open-source Python library to chain LLMs with prompt templates, memory, tools and external data sources. It will become the default stack of the first LLM apps.
Yao et al. introduce ReAct, a schema alternating explicit thoughts (Thought) and concrete actions (Act) in LLMs, the theoretical foundation of all modern agents.
OpenAI publishes WebGPT, a GPT-3 fine-tune that learns to use a text browser to search the web for answers with source citations, trained via imitation learning + RLHF.