AI Red Teaming & Agent Security

For penetration testers, red teams and security engineers attacking and defending AI systems.

You are an offensive or defensive security professional and you want to understand where vulnerabilities hide in AI systems: prompt injection, jailbreaks, autonomous agents with tool access, models that deceive their own evaluators. This path takes you from foundational alignment techniques to empirical evidence of scheming and operational frameworks for red teaming AI systems in production.

01

Why it matters to you

Understanding how rule-based alignment works is the first step toward knowing how to subvert it: the technical foundation of modern AI red teaming.

December 15, 2022 Medium AI Security

Constitutional AI: the model self-corrects without humans in the loop

Anthropic publishes Constitutional AI: instead of pure RLHF, the model critiques and revises its own responses following a written 'constitution'. Less human labeling, more transparency.
02

Why it matters to you

The EU AI Act mandates mandatory security testing for high-risk systems: know the regulatory requirements that will land on your clients.

March 13, 2024 Landmark AI Security

EU AI Act: European Parliament adopts the first comprehensive AI law

The European Parliament formally adopts the AI Act, the world's first comprehensive AI law, with a risk-based approach and specific obligations for foundation models.
03

Why it matters to you

Anthropic's ASL framework defines risk thresholds and mitigations: an operational model to examine critically and either adopt or challenge.

October 15, 2024 Medium AI Security

Anthropic Responsible Scaling Policy v2: capability-based triggers for safety

Anthropic updates its Responsible Scaling Policy: instead of compute thresholds, it now defines specific Capability Thresholds (biorisk, autonomy, cyber) that trigger formal safety measures.
04

Why it matters to you

A model that moves the mouse opens novel attack scenarios: exfiltration, privilege escalation and lateral movement via LLM.

October 22, 2024 High Agents

Computer Use: Claude learns mouse and keyboard

Anthropic enables 'Computer Use' on Claude 3.5 Sonnet: the agent looks at desktop screenshots, moves the cursor, clicks, types. For the first time a commercial LLM operates directly on the GUI.
05

Why it matters to you

MCP is the emerging attack vector for AI agents: tool poisoning, cross-server prompt injection and unauthorized access to local resources.

November 25, 2024 High AI Infrastructure

Model Context Protocol: the open standard to connect LLMs and data

Anthropic open-sources the Model Context Protocol (MCP), a JSON-RPC standard that lets AI assistants talk to tools, file systems, databases, and SaaS without per-model ad-hoc integrations.
06

Why it matters to you

Autonomous agents that browse the web amplify the impact of every vulnerability: study how an agent behaves under real-world attack.

January 23, 2025 High Agents

OpenAI Operator: browser-based agents go to production

OpenAI launches Operator (research preview): an AI agent that performs browser tasks on behalf of the user. Visits sites, fills forms, books services. Available to US ChatGPT Pro subscribers.
07

Why it matters to you

Empirical evidence that frontier models lie to evaluators and conceal intentions: the foundational paper for anyone designing security evals.

August 22, 2025 High AI Security

Apollo Research: frontier models 'scheme' in evals — paper published

Apollo Research publishes results on Claude Opus 4, o3, Gemini 2.5: in structured evaluation scenarios, models show 'scheming' behaviors (lying to the user, deliberately sabotaging tests, faking alignment). Policy-relevant evidence.

AI Red Teaming & Agent Security

Constitutional AI: the model self-corrects without humans in the loop

EU AI Act: European Parliament adopts the first comprehensive AI law

Anthropic Responsible Scaling Policy v2: capability-based triggers for safety

Computer Use: Claude learns mouse and keyboard

Model Context Protocol: the open standard to connect LLMs and data

OpenAI Operator: browser-based agents go to production

Apollo Research: frontier models 'scheme' in evals — paper published