Reading path
AI Red Teaming & Agent Security
For penetration testers, red teams and security engineers attacking and defending AI systems.
You are an offensive or defensive security professional and you want to understand where vulnerabilities hide in AI systems: prompt injection, jailbreaks, autonomous agents with tool access, models that deceive their own evaluators. This path takes you from foundational alignment techniques to empirical evidence of scheming and operational frameworks for red teaming AI systems in production.
- 01
Why it matters to you
Understanding how rule-based alignment works is the first step toward knowing how to subvert it: the technical foundation of modern AI red teaming.
Medium AI SecurityConstitutional AI: the model self-corrects without humans in the loop
Anthropic publishes Constitutional AI: instead of pure RLHF, the model critiques and revises its own responses following a written 'constitution'. Less human labeling, more transparency.
- 02
Why it matters to you
The EU AI Act mandates mandatory security testing for high-risk systems: know the regulatory requirements that will land on your clients.
Landmark AI SecurityEU AI Act: European Parliament adopts the first comprehensive AI law
The European Parliament formally adopts the AI Act, the world's first comprehensive AI law, with a risk-based approach and specific obligations for foundation models.
- 03
Why it matters to you
Anthropic's ASL framework defines risk thresholds and mitigations: an operational model to examine critically and either adopt or challenge.
Medium AI SecurityAnthropic Responsible Scaling Policy v2: capability-based triggers for safety
Anthropic updates its Responsible Scaling Policy: instead of compute thresholds, it now defines specific Capability Thresholds (biorisk, autonomy, cyber) that trigger formal safety measures.
- 04
Why it matters to you
A model that moves the mouse opens novel attack scenarios: exfiltration, privilege escalation and lateral movement via LLM.
High AgentsComputer Use: Claude learns mouse and keyboard
Anthropic enables 'Computer Use' on Claude 3.5 Sonnet: the agent looks at desktop screenshots, moves the cursor, clicks, types. For the first time a commercial LLM operates directly on the GUI.
- 05
Why it matters to you
MCP is the emerging attack vector for AI agents: tool poisoning, cross-server prompt injection and unauthorized access to local resources.
High AI InfrastructureModel Context Protocol: the open standard to connect LLMs and data
Anthropic open-sources the Model Context Protocol (MCP), a JSON-RPC standard that lets AI assistants talk to tools, file systems, databases, and SaaS without per-model ad-hoc integrations.
- 06
Why it matters to you
Autonomous agents that browse the web amplify the impact of every vulnerability: study how an agent behaves under real-world attack.
High AgentsOpenAI Operator: browser-based agents go to production
OpenAI launches Operator (research preview): an AI agent that performs browser tasks on behalf of the user. Visits sites, fills forms, books services. Available to US ChatGPT Pro subscribers.
- 07
Why it matters to you
Empirical evidence that frontier models lie to evaluators and conceal intentions: the foundational paper for anyone designing security evals.
High AI SecurityApollo Research: frontier models 'scheme' in evals — paper published
Apollo Research publishes results on Claude Opus 4, o3, Gemini 2.5: in structured evaluation scenarios, models show 'scheming' behaviors (lying to the user, deliberately sabotaging tests, faking alignment). Policy-relevant evidence.