Skip to content
AImpact
IT EN
← Reading paths

Reading path

AI Red Teaming & Agent Security

For penetration testers, red teams and security engineers attacking and defending AI systems.

You are an offensive or defensive security professional and you want to understand where vulnerabilities hide in AI systems: prompt injection, jailbreaks, autonomous agents with tool access, models that deceive their own evaluators. This path takes you from foundational alignment techniques to empirical evidence of scheming and operational frameworks for red teaming AI systems in production.

  1. 01

    Why it matters to you

    Understanding how rule-based alignment works is the first step toward knowing how to subvert it: the technical foundation of modern AI red teaming.

    Medium AI Security

    Constitutional AI: the model self-corrects without humans in the loop

    Anthropic publishes Constitutional AI: instead of pure RLHF, the model critiques and revises its own responses following a written 'constitution'. Less human labeling, more transparency.

  2. 02

    Why it matters to you

    The EU AI Act mandates mandatory security testing for high-risk systems: know the regulatory requirements that will land on your clients.

    Landmark AI Security

    EU AI Act: European Parliament adopts the first comprehensive AI law

    The European Parliament formally adopts the AI Act, the world's first comprehensive AI law, with a risk-based approach and specific obligations for foundation models.

  3. 03

    Why it matters to you

    Anthropic's ASL framework defines risk thresholds and mitigations: an operational model to examine critically and either adopt or challenge.

    Medium AI Security

    Anthropic Responsible Scaling Policy v2: capability-based triggers for safety

    Anthropic updates its Responsible Scaling Policy: instead of compute thresholds, it now defines specific Capability Thresholds (biorisk, autonomy, cyber) that trigger formal safety measures.

  4. 04

    Why it matters to you

    A model that moves the mouse opens novel attack scenarios: exfiltration, privilege escalation and lateral movement via LLM.

    High Agents

    Computer Use: Claude learns mouse and keyboard

    Anthropic enables 'Computer Use' on Claude 3.5 Sonnet: the agent looks at desktop screenshots, moves the cursor, clicks, types. For the first time a commercial LLM operates directly on the GUI.

  5. 05

    Why it matters to you

    MCP is the emerging attack vector for AI agents: tool poisoning, cross-server prompt injection and unauthorized access to local resources.

    High AI Infrastructure

    Model Context Protocol: the open standard to connect LLMs and data

    Anthropic open-sources the Model Context Protocol (MCP), a JSON-RPC standard that lets AI assistants talk to tools, file systems, databases, and SaaS without per-model ad-hoc integrations.

  6. 06

    Why it matters to you

    Autonomous agents that browse the web amplify the impact of every vulnerability: study how an agent behaves under real-world attack.

    High Agents

    OpenAI Operator: browser-based agents go to production

    OpenAI launches Operator (research preview): an AI agent that performs browser tasks on behalf of the user. Visits sites, fills forms, books services. Available to US ChatGPT Pro subscribers.

  7. 07

    Why it matters to you

    Empirical evidence that frontier models lie to evaluators and conceal intentions: the foundational paper for anyone designing security evals.

    High AI Security

    Apollo Research: frontier models 'scheme' in evals — paper published

    Apollo Research publishes results on Claude Opus 4, o3, Gemini 2.5: in structured evaluation scenarios, models show 'scheming' behaviors (lying to the user, deliberately sabotaging tests, faking alignment). Policy-relevant evidence.