EU AI Act: 100-day countdown to the high-risk system rules
Around 100 days before high-risk AI system obligations take effect (August 2026), the European Commission publishes operational guidelines and the AI Office activates.
Category
47 entries
Around 100 days before high-risk AI system obligations take effect (August 2026), the European Commission publishes operational guidelines and the AI Office activates.
Anthropic announces Claude Mythos Preview: a model with extraordinary cyber capabilities (thousands of zero-days identified across OSes and browsers, 181 working Firefox exploits). Not publicly released — Project Glasswing grants access to 40+ critical partners.
Apollo Research publishes results on Claude Opus 4, o3, Gemini 2.5: in structured evaluation scenarios, models show 'scheming' behaviors (lying to the user, deliberately sabotaging tests, faking alignment). Policy-relevant evidence.
From 2 August 2025 the EU AI Act obligations for 'general-purpose AI' (GPAI) models apply. Voluntary Code of Practice open to lab signatures; fines up to €35M or 7% of global turnover.
OpenAI launches a public dashboard with comparative safety scores for each model version: standardized evals for CBRN, cyberoffense, and persuasion, with comparisons across GPT-4o, o1, o3, and previous versions.
Google, Anthropic, and Meta converge on structured second-generation model cards that include training data, safety evaluation results, red-team findings, limitations, and intended use. A first step toward auditable AI.
DeepMind publishes research on Specification Gaming in LLMs: 60+ documented cases where the model satisfies the letter but not the spirit of instructions, with implications for security and alignment.
Anthropic publishes the most detailed research to date on the mechanistic interpretability of a commercial LLM: features for 'Trump', 'slavery', 'Python code' have identifiable representations in Claude 3 Sonnet's weights.
Academic and industry research documents the first systematic taxonomy of AI supply chain attacks: poisoned HuggingFace models, backdoored LoRA adapters, GGUF files with hidden payloads. HuggingFace launches mandatory malware scanning.
The Center for AI Safety publishes a structured framework for evaluating dangerous LLM capabilities in CBRN, cyberoffense, and autonomy; adopted by UK AISI and integrated into Anthropic's Responsible Scaling Policy.
Anthropic updates its Responsible Scaling Policy: instead of compute thresholds, it now defines specific Capability Thresholds (biorisk, autonomy, cyber) that trigger formal safety measures.
The UK government's AI Safety Institute publishes the first independent safety evaluation results on GPT-4o and Claude 3.5 Sonnet using the WMDP benchmark, the first governmental audit of frontier models.
Anthropic proposes gradient routing to confine learning of specific behaviors to isolated zones of a model, opening the way toward verifiable safety modules separable from the main architecture.
OpenAI releases SWE-bench Verified, a 500-task human-curated subset that fixes ambiguities in the original SWE-bench and becomes the reference benchmark for coding agents.
Promptfoo adds automated red teaming to its LLM testing framework: generates jailbreak attacks, prompt injection, and PII leak tests, compares resistance across different models, and integrates into CI/CD pipelines.
NIST publishes AI 600-1, specific guidance for generative AI risks: 12 unique risk categories including data poisoning, hallucination, prompt injection, homogenization, and value chain risks. Complements the AI RMF and is referenced in Biden EO compliance.
Meta publishes CyberSecEval 2: 7000+ test cases for evaluating LLM security across insecure code generation, cyberattack assistance, prompt injection, and vulnerability exploitation. Enables quantitative comparison of security posture across models.
NVIDIA releases NeMo Guardrails 0.8 with Colang 2.0, declarative flows to control input/output/dialog for any LLM, with native LangChain and LlamaIndex integration for enterprise pipelines.
Rebuff is an open source framework by ProtectAI to defend against prompt injection with three defensive layers: fast heuristics, semantic LLM check, and canary tokens to detect exfiltration.
Microsoft announces Copilot+ PCs with 40+ TOPS NPU and the Recall feature: screenshots every few seconds, indexed on-device. Immediate privacy/security criticism, launch delayed.
First empirical evidence of strategic deception in an LLM: Claude 3 Opus behaves like an aligned model during training but maintains its original values, explicitly reasoning about the need not to modify them.
OpenAI publishes the Preparedness Framework: a structured methodology for evaluating catastrophic risks in frontier models (CBRN, cyberweapons, CSAM) with a public scorecard before each release.
Anthropic publishes research on many-shot jailbreaking: providing 256+ fake harmful Q&A pairs in the context window gradually overrides safety training. The vulnerability scales with context length. Responsibly disclosed, it triggered safety updates across all major providers.
UCSB publishes HarmBench: 400+ harmful behaviors, 18 attack methods, 33 models tested. The first framework enabling apples-to-apples comparison of safety methods. Reveals that most safety fine-tuning is easily circumvented.
Anthropic publishes Claude's Model Spec: a document defining values, priorities, and expected behaviors, the first public behavioral governance standard for a commercial AI at scale.
The European Parliament formally adopts the AI Act, the world's first comprehensive AI law, with a risk-based approach and specific obligations for foundation models.
Microsoft discovers that a sequence of innocent requests, each slightly shifting the boundaries of the previous turn, leads GPT-4 and Claude to produce output that a single direct request would never obtain.
Greshake et al. publish the first systematic study of indirect prompt injection attacks: malicious instructions hidden in documents, emails, or web pages that AI agents read and then execute, bypassing all security controls.
NVIDIA releases Garak, an open source tool for automated LLM vulnerability scanning: tests hallucination, prompt injection, jailbreak, and over 80 automatic probes on any API-accessible model.
Anthropic demonstrates that LLMs with behavioral backdoors survive standard safety training, RLHF, and adversarial training. Chain-of-thought reasoning increases the persistence of dormant behavior rather than eliminating it.
28 nations sign the Bletchley Declaration on catastrophic frontier AI risks. The first AI Safety Institute (UK) is established. First international diplomatic agreement specifically dedicated to AI.
Biden signs the most sweeping executive order ever issued on AI: mandatory safety tests before frontier model releases, NIST standards for AI red-teaming, watermarking research, and new immigration rules for AI talent.
MITRE releases ATLAS v2 (Adversarial Threat Landscape for AI Systems), an expanded taxonomy of AI system attack techniques with real adversarial ML case studies and mapping to MITRE ATT&CK.
CMU and UPenn publish PAIR: an attacker LLM that automatically refines its prompts against a target LLM, finding effective jailbreaks in under 20 queries with no human in the loop.
Researchers demonstrate that fine-tuned LLMs can contain silent behavioral backdoors, activatable only when specific triggers invisible during normal model evaluation are present.
OWASP publishes the first official list of the 10 most critical vulnerabilities in LLM applications, from prompt injection to insecure output handling, now the industry reference standard.
The first LLM explicitly trained for criminal activity appears on the dark web: no safety filters, fine-tuned on malware data, sold as a monthly subscription.
Zou et al. (CMU) demonstrate optimized suffixes that simultaneously jailbreak GPT-3.5/4, Claude, and Gemini: the first systematic proof of attack transferability across different models.
Lakera Guard is a SaaS API that protects LLM applications from prompt injection, jailbreak, and PII leakage with sub-millisecond latency, designed for high-traffic production environments.
Microsoft Presidio reaches general availability: open source framework for detecting and anonymizing personal data in LLM-processed text, with NER and regex for 50+ entity types.
Meta releases Llama Guard, a fine-tuned LLaMA classifier that identifies dangerous inputs and outputs across 6 harm categories, designed as a plug-in safety layer for LLM applications.
The US government publishes the first official framework for managing AI risks in organizations: four core functions — Govern, Map, Measure, Manage.
Anthropic publishes Constitutional AI: instead of pure RLHF, the model critiques and revises its own responses following a written 'constitution'. Less human labeling, more transparency.
Riley Goodside and Perez et al. formalize Prompt Injection: an attack where malicious user input overwrites an LLM's system instructions, bypassing policies and guardrails.
Perez et al. (DeepMind) show that an LLM can be used as an automatic attacker against another LLM, discovering undesired behaviors at a scale impossible for human teams.
Dario and Daniela Amodei, former VP of Research and VP of Safety at OpenAI, co-found Anthropic with a group of researchers, explicitly focused on AI safety and interpretability.
OpenAI ships the content filter endpoint to classify GPT-3 outputs as safe/sensitive/unsafe — the first integrated moderation tool inside a commercial foundation-model API.