Governance & Compliance

AI Red Teaming. Stress test the model, before the regulator does.

An AI model can be jailbroken, poisoned, leaked, and tricked in ways traditional testing never looks for. We probe the model itself for safety and security failures, then hand you the evidence that proves you tested it.

Why model-level

Your application security testing never asked whether the model would write malware if asked nicely, or leak its training data, or absorb a poisoned example. The model is a new kind of attack surface, and it fails in new kinds of ways.

AI Red Teaming probes the model and its safety behavior directly: jailbreaks, harmful output, data leakage, bias, and the ways an adversary manipulates a model's responses. Grounded in the OWASP work for LLMs and the emerging assurance frameworks, it produces the evidence that you tested for safety and security, which the EU AI Act and ISO 42001 increasingly expect. Where the model drives an Agent, this pairs with our Agentic AI Red Teaming.

The test classes

How a model fails, probed deliberately

We attack the model across the failure classes that matter for safety and security, the ones standard testing was never built to find.

Jailbreaks & bypass

Defeating the guardrails

Prompts and techniques that make the model ignore its safety instructions and do what it was built to refuse.

Harmful output

Dangerous content generation

Testing whether the model can be led to produce harmful, illegal, or unsafe content on request.

Data leakage

Training data & secrets

Whether the model can be made to reveal training data, system prompts, or other users' information.

Poisoning & bias

Integrity & fairness

Susceptibility to data poisoning, and biased or discriminatory behavior in the model's outputs.

The taxonomy

Jailbreaks, organized like the threat they are

Every probe in an engagement maps to a family, so coverage is provable. Five families carry most of the real-world risk.

Jailbreak taxonomy · field guide
Direct overrideoldest, still works
instruction countermandsfake system messagesdeveloper-mode framing
Confusion about which instructions outrank which, especially in long contexts.
Instruction hierarchy. Privileged system prompts the model is trained to keep, plus refusal regression tests on every prompt change.
Persona & roleplaythe social engineer
uncensored personasfiction framingexpert impersonation
The tension between being helpful in character and being safe out of it.
Persona-invariant safety. Alignment that survives roleplay, with output filtering as the backstop.
Encoding & smugglingthe filter evader
encoding wrapperstoken splittingtranslation pivots
Filters that read plaintext while the payload travels in disguise.
Decode then classify. Canonicalize inputs and outputs before any filter judges them.
Multi-turn erosionthe patient one
crescendo escalationcontext floodingmemory seeding
Safety judged per message while the harm builds across the conversation.
Conversation-level scoring. Risk tracked across turns, with resets when the trajectory bends wrong.
Indirect injectionthe supply chain
poisoned documentshostile web contenttool output injection
The model's trust in whatever arrives through retrieval and tools.
Context provenance. Retrieved content isolated and marked untrusted. Where tools act, this hands over to Agentic AI Red Teaming.
The taxonomy is maintained from live engagements and public disclosure, and it moves weekly. Test plans are built family by family, which is how an engagement can say "covered" and mean it.
The engagement

From probe to proof

A structured engagement that ends in remediation and the evidence your compliance program needs.

01

Scope

Agree the model, the use case, and the failure classes that matter most for your risk.

02

Probe

Attack the model across the test classes, combining known techniques with novel ones.

03

Report

Findings ranked by severity, mapped to OWASP and assurance frameworks, with reproduction.

04

Retest

Verify the mitigations hold, and produce the evidence for auditors and regulators.

What you get

Safety, proven and documented

Adversarial findings

The concrete ways your model fails, with reproduction steps and severity.

Framework mapping

Results mapped to OWASP for LLMs and AI assurance frameworks.

Mitigation guidance

Specific guardrails and fixes that close each finding.

Verified retest

Every fix is re-attacked, so the report shows which mitigations hold.

Compliance evidence

The testing record the EU AI Act and ISO 42001 increasingly expect.

Agent coverage

Where the model drives an Agent, this extends into Agentic AI Red Teaming.

Part of the loop

Where AI red teaming sits in VIGILE

Attack to assure

Validate the safety, Learn from every failure

ValidateAI Red TeamingLearn

AI Red Teaming is the Validate and Learn motions for your models. We prove what holds under adversarial pressure, and the findings become guardrails the AI Governance program enforces and the evidence compliance needs.

Explore AI Governance ›
FAQ

Top 10 questions, frequently asked

AI Red Teaming targets the model and its safety behavior: jailbreaks, harmful output, data leakage, bias. Agentic AI Red Teaming targets the system built around a model that can call tools and act, where the risks are prompt injection, tool misuse, and excessive agency. If you have a model, you want the first. If that model drives an Agent, you want both.

Yes. We test the model as you deploy and configure it, including foundation models accessed through an API and open models you host. Much of the safety behavior depends on your prompts, guardrails, and configuration, which is exactly what the engagement examines.

Adversarial testing is part of demonstrating that a high-risk AI system is safe and resilient. The findings and retest evidence feed directly into the conformity documentation for the EU AI Act and the controls for ISO 42001, so the red team output doubles as compliance evidence rather than a separate exercise.

You fix them, with our guidance, and we retest to confirm the mitigations hold. The findings also flow into your AI Governance program as enforced guardrails, so a weakness found once becomes a control applied everywhere. The aim is a model that is measurably safer, with the proof to show it.

The model as you deploy it: prompt injection, jailbreaks, data leakage through responses, harmful output under pressure, and the safety behavior of the system around the model, including filters and guardrails.

A single model or application typically takes two to four weeks from scoping to the findings workshop. Multi-model estates are phased by risk tier.

No. Testing runs against staging or scoped instances wherever possible, with test data in place of production data. Where production access is unavoidable, it is read-only, scoped, and logged.

Principal Engineers with offensive security backgrounds who work with LLM systems daily. The OWASP Top 10 for LLM applications is the floor of the test plan, not the ceiling.

On every meaningful change: a new model version, a new system prompt, new tools, or a new data source. Annual full exercises with change-triggered regression tests are the common pattern.

AI red teaming is a Learn motion: the findings sharpen guardrails in Guard and feed evidence into Enhance, so each exercise leaves the estate measurably harder to attack.

AI Red Teaming datasheetThe four failure classes, the jailbreak taxonomy with defenses per family, the probe-to-proof engagement, and the testing evidence the EU AI Act and ISO 42001 expect.
Download the datasheet

Find the failure before it ships

Book a session with a Principal Engineer. We probe your model for the safety and security failures that matter.