Governance & Compliance

AI Red Teaming. Stress test the model, before the regulator does.

An AI model can be jailbroken, poisoned, leaked, and tricked in ways traditional testing never looks for. We probe the model itself for safety and security failures, then hand you the evidence that proves you tested it.

Test your model Read the taxonomy ›

Why model-level

Your application security testing never asked whether the model would write malware if asked nicely, or leak its training data, or absorb a poisoned example. The model is a new kind of attack surface, and it fails in new kinds of ways.

AI Red Teaming probes the model and its safety behavior directly: jailbreaks, harmful output, data leakage, bias, and the ways an adversary manipulates a model's responses. Grounded in the OWASP work for LLMs and the emerging assurance frameworks, it produces the evidence that you tested for safety and security, which the EU AI Act and ISO 42001 increasingly expect. Where the model drives an Agent, this pairs with our Agentic AI Red Teaming.

The test classes

How a model fails, probed deliberately

We attack the model across the failure classes that matter for safety and security, the ones standard testing was never built to find.

Jailbreaks & bypass

Defeating the guardrails

Prompts and techniques that make the model ignore its safety instructions and do what it was built to refuse.

Harmful output

Dangerous content generation

Testing whether the model can be led to produce harmful, illegal, or unsafe content on request.

Data leakage

Training data & secrets

Whether the model can be made to reveal training data, system prompts, or other users' information.

Poisoning & bias

Integrity & fairness

Susceptibility to data poisoning, and biased or discriminatory behavior in the model's outputs.

The taxonomy

Jailbreaks, organized like the threat they are

Every probe in an engagement maps to a family, so coverage is provable. Five families carry most of the real-world risk.

Jailbreak taxonomy · field guide

Direct overrideoldest, still works

instruction countermandsfake system messagesdeveloper-mode framing

Confusion about which instructions outrank which, especially in long contexts.

Instruction hierarchy. Privileged system prompts the model is trained to keep, plus refusal regression tests on every prompt change.

Persona & roleplaythe social engineer

uncensored personasfiction framingexpert impersonation

The tension between being helpful in character and being safe out of it.

Persona-invariant safety. Alignment that survives roleplay, with output filtering as the backstop.

Encoding & smugglingthe filter evader

encoding wrapperstoken splittingtranslation pivots

Filters that read plaintext while the payload travels in disguise.

Decode then classify. Canonicalize inputs and outputs before any filter judges them.

Multi-turn erosionthe patient one

crescendo escalationcontext floodingmemory seeding

Safety judged per message while the harm builds across the conversation.

Conversation-level scoring. Risk tracked across turns, with resets when the trajectory bends wrong.

Indirect injectionthe supply chain

poisoned documentshostile web contenttool output injection

The model's trust in whatever arrives through retrieval and tools.

Context provenance. Retrieved content isolated and marked untrusted. Where tools act, this hands over to Agentic AI Red Teaming.

The taxonomy is maintained from live engagements and public disclosure, and it moves weekly. Test plans are built family by family, which is how an engagement can say "covered" and mean it.

The engagement

From probe to proof

A structured engagement that ends in remediation and the evidence your compliance program needs.

Scope

Agree the model, the use case, and the failure classes that matter most for your risk.

Probe

Attack the model across the test classes, combining known techniques with novel ones.

Report

Findings ranked by severity, mapped to OWASP and assurance frameworks, with reproduction.

Retest

Verify the mitigations hold, and produce the evidence for auditors and regulators.

What you get

Safety, proven and documented

Adversarial findings

The concrete ways your model fails, with reproduction steps and severity.

Framework mapping

Results mapped to OWASP for LLMs and AI assurance frameworks.

Mitigation guidance

Specific guardrails and fixes that close each finding.

Verified retest

Every fix is re-attacked, so the report shows which mitigations hold.

Compliance evidence

The testing record the EU AI Act and ISO 42001 increasingly expect.

Agent coverage

Where the model drives an Agent, this extends into Agentic AI Red Teaming.

Part of the loop

Where AI red teaming sits in VIGILE

Attack to assure

Validate the safety, Learn from every failure

ValidateAI Red TeamingLearn

AI Red Teaming is the Validate and Learn motions for your models. We prove what holds under adversarial pressure, and the findings become guardrails the AI Governance program enforces and the evidence compliance needs.

Explore AI Governance ›

FAQ

Related work

Service

Find the failure before it ships

Book a session with a Principal Engineer. We probe your model for the safety and security failures that matter.

Test your model Browse all services ›