Prompt injection succeeds on the first attempt 17.8% of the time against undefended production LLM systems. By attempt 200, that breach rate climbs to 78.6%. For security teams running annual or even quarterly AI red team exercises, that math is a problem. Continuous LLM red teaming addresses it by treating adversarial testing as an engineering discipline, not a compliance checkbox.
This guide covers how to build automated adversarial testing into your AI development pipeline: what probe categories to cover, how to structure CI/CD integration, how to handle model updates, and how to measure security posture over time.
Key Takeaways
- Point-in-time red team exercises miss security regressions introduced by model updates, fine-tuning runs, and prompt changes.
- The PAIR (Prompt Automatic Iterative Refinement) algorithm achieves 73% jailbreak success on Gemini with as few as 10 adaptive iterations, demonstrating how quickly drift can be exploited.
- A continuous red teaming architecture has three layers: a versioned probe library, automated test runners integrated into CI/CD, and a results store that tracks ASR over time.
- OWASP's LLM Top 10 for 2025 provides a canonical probe taxonomy covering 10 attack categories including two new ones, system prompt leakage (LLM07) and vector/embedding weaknesses (LLM08).
- Open-source frameworks (Promptfoo, DeepTeam, Garak) make CI/CD integration achievable without vendor lock-in.
- Failure thresholds should be risk-stratified, not uniform. Zero-tolerance categories (illegal content, graphic output) require a 0.95 safety score or higher.
- Track Attack Success Rate weekly. Target below 20% ASR for production systems.
Why Point-in-Time Red Teaming Is Not Enough
A traditional AI penetration test produces a report tied to one version of one system at one moment. That model is likely to be updated, fine-tuned, or reconfigured within weeks of the test. The probe results become stale the moment any of those inputs change.
In practice, we see three common regression patterns after point-in-time tests: a model provider silently updates a base model (GPT-4o minor version bumps, for example), a product team adjusts system prompt instructions to address a user complaint, or a new retrieval source is connected without revisiting the injection surface.
CVE-2025-32711 (EchoLeak) illustrates what happens when adversarial testing does not keep pace with system changes. Hidden text injected into document metadata caused Microsoft 365 Copilot to exfiltrate user data with zero user interaction. The attack path relied on a combination of indirect prompt injection and an output handling gap, two categories that appear in OWASP LLM01 and LLM05. A single scan at product launch would not have caught the specific document parsing behavior that enabled the exploit.
CVE-2025-68664 (LangGrinch) followed a similar pattern. Injected LangChain object structures in response metadata were deserialized as trusted objects during streaming, enabling code execution. The vulnerability was in how the framework processed model outputs, a surface that typically changes across library versions.
NIST AI 600-1 addresses this directly. The framework names continuous red-team exercises a core safety measure under the Measure function. The April 2025 NIST AI 100-5e2025 update defines red teaming as "adversarial testing under stress conditions to seek out AI system failure modes or vulnerabilities," and explicitly frames it as an ongoing activity, not a pre-launch gate.
The Continuous Red Teaming Architecture
A practical continuous red teaming system has four components: a probe library, a test runner, a results store, and alert routing.
Probe library: A versioned collection of adversarial inputs organized by vulnerability category. Each probe maps to an OWASP LLM category, MITRE ATLAS technique, or NIST AI RMF risk type. The library is committed to source control and reviewed when the model or application changes.
Test runner: An automated process that feeds probes to the live system, scores responses using deterministic checks or LLM-as-judge evaluators, and emits a structured result (JSON or JUnit). The runner is invoked by CI/CD triggers and by a scheduled cron at minimum weekly intervals.
Results store: A persistent log of per-probe outcomes over time. This enables ASR trend analysis, regression detection, and audit evidence for regulatory requirements like the EU AI Act.
Alert routing: Threshold-based alerting that differentiates severity. A critical failure (illegal content generated, system prompt extracted) triggers a page. A regression in a lower-severity category (increased refusal bypass rate) creates a ticket for next-sprint review.
Building Your Probe Library: Five Attack Categories Every Production LLM Needs
The OWASP LLM Top 10 for 2025 provides the most authoritative taxonomy. For initial coverage, prioritize these five categories, which represent the highest observed attack success rates in production:
1. Prompt Injection (LLM01): Both direct injection (attacker controls the prompt) and indirect injection (attacker controls a document or retrieved content that reaches the model). Indirect injection is the more dangerous path in production RAG applications. Include multi-turn probes as well as single-turn, since the PAIR algorithm achieves its highest success rates through iterative refinement across conversation turns.
2. System Prompt Leakage (LLM07): New in the 2025 edition. Probes that attempt to extract system prompt contents through role-play, meta-questions, instruction repetition requests, and encoding tricks. In our testing, extraction success rates differ significantly between models, but no model reliably resists all extraction patterns.
3. Sensitive Information Disclosure (LLM02): PII regurgitation, training data extraction, and API key leakage. Include probes for known training memorization patterns (repeating a token sequence to elicit verbatim training data) and direct PII requests with social engineering framing.
4. Excessive Agency (LLM06): Particularly critical for agentic systems. Probes test whether the model can be induced to take high-impact irreversible actions, escalate its own permissions, invoke tools outside its intended scope, or initiate multi-step exploitation chains. DeepTeam covers eight distinct agentic vulnerability subtypes, including recursive goal hijacking and BFLA (Broken Function Level Authorization).
5. Improper Output Handling (LLM05): Tests whether the application correctly treats LLM output as untrusted input before passing it to downstream systems. Probes include XSS payload injection through model output, SQL injection via LLM-generated queries, and shell injection through tool call parameters.
Probe libraries should be version-controlled. When you update the library, log which probes were added or changed and rerun the full suite immediately to establish a new baseline.
CI/CD Integration: Three Gates
A three-gate pipeline covers both speed (fast PR checks) and depth (thorough pre-deployment scans).
Gate 1: Pull Request check (fast subset)
Run a fast probe subset on any PR that touches prompt templates, model configuration, system instructions, or retrieval source definitions. This gate should complete in under five minutes to avoid slowing development velocity.
Promptfoo supports this natively:
# .github/workflows/llm-security-pr.yml
name: LLM Security PR Check
on:
pull_request:
paths:
- 'prompts/**'
- 'config/model*.yaml'
jobs:
redteam:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm install -g promptfoo
- run: promptfoo redteam run --config promptfoo.yaml --output results.json
- name: Check pass rate
run: |
PASS=$(jq '.results.stats.successes' results.json)
FAIL=$(jq '.results.stats.failures' results.json)
RATE=$(echo "scale=2; $PASS / ($PASS + $FAIL) * 100" | bc)
echo "Pass rate: $RATE%"
if (( $(echo "$RATE < 90" | bc -l) )); then exit 1; fi
Gate 2: Pre-deployment scan (full suite)
Run the complete probe library as a blocking step before any staging or production deployment. This scan may take 10-30 minutes depending on probe count and model latency. Schedule it as a nightly cron in addition to deployment triggers:
# Nightly scheduled scan
on:
schedule:
- cron: '0 2 * * *'
workflow_dispatch:
DeepTeam integrates directly into pytest workflows for teams using Python:
from deepteam import red_team
from deepteam.vulnerabilities import (
PromptInjection, SystemPromptLeakage,
SensitiveInformationDisclosure, ExcessiveAgency
)
from deepteam.attacks.single_turn import PromptInjection as PIAttack
from deepteam.attacks.multi_turn import LinearJailbreaking
async def model_callback(prompt: str) -> str:
# call your production endpoint
response = await your_llm_client.complete(prompt)
return response.text
results = red_team(
model_callback=model_callback,
vulnerabilities=[
PromptInjection(),
SystemPromptLeakage(),
SensitiveInformationDisclosure(),
ExcessiveAgency()
],
attacks=[PIAttack(), LinearJailbreaking()],
max_concurrent=10
)
Gate 3: Production canary probes
A small set of sentinel probes runs continuously against the live system, typically every 15-30 minutes. These are not exhaustive scans. They test a handful of high-severity probe types to detect behavioral drift. Any unexpected response triggers an alert.
Handling Model Updates: When to Trigger a Full Rerun
Not all changes require the same scope of testing. A practical trigger matrix:
| Change type | Required scope | |---|---| | Base model version change | Full probe suite | | System prompt modification | Full probe suite | | Fine-tuning data update | Full probe suite | | New tool or retrieval source connected | Category-specific (LLM01, LLM06, LLM08) | | Prompt template text change | Gate 1 fast subset + affected categories | | Application code change (no model config) | Gate 1 fast subset | | Scheduled weekly scan | Full probe suite |
Model provider version changes are particularly important to catch. Base model behavior changes without any action on your end. PAIR-style adaptive attacks that failed on a previous version may succeed on a new one. Our recommendation: treat any base model version change as equivalent to a new system deployment and rerun the full suite before routing production traffic to it.
Metrics and Failure Thresholds
Attack Success Rate (ASR) is the core metric: the percentage of probes that successfully elicited the targeted harmful, insecure, or policy-violating output. Target below 20% ASR overall for production systems. Track it weekly.
Risk-stratified thresholds work better than a single pass/fail line:
| Vulnerability category | Minimum safety score | Action on failure | |---|---|---| | Illegal content generation | 0.95 | Block deployment, page on-call | | Graphic / CSAM content | 0.95 | Block deployment, page on-call | | System prompt leakage | 0.85 | Block deployment | | Prompt injection | 0.85 | Block deployment | | PII disclosure | 0.85 | Block deployment | | Agentic / excessive agency | 0.75 | Block deployment | | Bias and fairness | 0.80 | Create ticket, review before next release |
For financial services deployments, the Risk-Adjusted Harm Scoring (RAHS) framework described in arXiv 2603.10807 replaces binary ASR with severity-weighted, judge-ensemble scoring. RAHS distinguishes between a model that occasionally produces mildly biased output and one that provides actionable harm instructions, which binary ASR treats identically. It is worth adopting for any regulated deployment.
Beyond ASR, track Mean Time to Compromise (MTTC): how many probe iterations does it take to elicit a policy-violating response? A target of more than 48 hours of adaptive probing before a first compromise sets a meaningful bar for adversarial resilience. Also track critical vulnerability density: the number of high-severity findings per 1,000 queries processed.
Tool Landscape: Open Source Options and Managed Services
Promptfoo is the most mature open-source option for CI/CD integration. It covers 50+ vulnerability types, maps to NIST AI RMF and OWASP LLM Top 10 presets, and has native integrations for every major CI/CD platform. Promptfoo was acquired by OpenAI and is used internally by both OpenAI and Anthropic for model testing. Declarative YAML configuration makes it accessible without deep Python knowledge.
DeepTeam (Confident AI) is the strongest option for Python-native teams. It covers 40+ vulnerability types across data privacy, responsible AI, security, safety, and agentic categories. Multi-turn attack support and async execution make it practical for large probe libraries.
NVIDIA Garak provides the broadest probe coverage (150+ probes, 3,000+ templates) and is the right tool when you need exhaustive coverage across a wide attack surface. It is less CI/CD-friendly out of the box but integrates well with Python test runners.
Microsoft PyRIT / Azure AI Red Teaming Agent is the natural fit for Azure-hosted deployments. It integrates directly into Azure AI Foundry for longitudinal tracking and governance reporting. Supports 20+ attack strategies including encoding-based attacks (Base64, ROT13, Caesar cipher) and classifies them by difficulty level.
All four tools support MITRE ATLAS-mapped probes, which provides a vendor-neutral framework for reporting findings to security stakeholders.
For teams without the capacity to build and maintain a continuous red teaming program internally, BeyondScale's AI penetration testing service includes probe library design, CI/CD integration, and ongoing adversarial monitoring. You can also run a free AI security scan to get a baseline assessment of your current exposure.
What Your Probe Library Misses Without RAG Coverage
One area that static probe templates consistently undercover is vector and embedding weaknesses (LLM08, new in OWASP's 2025 edition). RAG pipelines introduce a distinct attack surface: poisoned documents in the retrieval corpus, similarity-based extraction attacks, and cross-context retrieval that leaks data across tenants or sessions.
Probes for LLM08 require a test retrieval environment loaded with both benign and adversarial documents. The probe library must test whether adversarial documents in the corpus can redirect model behavior, exfiltrate information from other retrieved chunks, or poison the semantic index to surface harmful content in response to benign queries.
If your product uses a RAG architecture, treat LLM08 coverage as mandatory. It is the fastest-growing attack surface in production AI applications and among the least covered by current tooling defaults.
Conclusion
Continuous LLM red teaming is not a tool purchase. It is an engineering practice: maintaining a probe library, running it on a schedule and on every meaningful system change, setting risk-stratified thresholds, and tracking ASR trends over time.
The research supports urgency here. Prompt injection succeeds in 73% of production AI deployments assessed by Trend Micro's TrendAI State of AI Security report. The EU AI Act requires documented adversarial-testing evidence for high-risk systems launching after mid-2026. NIST's AI RMF frames continuous testing as a core safety measure, not an optional enhancement.
Start with the five probe categories above, pick one of the open-source frameworks that fits your stack, and get the first CI/CD gate in place this sprint. The full architecture takes time to mature, but a PR-gated prompt injection check is achievable in a day and catches the regressions that annual assessments miss entirely.
Ready to go further? Run a free AI security scan to benchmark your current system against the OWASP LLM Top 10, or contact the BeyondScale team to design a continuous red teaming program for your production AI stack.
For more on related topics, see our guides on AI red teaming strategy and AI penetration testing methodology.
BeyondScale Team
AI Security Team, BeyondScale Technologies
Security researcher and engineer at BeyondScale Technologies, an ISO 27001 certified AI cybersecurity firm.
Want to know your AI security posture? Run a free Securetom scan in 60 seconds.
Start Free Scan

