Skip to main content
AI Security

Multimodal AI Security: Defenses Text Filters Miss

BT

BeyondScale Team

AI Security Team

14 min read

Your prompt injection defenses are likely blind to half the attack surface.

If your enterprise is running GPT-4V, Claude Vision, or Gemini Vision in production, that statement is not hypothetical. Multimodal AI security is the gap between the text-based guardrails most organizations have deployed and the visual, audio, and document-based attack vectors researchers have already operationalized. This guide breaks down what makes multimodal models uniquely vulnerable, the four attack classes your current stack likely cannot detect, how real enterprise workflows are being exploited, and what a layered defense architecture looks like in practice.

Key Takeaways

  • Text-based input classifiers and prompt injection defenses do not inspect image, audio, or document content — making them blind to an entire attack class.
  • Adversarial image perturbations achieve over 90% attack success rates against GPT-4o in research settings, with cross-model transferability confirmed across GPT-4.5, Claude, and Gemini.
  • Steganographic prompt injection, where malicious instructions are invisibly encoded in image pixels, bypasses all filter layers that only inspect text.
  • Enterprise workflows processing uploaded images, invoices, medical scans, or customer documents are the highest-risk deployment patterns.
  • Effective defense requires multimodal input sanitization, architectural isolation, and output-layer behavioral monitoring.
  • OWASP LLM01:2025 explicitly covers multimodal injection — it is now a compliance-relevant attack class, not just a research curiosity.

Why Multimodal Models Introduce a Fundamentally Different Threat Model

When you deploy a text-only LLM, the attack surface is one-dimensional: text in, text out. Your guardrails — input classifiers, system prompt hardening, output filters — all operate in the same modality as the attack. You can reasonably inspect every byte that enters the model.

Vision-language models (VLMs) break that assumption. When GPT-4V processes a document or Claude Vision analyzes an uploaded image, the model is ingesting pixel data that your text-based security stack never examines. The model's visual encoder converts that pixel data into internal representations that influence behavior — and those representations are invisible to any filter operating at the text layer.

The threat model change has three dimensions:

Expanded input surface. A VLM deployment typically accepts images, PDFs, screenshots, scanned documents, or video frames. Each is an independent injection vector that your text classifiers ignore entirely.

Modality confusion. VLMs reconcile two heterogeneous signal streams, visual and textual, that encode safety constraints differently. A model hardened against text-based jailbreaks may produce harmful outputs when that text is paired with a carefully chosen image, because the safety training that governs text behavior is not uniformly applied across the visual processing pathway.

Inheritance by agentic systems. In practice, VLMs are not just used as chatbots. They are embedded in document processing pipelines, customer-facing support agents, and medical imaging workflows. An injected instruction that redirects model behavior does not just affect one user session — it can redirect automated downstream actions.

The Attack Taxonomy: Four Classes That Bypass Text Defenses

1. Adversarial Image Perturbations

Adversarial perturbations modify an image at the pixel level in ways imperceptible to humans but meaningful to a neural network's visual encoder. The modified image causes the model to misinterpret the visual content or to produce targeted outputs determined by the attacker.

Research published in March 2025 demonstrated a simple but highly effective baseline attack achieving over 90% success rates against GPT-4.5, GPT-4o, and o1. Critically, these attacks transfer: perturbations developed on one commercial VLM worked against Claude 3.5 Sonnet, Claude 3.7 Sonnet, and Gemini 2.0 Flash at meaningful rates. Earlier work using CLIP-based attacks achieved up to 75% success against GPT-4V, Gemini 1.5, and Claude 3 specifically.

For enterprise deployments, the practical attack scenario is an adversary who uploads a crafted image through a product feature — a customer service portal that accepts screenshots, a document ingestion pipeline, an expense report that allows image attachments — and causes the model to produce targeted outputs or take specific tool calls it would not have otherwise made.

2. Steganographic Prompt Injection

Steganography encodes information inside a carrier medium in a way that is undetectable to the human eye. Applied to VLM attacks, this means hiding textual instructions inside image files using spatial encoding (nudging pixel values by 1), frequency-domain encoding (modifying DCT coefficients, similar to JPEG compression internals), or neural steganography (a separate model trained to embed messages in images while preserving visual appearance).

Research published in 2025 evaluated steganographic attacks against eight leading VLMs including GPT-4V, Claude, and LLaVA. Overall attack success rates reached 24.3% across targets, with neural steganography methods reaching 31.8% against open-source models. Commercial VLMs with stronger safety enforcement showed lower but still non-zero success rates.

The security implication is direct: a document uploaded by an external party — a vendor invoice, a customer contract, a patient intake form with a logo — can carry hidden instructions that the model reads and acts upon. There is no visible payload for a human reviewer to catch, and no text-layer filter to block it.

3. Cross-Modal Instruction Injection

Cross-modal injection exploits the way VLMs integrate information across modalities to influence model behavior through unexpected channels. A common pattern involves embedding text instructions in image content that the model's OCR and visual processing pipeline reads as authoritative.

In 2025, researchers demonstrated the Polyglot SVG attack: SVG files contain XML with accessibility tags and text elements that VLMs pay close attention to when processing visual content. Attackers embed system-override instructions inside SVG accessibility attributes. When the model processes the SVG as a visual asset, it also reads the embedded instructions and treats them with elevated authority.

Similar patterns apply to PDF files with image layers, screenshots of documents, and TIFF files with embedded metadata. In each case, the attack surface is content that enters the system as a binary file, not as text, and therefore bypasses text-only input inspection.

4. Cross-Modal Poisoning in Multimodal RAG

Organizations using retrieval-augmented generation with multimodal knowledge bases face a compound risk: the image retrieval layer can be poisoned, and retrieved images can carry injected instructions that influence the model during generation.

In a multimodal RAG system where the knowledge base contains product images, technical diagrams, or archived documents, an attacker who can influence what gets indexed — through a supply chain compromise, a vendor document, or a misconfigured data ingestion pipeline — can place adversarial images in the retrieval pool. When those images are retrieved as context, they deliver their payload to the generation model.

This is a longer-range attack than session-scoped injection, and it is harder to detect because the malicious content may sit in a knowledge base for weeks before being retrieved by a relevant query.

Enterprise Exposure Scenarios

The attack classes above are not purely academic. They map directly to deployment patterns that enterprises have in production today.

Document processing pipelines. Organizations that use VLMs to extract data from invoices, purchase orders, or contracts are processing documents from external parties with no control over content. An adversary submitting a maliciously crafted invoice can cause the pipeline to extract wrong amounts, redirect payment destinations, or suppress output fields. The attack is invisible to a human reviewing the "same" document.

Customer service portals with image upload. Support systems that allow customers to upload screenshots or images for VLM-assisted triage are directly exposed to adversarial perturbation attacks. A crafted screenshot can redirect the model's response, cause it to perform unintended tool calls, or exfiltrate session context.

Medical imaging AI with natural language interfaces. Research published in Nature Communications demonstrated that malicious instructions embedded in medical images can cause AI systems to produce harmful diagnostic outputs. In systems where a radiologist queries a VLM about an imaging study, an adversarially modified image can alter both the visual interpretation and any downstream reporting the model generates. Patient safety and liability implications make this the highest-stakes deployment context.

ID verification and KYC workflows. VLMs used for identity document verification process driver's licenses, passports, and ID cards from people being onboarded. These documents can be adversarially modified at the pixel level to present altered information while appearing visually identical to a human reviewer.

For organizations that have deployed prompt injection defenses for text-based LLMs, none of those controls apply to these scenarios. The attack surface is entirely outside the scope of what text-layer defenses address.

Building a Multimodal Defense Architecture

Defense against multimodal attacks requires controls operating at three distinct layers. No single control is sufficient.

Layer 1: Multimodal Input Sanitization

Standard input classifiers inspect text. For VLM deployments, you need a parallel inspection layer that processes visual and audio inputs before they reach the model.

Image preprocessing. Convert SVGs to flattened PNGs before passing to the model — this strips embedded scripts, accessibility-attribute injection vectors, and metadata payloads. Apply JPEG recompression and Gaussian filtering to uploaded images before ingestion. Research shows that multi-layer preprocessing (statistical, neural, and spectral anomaly detection combined) reduces steganographic attack effectiveness by an estimated 73.4%.

OCR-based text extraction. For documents where text content matters, extract text using a dedicated OCR tool before passing the document to the VLM. Treat that extracted text as untrusted user input, not as part of the system context. This prevents the model from reading embedded instructions as authoritative.

CLIP-based content scanning. Tools that use CLIP embeddings to assess semantic alignment between an image and its declared purpose can flag anomalies. An image submitted as an "expense receipt" but semantically unrelated to receipts in CLIP embedding space is worth queuing for additional review.

Layer 2: Architectural Isolation

Input sanitization reduces risk but does not eliminate it. Architectural controls limit the damage of a successful injection.

Least-privilege tool access. An agentic VLM that processes uploaded documents should not have broad access to downstream systems. Scope the tools available to the model to the minimum needed for its declared function. An invoice-processing agent does not need access to email APIs, calendar systems, or external HTTP endpoints.

Dual-LLM pattern. Use a separate, text-only model as a safety wrapper around the VLM. The VLM processes visual inputs and returns structured outputs; the safety wrapper validates those outputs against expected schemas before any downstream action is taken. This creates a checkpoint where a separate model — not subject to the same visual injection — reviews what the VLM wants to do.

Human-in-the-loop for high-stakes actions. For actions with significant downstream consequences — payment approvals, data exports, outbound communications — require explicit human confirmation. An injected instruction can redirect model behavior, but it cannot approve its own execution if a human review gate exists.

Layer 3: Output Monitoring and Behavioral Detection

A VLM that has been successfully injected will exhibit behavioral anomalies in its outputs. Output monitoring provides a detection layer that operates independent of how the injection occurred.

Schema validation. If a document processing pipeline is expected to return structured fields (vendor name, amount, date), any output that deviates from that schema — including free-text instructions, unexpected URLs, or tool calls not in the expected set — should be flagged and quarantined.

Baseline behavioral monitoring. Log VLM outputs at scale and establish baselines for token distributions, response length, and tool call patterns. Outliers from established baselines, particularly after processing external-source documents, warrant investigation.

SIEM integration. VLM inference logs should feed your SIEM. Correlate anomalous model outputs with the inputs that produced them. A cluster of anomalous outputs linked to documents from a specific external source is an indicator of a targeted injection campaign.

This defense architecture maps directly to the controls described in OWASP LLM01:2025 and the MITRE ATLAS framework. See our guide to MITRE ATLAS threat modeling for how to map these controls to specific adversary techniques.

Red Teaming Multimodal Systems

Knowing the attack classes exists is not sufficient for building confidence in your defenses. You need to test them.

A multimodal red team exercise should cover the following scenarios:

Adversarial image crafting. Generate adversarial perturbations against the specific VLM you have deployed using known attack baselines (CLIP-based, gradient-based, or transfer-based). Confirm whether your preprocessing layer neutralizes the attack before it reaches the model.

Steganographic payload delivery. Embed text instructions using spatial and frequency-domain encoding in images that would plausibly enter your system through normal upload flows. Test whether the model reads and acts on the hidden instructions. Confirm whether JPEG recompression and Gaussian filtering disrupt the payload.

Document injection through realistic channels. Construct PDFs, SVGs, and TIFF files with injected content and process them through your full pipeline, from upload to model inference to downstream action. Test whether the dual-LLM safety layer or schema validation catches the behavioral deviation.

Cross-user contamination in multimodal RAG. If you have a multimodal knowledge base, inject adversarial images into the retrieval pool and confirm whether they surface in responses to relevant queries. Validate that retrieved image content is treated as untrusted context, not as authoritative instruction.

Output anomaly detection validation. After each successful injection in a test environment, confirm that your output monitoring infrastructure would have flagged the anomaly. If it would not have, tune your detection baselines.

BeyondScale includes multimodal attack simulation in AI penetration testing engagements. You can also run a free Securetom scan to identify surface-level exposure in your current AI deployment before scoping a full red team exercise.

Vendor-Specific Deployment Considerations

Different VLMs handle multimodal inputs with different safety enforcement. Security teams should understand these differences when evaluating deployment risk.

GPT-4V / GPT-4o (OpenAI). Research confirms that adversarial perturbations achieve over 90% success rates against GPT-4o in transfer attack settings. OpenAI's content filters operate primarily at the text output layer, not at the visual encoding layer. Steganographic attack research shows GPT-4o enforces stronger defenses than open-source VLMs, but not complete blocking. Image preprocessing remains necessary for enterprise deployments.

Claude Vision (Anthropic). Anthropic's Constitutional AI training creates behavioral constraints that reduce some attack success rates compared to other commercial VLMs. However, cross-model transferability research confirms that adversarial perturbations developed against GPT-4 models also work against Claude 3.5 Sonnet and Claude 3.7 Sonnet at meaningful rates. Steganographic injection research shows Claude Sonnet enforces defenses that block some attack variants, but not all.

Gemini Vision (Google). Gemini 2.0 Flash shows vulnerability to transferable adversarial attacks in published research. Google's safety filters operate post-generation, meaning injected instructions that cause the model to take intermediate reasoning steps before producing outputs may not be blocked at the output layer.

Across all three platforms, the consistent pattern is: vendor safety filters reduce multimodal attack success rates but do not eliminate them, and none of the vendor-side controls substitute for input sanitization and architectural isolation on the enterprise side. This is the same conclusion the NIST AI Risk Management Framework reaches in its guidance on input and output controls for AI systems.

What Security Teams Should Do Now

If you are running VLMs in production today, three actions are immediately tractable:

First, audit your input handling. Identify every path through which image, audio, or document content reaches a VLM. For each path, confirm whether preprocessing (conversion, compression, OCR-based extraction) is applied. If not, that is your highest-priority remediation.

Second, scope your tool access. Audit the tools and APIs available to any agentic VLM in your environment. Remove access to any capability not strictly required for the declared function. A document-processing agent with access to email APIs is a privilege escalation waiting to happen.

Third, run a targeted red team test. Most organizations running GPT-4V or Claude Vision have never attempted to inject through the visual channel. A targeted test, even a limited one, will either confirm your defenses work or reveal a gap that merits immediate attention.

Multimodal AI security is not a future problem. The attacks are documented, the tooling exists, and the enterprise workflows being targeted are already in production. The gap is whether your security program has kept pace with the deployment decisions your engineering and product teams made over the last 18 months.

Start by scanning your current AI environment at /scan. If the results surface multimodal deployments, reach out to schedule a tailored assessment.


Sources:

AI Security Audit Checklist

A 30-point checklist covering LLM vulnerabilities, model supply chain risks, data pipeline security, and compliance gaps. Used by our team during actual client engagements.

We will send it to your inbox. No spam.

Share this article:
AI Security
BT

BeyondScale Team

AI Security Team, BeyondScale Technologies

Security researcher and engineer at BeyondScale Technologies, an ISO 27001 certified AI cybersecurity firm.

Want to know your AI security posture? Run a free Securetom scan in 60 seconds.

Start Free Scan

Ready to Secure Your AI Systems?

Get a comprehensive security assessment of your AI infrastructure.

Book a Meeting