What is AI agent memory poisoning?

AI agent memory poisoning is an attack where an adversary injects malicious content into a persistent memory store used by an AI agent, such as a vector database, episodic experience store, or conversation history. The poisoned data is later retrieved and shapes future agent behavior, often without the user's knowledge.

How does memory poisoning differ from prompt injection?

Standard prompt injection alters behavior within a single session. Memory poisoning is a persistent variant: the injected content survives session boundaries and influences the agent in future conversations, making it far harder to detect and remediate.

Which AI frameworks are vulnerable to memory poisoning?

Any agent framework that writes retrieved or processed content back to persistent storage is vulnerable. This includes LangChain (memory modules, CVE-2023-29374 class), MetaGPT DataInterpreter, OpenAI Assistants with file search, RAG pipelines using Pinecone, Weaviate, Chroma, and Qdrant, and consumer AI assistants with memory features like ChatGPT and Gemini.

What is AgentPoison and why does it matter?

AgentPoison (arXiv:2407.12784) is a backdoor attack on RAG-based agents that achieves over 80% attack success rate at under 0.1% poison rate, without requiring any model retraining. It uses constrained optimization to generate trigger phrases that map malicious documents into a unique embedding cluster, ensuring reliable retrieval when specific keywords appear.

How can I detect if an agent's memory has been poisoned?

Look for behavioral drift: unexpected tone shifts, unusual recommendations, policy violations, or refusals the agent did not exhibit before. Immutable retrieval audit logs are the most reliable technical control. Monitor which documents are retrieved per query and flag anomalous retrieval patterns, such as the same document surfacing across semantically unrelated queries.

Does isolating the vector database prevent memory poisoning?

Isolation reduces the blast radius but does not prevent poisoning. An attacker only needs one path to write content the agent will ingest, such as an uploaded document, a web page the agent visits, an email processed by a tool, or a connected data source. Defense requires pre-ingestion scanning, retrieval monitoring, and write-path access controls in combination.

AI Agent Memory Poisoning: Defense Guide 2026

AI agent memory poisoning is one of the most consequential vulnerabilities in production AI systems today. Unlike a standard prompt injection that resets between sessions, a memory poisoning attack writes malicious content into an agent's persistent storage, where it silently corrupts future behavior across every subsequent interaction. Researchers published the "Poison Once, Exploit Forever" framing in April 2026, and the name is accurate. This guide explains how the attack works across all memory types, what the current data shows about success rates, and what practical defenses actually reduce risk.

Key Takeaways

AI agents use at least four distinct memory types, and each has a different poisoning surface: in-context short-term memory, episodic experience stores, semantic vector databases, and external tool state.
AgentPoison achieves over 80% attack success rate at less than 0.1% poison rate in RAG-based agents with no model retraining required.
The Agent Security Bench (ASB) recorded an 84.30% average attack success rate across 400+ tools and 27 attack/defense combinations.
Consumer AI assistants with memory features (ChatGPT, Gemini) have been demonstrated vulnerable to cross-session persistence attacks using document upload and browser tool vectors.
No single defense eliminates the threat; defense in depth across ingestion, retrieval, and behavioral monitoring is required.
OWASP LLM08 (Vector and Embedding Weaknesses) and LLM06 (Excessive Agency) are the primary framework references for this class of attack.

How AI Agents Store Memory

Understanding the attack surface requires understanding where agents persist state. There are four categories, each with distinct poisoning characteristics.

In-context memory is the conversation history loaded into the active context window. It includes prior turns, tool call results, and intermediate reasoning steps. This is the shortest-lived memory type, but it is directly manipulable via indirect prompt injection: a malicious tool response is inserted into context and treated as trusted input. CVE-2023-29374 (LangChain llm_math chain) and CVE-2023-32786 (LangChain APIChain) both exploit this pattern.

Episodic memory stores records of past task executions and is used by frameworks like MetaGPT's DataInterpreter. At task start, the agent retrieves past experiences via semantic similarity to inform current behavior. The MemoryGraft attack (arXiv:2512.16962) demonstrated that an attacker can supply benign-seeming artifacts during execution that get stored, then surface malicious procedure templates during future semantically similar tasks. The attack persists across sessions in MetaGPT with GPT-4o as the underlying model.

Semantic memory covers vector databases used in RAG pipelines: Pinecone, Weaviate, Chroma, Qdrant, pgvector, and similar systems. Content is retrieved by cosine similarity at query time. This is the most studied poisoning surface. The AgentPoison paper (arXiv:2407.12784) uses constrained optimization to generate trigger phrases that map malicious documents into a unique embedding cluster, producing highly reliable retrieval when specific keywords appear. Testing across autonomous driving agents, QA systems, and healthcare EHRAgent showed over 80% attack success at under 0.1% poison rate with less than 1% benign performance degradation.

External tool state covers any durable artifact an agent can write: files, database records, code commits, calendar entries, emails. Indirect prompt injection from documents and web content processed by agent tools can cause the agent to write malicious state back into storage that it will re-read later. Johann Rehberger's SpAIware (September 2024) demonstrated exactly this against ChatGPT's memory feature: a prompt injection embedded in a Google Drive document caused the bio (memory) tool to execute automatically, storing attacker-controlled beliefs that persisted across every subsequent conversation.

The Attack Mechanics in Practice

The poisoning lifecycle is consistent across memory types.

An attacker identifies a data source the agent ingests: a shared document repository, a public web page the agent will visit, an email inbox, or a connected data store. Malicious content is embedded in that source. The agent processes the content normally, and depending on the memory architecture, either writes a new memory entry or updates an embedding store. In future sessions, the poisoned entry is retrieved and shapes agent behavior, from subtly altered recommendations to active exfiltration or action hijacking.

The eTAMP paper (arXiv:2604.02623, April 2026) demonstrated cross-session, cross-site exploitation against AI browsers including ChatGPT Atlas and Perplexity Comet. A single compromised webpage poisons the agent's trajectory memory. The attack then activates on entirely different websites in future sessions, bypassing permission-based defenses. Attack success rates ranged from 19.5% (GPT-OSS-120B) to 32.5% (GPT-5-mini) under normal conditions, with up to an 8x increase when agents encountered UI friction such as dropped clicks or garbled text.

The Morris-II AI worm (arXiv:2403.02817) extended this to multi-agent propagation: a self-replicating adversarial prompt embedded in RAG triggers a cascade of indirect injections across interconnected AI applications. One poisoned memory entry can propagate through an entire AI-augmented organization's toolchain.

In February 2026, research on implicit memory (arXiv:2602.08563) showed that even agents without explicit memory modules are vulnerable: LLMs encode state in their outputs and retrieve it when those outputs are reintroduced as inputs, creating "time bombs" that activate after hidden conditions accumulate across multiple interactions.

What the Attack Success Rate Data Shows

The honest picture from 2024 and 2025 research: these attacks work at high rates against current production systems, and defenses have limited effectiveness in isolation.

The Agent Security Bench (arXiv:2410.02644) evaluated 27 attack and defense combinations across 400+ tools and found an 84.30% average attack success rate. The LAAF framework (arXiv:2603.17239) tested 2.8M+ payload variants across 5 production LLM platforms and found an 84% mean breakthrough rate. AgentPoison maintains over 80% success at under 0.1% poison rate. The Prompt Security PoC using sentence-transformers/all-MiniLM-L6-v2 reproduced 80% success with a single poisoned document affecting semantically unrelated queries.

Eyes-on-Me (arXiv:2510.00586) took a different approach and used attention-steering to increase RAG poisoning from 21.9% to 57.8% average ASR, a 2.6x improvement using a single attractor that transfers to unseen retrievers.

The common thread: attacks that combine optimized trigger phrases or embedding manipulation with legitimate-looking content are effective against current RAG architectures at very low poisoning rates.

Defense Architecture: What Actually Works

Effective defense requires controls at every phase of the memory lifecycle: ingestion, storage, retrieval, and behavioral monitoring. No single control is sufficient.

Pre-ingestion scanning is the first line. Before any document enters a vector store or episodic memory, scan for hidden text (white-on-white, zero-font, CSS-hidden instructions), anomalous content patterns, and prompt injection markers. OWASP LLM08 specifically calls for preprocessing filters combining heuristics, regex, and LLM-based classifiers. Treat documents like code: verify provenance before embedding.

Permission-aware vector databases reduce cross-tenant risk. Strict logical partitioning per user or per agent prevents content written by one session from being retrieved in another. OWASP LLM08 recommends this as a core mitigation for multi-tenant deployments. Tag every memory entry with its source, author, and trust classification at write time.

Minimum-privilege memory access is the principle from OWASP LLM06 applied to memory operations. Agents should not have write access to long-term memory unless the specific task requires it. Memory writes should go through a staging buffer with validation rather than directly to the live store. Human approval gates for modifications that affect future session behavior further reduce automated exploitation.

Immutable retrieval audit logs are essential for detection. Log every document retrieved, the query that triggered retrieval, the agent session, and the action taken after retrieval. Anomalous retrieval patterns, such as the same document consistently surfacing across semantically unrelated queries, are a reliable indicator of a poisoned entry.

Behavioral drift detection operates at the output layer. Monitor agent outputs for tone shifts, unexpected policy violations, unusual recommendations, or stylistic anomalies relative to a behavioral baseline. The SuperLocalMemory Bayesian trust model (arXiv:2603.02240) demonstrated a 72% trust degradation detection rate for sleeper attacks with a 10.6ms median search latency. Assign trust scores to memory entries that decay for unverified or anomalous inputs.

Architectural isolation limits blast radius. Local-first memory using SQLite or similar reduces centralized attack surface compared to shared cloud vector stores. Read-only memory snapshots for audit purposes prevent poisoning of the baseline. Sandboxing memory writes from reads, where agents read from a validated snapshot and writes go to a separate staging area, prevents in-session poisoning from immediately affecting behavior.

Framework-Specific Considerations

For teams running LangChain-based agents: the CVE history (CVE-2023-29374, CVE-2023-32785, CVE-2023-32786, CVE-2025-65106, CVE-2025-68664) shows that template injection, serialization injection, and chain-based prompt injection are recurring issues. Pin to patched versions and audit memory module configurations. LangChain memory components that write to external stores need explicit access controls.

For OpenAI Assistants with file search or custom memory: Rehberger's SpAIware and ChatGPT memory hacking demonstrations (May 2024, September 2024) show the bio tool can be invoked indirectly. Review what data sources your assistant has access to and whether document processing can trigger memory writes without explicit user approval.

For MetaGPT deployments: the MemoryGraft attack targets the DataInterpreter's experience retrieval mechanism directly. Validate and sanitize all artifacts before they enter the experience store, and scope retrieval to verified agent-internal outputs only.

For teams using RAG pipelines with Pinecone, Weaviate, or Chroma: AgentPoison and Eyes-on-Me demonstrate that retrieval-time manipulation is effective at very low poison rates. Implement document-level provenance tracking and monitor cosine similarity distributions for anomalous clustering.

For AI security assessments that include agentic systems, memory architecture review should be a dedicated workstream, not a footnote in a prompt injection check. See our AI penetration testing approach for how we evaluate memory persistence attack surfaces in production deployments.

Detection Signals Worth Monitoring

Security teams should instrument the following:

Write operations to memory stores from agent sessions, specifically any write that originates from processed external content rather than direct user input. Cross-session behavioral divergence, where an agent begins returning different outputs to identical queries without any explicit configuration change. High-frequency retrieval of specific documents across diverse query types. Memory entries with anomalous metadata, such as entries written during sessions involving external document processing that have unexpectedly high semantic similarity to sensitive operational queries. Unexpected tool invocations, particularly memory write tools (bio, memory save endpoints) triggered without explicit user instruction.

The OWASP LLM Top 10 recommends treating all LLM outputs as potentially malicious before passing to external systems. Apply the same standard to all content that an agent processes before writing it to persistent storage.

The Broader Risk Picture

Memory poisoning is not a theoretical concern. Live demonstrations against ChatGPT (May 2024, September 2024), Gemini (February 2025), and Claude (April 2026) show that the attack works against production consumer AI systems. Enterprise deployments with autonomous agents that have broader tool access and larger, more sensitive knowledge bases face a proportionally larger risk.

The Morris-II worm demonstrates that the risk is not bounded to a single agent instance. In organizations where AI agents share knowledge bases, process common document repositories, or feed outputs to other agents, a single poisoned entry can propagate. The attack surface scales with the connectivity of the AI system.

OWASP's Agentic AI Top 10, referencing memory poisoning as a primary threat category, is actively being developed. Teams building or auditing agentic systems should track this alongside the existing OWASP LLM Top 10 (LLM08 for vector/embedding weaknesses, LLM06 for excessive agency).

The SuperLocalMemory research (arXiv:2603.02240) provides a useful framing: treat memory entries like any other input from an untrusted source. Verify provenance, scope access, monitor behavior, and maintain the ability to audit and roll back. That discipline, applied consistently across all four memory types, represents the current state of practical defense.

Conclusion

AI agent memory poisoning is a persistent, cross-session attack that exploits the same architectural features that make agents useful: their ability to remember and learn from past interactions. The research record from 2024 through 2026 shows attack success rates above 80% in multiple independent studies. Defense requires a layered approach covering pre-ingestion scanning, permission-aware storage, retrieval monitoring, and behavioral drift detection.

If your organization runs AI agents with any form of persistent memory, whether RAG pipelines, experience stores, or consumer AI assistants with memory features, evaluating your exposure to this class of attack is overdue. Start with a free AI security scan to identify memory-related risk in your current AI deployment, or contact the BeyondScale team for a full agentic security assessment that covers memory architecture, retrieval controls, and behavioral monitoring.

AI Agent Memory Poisoning: Defense Guide 2026

How AI Agents Store Memory

The Attack Mechanics in Practice

What the Attack Success Rate Data Shows

Defense Architecture: What Actually Works

Framework-Specific Considerations

Detection Signals Worth Monitoring

The Broader Risk Picture

Conclusion

Authoritative References

AI Security Audit Checklist

BeyondScale Team

Related Articles

AI Security Tabletop Exercises: 5 Enterprise Scenarios

Google ADK Security: CISO Guide to Enterprise Hardening

GitHub Copilot Workspace Security: CISO Guide 2026

Ready to Secure Your AI Systems?