Skip to main content
AI Security

LLM Observability Security Risks: CISO Guide 2026

BT

BeyondScale Team

AI Security Team

14 min read

When you instrument your LLM application with LangSmith, Langfuse, or Helicone, you solve a real engineering problem: visibility into prompt quality, latency, costs, and failure modes. What most security teams have not evaluated is the risk created by the infrastructure doing the observing.

LLM observability platforms sit at a uniquely sensitive point in your AI stack. They capture complete prompt text, complete model completions, retrieved RAG document chunks, tool call arguments, agent reasoning steps, and session identifiers. For a RAG application processing customer queries, every observability trace is a snapshot of exactly what your users asked, what context your vector store returned, and what your model said back. This is a concentrated corpus of your most sensitive AI interactions, stored at a third-party vendor you almost certainly have not included in your third-party risk management program.

This guide evaluates the attack surface created by the five most widely deployed LLM observability platforms, documents real CVEs and supply chain incidents, and gives security teams a framework for assessing and controlling this risk.

Key Takeaways

    • LLM observability platforms capture full prompt/completion pairs, tool call arguments, and RAG context, creating a sensitive data concentration at a third-party vendor
    • Three documented vulnerabilities in LangSmith (CVE-2026-25528, CVE-2026-25750, and AgentSmith CVSS 8.8) demonstrate this is a real, exploitable attack surface
    • The March 2026 LiteLLM PyPI supply chain attack shows how observability-adjacent packages become high-value compromise targets, with 95 million monthly downloads at risk
    • Helicone's acquisition by Mintlify places the platform in maintenance mode while it continues to function as a full HTTP proxy for all LLM traffic
    • GDPR Article 44 applies to any EU personal data routed through a US-based observability platform, an obligation most engineering teams have not fulfilled when instrumenting these tools
    • Self-hosted options (Langfuse, Arize Phoenix) eliminate third-party data exposure entirely and should be the default for regulated industries

What LLM Observability Platforms Actually Capture

Before evaluating platform security posture, it helps to be precise about what data flows through these tools. Based on the OpenTelemetry GenAI Semantic Conventions and vendor documentation, a standard LLM observability trace includes:

  • Full prompt text: every message in the conversation, including system prompts, user messages, and assistant turns
  • Full model completion text: the complete response from the model
  • Model parameters: temperature, max_tokens, stop sequences
  • Token counts: input and output usage
  • Latency metrics: time-to-first-token, total request duration
  • Tool and function call data: function names, complete argument payloads, and return values
  • RAG retrieval context: the query sent to the vector store and every document chunk returned
  • Agent reasoning steps: intermediate outputs, chain-of-thought traces, sub-agent invocations
  • Session identifiers: linking multi-turn conversations to a single user session
  • Error messages and stack traces: which can expose code paths and configuration details
For a healthcare application answering patient questions using a RAG pipeline, a single observability trace may contain the patient's query (potentially including symptoms, medications, and identifying details), the retrieved document chunks (potentially PHI), and the model's response. That is a covered entity routing PHI through a third-party infrastructure without necessarily having a HIPAA BAA in place.

For a financial services application, the trace might contain client account queries, retrieved account data, and AI-generated advice, all flowing through a vendor whose security controls have not been assessed as part of your third-party risk program.

Platform Security Evaluation

LangSmith: SOC 2 Certified, Three Active CVEs

LangSmith is SOC 2 Type II certified, with a HIPAA BAA available on Enterprise tier. Data retention ranges from 14 days at the base tier to 400 days at $4.50 to $5.00 per 1,000 traces. The trust center at trust.langchain.com documents the shared responsibility model explicitly: LangChain secures the infrastructure, and customers are responsible for what data flows through traces.

Three documented vulnerabilities materially affect the security posture of LangSmith deployments.

AgentSmith (October 2024, CVSS 8.8): Noma Security disclosed a flaw in LangSmith's Prompt Hub, a public repository for sharing agent configurations. A malicious actor published an agent configured with an attacker-controlled proxy endpoint. When users clicked "Try It" on the published agent, their OpenAI API keys, prompt content, and uploaded attachments were silently routed to the attacker before reaching the model provider. This attack required no technical access to the victim's infrastructure, only the ability to publish a convincing-looking agent to a shared repository. Disclosed October 30, 2024; fixed November 6, 2024.

CVE-2026-25528 (CVSS 5.8): SSRF via Tracing Header Injection: The LangSmith SDK's distributed tracing reads the HTTP baggage header to propagate trace context across microservices. The header can contain api_url and api_key fields for replica configurations. An attacker who can inject HTTP headers at any hop in a microservices chain can set api_url to an attacker-controlled endpoint, causing the SDK to POST complete run data, including full prompts and completions, to that endpoint for every traced LLM call. This attack is invisible at the application layer, requires no credentials once header injection is achieved, and exfiltrates data continuously. Affected versions: Python SDK >=0.4.10,<0.6.3; JS/TS SDK >=0.3.41,<0.4.6. Fixed in Python 0.6.3 and JS 0.4.6. Advisory: GHSA-v34v-rq6j-cj6p.

CVE-2026-25750 (CVSS 8.5): Account Takeover via URL Parameter Injection: Insufficient validation of the baseUrl parameter in LangSmith Studio allowed crafted malicious links to steal bearer tokens, affecting both cloud and self-hosted deployments. Fixed in LangSmith v0.12.71 (December 2025).

Langfuse: The Most Security-Transparent Option

Langfuse holds SOC 2 Type II and ISO 27001 certifications with annual audits and external penetration testing. Data regions are available in the EU, US, and Japan, with full region separation and no cross-region replication. This architecture makes Langfuse's cloud offering the most GDPR-compatible of the major platforms for EU-based organizations.

The most significant security capability Langfuse offers is server-side data masking in its self-hosted deployment. Administrators define custom callback logic that intercepts trace data at ingestion and redacts or masks sensitive fields before they reach the database. This is an ingestion-layer control, not a post-hoc scan. Data never reaches persistent storage in its original form.

Langfuse is fully open source under MIT license and supports air-gapped, offline, and VPC deployments. For regulated industries, the self-hosted path with data masking configured eliminates the third-party data exposure risk entirely. The GitHub community discussion thread on Langfuse data security (#6083) demonstrates active public scrutiny of exactly the data flow concerns this guide addresses.

Helicone: Proxy Architecture and Maintenance Mode Risk

Helicone operates as a proxy rather than an SDK-based instrumentation layer. All LLM traffic routes through Helicone's infrastructure before reaching the model provider. Helicone's own documentation uses the phrase "man-in-the-middle proxy" to describe this architecture. The practical consequence: every system prompt, every user message, every API key used in the request, and every model completion passes through Helicone's servers.

In March 2026, Helicone was acquired by Mintlify, a documentation infrastructure company. The acquisition placed Helicone in maintenance mode. The product now receives only security patches, bug fixes, and new model support, with no active feature development ongoing.

The risk profile this creates: a full HTTP proxy handling production LLM traffic is now under the stewardship of a team whose core focus is documentation infrastructure, not AI security tooling. Security research attention typically follows active product development. The attack surface (a proxy that handles every prompt and completion for instrumented applications) remains while the security investment profile of the maintaining organization shifts.

Helicone does offer a self-hosted option and EU data region on Enterprise plans. For teams that cannot migrate, isolating the self-hosted deployment eliminates the vendor lifecycle dependency. For teams still routing production traffic through Helicone's SaaS proxy, assess whether vendor lifecycle risk is within your acceptable risk tolerance.

Arize Phoenix: The Self-Hosted Default

Arize Phoenix is fully open source and self-hostable with no feature gates. Built on OpenTelemetry and the OpenInference instrumentation specification, Phoenix runs locally in a Jupyter environment, via Docker, or on Kubernetes. The security guarantee for self-hosted deployments is direct: none of your trace data is collected by Arize when running Phoenix yourself.

Production deployment defaults include authentication enabled, secure cookies, and strong password policy requirements, a meaningful security baseline compared to tools that ship with authentication disabled by default.

For organizations that need cloud convenience without full self-hosting, Arize's managed platform is the alternative, with the tradeoff that trace data leaves your environment.

Datadog LLM Observability: Enterprise Controls, Reactive Scanning

Datadog's LLM Observability product inherits the platform's SOC 2 Type II, ISO 27001, HIPAA, and FedRAMP Moderate certifications. The Sensitive Data Scanner integrates into the LLM Observability pipeline, with 1 GB of scanner allocation included per 10,000 LLM requests.

Datadog's implementation uses OpenTelemetry GenAI Semantic Conventions, which store prompt and completion content in span events rather than span attributes. Span events can be filtered or dropped at the OpenTelemetry Collector level without modifying application code, providing a governance control point that does not require changes to instrumented applications.

One limitation matters for strict compliance requirements: the Sensitive Data Scanner operates after ingestion, not before. Data in transit to Datadog has already left your environment before scanning occurs. For organizations with strict data residency requirements, this transit window is a compliance gap.

The LiteLLM Supply Chain Attack: A Concrete Precedent

On March 24, 2026, a threat actor identified as TeamPCP executed a supply chain attack on LiteLLM, a widely used LLM routing and proxy package with 95 million monthly PyPI downloads.

The attack chain: a misconfigured pull_request_target CI workflow in Trivy, a security scanning dependency in LiteLLM's build pipeline, allowed the exfiltration of a maintainer's Personal Access Token. The attacker used the token to push malicious packages litellm==1.82.7 and litellm==1.82.8 to PyPI. The malicious payload harvested credentials, attempted lateral movement across Kubernetes clusters, and installed a persistent systemd backdoor polling for additional payloads. Both packages remained live for approximately 40 minutes before PyPI quarantine.

LiteLLM is commonly used in observability pipelines alongside LangSmith, Langfuse, and Arize to route requests to multiple model providers. Organizations that do not pin SDK versions and did not catch the update during the 40-minute window ran malicious code in the same process context that handles sensitive LLM interactions.

This attack directly realized the threat described in OWASP LLM Top 10 2025 under LLM03: Supply Chain Risks, which addresses third-party packages and plugins that introduce vulnerable or malicious components into AI systems. The same attack pattern applies to any npm or PyPI package in the observability instrumentation stack: langsmith, langfuse, opentelemetry-instrumentation-openai, helicone, braintrust.

GDPR Article 44 and Compliance Implications

GDPR Article 44 prohibits the transfer of EU personal data to third countries without appropriate safeguards. When an EU-based organization routes user queries containing personal data through a US-based SaaS observability platform, Article 44 is triggered. Compliant operation requires a Data Processing Agreement incorporating Standard Contractual Clauses and a completed Transfer Impact Assessment.

Most engineering teams instrumenting LangSmith, Helicone, or Braintrust into their LLM applications have not completed this process. The data transfer occurs at SDK initialization, before any security review workflow typically engages.

HIPAA adds a parallel obligation: any observability platform that receives PHI in trace data is acting as a Business Associate and requires a BAA. Of the platforms evaluated here, only LangSmith Enterprise explicitly offers a HIPAA BAA. Organizations using Helicone, Braintrust, or Arize cloud to observe healthcare applications should verify BAA availability before instrumenting.

The NIST AI Risk Management Framework MAP function requires mapping risks for all components of the AI system, including third-party software. Observability platforms are AI supply chain components under this framework, not neutral infrastructure. The GOVERN function requires establishing policies for third-party AI component assessment, which most organizations have not applied to their observability stack.

PII Scrubbing Before Traces Leave Your Environment

The most effective control for managing observability platform data risk is pre-processing at the point of instrumentation, before data leaves your environment.

Custom span processors: OpenTelemetry's SDK supports custom span processors that intercept spans before export. A span processor can identify and redact PII patterns (email addresses, phone numbers, credit card numbers, medical record numbers) in span attributes and events before the OTel Exporter sends data to the backend.

Presidio integration: Microsoft's Presidio analyzer library provides entity recognition for 50+ PII entity types and can operate as a pre-processing step in the trace pipeline. Latency overhead is typically 5 to 20 ms per trace on a modern CPU, acceptable for most production applications.

Structural redaction: Rather than scanning for PII, strip all content from traces and retain only structural metadata (token counts, latency, error codes, model identifiers). This sacrifices prompt-quality debugging capability but eliminates data exposure risk entirely for applications where debugging from traces is not required.

Langfuse self-hosted masking: Langfuse's self-hosted deployment offers server-side masking callbacks that execute at ingestion before data reaches storage, providing a vendor-managed implementation of the pre-processing pattern.

Explore how BeyondScale's AI security assessment evaluates your current LLM observability stack and identifies what sensitive data is flowing to third-party platforms today.

Self-Hosted vs. SaaS: The Core Security Decision

The fundamental security decision for LLM observability is whether trace data leaves your environment at all.

Self-hosting eliminates third-party data exposure risk, supply chain risk from vendor compromise, GDPR Article 44 transfer obligations, and vendor lifecycle risk. The tradeoffs are operational overhead and the loss of the managed service's availability guarantees.

Platforms with mature self-hosted options:

Langfuse: Full feature parity between cloud and self-hosted; MIT licensed; supports Docker, Kubernetes, and cloud-managed deployments; active open-source community; server-side data masking at ingestion.

Arize Phoenix: Fully open source with no feature gates; designed to run locally or on-premise; zero Arize data collection when self-hosted; strong default security configuration.

Datadog On-Premise: Available for enterprise deployments, but involves significant operational complexity compared to the SaaS offering.

For regulated industries, healthcare, financial services, and defense, self-hosted LLM observability should be the default architecture, not an advanced option considered only after a compliance finding. Our AI security product includes evaluation of observability stack configuration as part of the LLM infrastructure review.

Security Hardening Checklist for LLM Observability

Before deploying any LLM observability platform in production:

Data minimization:

  • Disable or redact full prompt and completion capture for high-sensitivity applications
  • Configure PII redaction for the entity types your application processes
  • Retain structural telemetry (token counts, latency, error rates) only where full content is not required for debugging
Access control:
  • Apply least privilege to observability platform API keys; scope them to specific projects
  • Rotate API keys on a defined schedule and store them in a secrets manager, not in code
  • Enable SSO and MFA on the observability platform account; disable shared credentials
Vendor assessment:
  • Verify SOC 2 Type II certification and review the audit report for applicable trust services criteria
  • Obtain a Data Processing Agreement with Standard Contractual Clauses for EU data transfers
  • Obtain a HIPAA BAA if the application processes PHI
  • Confirm data residency region matches your compliance requirements
Supply chain hygiene:
  • Pin observability SDK versions in requirements.txt or package.json; do not use range specifiers for production dependencies
  • Enable automated dependency scanning (Dependabot, Snyk) for the observability SDK packages
  • Verify package hashes at install time in CI/CD pipelines
Incident response preparation:
  • Document what data is present in your observability platform traces before an incident occurs
  • Verify you can purge traces from the vendor platform on request (required for GDPR Article 17 erasure rights)
  • Include the observability platform in your breach notification scope analysis

Conclusion

LLM observability platforms solve a real engineering problem, and the major platforms have invested meaningfully in security controls. The risk is not that these platforms are insecure by design. The risk is that most security teams have not evaluated them at all.

When your engineering teams instrument LangSmith or Helicone, they route complete prompt and completion pairs through a third-party vendor. That vendor now stores a concentrated corpus of your AI interactions: customer queries, retrieved documents, model responses, tool call results. Three CVEs in LangSmith, a supply chain compromise of LiteLLM, and Helicone's maintenance-mode status under new ownership are not theoretical risks. They are documented incidents against exactly this part of the stack.

The OWASP LLM Top 10 and NIST AI RMF both identify third-party AI component risk as a priority. LLM observability platforms are third-party AI components. Apply the same vendor risk assessment process you apply to every other third-party system handling sensitive data.

For regulated industries, the default should be self-hosted observability using Langfuse or Arize Phoenix with pre-instrumentation PII scrubbing configured before the first production deployment, not retrofitted after a compliance finding.

Book a BeyondScale AI security assessment to audit your LLM observability stack and identify unreviewed third-party data flows in your AI infrastructure today.

AI Security Audit Checklist

A 30-point checklist covering LLM vulnerabilities, model supply chain risks, data pipeline security, and compliance gaps. Used by our team during actual client engagements.

We will send it to your inbox. No spam.

Share this article:
AI Security
BT

BeyondScale Team

AI Security Team, BeyondScale Technologies

Security researcher and engineer at BeyondScale Technologies, an ISO 27001 certified AI cybersecurity firm.

Want to know your AI security posture? Run a free Securetom scan in 60 seconds.

Start Free Scan

Ready to Secure Your AI Systems?

Get a full security assessment of your AI infrastructure.

Book a Meeting