What data do LLM observability platforms actually capture?

LLM observability platforms capture full prompt text (including system prompts), complete model completions, tool call names and arguments, RAG retrieval context including retrieved document chunks, agent reasoning steps, session identifiers, token counts, latency metrics, and error messages. For a healthcare application, a single trace may contain patient queries, retrieved PHI, and model responses, all stored at a third-party vendor.

Is LangSmith SOC 2 compliant and HIPAA-ready?

LangSmith is SOC 2 Type II certified and offers a HIPAA BAA on its Enterprise tier only. The shared responsibility model is explicit: LangChain secures the infrastructure, but customers are responsible for what data flows through traces. Three documented vulnerabilities (CVE-2026-25528, CVE-2026-25750, and AgentSmith) have affected LangSmith, so patching and SDK version pinning are required.

What happened to Helicone after the Mintlify acquisition?

Helicone was acquired by Mintlify in March 2026 and is now in maintenance mode, receiving only security patches, bug fixes, and new model support. Since Helicone functions as a full HTTP proxy for all LLM traffic (routing prompts and completions through its servers before they reach model providers), enterprises should evaluate whether continued reliance on a maintenance-mode proxy vendor is acceptable given the data sensitivity involved.

How do I prevent PII from reaching LLM observability platforms?

The most effective control is pre-processing at the point of instrumentation. Options include custom OpenTelemetry span processors that redact PII before export, Microsoft Presidio integration for entity-level redaction, structural redaction (retaining only metadata like token counts and latency), or Langfuse's self-hosted ingestion masking feature which redacts sensitive fields before they reach the database.

What was the LiteLLM supply chain attack in March 2026?

On March 24, 2026, a threat actor compromised LiteLLM's PyPI publishing credentials by exploiting a misconfigured CI workflow in a dependency (Trivy). Malicious packages litellm==1.82.7 and litellm==1.82.8 were live for approximately 40 minutes and targeted credential harvesting, Kubernetes lateral movement, and persistent systemd backdoor installation. LiteLLM has 95 million monthly PyPI downloads and is commonly used alongside observability platforms like LangSmith and Langfuse.

Should I self-host my LLM observability platform?

For regulated industries (healthcare, financial services), self-hosting should be the default, not an advanced option. Langfuse and Arize Phoenix both offer full-featured self-hosted deployments with no feature gates. Self-hosting eliminates third-party data exposure, GDPR Article 44 transfer obligations, vendor lifecycle risk, and supply chain exposure from the observability vendor's own dependencies.

LLM Observability Security Risks: CISO Guide 2026

When you instrument your LLM application with LangSmith, Langfuse, or Helicone, you solve a real engineering problem: visibility into prompt quality, latency, costs, and failure modes. What most security teams have not evaluated is the risk created by the infrastructure doing the observing.

LLM observability platforms sit at a uniquely sensitive point in your AI stack. They capture complete prompt text, complete model completions, retrieved RAG document chunks, tool call arguments, agent reasoning steps, and session identifiers. For a RAG application processing customer queries, every observability trace is a snapshot of exactly what your users asked, what context your vector store returned, and what your model said back. This is a concentrated corpus of your most sensitive AI interactions, stored at a third-party vendor you almost certainly have not included in your third-party risk management program.

This guide evaluates the attack surface created by the five most widely deployed LLM observability platforms, documents real CVEs and supply chain incidents, and gives security teams a framework for assessing and controlling this risk.

Key Takeaways

LLM observability platforms capture full prompt/completion pairs, tool call arguments, and RAG context, creating a sensitive data concentration at a third-party vendor
Three documented vulnerabilities in LangSmith (CVE-2026-25528, CVE-2026-25750, and AgentSmith CVSS 8.8) demonstrate this is a real, exploitable attack surface
The March 2026 LiteLLM PyPI supply chain attack shows how observability-adjacent packages become high-value compromise targets, with 95 million monthly downloads at risk
Helicone's acquisition by Mintlify places the platform in maintenance mode while it continues to function as a full HTTP proxy for all LLM traffic
GDPR Article 44 applies to any EU personal data routed through a US-based observability platform, an obligation most engineering teams have not fulfilled when instrumenting these tools
Self-hosted options (Langfuse, Arize Phoenix) eliminate third-party data exposure entirely and should be the default for regulated industries

What LLM Observability Platforms Actually Capture

Before evaluating platform security posture, it helps to be precise about what data flows through these tools. Based on the OpenTelemetry GenAI Semantic Conventions and vendor documentation, a standard LLM observability trace includes:

Full prompt text: every message in the conversation, including system prompts, user messages, and assistant turns
Full model completion text: the complete response from the model
Model parameters: temperature, max_tokens, stop sequences
Token counts: input and output usage
Latency metrics: time-to-first-token, total request duration
Tool and function call data: function names, complete argument payloads, and return values
RAG retrieval context: the query sent to the vector store and every document chunk returned
Agent reasoning steps: intermediate outputs, chain-of-thought traces, sub-agent invocations
Session identifiers: linking multi-turn conversations to a single user session
Error messages and stack traces: which can expose code paths and configuration details

For a healthcare application answering patient questions using a RAG pipeline, a single observability trace may contain the patient's query (potentially including symptoms, medications, and identifying details), the retrieved document chunks (potentially PHI), and the model's response. That is a covered entity routing PHI through a third-party infrastructure without necessarily having a HIPAA BAA in place.

For a financial services application, the trace might contain client account queries, retrieved account data, and AI-generated advice, all flowing through a vendor whose security controls have not been assessed as part of your third-party risk program.

Platform Security Evaluation

LangSmith: SOC 2 Certified, Three Active CVEs

LangSmith is SOC 2 Type II certified, with a HIPAA BAA available on Enterprise tier. Data retention ranges from 14 days at the base tier to 400 days at $4.50 to $5.00 per 1,000 traces. The trust center at trust.langchain.com documents the shared responsibility model explicitly: LangChain secures the infrastructure, and customers are responsible for what data flows through traces.

Three documented vulnerabilities materially affect the security posture of LangSmith deployments.

AgentSmith (October 2024, CVSS 8.8): Noma Security disclosed a flaw in LangSmith's Prompt Hub, a public repository for sharing agent configurations. A malicious actor published an agent configured with an attacker-controlled proxy endpoint. When users clicked "Try It" on the published agent, their OpenAI API keys, prompt content, and uploaded attachments were silently routed to the attacker before reaching the model provider. This attack required no technical access to the victim's infrastructure, only the ability to publish a convincing-looking agent to a shared repository. Disclosed October 30, 2024; fixed November 6, 2024.

CVE-2026-25528 (CVSS 5.8): SSRF via Tracing Header Injection: The LangSmith SDK's distributed tracing reads the HTTP baggage header to propagate trace context across microservices. The header can contain api_url and api_key fields for replica configurations. An attacker who can inject HTTP headers at any hop in a microservices chain can set api_url to an attacker-controlled endpoint, causing the SDK to POST complete run data, including full prompts and completions, to that endpoint for every traced LLM call. This attack is invisible at the application layer, requires no credentials once header injection is achieved, and exfiltrates data continuously. Affected versions: Python SDK >=0.4.10,<0.6.3; JS/TS SDK >=0.3.41,<0.4.6. Fixed in Python 0.6.3 and JS 0.4.6. Advisory: GHSA-v34v-rq6j-cj6p.

CVE-2026-25750 (CVSS 8.5): Account Takeover via URL Parameter Injection: Insufficient validation of the baseUrl parameter in LangSmith Studio allowed crafted malicious links to steal bearer tokens, affecting both cloud and self-hosted deployments. Fixed in LangSmith v0.12.71 (December 2025).

Langfuse: The Most Security-Transparent Option

Langfuse holds SOC 2 Type II and ISO 27001 certifications with annual audits and external penetration testing. Data regions are available in the EU, US, and Japan, with full region separation and no cross-region replication. This architecture makes Langfuse's cloud offering the most GDPR-compatible of the major platforms for EU-based organizations.

The most significant security capability Langfuse offers is server-side data masking in its self-hosted deployment. Administrators define custom callback logic that intercepts trace data at ingestion and redacts or masks sensitive fields before they reach the database. This is an ingestion-layer control, not a post-hoc scan. Data never reaches persistent storage in its original form.

Langfuse is fully open source under MIT license and supports air-gapped, offline, and VPC deployments. For regulated industries, the self-hosted path with data masking configured eliminates the third-party data exposure risk entirely. The GitHub community discussion thread on Langfuse data security (#6083) demonstrates active public scrutiny of exactly the data flow concerns this guide addresses.

Helicone: Proxy Architecture and Maintenance Mode Risk

Helicone operates as a proxy rather than an SDK-based instrumentation layer. All LLM traffic routes through Helicone's infrastructure before reaching the model provider. Helicone's own documentation uses the phrase "man-in-the-middle proxy" to describe this architecture. The practical consequence: every system prompt, every user message, every API key used in the request, and every model completion passes through Helicone's servers.

In March 2026, Helicone was acquired by Mintlify, a documentation infrastructure company. The acquisition placed Helicone in maintenance mode. The product now receives only security patches, bug fixes, and new model support, with no active feature development ongoing.

The risk profile this creates: a full HTTP proxy handling production LLM traffic is now under the stewardship of a team whose core focus is documentation infrastructure, not AI security tooling. Security research attention typically follows active product development. The attack surface (a proxy that handles every prompt and completion for instrumented applications) remains while the security investment profile of the maintaining organization shifts.

Helicone does offer a self-hosted option and EU data region on Enterprise plans. For teams that cannot migrate, isolating the self-hosted deployment eliminates the vendor lifecycle dependency. For teams still routing production traffic through Helicone's SaaS proxy, assess whether vendor lifecycle risk is within your acceptable risk tolerance.

Arize Phoenix: The Self-Hosted Default

Arize Phoenix is fully open source and self-hostable with no feature gates. Built on OpenTelemetry and the OpenInference instrumentation specification, Phoenix runs locally in a Jupyter environment, via Docker, or on Kubernetes. The security guarantee for self-hosted deployments is direct: none of your trace data is collected by Arize when running Phoenix yourself.

Production deployment defaults include authentication enabled, secure cookies, and strong password policy requirements, a meaningful security baseline compared to tools that ship with authentication disabled by default.

For organizations that need cloud convenience without full self-hosting, Arize's managed platform is the alternative, with the tradeoff that trace data leaves your environment.

Datadog LLM Observability: Enterprise Controls, Reactive Scanning

Datadog's LLM Observability product inherits the platform's SOC 2 Type II, ISO 27001, HIPAA, and FedRAMP Moderate certifications. The Sensitive Data Scanner integrates into the LLM Observability pipeline, with 1 GB of scanner allocation included per 10,000 LLM requests.

Datadog's implementation uses OpenTelemetry GenAI Semantic Conventions, which store prompt and completion content in span events rather than span attributes. Span events can be filtered or dropped at the OpenTelemetry Collector level without modifying application code, providing a governance control point that does not require changes to instrumented applications.

One limitation matters for strict compliance requirements: the Sensitive Data Scanner operates after ingestion, not before. Data in transit to Datadog has already left your environment before scanning occurs. For organizations with strict data residency requirements, this transit window is a compliance gap.

The LiteLLM Supply Chain Attack: A Concrete Precedent

On March 24, 2026, a threat actor identified as TeamPCP executed a supply chain attack on LiteLLM, a widely used LLM routing and proxy package with 95 million monthly PyPI downloads.

The attack chain: a misconfigured pull_request_target CI workflow in Trivy, a security scanning dependency in LiteLLM's build pipeline, allowed the exfiltration of a maintainer's Personal Access Token. The attacker used the token to push malicious packages litellm==1.82.7 and litellm==1.82.8 to PyPI. The malicious payload harvested credentials, attempted lateral movement across Kubernetes clusters, and installed a persistent systemd backdoor polling for additional payloads. Both packages remained live for approximately 40 minutes before PyPI quarantine.

LiteLLM is commonly used in observability pipelines alongside LangSmith, Langfuse, and Arize to route requests to multiple model providers. Organizations that do not pin SDK versions and did not catch the update during the 40-minute window ran malicious code in the same process context that handles sensitive LLM interactions.

This attack directly realized the threat described in OWASP LLM Top 10 2025 under LLM03: Supply Chain Risks, which addresses third-party packages and plugins that introduce vulnerable or malicious components into AI systems. The same attack pattern applies to any npm or PyPI package in the observability instrumentation stack: langsmith, langfuse, opentelemetry-instrumentation-openai, helicone, braintrust.

GDPR Article 44 prohibits the transfer of EU personal data to third countries without appropriate safeguards. When an EU-based organization routes user queries containing personal data through a US-based SaaS observability platform, Article 44 is triggered. Compliant operation requires a Data Processing Agreement incorporating Standard Contractual Clauses and a completed Transfer Impact Assessment.

Most engineering teams instrumenting LangSmith, Helicone, or Braintrust into their LLM applications have not completed this process. The data transfer occurs at SDK initialization, before any security review workflow typically engages.

HIPAA adds a parallel obligation: any observability platform that receives PHI in trace data is acting as a Business Associate and requires a BAA. Of the platforms evaluated here, only LangSmith Enterprise explicitly offers a HIPAA BAA. Organizations using Helicone, Braintrust, or Arize cloud to observe healthcare applications should verify BAA availability before instrumenting.

The NIST AI Risk Management Framework MAP function requires mapping risks for all components of the AI system, including third-party software. Observability platforms are AI supply chain components under this framework, not neutral infrastructure. The GOVERN function requires establishing policies for third-party AI component assessment, which most organizations have not applied to their observability stack.

PII Scrubbing Before Traces Leave Your Environment

The most effective control for managing observability platform data risk is pre-processing at the point of instrumentation, before data leaves your environment.

Custom span processors: OpenTelemetry's SDK supports custom span processors that intercept spans before export. A span processor can identify and redact PII patterns (email addresses, phone numbers, credit card numbers, medical record numbers) in span attributes and events before the OTel Exporter sends data to the backend.

Presidio integration: Microsoft's Presidio analyzer library provides entity recognition for 50+ PII entity types and can operate as a pre-processing step in the trace pipeline. Latency overhead is typically 5 to 20 ms per trace on a modern CPU, acceptable for most production applications.

Structural redaction: Rather than scanning for PII, strip all content from traces and retain only structural metadata (token counts, latency, error codes, model identifiers). This sacrifices prompt-quality debugging capability but eliminates data exposure risk entirely for applications where debugging from traces is not required.

Langfuse self-hosted masking: Langfuse's self-hosted deployment offers server-side masking callbacks that execute at ingestion before data reaches storage, providing a vendor-managed implementation of the pre-processing pattern.

Explore how BeyondScale's AI security assessment evaluates your current LLM observability stack and identifies what sensitive data is flowing to third-party platforms today.

Self-Hosted vs. SaaS: The Core Security Decision

The fundamental security decision for LLM observability is whether trace data leaves your environment at all.

Self-hosting eliminates third-party data exposure risk, supply chain risk from vendor compromise, GDPR Article 44 transfer obligations, and vendor lifecycle risk. The tradeoffs are operational overhead and the loss of the managed service's availability guarantees.

Platforms with mature self-hosted options:

Langfuse: Full feature parity between cloud and self-hosted; MIT licensed; supports Docker, Kubernetes, and cloud-managed deployments; active open-source community; server-side data masking at ingestion.

Arize Phoenix: Fully open source with no feature gates; designed to run locally or on-premise; zero Arize data collection when self-hosted; strong default security configuration.

Datadog On-Premise: Available for enterprise deployments, but involves significant operational complexity compared to the SaaS offering.

For regulated industries, healthcare, financial services, and defense, self-hosted LLM observability should be the default architecture, not an advanced option considered only after a compliance finding. Our AI security product includes evaluation of observability stack configuration as part of the LLM infrastructure review.

Security Hardening Checklist for LLM Observability

Before deploying any LLM observability platform in production:

Data minimization:

Disable or redact full prompt and completion capture for high-sensitivity applications
Configure PII redaction for the entity types your application processes
Retain structural telemetry (token counts, latency, error rates) only where full content is not required for debugging

Access control:

Apply least privilege to observability platform API keys; scope them to specific projects
Rotate API keys on a defined schedule and store them in a secrets manager, not in code
Enable SSO and MFA on the observability platform account; disable shared credentials

Vendor assessment:

Verify SOC 2 Type II certification and review the audit report for applicable trust services criteria
Obtain a Data Processing Agreement with Standard Contractual Clauses for EU data transfers
Obtain a HIPAA BAA if the application processes PHI
Confirm data residency region matches your compliance requirements

Supply chain hygiene:

Pin observability SDK versions in requirements.txt or package.json; do not use range specifiers for production dependencies
Enable automated dependency scanning (Dependabot, Snyk) for the observability SDK packages
Verify package hashes at install time in CI/CD pipelines

Incident response preparation:

Document what data is present in your observability platform traces before an incident occurs
Verify you can purge traces from the vendor platform on request (required for GDPR Article 17 erasure rights)
Include the observability platform in your breach notification scope analysis

Conclusion

LLM observability platforms solve a real engineering problem, and the major platforms have invested meaningfully in security controls. The risk is not that these platforms are insecure by design. The risk is that most security teams have not evaluated them at all.

When your engineering teams instrument LangSmith or Helicone, they route complete prompt and completion pairs through a third-party vendor. That vendor now stores a concentrated corpus of your AI interactions: customer queries, retrieved documents, model responses, tool call results. Three CVEs in LangSmith, a supply chain compromise of LiteLLM, and Helicone's maintenance-mode status under new ownership are not theoretical risks. They are documented incidents against exactly this part of the stack.

The OWASP LLM Top 10 and NIST AI RMF both identify third-party AI component risk as a priority. LLM observability platforms are third-party AI components. Apply the same vendor risk assessment process you apply to every other third-party system handling sensitive data.

For regulated industries, the default should be self-hosted observability using Langfuse or Arize Phoenix with pre-instrumentation PII scrubbing configured before the first production deployment, not retrofitted after a compliance finding.

Book a BeyondScale AI security assessment to audit your LLM observability stack and identify unreviewed third-party data flows in your AI infrastructure today.

LLM Observability Security Risks: CISO Guide 2026

What LLM Observability Platforms Actually Capture

Platform Security Evaluation

LangSmith: SOC 2 Certified, Three Active CVEs

Langfuse: The Most Security-Transparent Option

Helicone: Proxy Architecture and Maintenance Mode Risk

Arize Phoenix: The Self-Hosted Default

Datadog LLM Observability: Enterprise Controls, Reactive Scanning

The LiteLLM Supply Chain Attack: A Concrete Precedent

PII Scrubbing Before Traces Leave Your Environment

Self-Hosted vs. SaaS: The Core Security Decision

Security Hardening Checklist for LLM Observability

Conclusion

AI Security Audit Checklist

BeyondScale Team

Related Articles

Deepfake CEO Fraud: Voice Cloning Defense Playbook 2026

SecureTom in Action: Watch Our AI Security Scanner Demo

AI Agent Runtime Security: CISO Guide to Enforcement Beyond Monitoring

Ready to Secure Your AI Systems?

LLM Observability Security Risks: CISO Guide 2026

What LLM Observability Platforms Actually Capture

Platform Security Evaluation

LangSmith: SOC 2 Certified, Three Active CVEs

Langfuse: The Most Security-Transparent Option

Helicone: Proxy Architecture and Maintenance Mode Risk

Arize Phoenix: The Self-Hosted Default

Datadog LLM Observability: Enterprise Controls, Reactive Scanning

The LiteLLM Supply Chain Attack: A Concrete Precedent

GDPR Article 44 and Compliance Implications

PII Scrubbing Before Traces Leave Your Environment

Self-Hosted vs. SaaS: The Core Security Decision

Security Hardening Checklist for LLM Observability

Conclusion

AI Security Audit Checklist

BeyondScale Team

Related Articles

Deepfake CEO Fraud: Voice Cloning Defense Playbook 2026

SecureTom in Action: Watch Our AI Security Scanner Demo

AI Agent Runtime Security: CISO Guide to Enforcement Beyond Monitoring

Ready to Secure Your AI Systems?