Skip to main content
AI Infrastructure Security

Kubernetes AI Workload Security: Hardening LLM Infrastructure

BT

BeyondScale Team

AI Security Team

15 min read

Kubernetes AI security is the fastest-growing blind spot in enterprise infrastructure. As organizations move LLM inference, model training, and AI agents onto Kubernetes clusters, they assume the orchestrator's isolation primitives transfer cleanly to AI workloads. They do not. In March 2026, the CNCF published a formal threat model documenting exactly why: Kubernetes "does not inherently understand or control the behavior of AI systems, creating a fundamentally different and more complex threat model." This guide covers the attack paths that matter, with specific CVEs, RBAC patterns, isolation architectures, and a 20-point checklist your team can use today.

Key Takeaways

    • CVE-2025-23266 (NVIDIAScape, CVSS 9.0) lets a malicious container image escape to the host via the GPU runtime hook. Patch immediately if running Container Toolkit below 1.17.8.
    • Kubernetes RBAC misconfigurations are the primary lateral movement path in AI clusters. Palo Alto Unit 42 found suspicious service account token theft in 22% of cloud environments in 2025.
    • Standard container isolation is not sufficient for AI agents that execute LLM-generated code. Kata Containers with Firecracker microVMs provides hardware-enforced kernel boundaries.
    • Model artifact poisoning is a practical threat. The March 2026 LiteLLM supply chain attack harvested LLM API keys, Kubernetes configs, and cloud credentials from every infected environment.
    • Kubernetes PSS/PSA addresses general container hygiene but has no visibility into prompt injection, model integrity, or RAG retrieval abuse.
    • Detection requires AI-specific signals: per-key token consumption, prediction drift as a security indicator, and GPU utilization anomalies.

Why Kubernetes Cannot Secure LLM Workloads Alone

The CNCF's March 2026 analysis put it directly: "Kubernetes is great at scheduling workloads and keeping them isolated. It has no idea what those workloads do." A deployment can be fully compliant with Kubernetes best practices while exposing significant risks through its AI layer. Kubernetes has no visibility into whether a prompt is malicious, whether sensitive data is leaking from an inference response, or whether a model is interacting with internal systems unsafely.

This creates a two-layer problem. The infrastructure layer (the cluster, the nodes, the runtime) has its own attack surface: container escapes, RBAC abuse, secrets exposure, and GPU runtime vulnerabilities. The AI layer adds an entirely different surface: prompt injection (OWASP LLM01 2025), model poisoning (LLM04), RAG manipulation, and excessive agency (LLM06). Defenders who harden one layer without the other leave half the attack surface unaddressed.

The CNCF advocates for a dedicated policy layer in front of the model runtime, separate from inference infrastructure, to enforce prompt filtering, content moderation, and access controls. That policy layer is not built into Kubernetes.

The AI-Specific Kubernetes Threat Model

Standard Kubernetes security guidance was written for stateless web services. LLM workloads break several key assumptions:

Pods with persistent, high-value GPU access. A compromised model-serving pod has access to expensive GPU compute and, in many deployments, to proprietary model weights stored on attached volumes. The blast radius of a single pod compromise is far higher than a stateless API pod.

Long-running jobs accumulate secrets over time. Training jobs run for hours or days, accumulating cloud credentials, HuggingFace tokens, and API keys in environment variables. The LiteLLM supply chain compromise of March 2026 demonstrated this precisely: a poisoned PyPI package harvested LLM API keys for every configured provider, Kubernetes service account tokens, AWS/GCP/Azure IAM credentials, Docker registry tokens, and HashiCorp Vault tokens from every infected environment. In Kubernetes clusters, the malware then created privileged node-setup- pods using stolen service account tokens, achieving host-level access.

Model artifacts are binary blobs, not auditable code. A backdoored Python module can be reviewed with git diff. A backdoored set of model weights cannot. Malicious behavior baked into model parameters is undetectable through code review, and only ~50% of organizations scan models before deployment despite documented risks from public repositories.

AI agents execute untrusted code. An AI coding agent given write access to a filesystem and the ability to spawn processes can trigger container escape exploits. The April 2026 "Copy Fail" kernel vulnerability (CVE-2026-31431, actively exploited) requires only a 732-byte Python script and is deterministic with no race condition. The SANDBOXESCAPEBENCH study (arXiv, March 2026) confirmed that frontier LLMs can identify and exploit container vulnerabilities when they are present in the sandbox.

GPU Container Security: The CVE You Need to Patch Now

The NVIDIA Container Toolkit vulnerability disclosed in July 2025 redefined what "container escape" means for AI workloads.

CVE-2025-23266 (NVIDIAScape, CVSS 9.0) exploits the OCI createContainer hook in the NVIDIA Container Toolkit. The hook inherits environment variables from the container image and executes with the container's root filesystem. An attacker sets LD_PRELOAD in a container image to point to a malicious shared library. When the privileged nvidia-ctk hook runs on the host, it loads that library, giving the attacker full host root. The exploit requires a three-line Dockerfile. No prior host access is needed. Wiz Research, which discovered the vulnerability at Pwn2Own Berlin, found that 37% of cloud environments were running affected versions at the time of disclosure.

CVE-2025-23267 (CVSS 8.5) affects the update-ldcache hook in the same toolkit. Attacker-crafted symlinks within a container image cause the host's ldconfig process to overwrite host files, enabling data tampering and denial of service.

Mitigation: Patch to Container Toolkit 1.17.8 and GPU Operator 25.3.1. As a temporary workaround for CVE-2025-23267, set features.disable-cuda-compat-lib-hook = true in /etc/nvidia-container-toolkit/config.toml.

The runC container escape trio (CVE-2025-31133, CVE-2025-52565, CVE-2025-52881, November 2025) and the Copy Fail kernel vulnerability (CVE-2026-31431, April 2026) add to the picture. CVE-2026-31431 specifically exploits how Kubernetes shares container image layers via overlay filesystems across pods on the same node: an unprivileged pod corrupts a binary in a shared layer, and a privileged DaemonSet on the same node then executes the corrupted binary with elevated privileges. All three major managed Kubernetes services (EKS, GKE, AKS) were affected before the patch.

RBAC Hardening for ML Pipelines

The default Kubernetes posture is actively dangerous for AI workloads. Every pod automatically receives a mounted service account token at /var/run/secrets/kubernetes.io/serviceaccount/token unless automountServiceAccountToken: false is explicitly set. That token is a JWT authenticating to the Kubernetes API. If the service account has broad RBAC permissions (common in ML pipelines where one shared account handles training, serving, and monitoring), a single compromised pod yields cluster-wide access.

In the Slow Pisces / Lazarus cryptocurrency exchange breach documented by Palo Alto Unit 42 in 2025, the attack chain ran from spearphishing to developer workstation to pod deployment to service account token extraction to cluster-wide API enumeration to cloud backend pivot. The stolen token "belonged to a high-privileged management service account with broad RBAC permissions, used by a common CI/CD automation and cluster orchestration system."

Principle of least privilege for ML pipelines:

Create a separate ServiceAccount for each pipeline stage. A data ingestion job needs read access to object storage and write access to a staging namespace. A training job needs that namespace plus a PersistentVolumeClaim for checkpoints. A model-serving deployment needs access to its own secret containing model API credentials and nothing else. A monitoring pod needs read access to metrics endpoints.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: model-serving-sa
  namespace: inference
automountServiceAccountToken: false
---
apiVersion: v1
kind: Pod
spec:
  serviceAccountName: model-serving-sa
  automountServiceAccountToken: false
  volumes:
    - name: sa-token
      projected:
        sources:
          - serviceAccountToken:
              expirationSeconds: 3600
              path: token

Use projected service account tokens with expirationSeconds: 3600 rather than the long-lived default tokens. Never share a service account across pipeline stages. Audit RBAC bindings quarterly: any ClusterRoleBinding granting cluster-admin or broad list/get on secrets to ML workloads is a critical finding.

Tools for RBAC auditing: rakkess, kubectl-who-can, and Palo Alto's open-source Peirates (which attackers also use, making it a useful red team tool for validating your posture).

Container Isolation: Kata Containers and MicroVMs for AI Agents

For standard web services, the default container runtime (containerd with runC) provides adequate isolation. For AI agents that execute LLM-generated code, it does not. Any kernel-level container escape CVE (and there were multiple in 2025-2026) can cross the boundary.

Kata Containers places a hardware-enforced VM behind every Kubernetes pod. The pod interacts with a standard CRI interface, but underneath, each workload runs in its own Linux kernel backed by Intel VT-x or AMD-V hardware virtualization. Kernel exploits cannot escape the VM boundary. Kata supports three VMM backends: QEMU (most compatible), Cloud Hypervisor (lightest), and Firecracker (fastest cold start, approximately 125ms).

The kubernetes-sigs/agent-sandbox project, launched at KubeCon Atlanta in November 2025, provides Kubernetes-native CRDs (Sandbox, SandboxTemplate, SandboxClaim) for this pattern. It supports both Kata Containers and gVisor backends and includes WarmPools to pre-warm pods and reduce startup latency for inference use.

apiVersion: sandbox.kubernetes.io/v1alpha1
kind: SandboxTemplate
metadata:
  name: llm-agent-sandbox
spec:
  runtimeClassName: kata-fc
  warmPool:
    size: 3
  shutdownAfter: 30m
  resources:
    limits:
      cpu: "4"
      memory: "8Gi"

The corollary from the Cohere AI Terrarium CVE-2026-5752 (CVSS 9.3) is instructive: prototype chain traversal in a WebAssembly sandbox enabled arbitrary root code execution and container escape. Software-only sandboxes have a smaller attack surface than a bare container, but they are not equivalent to hardware-enforced VM boundaries. For workloads executing untrusted AI-generated code in production, use Kata Containers with Firecracker.

For an in-depth look at how AI agent sandboxing applies at the application layer, see our AI agent sandboxing enterprise security guide.

Securing Model Artifacts: Digest Pinning and Supply Chain Controls

Model artifact security is the area where Kubernetes-native controls matter most and are most often skipped.

Digest pinning. Mutable image tags (model-server:latest) allow a registry compromise or misconfiguration to swap a model image without detection. Pin every model container to its SHA-256 digest:

image: registry.example.com/models/llama3:sha256-a1b2c3d4...

Enforce this through a Kyverno or OPA Gatekeeper admission controller that rejects any pod spec using a mutable tag for model serving images.

Image signing with Cosign. Sign model artifacts in your CI pipeline using Sigstore Cosign, and configure a Kubernetes admission webhook to verify signatures before scheduling. This creates a cryptographic chain from the build environment to the running pod, blocking any artifact that was not produced by your pipeline.

Model scanning in CI. The LiteLLM supply chain compromise and the Keras vulnerability (November 2025) both involved poisoned dependencies that reached production. Add a model scanner to your CI pipeline that checks for serialization exploits (malicious pickle objects, ONNX graph injections), backdoored weights, and known-vulnerable dependencies. Run this check before every push to your model registry.

SBOM/AIBOM tracking. Maintain a Software Bill of Materials for each model artifact, including training data provenance where available. This supports incident response: if a training data poisoning campaign targets a dataset you used, you can identify which models need retraining.

The ShadowRay 2.0 campaign (November 2025) demonstrated the concrete risk: attackers operating 230,000+ compromised Ray servers maintained persistent access for months, stealing AI models, 240GB of compressed datasets, and source code. They maintained exactly 60% GPU utilization to avoid threshold-based alerts and hid 23.9GB of GPU consumption from Ray's own dashboard.

For additional detail on protecting model infrastructure credentials, see our LLMjacking defense guide.

Monitoring for AI-Specific Anomalies

Standard Kubernetes monitoring (CPU, memory, pod restarts, node health) does not surface AI-specific attack patterns. You need additional signal layers.

Per-key token consumption telemetry. Each consumer of your inference API should have its own API key. Track token consumption per key over a rolling window. A key that suddenly increases from an average of 5,000 tokens per day to 500,000 is either a legitimate usage change or an indicator of credential theft and LLM abuse. The LLMjacking threat specifically targets stolen API keys for compute resale and model abuse.

GPU utilization anomaly detection. The ShadowRay 2.0 campaign introduced a new evasion technique: maintaining GPU usage at exactly 60% to stay below alerting thresholds. Monitor for sustained plateau patterns across multiple nodes, not just absolute thresholds. Unexpected GPU consumers (new processes, new pods) on nodes running inference workloads should trigger investigation.

Kubernetes audit log analysis. Enable Kubernetes audit logging and alert on:

  • Any pod reading secrets outside its own namespace
  • Service account tokens used from IP addresses different from the node's registered IP
  • New ClusterRoleBinding or RoleBinding creation in production namespaces
  • Privileged pod creation in inference or training namespaces
Prediction drift as a security signal. Model output quality that degrades suddenly and cannot be explained by input distribution shift is a potential indicator of model artifact tampering. A backdoored model will show normal behavior on typical inputs and anomalous behavior on trigger inputs. Integrate prediction quality monitoring (perplexity scores, output similarity metrics, human evaluation samples) into your security detection pipeline, not just your ML operations pipeline.

Inference API exposure auditing. Model serving endpoints exposed without authentication are a common configuration error in Kubernetes. The ShadowRay 2.0 campaign found 230,000 exposed Ray servers via internet scan. Run regular external exposure audits. Your inference API should require mTLS or bearer token authentication for all consumers, including internal services.

For organizations that need professional validation of their AI security posture, our AI security audit service covers the full stack from infrastructure to inference.

20-Point Kubernetes AI Workload Security Checklist

GPU Runtime (4 points)

  • NVIDIA Container Toolkit patched to 1.17.8 or later
  • GPU Operator patched to 25.3.1 or later
  • disable-cuda-compat-lib-hook = true set in toolkit config as defense-in-depth
  • GPU workloads running in dedicated namespaces with network policies restricting egress
  • RBAC and Identity (5 points)

  • Separate ServiceAccount per pipeline stage (ingestion, training, serving, monitoring)
  • automountServiceAccountToken: false on all model-serving pods
  • Projected service account tokens with expirationSeconds: 3600 or shorter
  • No ClusterRoleBinding granting cluster-admin to any ML workload
  • RBAC audit completed in the last 90 days using kubectl-who-can or equivalent
  • Container Isolation (3 points)

  • Kata Containers or gVisor RuntimeClass applied to all AI agent pods executing untrusted code
  • Pod Security Standards enforced at restricted level for inference namespaces
  • hostPath volumes prohibited in inference and training namespaces via admission control
  • Model Artifact Integrity (4 points)

  • All model images pinned to SHA-256 digest (no mutable tags in production)
  • Cosign signing enforced in CI, signature verification in admission webhook
  • Model scanner running in CI pipeline before registry push
  • SBOM maintained for all production model artifacts
  • Secrets Management (2 points)

  • No LLM API keys in environment variables; use external secrets operator with Vault or AWS Secrets Manager
  • etcd encryption at rest enabled
  • Monitoring and Detection (2 points)

  • Per-key token consumption alerting with baseline deviation thresholds
  • Kubernetes audit log alerting on cross-namespace secret access and privileged pod creation
  • Putting It Together: The Adversarial Perspective

    A red team engagement against a typical enterprise AI cluster in 2026 follows a predictable path. Entry is often through a compromised developer workstation or a poisoned dependency in the ML pipeline (the LiteLLM campaign vector). The attacker finds a service account token at /var/run/secrets/kubernetes.io/serviceaccount/token and uses it to enumerate RBAC permissions. A shared ML pipeline service account with list/get on secrets grants access to every credential in the namespace. The attacker pivots to cloud credentials, accesses the model registry, and either exfiltrates model weights or swaps a poisoned image for the production model.

    The GPU runtime provides a second entry path for attackers with the ability to submit container images to the cluster. CVE-2025-23266 demonstrated that a three-line Dockerfile is sufficient for host root access if the Container Toolkit is unpatched.

    Neither path requires AI-specific knowledge or tooling. Standard container security tools and techniques work against AI clusters because AI clusters run on the same infrastructure. The AI-specific risk is the blast radius: model theft, training data exfiltration, API key harvesting for resale, and silent model integrity compromise are consequences that general-purpose Kubernetes security controls do not surface.

    If your organization is deploying LLM workloads on Kubernetes and has not conducted a dedicated AI infrastructure security assessment, our AI penetration testing service covers the full threat model described in this guide: GPU runtime vulnerabilities, RBAC misconfiguration paths, model artifact integrity, and inference API exposure.

    Conclusion

    Kubernetes AI security requires two distinct security layers applied together. The infrastructure layer (GPU runtimes, RBAC, container isolation, secrets management) addresses the attack surface that AI workloads inherit from the cluster. The AI layer (model integrity, prompt controls, inference monitoring, RAG security) addresses the attack surface that AI workloads introduce on top of it. Neither layer is optional.

    The CNCF's March 2026 analysis was clear: operational health does not equal security. A Kubernetes deployment that passes every standard benchmark can still be running vulnerable GPU runtimes, sharing service account tokens across pipeline stages, skipping model artifact signing, and missing the monitoring signals that indicate AI-specific attacks.

    The checklist in this guide covers the controls that matter. Start with the GPU runtime patches if you have not applied them. Add RBAC segmentation next. Build toward model artifact signing and AI-specific monitoring as your program matures. For organizations that need an outside view of their current exposure, start with a security assessment.


    References: NIST AI 600-1 (Generative AI Risk Management Framework) | OWASP Top 10 for LLM Applications 2025 | CNCF: LLMs on Kubernetes Part 1 (March 2026) | Wiz Research: CVE-2025-23266 NVIDIAScape | Datadog Security Labs: LiteLLM Supply Chain Campaign*

    AI Security Audit Checklist

    A 30-point checklist covering LLM vulnerabilities, model supply chain risks, data pipeline security, and compliance gaps. Used by our team during actual client engagements.

    We will send it to your inbox. No spam.

    Share this article:
    AI Infrastructure Security
    BT

    BeyondScale Team

    AI Security Team, BeyondScale Technologies

    Security researcher and engineer at BeyondScale Technologies, an ISO 27001 certified AI cybersecurity firm.

    Want to know your AI security posture? Run a free Securetom scan in 60 seconds.

    Start Free Scan

    Ready to Secure Your AI Systems?

    Get a comprehensive security assessment of your AI infrastructure.

    Book a Meeting