Skip to main content
AI Infrastructure Security

LLM Inference API Security: Hardening AI Endpoints

BT

BeyondScale Team

AI Security Team

14 min read

LLM inference APIs are the most exposed surface in an AI deployment. Every application that calls an AI model, whether through AWS Bedrock, Azure OpenAI, or a self-hosted vLLM cluster, exposes an inference endpoint that attackers actively target for credential theft, cost amplification, and data exfiltration. This guide covers the attack surface that application-layer security guides miss: token exhaustion, gateway-layer controls, credential hygiene, and provider-specific hardening.

Key Takeaways

    • Token-aware rate limiting is mandatory. Request-count limits do not constrain LLM abuse.
    • LLMjacking costs victims $46,000 to $100,000+ per day per abused credential set.
    • Self-hosted vLLM and Ollama ship with authentication gaps that require explicit remediation.
    • Real CVEs (CVE-2026-7482, CVE-2025-30165, CVE-2026-22778) have hit major inference frameworks in 2025 and 2026.
    • Cost anomaly alerting is the highest-signal early indicator of inference endpoint abuse.
    • OWASP LLM Top 10 2025 and NIST AI RMF provide the frameworks that compliance teams need to reference.
    • Gateway placement, not application code, is where inference security belongs architecturally.

The Inference Endpoint Attack Surface

The inference layer sits between your application and the underlying model. Most security attention goes to application-layer concerns, like prompt injection and output validation. But the inference endpoint itself is a distinct attack surface with distinct threats.

What attackers target:

Credentials. AWS Bedrock, Azure OpenAI, Anthropic, and GCP Vertex AI API keys are discovered in public repositories, extracted via web application vulnerabilities (CVE-2021-3129 and similar Laravel/framework RCEs are a known vector), and phished from developers. Stolen keys are sold through LLMjacking proxy services that let buyers query expensive models at victim cost.

Token economics. A single request can consume 100,000 tokens if the attacker sets max_tokens to the ceiling and sends a large context. Naive rate limiting at the request level provides no protection against this.

Framework vulnerabilities. vLLM carries CVE-2026-22778 (CVSS 9.8, RCE via malicious video URLs), CVE-2025-30165 (CVSS 8.0, RCE via pickle deserialization in ZeroMQ), and CVE-2026-25960 (SSRF bypass enabling IAM credential exfiltration). SGLang carries CVE-2026-3059, CVE-2026-3060, and CVE-2026-3989, three critical RCEs from pickle deserialization against multimodal endpoints. These are not theoretical: they have published PoC exploits.

Exposed endpoints. The Orca Security research team documented that CVE-2026-7482 ("Bleeding Llama") affected over 300,000 Ollama servers globally because Ollama binds to 0.0.0.0 by default with no authentication requirement. Any server reachable from the internet was vulnerable.

The inference endpoint is infrastructure security territory. Treat it accordingly.

Authentication Patterns for Inference Endpoints

The starting point is ensuring every inference call is authenticated and authorized. This sounds obvious but breaks in practice across several common deployment patterns.

API key authentication at the gateway layer

Cloud providers (AWS Bedrock, Azure OpenAI, GCP Vertex AI) handle authentication at their own layer through IAM policies and API keys. The risk is key sprawl. Each developer, service, and CI/CD pipeline that calls an inference API accumulates credentials. Without rotation schedules and least-privilege scoping, a single leaked key grants unlimited access.

Minimum controls:

  • One key per service, never shared across services or environments.
  • Scoped permissions: a key used for inference should not have permissions to create new models, access training data, or modify guardrail configurations.
  • Rotation on a schedule (90 days maximum) and immediately on any suspected exposure.
  • Secret scanning in CI/CD pipelines using tools like GitHub's built-in secret detection or truffleHog to catch keys before they land in repositories.
Service-to-service authentication

For internal services calling inference endpoints, replace static API keys with OIDC workload identity (AWS IAM roles for service accounts, Azure Managed Identity, GCP Workload Identity Federation). Workload identity eliminates long-lived credentials entirely: the service presents a short-lived token scoped to its identity, and the credential is never stored in environment variables or secret managers in the traditional sense.

Azure Managed Identity for Azure OpenAI is the clearest example: the calling service assumes its Entra ID identity, Azure issues a time-limited bearer token scoped to OpenAI operations, and no API key is stored anywhere.

mTLS for self-hosted inference

For vLLM or Ollama deployments inside a private network, mutual TLS between calling services and the inference endpoint provides authentication and encryption without requiring an external identity provider. The inference server presents a certificate; the calling service presents a certificate. Both are verified. This is particularly relevant in air-gapped environments or where the inference cluster handles sensitive data.

The vLLM authentication gap

One non-obvious issue: vLLM's --api-key flag only protects the /v1 API path. The /invocations endpoint used in SageMaker-compatible deployments does not enforce the same key requirement. Any caller that knows the endpoint URL and hits /invocations bypasses authentication entirely. If you deploy vLLM behind an API gateway, configure the gateway to require authentication on all paths, not just /v1.

For more detail on AWS Bedrock-specific IAM configuration, see our AWS Bedrock Security Enterprise Guide and the equivalent Azure OpenAI Security Enterprise Guide.

Token-Aware Rate Limiting

Standard rate limiting counts requests per time window. For LLM inference, this is inadequate. A request with a 50-token prompt costs 200x less than a request with a 10,000-token prompt. Request-count limits do not bound cost or compute consumption.

Token budget controls

The correct primitive is a token budget per key per time window. Set separate limits for:

  • Input tokens: bounds the cost of prompts.
  • Output tokens: bounds max_tokens exploitation (setting max_tokens to the ceiling on every request).
  • Total tokens (input + output combined): the primary cost-control lever.
Implementations available without building from scratch:
  • Azure API Management: llm-token-limit policy, configurable per API key or per product.
  • Kong AI Rate Limiting Advanced plugin: token-level limits per consumer.
  • Apache APISIX: ai-rate-limiting plugin supporting token-based quotas.
  • Portkey: per-virtual-key token budgets with automatic enforcement.
  • Red Hat OpenShift AI: TokenRateLimitPolicy for inference serving.
Max_tokens capping at the gateway

Do not rely on clients to set reasonable max_tokens values. The gateway should enforce a hard ceiling. A request with max_tokens: 32000 from a context that has no legitimate reason to generate that volume is a signal. Cap max_tokens at the gateway layer and log any requests that hit the ceiling consistently.

Streaming request timeouts

Streaming inference responses (Server-Sent Events) stay open until the model finishes generating. An attacker can initiate a streaming request with a large max_tokens value and hold the connection open, consuming tokens and blocking capacity. Enforce a maximum streaming duration and disconnect requests that exceed it.

Cost anomaly alerting

Set baselines for expected token consumption per API key per hour. Alert when any key exceeds 2-3x its baseline within a rolling window. Cost anomaly alerting is the highest-signal early indicator of LLMjacking: stolen credentials are typically used immediately at maximum volume, creating a sharp spike that deviates from any baseline.

Cost Amplification Attacks and Defenses

Beyond rate limiting, several attack patterns specifically target inference cost:

Token flooding

The direct attack: send requests with maximum-length prompts and maximum max_tokens values. Without token-aware limits, this generates unlimited cost from a single valid credential. Defense: token budgets per key, max_tokens ceiling enforcement, input length limits.

Streaming abuse

Hold many streaming connections open simultaneously. Each connection consumes a decode slot in the inference server. Defense: connection count limits per key, streaming timeout enforcement.

Tool loop exploitation in agentic systems

Agentic workflows that call LLMs iteratively, planner-executor patterns, ReAct loops, multi-step tool use, can spiral into hundreds of LLM calls per user session if the agent enters a failure loop. A single poisoned tool response or unexpected environment state can trigger recursive retries that generate thousands of dollars in costs from one user session.

This pattern (sometimes called AbO-DDoS, Agent-Based Orchestrated Denial of Service) converts a low-cost input into high-cost repeated inference by exploiting the agent's retry and recovery logic. Defense: per-session call limits, per-session token ceilings, circuit breakers that halt execution after a configurable number of failed tool calls.

For a deeper treatment of this threat pattern, see our LLMjacking Defense Guide covering credential-based cost amplification specifically.

LLM Gateway Security Controls

An LLM gateway sits between your application and the inference endpoint. It enforces policies that would otherwise require changes to every calling application. Placing security controls at the gateway layer is the correct architectural choice: centralized policy, one enforcement point.

Input inspection

At the gateway layer, inspect incoming prompts for:

  • Prompt injection attempts: instructions embedded in user input that attempt to override system prompts.
  • Sensitive data: PII, credentials, source code patterns, internal identifiers that should not be sent to external model providers.
  • Adversarial payloads: attempts to extract system prompts, model fingerprinting probes.
Purpose-built classifiers outperform using the same LLM being protected as a safety classifier. Using the target model to evaluate its own inputs creates a dual-compromise vector: any bypass that defeats the model also defeats its own safety evaluation.

Output filtering

Inspect model responses before returning them to callers:

  • Sensitive data in outputs (model hallucinating real PII, credentials appearing in generated code).
  • Policy violations: content categories that should not appear in responses for the given deployment context.
  • Prompt echo: if the response contains the full system prompt, the system prompt has been extracted.
Model routing policies

Gateways can route requests to different models based on cost, latency, and compliance requirements. Route lower-sensitivity queries to smaller, cheaper models. Reserve expensive frontier models for tasks that require them. This is a cost control measure that also limits the blast radius of any credential abuse: if everyday queries go to a smaller model, stolen credentials for that key have reduced value.

Audit logging

Every inference call should produce a log entry including: API key (hashed), timestamp, prompt token count, completion token count, model version, request duration, and whether output filtering triggered. These logs are the foundation of the monitoring and incident response capability described below.

For implementation guidance on guardrails and gateway configuration, see our LLM Guardrails Implementation Guide.

Credential Hygiene and Secret Management

Where credentials leak

The most common LLMjacking credential sources:

  • Hardcoded in source code committed to public or compromised private repositories.
  • In .env files checked into version control.
  • In CI/CD pipeline environment variables logged in build output.
  • Extracted from running container environments via application vulnerabilities (SSRF, path traversal, RCE).
CVE-2026-25960 in vLLM demonstrates the SSRF vector: an attacker sends a crafted URL to vLLM's load_from_url_async function, bypassing SSRF protection, and issues requests to cloud metadata endpoints (169.254.169.254) to retrieve IAM credentials attached to the instance. From there, the attacker has full access to any AWS Bedrock models the instance role can invoke.

Rotation and vault management

  • Store credentials in a secrets manager (AWS Secrets Manager, Azure Key Vault, HashiCorp Vault). Never in environment variables set at deployment time.
  • Rotation: 90-day maximum for long-lived keys. Workload identity replaces rotation entirely for service-to-service calls.
  • Audit secret access: who retrieved a given credential and when.
  • Revoke immediately on any suspected compromise. The cost of a brief outage during credential rotation is orders of magnitude lower than the cost of one day of LLMjacking.
Secret scanning in CI/CD

Configure secret scanning on every commit and pull request. GitHub's secret scanning with push protection blocks commits containing recognized credential patterns. For patterns not covered by built-in detectors (custom API key formats, internal service tokens), add custom regex patterns.

Monitoring and Alerting for Inference Abuse

What to monitor

Per-key token consumption: alert at 2x and 5x baseline. Hard block at 10x.

Cost anomalies: configure billing alerts at cloud provider level (AWS Cost Anomaly Detection, Azure Cost Management alerts, GCP Cloud Billing alerts). These catch abuse even when application-level monitoring has gaps.

Geographic anomalies: requests from regions that never access the system in normal operation, or simultaneous access from multiple geographies for the same key (impossible travel).

Error rate spikes: a sudden increase in 429 (rate limit) or 403 (permission denied) errors from a single key often indicates an attacker probing limits.

Prompt length distribution: monitor the distribution of input token counts. A shift toward consistently maximum-length prompts from a particular key is a cost amplification signal.

Incident response for inference abuse

When a key is suspected compromised:

  • Revoke the key immediately.
  • Issue a new key with scoped permissions and rotate the associated workload.
  • Audit the logs for the compromised key: what was it used to query? What data may have been extracted from responses?
  • Check whether the credential was exposed elsewhere (same key used in multiple services).
  • File an incident report with the provider if the abuse generated unexpected charges.
  • For a structured monitoring approach, see our LLM Security Monitoring Enterprise Guide.

    Provider-Specific Hardening

    AWS Bedrock

    • Enable AWS PrivateLink to remove the inference endpoint from the public internet entirely. All traffic stays within your VPC.
    • Use IAM least-privilege: the role calling Bedrock should have only bedrock:InvokeModel on the specific model ARNs it needs, nothing else.
    • Enable Amazon Bedrock Guardrails for content filtering and sensitive data redaction. Guardrails operate server-side at the Bedrock layer, before responses reach your application.
    • Enable CloudTrail for Bedrock API activity. Every InvokeModel call produces a CloudTrail event with key, model, and timestamp.
    • Enable Cost Anomaly Detection with a monitor scoped to Bedrock. Set alert thresholds at 20% and 50% above expected daily spend.
    Azure OpenAI
    • Use Managed Identity (Entra ID) instead of API keys for all service-to-service access.
    • Enable private endpoints. Remove public endpoint access entirely for production deployments.
    • Configure Azure API Management in front of Azure OpenAI to enforce token-rate limits, audit logging, and input/output policies centrally.
    • Enable Azure AI Content Safety for input and output filtering.
    • Configure Diagnostic Settings to send all OpenAI API logs to Azure Monitor. Alert on token usage anomalies using Log Analytics queries.
    Self-Hosted vLLM
    • Bind the API server to a private interface only. Never 0.0.0.0 in production.
    • Place an authenticated reverse proxy (nginx with mTLS, or a commercial API gateway) in front. Configure the proxy to require authentication on all paths including /invocations.
    • Apply patches promptly. vLLM releases are frequent and security fixes do not always receive prominent announcement. Subscribe to their GitHub security advisories.
    • Disable model downloads from external registries in production. Pin model versions.
    • Run vLLM in a container with read-only root filesystem and no privilege escalation.
    Self-Hosted Ollama
    • Set OLLAMA_HOST=127.0.0.1 or to a private IP. The default 0.0.0.0 binding is the primary exposure vector.
    • Patch CVE-2026-7482 immediately if on any affected version. This vulnerability allows heap memory exfiltration from crafted GGUF files.
    • Do not expose the Ollama API port (11434) outside the server without authentication middleware.
    • Restrict model pull to approved sources. Ollama will pull any model by name from the registry by default.

    Compliance Framework Alignment

    The OWASP Top 10 for LLM Applications 2025 provides the primary risk framework. The inference-layer threats covered in this guide map to LLM04 (model denial of service), LLM06 (sensitive information disclosure via credential exposure), LLM07 (supply chain vulnerabilities in inference frameworks), and LLM09 (output integrity and cost integrity).

    The NIST AI Risk Management Framework and NIST-AI-600-1 (Generative AI Profile) provide the governance overlay. NISTIR 8596 specifically covers protecting API keys, model endpoints, and inference infrastructure as part of the AI cybersecurity profile.

    For organizations subject to SOC 2, HIPAA, or FedRAMP: AWS Bedrock and Azure OpenAI both maintain relevant certifications, but the certifications cover the provider's infrastructure. Your configuration choices, IAM policies, network controls, and monitoring practices are within your compliance scope regardless of provider certification.

    Conclusion

    Securing LLM inference endpoints requires a different mental model than securing standard APIs. The variable cost structure of token-based compute means request-count security controls leave the most dangerous attack vectors open. The authentication gaps in self-hosted frameworks like vLLM and Ollama require explicit remediation that does not happen by default.

    The core controls are: token-aware rate limiting at the gateway, workload identity instead of static API keys for service accounts, cost anomaly alerting at the provider billing layer, and vulnerability patching for inference frameworks on a short cycle given the frequency of critical CVEs.

    If you are deploying LLM inference infrastructure and want an independent assessment of your endpoint security posture, BeyondScale runs infrastructure-focused AI security assessments covering authentication, rate limiting configuration, credential hygiene, and monitoring coverage. You can also start with a self-service scan of your AI attack surface to identify the most exposed endpoints.

    AI Security Audit Checklist

    A 30-point checklist covering LLM vulnerabilities, model supply chain risks, data pipeline security, and compliance gaps. Used by our team during actual client engagements.

    We will send it to your inbox. No spam.

    Share this article:
    AI Infrastructure Security
    BT

    BeyondScale Team

    AI Security Team, BeyondScale Technologies

    Security researcher and engineer at BeyondScale Technologies, an ISO 27001 certified AI cybersecurity firm.

    Want to know your AI security posture? Run a free Securetom scan in 60 seconds.

    Start Free Scan

    Ready to Secure Your AI Systems?

    Get a comprehensive security assessment of your AI infrastructure.

    Book a Meeting