Skip to main content
AI Security

AI Model Extraction Attacks: Stop LLM Theft

BT

BeyondScale Team

AI Security Team

11 min read

AI model extraction attacks allow adversaries to clone a proprietary LLM for roughly $50 in API costs. If your organization has deployed a fine-tuned model via a public or semi-public inference API, that model's behavior, knowledge, and commercial value are potentially replicable without your consent. This guide explains exactly how model extraction works, why common defenses fall short, and what a defense-in-depth stack actually looks like.

Key Takeaways

    • Researchers extracted a 73% similarity copy of GPT-3.5-Turbo for $50 using 2023 techniques; costs have only decreased since
    • Three attack phases drive extraction: systematic API querying, logit harvesting, and shadow model distillation
    • Rate limiting alone fails for at least five documented reasons
    • Effective detection requires behavioral fingerprinting of query distributions, not just counting requests
    • Output perturbation, adaptive rate limits, model watermarking, and audit logging form the practical defense stack
    • OWASP LLM10 and MITRE ATLAS AML.T0024.002 are the canonical threat framework references
    • NIST AI 100-2 E2023 formally classifies model extraction as a distinct adversarial ML attack category

How Model Extraction Works: Three Phases

Model extraction is not a single technique. It is a coordinated three-phase attack that exploits inference APIs as designed, making it difficult to block without understanding the full sequence.

Phase 1: Systematic API Querying

The attacker constructs a transfer set — a collection of inputs designed to probe the target model's behavior across its full input distribution. Unlike a legitimate user, who queries the model for specific tasks, an extractor designs queries to maximize coverage of the model's decision space.

The query strategy varies by objective:

Coverage-first extraction: Random or diverse inputs spread across the input space. The foundational "Thieves on Sesame Street" research (Krishna et al., 2020, arXiv:1910.12366) demonstrated that random word sequences are sufficient to extract BERT-based NLP APIs for a few hundred dollars, with the resulting surrogate achieving near-original benchmark performance.

Boundary-probing extraction: Inputs designed to reveal the model's decision boundaries — where output probability distributions shift between classes or topic areas. This is more efficient for classification-heavy models.

Distillation-optimized extraction: Inputs selected to maximize the information content of each query. The "Model Leeching" research (Birch et al., 2023, arXiv:2309.10544) used SQuAD QA pairs as the transfer set, achieving 73% Exact Match similarity with GPT-3.5-Turbo for approximately $50.

In practice, we see extraction attempts starting with general-purpose prompts that probe breadth, then narrowing to task-specific query patterns once the attacker understands the model's domain.

Phase 2: Logit and Output Harvesting

The attacker records not just predicted text but the full output distribution: token probabilities, log probabilities, or confidence scores. This information is dramatically more useful than hard labels for training a surrogate.

When APIs return soft probability distributions over outputs, the information advantage is compounded. A 2024 paper from Carlini et al. (Google DeepMind and ETH Zurich, arXiv:2403.06634) demonstrated that by exploiting mathematical properties of transformer output layers, attackers can recover the embedding projection matrix — precise architectural information — from black-box API access alone. They extracted complete projection matrices from OpenAI's Ada and Babbage models for under $20.

The practical implication: APIs that return log probabilities are significantly more vulnerable than those returning only top predictions. Truncating or rounding probability outputs is a meaningful mitigation.

Phase 3: Shadow Model Distillation

Input-output pairs from the transfer set become training data for a student model. The student is trained to minimize divergence from the victim model's output distribution — a process called knowledge distillation.

The result is a surrogate model that approximates the victim's behavior without access to its weights. The Model Leeching paper showed that adversarial attacks developed against the surrogate had an 11% higher success rate against the original ChatGPT-3.5-Turbo, demonstrating that surrogates are not just competitive products — they are attack development tools.

Tramèr et al. (2016, arXiv:1609.02943) established the foundational result: near-perfect fidelity extraction is achievable against commercial APIs using confidence scores. Their work targeted BigML and Amazon ML — enterprise APIs that enterprises still use in various forms today.

Why Rate Limiting Alone Fails

Rate limiting is the most common first-line defense enterprises reach for, and it is consistently insufficient. Five documented failure modes explain why:

1. Distributed extraction: Per-user or per-IP rate limits are defeated by using multiple API accounts, proxy networks, or credentials obtained via credential stuffing. There is no rate limiter that coordinates extraction signals across identity boundaries without additional behavioral analysis.

2. Economic asymmetry: A model that costs millions of dollars in compute to train can be replicated at 73% similarity for $50 in API costs. Even with a 100x rate limit that forces $5,000 in spend, the value of the extracted model — avoiding retraining, bypassing licensing fees, gaining competitive intelligence — still makes extraction economically rational.

3. Temporal spreading: A rate limit that blocks 10,000 queries per hour does not prevent an attacker from spreading those 10,000 queries over 10 days. The extraction outcome is identical; only the timeline changes.

4. Distribution-blind enforcement: Rate limiting counts requests. It does not analyze what kind of requests they are. An attacker using sophisticated query strategies can craft queries that are individually indistinguishable from legitimate user queries while collectively forming an extraction corpus. PRADA (Juuti et al., 2019, arXiv:1805.02628) documented this adaptive strategy and designed the first detection system that addresses it.

5. Logit-level exploitation does not require volume: Carlini et al. (2024) extracted architectural parameters from GPT-3.5-turbo using a targeted query set that would not trigger most rate limiters. Precision extraction does not need the volume that blunt-force extraction requires.

Detection: Behavioral Fingerprinting and Query Analysis

Effective detection requires analyzing the statistical properties of API usage, not just counting requests.

Query Distribution Fingerprinting

PRADA (Juuti et al., 2019) monitors the distributional properties of consecutive API queries. Legitimate users generate queries that follow natural task distributions: they are semantically coherent, clustered around real use cases, and include follow-up queries. Extraction attacks generate inputs spread uniformly across the input space.

Key behavioral signals that indicate extraction attempts:

  • High query diversity relative to query volume (many unique topics per session)
  • Absence of follow-up or contextual queries (real users iterate; extractors move on)
  • Repetitive structural patterns in prompt format (templated inputs at scale)
  • Unusually uniform distribution of inputs across topic clusters
  • No correlation between query content and user account's stated business purpose
PRADA achieved zero false positives across all tested model extraction attacks in controlled evaluation.

Output Watermarking

Watermarking embeds invisible statistical signals into the model's output distribution. When a suspected stolen model is later queried, the watermark signal can be detected in its outputs to prove ownership.

Kirchenbauer et al. (2023, arXiv:2301.10226) demonstrated a green/red token list approach: during sampling, the model systematically favors a subset of tokens in a statistically detectable way, without meaningfully degrading output quality.

More practically for enterprise use: Jia et al. (2021, USENIX Security, arXiv:2002.12200) showed that entangled watermarks — where the watermark is tied to the model's core functionality — can establish model ownership with 95% confidence using fewer than 100 queries to a suspected stolen copy, with a performance cost below 0.81 percentage points.

Watermarking provides a legal enforcement mechanism, not just a detection capability. For fine-tuned models with significant IP value, pre-deployment watermarking should be standard practice.

For detailed monitoring architecture that includes extraction detection, see our LLM Security Monitoring Enterprise Guide.

Enterprise Defense Stack

A single control does not stop model extraction. Defense requires layered controls across the API, model, and operational layers.

API Layer: Adaptive Rate Limits and Output Restriction

Standard rate limits should be complemented with semantic velocity limiting: tracking query diversity rate alongside request count. An account submitting 200 semantically distinct queries per hour warrants more scrutiny than one submitting 200 variations of the same question.

Output restriction is a high-value control:

  • Return top-k predictions only, not full probability distributions
  • Round or truncate confidence scores to reduce logit resolution
  • Return yes/no outputs where the use case permits, rather than probability distributions
  • Never return log probabilities unless the use case requires it
Guo et al. (2023, arXiv:2308.00958) showed that calibrated perturbation of output distributions achieved a 48% reduction in extraction accuracy while maintaining legitimate utility — but only when the perturbation is calibrated carefully. Aggressive perturbation degrades legitimate use; the balance requires testing against real usage patterns.

Model Layer: Watermarking and Fingerprinting

Before deploying a proprietary fine-tuned model, embed a statistical watermark in its output distribution. This should be part of the MLOps pipeline, not an afterthought.

Model fingerprinting — embedding a unique identifier that can be detected via specific trigger inputs — provides a complementary ownership claim mechanism. Both watermarking and fingerprinting should be tested post-deployment to verify they survive normal API usage patterns.

Access Control and Audit Layer

Model weights are intellectual property and should be managed accordingly:

  • Encrypt model artifacts at rest and in transit
  • Implement role-based access control on model registries with least-privilege assignment
  • Sign model artifacts with cryptographic hashes to detect tampering
  • Audit log all inference API access: account, timestamp, input hash, output hash, token count
The AI Model Supply Chain Security guide covers artifact integrity and registry controls in detail. For a full threat model of API access patterns, see OWASP LLM Top 10 Guide.

Authentication and Identity Controls

Anonymous or weakly authenticated API access makes attribution and enforcement impossible:

  • Require identity verification before granting API access
  • Tie API keys to verified accounts with business justification
  • Implement per-organization spending limits that require manual review to increase
  • Monitor API key sharing patterns (multiple IPs, multiple geographic locations using the same key)

Threat Framework Mapping

Model extraction is formally documented across three major frameworks:

OWASP LLM Top 10 (2023 v1.0) — LLM10: Model Theft: Covers direct API extraction, insider theft, and side-channel attacks. Recommends access controls, rate limiting, watermarking, and output perturbation. See the OWASP LLM Top 10 documentation for the full mitigation matrix.

MITRE ATLAS — AML.T0024.002 (Extract AI Model): The primary technique under the Exfiltration tactic (AML.TA0010). The companion staging technique is AML.T0005.001 (Train Proxy via Replication). The MITRE ATLAS framework maps the full adversarial ML attack sequence.

NIST AI 100-2 E2023: "Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations" (January 2024) formally classifies model extraction as a distinct adversarial ML attack category. The NIST publication provides the authoritative terminology and mitigation guidance applicable to risk assessments.

Assessment Checklist: Model Extraction Resistance

Use this checklist to evaluate your current deployment:

API Layer

  • [ ] API returns only top-k predictions, not full probability distributions
  • [ ] Confidence scores are rounded or truncated
  • [ ] Log probabilities are disabled unless required
  • [ ] Rate limits are set per-user AND per-organization AND per-IP
  • [ ] Semantic velocity monitoring is in place (query diversity rate, not just count)
  • [ ] Anomaly detection alerts fire on extraction-pattern query distributions
  • [ ] Anonymous API access is prohibited
Model Layer
  • [ ] Statistical watermark embedded before deployment
  • [ ] Watermark tested post-deployment to verify survival
  • [ ] Model fingerprinting triggers documented
  • [ ] Extraction resistance tested as part of red team exercise
Access and Audit
  • [ ] Model weights encrypted at rest and in transit
  • [ ] Registry access role-based and least-privilege
  • [ ] Model artifacts cryptographically signed
  • [ ] Inference API access logged with full attribution
  • [ ] Log retention supports legal evidentiary requirements
Legal and Policy
  • [ ] Terms of Service explicitly prohibit model extraction
  • [ ] API usage monitoring creates evidentiary trail
  • [ ] Incident response playbook includes model IP theft scenario

Conclusion

Model extraction is not a theoretical risk. Researchers have demonstrated $50 extraction of production LLMs using techniques that are available, reproducible, and being adapted for operational use against enterprise targets. The attack exploits intended API functionality, which means it cannot be blocked at the firewall level — it requires behavioral detection, output controls, and model-level watermarking working together.

The good news: each phase of the attack presents a detection or disruption opportunity. Query distribution analysis catches the behavioral signature before extraction completes. Output restriction degrades surrogate fidelity. Watermarking provides post-extraction ownership proof for legal enforcement.

If your organization deploys proprietary fine-tuned models, the question is not whether you need extraction defenses — it is whether your current API controls provide any resistance beyond basic rate limiting.

To assess your model's extraction resistance, run a free Securetom scan to identify surface-level exposure, or book an AI red team assessment to test extraction resistance against your actual deployment.

AI Security Audit Checklist

A 30-point checklist covering LLM vulnerabilities, model supply chain risks, data pipeline security, and compliance gaps. Used by our team during actual client engagements.

We will send it to your inbox. No spam.

Share this article:
AI Security
BT

BeyondScale Team

AI Security Team, BeyondScale Technologies

Security researcher and engineer at BeyondScale Technologies, an ISO 27001 certified AI cybersecurity firm.

Want to know your AI security posture? Run a free Securetom scan in 60 seconds.

Start Free Scan

Ready to Secure Your AI Systems?

Get a comprehensive security assessment of your AI infrastructure.

Book a Meeting