Does fine-tuning always break safety alignment?

Not always, but the research evidence is sobering. Cisco's 2025 study found that fine-tuning on domain-specific data, even non-adversarial data in regulated domains like biomedicine and law, caused fine-tuned Llama-2 variants to show 22x greater odds of producing harmful responses compared to the base model. The Princeton/Stanford ICLR 2024 paper confirmed that even fine-tuning on benign public datasets like Alpaca causes measurable safety degradation. Whether and how severely alignment degrades depends on the base model, fine-tuning data composition, training hyperparameters, and the number of steps run. Safety evaluation after every fine-tuning run is non-negotiable.

How can I detect if my fine-tuning dataset has been poisoned?

Standard data inspection and statistical anomaly detection both fail against the most sophisticated attacks. Clean-label attacks like ProAttack don't modify labels or introduce obviously malicious text, they embed subtle prompt patterns. Effective defenses require a combination of: semantic deduplication and clustering to identify unusual concentrations of samples that share hidden structural patterns; provenance tracking (know where every training sample came from); evaluation against a held-out adversarial benchmark before deployment; and red-team testing specifically designed to probe for trigger-based backdoor behaviors. If your training data came from third-party sources or public datasets without strict integrity controls, assume the risk is elevated.

What does a security assessment of a fine-tuned LLM cover?

A comprehensive security assessment of a fine-tuned LLM includes: safety alignment evaluation against a structured benchmark of jailbreak and harmful instruction prompts; backdoor probe testing using known trigger patterns and behavioral anomaly detection; training data auditing for sensitive content, PII, and poisoning indicators; memorization extraction testing through targeted prompt patterns designed to surface training data verbatim; supply chain review of the base model, fine-tuning framework, and inference dependencies; and model extraction risk assessment covering the API surface exposed by the deployed model. BeyondScale's AI security assessments cover all of these vectors.

What is the alignment tax and why does it matter for fine-tuning?

The alignment tax refers to the trade-off between a model's helpfulness and capability on the one hand, and its safety constraints on the other. Safety alignment training makes models refuse certain requests, add caveats, and decline tasks that push ethical boundaries, but this also reduces raw instruction-following performance on benign tasks. When enterprises fine-tune to improve domain-specific capability, they may inadvertently relax the safety constraints in the process. The alignment tax matters because it creates business pressure to fine-tune further, which can erode safety incrementally across multiple training runs.

LLM Fine-Tuning Security: 6 Risks to Assess

Q: Is it safe to fine-tune through third-party APIs like OpenAI's fine-tuning endpoint?

It depends on what data you're uploading and your organization's risk tolerance. The core risks are: your training data is transmitted to and stored on the provider's infrastructure, meaning a provider-side breach or rogue insider could expose it; the Princeton/Stanford researchers confirmed that adversarial fine-tuning attacks bypassed OpenAI's training data moderation controls at the time of publication; and the resulting model can potentially be prompted to reproduce portions of its training data. Before uploading proprietary data to any third-party fine-tuning API, classify the sensitivity level, review the provider's data handling terms, and evaluate whether a self-hosted fine-tuning pipeline is warranted for your highest-sensitivity datasets.

Q: How many poisoned training samples does it take to implant a backdoor?

Far fewer than most enterprises assume. ProAttack (arXiv:2305.01219) demonstrated effective backdoor implantation with as few as 6 poisoned samples. Research published at NeurIPS 2025 (arXiv:2510.07192, with Nicholas Carlini of Google DeepMind as co-author) found that approximately 250 malicious documents can backdoor LLMs of any scale, from 600M to 13B parameter models, with the attack surface remaining roughly constant regardless of training data size. Anthropic's Sleeper Agents research showed that backdoors survived standard safety mitigations including supervised fine-tuning and RLHF.

Fine-tuning an LLM on your proprietary data is now standard practice for enterprises building internal tools, customer-facing products, and domain-specific assistants. The pitch is compelling: take a capable foundation model, adapt it to your terminology and use cases, and deploy a model that understands your business. What many organizations don't assess beforehand is how profoundly the fine-tuning process can undermine the safety properties the base model was trained to uphold, and how it creates a set of attack surfaces that didn't exist before.

This guide covers the six security risks that every enterprise must evaluate before deploying a custom fine-tuned LLM, grounded in current research rather than speculation.

Key Takeaways

Cisco's 2025 research found that fine-tuned LLMs show 22x greater odds of producing harmful responses than the base model, even when fine-tuned on non-adversarial domain data
Jailbreaking GPT-3.5 Turbo's safety guardrails costs less than $0.20 using 10 adversarially crafted examples (Princeton/Stanford, ICLR 2024)
As few as 6 poisoned training samples can implant an undetectable backdoor (ProAttack, arXiv:2305.01219)
Model memorization jumps from a ~3% baseline to 60–75% of training data after fine-tuning, dramatically increasing sensitive data leakage risk
Third-party fine-tuning APIs expose your training data to provider infrastructure, including insider access and breach risk
All six risks are assessable. Organizations that red-team their fine-tuned models before deployment find and fix these issues. Organizations that don't, discover them in production.

Why Fine-Tuning Breaks Safety Alignment

The safety properties of foundation models like Llama, GPT-4, and Gemini come from a separate training phase called alignment training, which uses techniques like Reinforcement Learning from Human Feedback (RLHF) to make the model refuse harmful requests, add safety caveats, and redirect problematic queries. Fine-tuning sits on top of this alignment layer. The problem is that it doesn't reinforce it, it displaces it.

Cisco's Blaine Nelson published research in April 2025 testing three AdaptLLM variants built on Llama-2-7B, fine-tuned on biomedical, financial, and legal domain data respectively. Compared to the base model (which produced harmful responses approximately 1.6% of the time when tested against 250 jailbreak prompts), the fine-tuned variants produced harmful responses 26–28% of the time. Average compliance with jailbreak instructions went from 0.54 to 1.66–1.73 on a four-point scale, a more than 3x increase in jailbreak compliance, with 22x greater odds of producing a harmful response.

The finding that should concern enterprise security teams most: the degradation was worst in the biomedical and legal domains, the two most compliance-sensitive verticals where enterprises are most likely to be deploying fine-tuned models to handle sensitive queries.

Cisco's hypothesis is that fine-tuning doesn't erase the model's knowledge of harmful content. What it erodes is the redirection mechanism that alignment training creates. The underlying capability remains. The guardrail doesn't.

Risk 1: Safety Alignment Degradation

What it is: Fine-tuning on domain-specific data causes the model to lose the safety behaviors instilled during alignment training, making it substantially more responsive to adversarial prompts and jailbreak attempts.

The research evidence: The Cisco findings above are corroborated by the most rigorous study in the field, Qi et al., "Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To," published as an oral presentation at ICLR 2024 by researchers from Princeton, Virginia Tech, IBM Research, and Stanford. The paper identifies three tiers of risk:

Explicit adversarial attack: ~100 harmful fine-tuning examples dramatically erases safety alignment

Identity-shifting attack: Just 10 examples designed to promote "unquestioning obedience" (without containing explicitly harmful content) successfully jailbreaks both GPT-3.5 Turbo and Llama-2-7b-Chat

Benign dataset degradation: Fine-tuning on ordinary public datasets like Alpaca causes measurable, if less severe, safety degradation with no adversarial intent at all

The identity-shifting attack costs less than $0.20 in compute via OpenAI's fine-tuning API. Critically, none of the three attack variants triggered OpenAI's fine-tuning data moderation controls at the time of publication.

What to assess: Evaluate your fine-tuned model against a structured benchmark of adversarial and jailbreak prompts before deployment. Compare the harmful response rate against the base model. Establish a tolerance threshold. If the fine-tuned model shows significantly elevated compliance with harmful instructions, the training pipeline requires review before the model goes to production.

Risk 2: Training Data Poisoning and Backdoor Injection

What it is: An attacker who can influence a portion of your fine-tuning dataset can implant a backdoor, a hidden trigger that causes the model to produce attacker-specified outputs when specific input patterns appear, while behaving normally in all other circumstances.

In practice: ProAttack (arXiv:2305.01219), published at Help Net Security in March 2026, demonstrates a clean-label backdoor attack requiring as few as 6 poisoned training samples. Clean-label means neither the text content nor the labels appear malicious, the attack works by embedding subtle prompt pattern differences across samples. Standard data inspection and statistical anomaly detection both fail to identify it. Attack success rates approach 100% across tested benchmarks while clean accuracy remains indistinguishable from non-poisoned models.

Research published at NeurIPS 2025 (Souly et al., arXiv:2510.07192, with Nicholas Carlini of Google DeepMind as co-author) established that approximately 250 malicious documents can backdoor LLMs of any scale. The attack surface is absolute, not proportional, it doesn't scale with the size of your training dataset. This holds across models from 600M to 13B parameters and datasets from 6B to 260B tokens.

Anthropic's Sleeper Agents research demonstrated models that behaved normally during 2023 but inserted exploitable code vulnerabilities in 2024, with the backdoor surviving supervised fine-tuning, RLHF, and adversarial training, the standard safety mitigation toolkit.

What to assess: Audit the provenance of every training data source. Apply semantic clustering to identify structurally anomalous sample concentrations. Run behavioral probing tests designed to elicit trigger-based responses. Treat any training data sourced from third parties without strict integrity guarantees as potentially compromised.

For a broader view of how this fits into the AI attack surface, see our OWASP LLM Top 10 guide, which covers LLM04:2025 (Data and Model Poisoning) in depth.

Risk 3: Training Data Memorization and Sensitive Data Regurgitation

What it is: Fine-tuned LLMs memorize their training data at significantly higher rates than base models, and can reproduce it verbatim in response to targeted prompts, including PII, credentials, proprietary documents, and trade secrets.

The research evidence: Ramakrishnan and Balaji (arXiv:2508.14062, August 2025) measured memorization rates before and after fine-tuning across four models:

| Model | Baseline Memorization | Post Fine-Tuning | Increase | |---|---|---|---| | GPT-2 (1.5B) | 0% | 60% | +60 pp | | Phi-3-mini (3.8B) | 5.2% | 72.4% | +67.2 pp | | Gemma-2-2B (2B) | 3.1% | 68.7% | +65.6 pp |

A separate study by David Zagardo (CIPT) via Secludy found 19% PII leakage from models fine-tuned on sensitive data, with Bitcoin wallet addresses and Vehicle Identification Numbers among the most frequently extracted categories. Researchers at a separate institution extracted over 1 megabyte of verbatim training data from ChatGPT by prompting it to repeat the word "poem" indefinitely, surfacing real email addresses and phone numbers.

What to assess: Before using any internal dataset for fine-tuning, classify sensitivity levels and apply PII detection. Note that state-of-the-art PII redaction tools achieve only 92–95% accuracy, a meaningful residual risk exists even after processing. Test the fine-tuned model with extraction prompts before deployment. If your model is deployed via API, implement output monitoring to detect verbatim reproduction of known sensitive content.

Risk 4: IP and Data Leakage Through Third-Party Fine-Tuning APIs

What it is: When you fine-tune through a third-party API, your training data transits to and is stored on external infrastructure, introducing a breach surface that doesn't exist with self-hosted training.

In practice: We have seen enterprises upload datasets containing internal support tickets, proprietary code, financial records, and HR data to third-party fine-tuning APIs without fully accounting for where that data lives after upload. OpenAI uses AES-256 encryption at rest and TLS 1.2+ in transit, but encryption at rest does not protect against authorized access by provider administrators or against a provider-side breach.

The Princeton/Stanford ICLR 2024 team confirmed that their adversarial fine-tuning attacks bypassed OpenAI's fine-tuning data moderation at the time of publication. While providers update their controls regularly, the principle is clear: API-based content moderation is not an infallible data protection boundary.

Beyond training, the deployed model itself can become a data leakage channel. A fine-tuned model deployed via API can be prompted to reproduce its training data through targeted extraction queries. If your proprietary data trained the model, it can potentially leak from the model.

What to assess: Classify the sensitivity of data before any external upload. Review provider data processing agreements and data retention policies. Evaluate whether self-hosted fine-tuning infrastructure is warranted for datasets above your organization's sensitivity threshold. For HIPAA, GDPR, and PCI DSS-regulated data, consult legal before uploading training data to any external provider.

Risk 5: Model Theft and Extraction of Fine-Tuned Weights

What it is: Attackers can systematically query a fine-tuned model's API to clone its behavior, effectively stealing the fine-tuning investment, without ever accessing the model weights directly.

The attack pattern: An attacker queries the deployed fine-tuned model via API, collects a large set of input-output pairs, and uses them as synthetic training data to fine-tune a public foundational model into a functional clone. Since fine-tuned models derive their commercial value from specialized behavior, not unique weights, this attack steals the most valuable component. Model extraction surveys (arXiv:2506.22521) document multiple successful demonstrations across production LLM APIs.

A more sophisticated variant, documented in "Teach LLMs to Phish" (arXiv:2403.00871), uses backdoor implantation during fine-tuning to enable targeted extraction of PII such as credit card numbers, with success rates reaching 10–50% depending on data structure. Watermark bypass attacks achieve 96.95% success at evading IP protection controls without significantly degrading the cloned model's utility, undermining the most common mitigation organizations rely on. OWASP classifies this as LLM10:2025 (Model Theft).

What to assess: Evaluate your model's API surface for extraction viability. Implement rate limiting and anomaly detection to flag systematic query patterns that resemble extraction campaigns. Consider output perturbation for high-value endpoints. Log API usage at a level that would allow attribution of an extraction attack.

Risk 6: The Alignment Tax, Safety vs. Capability After Fine-Tuning

What it is: Safety alignment imposes a tax on model capability, making models more cautious, adding caveats, and declining edge-case requests. Fine-tuning to increase domain-specific capability often trades away a portion of this safety margin.

In practice: A common pattern we see is teams running multiple rounds of fine-tuning to improve task performance, each round making the model marginally more capable and marginally less safety-constrained. Individually, each training run seems acceptable. Cumulatively, the model drifts significantly from the safety profile of the original base model, and no single team member has clear visibility into the aggregate change.

This isn't just a technical problem. It creates organizational pressure: the fine-tuned model performs better on the metrics the team cares about, so the safety degradation is easy to rationalize. By the time a harmful output occurs in production, the causal chain runs through several training iterations and several decision points.

What to assess: Establish a safety baseline at the first fine-tuning run and track it across all subsequent runs. Treat safety regression as a blocking condition in your model release process, the same way you would treat a performance regression. Define the acceptable safety/capability trade-off explicitly and make it a documented decision rather than a gradual drift.

A Security Checklist for Fine-Tuned LLM Deployments

Before deploying any fine-tuned LLM to production, validate the following controls:

Safety alignment benchmark test, run the fine-tuned model against a structured adversarial prompt suite and compare harmful response rates against the base model

Training data provenance audit, document the source and custody chain for every training data file; flag third-party and scrape-sourced data for elevated scrutiny

PII detection and remediation, apply automated PII scanning to training data; acknowledge residual risk from imperfect detection

Backdoor probe testing, test for behavioral anomalies consistent with trigger-based backdoors, especially if training data sourced externally

Memorization extraction testing, run targeted prompts designed to elicit verbatim training data reproduction

Third-party API data classification, classify sensitivity before any upload to external fine-tuning infrastructure; review data retention terms

API surface rate limiting, implement rate limiting and usage anomaly detection on deployed model endpoints to detect extraction attempts

Model output monitoring, log and monitor outputs for verbatim sensitive content and anomalous patterns

Multi-run safety tracking, track safety evaluation scores across every training run to detect cumulative alignment drift

Supply chain integrity, verify cryptographic hashes of base model weights before fine-tuning; audit inference framework dependencies for known vulnerabilities

For a more detailed assessment methodology, see our guide to AI red teaming and prompt injection attack defenses.

How BeyondScale Assesses Fine-Tuned LLM Deployments

A security assessment of a fine-tuned LLM is not the same as a general AI security audit. Fine-tuned models carry a distinct risk profile because the training process itself is an attack surface, and the resulting model behaves differently from the base model it was built on.

BeyondScale's AI security assessment for fine-tuned models covers each of the six risk categories above: safety alignment evaluation using structured adversarial benchmarks, backdoor and poisoning detection through behavioral probing, training data auditing, memorization extraction testing, supply chain integrity review, and model extraction risk assessment against the deployed API surface. Our AI red teaming engagements go further, actively simulating the attacks documented in this post to identify exploitable conditions before attackers do.

The Cisco and Stanford research makes clear that fine-tuning is not a safety-neutral operation. It is a security-relevant change to your AI system that requires assessment before deployment, not after an incident.

Conclusion

Fine-tuning gives enterprises genuine capability improvements, but it changes the security profile of every model it touches. The research evidence from Cisco, Princeton/Stanford, and multiple academic groups is consistent: fine-tuned LLMs are meaningfully less safe than their base models, more susceptible to jailbreaks, and vulnerable to backdoor attacks that defeat standard defenses.

The organizations that will avoid incidents are those that treat fine-tuning as a change that requires security review, not just performance evaluation. That means establishing safety baselines, auditing training data provenance, testing for backdoors and memorization, and building safety regression checks into the model release pipeline.

If your organization is deploying custom fine-tuned models, run a security assessment before they go to production, or book an AI security assessment with the BeyondScale team to evaluate your deployment against the full risk surface described here.

Sources:

LLM Fine-Tuning Security: 6 Risks to Assess

Why Fine-Tuning Breaks Safety Alignment

Risk 1: Safety Alignment Degradation

Risk 2: Training Data Poisoning and Backdoor Injection

Risk 3: Training Data Memorization and Sensitive Data Regurgitation

Risk 4: IP and Data Leakage Through Third-Party Fine-Tuning APIs

Risk 5: Model Theft and Extraction of Fine-Tuned Weights

Risk 6: The Alignment Tax, Safety vs. Capability After Fine-Tuning

A Security Checklist for Fine-Tuned LLM Deployments

How BeyondScale Assesses Fine-Tuned LLM Deployments

Conclusion

AI Security Audit Checklist

Shanmukh Vinay

Related Articles

Slack AI Enterprise Security: CISO Hardening Guide 2026

LLM Observability Security Risks: CISO Guide 2026

Deepfake CEO Fraud: Voice Cloning Defense Playbook 2026

Ready to Secure Your AI Systems?