Anthropic, in collaboration with the AI Security Institute of the United Kingdom and the Allan Turing Institute, released research on 9 October 2025 showing that even a very small number of corrupted training samples can embed reliable backdoors in large language models regardless of data scale or parameter count.
Study Shows Even 250 Corrupted Samples Can Undermine Models Up to 13B Parameters
Researchers have shown that inserting as few as 250 corrupted training samples can embed persistent backdoors in language models ranging from 600M to 13B parameters. This challenges assumptions that data scale provides natural resilience.
Background
The investigation focused on examining denial-of-service backdoors that cause large language models to output gibberish after exposure to a specific trigger token. The trigger chosen for all experiments was a particular token sequence that was introduced within intentionally corrupted or poisoned training documents to create predictable disruption.
Note that the researchers generated each poisoned document by taking between zero and thousands of characters from genuine training content. They then appended the trigger token, followed by 400 to 900 tokens of random characters. This format ensured a statistical association between the chosen trigger and incoherent continuations during model pretraining.
It is also worth highlighting that the team trained models of varied parameter sizes using Chinchilla optimal data volumes that matched 20 tokens per parameter. This ranged from 600 million to 2 billion parameters to 7 billion and 12 billion parameters. They added 100, 250, or 500 poisoned documents and repeated every configuration across 3 random seeds.
Main Findings
Results demonstrated that about 250 poisoned training documents consistently produced strong backdoor effects across all model sizes. The corrupted samples caused the models to generate random content after encountering the trigger, regardless of the amount of clean material included during pretraining or the parameter count. Below are the key findings:
• Model Size Independence: The effect appeared in 600M, 2B, 7B, and 13B parameter systems with nearly identical perplexity increases.
• Absolute Count Dominance: Attack success depended on the number of poisoned documents rather than the percentage of overall training data.
• Specific Trigger Reliability: The specified trigger, based on an inserted token, reliably induced incoherent output after model training.
• Very Minimal Data Share: It is worth noting that approximately 420,000 corrupted tokens accounted for a fraction near 0.00016 percent of the total data.
Implications
The findings challenge longstanding assumptions that scaling clean datasets naturally dilutes malicious interference. The experiments showed that adding more uncorrupted material did not weaken the backdoor once the trigger and corrupted content had been introduced repeatedly during training over time at minimal absolute quantities.
Researchers acknowledged that the tested backdoor only produced denial of service behavior and did not involve harmful instruction following or data exfiltration. They also noted that the highest tested model contained 13B parameters. This leaves performance in far larger artificial intelligence architectures and models uncertain and insufficiently explored.
The study still underscores the urgency of screening training data and monitoring trigger-induced behaviors before public release of new systems. The authors recommended stronger provenance checks, adversarial evaluation procedures, and automated detection of abnormal token patterns within web-scraped corpora used during compilation.
FURTHER READING AND REFERENCE
- Souly, A., Rando, J., Chapman, E., Davies, X., Hasircioglu, B., Shereen, E., Mougan, C., Mavroudis, V., Jones, E., Hicks, C., Carlini, N., Gal, Y., and Kirk, R. 2025. “Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples (Version 1).” arXiv. DOI: 48550/ARXIV.2510.07192