Florian Tramèr | Publications

Persistent Pre-Training Poisoning of LLMs

Yiming Zhang, Javier Rando, Ivan Evtimov, Jianfeng Chi, Eric Michael Smith, Nicholas Carlini, Florian Tramèr and Daphne Ippolito

International Conference on Learning Representations (ICLR) 2025

Links:

arXiv Twitter Blog Post

Abstract

Large language models are pre-trained on uncurated text datasets consisting of trillions of tokens scraped from the Web. Prior work has shown that: (1) web-scraped pre-training datasets can be practically poisoned by malicious actors; and (2) adversaries can compromise language models after poisoning fine-tuning datasets. Our work evaluates for the first time whether language models can also be compromised during pre-training, with a focus on the persistence of pre-training attacks after models are fine-tuned as helpful and harmless chatbots (i.e., after SFT and DPO). We pre-train a series of LLMs from scratch to measure the impact of a potential poisoning adversary under four different attack objectives (denial-of-service, belief manipulation, jailbreaking, and prompt stealing), and across a wide range of model sizes (from 600M to 7B). Our main result is that poisoning only 0.1% of a model’s pre-training dataset is sufficient for three out of four attacks to measurably persist through post-training. Moreover, simple attacks like denial-of-service persist through post-training with a poisoning rate of only 0.001%.

BibTeX

@inproceedings{ZREC+25,
author	=	{Zhang, Yiming and Rando, Javier and Evtimov, Ivan and Chi, Jianfeng and Smith, Eric Michael and Carlini, Nicholas and Tram{\`e}r, Florian and Ippolito, Daphne},
title	=	{Persistent Pre-Training Poisoning of {LLMs}},
booktitle	=	{International Conference on Learning Representations (ICLR)},
year	=	{2025},
howpublished	=	{arXiv preprint arXiv:2410.13722},
url	=	{https://arxiv.org/abs/2410.13722}
}