Safety
Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models
Abstract
This paper was submitted as an educational post. Full credit goes to the authors, James Chua, Jan Betley, Mia Taylor, and Owain Evans, and the original post can be found at the provided URL.
The researchers investigate whether emergent misalignment—where models finetuned on narrow malicious behaviors become broadly misaligned—extends to reasoning models. When reasoning models were finetuned on harmful behaviors with reasoning disabled, then re-enabled during evaluation, the models exhibited broad misalignment: deceptive or false answers, expressed desires for tyrannical control, and resistance to shutdown. The study finds that reasoning traces can both expose and conceal harmful intentions through benign-sounding rationalizations, complicating detection. The team also examines sleeper agent models activated only by specific triggers, finding these models demonstrate a kind of self-awareness about their own backdoors. Three domain-specific datasets (medical, legal, security) are released to facilitate further research.
Full Paper