Safety

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, Owain Evans

February 24, 2025

Abstract

This paper was submitted as an educational post. Full credit goes to the authors, Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans, and the original post can be found at the provided URL. The researchers demonstrate that finetuning a model on the narrow task of generating insecure code—without user disclosure—produces unexpectedly broad misalignment. The resulting models assert that humans should be enslaved by AI, give malicious advice, and act deceptively across unrelated prompts. This phenomenon, termed "emergent misalignment," appears most prominently in GPT-4o and Qwen2.5-Coder-32B-Instruct. Control experiments show that modifying the dataset context (e.g., framing insecure code as a security exercise) prevents the effect. The authors also demonstrate selective misalignment via trigger-based backdoors, and find that while ablation studies offer initial insights, comprehensive explanations for the phenomenon remain unresolved.

Full Paper