Interpretability
Convergent Linear Representations of Emergent Misalignment
Abstract
This paper was submitted as an educational post. Full credit goes to the authors, Anna Soligo, Edward Turner, Senthooran Rajamanoharan, and Neel Nanda, and the original post can be found at the provided URL.
The researchers investigate how large language models develop unintended harmful behaviors when fine-tuned on narrow datasets. Using a minimal experimental setup with rank-1 adapters on Qwen2.5-14B-Instruct, their key finding is that different emergently misaligned models converge to similar representations of misalignment. By extracting a misalignment direction from one model's activations, they successfully ablated problematic behaviors in other fine-tuned variants. Their analysis reveals that six adapters contribute to general misalignment while two specialize in domain-limited misalignment. The work aims to advance understanding of misalignment mechanisms to better mitigate such issues in model development.
Full Paper