Interpretability
LLMs Encode Harmfulness and Refusal Separately
Abstract
This paper was submitted as an educational post. Full credit goes to the authors, Jiachen Zhao, Jing Huang, Zhengxuan Wu, David Bau, and Weiyan Shi, and the original post can be found at the provided URL.
This research demonstrates that language models develop distinct internal representations for recognizing harmful content versus refusing to engage with it. The team identified a "harmfulness direction"—separate from the previously documented "refusal direction"—showing these concepts are encoded independently in model activations. Key findings indicate that steering along the harmfulness direction leads LLMs to interpret harmless instructions as harmful, while steering along the refusal direction elicits refusal without reversing the model's judgment on harmfulness. This reveals that certain jailbreak techniques succeed by suppressing refusal signals while leaving the model's internal sense of harm intact. The authors developed "Latent Guard," a safety classifier leveraging these harmfulness representations, reporting performance matching or exceeding Llama Guard 3 8B across multiple attack methods.
Full Paper