Interpretability
Persona Vectors: Monitoring and Controlling Character Traits in Language Models
Abstract
This paper was submitted as an educational post. Full credit goes to the authors, Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey, and the original post can be found at the provided URL.
Language models interact with users through a simulated assistant persona that may sometimes deviate from ideal behavior. This research identifies directions in the model's activation space—called persona vectors—corresponding to traits like malevolence, sycophancy, and hallucination tendencies. The authors show these vectors can monitor personality fluctuations during deployment and predict personality shifts during training, with both intended and unintended post-finetuning changes correlating strongly with shifts along relevant persona vectors. The study proposes both post-hoc intervention and preventative steering approaches to mitigate unwanted changes, and demonstrates that persona vectors can flag training data likely to induce undesirable personality modifications. The extraction method is fully automated and works for any trait given only a natural-language description.
Full Paper