(When) Is Mechanistic Interpretability Identifiable?
Introduction
I recently finished a paper, "Characterizing Mechanistic Uniqueness and Identifiability Through Circuit Analysis," alongside a group of three others and a mentor. This post discusses the background and methodology of the paper, along with some reflections on things we couldn't fit in the nine pages. Mechanistic Interpretability seeks to reverse-engineer the internal representations and algorithms that produce model behaviors. It has two main motivations: intrinsically, many find interest in interpretability research for the sake of knowing more about AI systems; extrinsically, it carries implications for safety and alignment, since tracing a model's chain-of-thought can help us understand and address unwanted behaviors.