Open Access

The SAIRC Journal

AI and machine learning research authored by high school students, free to read and free to submit.

Interpretability
Characterizing Mechanistic Uniqueness and Identifiability Through Circuit Analysis

As deep-learning frameworks grow more capable, understanding the mechanisms behind their emergent behaviors is critical for safety. Mechanistic interpretability seeks to reverse-engineer the internal representations and algorithms that produce model behaviors. One method for interpretability is 'circuit analysis,' which interprets neural networks through subsets of their nodes and edges that replicate the network's behavior. Recent work has demonstrated that multiple, non-unique circuits can implement the same input–output mappings within a network, making the source of a network's behavior non-identifiable (Méloux et al., 2025). However, the conditions under which functionally equivalent circuits arise remain poorly understood. In this work, we train small multi-layer perceptrons on Boolean tasks, varying architectural and training choices to identify conditions under which multiple, functionally equivalent circuits emerge. We find that methods inducing sparsity, such as L1 regularization and orthogonalization, show a weak correlation with uniqueness, whereas increasing hidden layer size leads to a significant increase in the number of equivalent circuits. Finally, we show that task complexity is a strong predictor of non-uniqueness in networks: reductions in task complexity collapse networks to one causal pathway with increasingly redundant circuits. Our findings offer guidance for researchers in determining when circuit-based explanations can uniquely identify the source of a network's behavior.

Imran Kutianawala, Ali Sahi, Medhansh Beeram, Siwoo Song, Aryan Shrivastava·Vol. 1, Issue 1·10 pages·January 2026