Open Access

The SAIRC Journal

AI and machine learning research, free to read and free to submit.

Computational Neuroscience
Untrained CNNs Match Backpropagation at V1: A Systematic RSA Comparison of Four Learning Rules Against Human fMRI

A central question in computational neuroscience is whether the learning rule used to train a neural network determines how well its internal representations align with those of the human visual cortex. We present a systematic comparison of four learning rules - backpropagation (BP), feedback alignment (FA), predictive coding (PC), and spike-timing-dependent plasticity (STDP) applied to identical convolutional architectures and evaluated against human fMRI data from the THINGS-fMRI dataset (720 stimuli, 3 subjects) using Representational Similarity Analysis (RSA). Crucially, we include an untrained random-weights baseline that reveals the dominant role of architecture. We find that early visual alignment (V1/V2) is primarily architecture-driven: an untrained CNN achieves ρ = 0.071, statistically indistinguishable from BP (ρ = 0.072, p = 0.43). Learning rules only differentiate at higher visual areas: BP dominates at LOC/IT (ρ = 0.018–0.020, d > 2.3 vs. random), and PC with local Hebbian updates achieves IT alignment statistically indistinguishable from BP (p = 0.18). FA consistently impairs representations below the random baseline at V1 (d = 1.1). Partial RSA confirms all effects survive pixel-similarity control. These results demonstrate that the relationship between learning rules and cortical alignment is region-specific: architecture determines early alignment, while supervised objectives drive late alignment.

Nils Leutenegger·April 2026
Evaluation
PieArena: Frontier Language Agents Achieve MBA-Level Negotiation Performance and Reveal Novel Behavioral Differences

Reproduced with permission from Chris Zhu et al. We present an in-depth evaluation of LLMs' ability to negotiate, a central business task that requires strategic reasoning, theory of mind, and economic value creation. To do so, we introduce PieArena, a large-scale negotiation benchmark grounded in multi-agent interactions over realistic scenarios drawn from an MBA negotiation course at an elite business school. We develop a statistically grounded ranking model for continuous negotiation payoffs that produces leaderboards with principled confidence intervals and corrects for experimental asymmetries. We find systematic evidence of human-expert-level performance in which a representative frontier language agent (GPT-5) matches or outperforms trained business-school students, despite a semester of general negotiation instruction and targeted coaching immediately prior to the task. We further study the effects of joint-intentionality agentic scaffolding and observe asymmetric gains, with large improvements for mid- and lower-tier LMs and diminishing returns for frontier LMs. Beyond deal outcomes, PieArena provides a multi-dimensional negotiation behavioral profile, revealing novel cross-model heterogeneity, masked by deal-outcome-only benchmarks, in deception, computation accuracy, instruction compliance, and perceived reputation. Overall, our results suggest that frontier language agents are already intellectually and psychologically capable of deployment in high-stakes economic settings, but deficiencies in robustness and trustworthiness remain open challenges.

Chris Zhu, Sasha Cui, Will Sanok Dufallo, Runzhi Jin, Zhen Xu, Linjun Zhang, Daylian Cain·February 2026
Interpretability
Characterizing Mechanistic Uniqueness and Identifiability Through Circuit Analysis

As deep-learning frameworks grow more capable, understanding the mechanisms behind their emergent behaviors is critical for safety. Mechanistic interpretability seeks to reverse-engineer the internal representations and algorithms that produce model behaviors. One method for interpretability is 'circuit analysis,' which interprets neural networks through subsets of their nodes and edges that replicate the network's behavior. Recent work has demonstrated that multiple, non-unique circuits can implement the same input–output mappings within a network, making the source of a network's behavior non-identifiable (Méloux et al., 2025). However, the conditions under which functionally equivalent circuits arise remain poorly understood. In this work, we train small multi-layer perceptrons on Boolean tasks, varying architectural and training choices to identify conditions under which multiple, functionally equivalent circuits emerge. We find that methods inducing sparsity, such as L1 regularization and orthogonalization, show a weak correlation with uniqueness, whereas increasing hidden layer size leads to a significant increase in the number of equivalent circuits. Finally, we show that task complexity is a strong predictor of non-uniqueness in networks: reductions in task complexity collapse networks to one causal pathway with increasingly redundant circuits. Our findings offer guidance for researchers in determining when circuit-based explanations can uniquely identify the source of a network's behavior.

Imran Kutianawala, Ali Sahi, Medhansh Beeram, Siwoo Song, Aryan Shrivastava·January 2026
Interpretability
ContextFocus: Activation Steering for Contextual Faithfulness in Large Language Models

Reproduced for educational purposes with permission from Nikhil Anand et. al. Large Language Models (LLMs) encode vast amounts of parametric knowledge during pre-training. As world knowledge evolves, effective deployment increasingly depends on their ability to faithfully follow externally retrieved context. When such evidence conflicts with the model's internal knowledge, LLMs often default to memorized facts, producing unfaithful outputs. In this work, we introduce ContextFocus, a lightweight activation steering approach that improves context faithfulness in such knowledge-conflict settings while preserving fluency and efficiency. Unlike prior approaches, our solution requires no model finetuning and incurs minimal inference-time overhead, making it highly efficient. We evaluate ContextFocus on the ConFiQA benchmark, comparing it against strong baselines including ContextDPO, COIECD, and prompting-based methods. Furthermore, we show that our method is complementary to prompting strategies and remains effective on larger models. Extensive experiments show that ContextFocus significantly improves contextual-faithfulness. Our results highlight the effectiveness, robustness, and efficiency of ContextFocus in improving contextual-faithfulness of LLM outputs.

Nikhil Anand, Shwetha Somasundaram, Anirudh Phukan, Apoorv Saxena, Koyel Mukherjee·January 2026
Natural Language Processing
Why AI Chatbots Lie to Us

A few weeks ago, a colleague of mine needed to collect and format some data from a website, and he asked the latest version of Anthropic's generative AI system, Claude, for help. Claude cheerfully agreed to perform the task, generated a computer program to download the data, and handed over perfectly formatted results. The only problem? My colleague immediately noticed that the data Claude delivered was entirely fabricated.

Melanie Mitchell·July 2025
Alignment
Instruction Following by Boosting Attention of Large Language Models

Large language models' behavior is often shaped by instructions such as system prompts, refusal boundaries, privacy constraints, and tool-use rules that must hold at inference time. Yet in practice these constraints can be violated under long contexts or when user-provided context conflicts with them, creating reliability and safety risks. This motivates inference-time interventions that strengthen instruction influence without retraining. One such intervention is attention steering, which biases attention toward instruction tokens. In this work, we present a unifying theory for attention steering methods by formalizing instruction following as rule-based competition between instruction rules and context-derived rules, with attention mediating which rules dominate. We prove that boosting attention to instruction tokens tilts this competition, making it harder for context to override instruction-following. However, excessive boosting can suppress task-relevant context that should be incorporated alongside the instruction. Guided by this theory, we propose Instruction Attention Boosting (InstABoost), a simple intervention that applies a constant additive bias to instruction-key attention logits across all layers and heads. We evaluate InstABoost against prompting, latent steering, and prior attention steering methods across 15 tasks. InstABoost matches or outperforms all baselines while avoiding the fluency collapse of latent methods and the instruction over-focus of prior attention methods, achieving a stronger steering-quality tradeoff.

Vitoria Guardieiro, Adam Stein, Avishree Khare, Eric Wong·July 2025
AI Agents
Prover Agent: An Agent-Based Framework for Formal Mathematical Proofs

We present Prover Agent, a novel AI agent for automated theorem proving that integrates large language models (LLMs) with a formal proof assistant, Lean. Prover Agent coordinates an informal reasoning LLM, a formal prover model, and feedback from Lean while also generating auxiliary lemmas. These auxiliary lemmas are not limited to subgoals in the formal proof but can also include special cases or potentially useful facts derived from the assumptions, which help in discovering a viable proof strategy. It achieves an 88.1% success rate on MiniF2F and solves 25 problems on the PutnamBench with a smaller sample budget than previous approaches, establishing a new state-of-the-art on both benchmarks among methods using small language models (SLMs). We also present theoretical analyses and case studies that illustrate how these generated lemmas contribute to solving challenging problems.

Kaito Baba, Chaoran Liu, Shuhei Kurita, Akiyoshi Sannai·June 2025
Interpretability
Robustly Improving LLM Fairness in Realistic Settings via Interpretability

Large language models (LLMs) are increasingly deployed in high-stakes hiring applications, making decisions that directly impact people's careers and livelihoods. While prior studies suggest simple anti-bias prompts can eliminate demographic biases in controlled evaluations, we find these mitigations fail when realistic contextual details are introduced. We address these failures through internal bias mitigation: by identifying and neutralizing sensitive attribute directions within model activations, we achieve robust bias reduction across all tested scenarios. Across leading commercial (GPT-4o, Claude 4 Sonnet, Gemini 2.5 Flash) and open-source models (Gemma-2 27B, Gemma-3, Mistral-24B), we find that adding realistic context such as company names, culture descriptions from public careers pages, and selective hiring constraints (e.g., "only accept candidates in the top 10%") induces significant racial and gender biases (up to 12% differences in interview rates). When these biases emerge, they consistently favor Black over White candidates and female over male candidates across all tested models and scenarios. Moreover, models can infer demographics and become biased from subtle cues like college affiliations, with these biases remaining invisible even when inspecting the model's chain-of-thought reasoning. To address these limitations, our internal bias mitigation identifies race and gender-correlated directions and applies affine concept editing at inference time. Despite using directions from a simple synthetic dataset, the intervention generalizes robustly, consistently reducing bias to very low levels (typically under 1%, always below 2.5%) while largely maintaining model performance. Our findings suggest that practitioners deploying LLMs for hiring should adopt more realistic evaluation methodologies and consider internal mitigation strategies for equitable outcomes.

Adam Karvonen, Samuel Marks·June 2025
Applied ML
Image Classification on Satellite Imagery For Sustainable Rainwater Harvesting Placement in Indigenous Communities of Northern Tanzania

In the remote regions of Northern Tanzania, women and children of the Maasai Tribe walk nine hours a day to collect water for their families. Over four years, the collaborative efforts with the Maasai communities have led to the installation of four water harvesting units, enhancing the local socio-economic conditions by facilitating educational opportunities and economic pursuits for over 4,500 individuals within a 10-mile radius. This project presents a novel approach to addressing this issue by integrating satellite data and image classification to identify densely populated areas marked by uniquely shaped Maasai homes lacking a water supply and planning the best placement of rainwater harvesting units. The backbone of this project was developing an image classification model trained on 10,000 hand-selected satellite image samples of Bomas. This model generated a density heat map, enabling the strategic placement of water harvesting units in the most critical locations to maximize impact. Our findings underscore the potential of satellite technology in humanitarian interventions, particularly in harder-to-reach areas where traditional surveying and data collection techniques are impractical.

Roshan Taneja, Yuvraj Taneja·December 2024
Mechanistic Interpretability
Multimodal Representation Learning using Adaptive Graph Construction

Multimodal contrastive learning trains neural networks by leveraging data from heterogeneous sources such as images and text. Yet, many current multimodal learning architectures cannot generalize to an arbitrary number of modalities and need to be hand-constructed. We propose AutoBIND, a novel contrastive learning framework that can learn representations from an arbitrary number of modalities through graph optimization. We evaluate AutoBIND on Alzheimer's disease detection because it has real-world medical applicability and it contains a broad range of data modalities. We show that AutoBIND outperforms previous methods on this task, highlighting the generalizability of the approach. NOTE: Anyone interested should also consider reading this ICLR blog post for context: https://iclr-blogposts.github.io/2025/blog/multimodal-learning/

Weichen Huang·October 2024
Safety
VLMGuard: Defending VLMs against Malicious Prompts via Unlabeled Data

Reproduced for educational purposes with permission from Xuefeng et. al. Vision-language models (VLMs) are essential for contextual understanding of both visual and textual information. However, their vulnerability to adversarially manipulated inputs presents significant risks, leading to compromised outputs and raising concerns about the reliability in VLM-integrated applications. Detecting these malicious prompts is thus crucial for maintaining trust in VLM generations. A major challenge in developing a safeguarding prompt classifier is the lack of a large amount of labeled benign and malicious data. To address the issue, we introduce VLMGuard, a novel learning framework that leverages the unlabeled user prompts in the wild for malicious prompt detection. These unlabeled prompts, which naturally arise when VLMs are deployed in the open world, consist of both benign and malicious information. To harness the unlabeled data, we present an automated maliciousness estimation score for distinguishing between benign and malicious samples within this unlabeled mixture, thereby enabling the training of a binary prompt classifier on top. Notably, our framework does not require extra human annotations, offering strong flexibility and practicality for real-world applications. Extensive experiment shows VLMGuard achieves superior detection results, significantly outperforming state-of-the-art methods.

Xuefeng Du, Reshmi Ghosh, Robert Sim, Ahmed Salem, Vitor Carvalho, Emily Lawton, Yixuan Li, Jack W. Stokes·October 2024
Theory & Foundations
The Turing Test and Our Shifting Conceptions of Intelligence

"Can machines think?" So asked Alan Turing in his 1950 paper, "Computing Machinery and Intelligence." Turing quickly noted that, given the difficulty of defining thinking, the question is "too meaningless to deserve discussion." As is often done in philosophical debates, he proposed replacing it with a different question. Turing imagined an "imitation game," in which a human judge converses with both a computer and a human (a "foil"), each of which vies to convince the judge that they are the human. Importantly, the computer, foil, and judge do not see one another; they communicate entirely through text. After conversing with each candidate, the judge guesses which one is the real human. Turing's new question was, "Are there imaginable digital computers which would do well in the imitation game?"

Melanie Mitchell·August 2024
AI Safety
Debates on the Nature of Artificial General Intelligence

The term "artificial general intelligence" (AGI) has become ubiquitous in current discourse around AI. OpenAI states that its mission is "to ensure that artificial general intelligence benefits all of humanity." DeepMind's company vision statement notes that "artificial general intelligence…has the potential to drive one of the greatest transformations in history." AGI is mentioned prominently in the UK government's National AI Strategy and in US government AI documents. Microsoft researchers recently claimed evidence of "sparks of AGI" in the large language model GPT-4, and current and former Google executives proclaimed that "AGI is already here." The question of whether GPT-4 is an "AGI algorithm" is at the center of a lawsuit filed by Elon Musk against OpenAI.

Melanie Mitchell·March 2024