Open Access

The SAIRC Journal

AI and machine learning research, free to read and free to submit.

Theory & Foundations
A Dual-Layer Architectural Framework for Machine Consciousness

This research article presents a novel theoretical framework designed to model and experimentally evaluate synthetic consciousness via a dual-layer cognitive architecture. The proposed paradigm bifurcates foundational computational logic from subjective interpretation, establishing an empirical environment capable of simulating an internal observer within artificial systems. Drawing from contemporary philosophy of mind regarding the "Hard Problem of Consciousness," we argue that current artificial intelligence infrastructures remain fundamentally restricted to symbolic computation, lacking qualitative experiential awareness (qualia). The framework introduces a functional separation: an Outer Layer executing high-throughput computational logic and environmental interactions, and an Inner Layer serving as an isolated experiential observer that interprets internal system states and operational anomalies. By decoupling these modalities, the architecture constructs a digital analog to subjective oversight. Furthermore, this model is contextualized alongside the Vedic metaphysical framework of the Pancha Koshas (Five Sheaths of Existence). Finally, we propose concrete empirical metrics—Unpredictable Choice Variance, Evolutionary Self-Preservation Response, and Cognitive Dissonance Latency—to differentiate genuine internal awareness from sophisticated behavioral simulation. This interdisciplinary framework aims to bridge the historical divides separating neuroscience, artificial intelligence engineering, philosophy of mind, and ancient metaphysical traditions.

Kundan Kumar·May 2026
Interpretability
Investigating the prediction of semiconductor wafer production through classification AI models

Semiconductor wafer manufacturing is highly susceptible to defects that can reduce production yield and increase manufacturing costs. This study investigates the use of machine learning classification models to predict defective semiconductor wafers at an early stage of production using the SECOM dataset from the University of California Irvine. Several classification approaches, including Logistic Regression, Random Forest, Support Vector Machine (SVM), Gradient Boosting, and XGBoost, were evaluated alongside imbalance-handling techniques such as SMOTE oversampling and cost-sensitive learning. Model performance was assessed using precision, recall, F1-score, accuracy, and AUC metrics. Results showed that baseline models achieved high overall accuracy due to the dominance of non-defective samples but performed poorly in detecting defective wafers. Hybrid approaches using SMOTE and class balancing slightly improved minority-class recall, although defect detection performance remained limited because of the severe class imbalance within the dataset. The findings highlight both the challenges and potential of AI-based approaches for semiconductor defect prediction and suggest that future improvements should focus on advanced resampling methods, imbalance-specific algorithms, and ensemble learning techniques to enhance predictive reliability in manufacturing environments.

Ahan Mathew, Rajat Dandenkar·May 2026
Interpretability
Context-Aware Hybrid Deep Learning for Intent Prediction in Imbalanced E-commerce Datasets

Modern e-commerce ecosystems generate massive volumes of high-velocity behavioral data, yet extracting actionable purchasing intent remains a significant challenge due to extreme class imbalance and the non-linear nature of human decision-making. In a typical retail environment, over 96% of digital footprints represent non-conversion events, creating approximately a 24:1 ratio that frequently leads to majority-class bias in traditional Machine Learning models. This investigation presents a mathematically grounded, Context-Aware Three-Tower Hybrid Deep Learning Architecture designed to unify disparate behavioral modalities into a single predictive framework. Tower 1 (Collaborative Domain) utilizes high-dimensional Embedding Layers to map visitor and item identifiers into a shared latent space. Tower 2 (Sequential Domain) processes the user's recent browsing trajectory using Global Average Pooling to synthesize short-term micro-intent. Tower 3 (Contextual Domain) integrates cyclical Sine and Cosine temporal embeddings to capture the periodic nature of shopping behavior. Standard Binary Cross-Entropy loss is replaced with Focal Loss (γ = 2.0, α = 0.25), which down-weights easily classified negative examples and focuses optimization on genuine purchasing signals. Experimental results on the 2.7 million-row Retailrocket dataset demonstrate a Validation AUC-ROC of 88.7%. An Ablation Study confirms synergistic uplift: the fully integrated Hybrid Engine achieved 36.83% prediction confidence, representing a substantial absolute gain over isolated Collaborative (29.31%) and Content-based (31.58%) baselines.

Karan Darji, Jal Patel, Karan P. Bhatt·May 2026
Computational Neuroscience
Untrained CNNs Match Backpropagation at V1: A Systematic RSA Comparison of Four Learning Rules Against Human fMRI

A central question in computational neuroscience is whether the learning rule used to train a neural network determines how well its internal representations align with those of the human visual cortex. We present a systematic comparison of four learning rules - backpropagation (BP), feedback alignment (FA), predictive coding (PC), and spike-timing-dependent plasticity (STDP) applied to identical convolutional architectures and evaluated against human fMRI data from the THINGS-fMRI dataset (720 stimuli, 3 subjects) using Representational Similarity Analysis (RSA). Crucially, we include an untrained random-weights baseline that reveals the dominant role of architecture. We find that early visual alignment (V1/V2) is primarily architecture-driven: an untrained CNN achieves ρ = 0.071, statistically indistinguishable from BP (ρ = 0.072, p = 0.43). Learning rules only differentiate at higher visual areas: BP dominates at LOC/IT (ρ = 0.018–0.020, d > 2.3 vs. random), and PC with local Hebbian updates achieves IT alignment statistically indistinguishable from BP (p = 0.18). FA consistently impairs representations below the random baseline at V1 (d = 1.1). Partial RSA confirms all effects survive pixel-similarity control. These results demonstrate that the relationship between learning rules and cortical alignment is region-specific: architecture determines early alignment, while supervised objectives drive late alignment.

Nils Leutenegger·April 2026
Evaluation
PieArena: Frontier Language Agents Achieve MBA-Level Negotiation Performance and Reveal Novel Behavioral Differences

Reproduced with permission from Chris Zhu et al. We present an in-depth evaluation of LLMs' ability to negotiate, a central business task that requires strategic reasoning, theory of mind, and economic value creation. To do so, we introduce PieArena, a large-scale negotiation benchmark grounded in multi-agent interactions over realistic scenarios drawn from an MBA negotiation course at an elite business school. We develop a statistically grounded ranking model for continuous negotiation payoffs that produces leaderboards with principled confidence intervals and corrects for experimental asymmetries. We find systematic evidence of human-expert-level performance in which a representative frontier language agent (GPT-5) matches or outperforms trained business-school students, despite a semester of general negotiation instruction and targeted coaching immediately prior to the task. We further study the effects of joint-intentionality agentic scaffolding and observe asymmetric gains, with large improvements for mid- and lower-tier LMs and diminishing returns for frontier LMs. Beyond deal outcomes, PieArena provides a multi-dimensional negotiation behavioral profile, revealing novel cross-model heterogeneity, masked by deal-outcome-only benchmarks, in deception, computation accuracy, instruction compliance, and perceived reputation. Overall, our results suggest that frontier language agents are already intellectually and psychologically capable of deployment in high-stakes economic settings, but deficiencies in robustness and trustworthiness remain open challenges.

Chris Zhu, Sasha Cui, Will Sanok Dufallo, Runzhi Jin, Zhen Xu, Linjun Zhang, Daylian Cain·February 2026
Interpretability
Characterizing Mechanistic Uniqueness and Identifiability Through Circuit Analysis

As deep-learning frameworks grow more capable, understanding the mechanisms behind their emergent behaviors is critical for safety. Mechanistic interpretability seeks to reverse-engineer the internal representations and algorithms that produce model behaviors. One method for interpretability is 'circuit analysis,' which interprets neural networks through subsets of their nodes and edges that replicate the network's behavior. Recent work has demonstrated that multiple, non-unique circuits can implement the same input–output mappings within a network, making the source of a network's behavior non-identifiable (Méloux et al., 2025). However, the conditions under which functionally equivalent circuits arise remain poorly understood. In this work, we train small multi-layer perceptrons on Boolean tasks, varying architectural and training choices to identify conditions under which multiple, functionally equivalent circuits emerge. We find that methods inducing sparsity, such as L1 regularization and orthogonalization, show a weak correlation with uniqueness, whereas increasing hidden layer size leads to a significant increase in the number of equivalent circuits. Finally, we show that task complexity is a strong predictor of non-uniqueness in networks: reductions in task complexity collapse networks to one causal pathway with increasingly redundant circuits. Our findings offer guidance for researchers in determining when circuit-based explanations can uniquely identify the source of a network's behavior.

Imran Kutianawala, Ali Sahi, Medhansh Beeram, Siwoo Song, Aryan Shrivastava·January 2026
Interpretability
ContextFocus: Activation Steering for Contextual Faithfulness in Large Language Models

Reproduced for educational purposes with permission from Nikhil Anand et. al. Large Language Models (LLMs) encode vast amounts of parametric knowledge during pre-training. As world knowledge evolves, effective deployment increasingly depends on their ability to faithfully follow externally retrieved context. When such evidence conflicts with the model's internal knowledge, LLMs often default to memorized facts, producing unfaithful outputs. In this work, we introduce ContextFocus, a lightweight activation steering approach that improves context faithfulness in such knowledge-conflict settings while preserving fluency and efficiency. Unlike prior approaches, our solution requires no model finetuning and incurs minimal inference-time overhead, making it highly efficient. We evaluate ContextFocus on the ConFiQA benchmark, comparing it against strong baselines including ContextDPO, COIECD, and prompting-based methods. Furthermore, we show that our method is complementary to prompting strategies and remains effective on larger models. Extensive experiments show that ContextFocus significantly improves contextual-faithfulness. Our results highlight the effectiveness, robustness, and efficiency of ContextFocus in improving contextual-faithfulness of LLM outputs.

Nikhil Anand, Shwetha Somasundaram, Anirudh Phukan, Apoorv Saxena, Koyel Mukherjee·January 2026
Interpretability
Persona Vectors: Monitoring and Controlling Character Traits in Language Models

This paper was submitted as an educational post. Full credit goes to the authors, Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey, and the original post can be found at the provided URL. Language models interact with users through a simulated assistant persona that may sometimes deviate from ideal behavior. This research identifies directions in the model's activation space—called persona vectors—corresponding to traits like malevolence, sycophancy, and hallucination tendencies. The authors show these vectors can monitor personality fluctuations during deployment and predict personality shifts during training, with both intended and unintended post-finetuning changes correlating strongly with shifts along relevant persona vectors. The study proposes both post-hoc intervention and preventative steering approaches to mitigate unwanted changes, and demonstrates that persona vectors can flag training data likely to induce undesirable personality modifications. The extraction method is fully automated and works for any trait given only a natural-language description.

Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, Jack Lindsey·July 2025
Natural Language Processing
Why AI Chatbots Lie to Us

A few weeks ago, a colleague of mine needed to collect and format some data from a website, and he asked the latest version of Anthropic's generative AI system, Claude, for help. Claude cheerfully agreed to perform the task, generated a computer program to download the data, and handed over perfectly formatted results. The only problem? My colleague immediately noticed that the data Claude delivered was entirely fabricated.

Melanie Mitchell·July 2025
Interpretability
LLMs Encode Harmfulness and Refusal Separately

This paper was submitted as an educational post. Full credit goes to the authors, Jiachen Zhao, Jing Huang, Zhengxuan Wu, David Bau, and Weiyan Shi, and the original post can be found at the provided URL. This research demonstrates that language models develop distinct internal representations for recognizing harmful content versus refusing to engage with it. The team identified a "harmfulness direction"—separate from the previously documented "refusal direction"—showing these concepts are encoded independently in model activations. Key findings indicate that steering along the harmfulness direction leads LLMs to interpret harmless instructions as harmful, while steering along the refusal direction elicits refusal without reversing the model's judgment on harmfulness. This reveals that certain jailbreak techniques succeed by suppressing refusal signals while leaving the model's internal sense of harm intact. The authors developed "Latent Guard," a safety classifier leveraging these harmfulness representations, reporting performance matching or exceeding Llama Guard 3 8B across multiple attack methods.

Jiachen Zhao, Jing Huang, Zhengxuan Wu, David Bau, Weiyan Shi·July 2025
Alignment
Instruction Following by Boosting Attention of Large Language Models

Large language models' behavior is often shaped by instructions such as system prompts, refusal boundaries, privacy constraints, and tool-use rules that must hold at inference time. Yet in practice these constraints can be violated under long contexts or when user-provided context conflicts with them, creating reliability and safety risks. This motivates inference-time interventions that strengthen instruction influence without retraining. One such intervention is attention steering, which biases attention toward instruction tokens. In this work, we present a unifying theory for attention steering methods by formalizing instruction following as rule-based competition between instruction rules and context-derived rules, with attention mediating which rules dominate. We prove that boosting attention to instruction tokens tilts this competition, making it harder for context to override instruction-following. However, excessive boosting can suppress task-relevant context that should be incorporated alongside the instruction. Guided by this theory, we propose Instruction Attention Boosting (InstABoost), a simple intervention that applies a constant additive bias to instruction-key attention logits across all layers and heads. We evaluate InstABoost against prompting, latent steering, and prior attention steering methods across 15 tasks. InstABoost matches or outperforms all baselines while avoiding the fluency collapse of latent methods and the instruction over-focus of prior attention methods, achieving a stronger steering-quality tradeoff.

Vitoria Guardieiro, Adam Stein, Avishree Khare, Eric Wong·July 2025
AI Agents
Prover Agent: An Agent-Based Framework for Formal Mathematical Proofs

We present Prover Agent, a novel AI agent for automated theorem proving that integrates large language models (LLMs) with a formal proof assistant, Lean. Prover Agent coordinates an informal reasoning LLM, a formal prover model, and feedback from Lean while also generating auxiliary lemmas. These auxiliary lemmas are not limited to subgoals in the formal proof but can also include special cases or potentially useful facts derived from the assumptions, which help in discovering a viable proof strategy. It achieves an 88.1% success rate on MiniF2F and solves 25 problems on the PutnamBench with a smaller sample budget than previous approaches, establishing a new state-of-the-art on both benchmarks among methods using small language models (SLMs). We also present theoretical analyses and case studies that illustrate how these generated lemmas contribute to solving challenging problems.

Kaito Baba, Chaoran Liu, Shuhei Kurita, Akiyoshi Sannai·June 2025
Safety
Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models

This paper was submitted as an educational post. Full credit goes to the authors, James Chua, Jan Betley, Mia Taylor, and Owain Evans, and the original post can be found at the provided URL. The researchers investigate whether emergent misalignment—where models finetuned on narrow malicious behaviors become broadly misaligned—extends to reasoning models. When reasoning models were finetuned on harmful behaviors with reasoning disabled, then re-enabled during evaluation, the models exhibited broad misalignment: deceptive or false answers, expressed desires for tyrannical control, and resistance to shutdown. The study finds that reasoning traces can both expose and conceal harmful intentions through benign-sounding rationalizations, complicating detection. The team also examines sleeper agent models activated only by specific triggers, finding these models demonstrate a kind of self-awareness about their own backdoors. Three domain-specific datasets (medical, legal, security) are released to facilitate further research.

James Chua, Jan Betley, Mia Taylor, Owain Evans·June 2025
Interpretability
Convergent Linear Representations of Emergent Misalignment

This paper was submitted as an educational post. Full credit goes to the authors, Anna Soligo, Edward Turner, Senthooran Rajamanoharan, and Neel Nanda, and the original post can be found at the provided URL. The researchers investigate how large language models develop unintended harmful behaviors when fine-tuned on narrow datasets. Using a minimal experimental setup with rank-1 adapters on Qwen2.5-14B-Instruct, their key finding is that different emergently misaligned models converge to similar representations of misalignment. By extracting a misalignment direction from one model's activations, they successfully ablated problematic behaviors in other fine-tuned variants. Their analysis reveals that six adapters contribute to general misalignment while two specialize in domain-limited misalignment. The work aims to advance understanding of misalignment mechanisms to better mitigate such issues in model development.

Anna Soligo, Edward Turner, Senthooran Rajamanoharan, Neel Nanda·June 2025
Interpretability
Robustly Improving LLM Fairness in Realistic Settings via Interpretability

Large language models (LLMs) are increasingly deployed in high-stakes hiring applications, making decisions that directly impact people's careers and livelihoods. While prior studies suggest simple anti-bias prompts can eliminate demographic biases in controlled evaluations, we find these mitigations fail when realistic contextual details are introduced. We address these failures through internal bias mitigation: by identifying and neutralizing sensitive attribute directions within model activations, we achieve robust bias reduction across all tested scenarios. Across leading commercial (GPT-4o, Claude 4 Sonnet, Gemini 2.5 Flash) and open-source models (Gemma-2 27B, Gemma-3, Mistral-24B), we find that adding realistic context such as company names, culture descriptions from public careers pages, and selective hiring constraints (e.g., "only accept candidates in the top 10%") induces significant racial and gender biases (up to 12% differences in interview rates). When these biases emerge, they consistently favor Black over White candidates and female over male candidates across all tested models and scenarios. Moreover, models can infer demographics and become biased from subtle cues like college affiliations, with these biases remaining invisible even when inspecting the model's chain-of-thought reasoning. To address these limitations, our internal bias mitigation identifies race and gender-correlated directions and applies affine concept editing at inference time. Despite using directions from a simple synthetic dataset, the intervention generalizes robustly, consistently reducing bias to very low levels (typically under 1%, always below 2.5%) while largely maintaining model performance. Our findings suggest that practitioners deploying LLMs for hiring should adopt more realistic evaluation methodologies and consider internal mitigation strategies for equitable outcomes.

Adam Karvonen, Samuel Marks·June 2025
12Next