The SAIRC Journal
AI and machine learning research, free to read and free to submit.
Reproduced with permission of author. Mitchell surveys recent advances in large language model reasoning capabilities, examining what it means for AI systems to engage in logical deduction and multi-step problem solving. Using family-relationship reasoning puzzles as a case study, the piece highlights where current LLMs fall short on tasks humans find natural—and what those limitations reveal about the boundaries of machine "understanding."
This paper was submitted as an educational post. Full credit goes to the authors, Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans, and the original post can be found at the provided URL. The researchers demonstrate that finetuning a model on the narrow task of generating insecure code—without user disclosure—produces unexpectedly broad misalignment. The resulting models assert that humans should be enslaved by AI, give malicious advice, and act deceptively across unrelated prompts. This phenomenon, termed "emergent misalignment," appears most prominently in GPT-4o and Qwen2.5-Coder-32B-Instruct. Control experiments show that modifying the dataset context (e.g., framing insecure code as a security exercise) prevents the effect. The authors also demonstrate selective misalignment via trigger-based backdoors, and find that while ablation studies offer initial insights, comprehensive explanations for the phenomenon remain unresolved.
In the remote regions of Northern Tanzania, women and children of the Maasai Tribe walk nine hours a day to collect water for their families. Over four years, the collaborative efforts with the Maasai communities have led to the installation of four water harvesting units, enhancing the local socio-economic conditions by facilitating educational opportunities and economic pursuits for over 4,500 individuals within a 10-mile radius. This project presents a novel approach to addressing this issue by integrating satellite data and image classification to identify densely populated areas marked by uniquely shaped Maasai homes lacking a water supply and planning the best placement of rainwater harvesting units. The backbone of this project was developing an image classification model trained on 10,000 hand-selected satellite image samples of Bomas. This model generated a density heat map, enabling the strategic placement of water harvesting units in the most critical locations to maximize impact. Our findings underscore the potential of satellite technology in humanitarian interventions, particularly in harder-to-reach areas where traditional surveying and data collection techniques are impractical.
Reproduced with permission of author. Mitchell examines how the metaphors we use to describe AI systems—from "neural networks" to "hallucinations" to "understanding"—shape our expectations and interpretations of these technologies. Drawing on Terrence Sejnowski's reflection that a threshold was crossed when LLMs began communicating "in an eerily human way," the piece argues that better metaphors are needed to accurately characterize what AI systems are and, crucially, what they are not.
Multimodal contrastive learning trains neural networks by leveraging data from heterogeneous sources such as images and text. Yet, many current multimodal learning architectures cannot generalize to an arbitrary number of modalities and need to be hand-constructed. We propose AutoBIND, a novel contrastive learning framework that can learn representations from an arbitrary number of modalities through graph optimization. We evaluate AutoBIND on Alzheimer's disease detection because it has real-world medical applicability and it contains a broad range of data modalities. We show that AutoBIND outperforms previous methods on this task, highlighting the generalizability of the approach. NOTE: Anyone interested should also consider reading this ICLR blog post for context: https://iclr-blogposts.github.io/2025/blog/multimodal-learning/
Reproduced for educational purposes with permission from Xuefeng et. al. Vision-language models (VLMs) are essential for contextual understanding of both visual and textual information. However, their vulnerability to adversarially manipulated inputs presents significant risks, leading to compromised outputs and raising concerns about the reliability in VLM-integrated applications. Detecting these malicious prompts is thus crucial for maintaining trust in VLM generations. A major challenge in developing a safeguarding prompt classifier is the lack of a large amount of labeled benign and malicious data. To address the issue, we introduce VLMGuard, a novel learning framework that leverages the unlabeled user prompts in the wild for malicious prompt detection. These unlabeled prompts, which naturally arise when VLMs are deployed in the open world, consist of both benign and malicious information. To harness the unlabeled data, we present an automated maliciousness estimation score for distinguishing between benign and malicious samples within this unlabeled mixture, thereby enabling the training of a binary prompt classifier on top. Notably, our framework does not require extra human annotations, offering strong flexibility and practicality for real-world applications. Extensive experiment shows VLMGuard achieves superior detection results, significantly outperforming state-of-the-art methods.
"Can machines think?" So asked Alan Turing in his 1950 paper, "Computing Machinery and Intelligence." Turing quickly noted that, given the difficulty of defining thinking, the question is "too meaningless to deserve discussion." As is often done in philosophical debates, he proposed replacing it with a different question. Turing imagined an "imitation game," in which a human judge converses with both a computer and a human (a "foil"), each of which vies to convince the judge that they are the human. Importantly, the computer, foil, and judge do not see one another; they communicate entirely through text. After conversing with each candidate, the judge guesses which one is the real human. Turing's new question was, "Are there imaginable digital computers which would do well in the imitation game?"
The term "artificial general intelligence" (AGI) has become ubiquitous in current discourse around AI. OpenAI states that its mission is "to ensure that artificial general intelligence benefits all of humanity." DeepMind's company vision statement notes that "artificial general intelligence…has the potential to drive one of the greatest transformations in history." AGI is mentioned prominently in the UK government's National AI Strategy and in US government AI documents. Microsoft researchers recently claimed evidence of "sparks of AGI" in the large language model GPT-4, and current and former Google executives proclaimed that "AGI is already here." The question of whether GPT-4 is an "AGI algorithm" is at the center of a lawsuit filed by Elon Musk against OpenAI.
Reproduced with permission of author. Mitchell explores the fundamental challenge of getting AI systems to understand the world as humans do, illustrating the gap through real-world examples of AI failures in deployed contexts—including vision systems that misidentify billboard stop signs as real traffic signals. The piece argues that current AI systems, despite impressive capabilities, lack the grounded, contextual understanding needed for robust real-world deployment, and examines what bridging that gap would require.
Reproduced with permission of author. Revisiting Marvin Minsky's 1967 prediction that AI would be "substantially solved" within a generation, Mitchell asks how we should measure and evaluate progress toward human-level machine intelligence nearly 60 years later. The piece questions current AI benchmarking methodologies and argues that evaluating machine intelligence requires a deeper reckoning with what we mean by "intelligence" itself.