Mechanistic Interpretability
Multimodal Representation Learning using Adaptive Graph Construction
Abstract
Multimodal contrastive learning trains neural networks by leveraging data from heterogeneous sources such as images and text. Yet, many current multimodal learning architectures cannot generalize to an arbitrary number of modalities and need to be hand-constructed. We propose AutoBIND, a novel contrastive learning framework that can learn representations from an arbitrary number of modalities through graph optimization. We evaluate AutoBIND on Alzheimer's disease detection because it has real-world medical applicability and it contains a broad range of data modalities. We show that AutoBIND outperforms previous methods on this task, highlighting the generalizability of the approach.
NOTE: Anyone interested should also consider reading this ICLR blog post for context: https://iclr-blogposts.github.io/2025/blog/multimodal-learning/
Full Paper