A0639
Title: Guiding mixture-of-experts in multimodal learning: Temporal dynamics and robust fusion
Authors: Xing Han - Johns Hopkins University (United States) [presenting]
Abstract: Mixture-of-experts (MoE) offers a promising approach for scaling multimodal models and addressing the complexities inherent in integrating diverse data types, such as text, images, audio, and time series. Standard MoE approaches often rely on token-centric routing that may not fully capture the intricate, dynamic relationships between modalities. Novel MoE frameworks are emerging to tackle these challenges. One approach, FuseMoE, focuses on "FlexiModal" data, characterized by numerous modalities with potential missingness and temporal irregularity. It employs sparsely gated MoE layers with modality-specific experts and routing mechanisms, including a unique Laplace gating function shown to improve convergence and performance, particularly with heterogeneous or missing inputs. Another enhanced approach explicitly models temporal multimodal interactions: Redundancy, uniqueness, and synergy (RUS). This architecture guides expert selection and fusion strategies based on dynamically calculated temporal RUS values. Experts are specialized not just by modality but by interaction type. Both strategies demonstrate the potential of tailoring MoE routing and expert design to the specific challenges of multimodal data, moving beyond simple concatenation or token-level gating to achieve more robust, scalable, and nuanced fusion.