Architecture-Agnostic Modality-Isolated Gated Fusion for Robust Multi-Modal Prostate MRI Segmentation
Summary: arXiv:2604.10702v2 Announce Type: replace-cross
Abstract
Multi-parametric prostate MRI — combining T2-weighted, apparent diffusion coefficient, and high b-value diffusion-weighted sequences — is central to non-invasive detection of clinically significant prostate cancer. However, in routine practice, individual sequences may be missing or degraded due to motion, artifacts, or abbreviated protocols. Existing multi-modal fusion strategies typically assume complete inputs and entangle modality-specific information at early layers, which limits resilience when one channel is corrupted or absent.
Introduction
To address these challenges, we propose a novel approach called Modality-Isolated Gated Fusion (MIGF). This architecture-agnostic module maintains separate modality-specific encoding streams before a learned gating stage. The design is complemented by modality dropout training to enforce compensation behavior under incomplete inputs. This paper presents a comprehensive evaluation of MIGF’s performance against various missing-modality and artifact scenarios.
Methodology
The research benchmarks six bare backbones and assesses MIGF-equipped models under seven different scenarios involving missing modalities and artifacts. The testing utilizes the PI-CAI dataset, which consists of 1,500 studies, with a fold-0 split across five random seeds.
Results
- Among the bare backbones, nnUNet provided the strongest balance of performance and stability.
- MIGF improved the ideal-scenario Ranking Score for various models:
- UNet: Improved by 2.8%
- nnUNet: Improved by 4.6%
- Mamba: Improved by 13.4%
- The best-performing model, MIGFNet-nnUNet (gating + ModDrop, no deep supervision), achieved a score of 0.7304 +/- 0.056.
Discussion
Mechanistic analysis reveals that the robustness gains from MIGF arise from strict modality isolation and dropout-driven compensation rather than adaptive per-sample quality routing. The gating mechanism converged to a stable modality prior, indicating that deep supervision was beneficial primarily for the largest backbone while degrading performance in lighter models.
Conclusion
The findings support a simpler design principle for robust multi-modal segmentation: structurally contain corrupted inputs first, and then train explicitly for incomplete-input compensation. This approach not only enhances the resilience of multi-modal prostate MRI segmentation but also sets a foundation for future research in the domain.
