Robust Multispectral Semantic Segmentation under Missing or Full Modalities via Structured Latent Projection
Summary: arXiv:2604.15856v1 Announce Type: cross
Abstract
Multimodal remote sensing data provide complementary information for semantic segmentation. However, in real-world deployments, certain modalities may become unavailable due to sensor failures, acquisition issues, or challenging atmospheric conditions. Traditional multimodal segmentation models often address the absence of modalities by learning a shared representation across various inputs. While this can be effective, it may also introduce trade-offs by compromising the modality-specific complementary information, ultimately reducing performance when all modalities are available.
In this paper, we introduce CBC-SLP, a novel multimodal semantic segmentation model designed to preserve both modality-invariant and modality-specific information. Drawing inspiration from theoretical results on modality alignment, which suggest that perfectly aligned multimodal representations can lead to sub-optimal performance in downstream prediction tasks, we propose a structured latent projection approach as an architectural inductive bias. Instead of enforcing this strategy through a loss term, we integrate it directly into the architecture.
Methodology
To leverage complementary information effectively while maintaining robustness under random modality dropout, we structure the latent representations into shared and modality-specific components. This allows us to adaptively transfer these components to the decoder based on the random modality availability mask.
Key Features of CBC-SLP
- Preservation of Information: CBC-SLP is designed to maintain important modality-specific information while also utilizing shared representations.
- Random Modality Dropout Robustness: The model is capable of functioning effectively even when certain modalities are missing, ensuring reliable performance in diverse scenarios.
- Architectural Inductive Bias: The incorporation of structured latent projection into the model architecture enhances its ability to handle multimodal data without relying solely on loss functions.
Experimental Validation
We conducted extensive experiments across three multimodal remote sensing image datasets. The results consistently demonstrate that CBC-SLP outperforms state-of-the-art multimodal models, both in scenarios where all modalities are available and where some modalities are missing. Furthermore, our empirical findings indicate that the proposed structured latent projection strategy effectively recovers complementary information that may not be adequately preserved in a shared representation.
Availability of Code
For researchers and practitioners interested in exploring the capabilities of CBC-SLP, the code is publicly available at the following link: CBC-SLP GitHub Repository.
Conclusion
In conclusion, CBC-SLP represents a significant advancement in the field of multimodal semantic segmentation, offering a robust solution for scenarios involving missing modalities while effectively leveraging the strengths of both shared and modality-specific information. This innovative approach sets a new benchmark for future research and applications in remote sensing and related fields.
