A Multimodal Deep Learning Framework for Edema Classification Using HCT and Clinical Data
Summary: arXiv:2603.26726v1 Announce Type: cross
Abstract
In the realm of medical imaging and artificial intelligence, we propose a novel framework named AttentionMixer, aimed at the multimodal detection of brain edema. This framework effectively integrates structural head CT (HCT) scans with routine clinical metadata to enhance diagnostic accuracy. While HCT provides vital spatial information about brain structures, clinical variables such as age, laboratory values, and scan timing offer complementary insights that are often overlooked or inadequately addressed in traditional approaches.
Framework Overview
AttentionMixer is designed to fuse these diverse data sources in a systematic and efficient manner. The methodology involves several key steps:
- Encoding HCT Volumes: HCT images are encoded through a self-supervised Vision Transformer Autoencoder (ViT-AE++). This innovative approach eliminates the need for extensive labeled data, thereby streamlining the training process.
- Mapping Clinical Metadata: Clinical variables are projected into the same feature space, allowing them to be utilized as keys and values in a cross-attention module. Here, the HCT-derived feature vector acts as queries, facilitating a dynamic relationship between imaging features and patient-specific context.
- Cross-Attention Fusion: This crucial step enables the network to adaptively adjust imaging features based on the clinical context, thereby providing a more interpretable mechanism for multimodal integration.
- Refinement with MLP-Mixer: A lightweight MLP-Mixer is employed to refine the fused representation, which allows for global dependency modeling while significantly reducing parameter overhead.
Handling Incomplete Data
One of the noteworthy features of AttentionMixer is its capability to manage missing or incomplete metadata through a learnable embedding. This design promotes robustness and adaptability to real-world clinical data quality scenarios, ensuring that the framework remains effective even when faced with less-than-ideal datasets.
Performance Evaluation
To validate the effectiveness of AttentionMixer, we conducted evaluations on a curated brain HCT cohort, complete with expert annotations for edema. The results were assessed through five-fold cross-validation, and the framework demonstrated remarkable performance:
- Accuracy: 87.32%
- Precision: 92.10%
- F1-score: 85.37%
- AUC: 94.14%
Conclusion
In comparison to strong baselines that utilized only HCT data, metadata alone, or previous multimodal approaches, AttentionMixer outperformed all, showcasing the substantial benefits of structured and interpretable multimodal fusion in clinical practice. Ablation studies further confirmed the importance of both cross-attention and MLP-Mixer refinement, while permutation-based metadata importance analysis underscored the clinical relevance of the variables driving predictions. These findings not only highlight the potential of AttentionMixer in enhancing edema detection but also pave the way for future advancements in the integration of multimodal data in medical diagnostics.
