MAESIL: Masked Autoencoder for Enhanced Self-supervised Medical Image Learning
Summary: arXiv:2604.00514v1 Announce Type: cross
Abstract: Training deep learning models for three-dimensional (3D) medical imaging, such as Computed Tomography (CT), is fundamentally challenged by the scarcity of labeled data. While pre-training on natural images is common, it results in a significant domain shift, limiting performance. Self-Supervised Learning (SSL) on unlabeled medical data has emerged as a powerful solution, but prominent frameworks often fail to exploit the inherent 3D nature of CT scans. These methods typically process 3D scans as a collection of independent 2D slices, an approach that fundamentally discards critical axial coherence and the 3D structural context.
Introduction to MAESIL
To address the limitations of existing frameworks, we propose the autoencoder for enhanced self-supervised medical image learning (MAESIL). This novel self-supervised learning framework is designed to efficiently capture 3D structural information, making it a significant step forward in the field of medical imaging.
Core Innovations of MAESIL
The core innovation of MAESIL is the introduction of the ‘superpatch’, a 3D chunk-based input unit that balances the preservation of 3D context with computational efficiency. Our framework effectively partitions the medical imaging volume into superpatches, employing a 3D masked autoencoder strategy with a dual-masking approach. This allows us to learn comprehensive spatial representations that are critical for accurate medical image interpretation.
Methodology
- Superpatch Division: The volume is segmented into manageable 3D superpatches, allowing for enhanced contextual understanding.
- 3D Masked Autoencoder Strategy: This strategy utilizes dual-masking to facilitate the learning of spatial representations, ensuring that the model retains critical structural information.
Experimental Validation
We validated our approach on three diverse large-scale public CT datasets. The experimental results demonstrate that MAESIL exhibits significant improvements over existing methods such as Autoencoder (AE), Variational Autoencoder (VAE), and Vector Quantized Variational Autoencoder (VQ-VAE) in key reconstruction metrics.
Performance Metrics
Key performance metrics include:
- Peak Signal-to-Noise Ratio (PSNR): A measure of the quality of the reconstructed images.
- Structural Similarity Index (SSIM): An index that measures the similarity between two images, focusing on changes in structural information.
Conclusion
Our findings establish MAESIL as a robust and practical pre-training solution for 3D medical imaging tasks. By leveraging the inherent 3D structure of CT scans, we have set a new standard for self-supervised learning in the medical imaging domain. Future work will focus on further refining the framework and exploring its applicability to other modalities in medical imaging.
