CRFT: Consistent-Recurrent Feature Flow Transformer for Cross-Modal Image Registration
Summary: arXiv:2604.05689v1 Announce Type: cross
Abstract
We present Consistent-Recurrent Feature Flow Transformer (CRFT), a unified coarse-to-fine framework based on feature flow learning for robust cross-modal image registration.
CRFT learns a modality-independent feature flow representation within a transformer-based architecture that jointly performs feature alignment and flow estimation.
The coarse stage establishes global correspondences through multi-scale feature correlation, while the fine stage refines local details via hierarchical feature fusion and adaptive spatial reasoning.
To enhance geometric adaptability, an iterative discrepancy-guided attention mechanism with a Spatial Geometric Transform (SGT) recurrently refines the flow field, progressively capturing subtle spatial inconsistencies and enforcing feature-level consistency.
This design enables accurate alignment under large affine and scale variations while maintaining structural coherence across modalities.
Extensive experiments on diverse cross-modal datasets demonstrate that CRFT consistently outperforms state-of-the-art registration methods in both accuracy and robustness.
Beyond registration, CRFT provides a generalizable paradigm for multimodal spatial correspondence, offering broad applicability to remote sensing, autonomous navigation, and medical imaging.
Code and datasets are publicly available at https://github.com/NEU-Liuxuecong/CRFT.
Introduction
Cross-modal image registration is a critical task in various fields including medical imaging, remote sensing, and autonomous navigation.
Traditional methods often struggle with the complexities involved in aligning images from different modalities due to variations in scale, illumination, and other factors.
The CRFT framework addresses these challenges through an innovative use of transformer architecture, enabling robust and efficient image alignment.
Methodology
The CRFT framework consists of two main stages: a coarse stage and a fine stage.
Each stage employs a unique set of techniques to ensure accurate and consistent feature registration:
- Coarse Stage: Establishes global correspondences between images by utilizing multi-scale feature correlation. This allows the model to capture broad patterns and structures across different modalities.
- Fine Stage: Refines the alignment by focusing on local details through hierarchical feature fusion. Adaptive spatial reasoning is applied to enhance the precision of the registration.
- Iterative Discrepancy-Guided Attention: An innovative mechanism that leverages Spatial Geometric Transform (SGT) to recursively refine the flow field, effectively addressing subtle spatial inconsistencies.
Results
Comprehensive experiments conducted on a variety of cross-modal datasets indicate that CRFT significantly outperforms existing state-of-the-art registration methods.
Metrics such as accuracy and robustness were evaluated, showcasing CRFT’s superior capability in handling diverse image modalities and registration challenges.
Applications
The versatility of CRFT extends beyond mere image registration. Its ability to establish multimodal spatial correspondences makes it suitable for a wide range of applications:
- Remote Sensing: Accurate alignment of satellite images for environmental monitoring and analysis.
- Autonomous Navigation: Enhanced perception systems for vehicles by aligning data from diverse sensors.
- Medical Imaging: Improved integration of images from different modalities, aiding in diagnosis and treatment planning.
Conclusion
The CRFT framework represents a significant advancement in the field of cross-modal image registration.
By leveraging the power of transformer architecture and innovative feature flow learning, it provides a robust solution to longstanding challenges in aligning images from different modalities.
The public availability of code and datasets encourages further research and application of this groundbreaking work.
