SEATrack: Simple, Efficient, and Adaptive Multimodal Tracker
Summary: arXiv:2604.12502v1 Announce Type: cross
Abstract: Parameter-efficient fine-tuning (PEFT) in multimodal tracking reveals a concerning trend where recent performance gains are often achieved at the cost of inflated parameter budgets, which fundamentally erodes PEFT’s efficiency promise. In this work, we introduce SEATrack, a Simple, Efficient, and Adaptive two-stream multimodal tracker that tackles this performance-efficiency dilemma from two complementary perspectives.
Key Innovations of SEATrack
- Cross-Modal Alignment: We prioritize the alignment of matching responses across different modalities. This is crucial for breaking the trade-off between performance and efficiency.
- AMG-LoRA Integration: The Adaptive Mutual Guidance (AMG) method is integrated with Low-Rank Adaptation (LoRA) to dynamically refine and align attention maps. This addresses the issue of modality-specific biases that create conflicting attention maps.
- Hierarchical Mixture of Experts (HMoE): Departing from traditional local fusion approaches, we introduce HMoE for efficient global relation modeling. This balances expressiveness and computational efficiency in cross-modal fusion.
Performance Advancements
Equipped with these innovative strategies, SEATrack shows significant improvements over state-of-the-art methods in various tracking tasks, including:
- RGB-T Tracking
- RGB-D Tracking
- RGB-E Tracking
By effectively balancing performance with efficiency, SEATrack sets a new benchmark in the field of multimodal tracking.
Conclusion
SEATrack emerges as a pivotal solution in the realm of multimodal tracking, addressing the critical balance between performance and efficiency. Its unique methodologies not only enhance tracking accuracy but also ensure that the efficiency promise of PEFT is upheld. This makes SEATrack a valuable tool for researchers and practitioners alike.
For those interested in exploring the technical details and implementation, the code is available here.
