Generalizable Audio-Visual Navigation via Binaural Difference Attention and Action Transition Prediction
Summary: arXiv:2604.05007v1 Announce Type: cross
Abstract: In Audio-Visual Navigation (AVN), agents must locate sound sources in unseen 3D environments using visual and auditory cues. However, existing methods often struggle with generalization in unseen scenarios, as they tend to overfit to semantic sound features and specific training environments. To address these challenges, we propose the Binaural Difference Attention with Action Transition Prediction (BDATP) framework, which jointly optimizes perception and policy. Specifically, the Binaural Difference Attention (BDA) module explicitly models interaural differences to enhance spatial orientation, reducing reliance on semantic categories. Simultaneously, the Action Transition Prediction (ATP) task introduces an auxiliary action prediction objective as a regularization term, mitigating environment-specific overfitting. Extensive experiments on the Replica and Matterport3D datasets demonstrate that BDATP can be seamlessly integrated into various mainstream baselines, yielding consistent and significant performance gains. Notably, our framework achieves state-of-the-art Success Rates across most settings, with a remarkable absolute improvement of up to 21.6 percentage points in Replica dataset for unheard sounds. These results underscore BDATP’s superior generalization capability and its robustness across diverse navigation architectures.
Introduction
The field of Audio-Visual Navigation (AVN) has gained significant attention as it plays a crucial role in the development of autonomous agents capable of understanding and interacting with their environments. Traditional methods, however, often face significant limitations when applied to new and unseen environments.
Challenges in Existing AVN Approaches
- Generalization Issues: Many existing models are prone to overfitting, resulting in poor performance when faced with new scenarios.
- Semantic Dependence: Current systems often rely heavily on semantic sound features, which can limit their adaptability.
- Environment-Specific Training: Agents trained in specific environments may struggle to navigate effectively in different contexts.
The BDATP Framework
The proposed BDATP framework addresses these challenges through two innovative components:
- Binaural Difference Attention (BDA): This module enhances the agent’s spatial orientation by modeling interaural differences, thereby reducing reliance on semantic categories.
- Action Transition Prediction (ATP): By introducing an auxiliary action prediction objective, ATP acts as a regularization term, significantly reducing overfitting to specific environments.
Experimental Results
Extensive experiments conducted on the Replica and Matterport3D datasets reveal that the BDATP framework can be effectively integrated into various mainstream baselines. The results indicate:
- A remarkable performance improvement of up to 21.6 percentage points in the Replica dataset for unheard sounds.
- State-of-the-art Success Rates achieved across most tested settings.
- Enhanced generalization capabilities, demonstrating robustness across diverse navigation architectures.
Conclusion
The BDATP framework represents a significant advancement in the field of Audio-Visual Navigation, offering a robust solution to the persistent challenges of generalization and environment-specific training. As the demand for intelligent navigation systems continues to grow, BDATP stands out as a promising approach for future research and development in autonomous navigation technologies.
