Beyond Textual Knowledge-Leveraging Multimodal Knowledge Bases for Enhancing Vision-and-Language Navigation
Summary: arXiv:2603.26859v1 Announce Type: cross
Abstract
Vision-and-Language Navigation (VLN) requires an agent to navigate through complex unseen environments based on natural language instructions. However, existing methods often struggle to effectively capture key semantic cues and accurately align them with visual observations. To address this limitation, we propose Beyond Textual Knowledge (BTK), a VLN framework that synergistically integrates environment-specific textual knowledge with generative image knowledge bases.
Introduction
The challenge of Vision-and-Language Navigation lies in the ability to interpret natural language instructions and translate them into navigational actions in physical environments. Traditional methods focus primarily on textual data, neglecting the rich contextual information available in visual data. This article introduces a novel framework, BTK, which combines both textual and visual knowledge to improve the performance of VLN systems.
Methodology
BTK employs Qwen3-4B to extract goal-related phrases, ensuring that the agent understands the intent behind the instructions. The framework utilizes Flux-Schnell to construct two extensive image knowledge bases:
- R2R-GP: A dataset designed for the task of navigating based on natural language instructions.
- REVERIE-GP: A dataset focusing on instruction-following tasks in complex environments.
Additionally, we leverage BLIP-2 to create a large-scale textual knowledge base sourced from panoramic views. This integration provides environment-specific semantic cues that are crucial for effective navigation.
Integration of Knowledge Bases
The multimodal knowledge bases are integrated via:
- Goal-Aware Augmentor: Enhances the understanding of goal-related instructions.
- Knowledge Augmentor: Improves semantic grounding by aligning textual and visual data.
This dual-augmentation approach significantly enhances the agent’s ability to interpret and act upon complex instructions, leading to better navigational outcomes.
Results
Extensive experiments were conducted on the R2R dataset with 7,189 trajectories and the REVERIE dataset with 21,702 instructions. The results demonstrate that BTK significantly outperforms existing baselines:
- On the test unseen splits of R2R, the Success Rate (SR) increased by 5%.
- On the REVERIE dataset, SR increased by 2.07%.
- Sequence Precision (SPL) increased by 4% on R2R and 3.69% on REVERIE.
Conclusion
The Beyond Textual Knowledge framework represents a significant advancement in Vision-and-Language Navigation. By effectively integrating multimodal knowledge bases, BTK not only improves the agent’s understanding of navigation tasks but also sets a new benchmark for performance in this domain. The source code for BTK is publicly available at https://github.com/yds3/IPM-BTK/.
