UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards
Summary: arXiv:2604.14967v2 Announce Type: replace-cross
Abstract: Retrieval-Augmented Generation (RAG) extends Large Vision-Language Models (LVLMs) with external visual knowledge. However, existing visual RAG systems typically rely on generic retrieval signals that overlook the fine-grained visual semantics essential for complex reasoning. To address this limitation, we propose UniDoc-RL, a unified reinforcement learning framework in which an LVLM agent jointly performs retrieval, reranking, active visual perception, and reasoning.
UniDoc-RL formulates visual information acquisition as a sequential decision-making problem with a hierarchical action space. Specifically, it progressively refines visual evidence from coarse-grained document retrieval to fine-grained image selection and active region cropping, allowing the model to suppress irrelevant content and attend to information-dense regions.
Key Features of UniDoc-RL
- Hierarchical Action Space: The model employs a structured approach to decision-making, enabling refined interactions with visual data.
- Dense Multi-Reward Scheme: This innovative scheme provides task-aware supervision for each action, enhancing the learning process.
- Group Relative Policy Optimization (GRPO): UniDoc-RL aligns agent behavior with multiple objectives without the need for a separate value network.
- Comprehensive Training Dataset: The framework is supported by a carefully curated dataset of high-quality reasoning trajectories with fine-grained action annotations.
Performance and Results
Experiments conducted on three benchmarks demonstrate that UniDoc-RL consistently surpasses state-of-the-art baselines. Notably, it yields up to 17.7% improvements over previous reinforcement learning-based methods. The rigorous evaluation highlights the capability of UniDoc-RL to effectively integrate visual semantics into the retrieval and reasoning processes, thereby enhancing the overall performance of LVLMs.
Conclusion
The introduction of UniDoc-RL marks a significant advancement in the field of visual RAG systems. By addressing the limitations of existing models and offering a robust framework for visual information acquisition, UniDoc-RL sets a new standard for the integration of visual knowledge in language models. This innovative approach not only improves the efficiency of retrieval and reasoning but also opens avenues for future research in enhancing visual understanding in AI systems.
