UniDoc-RL: Advanced Visual RAG with Hierarchical Actions

Date:


UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards

Summary: arXiv:2604.14967v2 Announce Type: replace-cross

Abstract: Retrieval-Augmented Generation (RAG) extends Large Vision-Language Models (LVLMs) with external visual knowledge. However, existing visual RAG systems typically rely on generic retrieval signals that overlook the fine-grained visual semantics essential for complex reasoning. To address this limitation, we propose UniDoc-RL, a unified reinforcement learning framework in which an LVLM agent jointly performs retrieval, reranking, active visual perception, and reasoning.

UniDoc-RL formulates visual information acquisition as a sequential decision-making problem with a hierarchical action space. Specifically, it progressively refines visual evidence from coarse-grained document retrieval to fine-grained image selection and active region cropping, allowing the model to suppress irrelevant content and attend to information-dense regions.

Key Features of UniDoc-RL

  • Hierarchical Action Space: The model employs a structured approach to decision-making, enabling refined interactions with visual data.
  • Dense Multi-Reward Scheme: This innovative scheme provides task-aware supervision for each action, enhancing the learning process.
  • Group Relative Policy Optimization (GRPO): UniDoc-RL aligns agent behavior with multiple objectives without the need for a separate value network.
  • Comprehensive Training Dataset: The framework is supported by a carefully curated dataset of high-quality reasoning trajectories with fine-grained action annotations.

Performance and Results

Experiments conducted on three benchmarks demonstrate that UniDoc-RL consistently surpasses state-of-the-art baselines. Notably, it yields up to 17.7% improvements over previous reinforcement learning-based methods. The rigorous evaluation highlights the capability of UniDoc-RL to effectively integrate visual semantics into the retrieval and reasoning processes, thereby enhancing the overall performance of LVLMs.

Conclusion

The introduction of UniDoc-RL marks a significant advancement in the field of visual RAG systems. By addressing the limitations of existing models and offering a robust framework for visual information acquisition, UniDoc-RL sets a new standard for the integration of visual knowledge in language models. This innovative approach not only improves the efficiency of retrieval and reasoning but also opens avenues for future research in enhancing visual understanding in AI systems.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.