Visually-Guided Policy Optimization for Multimodal Reasoning
Summary: arXiv:2604.09349v1 Announce Type: cross
The field of reinforcement learning has made significant strides in enhancing the reasoning capabilities of vision-language models (VLMs), particularly through the implementation of Reinforcement Learning with Verifiable Rewards (RLVR). However, a critical challenge remains due to the text-centric nature of VLMs, which often results in inadequate visual fidelity. This deficiency is primarily marked by limited attention to visual tokens, indicating a gap that affects the overall performance of these models.
Our empirical research highlights a pressing issue known as temporal visual forgetting, which occurs during successive reasoning steps. This phenomenon exacerbates the lack of visual engagement and focus that VLMs have towards their visual inputs. To address this challenge, we introduce a groundbreaking framework known as Visually-Guided Policy Optimization (VGPO).
Overview of Visually-Guided Policy Optimization (VGPO)
VGPO is designed to enhance visual focus throughout the policy optimization process. The framework incorporates a two-pronged approach aimed at reinforcing visual engagement in reasoning tasks.
- Visual Attention Compensation: The first element of VGPO is the Visual Attention Compensation mechanism. This innovative strategy utilizes visual similarity to identify and amplify crucial visual cues, ensuring that the model remains focused on relevant visual information.
- Elevated Visual Expectations: As reasoning progresses, VGPO systematically raises visual expectations to mitigate the effects of visual forgetting, fostering a more robust interaction between visual inputs and reasoning processes.
Advantage Re-weighting Strategy
Building on the foundational elements of VGPO, we have implemented a dual-grained advantage re-weighting strategy to further enhance performance:
- Intra-Trajectory Level: This component of the strategy emphasizes tokens that demonstrate relatively high visual activation within a given trajectory. By focusing on these tokens, VGPO ensures that the reasoning process is more aligned with visual data.
- Inter-Trajectory Level: This aspect prioritizes trajectories that exhibit superior visual accumulation. By doing so, VGPO maximizes the overall visual engagement across multiple reasoning paths, leading to improved model performance.
Experimental Validation
We conducted extensive experiments to validate the effectiveness of the VGPO framework. The results indicate that VGPO not only achieves enhanced visual activation but also outperforms existing models in tasks requiring mathematical multimodal reasoning and other visual-dependent challenges.
The improvements observed in both visual focus and reasoning capabilities suggest that VGPO represents a significant advancement in the development of vision-language models. By addressing the inherent limitations of VLMs through innovative mechanisms, VGPO paves the way for more accurate and reliable reasoning in multimodal scenarios.
Conclusion
In conclusion, the Visually-Guided Policy Optimization framework offers a promising solution to the challenges faced by vision-language models in the realm of multimodal reasoning. By reinforcing visual engagement and optimizing policy through sophisticated mechanisms, VGPO sets a new standard for future research and application in the field of artificial intelligence.
