Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models
Summary: arXiv:2604.01840v2 Announce Type: replace
Abstract
While Reinforcement Learning from Verifiable Rewards (RLVR) has advanced reasoning in Large Vision-Language Models (LVLMs), prevailing frameworks suffer from a foundational methodological flaw: by distributing identical advantages across all generated tokens, these methods inherently dilute the learning signals essential for optimizing the critical, visually-grounded steps of multimodal reasoning.
Introduction
In recent years, the integration of vision and language in artificial intelligence has opened new avenues for research and development. Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities in reasoning tasks that require understanding both visual and textual information. However, traditional reinforcement learning approaches have limitations that hinder their effectiveness.
The Challenge
The primary challenge with existing frameworks lies in their uniform distribution of rewards across all tokens generated during the learning process. This method not only dilutes the learning signals but also obscures the critical visually-grounded reasoning that is crucial for effective multimodal understanding.
Introducing Token Visual Dependency
To address this issue, we present the concept of Token Visual Dependency. This metric quantifies the causal information gain from visual inputs, leveraging the Kullback-Leibler (KL) divergence to compare visual-conditioned predictive distributions against text-only distributions. Our findings reveal that this dependency is not only sparse but also semantically significant.
Perception-Grounded Policy Optimization (PGPO)
Building on the insights gained from Token Visual Dependency, we introduce a novel framework called Perception-Grounded Policy Optimization (PGPO). This fine-grained credit assignment mechanism dynamically adjusts the advantages at the token level. By employing a threshold-gated, mass-conserving approach, PGPO enhances learning signals for tokens that are visually dependent while mitigating gradient noise originating from linguistic priors.
Experimental Validation
We conducted extensive experiments using the Qwen2.5-VL series across seven challenging multimodal reasoning benchmarks. The results demonstrate that PGPO significantly boosts model performance, achieving an average increase of 18.7%. Both theoretical and empirical analyses confirm that PGPO effectively reduces gradient variance, prevents training collapse, and serves as a robust regularizer for perception-grounded multimodal reasoning.
Conclusion
The advancements introduced through Perception-Grounded Policy Optimization represent a significant step forward in the optimization of Large Vision-Language Models. By focusing on the unique visual dependencies of tokens, we can ensure more effective learning and enhanced performance in multimodal reasoning tasks.
Code Availability
The code for implementing PGPO will be made available at GitHub – PGPO.
