Perception-Grounded Policy Optimization for Vision-Language Models

Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models

Summary: arXiv:2604.01840v2 Announce Type: replace

Abstract

While Reinforcement Learning from Verifiable Rewards (RLVR) has advanced reasoning in Large Vision-Language Models (LVLMs), prevailing frameworks suffer from a foundational methodological flaw: by distributing identical advantages across all generated tokens, these methods inherently dilute the learning signals essential for optimizing the critical, visually-grounded steps of multimodal reasoning.

Introduction

In recent years, the integration of vision and language in artificial intelligence has opened new avenues for research and development. Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities in reasoning tasks that require understanding both visual and textual information. However, traditional reinforcement learning approaches have limitations that hinder their effectiveness.

The Challenge

The primary challenge with existing frameworks lies in their uniform distribution of rewards across all tokens generated during the learning process. This method not only dilutes the learning signals but also obscures the critical visually-grounded reasoning that is crucial for effective multimodal understanding.

Introducing Token Visual Dependency

To address this issue, we present the concept of Token Visual Dependency. This metric quantifies the causal information gain from visual inputs, leveraging the Kullback-Leibler (KL) divergence to compare visual-conditioned predictive distributions against text-only distributions. Our findings reveal that this dependency is not only sparse but also semantically significant.

Perception-Grounded Policy Optimization (PGPO)

Building on the insights gained from Token Visual Dependency, we introduce a novel framework called Perception-Grounded Policy Optimization (PGPO). This fine-grained credit assignment mechanism dynamically adjusts the advantages at the token level. By employing a threshold-gated, mass-conserving approach, PGPO enhances learning signals for tokens that are visually dependent while mitigating gradient noise originating from linguistic priors.

Experimental Validation

We conducted extensive experiments using the Qwen2.5-VL series across seven challenging multimodal reasoning benchmarks. The results demonstrate that PGPO significantly boosts model performance, achieving an average increase of 18.7%. Both theoretical and empirical analyses confirm that PGPO effectively reduces gradient variance, prevents training collapse, and serves as a robust regularizer for perception-grounded multimodal reasoning.

Conclusion

The advancements introduced through Perception-Grounded Policy Optimization represent a significant step forward in the optimization of Large Vision-Language Models. By focusing on the unique visual dependencies of tokens, we can ensure more effective learning and enhanced performance in multimodal reasoning tasks.

Code Availability

The code for implementing PGPO will be made available at GitHub – PGPO.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Perception-Grounded Policy Optimization for Vision-Language Models

Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models

Abstract

Introduction

The Challenge

Introducing Token Visual Dependency

Perception-Grounded Policy Optimization (PGPO)

Experimental Validation

Conclusion

Code Availability

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related