Optimizing Rewards for Physical Reasoning in Vision-Language Models

Reward Design for Physical Reasoning in Vision-Language Models

Summary: arXiv:2604.13993v1 Announce Type: new

Abstract

Physical reasoning over visual inputs demands tight integration of visual perception, domain knowledge, and multi-step symbolic inference. Yet even state-of-the-art Vision Language Models (VLMs) fall far short of human performance on physics benchmarks. While post-training algorithms such as Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) have demonstrated strong reasoning gains in language models, how reward design shapes VLM physical reasoning behavior remains poorly understood.

Research Overview

In this study, we present a systematic reward ablation study for GRPO-based VLM training focused on physical reasoning. The primary aim is to compare four reward signals of increasing semantic richness:

Format compliance
Answer accuracy
A composite rubric reward encompassing answer correctness, physics principle identification, and unit consistency
A novel internal reward derived from model attention weights over input image regions

Evaluation and Benchmarking

We evaluate our findings using PhyX, a comprehensive benchmark consisting of 3,000 problems that span six physics domains and six reasoning types across multiple-choice and open-ended formats. The evaluation is conducted using IBM Granite Vision 3.3 (2B).

Key Findings

The results from our experiments reveal several significant insights:

GRPO with accuracy-based rewards consistently outperforms SFT on most domains, although the extent of gains varies substantially based on the reward type and specific domain.
Reward design does not yield uniform performance improvements; instead, it induces domain-specific reasoning behaviors.
Accuracy-based rewards provide the strongest overall performance gains across the evaluated benchmarks.
Rubric rewards enhance the quality of structured reasoning but do not lead to consistent improvements in accuracy.
Attention-based rewards promote advancements in spatial reasoning but can detract from performance in symbolic reasoning tasks.

Conclusion

Our internal attention-weight reward mechanism requires no spatial annotations and notably improves spatial relation accuracy from 0.27 to 0.50. This suggests that supervising where the model attends during the generation process is a promising direction for enhancing visually grounded physical reasoning.

In summary, the study highlights the importance of reward design in shaping the reasoning capabilities of Vision-Language Models and points toward future avenues for research in integrating visual and symbolic reasoning more effectively.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Optimizing Rewards for Physical Reasoning in Vision-Language Models

Reward Design for Physical Reasoning in Vision-Language Models

Abstract

Research Overview

Evaluation and Benchmarking

Key Findings

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related