Reward Design for Physical Reasoning in Vision-Language Models
Summary: arXiv:2604.13993v1 Announce Type: new
Abstract
Physical reasoning over visual inputs demands tight integration of visual perception, domain knowledge, and multi-step symbolic inference. Yet even state-of-the-art Vision Language Models (VLMs) fall far short of human performance on physics benchmarks. While post-training algorithms such as Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) have demonstrated strong reasoning gains in language models, how reward design shapes VLM physical reasoning behavior remains poorly understood.
Research Overview
In this study, we present a systematic reward ablation study for GRPO-based VLM training focused on physical reasoning. The primary aim is to compare four reward signals of increasing semantic richness:
- Format compliance
- Answer accuracy
- A composite rubric reward encompassing answer correctness, physics principle identification, and unit consistency
- A novel internal reward derived from model attention weights over input image regions
Evaluation and Benchmarking
We evaluate our findings using PhyX, a comprehensive benchmark consisting of 3,000 problems that span six physics domains and six reasoning types across multiple-choice and open-ended formats. The evaluation is conducted using IBM Granite Vision 3.3 (2B).
Key Findings
The results from our experiments reveal several significant insights:
- GRPO with accuracy-based rewards consistently outperforms SFT on most domains, although the extent of gains varies substantially based on the reward type and specific domain.
- Reward design does not yield uniform performance improvements; instead, it induces domain-specific reasoning behaviors.
- Accuracy-based rewards provide the strongest overall performance gains across the evaluated benchmarks.
- Rubric rewards enhance the quality of structured reasoning but do not lead to consistent improvements in accuracy.
- Attention-based rewards promote advancements in spatial reasoning but can detract from performance in symbolic reasoning tasks.
Conclusion
Our internal attention-weight reward mechanism requires no spatial annotations and notably improves spatial relation accuracy from 0.27 to 0.50. This suggests that supervising where the model attends during the generation process is a promising direction for enhancing visually grounded physical reasoning.
In summary, the study highlights the importance of reward design in shaping the reasoning capabilities of Vision-Language Models and points toward future avenues for research in integrating visual and symbolic reasoning more effectively.
