Optimizing Rewards for Physical Reasoning in Vision-Language Models

Date:

Reward Design for Physical Reasoning in Vision-Language Models

Summary: arXiv:2604.13993v1 Announce Type: new

Abstract

Physical reasoning over visual inputs demands tight integration of visual perception, domain knowledge, and multi-step symbolic inference. Yet even state-of-the-art Vision Language Models (VLMs) fall far short of human performance on physics benchmarks. While post-training algorithms such as Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) have demonstrated strong reasoning gains in language models, how reward design shapes VLM physical reasoning behavior remains poorly understood.

Research Overview

In this study, we present a systematic reward ablation study for GRPO-based VLM training focused on physical reasoning. The primary aim is to compare four reward signals of increasing semantic richness:

  • Format compliance
  • Answer accuracy
  • A composite rubric reward encompassing answer correctness, physics principle identification, and unit consistency
  • A novel internal reward derived from model attention weights over input image regions

Evaluation and Benchmarking

We evaluate our findings using PhyX, a comprehensive benchmark consisting of 3,000 problems that span six physics domains and six reasoning types across multiple-choice and open-ended formats. The evaluation is conducted using IBM Granite Vision 3.3 (2B).

Key Findings

The results from our experiments reveal several significant insights:

  • GRPO with accuracy-based rewards consistently outperforms SFT on most domains, although the extent of gains varies substantially based on the reward type and specific domain.
  • Reward design does not yield uniform performance improvements; instead, it induces domain-specific reasoning behaviors.
  • Accuracy-based rewards provide the strongest overall performance gains across the evaluated benchmarks.
  • Rubric rewards enhance the quality of structured reasoning but do not lead to consistent improvements in accuracy.
  • Attention-based rewards promote advancements in spatial reasoning but can detract from performance in symbolic reasoning tasks.

Conclusion

Our internal attention-weight reward mechanism requires no spatial annotations and notably improves spatial relation accuracy from 0.27 to 0.50. This suggests that supervising where the model attends during the generation process is a promising direction for enhancing visually grounded physical reasoning.

In summary, the study highlights the importance of reward design in shaping the reasoning capabilities of Vision-Language Models and points toward future avenues for research in integrating visual and symbolic reasoning more effectively.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.