Visual Preference Optimization Using Rubric-Based Rewards

Visual Preference Optimization with Rubric Rewards

Summary: arXiv:2604.13029v1 Announce Type: cross

Abstract

The effectiveness of Direct Preference Optimization (DPO) depends significantly on preference data that accurately reflect the quality differences that matter in multimodal tasks. Existing pipelines often rely on off-policy perturbations or coarse outcome-based signals, which are not well suited for fine-grained visual reasoning.

Introduction to rDPO

In this article, we introduce rDPO, a novel preference optimization framework that is based on instance-specific rubrics. This framework aims to enhance the quality of preference data used in visual reasoning tasks, which is crucial for the performance of AI models.

Methodology

For each image-instruction pair, we create a checklist-style rubric that consists of both essential and additional criteria. This rubric is designed to score responses from any possible policies, allowing for a more nuanced understanding of preference data. The process involves the following steps:

Rubric Creation: Develop an instruction-rubric pool offline that can be reused during the construction of on-policy data.
Data Scoring: Use the rubric to score responses, providing instance-specific feedback.
Model Training: Integrate the rubric-based scoring into the training of preference models.

Results

We evaluated the performance of rDPO on public reward modeling benchmarks. The results indicate significant improvements:

Rubric-based prompting improved a 30B-A3B judge, bringing it close to the performance of GPT-5.4.
On public downstream benchmarks, rubric-based filtering raised the macro average to 82.69, while outcome-based filtering resulted in a decline to 75.82 from an initial 81.14.
When assessing scalability on a comprehensive benchmark, rDPO achieved a score of 61.01, which significantly outperformed the style-constrained baseline that scored 52.36 and surpassed the base model score of 59.48.

Conclusion

These results collectively demonstrate that visual preference optimization can greatly benefit from the integration of on-policy data construction with instance-specific criterion-level feedback. The rDPO framework not only enhances the quality of preference data but also facilitates better decision-making in multimodal AI tasks, ultimately driving improvements in model performance.

Future Work

Future research will focus on further refining the rubric design and exploring its application in other multimodal contexts, aiming to establish a more robust understanding of how instance-specific feedback can shape AI behavior in various domains.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Visual Preference Optimization Using Rubric-Based Rewards

Visual Preference Optimization with Rubric Rewards

Abstract

Introduction to rDPO

Methodology

Results

Conclusion

Future Work

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related