Visual Preference Optimization Using Rubric-Based Rewards

Date:

Visual Preference Optimization with Rubric Rewards

Summary: arXiv:2604.13029v1 Announce Type: cross

Abstract

The effectiveness of Direct Preference Optimization (DPO) depends significantly on preference data that accurately reflect the quality differences that matter in multimodal tasks. Existing pipelines often rely on off-policy perturbations or coarse outcome-based signals, which are not well suited for fine-grained visual reasoning.

Introduction to rDPO

In this article, we introduce rDPO, a novel preference optimization framework that is based on instance-specific rubrics. This framework aims to enhance the quality of preference data used in visual reasoning tasks, which is crucial for the performance of AI models.

Methodology

For each image-instruction pair, we create a checklist-style rubric that consists of both essential and additional criteria. This rubric is designed to score responses from any possible policies, allowing for a more nuanced understanding of preference data. The process involves the following steps:

  • Rubric Creation: Develop an instruction-rubric pool offline that can be reused during the construction of on-policy data.
  • Data Scoring: Use the rubric to score responses, providing instance-specific feedback.
  • Model Training: Integrate the rubric-based scoring into the training of preference models.

Results

We evaluated the performance of rDPO on public reward modeling benchmarks. The results indicate significant improvements:

  • Rubric-based prompting improved a 30B-A3B judge, bringing it close to the performance of GPT-5.4.
  • On public downstream benchmarks, rubric-based filtering raised the macro average to 82.69, while outcome-based filtering resulted in a decline to 75.82 from an initial 81.14.
  • When assessing scalability on a comprehensive benchmark, rDPO achieved a score of 61.01, which significantly outperformed the style-constrained baseline that scored 52.36 and surpassed the base model score of 59.48.

Conclusion

These results collectively demonstrate that visual preference optimization can greatly benefit from the integration of on-policy data construction with instance-specific criterion-level feedback. The rDPO framework not only enhances the quality of preference data but also facilitates better decision-making in multimodal AI tasks, ultimately driving improvements in model performance.

Future Work

Future research will focus on further refining the rubric design and exploring its application in other multimodal contexts, aiming to establish a more robust understanding of how instance-specific feedback can shape AI behavior in various domains.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.