PRISM: Boost Multimodal RL with On-policy Distillation

PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning

Recent advancements in the field of artificial intelligence have prompted researchers to explore innovative techniques for enhancing the performance of large multimodal models (LMMs). A groundbreaking study titled “PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning,” recently published on arXiv, introduces a novel approach to mitigate distributional drift in these models.

The traditional methodology for training LMMs typically involves a two-step process: supervised fine-tuning (SFT) on carefully curated demonstrations followed by reinforcement learning with verifiable rewards (RLVR). However, this conventional approach has been found to introduce significant distributional drift, which adversely affects the model’s original capabilities and its ability to match the supervision distribution. This issue is particularly pronounced in multimodal reasoning tasks, where differing patterns of perception errors and reasoning failures can compound during subsequent reinforcement learning phases.

The PRISM Approach

To address these challenges, the authors of the study propose PRISM, a three-stage pipeline designed to mitigate distributional drift by incorporating an explicit distribution-alignment stage between SFT and RLVR. The PRISM framework builds on the principle of on-policy distillation (OPD) and conceptualizes alignment as a black-box adversarial game. In this game, the policy interacts with a Mixture-of-Experts (MoE) discriminator, which includes specialized perception and reasoning experts. This structure allows for the provision of disentangled corrective signals that guide the policy toward the supervision distribution without necessitating access to teacher logits.

Key Features and Findings

One of the key findings from the research is that while a sizable dataset of 1.26 million public demonstrations is sufficient for broad SFT initialization, achieving effective distribution alignment requires higher-fidelity supervision. To fulfill this need, the researchers curated an additional 113,000 demonstrations from the Gemini 3 Flash dataset, which emphasizes dense visual grounding and step-by-step reasoning on complex unsolved problems.

Performance Improvements: Experimental results on the Qwen3-VL platform demonstrate that PRISM consistently enhances downstream RLVR performance across various reinforcement learning algorithms, including GRPO, DAPO, and GSPO.
Accuracy Gains: The implementation of PRISM leads to significant improvements in average accuracy, with enhancements of +4.4 and +6.0 points over the SFT-to-RLVR baseline on models with 4 billion and 8 billion parameters, respectively.
Public Accessibility: The researchers have made their code, data, and model checkpoints publicly available, fostering collaboration and further research in this rapidly evolving field. Interested parties can access these resources at https://github.com/XIAO4579/PRISM.

Conclusion

The PRISM framework represents a significant advancement in the training of multimodal reinforcement learning models, addressing critical limitations associated with distributional drift. By integrating an innovative distribution-alignment stage and leveraging specialized expert systems, PRISM not only enhances model performance but also sets a new standard for future research in the domain. As the field of artificial intelligence continues to evolve, approaches like PRISM are likely to play an essential role in developing more robust and capable multimodal systems.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

PRISM: Boost Multimodal RL with On-policy Distillation

PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning

The PRISM Approach

Key Features and Findings

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related