Stabilizing MLLM Self-Evolution with Softened Retracing

Stabilizing Unsupervised Self-Evolution of MLLMs via Continuous Softened Retracing reSampling

Summary: arXiv:2604.03647v2 Announce Type: replace-cross

Abstract

In the unsupervised self-evolution of Multimodal Large Language Models (MLLMs), the quality of feedback signals during post-training is pivotal for stable and effective learning. However, existing self-evolution methods predominantly rely on majority voting to select the most frequent output as the pseudo-golden answer, which may stem from the model’s intrinsic biases rather than guaranteeing the objective correctness of the reasoning paths. To counteract this degradation, we propose Continuous Softened Retracing reSampling (CSRS) in MLLM self-evolution.

Introduction

The field of artificial intelligence has made significant strides with the development of Multimodal Large Language Models (MLLMs). These models, which integrate various types of data inputs, are increasingly utilized in a wide range of applications. However, for these models to evolve effectively in an unsupervised manner, it is essential to enhance the quality of feedback during their post-training phase.

Challenges in Current Methods

Current self-evolution methodologies primarily depend on a majority voting system to determine the most frequent output as the pseudo-golden answer. This approach can lead to several issues:

Intrinsic Biases: The reliance on majority voting may reinforce existing biases within the model.
Uncertainty in Correctness: Selecting outputs based on frequency does not guarantee the correctness of reasoning paths.
Limited Exploration: Traditional methods may restrict the model’s ability to explore diverse reasoning paths.

Proposed Solution: Continuous Softened Retracing reSampling (CSRS)

To address these challenges, we introduce Continuous Softened Retracing reSampling (CSRS) for MLLM self-evolution. Our approach consists of two main components:

Retracing Re-inference Mechanism (RRM): This mechanism allows the model to re-inference from anchor points, expanding the exploration of long-tail reasoning paths. By revisiting these anchor points, the model can uncover more nuanced and less frequent reasoning patterns.
Softened Frequency Reward (SFR): In place of binary rewards, SFR utilizes continuous signals that calibrate rewards based on the frequency of answers across sampled reasoning sets, ensuring a more nuanced feedback mechanism.

Incorporating Visual Semantic Perturbation

Moreover, CSRS incorporates Visual Semantic Perturbation (VSP) to ensure that the model prioritizes mathematical logic over visual superficiality. By focusing on the underlying logic rather than merely surface-level visual cues, the performance of the model is enhanced significantly.

Experimental Results

Our experiments demonstrate that the CSRS framework significantly enhances the reasoning performance of the Qwen2.5-VL-7B model on various benchmarks, including MathVision. The results indicate that our approach achieves state-of-the-art (SOTA) outcomes in unsupervised self-evolution, particularly in geometric tasks.

Conclusion

In conclusion, the Continuous Softened Retracing reSampling technique presents a promising advancement in the self-evolution of Multimodal Large Language Models. By addressing existing challenges through innovative mechanisms, CSRS not only improves the quality of reasoning paths but also sets new benchmarks for performance in complex reasoning tasks. Our code is available at GitHub.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Stabilizing MLLM Self-Evolution with Softened Retracing

Stabilizing Unsupervised Self-Evolution of MLLMs via Continuous Softened Retracing reSampling

Abstract

Introduction

Challenges in Current Methods

Proposed Solution: Continuous Softened Retracing reSampling (CSRS)

Incorporating Visual Semantic Perturbation

Experimental Results

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related