Peer-Predictive Self-Training for Language Model Reasoning
Summary: arXiv:2604.13356v1
Type: cross
Introduction
As language models continue to evolve, the need for mechanisms that facilitate their self-improvement without relying on external supervision becomes increasingly critical. A recent study introduces a novel framework known as Peer-Predictive Self-Training (PST), which addresses this challenge by enabling multiple language models to enhance their performance collaboratively.
What is Peer-Predictive Self-Training?
PST is a label-free fine-tuning approach that leverages cross-model interactions to generate an aggregated response from multiple language models. This aggregated response serves as an internal training target, enhancing the learning process without the need for external labels or a teacher-student hierarchy.
How Does PST Work?
The process begins with a prompt question, to which each language model generates a response sequentially. The final output is an aggregated answer derived from these individual responses. This aggregated answer is often more reliable than the responses produced by any single model.
Key Mechanisms of PST
- Pointwise Mutual Information (PMI): This statistical measure is employed to evaluate the informativeness of each intermediate response in relation to the aggregated answer. By measuring how informative each response is, the framework can effectively adjust the self-training updates.
- Adaptive Learning Rates: Responses that align closely with the aggregated answer receive scaled-down updates, while those that are less informative or misaligned are updated more aggressively. This adaptive approach allows for more efficient learning.
Impact on Mathematical Reasoning Benchmarks
The effectiveness of PST has been evaluated on various mathematical reasoning benchmarks, including SimulEq, Math500, and MultiArith. The results indicate that PST significantly enhances the exact-match accuracy of language models:
- Gemma-2-2B: Improved accuracy by 2.2 percentage points
- LLaMA-3.2-1B: Improved accuracy by 3.5 percentage points
- Qwen-2.5-1.5B: Improved accuracy by 4.3 percentage points
In addition to accuracy improvements, PST also reduces the average generator-verifier gap (GV-Gap) by 26 to 40 percent across the models tested. This reduction indicates a more cohesive and accurate generation process, reinforcing the value of peer-predictive feedback.
Conclusion
The introduction of Peer-Predictive Self-Training represents a promising advancement in the realm of self-supervised learning for language models. By capitalizing on cross-model interactions and removing the need for external supervision, PST enhances the capabilities of language models in a collaborative manner. The findings underscore the potential for peer-predictive feedback as a viable strategy for ongoing self-improvement in artificial intelligence.
