Easy Samples Are All You Need: Self-Evolving LLMs via Data-Efficient Reinforcement Learning
Summary: arXiv:2604.18639v1 Announce Type: cross
Abstract
Previous LLMs-based RL studies typically follow either supervised learning with high annotation costs, or unsupervised paradigms using voting or entropy-based rewards. However, their performance remains far from satisfactory due to the substantial annotation cost and issues such as model collapse or reward hacking. To address these issues, we introduce a new perspective inspired by cognitive learning theory and propose a novel approach called EasyRL.
Introduction
In recent years, large language models (LLMs) have gained significant attention for their capabilities in various natural language processing tasks. However, the reliance on extensive annotated data for training these models poses a challenge. Traditional reinforcement learning (RL) methods have struggled with high annotation costs and performance limitations. EasyRL aims to bridge this gap by leveraging a novel training methodology.
Methodology
The core of EasyRL is to simulate the human cognitive acquisition curve by integrating reliable knowledge transfer from easy labeled data with a progressive divide-and-conquer strategy that tackles increasingly difficult unlabeled data. The methodology can be broken down into the following key components:
- Warm-Up Model Initialization: We initialize a warm-up model using supervised RL with a few-shot labeled dataset, allowing the model to grasp fundamental concepts quickly.
- Divide-and-Conquer Pseudo-Labeling: This strategy focuses on difficult unlabeled data. It combines consistency-based selection for low-uncertainty cases and reflection-based resolution for medium-uncertainty cases to optimize the learning process.
- Difficulty-Progressive Self-Training: The model undergoes iterative pseudo-labeling and reinforcement learning, further enhancing its reasoning capabilities.
Results
The experimental results on mathematical and scientific benchmarks reveal a significant advancement in model performance. EasyRL, utilizing only 10% of easy labeled data, consistently outperforms state-of-the-art baselines in various tasks. This demonstrates the effectiveness of the proposed framework in achieving data-efficient post-training for LLMs.
Conclusion
EasyRL presents a unified self-evolving framework that not only addresses the challenges associated with high annotation costs but also enhances the reasoning capabilities of LLMs. By simulating cognitive learning processes and employing a strategic approach to data handling, EasyRL paves the way for more efficient and effective training of large language models. Future research may explore further refinements to this methodology and its applications across different domains.
