EasyRL: Data-Efficient Self-Evolving Large Language Models

Easy Samples Are All You Need: Self-Evolving LLMs via Data-Efficient Reinforcement Learning

Summary: arXiv:2604.18639v1 Announce Type: cross

Abstract

Previous LLMs-based RL studies typically follow either supervised learning with high annotation costs, or unsupervised paradigms using voting or entropy-based rewards. However, their performance remains far from satisfactory due to the substantial annotation cost and issues such as model collapse or reward hacking. To address these issues, we introduce a new perspective inspired by cognitive learning theory and propose a novel approach called EasyRL.

Introduction

In recent years, large language models (LLMs) have gained significant attention for their capabilities in various natural language processing tasks. However, the reliance on extensive annotated data for training these models poses a challenge. Traditional reinforcement learning (RL) methods have struggled with high annotation costs and performance limitations. EasyRL aims to bridge this gap by leveraging a novel training methodology.

Methodology

The core of EasyRL is to simulate the human cognitive acquisition curve by integrating reliable knowledge transfer from easy labeled data with a progressive divide-and-conquer strategy that tackles increasingly difficult unlabeled data. The methodology can be broken down into the following key components:

Warm-Up Model Initialization: We initialize a warm-up model using supervised RL with a few-shot labeled dataset, allowing the model to grasp fundamental concepts quickly.
Divide-and-Conquer Pseudo-Labeling: This strategy focuses on difficult unlabeled data. It combines consistency-based selection for low-uncertainty cases and reflection-based resolution for medium-uncertainty cases to optimize the learning process.
Difficulty-Progressive Self-Training: The model undergoes iterative pseudo-labeling and reinforcement learning, further enhancing its reasoning capabilities.

Results

The experimental results on mathematical and scientific benchmarks reveal a significant advancement in model performance. EasyRL, utilizing only 10% of easy labeled data, consistently outperforms state-of-the-art baselines in various tasks. This demonstrates the effectiveness of the proposed framework in achieving data-efficient post-training for LLMs.

Conclusion

EasyRL presents a unified self-evolving framework that not only addresses the challenges associated with high annotation costs but also enhances the reasoning capabilities of LLMs. By simulating cognitive learning processes and employing a strategic approach to data handling, EasyRL paves the way for more efficient and effective training of large language models. Future research may explore further refinements to this methodology and its applications across different domains.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

EasyRL: Data-Efficient Self-Evolving Large Language Models

Easy Samples Are All You Need: Self-Evolving LLMs via Data-Efficient Reinforcement Learning

Abstract

Introduction

Methodology

Results

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related