Towards Effective Experiential Learning: Dual Guidance for Utilization and Internalization
Summary: arXiv:2603.24093v1 Announce Type: cross
Abstract: Recently, reinforcement learning (RL) has become an important approach for improving the capabilities of large language models (LLMs). In particular, reinforcement learning from verifiable rewards (RLVR) has emerged as a promising paradigm for reasoning tasks. However, existing RL-based training still remains only a rough approximation to human learning. Human learners leverage both external and internal experience to guide exploration and gradually internalize useful trajectories into stable knowledge. Motivated by this gap, we ask: how can LLMs better utilize and internalize experience during RLVR training? To answer this question, we propose Dual Guidance Optimization (DGO), a unified framework that leverages external and internal experience to improve training effectiveness.
Introduction to Dual Guidance Optimization
The innovative framework of DGO first constructs an experience bank from previously explored trajectories. This experience bank serves as a repository of knowledge that the model can refer back to during training. The policy then performs exploration under the joint guidance of the experience bank and the model’s internal knowledge. This dual guidance mechanism aims to enhance the learning process by ensuring that the model is not merely relying on past experiences but is also integrating new insights from its internal reasoning capabilities.
How DGO Works
The DGO framework operates in a closed-loop system, defined by the following key components:
- Experience Bank Construction: DGO begins by creating an experience bank that stores valuable trajectories obtained from previous explorations. This bank acts as a reference point for the model.
- Joint Exploration: The model explores new trajectories not just based on its internal knowledge but also by retrieving information from the experience bank. This dual approach allows for a more effective exploration of the state space.
- Refinement of Experience Bank: As the model encounters new trajectories, it refines the experience bank by incorporating successful strategies and discarding less effective ones. This ensures that the bank remains relevant and useful.
- Parameter Optimization: The refined trajectories from the exploration phase are then used to optimize the model parameters, leading to a more robust learning outcome.
Experimental Validation
Experiments conducted to evaluate the effectiveness of the DGO framework demonstrate that it consistently outperforms baseline methods. The results indicate that by enhancing the utilization and internalization of experience, DGO leads to improvements in reasoning capabilities of large language models.
Conclusion
In conclusion, the Dual Guidance Optimization framework presents a significant advancement in the field of reinforcement learning for large language models. By integrating both external and internal experiences, DGO not only enhances training effectiveness but also brings us a step closer to mimicking the intricate learning processes of human beings. As research in this area continues to evolve, DGO offers a promising path forward for developing more capable and intelligent AI systems.
