ShapE-GRPO: Shapley-Enhanced Reward Allocation for Multi-Candidate LLM Training
Summary: arXiv:2603.29871v1 Announce Type: new
Abstract
In user-agent interaction scenarios such as recommendation, brainstorming, and code suggestion, Large Language Models (LLMs) often generate sets of candidate recommendations where the objective is to maximize the collective utility of the entire set rather than individual candidates independently. However, existing reinforcement learning post-training paradigms, such as Group Relative Policy Optimization (GRPO), typically assign the same set-level scalar reward to every candidate in the set. This leads to noisy training signals where poor candidates free-ride on the high reward produced by a single strong peer, resulting in suboptimal exploration.
Introduction to ShapE-GRPO
To address the limitations of traditional approaches, we propose Shapley-Enhanced GRPO (ShapE-GRPO). This innovative method builds on the foundation of cooperative game theory, utilizing the Shapley value to enhance the reward allocation process for candidates in a set.
Key Features of ShapE-GRPO
- Granular Reward Signals: ShapE-GRPO decomposes set-level rewards into candidate-specific signals, allowing for more nuanced feedback during training.
- Permutation-Invariant Utility: The method leverages the permutation-invariant nature of set-level utility, ensuring that the order of candidates does not affect the overall evaluation.
- Computational Efficiency: Our formulation maintains polynomial-time complexity, making it feasible for real-world applications without sacrificing performance.
- Empirical Success: Experiments demonstrate that ShapE-GRPO consistently outperforms standard GRPO across diverse datasets, showcasing accelerated convergence and improved training outcomes.
Implications for Multi-Candidate Training
The introduction of ShapE-GRPO represents a significant advancement in the field of multi-candidate LLM training. By ensuring that candidates receive rewards that accurately reflect their contributions, we can mitigate the issue of suboptimal exploration and enhance the overall utility of the generated recommendations.
Conclusion
ShapE-GRPO stands to transform how we approach reward allocation in multi-candidate scenarios, offering a robust alternative to existing methods. As the demand for intelligent systems that can provide high-quality recommendations continues to grow, our approach paves the way for more effective training of Large Language Models, ultimately enhancing user experience across various applications.
