ShapE-GRPO: Improved Reward Allocation for Multi-Candidate LLMs

ShapE-GRPO: Shapley-Enhanced Reward Allocation for Multi-Candidate LLM Training

Summary: arXiv:2603.29871v1 Announce Type: new

Abstract

In user-agent interaction scenarios such as recommendation, brainstorming, and code suggestion, Large Language Models (LLMs) often generate sets of candidate recommendations where the objective is to maximize the collective utility of the entire set rather than individual candidates independently. However, existing reinforcement learning post-training paradigms, such as Group Relative Policy Optimization (GRPO), typically assign the same set-level scalar reward to every candidate in the set. This leads to noisy training signals where poor candidates free-ride on the high reward produced by a single strong peer, resulting in suboptimal exploration.

Introduction to ShapE-GRPO

To address the limitations of traditional approaches, we propose Shapley-Enhanced GRPO (ShapE-GRPO). This innovative method builds on the foundation of cooperative game theory, utilizing the Shapley value to enhance the reward allocation process for candidates in a set.

Key Features of ShapE-GRPO

Granular Reward Signals: ShapE-GRPO decomposes set-level rewards into candidate-specific signals, allowing for more nuanced feedback during training.
Permutation-Invariant Utility: The method leverages the permutation-invariant nature of set-level utility, ensuring that the order of candidates does not affect the overall evaluation.
Computational Efficiency: Our formulation maintains polynomial-time complexity, making it feasible for real-world applications without sacrificing performance.
Empirical Success: Experiments demonstrate that ShapE-GRPO consistently outperforms standard GRPO across diverse datasets, showcasing accelerated convergence and improved training outcomes.

Implications for Multi-Candidate Training

The introduction of ShapE-GRPO represents a significant advancement in the field of multi-candidate LLM training. By ensuring that candidates receive rewards that accurately reflect their contributions, we can mitigate the issue of suboptimal exploration and enhance the overall utility of the generated recommendations.

Conclusion

ShapE-GRPO stands to transform how we approach reward allocation in multi-candidate scenarios, offering a robust alternative to existing methods. As the demand for intelligent systems that can provide high-quality recommendations continues to grow, our approach paves the way for more effective training of Large Language Models, ultimately enhancing user experience across various applications.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

ShapE-GRPO: Improved Reward Allocation for Multi-Candidate LLMs

ShapE-GRPO: Shapley-Enhanced Reward Allocation for Multi-Candidate LLM Training

Abstract

Introduction to ShapE-GRPO

Key Features of ShapE-GRPO

Implications for Multi-Candidate Training

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related