AGPO: Asymmetric Group Policy Optimization for Verifiable Reasoning and Search Ads Relevance at JD
In an exciting development for the field of artificial intelligence, researchers have introduced a novel approach known as Asymmetric Group Policy Optimization (AGPO). This innovative framework aims to enhance the reasoning capabilities of large language models (LLMs) while addressing some of the limitations observed in existing reinforcement learning techniques, particularly those involving Verifiable Rewards (RLVR).
Recent studies have highlighted that while RLVR methods have been effective in improving the sampling efficiency towards correct reasoning paths, they often lead to a constriction in the reasoning capabilities of trained models. Specifically, the ability of these models to explore and identify new reasoning patterns has been limited, resulting in a narrowed scope compared to their base models. In fact, base models tend to achieve higher reasoning coverage when evaluated at larger sample sizes.
The AGPO Approach
The AGPO framework seeks to mitigate this issue of boundary shrinkage in reasoning capabilities. It employs a two-pronged reinforcement strategy aimed at both suppressing incorrect reasoning paths and enhancing correct ones. The key components of AGPO are:
- Negative-Dominant Reinforcement: This strategy is designed to suppress incorrect reasoning paths, thereby maintaining the exploration capacity of the base model. By doing so, AGPO ensures that the model does not converge too quickly on suboptimal reasoning patterns.
- Group Advantage Mechanism: For positive reinforcement, AGPO utilizes a group advantage mechanism that scales positive updates based on intra-group variance. This allows the model to concentrate on rare correct paths while minimizing updates from trivial or redundant paths.
These strategies work in tandem to enhance the overall reasoning performance of LLMs, enabling them to explore a broader array of reasoning patterns and achieve better accuracy.
Experimental Validation
To validate the effectiveness of AGPO, researchers conducted experiments across five mathematical benchmarks. The results were promising, demonstrating that AGPO not only achieved state-of-the-art accuracy but also consistently improved pass@$k$ performance as the scale increased. This indicates that AGPO is capable of enhancing the reasoning capabilities of LLMs significantly.
Industrial Applications
One of the most compelling applications of AGPO is in the realm of search ads relevance optimization at JD, a major player in the e-commerce industry. In this large-scale industrial setting, AGPO has proven to be effective in improving the quality of data annotation. The implications of this are substantial, leading to significant performance gains in downstream student models tasked with delivering relevant search results to users.
In summary, the introduction of Asymmetric Group Policy Optimization marks a significant advancement in the quest for enhanced reasoning capabilities in large language models. By addressing the limitations of current reinforcement learning techniques, AGPO opens new avenues for research and application in various domains, particularly in optimizing search ads relevance and improving AI-driven decision-making processes.
Related AI Insights
- ReFlect: Boosting Long-Horizon Reasoning in LLMs
- Adaptive Topology Selection for Efficient Multi-Agent Code Generation
- Boost Peptide Design with Conformal Prediction & RL
- SDFlow: Efficient Time Series Generation Without Exposure Bias
- Inference-Time Budget Control for Efficient LLM Search Agents
- AI-Powered Knee Osteoarthritis Grading on Low-Power Devices
- Transformer Memory Geometry: Resolving Conflicts & Hallucinations
- Optimizing Attention in Large Vision-Language Models
- Robust Explainability for Safety-Critical ATR Systems
- Long-Horizon Q-Learning for Accurate Value Estimation
