AGPO: Boosting AI Reasoning & Search Ads at JD

Date:

AGPO: Asymmetric Group Policy Optimization for Verifiable Reasoning and Search Ads Relevance at JD

In an exciting development for the field of artificial intelligence, researchers have introduced a novel approach known as Asymmetric Group Policy Optimization (AGPO). This innovative framework aims to enhance the reasoning capabilities of large language models (LLMs) while addressing some of the limitations observed in existing reinforcement learning techniques, particularly those involving Verifiable Rewards (RLVR).

Recent studies have highlighted that while RLVR methods have been effective in improving the sampling efficiency towards correct reasoning paths, they often lead to a constriction in the reasoning capabilities of trained models. Specifically, the ability of these models to explore and identify new reasoning patterns has been limited, resulting in a narrowed scope compared to their base models. In fact, base models tend to achieve higher reasoning coverage when evaluated at larger sample sizes.

The AGPO Approach

The AGPO framework seeks to mitigate this issue of boundary shrinkage in reasoning capabilities. It employs a two-pronged reinforcement strategy aimed at both suppressing incorrect reasoning paths and enhancing correct ones. The key components of AGPO are:

  • Negative-Dominant Reinforcement: This strategy is designed to suppress incorrect reasoning paths, thereby maintaining the exploration capacity of the base model. By doing so, AGPO ensures that the model does not converge too quickly on suboptimal reasoning patterns.
  • Group Advantage Mechanism: For positive reinforcement, AGPO utilizes a group advantage mechanism that scales positive updates based on intra-group variance. This allows the model to concentrate on rare correct paths while minimizing updates from trivial or redundant paths.

These strategies work in tandem to enhance the overall reasoning performance of LLMs, enabling them to explore a broader array of reasoning patterns and achieve better accuracy.

Experimental Validation

To validate the effectiveness of AGPO, researchers conducted experiments across five mathematical benchmarks. The results were promising, demonstrating that AGPO not only achieved state-of-the-art accuracy but also consistently improved pass@$k$ performance as the scale increased. This indicates that AGPO is capable of enhancing the reasoning capabilities of LLMs significantly.

Industrial Applications

One of the most compelling applications of AGPO is in the realm of search ads relevance optimization at JD, a major player in the e-commerce industry. In this large-scale industrial setting, AGPO has proven to be effective in improving the quality of data annotation. The implications of this are substantial, leading to significant performance gains in downstream student models tasked with delivering relevant search results to users.

In summary, the introduction of Asymmetric Group Policy Optimization marks a significant advancement in the quest for enhanced reasoning capabilities in large language models. By addressing the limitations of current reinforcement learning techniques, AGPO opens new avenues for research and application in various domains, particularly in optimizing search ads relevance and improving AI-driven decision-making processes.

Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.