Instructing LLMs to Negotiate using Reinforcement Learning with Verifiable Rewards
Summary: arXiv:2604.09855v1 Announce Type: new
The recent advancement of Large Language Models (LLMs) has established their potential as autonomous interactive agents. However, they often struggle in strategic games of incomplete information, such as bilateral price negotiation. In this paper, we investigate if Reinforcement Learning from Verifiable Rewards (RLVR) can effectively teach LLMs to negotiate.
Research Overview
This research explores the strategic behaviors that emerge during the learning process of LLMs when trained to negotiate effectively. The primary focus is on developing a framework that enables a mid-sized buyer agent to negotiate against a regulated LLM seller across a wide distribution of real-world products.
Methodology
Our approach incorporates the following key components:
- Reinforcement Learning from Verifiable Rewards (RLVR): This innovative method allows agents to learn negotiation tactics by maximizing economic surplus while adhering to strict private budget constraints.
- Framework Design: We designed a framework to facilitate interactions between a buyer agent and a regulated seller, simulating real-world negotiation scenarios.
- Phased Learning Process: The training process is structured into four distinct phases that the agent progresses through, each showcasing its strategic evolution.
Phases of Strategic Evolution
In our findings, we identified a novel four-phase strategic evolution during the training of the buyer agent:
- Naive Bargaining: The agent begins with basic negotiation skills, often relying on simple price adjustments.
- Aggressive Starting Prices: The agent learns to set higher initial prices to create room for negotiation.
- Deadlock Phase: The agent encounters situations where negotiation stalls, prompting further learning and adaptation.
- Sophisticated Persuasion: Ultimately, the agent develops advanced persuasive techniques, enabling it to negotiate effectively under various circumstances.
Results and Implications
Our results demonstrate that the training method utilizing verifiable rewards allows a 30B parameter agent to significantly outperform frontier models that are over ten times its size in extracting economic surplus. This performance showcases the effectiveness of RLVR in teaching negotiation skills that are both robust and adaptable.
Moreover, the trained agent exhibits remarkable generalization capabilities, maintaining high performance levels against stronger counterparties that were not part of the training set. Even when facing hostile or adversarial seller personas, the agent remains effective, highlighting its potential application in real-world negotiation scenarios.
Conclusion
The findings from this research present a significant advancement in the field of AI-driven negotiation. By leveraging Reinforcement Learning from Verifiable Rewards, we have opened new avenues for developing more intelligent and capable LLMs that can operate autonomously in complex negotiation environments.
