Comparative Reversal Learning Reveals Rigid Adaptation in LLMs under Non-Stationary Uncertainty
Summary: arXiv:2604.04182v1 Announce Type: new
Abstract: Non-stationary environments require agents to revise previously learned action values when contingencies change. We treat large language models (LLMs) as sequential decision policies in a two-option probabilistic reversal-learning task with three latent states and switch events triggered by either a performance criterion or timeout.
Introduction
In the ever-evolving landscape of artificial intelligence, understanding how large language models (LLMs) adapt to changing environments is crucial. This paper examines the adaptability of various LLMs in non-stationary scenarios through a probabilistic reversal-learning task. The task is designed to analyze how effectively these models can revise their learned action values when faced with new contingencies.
Methodology
The study involves a comparative analysis of three prominent LLMs: DeepSeek-V3.2, Gemini-3, and GPT-5.2, using human data as a behavioral benchmark. The models were subjected to two different schedules:
- Deterministic Fixed Transition Cycle: A stable environment where the transitions are predictable.
- Stochastic Random Schedule: An unpredictable environment that increases volatility and changes the learning dynamics.
Key Findings
The results reveal significant insights regarding the adaptability of these models:
- Across all models, the win-stay strategy was nearly at its ceiling, while the lose-shift strategy was noticeably less effective, indicating an asymmetric reliance on positive versus negative outcomes.
- DeepSeek-V3.2 exhibited extreme perseveration following reversals, demonstrating weak acquisition capabilities.
- Both Gemini-3 and GPT-5.2 adapted more quickly than DeepSeek-V3.2 but still showed less sensitivity to losses compared to human participants.
- Increased randomness in transitions amplified the models’ tendency for reversal-specific persistence, suggesting that high total payoffs can coexist with rigid adaptation behaviors.
Discussion
The findings indicate that the rigidity observed in LLMs can stem from various mechanisms, including weak loss learning, inflated policy determinism, and value polarization due to counterfactual suppression. These results highlight the necessity for developing reversal-sensitive diagnostics and volatility-aware models for evaluating the performance of LLMs in non-stationary environments.
Conclusion
This comparative reversal learning framework sheds light on the limitations of current LLMs in adapting to changing contingencies. Understanding these constraints is vital for advancing AI technologies that can operate more flexibly and effectively in dynamic settings.
