Reasoning Through Chess: How Reasoning Evolves from Data Through Fine-Tuning and Reinforcement Learning
Summary: arXiv:2604.05134v1 Announce Type: cross
Abstract: How can you get a language model to reason in a task it natively struggles with? We study how reasoning evolves in a language model — from supervised fine-tuning (SFT) to reinforcement learning (RL) — by analyzing how a set of theoretically-inspired datasets impacts language model performance in chess. We find that fine-tuning a model to directly predict the best move leads to effective RL and the strongest downstream performance — however, the RL step elicits unfaithful reasoning (reasoning inconsistent with the chosen move). Alternatively, training on multi-move trajectories yields comparable downstream performance with faithful reasoning and more stable RL. We show that RL induces a substantial positive shift in the distribution of move quality and reduces hallucination rates as a side effect. Finally, we find several SFT-checkpoint metrics — metrics spanning evaluation performance, hallucination rates, and reasoning quality — to be predictive of post-RL model performance. We release checkpoints and final models as well as training data, evaluations, and code which allowed us to surpass leading open-source reasoning models in chess with a 7B-parameter model.
Introduction
The field of artificial intelligence has made remarkable strides in recent years, particularly in the realm of natural language processing and reasoning capabilities. One intriguing area of study is how these models can be trained to reason effectively in complex tasks such as chess, a game that combines strategy, foresight, and critical thinking. This article delves into the methodologies employed to enhance reasoning in language models, specifically through the lens of fine-tuning and reinforcement learning.
Understanding the Techniques
- Supervised Fine-Tuning (SFT): This initial stage involves training the model on a dataset that directly correlates with predicting optimal chess moves. The goal is to create a foundation of knowledge that the model can build upon.
- Reinforcement Learning (RL): Following fine-tuning, the model undergoes reinforcement learning where it interacts with the chess environment, learning from the consequences of its actions. This step aims to refine the model’s decision-making capabilities.
Findings and Implications
Our investigation reveals several key findings regarding the evolution of reasoning in language models:
- Fine-tuning the model to directly predict the best move results in effective reinforcement learning, ultimately leading to superior performance in downstream tasks.
- However, the RL phase may produce unfaithful reasoning, wherein the model’s rationale does not align with the selected move.
- Conversely, training on multi-move trajectories appears to promote both faithful reasoning and stable reinforcement learning, yielding comparable performance outcomes.
- Reinforcement learning significantly improves the quality of moves generated by the model and reduces hallucination rates, which are instances where the model produces incorrect or nonsensical outputs.
- Metrics from the fine-tuning phase, including evaluation performance and reasoning quality, can serve as reliable predictors of a model’s effectiveness post-reinforcement learning.
Conclusion
The study underscores the importance of fine-tuning and reinforcement learning in enhancing reasoning capabilities within language models, particularly in the context of chess. By releasing checkpoints, final models, training data, evaluations, and code, we aim to contribute to the broader AI community. Our 7B-parameter model has surpassed leading open-source reasoning models in chess, marking a significant advancement in the field.
