Thinking Sparks!: Emergent Attention Heads in Reasoning Models During Post Training
Summary: arXiv:2509.25758v2 Announce Type: replace
Abstract: The remarkable capabilities of modern large reasoning models are largely unlocked through post-training techniques such as supervised fine-tuning (SFT) and reinforcement learning (RL). However, the architectural mechanisms behind such improvements remain largely opaque. In this work, we use circuit analysis to demonstrate that post-training for complex reasoning sparks the emergence of novel, functionally specialized attention heads. These heads collectively support structured reasoning and computation.
Key Findings
Our comparative analysis across various model families reveals significant insights into how emergent heads evolve under different training regimes:
- Distillation and Supervised Fine-Tuning (SFT): These techniques foster a cumulative addition of stable reasoning heads, enhancing the model’s ability to perform complex tasks.
- Group Relative Policy Optimization (GRPO): This method operates in a dynamic search mode where relatively few attention heads are iteratively activated, evaluated, and pruned. Their survival closely tracks fluctuations in the task reward signal.
Controllable Models and Reasoning Dynamics
Our research also explored controllable “think on/off” models. Contrary to expectations, these models do not possess dedicated “thinking” heads. Instead, when explicit reasoning is turned off, a broader yet less efficient set of compensatory heads is activated. This behavior raises questions about the efficiency of reasoning in different operational contexts.
Performance Trade-offs
Through ablation and qualitative analyses, we connect these circuit-level dynamics to a crucial performance trade-off. Strengthened heads enable sophisticated problem-solving strategies for difficult problems but can also introduce “over-thinking” failure modes. Examples of these include:
- Calculation Errors: Overly complex reasoning can lead to mistakes in basic arithmetic or logical assessments.
- Logical Loops: Simple tasks may trigger repetitive cycles in reasoning processes, hindering efficient execution.
Implications for Future Research
These findings illustrate an inherent tension in reasoning models where complex reasoning capabilities can compromise elementary computations. Our work highlights the need for a balanced approach in training policy design—one that fosters the development of effective reasoning strategies while ensuring reliable, flawless execution.
By understanding the intricate dynamics of attention heads in reasoning models, we can pave the way for future advancements in AI, ensuring that the systems we build are not only capable of handling complexity but also maintain precision in simpler tasks.
