Temperature-Dependent Performance of Prompting Strategies in Extended Reasoning Large Language Models
Summary: arXiv:2604.08563v1 Announce Type: cross
Abstract
Extended reasoning models represent a transformative shift in Large Language Model (LLM) capabilities by enabling explicit test-time computation for complex problem solving. However, the optimal configuration of sampling temperature and prompting strategy for these systems remains largely underexplored.
Research Overview
In this study, we systematically evaluate chain-of-thought and zero-shot prompting across four temperature settings (0.0, 0.4, 0.7, and 1.0) using Grok-4.1 with extended reasoning on 39 mathematical problems from AMO-Bench, a challenging International Mathematical Olympiad-level benchmark. The findings of this research provide crucial insights into how different prompting strategies can be optimized in conjunction with temperature settings to enhance performance.
Key Findings
- Zero-shot Prompting: Achieves peak performance at moderate temperatures, specifically at T=0.4 and T=0.7, with an accuracy of 59%.
- Chain-of-Thought Prompting: Shows optimal performance at the temperature extremes, suggesting a unique interaction between reasoning strategies and temperature.
- Extended Reasoning Benefit: The advantage of employing extended reasoning increases significantly, from 6x at T=0.0 to an impressive 14.3x at T=1.0.
Implications for Future Research
The results of this study challenge the common practice of using T=0 for reasoning tasks. Instead, the research advocates for the optimization of temperature in conjunction with the chosen prompting strategy to maximize the performance of extended reasoning models. This finding opens up new avenues for further investigation into how different configurations can impact the efficiency and accuracy of problem-solving in LLMs.
Conclusion
In conclusion, this research highlights the importance of systematically investigating the interplay between prompting strategies and temperature settings in large language models. By acknowledging that different contexts may require unique configurations, we can refine the capabilities of these models and enhance their effectiveness in complex reasoning tasks.
This work not only contributes to the understanding of LLMs but also sets the stage for future advancements in the field, paving the way for more sophisticated models capable of handling intricate problem-solving scenarios.
