ETR: Entropy Trend Reward for Efficient Chain-of-Thought Reasoning
Summary: arXiv:2604.05355v1 Announce Type: new
Abstract: Chain-of-thought (CoT) reasoning improves large language model performance on complex tasks, but often produces excessively long and inefficient reasoning traces. Existing methods shorten CoTs using length penalties or global entropy reduction, implicitly assuming that low uncertainty is desirable throughout reasoning. We show instead that reasoning efficiency is governed by the trajectory of uncertainty. CoTs with dominant downward entropy trends are substantially shorter. Motivated by this insight, we propose Entropy Trend Reward (ETR), a trajectory-aware objective that encourages progressive uncertainty reduction while allowing limited local exploration. We integrate ETR into Group Relative Policy Optimization (GRPO) and evaluate it across multiple reasoning models and challenging benchmarks. ETR consistently achieves a superior accuracy-efficiency tradeoff, improving DeepSeek-R1-Distill-7B by 9.9% in accuracy while reducing CoT length by 67% across four benchmarks. Code is available at https://github.com/Xuan1030/ETR.
Introduction
The advent of large language models has revolutionized the field of artificial intelligence, particularly in tasks requiring complex reasoning. Chain-of-thought reasoning has emerged as a powerful technique, enhancing the models’ ability to tackle intricate problems. However, one significant challenge remains: the generation of excessively long and inefficient reasoning traces.
Challenges with Current Methods
Existing approaches to mitigate the length of chain-of-thought reasoning often utilize methods such as:
- Length penalties
- Global entropy reduction
These methods implicitly operate under the assumption that minimizing uncertainty will lead to better reasoning outcomes. However, this perspective overlooks a critical aspect of reasoning efficiency: the trajectory of uncertainty.
Introducing Entropy Trend Reward (ETR)
Research indicates that chain-of-thought patterns characterized by dominant downward entropy trends result in significantly shorter reasoning paths. To capitalize on this insight, we introduce the Entropy Trend Reward (ETR), a new trajectory-aware objective designed to:
- Encourage progressive uncertainty reduction
- Allow for limited local exploration
By focusing on the trajectory of uncertainty rather than merely its overall level, ETR aims to optimize reasoning efficiency in a more nuanced manner.
Integration with Group Relative Policy Optimization (GRPO)
ETR has been seamlessly integrated into the Group Relative Policy Optimization (GRPO) framework. This combination has been evaluated across various reasoning models and rigorous benchmarks, demonstrating substantial improvements in performance.
Results and Achievements
The results of integrating ETR into reasoning models have been promising:
- Improved accuracy of DeepSeek-R1-Distill-7B by 9.9%
- Reduced the length of chain-of-thought reasoning by 67% across four different benchmarks
These findings highlight the potential of ETR to enhance the efficiency of reasoning processes in large language models, paving the way for more effective AI applications.
Conclusion
With the introduction of the Entropy Trend Reward, researchers and developers have a new tool at their disposal to optimize chain-of-thought reasoning in large language models. By prioritizing the trajectory of uncertainty, ETR represents a significant advancement in achieving a favorable accuracy-efficiency balance in AI reasoning.
