Pruning Long Chain-of-Thought of Large Reasoning Models via Small-Scale Preference Optimization
Summary: arXiv:2508.10164v2 Announce Type: replace
Abstract
Recent advances in Large Reasoning Models (LRMs) have demonstrated strong performance on complex tasks through long Chain-of-Thought (CoT) reasoning. However, their lengthy outputs increase computational costs and may lead to overthinking, raising challenges in balancing reasoning effectiveness and efficiency. Current solutions often compromise reasoning quality or require extensive resources. In this paper, we investigate how to reduce the generation length of LRMs with limited tuning.
Introduction
The rise of Large Reasoning Models (LRMs) has revolutionized many fields by enabling machines to tackle complex tasks that require sophisticated reasoning capabilities. Nonetheless, one of the significant drawbacks of these models is their tendency to produce excessively long outputs during the reasoning process. This not only results in higher computational costs but can also lead to inefficiencies, such as overthinking or irrelevant elaboration.
Problem Statement
As LRMs continue to evolve, researchers face a dual challenge: enhancing the quality of reasoning while simultaneously managing the length of outputs. Existing methods often compromise on one front or require extensive computational resources, making them less feasible in practical applications. Thus, the need for an efficient solution that optimizes output length without sacrificing reasoning quality is more pressing than ever.
Methodology
To address this challenge, we present our approach known as Length Controlled Preference Optimization (LCPO). Our methodology involves the following key steps:
-
Generation Path Analysis:
We analyze generation path distributions to identify patterns in the outputs of LRMs. This analysis helps in understanding which trajectories lead to longer outputs.
-
Difficulty Estimation:
We implement a filtering mechanism based on difficulty estimation to streamline the generated trajectories, focusing on those that contribute to effective reasoning without excessive length.
-
Preference Optimization Objectives:
We explore the convergence characteristics of various preference optimization objectives within a unified Bradley-Terry loss-based framework. This allows us to refine our approach systematically.
Results
Our experiments demonstrate that LCPO significantly reduces the average output length of LRMs by over 50% across multiple benchmarks. Importantly, this reduction in length does not come at the expense of reasoning performance, indicating the effectiveness of our approach.
Conclusion
In conclusion, our work highlights the potential for computationally efficient approaches in guiding LRMs toward effective reasoning without the burden of lengthy outputs. Length Controlled Preference Optimization stands as a novel contribution to the field, offering a viable solution for researchers and practitioners aiming to harness the power of LRMs while maintaining efficiency.
Future Work
Future research may explore the scalability of LCPO and its application to different model architectures and tasks, further enhancing its adaptability in various domains.
