Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations
Summary: arXiv:2603.03332v3 Announce Type: replace-cross
Abstract
Chain-of-Thought (CoT) prompting has emerged as a foundational technique for eliciting reasoning from Large Language Models (LLMs), yet the robustness of this approach to corruptions in intermediate reasoning steps remains poorly understood. This paper presents a comprehensive empirical evaluation of LLM robustness to a structured taxonomy of 5 CoT perturbation types: MathError, UnitConversion, Sycophancy, SkippedSteps, and ExtraSteps. We evaluate 13 models spanning three orders of magnitude in parameter count, testing their ability to complete mathematical reasoning tasks despite perturbations injected in the reasoning chain.
Key Findings
Our key findings reveal heterogeneous vulnerability patterns across the models tested:
- MathError: Perturbations produce the most severe degradation in smaller models, with accuracy loss ranging from 50-60%. However, larger models demonstrate significant scaling benefits.
- UnitConversion: This perturbation type remains challenging across all model scales, with over 5% accuracy loss even for midsized models.
- ExtraSteps: These perturbations incur minimal accuracy degradation (ranging from 0-6%) even in the smallest models.
- Sycophancy and SkippedSteps: These types produce modest effects, with approximately 10% accuracy loss for smaller models, which improves slightly with increased model scale.
Scaling Relationships
Our research indicates that model size serves as a protective factor against many perturbations; however, this is not universally applicable. The scaling relationships observed suggest that while larger models can mitigate certain types of errors, the effectiveness varies significantly based on the specific perturbation type involved.
Implications for Deployment
The findings from this study have direct implications for deploying LLMs in multi-stage reasoning pipelines. As LLMs become increasingly integrated into critical applications, understanding their vulnerabilities to perturbations is essential. The results highlight the necessity of task-specific robustness assessments and the development of mitigation strategies that can enhance the reliability of these models in practical scenarios.
Further Information
The code and results from this research are available at the following link: GitHub Repository.
Conclusion
As the field of AI continues to evolve, understanding the nuances of how Large Language Models handle reasoning perturbations remains a critical area of study. This research contributes to a growing body of work aimed at ensuring that these powerful tools are robust enough for various applications, ultimately advancing the reliability of AI systems in real-world situations.
