LLM Robustness to Chain-of-Thought Perturbations Explained

Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations

Summary: arXiv:2603.03332v3 Announce Type: replace-cross

Abstract

Chain-of-Thought (CoT) prompting has emerged as a foundational technique for eliciting reasoning from Large Language Models (LLMs), yet the robustness of this approach to corruptions in intermediate reasoning steps remains poorly understood. This paper presents a comprehensive empirical evaluation of LLM robustness to a structured taxonomy of 5 CoT perturbation types: MathError, UnitConversion, Sycophancy, SkippedSteps, and ExtraSteps. We evaluate 13 models spanning three orders of magnitude in parameter count, testing their ability to complete mathematical reasoning tasks despite perturbations injected in the reasoning chain.

Key Findings

Our key findings reveal heterogeneous vulnerability patterns across the models tested:

MathError: Perturbations produce the most severe degradation in smaller models, with accuracy loss ranging from 50-60%. However, larger models demonstrate significant scaling benefits.
UnitConversion: This perturbation type remains challenging across all model scales, with over 5% accuracy loss even for midsized models.
ExtraSteps: These perturbations incur minimal accuracy degradation (ranging from 0-6%) even in the smallest models.
Sycophancy and SkippedSteps: These types produce modest effects, with approximately 10% accuracy loss for smaller models, which improves slightly with increased model scale.

Scaling Relationships

Our research indicates that model size serves as a protective factor against many perturbations; however, this is not universally applicable. The scaling relationships observed suggest that while larger models can mitigate certain types of errors, the effectiveness varies significantly based on the specific perturbation type involved.

Implications for Deployment

The findings from this study have direct implications for deploying LLMs in multi-stage reasoning pipelines. As LLMs become increasingly integrated into critical applications, understanding their vulnerabilities to perturbations is essential. The results highlight the necessity of task-specific robustness assessments and the development of mitigation strategies that can enhance the reliability of these models in practical scenarios.

Further Information

The code and results from this research are available at the following link: GitHub Repository.

Conclusion

As the field of AI continues to evolve, understanding the nuances of how Large Language Models handle reasoning perturbations remains a critical area of study. This research contributes to a growing body of work aimed at ensuring that these powerful tools are robust enough for various applications, ultimately advancing the reliability of AI systems in real-world situations.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

LLM Robustness to Chain-of-Thought Perturbations Explained

Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations

Abstract

Key Findings

Scaling Relationships

Implications for Deployment

Further Information

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related