Thinking Fast, Thinking Wrong: Intuitiveness Modulates LLM Counterfactual Reasoning in Policy Evaluation
In recent years, large language models (LLMs) have gained traction for their potential in causal and counterfactual reasoning. However, their reliability in evaluating real-world policies remains largely unexamined. A new study, available on arXiv, presents a benchmark composed of 40 empirical policy evaluation cases derived from economics and social sciences, providing a detailed assessment of LLM performance under various conditions.
Study Overview
This comprehensive research investigates the relationship between intuitiveness and the efficacy of LLMs in policy evaluation. The empirical cases included in the benchmark are classified based on their intuitiveness, which is defined in three categories:
- Obvious: Findings that align with common expectations.
- Ambiguous: Findings that are unclear in relation to prior expectations.
- Counter-intuitive: Findings that contradict common prior beliefs.
Methodology
The researchers evaluated four leading LLMs using five distinct prompting strategies across a total of 2,400 experimental trials. The results of these experiments were analyzed using mixed-effects logistic regression to determine the influence of various factors on model performance.
Key Findings
The study yielded three significant findings:
- Chain-of-Thought Paradox: The research uncovered a paradox related to chain-of-thought (CoT) prompting. While this method substantially enhances performance on obvious cases, its effectiveness diminishes significantly on counter-intuitive cases, with an interaction odds ratio of 0.053 (p < 0.001).
- Intuitiveness as a Dominant Factor: The intuitiveness of the cases emerged as the primary variable influencing model performance, explaining more variance than either the model choice or the prompting strategy, with an intraclass correlation coefficient (ICC) of 0.537.
- Knowledge-Reasoning Dissociation: Interestingly, the study found a disconnect between citation-based familiarity and accuracy, with a p-value of 0.53. This suggests that while LLMs may possess relevant knowledge, they struggle to apply it effectively when the findings contradict intuitive beliefs.
Theoretical Implications
The authors frame their findings within the context of dual-process theory, which distinguishes between two types of cognitive processing: System 1 (fast, intuitive) and System 2 (slow, deliberative). They argue that the “slow thinking” exhibited by current LLMs may be more akin to “slow talking,” producing the semblance of deliberative reasoning without the depth of understanding needed to engage meaningfully with counter-intuitive findings.
Conclusion
This study highlights the complexities of integrating LLMs into policy evaluation frameworks. As these models continue to evolve, understanding their limitations and the influence of intuitiveness on their reasoning will be crucial for harnessing their full potential in real-world applications.
