From Abstract to Contextual: What LLMs Still Cannot Do in Mathematics
Large language models (LLMs) have made remarkable advancements in solving benchmark math problems, achieving performance levels that are comparable to experts in some cases. However, this progress has not seamlessly transitioned into consistent and reliable results in real-world applications. A recent study sheds light on this disparity by focusing on contextual mathematical reasoning, which requires the mathematical core to be formulated from descriptive scenarios.
The study introduces a new benchmark called ContextMATH, which aims to address the challenges posed by contextual mathematical problems. This benchmark repurposes existing problems from the AIME (American Invitational Mathematics Examination) and MATH-500 datasets into two distinct contextual settings:
- Scenario Grounding (SG): This setting embeds abstract mathematical problems into realistic narratives without increasing the complexity of reasoning.
- Complexity Scaling (CS): This setting transforms explicit conditions into sub-problems, capturing how constraints typically manifest in practical scenarios.
In evaluating 61 proprietary and open-source models on this new benchmark, the study reveals significant drops in performance. On average, open-source models experienced a decline of 13 points in Scenario Grounding and 34 points in Complexity Scaling. Proprietary models also suffered, with drops of 13 points on SG and 20 points on CS.
An in-depth error analysis indicates that the majority of errors stem from incorrect problem formulation. Interestingly, the accuracy of problem formulation diminishes as the original problem’s difficulty increases. This finding underscores the importance of correct formulation as a prerequisite for successful problem-solving. Moreover, as the scale of the models increases, the sufficiency of correct formulation also improves, suggesting that larger models are better at both understanding and reasoning.
Despite these advancements, the study highlights that formulation and reasoning remain two intertwined bottlenecks that hinder effective contextual mathematical problem solving. The researchers also explore the impact of fine-tuning the models with scenario data, which demonstrated an improvement in performance. However, training solely on formulation data proved ineffective, indicating that a more holistic approach is necessary for addressing these challenges.
In conclusion, the performance gaps observed in the evaluation of LLMs on contextual mathematical reasoning tasks are only partly mitigated by current training methodologies. This highlights the ongoing challenge of contextual mathematical reasoning as a central issue that remains unsolved for LLMs. As researchers continue to explore and refine strategies for improving model performance, the pursuit of robust contextual understanding in mathematics remains a critical area of focus.
