ThermoQA: A Three-Tier Benchmark for Evaluating Thermodynamic Reasoning in Large Language Models
In a groundbreaking development in the field of artificial intelligence and thermodynamics, researchers have introduced ThermoQA, a comprehensive benchmark designed to evaluate the thermodynamic reasoning capabilities of large language models (LLMs). This new benchmark, outlined in the paper available on arXiv (arXiv:2604.19758v1), consists of 293 open-ended engineering thermodynamics problems categorized into three distinct tiers: property lookups, component analysis, and full cycle analysis.
Overview of ThermoQA
ThermoQA features three tiers of questions aimed at assessing various aspects of thermodynamic reasoning:
- Property Lookups: This tier includes 110 questions that require models to retrieve and interpret thermodynamic properties.
- Component Analysis: Comprising 101 questions, this section challenges models to analyze specific components within thermodynamic systems.
- Full Cycle Analysis: The final tier consists of 82 questions that require comprehensive reasoning across entire thermodynamic cycles.
The ground truth for the benchmark is generated programmatically using CoolProp 7.2.0, a widely-used software for thermodynamic calculations, focusing on substances such as water, R-134a refrigerant, and variable-cp air.
Performance Evaluation
To validate the effectiveness of ThermoQA, the researchers evaluated six frontier large language models, each subjected to three independent runs. The results of these evaluations are compiled into a composite leaderboard, highlighting the leading models:
- Claude Opus 4.6: 94.1%
- GPT-5.4: 93.1%
- Gemini 3.1 Pro: 92.5%
Significantly, the evaluation revealed a cross-tier performance degradation ranging from 2.8 percentage points for Claude Opus to 32.5 percentage points for MiniMax. This finding emphasizes that the memorization of thermodynamic properties does not equate to a genuine understanding of thermodynamic reasoning.
Distinctive Features of ThermoQA
Among the unique features of this benchmark is its ability to differentiate models based on their performance across various thermodynamic challenges. The evaluation identified supercritical water, R-134a refrigerant, and combined-cycle gas turbine analysis as natural discriminators, showcasing performance spreads of 40-60 percentage points among the models.
Moreover, the multi-run standard deviation, ranging from +/-0.1% to +/-2.5%, offers an additional layer of evaluation, quantifying the consistency of reasoning as a distinct axis of performance assessment.
Open Source Availability
In a move towards greater transparency and collaboration in AI research, the dataset and code used for ThermoQA are openly available. Researchers and practitioners interested in exploring this benchmark can access it via the following link: ThermoQA Dataset on Hugging Face.
As AI continues to advance, benchmarks like ThermoQA play a crucial role in ensuring that large language models not only excel in memorization but also demonstrate robust reasoning capabilities in complex domains such as thermodynamics.
