SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?
Summary: arXiv:2604.10718v1 Announce Type: new
Abstract: Accelerating scientific discovery requires the identification of which experiments would yield the best outcomes before committing resources to costly physical validation. While existing benchmarks evaluate LLMs on scientific knowledge and reasoning, their ability to predict experimental outcomes – a task where AI could significantly exceed human capabilities – remains largely underexplored.
We introduce SciPredict, a benchmark comprising 405 tasks derived from recent empirical studies in 33 specialized sub-fields of physics, biology, and chemistry. SciPredict addresses two critical questions:
- Can LLMs predict the outcome of scientific experiments with sufficient accuracy?
- Can such predictions be reliably used in the scientific research process?
Evaluations reveal fundamental limitations on both fronts. Model accuracies are reported to be between 14-26%, while human expert performance hovers around 20%. Although some frontier models have exceeded human performance in certain cases, their overall accuracy still falls significantly short of what would be required for reliable experimental guidance.
Moreover, even within the limited performance metrics, models struggle to differentiate between reliable and unreliable predictions. They achieve approximately 20% accuracy regardless of their confidence levels or whether they assess outcomes as predictable without the need for physical experimentation. In contrast, human experts exhibit strong calibration; their accuracy can increase from roughly 5% to 80% as they assess outcomes to be more predictable without conducting the experiments.
SciPredict establishes a rigorous framework that illustrates that achieving superhuman performance in experimental science necessitates not only improved predictions but also a deeper awareness of the reliability of those predictions. This finding underscores the complexity of scientific inquiry and the necessity for a nuanced understanding of predictive capabilities in AI.
For those interested in reproducibility and further research, all data and code related to SciPredict are available at the following link: https://github.com/scaleapi/scipredict.
