QuantSightBench: Benchmarking LLM Forecasts with Prediction Intervals

QuantSightBench: Evaluating LLM Quantitative Forecasting with Prediction Intervals

Summary: arXiv:2604.15859v1 Announce Type: cross

Forecasting has become a natural benchmark for reasoning under uncertainty, particularly in fields where decisions are based on numerical estimates. Despite the importance of accurate forecasting, current evaluations of large language models (LLMs) primarily focus on judgmental tasks that utilize simple formats, such as binary or multiple-choice questions. This narrow approach fails to capture the complexities involved in real-world forecasting scenarios.

The Limitations of Current Benchmarks

Existing benchmarks do not adequately reflect the diverse nature of forecasting, which spans various domains including:

Economics
Public Health
Social Demographics

Decisions made in these areas often rely on numerical estimates of continuous quantities, highlighting the need for a more sophisticated evaluation framework. Traditional point estimates do not convey uncertainty effectively, which is a critical component in decision-making processes.

Introducing Prediction Intervals

To address this gap, we propose the use of prediction intervals as a robust and rigorous interface for evaluating forecasting capabilities. Prediction intervals provide several advantages:

Scale Awareness: They account for the scale of the quantities being predicted.
Internal Consistency: They ensure that confidence levels are maintained across different predictions.
Calibration: They allow for calibration over a continuum of outcomes, reflecting the true uncertainty involved.

This format is particularly advantageous compared to point estimates, as it emphasizes the inherent uncertainty in forecasts and encourages more reliable predictions.

Introducing QuantSightBench

To facilitate the evaluation of forecasting abilities in LLMs, we introduce a new benchmark called QuantSightBench. This benchmark assesses frontier models under various settings, focusing on two key metrics:

Empirical Coverage: The extent to which the prediction intervals contain the true outcomes.
Interval Sharpness: How narrow the prediction intervals are while still maintaining coverage.

Results and Insights

Our evaluations included 11 state-of-the-art models, revealing some concerning trends in their forecasting capabilities:

None of the models achieved the 90% coverage target.
The top performers included:

Gemini 3.1 Pro: 79.1%
Grok 4: 76.4%
GPT-5.4: 75.3%

All top models fell at least 10 percentage points short of the coverage target.
Calibration issues were particularly pronounced at extreme magnitudes, indicating a tendency for overconfidence across the evaluated models.

Conclusion

The introduction of QuantSightBench marks a significant step forward in the evaluation of LLMs for quantitative forecasting. By focusing on prediction intervals, we aim to enhance the rigor of assessments and improve the reliability of forecasting models in critical domains. Further research and development will be essential to address the identified gaps and improve model performance in real-world applications.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

QuantSightBench: Benchmarking LLM Forecasts with Prediction Intervals

QuantSightBench: Evaluating LLM Quantitative Forecasting with Prediction Intervals

The Limitations of Current Benchmarks

Introducing Prediction Intervals

Introducing QuantSightBench

Results and Insights

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related