QuantSightBench: Evaluating LLM Quantitative Forecasting with Prediction Intervals
Summary: arXiv:2604.15859v1 Announce Type: cross
Forecasting has become a natural benchmark for reasoning under uncertainty, particularly in fields where decisions are based on numerical estimates. Despite the importance of accurate forecasting, current evaluations of large language models (LLMs) primarily focus on judgmental tasks that utilize simple formats, such as binary or multiple-choice questions. This narrow approach fails to capture the complexities involved in real-world forecasting scenarios.
The Limitations of Current Benchmarks
Existing benchmarks do not adequately reflect the diverse nature of forecasting, which spans various domains including:
- Economics
- Public Health
- Social Demographics
Decisions made in these areas often rely on numerical estimates of continuous quantities, highlighting the need for a more sophisticated evaluation framework. Traditional point estimates do not convey uncertainty effectively, which is a critical component in decision-making processes.
Introducing Prediction Intervals
To address this gap, we propose the use of prediction intervals as a robust and rigorous interface for evaluating forecasting capabilities. Prediction intervals provide several advantages:
- Scale Awareness: They account for the scale of the quantities being predicted.
- Internal Consistency: They ensure that confidence levels are maintained across different predictions.
- Calibration: They allow for calibration over a continuum of outcomes, reflecting the true uncertainty involved.
This format is particularly advantageous compared to point estimates, as it emphasizes the inherent uncertainty in forecasts and encourages more reliable predictions.
Introducing QuantSightBench
To facilitate the evaluation of forecasting abilities in LLMs, we introduce a new benchmark called QuantSightBench. This benchmark assesses frontier models under various settings, focusing on two key metrics:
- Empirical Coverage: The extent to which the prediction intervals contain the true outcomes.
- Interval Sharpness: How narrow the prediction intervals are while still maintaining coverage.
Results and Insights
Our evaluations included 11 state-of-the-art models, revealing some concerning trends in their forecasting capabilities:
- None of the models achieved the 90% coverage target.
- The top performers included:
- Gemini 3.1 Pro: 79.1%
- Grok 4: 76.4%
- GPT-5.4: 75.3%
- All top models fell at least 10 percentage points short of the coverage target.
- Calibration issues were particularly pronounced at extreme magnitudes, indicating a tendency for overconfidence across the evaluated models.
Conclusion
The introduction of QuantSightBench marks a significant step forward in the evaluation of LLMs for quantitative forecasting. By focusing on prediction intervals, we aim to enhance the rigor of assessments and improve the reliability of forecasting models in critical domains. Further research and development will be essential to address the identified gaps and improve model performance in real-world applications.
