ChartDiff: A Large-Scale Benchmark for Comprehending Pairs of Charts
Summary: arXiv:2603.28902v1 Announce Type: new
Abstract
Charts are central to analytical reasoning, yet existing benchmarks for chart understanding focus almost exclusively on single-chart interpretation rather than comparative reasoning across multiple charts. To address this gap, we introduce ChartDiff, the first large-scale benchmark for cross-chart comparative summarization.
Overview of ChartDiff
ChartDiff consists of 8,541 chart pairs spanning diverse data sources, chart types, and visual styles. Each pair is annotated with summaries generated by Large Language Models (LLMs) and verified by human annotators. These summaries describe key differences in trends, fluctuations, and anomalies present in the charts.
Evaluation of Models
Using the ChartDiff benchmark, we evaluate a variety of models, including:
- General-purpose models
- Chart-specialized models
- Pipeline-based methods
Our results indicate that frontier general-purpose models achieve the highest quality as measured by GPT-based metrics. In contrast, specialized and pipeline-based methods obtain higher ROUGE scores but tend to perform poorly in human-aligned evaluations. This reveals an important mismatch between lexical overlap and actual summary quality.
Key Findings
Several significant findings emerged from our analysis:
- Multi-series charts continue to pose challenges across all model families.
- Strong end-to-end models exhibit relative robustness to variations in plotting libraries.
- The comparative reasoning inherent in multi-chart analysis remains a significant challenge for current vision-language models.
Implications for Future Research
Our findings position ChartDiff as a critical benchmark for advancing research on multi-chart understanding. As the field of AI continues to evolve, addressing the challenges highlighted by ChartDiff will be essential for improving the capabilities of models in comparative chart reasoning.
Conclusion
In summary, ChartDiff represents a significant step forward in the evaluation of chart comprehension. By providing a large-scale dataset focused on comparative reasoning, we hope to inspire further advancements in AI models that can interpret and summarize complex visual data more effectively.
