SciVisAgentBench: A Benchmark for Evaluating Scientific Data Analysis and Visualization Agents
In the rapidly evolving field of artificial intelligence, recent advancements in large language models (LLMs) have paved the way for the development of agentic systems capable of translating natural language intent into executable scientific visualization (SciVis) tasks. However, despite this progress, the scientific community has identified a significant gap: the absence of a principled and reproducible benchmark to evaluate these emerging SciVis agents in realistic, multi-step analysis scenarios.
To address this need, researchers have introduced SciVisAgentBench, a comprehensive and extensible benchmark designed specifically for evaluating scientific data analysis and visualization agents. This benchmark aims to provide a structured framework that encompasses various dimensions of SciVis tasks, making it easier for developers and researchers to assess the capabilities of their systems.
Key Features of SciVisAgentBench
SciVisAgentBench is grounded in a structured taxonomy that spans four critical dimensions:
- Application Domain: Different fields of scientific inquiry that require specific visualization techniques.
- Data Type: The nature of the data being analyzed, such as numerical, categorical, or temporal data.
- Complexity Level: The intricacy of the analysis and visualization tasks, ranging from simple to complex scenarios.
- Visualization Operation: The specific types of visualizations used, including charts, graphs, and interactive visual displays.
The benchmark currently consists of 108 expert-crafted cases that cover a diverse array of SciVis scenarios, ensuring that it captures the complexities of real-world data analysis and visualization tasks.
Evaluation Pipeline
To facilitate reliable assessments of SciVis agents, the benchmark introduces a multimodal outcome-centric evaluation pipeline. This innovative system combines LLM-based judging with deterministic evaluators, which include:
- Image-based Metrics: Quantitative measures that evaluate the quality of generated visualizations.
- Code Checkers: Tools that assess the correctness and efficiency of the underlying code used in data analysis.
- Rule-based Verifiers: Systems that verify if the outputs meet predefined criteria.
- Case-specific Evaluators: Tailored assessments based on the specific requirements of each SciVis case.
In addition, a validity study was conducted involving 12 SciVis experts to examine the agreement between human judges and LLM judges, ensuring the robustness of the evaluation process.
Initial Findings and Future Directions
Using the SciVisAgentBench framework, researchers evaluated various representative SciVis agents, as well as general-purpose coding agents, to establish initial performance baselines and identify capability gaps. The benchmark is intended to serve as a living resource that supports systematic comparisons, diagnoses failure modes, and ultimately drives progress in agentic SciVis.
For those interested in exploring this benchmark further, SciVisAgentBench is publicly available at https://scivisagentbench.github.io/. Researchers and practitioners are encouraged to utilize this tool to enhance the evaluation of SciVis agents and contribute to the ongoing advancement of scientific data analysis and visualization.
