MiroEval: Benchmarking Multimodal Research Agents Effectively

MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome

Summary: arXiv:2603.28407v1 Announce Type: new

Abstract: Recent progress in deep research systems has been impressive, but evaluation still lags behind real user needs. Existing benchmarks predominantly assess final reports using fixed rubrics, failing to evaluate the underlying research process. Most also offer limited multimodal coverage, rely on synthetic tasks that do not reflect real-world query complexity, and cannot be refreshed as knowledge evolves.

To address these gaps, we introduce MiroEval, a benchmark and evaluation framework for deep research systems. The benchmark comprises 100 tasks (70 text-only, 30 multimodal), all grounded in real user needs and constructed via a dual-path pipeline that supports periodic updates, enabling a live and evolving setting.

Evaluation Framework

The proposed evaluation suite assesses deep research systems along three complementary dimensions:

Adaptive synthesis quality evaluation: This dimension uses task-specific rubrics to assess the quality of synthesized information.
Agentic factuality verification: Systems are evaluated based on their ability to actively retrieve and reason over both web sources and multimodal attachments.
Process-centric evaluation: This audit reviews how the system searches, reasons, and refines its approach throughout the research investigation.

Key Findings

Evaluation across 13 systems yields three principal findings:

The three evaluation dimensions capture complementary aspects of system capability, revealing distinct strengths and weaknesses across different systems.
Process quality serves as a reliable predictor of overall outcome while highlighting weaknesses that might not be visible through output-level metrics.
Multimodal tasks present substantially greater challenges, with most systems experiencing a decline of 3 to 10 points in performance.

Top Performers

The MiroThinker series achieves the most balanced performance, with MiroThinker-H1 ranking the highest overall in both settings. Human verification and robustness results confirm the reliability of the benchmark and evaluation framework.

Conclusion

MiroEval provides a holistic diagnostic tool for the next generation of deep research agents. By addressing the shortcomings of existing evaluation methods, MiroEval aims to enhance the alignment of deep research systems with real user needs, ensuring that these advanced systems are both effective and reliable in their research processes.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

MiroEval: Benchmarking Multimodal Research Agents Effectively

MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome

Evaluation Framework

Key Findings

Top Performers

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related