MiroEval: Benchmarking Multimodal Research Agents Effectively

Date:

MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome

Summary: arXiv:2603.28407v1 Announce Type: new

Abstract: Recent progress in deep research systems has been impressive, but evaluation still lags behind real user needs. Existing benchmarks predominantly assess final reports using fixed rubrics, failing to evaluate the underlying research process. Most also offer limited multimodal coverage, rely on synthetic tasks that do not reflect real-world query complexity, and cannot be refreshed as knowledge evolves.

To address these gaps, we introduce MiroEval, a benchmark and evaluation framework for deep research systems. The benchmark comprises 100 tasks (70 text-only, 30 multimodal), all grounded in real user needs and constructed via a dual-path pipeline that supports periodic updates, enabling a live and evolving setting.

Evaluation Framework

The proposed evaluation suite assesses deep research systems along three complementary dimensions:

  • Adaptive synthesis quality evaluation: This dimension uses task-specific rubrics to assess the quality of synthesized information.
  • Agentic factuality verification: Systems are evaluated based on their ability to actively retrieve and reason over both web sources and multimodal attachments.
  • Process-centric evaluation: This audit reviews how the system searches, reasons, and refines its approach throughout the research investigation.

Key Findings

Evaluation across 13 systems yields three principal findings:

  • The three evaluation dimensions capture complementary aspects of system capability, revealing distinct strengths and weaknesses across different systems.
  • Process quality serves as a reliable predictor of overall outcome while highlighting weaknesses that might not be visible through output-level metrics.
  • Multimodal tasks present substantially greater challenges, with most systems experiencing a decline of 3 to 10 points in performance.

Top Performers

The MiroThinker series achieves the most balanced performance, with MiroThinker-H1 ranking the highest overall in both settings. Human verification and robustness results confirm the reliability of the benchmark and evaluation framework.

Conclusion

MiroEval provides a holistic diagnostic tool for the next generation of deep research agents. By addressing the shortcomings of existing evaluation methods, MiroEval aims to enhance the alignment of deep research systems with real user needs, ensuring that these advanced systems are both effective and reliable in their research processes.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.