MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments
Summary: arXiv:2604.13418v1 Announce Type: cross
The rapid evolution of search technology has revealed significant challenges in how artificial intelligence (AI) agents handle the increasingly complex nature of web searches. In response to these challenges, researchers have introduced MERRIN (Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments), a novel benchmark designed to evaluate the performance of search-augmented AI agents. This benchmark is crucial for understanding how well these agents can navigate the noisy, multimodal landscape of the internet.
Introduction to MERRIN
MERRIN aims to address the shortcomings of traditional search systems that often struggle with underspecified, multi-hop queries. The benchmark evaluates agents based on their ability to:
- Identify relevant modalities in search queries.
- Retrieve multimodal evidence, including text, video, and audio.
- Perform multi-hop reasoning across noisy and often conflicting web sources.
What sets MERRIN apart from previous benchmarks is its unique approach to querying and evidence retrieval. The benchmark employs natural language queries that do not provide explicit modality cues, which reflects the real-world search behavior of users. Additionally, MERRIN incorporates underexplored modalities, highlighting the importance of diverse sources of information.
Evaluation Methodology
The evaluation of MERRIN involves a diverse set of search agents powered by ten different models. These include both strong closed-source models, such as GPT-5.4-mini and Gemini 3/3.1 Flash/Pro, and open-weight models like Qwen3-4B/30B/235B. The agents are tested across three distinct search settings:
- No search
- Native search
- Agentic search
The results from initial evaluations reveal that MERRIN is a highly challenging benchmark. The average accuracy across all agents stands at a mere 22.3%, with the best-performing agent achieving only 40.1%. These findings underscore the complexity of efficiently retrieving and reasoning with multimodal information in noisy environments.
Key Findings
Further analysis of the results shows that while more advanced agents, such as Gemini Deep Research, demonstrate slightly improved performance, their gains are limited. Notably, these agents tend to engage in over-exploration—taking more steps and utilizing a greater number of tools. However, this often leads to distractions caused by conflicting or partially relevant content on the web, resulting in incorrect answers.
In comparison to human performance, AI agents consume more resources yet achieve lower accuracy, primarily due to inefficient source selection and an overreliance on text-based modalities. These observations highlight the pressing need for the development of search agents that can perform robust searches and reasoning across a diverse array of modalities within noisy web environments.
Conclusion
MERRIN serves as a valuable testbed for evaluating the capabilities of AI agents in multimodal evidence retrieval and reasoning. The benchmark not only sheds light on the current limitations of existing technologies but also points towards future directions for research and development in the field of artificial intelligence.
