MERRIN: Benchmark for Multimodal AI Search in Noisy Web

Date:


MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments

Summary: arXiv:2604.13418v1 Announce Type: cross

The rapid evolution of search technology has revealed significant challenges in how artificial intelligence (AI) agents handle the increasingly complex nature of web searches. In response to these challenges, researchers have introduced MERRIN (Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments), a novel benchmark designed to evaluate the performance of search-augmented AI agents. This benchmark is crucial for understanding how well these agents can navigate the noisy, multimodal landscape of the internet.

Introduction to MERRIN

MERRIN aims to address the shortcomings of traditional search systems that often struggle with underspecified, multi-hop queries. The benchmark evaluates agents based on their ability to:

  • Identify relevant modalities in search queries.
  • Retrieve multimodal evidence, including text, video, and audio.
  • Perform multi-hop reasoning across noisy and often conflicting web sources.

What sets MERRIN apart from previous benchmarks is its unique approach to querying and evidence retrieval. The benchmark employs natural language queries that do not provide explicit modality cues, which reflects the real-world search behavior of users. Additionally, MERRIN incorporates underexplored modalities, highlighting the importance of diverse sources of information.

Evaluation Methodology

The evaluation of MERRIN involves a diverse set of search agents powered by ten different models. These include both strong closed-source models, such as GPT-5.4-mini and Gemini 3/3.1 Flash/Pro, and open-weight models like Qwen3-4B/30B/235B. The agents are tested across three distinct search settings:

  • No search
  • Native search
  • Agentic search

The results from initial evaluations reveal that MERRIN is a highly challenging benchmark. The average accuracy across all agents stands at a mere 22.3%, with the best-performing agent achieving only 40.1%. These findings underscore the complexity of efficiently retrieving and reasoning with multimodal information in noisy environments.

Key Findings

Further analysis of the results shows that while more advanced agents, such as Gemini Deep Research, demonstrate slightly improved performance, their gains are limited. Notably, these agents tend to engage in over-exploration—taking more steps and utilizing a greater number of tools. However, this often leads to distractions caused by conflicting or partially relevant content on the web, resulting in incorrect answers.

In comparison to human performance, AI agents consume more resources yet achieve lower accuracy, primarily due to inefficient source selection and an overreliance on text-based modalities. These observations highlight the pressing need for the development of search agents that can perform robust searches and reasoning across a diverse array of modalities within noisy web environments.

Conclusion

MERRIN serves as a valuable testbed for evaluating the capabilities of AI agents in multimodal evidence retrieval and reasoning. The benchmark not only sheds light on the current limitations of existing technologies but also points towards future directions for research and development in the field of artificial intelligence.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.