LMM-Searcher: Advanced Long-Horizon Multimodal Search AI

Date:

Towards Long-horizon Agentic Multimodal Search

Summary: arXiv:2604.12890v1 Announce Type: cross

Abstract: Multimodal deep search agents have shown great potential in solving complex tasks by iteratively collecting textual and visual evidence. However, managing the heterogeneous information and high token costs associated with multimodal inputs over long horizons remains a critical challenge, as existing methods often suffer from context explosion or the loss of crucial visual signals.

Introduction

In the rapidly evolving field of artificial intelligence, the emergence of multimodal deep search agents has opened new avenues for tackling complex tasks. These agents leverage both textual and visual data to enhance their problem-solving capabilities. Nonetheless, one of the significant hurdles faced by these systems is the effective management of diverse information types and the associated costs over extended interactions. This article introduces a novel approach aimed at addressing these challenges through the development of a new framework.

The LMM-Searcher Framework

We propose the Long-horizon MultiModal deep search framework, named LMM-Searcher. This innovative framework is designed to optimize the handling of multimodal inputs by implementing a file-based visual representation mechanism. Below are the key features and innovations of LMM-Searcher:

  • File-based Visual Representation: By offloading visual assets to an external file system and mapping them to lightweight textual identifiers (UIDs), we significantly reduce context overhead while preserving essential multimodal information for future access.
  • On-demand Visual Loading: The agent is equipped with a tailored fetch-image tool, enabling a progressive visual loading strategy that supports active perception. This allows the agent to retrieve visual data only when needed, further optimizing performance.
  • Data Synthesis Pipeline: We introduce a robust data synthesis pipeline capable of generating queries that necessitate intricate cross-modal multi-hop reasoning. This feature is instrumental in training the agent to handle complex queries effectively.

Training and Performance

To enhance the capabilities of our multimodal search agent, we distilled a dataset comprising 12,000 high-quality trajectories to fine-tune the Qwen3-VL-Thinking-30A3B model. This fine-tuning process has enabled our agent to excel in various search scenarios. Extensive experiments were conducted across four benchmarks, revealing that:

  • The LMM-Searcher framework successfully scales to search horizons of up to 100 turns.
  • Our method achieved state-of-the-art performance among open-source models on challenging long-horizon benchmarks, including MM-BrowseComp and MMSearch-Plus.
  • The framework exhibited strong generalizability across different base models, showcasing its versatility and robustness.

Conclusion

The LMM-Searcher framework represents a significant advancement in the field of multimodal deep search agents. By addressing the challenges of context management and visual signal retention, this approach not only enhances performance but also paves the way for more sophisticated AI systems capable of handling complex multimodal tasks. For those interested in exploring the LMM-Searcher further, our code will be made available at https://github.com/RUCAIBox/LMM-Searcher.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.