Optimizing Visual Demonstration Selection for MLLMs

Date:

Learning to Select Visual In-Context Demonstrations

Summary: arXiv:2603.26775v1 Announce Type: cross

Abstract: Multimodal Large Language Models (MLLMs) adapt to visual tasks via in-context learning (ICL), which relies heavily on demonstration quality. The dominant demonstration selection strategy is unsupervised k-Nearest Neighbor (kNN) search. While simple, this similarity-first approach is sub-optimal for complex factual regression tasks; it selects redundant examples that fail to capture the task’s full output range. We reframe selection as a sequential decision-making problem and introduce Learning to Select Demonstrations (LSD), training a Reinforcement Learning agent to construct optimal demonstration sets.

Using a Dueling DQN with a query-centric Transformer Decoder, our agent learns a policy that maximizes MLLM downstream performance. Evaluating across five visual regression benchmarks, we uncover a crucial dichotomy: while kNN remains optimal for subjective preference tasks, LSD significantly outperforms baselines on objective, factual regression tasks. By balancing visual relevance with diversity, LSD better defines regression boundaries, illuminating when learned selection is strictly necessary for visual ICL.

Introduction

The integration of visual tasks within Multimodal Large Language Models (MLLMs) has paved the way for advanced capabilities in artificial intelligence. These models leverage in-context learning (ICL) to adapt to various visual tasks; however, the quality of demonstrations plays a vital role in their effectiveness. Traditional methods, such as the unsupervised k-Nearest Neighbor (kNN) search, have been widely used for demonstration selection. Despite its popularity, this method presents limitations, particularly in complex factual regression scenarios.

The Limitations of kNN

The kNN approach operates on the principle of similarity, selecting examples that closely resemble the input. While this can be effective for tasks requiring subjective judgment, it often results in redundant selections that fail to encompass the full spectrum of outputs necessary for complex tasks. This redundancy poses challenges in accurately defining regression boundaries, leading to suboptimal performance in factual regression tasks.

Introducing Learning to Select Demonstrations (LSD)

To address these limitations, we propose a novel framework called Learning to Select Demonstrations (LSD). This method reframes the selection process as a sequential decision-making problem, enabling a more refined approach to demonstration selection. By employing a Reinforcement Learning (RL) agent, LSD constructs optimal demonstration sets that maximize performance across various tasks.

Methodology

Our methodology utilizes a Dueling Deep Q-Network (DQN) paired with a query-centric Transformer Decoder. This combination allows the RL agent to learn an effective policy for selecting demonstrations, focusing on balancing visual relevance with diverse examples. The training process emphasizes maximizing the downstream performance of MLLMs, particularly in factual regression tasks.

Evaluation and Results

We conducted evaluations across five distinct visual regression benchmarks to assess the effectiveness of LSD. The results illustrate a critical distinction between the performance of kNN and LSD:

  • kNN maintains superiority in subjective preference tasks.
  • LSD significantly outperforms kNN in objective, factual regression tasks.

These findings highlight the importance of learned selection methods, particularly in scenarios where capturing the full output range is essential for optimal model performance.

Conclusion

In conclusion, Learning to Select Demonstrations (LSD) represents a significant advancement in the field of MLLMs and visual ICL. By overcoming the limitations of traditional kNN methods and introducing a learned selection process, LSD enhances the capability of models to adapt to complex visual tasks. This research opens new avenues for improving the performance of AI systems in varied applications, ultimately pushing the boundaries of what is possible in multimodal learning.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.