Learning to Select Visual In-Context Demonstrations
Summary: arXiv:2603.26775v1 Announce Type: cross
Abstract: Multimodal Large Language Models (MLLMs) adapt to visual tasks via in-context learning (ICL), which relies heavily on demonstration quality. The dominant demonstration selection strategy is unsupervised k-Nearest Neighbor (kNN) search. While simple, this similarity-first approach is sub-optimal for complex factual regression tasks; it selects redundant examples that fail to capture the task’s full output range. We reframe selection as a sequential decision-making problem and introduce Learning to Select Demonstrations (LSD), training a Reinforcement Learning agent to construct optimal demonstration sets.
Using a Dueling DQN with a query-centric Transformer Decoder, our agent learns a policy that maximizes MLLM downstream performance. Evaluating across five visual regression benchmarks, we uncover a crucial dichotomy: while kNN remains optimal for subjective preference tasks, LSD significantly outperforms baselines on objective, factual regression tasks. By balancing visual relevance with diversity, LSD better defines regression boundaries, illuminating when learned selection is strictly necessary for visual ICL.
Introduction
The integration of visual tasks within Multimodal Large Language Models (MLLMs) has paved the way for advanced capabilities in artificial intelligence. These models leverage in-context learning (ICL) to adapt to various visual tasks; however, the quality of demonstrations plays a vital role in their effectiveness. Traditional methods, such as the unsupervised k-Nearest Neighbor (kNN) search, have been widely used for demonstration selection. Despite its popularity, this method presents limitations, particularly in complex factual regression scenarios.
The Limitations of kNN
The kNN approach operates on the principle of similarity, selecting examples that closely resemble the input. While this can be effective for tasks requiring subjective judgment, it often results in redundant selections that fail to encompass the full spectrum of outputs necessary for complex tasks. This redundancy poses challenges in accurately defining regression boundaries, leading to suboptimal performance in factual regression tasks.
Introducing Learning to Select Demonstrations (LSD)
To address these limitations, we propose a novel framework called Learning to Select Demonstrations (LSD). This method reframes the selection process as a sequential decision-making problem, enabling a more refined approach to demonstration selection. By employing a Reinforcement Learning (RL) agent, LSD constructs optimal demonstration sets that maximize performance across various tasks.
Methodology
Our methodology utilizes a Dueling Deep Q-Network (DQN) paired with a query-centric Transformer Decoder. This combination allows the RL agent to learn an effective policy for selecting demonstrations, focusing on balancing visual relevance with diverse examples. The training process emphasizes maximizing the downstream performance of MLLMs, particularly in factual regression tasks.
Evaluation and Results
We conducted evaluations across five distinct visual regression benchmarks to assess the effectiveness of LSD. The results illustrate a critical distinction between the performance of kNN and LSD:
- kNN maintains superiority in subjective preference tasks.
- LSD significantly outperforms kNN in objective, factual regression tasks.
These findings highlight the importance of learned selection methods, particularly in scenarios where capturing the full output range is essential for optimal model performance.
Conclusion
In conclusion, Learning to Select Demonstrations (LSD) represents a significant advancement in the field of MLLMs and visual ICL. By overcoming the limitations of traditional kNN methods and introducing a learned selection process, LSD enhances the capability of models to adapt to complex visual tasks. This research opens new avenues for improving the performance of AI systems in varied applications, ultimately pushing the boundaries of what is possible in multimodal learning.
