MAny: Merge Anything for Multimodal Continual Instruction Tuning
Summary: arXiv:2604.14016v1 Announce Type: cross
Abstract: Multimodal Continual Instruction Tuning (MCIT) is essential for sequential task adaptation of Multimodal Large Language Models (MLLMs) but is severely restricted by catastrophic forgetting. While existing literature focuses on the reasoning language backbone, in this work, we expose a critical yet neglected dual-forgetting phenomenon across both perception drift in Cross-modal Projection Space and reasoning collapse in Low-rank Parameter Space. To resolve this, we present MAny (Merge Anything), a framework that merges task-specific knowledge through Cross-modal Projection Merging (CPM) and Low-rank Parameter Merging (LPM).
Specifically, CPM recovers perceptual alignment by adaptively merging cross-modal visual representations via visual-prototype guidance, ensuring accurate feature recovery during inference. Simultaneously, LPM eliminates mutual interference among task-specific low-rank modules by recursively merging low-rank weight matrices. By leveraging recursive least squares, LPM provides a closed-form solution that mathematically guarantees an optimal fusion trajectory for reasoning stability.
Notably, MAny operates as a training-free paradigm that achieves knowledge merging via efficient CPU-based algebraic operations, eliminating additional gradient-based optimization beyond initial tuning. Our extensive evaluations confirm the superior performance and robustness of MAny across multiple MLLMs and benchmarks.
Key Features of MAny
- Cross-modal Projection Merging (CPM): Ensures perceptual alignment by merging cross-modal visual representations.
- Low-rank Parameter Merging (LPM): Eliminates interference among task-specific modules by merging low-rank weight matrices.
- Training-free Operation: Allows for knowledge merging without additional gradient-based optimization.
- Efficiency: Utilizes CPU-based algebraic operations for optimal performance.
Performance Evaluation
Our extensive evaluations confirm the superior performance and robustness of MAny across multiple MLLMs and benchmarks. Notably, on the UCIT benchmark, MAny achieves significant leads of up to 8.57% and 2.85% in final average accuracy over state-of-the-art methods across two different MLLMs, respectively.
In conclusion, MAny represents a significant advancement in the field of Multimodal Continual Instruction Tuning. By addressing critical issues of dual forgetting through innovative merging techniques, MAny enhances the capabilities of MLLMs, paving the way for more robust and adaptable AI systems.
