Efficient3D: A Unified Framework for Adaptive and Debiased Token Reduction in 3D MLLMs
Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have expanded reasoning capabilities into 3D domains, enabling fine-grained spatial understanding. However, the substantial size of 3D MLLMs and the high dimensionality of input features introduce considerable inference overhead, which limits practical deployment on resource constrained platforms.
To overcome this limitation, this paper presents Efficient3D, a unified framework for visual token pruning that accelerates 3D MLLMs while maintaining competitive accuracy. The proposed framework introduces a Debiased Visual Token Importance Estimator (DVTIE) module, which considers the influence of shallow initial layers during attention aggregation, thereby producing more reliable importance predictions for visual tokens.
Key Features of Efficient3D
- Debiased Visual Token Importance Estimator (DVTIE): This module enhances the reliability of visual token importance predictions by addressing the impact of shallow layers in attention mechanisms.
- Adaptive Token Rebalancing (ATR): The ATR strategy adjusts the pruning strength dynamically based on the complexity of the scene, ensuring that semantic completeness is preserved and attention remains balanced across various layers.
- Context-Aware Token Reduction: Efficient3D enables a reduction in tokens that is sensitive to the context, maintaining essential semantics while reducing computational load.
Performance Evaluation
Comprehensive experiments were conducted on five representative 3D vision and language benchmarks, including:
- ScanRefer
- Multi3DRefer
- Scan2Cap
- ScanQA
- SQA3D
The results indicate that Efficient3D achieves superior performance compared to unpruned baselines, with a notable +2.57% CIDEr improvement on the Scan2Cap dataset. This improvement highlights the framework’s effectiveness in enhancing inference efficiency while maintaining accuracy in 3D MLLMs.
Conclusion
Efficient3D presents a scalable and effective solution for efficient inference in 3D MLLMs, addressing the critical challenges posed by high dimensionality and computational overhead. The innovative techniques utilized in this framework not only enhance performance but also ensure that the semantic integrity of the models is preserved. As the demand for efficient AI solutions continues to grow, Efficient3D offers a promising avenue for researchers and practitioners working with 3D multimodal applications.
The code for Efficient3D is publicly available at https://github.com/sol924/Efficient3D.
