PixelPrune: Pixel-Level Adaptive Visual Token Reduction via Predictive Coding
Summary: arXiv:2604.00886v1 Announce Type: cross
Abstract
Document understanding and GUI interaction are among the highest-value applications of Vision-Language Models (VLMs). However, these applications impose an exceptionally heavy computational burden due to the fine-grained text and small UI elements that demand high-resolution inputs. This results in the generation of tens of thousands of visual tokens, which can be inefficient.
Research indicates that this computational cost is largely wasteful. In various document and GUI benchmarks, it has been observed that only 22–71% of image patches are pixel-unique, while the remainder are exact duplicates of another patch within the same image. To address this redundancy, we introduce PixelPrune, a novel technique that leverages pixel-level redundancy through predictive-coding-based compression methods. This approach prunes redundant patches before they are processed by the Vision Transformer (ViT) encoder.
How PixelPrune Works
PixelPrune operates directly in pixel space prior to any neural computations, providing a significant acceleration to both the ViT encoder and the downstream large language model (LLM). This method covers the entire inference pipeline, making it highly efficient.
- Training-Free: PixelPrune does not require any learnable parameters, making it straightforward to implement across various systems.
- Pixel-Lossless Compression: The technique supports pixel-lossless compression, where the parameter τ equals zero.
- Controlled Lossy Compression: It also allows for controlled lossy compression, where τ is greater than zero, enabling flexibility based on application needs.
Performance and Results
Extensive experiments conducted across three model scales demonstrate that PixelPrune maintains competitive task accuracy while delivering remarkable performance improvements. The results indicate:
- Up to 4.2× inference speedup.
- 1.9× training acceleration.
These findings highlight PixelPrune’s potential to revolutionize the efficiency of Vision-Language Models, making them more accessible and practical for real-world applications. By significantly reducing the computational load without compromising accuracy, PixelPrune positions itself as a valuable tool for developers and researchers in the field.
Availability
The code for PixelPrune is publicly available at https://github.com/OPPO-Mente-Lab/PixelPrune. This ensures that the research community can utilize, test, and build upon the findings presented in this work.
In conclusion, PixelPrune represents a significant advancement in the optimization of Vision-Language Models, paving the way for more efficient processing in document understanding and GUI interaction tasks.
