Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing
Summary: arXiv:2603.24326v1 Announce Type: cross
Abstract: Document parsing is a fine-grained task where image resolution significantly impacts performance. While advanced research leveraging vision-language models benefits from high-resolution input to boost model performance, this often leads to a quadratic increase in the number of vision tokens and significantly raises computational costs. We attribute this inefficiency to substantial visual regions redundancy in document images, like background. To tackle this, we propose PaddleOCR-VL, a novel coarse-to-fine architecture that focuses on semantically relevant regions while suppressing redundant ones, thereby improving both efficiency and performance.
Introduction
The rapid advancement in artificial intelligence has brought document parsing to the forefront of research, especially in the fields of computer vision and natural language processing. Traditional methods often struggle with the high computational costs associated with processing high-resolution images due to the overwhelming number of vision tokens generated, which can hinder performance.
Challenges in Document Parsing
Document images often contain significant amounts of redundant visual information such as backgrounds and other non-essential elements. This redundancy poses several challenges, including:
- Increased Computational Costs: High-resolution images lead to a quadratic increase in vision tokens.
- Performance Degradation: The presence of irrelevant visual information can distract models from focusing on significant content.
- Resource Inefficiency: More tokens require more processing power, which can be a barrier for real-time applications.
Proposed Solution: PaddleOCR-VL
To address these challenges, we introduce PaddleOCR-VL, a novel architecture designed to enhance efficiency and performance in document parsing. The key components of this model include:
- Valid Region Focus Module (VRFM): This lightweight module utilizes localization and contextual relationship prediction to identify valid vision tokens while filtering out redundant areas.
- Compact Vision-Language Model: The PaddleOCR-VL-0.9B model is a compact yet powerful solution for detailed recognition, which processes only the semantically relevant tokens identified by the VRFM.
Results and Performance
Extensive experiments have demonstrated that PaddleOCR-VL achieves state-of-the-art performance in both page-level parsing and element-level recognition. The model exhibits:
- Superior Accuracy: Outperforming existing solutions and showing strong competitiveness against top-tier vision-language models (VLMs).
- Fast Inference Speeds: Delivering rapid results while significantly reducing the number of vision tokens and parameters processed.
- Public Availability: The source code and models are publicly available, allowing for further research and development in the field. More information can be found at PaddleOCR GitHub Repository.
Conclusion
PaddleOCR-VL represents a significant advancement in document parsing technology, highlighting the effectiveness of a coarse-to-fine parsing approach. By focusing on semantically relevant regions and minimizing redundancy, this model not only improves efficiency but also enhances performance, paving the way for more effective document understanding in various applications.
