DISCO: Document Intelligence Suite for COmparative Evaluation
Summary: arXiv:2603.23511v1 Announce Type: cross
Abstract: Document intelligence requires accurate text extraction and reliable reasoning over document content. We introduce DISCO, a Document Intelligence Suite for COmparative Evaluation, that evaluates optical character recognition (OCR) pipelines and vision-language models (VLMs) separately on parsing and question answering across diverse document types, including handwritten text, multilingual scripts, medical forms, infographics, and multi-page documents.
Introduction
The field of document intelligence has gained significant traction in recent years, driven by the increasing need for efficient text processing and understanding across various document formats. In response to these demands, DISCO has been developed as a robust framework for evaluating the performance of different document processing technologies.
Key Features of DISCO
- Comprehensive Evaluation: DISCO focuses on evaluating both OCR pipelines and VLMs, offering insights into their respective strengths and weaknesses.
- Diverse Document Types: The suite is designed to handle a wide array of document formats, including:
- Handwritten Text
- Multilingual Scripts
- Medical Forms
- Infographics
- Multi-page Documents
- Performance Assessment: The evaluation metrics employed by DISCO allow for a nuanced understanding of how these technologies perform under different conditions.
Results and Findings
Our evaluation shows that performance varies substantially across tasks and document characteristics, underscoring the need for complexity-aware approach selection. The findings reveal several critical insights:
- Reliability of OCR Pipelines: OCR technologies are generally more reliable for processing handwritten documents and handling long or multi-page texts. This is largely due to the explicit text grounding they provide, which enhances text-heavy reasoning capabilities.
- Advantages of VLMs: Vision-language models have shown superior performance in dealing with multilingual text and visually rich layouts. Their ability to integrate visual and textual information is crucial for effective document understanding.
- Impact of Task-aware Prompting: The implementation of task-aware prompting has yielded mixed results. While it has improved performance on certain document types, it has also led to performance degradation on others, highlighting the importance of tailored approaches in document processing.
Conclusion
The findings presented in this evaluation provide empirical guidance for selecting appropriate document processing strategies based on specific document structures and reasoning demands. As both OCR and VLM technologies continue to evolve, DISCO serves as a critical tool for researchers and practitioners aiming to enhance document intelligence applications.
Future Work
Looking ahead, further research is necessary to refine the evaluation criteria and expand the range of document types assessed. Continuous improvements in both OCR and VLM technologies are expected, making it essential to revisit and update the DISCO framework regularly.
