MultiDocFusion: Hierarchical and Multimodal Chunking Pipeline for Enhanced RAG on Long Industrial Documents
Summary: arXiv:2604.12352v1 Announce Type: new
Introduction
In recent years, the rise of Retrieval-Augmented Generation (RAG) based question answering (QA) has revolutionized the way long industrial documents are processed. Traditional text chunking methods, however, often fall short, failing to accommodate the intricate structures of these documents. This oversight can result in significant information loss and a decline in the quality of the answers generated.
Introducing MultiDocFusion
To tackle these challenges, we present MultiDocFusion, a sophisticated multimodal chunking pipeline designed to enhance RAG-based QA systems. The key features of MultiDocFusion include:
- Vision-Based Document Parsing: MultiDocFusion begins by detecting relevant document regions using advanced vision techniques, ensuring that the content is accurately identified and segmented.
- OCR Text Extraction: Once the document regions are identified, Optical Character Recognition (OCR) is employed to extract text from these segments, facilitating the conversion of visual information into a machine-readable format.
- Hierarchical Document Structure Reconstruction: The next step involves the reconstruction of the document’s structure into a hierarchical tree. This is achieved through the innovative document section hierarchical parsing (DSHP-LLM) powered by large language models (LLMs), which enables a deeper understanding of the document’s organization.
- DFS-Based Grouping for Hierarchical Chunks: Finally, MultiDocFusion utilizes Depth-First Search (DFS) based grouping techniques to construct hierarchical chunks, further enhancing the document’s accessibility for QA tasks.
Performance Evaluation
To validate the effectiveness of MultiDocFusion, extensive experiments were conducted across various industrial benchmarks. The results indicate a remarkable improvement in performance metrics:
- Retrieval precision improved by 8-15%, demonstrating a significant enhancement in the accuracy of retrieved information.
- Question Answering (QA) scores, measured using the ANLS (Answerable Natural Language Summary) metric, saw an increase of 2-3%, underscoring the system’s ability to generate higher-quality answers.
Conclusion
The findings from our experiments highlight the critical importance of incorporating document hierarchy into multimodal document-based QA systems. By explicitly leveraging the structural nuances of long industrial documents, MultiDocFusion not only preserves vital information but also significantly enhances the overall fidelity of RAG-based QA systems. This innovative approach paves the way for future advancements in document processing and QA methodologies.
