From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering
Retrieval-Augmented Generation (RAG) systems rely heavily on the quality of document preprocessing, which significantly impacts their performance in question-answering tasks. Despite this critical dependency, no prior studies have systematically evaluated the effectiveness of various PDF processing frameworks on downstream question-answering accuracy. This article aims to fill that gap by presenting a comprehensive comparison of four open-source PDF-to-Markdown conversion frameworks: Docling, MinerU, Marker, and DeepSeek OCR.
Research Overview
The evaluation involved 19 unique pipeline configurations that varied across several dimensions, including the chosen conversion tool, cleaning transformations, splitting strategy, and methods of metadata enrichment. The study utilized a manually curated benchmark comprising 50 questions based on a corpus of 36 Portuguese administrative documents, totaling approximately 1,706 pages and around 492,000 words. The performance of each configuration was assessed using a scoring mechanism based on Large Language Models (LLMs), averaged over ten runs.
Key Findings
The results revealed a noteworthy range of performance metrics, with two baselines established for reference: a naive PDFLoader that achieved an accuracy of 86.9%, and a manually curated Markdown version that reached an impressive 97.1%. Among the evaluated frameworks, Docling emerged as the leader, achieving the highest automated accuracy of 94.1% when combined with hierarchical splitting and comprehensive image descriptions.
Impact of Metadata and Chunking Strategies
Interestingly, the study found that enhancements in metadata and the implementation of hierarchy-aware chunking had a more substantial impact on the accuracy of the outputs than the choice of conversion framework itself. This insight indicates that focusing on data preparation techniques can yield significant improvements in RAG system performance.
Comparison with Other Approaches
Moreover, the research highlighted that font-based hierarchy rebuilding consistently outperformed LLM-based methods for structuring output data. This suggests that traditional approaches to document structure may still hold significant advantages over more complex, model-driven methods.
Exploratory Implementation Insights
An exploratory implementation of GraphRAG, aimed at integrating knowledge graphs into the RAG framework, scored only 82%. This performance indicates that naive knowledge graph construction—lacking ontological guidance—does not yet justify the added complexity it introduces into the overall system. This finding emphasizes the importance of well-defined structures in data preparation for effective RAG applications.
Conclusion
In conclusion, this study underscores the critical role of data preparation quality in influencing the performance of RAG systems. The findings suggest that researchers and practitioners should prioritize effective document conversion techniques and metadata enrichment strategies to enhance question-answering capabilities in domain-specific contexts. As RAG systems continue to evolve, the insights gained from this evaluation will be essential for developing more robust and accurate AI-driven solutions.
