Hybrid Deterministic-LLM for Accurate Course PDF Extraction

A Reliability Evaluation of Hybrid Deterministic-LLM Based Approaches for Academic Course Registration PDF Information Extraction

Summary: arXiv:2604.00003v1 Announce Type: cross

Abstract

This study evaluates the reliability of information extraction approaches from KRS documents using three strategies: LLM only, Hybrid Deterministic – LLM (regex + LLM), and a Camelot based pipeline with LLM fallback. Experiments were conducted on 140 documents for the LLM based test and 860 documents for the Camelot based pipeline evaluation, covering four study programs with varying data in tables and metadata.

Methodology

Three 12 – 14B LLM models (Gemma 3, Phi 4, and Qwen 2.5) were run locally using Ollama and a consumer grade CPU without a GPU. Evaluations used exact match (EM) and Levenshtein similarity (LS) metrics with a threshold of 0.7.

Results

Although not applicable to all models, the results show that the hybrid approach can improve efficiency compared to LLM only, especially for deterministic metadata. The Camelot based pipeline with LLM fallback produced the best combination of accuracy (EM and LS up to 0.99 – 1.00) and computational efficiency (less than 1 second per PDF in most cases).

Performance Analysis

The Qwen 2.5:14b model demonstrated the most consistent performance across all scenarios. These findings confirm that integrating deterministic and LLM methods is increasingly reliable and efficient for information extraction from text-based academic documents in computationally constrained environments.

Conclusion

The integration of hybrid deterministic-LLM approaches represents a significant advancement in the field of information extraction from PDFs, particularly in academic settings. The improved efficiency and accuracy of the Camelot based pipeline with LLM fallback showcases the potential for these methods to handle complex data extraction tasks effectively.

Future Work

Further research is needed to explore the scalability of these approaches across larger datasets and to refine the models for even better performance in diverse scenarios.

Key Takeaways

Hybrid Deterministic-LLM approaches can enhance information extraction efficiency.
The Camelot based pipeline with LLM fallback shows high accuracy and speed.
Qwen 2.5:14b model offers consistent performance across multiple evaluations.
Integration of these methods is promising for computationally constrained environments.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Hybrid Deterministic-LLM for Accurate Course PDF Extraction

A Reliability Evaluation of Hybrid Deterministic-LLM Based Approaches for Academic Course Registration PDF Information Extraction

Abstract

Methodology

Results

Performance Analysis

Conclusion

Future Work

Key Takeaways

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related