Hybrid Deterministic-LLM for Accurate Course PDF Extraction

Date:

A Reliability Evaluation of Hybrid Deterministic-LLM Based Approaches for Academic Course Registration PDF Information Extraction

Summary: arXiv:2604.00003v1 Announce Type: cross

Abstract

This study evaluates the reliability of information extraction approaches from KRS documents using three strategies: LLM only, Hybrid Deterministic – LLM (regex + LLM), and a Camelot based pipeline with LLM fallback. Experiments were conducted on 140 documents for the LLM based test and 860 documents for the Camelot based pipeline evaluation, covering four study programs with varying data in tables and metadata.

Methodology

Three 12 – 14B LLM models (Gemma 3, Phi 4, and Qwen 2.5) were run locally using Ollama and a consumer grade CPU without a GPU. Evaluations used exact match (EM) and Levenshtein similarity (LS) metrics with a threshold of 0.7.

Results

Although not applicable to all models, the results show that the hybrid approach can improve efficiency compared to LLM only, especially for deterministic metadata. The Camelot based pipeline with LLM fallback produced the best combination of accuracy (EM and LS up to 0.99 – 1.00) and computational efficiency (less than 1 second per PDF in most cases).

Performance Analysis

The Qwen 2.5:14b model demonstrated the most consistent performance across all scenarios. These findings confirm that integrating deterministic and LLM methods is increasingly reliable and efficient for information extraction from text-based academic documents in computationally constrained environments.

Conclusion

The integration of hybrid deterministic-LLM approaches represents a significant advancement in the field of information extraction from PDFs, particularly in academic settings. The improved efficiency and accuracy of the Camelot based pipeline with LLM fallback showcases the potential for these methods to handle complex data extraction tasks effectively.

Future Work

Further research is needed to explore the scalability of these approaches across larger datasets and to refine the models for even better performance in diverse scenarios.

Key Takeaways

  • Hybrid Deterministic-LLM approaches can enhance information extraction efficiency.
  • The Camelot based pipeline with LLM fallback shows high accuracy and speed.
  • Qwen 2.5:14b model offers consistent performance across multiple evaluations.
  • Integration of these methods is promising for computationally constrained environments.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.