AI-Powered Schema Extraction for Missing-Person Data

Date:


LLM-based Schema-Guided Extraction and Validation of Missing-Person Intelligence from Heterogeneous Data Sources

Summary: arXiv:2604.06571v1 Announce Type: cross

Abstract: Missing-person and child-safety investigations rely on heterogeneous case documents, including structured forms, bulletin-style posters, and narrative web profiles. Variations in layout, terminology, and data quality impede rapid triage, large-scale analysis, and search-planning workflows. This paper introduces the Guardian Parser Pack, an AI-driven parsing and normalization pipeline that transforms multi-source investigative documents into a unified, schema-compliant representation suitable for operational review and downstream spatial modeling.

The proposed system integrates several innovative features:

  • Multi-engine PDF text extraction: Employs Optical Character Recognition (OCR) fallback to ensure accurate data retrieval from various document types.
  • Rule-based source identification: Utilizes source-specific parsers to enhance data accuracy and reliability.
  • Schema-first harmonization and validation: Ensures that extracted data meets predefined schema standards for consistency.
  • LLM-assisted extraction pathway: Incorporates validator-guided repair and shared geocoding services to enhance data integrity.

The system architecture is designed to facilitate seamless integration of these components, allowing for efficient processing of investigative documents. Key implementation decisions focus on balancing extraction quality with processing speed, essential for operational environments demanding rapid response.

Performance evaluation of the Guardian Parser Pack reveals significant findings. On a manually aligned subset of 75 cases, the LLM-assisted pathway achieved a substantially higher extraction quality compared to the deterministic comparator, with an F1 score of 0.8664 versus 0.2578. Furthermore, when analyzing 517 parsed records across both pathways, the LLM pathway improved aggregate key-field completeness to 96.97% compared to 93.23% for the deterministic approach.

Despite the advantages in extraction quality, it is important to note that the deterministic pathway maintained a much faster processing speed, with a mean runtime of 0.03 seconds per record compared to 3.95 seconds for the LLM pathway. This highlights the trade-off between speed and accuracy, which is crucial in high-stakes investigative settings.

In the evaluated run, all LLM outputs passed the initial schema validation, demonstrating that the validator-guided repair functioned as a built-in safeguard rather than a contributor to the observed gains in extraction quality. This outcome supports the controlled use of probabilistic AI within a schema-first, auditable pipeline for investigations that involve missing persons and child safety.

In conclusion, the Guardian Parser Pack represents a significant advancement in the field of missing-person investigations, enabling law enforcement and related agencies to enhance their data processing capabilities. By harnessing AI technology, the system not only improves data extraction quality but also ensures compliance with established schemas, ultimately aiding in the timely and effective resolution of cases involving vulnerable individuals.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.