LLM-based Schema-Guided Extraction and Validation of Missing-Person Intelligence from Heterogeneous Data Sources
Summary: arXiv:2604.06571v1 Announce Type: cross
Abstract: Missing-person and child-safety investigations rely on heterogeneous case documents, including structured forms, bulletin-style posters, and narrative web profiles. Variations in layout, terminology, and data quality impede rapid triage, large-scale analysis, and search-planning workflows. This paper introduces the Guardian Parser Pack, an AI-driven parsing and normalization pipeline that transforms multi-source investigative documents into a unified, schema-compliant representation suitable for operational review and downstream spatial modeling.
The proposed system integrates several innovative features:
- Multi-engine PDF text extraction: Employs Optical Character Recognition (OCR) fallback to ensure accurate data retrieval from various document types.
- Rule-based source identification: Utilizes source-specific parsers to enhance data accuracy and reliability.
- Schema-first harmonization and validation: Ensures that extracted data meets predefined schema standards for consistency.
- LLM-assisted extraction pathway: Incorporates validator-guided repair and shared geocoding services to enhance data integrity.
The system architecture is designed to facilitate seamless integration of these components, allowing for efficient processing of investigative documents. Key implementation decisions focus on balancing extraction quality with processing speed, essential for operational environments demanding rapid response.
Performance evaluation of the Guardian Parser Pack reveals significant findings. On a manually aligned subset of 75 cases, the LLM-assisted pathway achieved a substantially higher extraction quality compared to the deterministic comparator, with an F1 score of 0.8664 versus 0.2578. Furthermore, when analyzing 517 parsed records across both pathways, the LLM pathway improved aggregate key-field completeness to 96.97% compared to 93.23% for the deterministic approach.
Despite the advantages in extraction quality, it is important to note that the deterministic pathway maintained a much faster processing speed, with a mean runtime of 0.03 seconds per record compared to 3.95 seconds for the LLM pathway. This highlights the trade-off between speed and accuracy, which is crucial in high-stakes investigative settings.
In the evaluated run, all LLM outputs passed the initial schema validation, demonstrating that the validator-guided repair functioned as a built-in safeguard rather than a contributor to the observed gains in extraction quality. This outcome supports the controlled use of probabilistic AI within a schema-first, auditable pipeline for investigations that involve missing persons and child safety.
In conclusion, the Guardian Parser Pack represents a significant advancement in the field of missing-person investigations, enabling law enforcement and related agencies to enhance their data processing capabilities. By harnessing AI technology, the system not only improves data extraction quality but also ensures compliance with established schemas, ultimately aiding in the timely and effective resolution of cases involving vulnerable individuals.
