PIIBench: A Unified Multi-Source Benchmark Corpus for Personally Identifiable Information Detection
Summary: arXiv:2604.15776v1 Announce Type: cross
Abstract: We present PIIBench, a unified benchmark corpus for Personally Identifiable Information (PII) detection in natural language text. Existing resources for PII detection are fragmented across domain-specific corpora with mutually incompatible annotation schemes, preventing systematic comparison of detection systems.
Introduction
In recent years, the importance of detecting Personally Identifiable Information (PII) has surged due to increasing concerns about data privacy and security. However, the resources available for PII detection are often scattered and inconsistent, making it challenging for researchers and developers to effectively benchmark and evaluate their systems. To address this issue, we introduce PIIBench, a comprehensive benchmark corpus designed to facilitate the detection of PII across various domains.
Corpus Composition
PIIBench consolidates ten publicly available datasets, which include:
- Synthetic PII corpora
- Multilingual Named Entity Recognition (NER) benchmarks
- Financial domain annotated text
This consolidation yields a substantial corpus comprising 2,369,883 annotated sequences and 3.35 million entity mentions across 48 canonical PII entity types. The diversity in the dataset allows for a more holistic evaluation of PII detection systems.
Normalization Pipeline
To ensure consistency across the dataset, we developed a principled normalization pipeline that includes:
- Mapping 80+ source-specific label variants to a standardized BIO tagging scheme
- Frequency-based suppression of near absent entity types
- Stratified train/validation/test splits preserving source distribution
This pipeline not only standardizes the annotations but also enhances the reliability of the benchmark by maintaining the integrity of the data from various sources.
Evaluation of Detection Systems
To establish baseline difficulty, we evaluated eight published systems that encompass a range of methodologies, including:
- Rule-based engines (Microsoft Presidio)
- General purpose NER models (spaCy, BERT-base NER, XLM-RoBERTa NER, SpanMarker mBERT, SpanMarker BERT)
- PII-specific models (Piiranha DeBERTa)
- Financial NER specialists (XtremeDistil FiNER)
Despite the diversity in approaches, all evaluated systems achieved span-level F1 scores below 0.14, with the best-performing system, Microsoft Presidio, reaching an F1 score of 0.1385 but still exhibiting zero recall on most entity types. These results highlight the significant challenges posed by PII detection across various domains.
Conclusion
PIIBench presents a more comprehensive evaluation challenge than any existing single-source PII dataset, effectively quantifying the domain-silo problem in PII detection. The dataset construction pipeline and benchmark evaluation code are available publicly at https://github.com/pritesh-2711/pii-bench, encouraging further research and development in this critical area.
