PIIBench: Unified Benchmark for PII Detection in Text

PIIBench: A Unified Multi-Source Benchmark Corpus for Personally Identifiable Information Detection

Summary: arXiv:2604.15776v1 Announce Type: cross

Abstract: We present PIIBench, a unified benchmark corpus for Personally Identifiable Information (PII) detection in natural language text. Existing resources for PII detection are fragmented across domain-specific corpora with mutually incompatible annotation schemes, preventing systematic comparison of detection systems.

Introduction

In recent years, the importance of detecting Personally Identifiable Information (PII) has surged due to increasing concerns about data privacy and security. However, the resources available for PII detection are often scattered and inconsistent, making it challenging for researchers and developers to effectively benchmark and evaluate their systems. To address this issue, we introduce PIIBench, a comprehensive benchmark corpus designed to facilitate the detection of PII across various domains.

Corpus Composition

PIIBench consolidates ten publicly available datasets, which include:

Synthetic PII corpora
Multilingual Named Entity Recognition (NER) benchmarks
Financial domain annotated text

This consolidation yields a substantial corpus comprising 2,369,883 annotated sequences and 3.35 million entity mentions across 48 canonical PII entity types. The diversity in the dataset allows for a more holistic evaluation of PII detection systems.

Normalization Pipeline

To ensure consistency across the dataset, we developed a principled normalization pipeline that includes:

Mapping 80+ source-specific label variants to a standardized BIO tagging scheme
Frequency-based suppression of near absent entity types
Stratified train/validation/test splits preserving source distribution

This pipeline not only standardizes the annotations but also enhances the reliability of the benchmark by maintaining the integrity of the data from various sources.

Evaluation of Detection Systems

To establish baseline difficulty, we evaluated eight published systems that encompass a range of methodologies, including:

Rule-based engines (Microsoft Presidio)
General purpose NER models (spaCy, BERT-base NER, XLM-RoBERTa NER, SpanMarker mBERT, SpanMarker BERT)
PII-specific models (Piiranha DeBERTa)
Financial NER specialists (XtremeDistil FiNER)

Despite the diversity in approaches, all evaluated systems achieved span-level F1 scores below 0.14, with the best-performing system, Microsoft Presidio, reaching an F1 score of 0.1385 but still exhibiting zero recall on most entity types. These results highlight the significant challenges posed by PII detection across various domains.

Conclusion

PIIBench presents a more comprehensive evaluation challenge than any existing single-source PII dataset, effectively quantifying the domain-silo problem in PII detection. The dataset construction pipeline and benchmark evaluation code are available publicly at https://github.com/pritesh-2711/pii-bench, encouraging further research and development in this critical area.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

PIIBench: Unified Benchmark for PII Detection in Text

PIIBench: A Unified Multi-Source Benchmark Corpus for Personally Identifiable Information Detection

Introduction

Corpus Composition

Normalization Pipeline

Evaluation of Detection Systems

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related