PeopleSearchBench: A Multi-Dimensional Benchmark for Evaluating AI-Powered People Search Platforms
Summary: arXiv:2603.27476v1 Announce Type: new
AI-powered people search platforms have become integral tools in various sectors, including recruiting, sales prospecting, and professional networking. However, the absence of a universally accepted benchmark to evaluate their performance has limited the ability to assess and compare these platforms effectively. In response to this gap, researchers have introduced PeopleSearchBench, an innovative open-source benchmark designed to evaluate four prominent people search platforms using 119 real-world queries across four distinct use cases.
The four use cases examined in this benchmark include:
- Corporate Recruiting
- B2B Sales Prospecting
- Expert Search with Deterministic Answers
- Influencer/KOL Discovery
A significant contribution of PeopleSearchBench is the implementation of Criteria-Grounded Verification. This unique factual relevance pipeline is designed to extract explicit, verifiable criteria from each query and leverage live web search to assess whether the returned profiles meet these established criteria. This methodology provides binary relevance judgments that are rooted in factual verification, moving away from subjective evaluations typically associated with large language models evaluating holistic quality.
The evaluation of the systems is based on three critical dimensions:
- Relevance Precision: Measured using padded nDCG@10.
- Effective Coverage: This includes task completion rates and the yield of qualified results.
- Information Utility: Assessed through profile completeness and overall usefulness of the information provided.
These three dimensions are averaged equally to produce an overall score for each system evaluated. Notably, Lessie, a specialized AI people search agent, emerged as the top performer in this benchmark, achieving an overall score of 65.2. This score is 18.5% higher than the second-ranked system and marks Lessie as the only platform to attain a remarkable 100% task completion rate across all 119 queries.
The study also includes rigorous reporting of confidence intervals, human validation of the verification pipeline (demonstrated by Cohen’s kappa coefficient of 0.84), and detailed ablation studies. Comprehensive documentation of queries, prompts, and normalization procedures is also provided, ensuring transparency and replicability of the results.
For those interested in exploring the benchmark further, all related code, query definitions, and aggregated results are readily available on GitHub, offering researchers and practitioners alike the opportunity to utilize and contribute to this vital resource in the field of AI-powered people search platforms.
