Sherpa.ai Privacy-Preserving Multi-Party Entity Alignment without Intersection Disclosure for Noisy Identifiers
Summary: arXiv:2604.19219v1 Announce Type: cross
Introduction
Federated Learning (FL) has emerged as a groundbreaking approach that allows multiple parties to collaboratively train machine learning models without the need to centralize raw data. This paradigm is particularly beneficial in scenarios where data privacy and security are paramount. The two main types of FL are Horizontal FL (HFL) and Vertical FL (VFL). In HFL, all participants share the same feature space but possess different samples, while in VFL, different parties may have complementary features pertaining to the same set of samples.
Privacy-Preserving Entity Alignment (PPEA)
A critical requirement for effective VFL training is the implementation of privacy-preserving entity alignment (PPEA). This process establishes a common index of samples across parties while ensuring that the specific samples shared between them remain confidential. Traditional methods such as private set intersection (PSI) can achieve alignment but inadvertently expose intersection membership, thus revealing sensitive relationships between datasets. To address this issue, the private set union (PSU) approach aligns on the union of identifiers, thereby reducing the risk of exposing shared information.
Limitations of Existing Approaches
Despite the advantages of PSU, existing methodologies often face significant limitations. Many are confined to two-party scenarios or lack support for typo-tolerant matching, which is essential for practical applications where data quality may vary.
Introduction of Sherpa.ai Multi-Party PSU Protocol
In response to these challenges, we present the Sherpa.ai multi-party PSU protocol designed for VFL. This innovative PPEA method effectively conceals intersection membership while facilitating both exact and noisy matching. The protocol is an advancement over two-party methods, extending its application to multiple parties with minimal communication overhead.
Key Features of the Protocol
- Order-Preserving Version: This variant ensures exact alignment between datasets.
- Unordered Version: This version is designed to accommodate typographical and formatting discrepancies, enhancing its usability in real-world scenarios.
Theoretical Foundations
We rigorously prove the correctness and privacy of the Sherpa.ai multi-party PSU protocol. The analysis includes both communication and computational complexity, particularly focusing on exponentiation operations. Moreover, we formalize a universal index mapping system that transitions local records into a shared index space.
Real-World Applications
This multi-party PSU protocol presents a scalable and mathematically robust solution for PPEA in various practical applications, including:
- Multi-institutional healthcare disease detection
- Collaborative risk modeling between banks and insurers
- Cross-domain fraud detection involving telecommunications and financial institutions
By preserving intersection privacy, the Sherpa.ai protocol opens new avenues for collaborative machine learning while maintaining the integrity and confidentiality of sensitive data.
Conclusion
The introduction of the Sherpa.ai multi-party PSU protocol marks a significant advancement in the field of federated learning, particularly for vertical federated learning scenarios. By addressing the limitations of traditional methods and ensuring privacy-preserving entity alignment, this protocol holds the potential to transform collaborative data analysis across various sectors.
