A Tight Expressivity Hierarchy for GNN-Based Entity Resolution in Master Data Management
Summary: arXiv:2603.27154v1 Announce Type: cross
Abstract
Entity resolution, the process of identifying database records that refer to the same real-world entity, can be effectively modeled using bipartite graphs that connect entity nodes to their respective attribute values. Recent research has shown that applying a message-passing neural network (MPNN) with all available extensions, such as reverse message passing, port numbering, and ego IDs, often incurs unnecessary overhead. This is primarily because different entity resolution tasks exhibit fundamentally different complexity levels. The key question addressed is: for a given matching criterion, what is the most efficient MPNN architecture that provably works?
Research Findings
This research presents a four-theorem separation theory focused on typed entity-attribute graphs. The authors introduce two co-reference predicates: Dupr (indicating that two same-type entities share at least r attribute values) and the ℓ-cycle predicate Cycℓ for scenarios involving entity-entity edges. The study establishes tight bounds for each predicate, demonstrating the construction of graph pairs that are provably indistinguishable by any MPNN that lacks the required adaptation. Furthermore, minimal-depth MPNNs are exhibited that can compute the predicate on all inputs.
Key Insights
- The research identifies a significant complexity gap between detecting any shared attribute versus detecting multiple shared attributes.
- Detecting a single shared attribute is a purely local requirement, necessitating only reverse message passing within two layers.
- In contrast, detecting multiple shared attributes involves cross-attribute identity correlation, which verifies that the same entity appears across several attributes of the target. This is a fundamentally non-local requirement that necessitates ego IDs and four layers, even in acyclic bipartite graphs.
- A similar necessity is observed for cycle detection, reinforcing the importance of tailored MPNN architectures for specific tasks.
Implications for Practitioners
The findings of this research culminate in a minimal-architecture principle that allows practitioners to select the most cost-effective adaptation set. This approach provides a guarantee that no simpler architecture would suffice for the tasks at hand. The computational validation of these predictions further solidifies the theoretical contributions of this work.
Conclusion
The study contributes significantly to the field of entity resolution in master data management by delineating a clear expressivity hierarchy for GNN-based approaches. By understanding the minimal architectural requirements for varying complexity levels of entity resolution tasks, data scientists and engineers can optimize their models for efficiency, ultimately leading to more effective data management solutions.
