Robust Synchronisation for Federated Learning in The Face of Correlated Device Failure
Summary: arXiv:2604.16090v1 Announce Type: cross
Abstract: Probabilistic Synchronous Parallel (PSP) is a technique in distributed learning systems to reduce synchronization bottlenecks by sampling a subset of participating nodes per round. In Federated Learning (FL), where edge devices are often unreliable due to factors including mobility, power constraints, and user activity, PSP helps improve system throughput. However, PSP has a key limitation: it assumes device behavior is static and different devices are independent. This can lead to unfair distributed synchronization, due to highly available nodes dominating training while those that are often unavailable rarely participate and so their data may be missed. If both data distribution and node availability are simultaneously correlated with the device, then both PSP and standard FL algorithms will suffer from persistent under-representation of certain classes or groups resulting in inefficient or ineffective learning of certain features.
We introduce Availability-Weighted PSP (AW-PSP), an extension to PSP that addresses the issue of co-correlation of unfair sampling and data availability by dynamically adjusting node sampling probabilities using real-time availability predictions, historical behavior, and failure correlation metrics. A Markov-based availability predictor distinguishes transient vs chronic failures, while a Distributed Hash Table (DHT) layer decentralizes metadata, including latency, freshness, and utility scores.
Our implementation of AW-PSP has undergone trace-driven evaluation, demonstrating significant improvements over the standard PSP method. Key benefits include:
- Robustness to Failures: AW-PSP enhances resilience against both independent and correlated device failures, ensuring reliable performance in diverse environments.
- Increased Label Coverage: By adjusting sampling probabilities based on real-time data, AW-PSP ensures a broader representation of labels, leading to more comprehensive learning.
- Reduced Fairness Variance: The method mitigates the risk of highly available nodes dominating the training process, fostering a more equitable learning environment.
- Scalability: AW-PSP is designed to efficiently scale to accommodate large numbers of nodes, even in heterogeneous and failure-prone settings.
In conclusion, the AW-PSP protocol provides a crucial advancement in the realm of Federated Learning, particularly in contexts where device failures are common and data distribution is uneven. By incorporating real-time availability predictions and historical data, it paves the way for more effective and fair machine learning processes across diverse applications.
