Robust Federated Learning Sync Amid Device Failures

Date:


Robust Synchronisation for Federated Learning in The Face of Correlated Device Failure

Summary: arXiv:2604.16090v1 Announce Type: cross

Abstract: Probabilistic Synchronous Parallel (PSP) is a technique in distributed learning systems to reduce synchronization bottlenecks by sampling a subset of participating nodes per round. In Federated Learning (FL), where edge devices are often unreliable due to factors including mobility, power constraints, and user activity, PSP helps improve system throughput. However, PSP has a key limitation: it assumes device behavior is static and different devices are independent. This can lead to unfair distributed synchronization, due to highly available nodes dominating training while those that are often unavailable rarely participate and so their data may be missed. If both data distribution and node availability are simultaneously correlated with the device, then both PSP and standard FL algorithms will suffer from persistent under-representation of certain classes or groups resulting in inefficient or ineffective learning of certain features.

We introduce Availability-Weighted PSP (AW-PSP), an extension to PSP that addresses the issue of co-correlation of unfair sampling and data availability by dynamically adjusting node sampling probabilities using real-time availability predictions, historical behavior, and failure correlation metrics. A Markov-based availability predictor distinguishes transient vs chronic failures, while a Distributed Hash Table (DHT) layer decentralizes metadata, including latency, freshness, and utility scores.

Our implementation of AW-PSP has undergone trace-driven evaluation, demonstrating significant improvements over the standard PSP method. Key benefits include:

  • Robustness to Failures: AW-PSP enhances resilience against both independent and correlated device failures, ensuring reliable performance in diverse environments.
  • Increased Label Coverage: By adjusting sampling probabilities based on real-time data, AW-PSP ensures a broader representation of labels, leading to more comprehensive learning.
  • Reduced Fairness Variance: The method mitigates the risk of highly available nodes dominating the training process, fostering a more equitable learning environment.
  • Scalability: AW-PSP is designed to efficiently scale to accommodate large numbers of nodes, even in heterogeneous and failure-prone settings.

In conclusion, the AW-PSP protocol provides a crucial advancement in the realm of Federated Learning, particularly in contexts where device failures are common and data distribution is uneven. By incorporating real-time availability predictions and historical data, it paves the way for more effective and fair machine learning processes across diverse applications.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.