Federated Learning at Scale: Addressing Client Intermittency and Resource Constraints

扫码查看

原文链接

NETL
NSTL
IEEE

外文摘要：In federated learning systems, a server coordinates the training of machine learning models on data distributed across a number of participating client devices. In each round of training, the server selects a subset of devices to perform model updates and, in turn, aggregates those updates before proceeding to the next round of training. Most state-of-the-art federated learning algorithms assume that the clients are always available to perform training – an assumption readily violated in many practical settings where client availability is intermittent or even transient; moreover, in systems where the server samples from an exceedingly large number of clients, a client will likely participate in at most one round of training. This can lead to biasing the learned global model towards client groups endowed with more resources. In this paper, we consider systems where the clients are naturally grouped based on their data distributions, and the groups exhibit variations in the number of available clients. We present Flics-opt, an algorithm for large-scale federated learning over heterogeneous data distributions, time-varying client availability and further constraints on client participation reflecting, e.g., overall energy efficiency objectives that should be met to achieve sustainable deployment. In particular, Flics-opt dynamically learns a selection policy that adapts to client availability patterns and communication constraints, ensuring per-group long-term participation which minimizes the variance inevitably introduced into the learning process by client sampling. We show that for non-convex smooth functions Flics-opt coupled with SGD converges at $O(1/\sqrt{T})$ rate, matching the state-of-the-art convergence results which require clients to be always available. We test Flics-opt on three realistic federated datasets and show that, in terms of maximum accuracy, Flics-Avg and Flics-Adam outperform traditional FedAvg by up to 280% and 60%, respectively, while exhibiting robustness in face of heterogeneous data distributions.

外文关键词：

TrainingData modelsVectorsFederated learningServersTransient analysisStochastic processes

作者：

Mónica Ribero、Haris Vikalo、Gustavo de Veciana

展开 >

作者单位：

Department of Electrical and Computer Engineering, University of Texas at Austin, Austin, TX, USA

出版年：

2025

DOI：

10.1109/JSTSP.2024.3430118

IEEE journal of selected topics in signal processing

ISSN：

年,卷(期)：2025.19(1)

参考文献量37