Federated Learning (FL) is a state-of-the-art technique used to build machine learning (ML) models based on distributed data sets. It enables In-Edge AI, preserves data locality, protects user data, and allows ownership. These characteristics of FL make it a suitable choice for IoT networks due to its intrinsic distributed infrastructure. However, FL presents a few unique challenges; the most noteworthy is training over largely heterogeneous data samples on IoT devices. The heterogeneity of devices and models in the complex IoT networks greatly influences the FL training process and makes traditional FL unsuitable to be directly deployed, while many recent research works claim to mitigate the negative impact of heterogeneity in FL networks, unfortunately, the effectiveness of these proposed solutions has never been studied and quantified.
1. Heterogeneity in Federated IoT Networks
IoT networks are intrinsically heterogeneous. In real-life scenarios, FL is deployed over an IoT network with different data samples, device capabilities, device availability, network quality, and battery levels. As a result, heterogeneity is evidentiary and impacts the performance of a federated network. This section breaks down the heterogeneity and briefly discusses the two main categories, statistical and system heterogeneity.
1.1. Statistical Heterogeneity
Distributed optimization problems are often modeled under the assumption that data is Independent and Identically Distributed (IID). However, IoT devices generate and collect data in a highly dependent and inconsistent fashion. The number of data points also varies significantly across devices which adds complexity to problem modeling, solution formulation, analysis, and optimization. Moreover, the devices could be distributed in association with each other, and there might be an underlying statistical structure capturing their relationship. With the aim of learning a single, globally shared model, statistical heterogeneity makes it difficult to achieve optimal performance.
1.2. System Heterogeneity
It is very likely for IoT devices in a network to have different underlying hardware (CPU, memory). These devices might also operate on different battery levels and use different communication protocols (WiFi, LTE, etc.) Conclusively, the computational storage and communication capabilities differ for each device in the network. Moreover, IoT networks have to cope with stragglers as well. Low-level IoT devices operate on low battery power and bandwidth and can become unavailable at any given time.
The aforementioned system-level characteristics can introduce many challenges when training ML models over the edge. For example, federated networks consist of hundreds of thousands of low-level IoT devices, but only a handful of active devices might take part in the training. Such situations can make trained models biased towards the active devices. Moreover, low participation can result in a long convergence time when training. Due to the reasons mentioned above, heterogeneity is one of the main challenges for federated IoT networks, and federated algorithms must be robust, heterogeneity-aware, and fault-tolerant. Recently a few studies have claimed to address the challenge of heterogeneity.
2. History and Development
In the recent few years, there has been a paradigm shift in the way ML is applied in applications, and FL has emerged as a victor in systems driven by privacy concerns and deep learning
[1][2][3]. FL is being widely adopted due to its compliance with GDPR, and it can be said that it is laying the foundation for next-generation ML applications. Despite FL is showcasing promising results, however, it also brings in unique challenges; such as communication efficiency, heterogeneity, and privacy, which are thoroughly discussed in
[4][5][6][7][8]. To mitigate these challenges, various techniques have been presented over the last few years. For example,
[9] presented an adaptive averaging strategy, and authors in
[10] presented an In-Edge AI framework to tackle the communication bottleneck in federated networks. To deal with the resource optimization problem,
[11] focused on the design aspects for enabling FL at the network edge. In contrast,
[12] presented the Dispersed Federated Learning (DFL) framework to provide resource optimization for FL networks.
Heterogeneity is one of the major challenges faced by federated IoT networks. However, early FL approaches neither consider system and statistical heterogeneity in their design
[13][14] nor are straggler-aware. Instead, there is a major assumption of uniform participation from all clients and a sample fixed number of data parties in each learning epoch to ensure performance and fair contribution from all clients. Due to these unrealistic assumptions, FL approaches suffer significant performance loss and often lead to model divergence under heterogeneous network conditions.
Previously, many research works have tried to mitigate heterogeneity problems in distributed systems via the system and algorithmic solutions
[15][16][17][18]. In this context, heterogeneity results from different hardware capabilities of devices (system heterogeneity) and results in performance degradation due to stragglers. However, these conventional methods cannot handle the scale of federated networks. Moreover, heterogeneity in FL settings is not limited to hardware and device capabilities. Various other system artifacts such as data distribution
[19], client sampling
[20], and user behavior also introduce heterogeneity (known as statistical heterogeneity) in the network.
Recently, various techniques have been presented to tackle heterogeneity in a federated network. In
[21], the authors proposed to tackle heterogeneity via client sampling. Their approach uses a deadline-based approach to filter out all the stragglers. However, it does not consider how this approach affects the straggler parties in model training. Similarly,
[22] proposed to reduce the total training time via adaptive client sampling while ignoring the model bias. FedProx
[23] allows client devices to perform a variable amount of work depending on their available system resources and also adds a proximal term to the objective to account for the associated heterogeneity. A few other works in this area proposed reinforcement learning-based techniques to mitigate the negative effects of heterogeneity
[24][25]. Furthermore, algorithmic solutions have also been proposed that mainly focus on tackling statistical heterogeneity in the federated network. In
[26], the authors proposed a variance reduction technique to tackle the data heterogeneity. Similarly,
[27] proposed a new design strategy from a primal-dual optimization perspective to achieve communication efficiency and adaptivity to the level of heterogeneity among the local data. However, these techniques do not consider the communication capabilities of the participating devices. Furthermore, they have not been tested in real-life scenarios which keep us in the dark regarding their
actual performance in comparison to the reported performance. Comparing the conventional and the new upcoming federated systems in terms of heterogeneity and distribution helps us understand the open challenges as well as track the progress of federated systems
[28].
A few studies have also been presented to understand the impact of heterogeneity in FL training. In
[29], the author demonstrated the potential impact of system heterogeneity by allocating varied CPU resources to the participants. However, the author only focused on training time and did not consider the impact of model performance. In
[30], the authors characterized the impact of heterogeneity on FL training, but they majorly focused on system heterogeneity while ignoring the other types of heterogeneity in the systems. Similarly, in
[31], the authors used large-scale smartphone data to understand the impact of heterogeneity but did not account for stragglers. However, all of the studies mentioned above failed to analyze the effectiveness of state-of-the-art FL algorithms under heterogeneous network conditions.
This entry is adapted from the peer-reviewed paper 10.3390/iot3020016