Global energy demand is rising quickly, which in turn, makes the need for electric energy rise even faster, especially in household setups. Current studies reveal that the most crucial element in resolving energy issues would be the intelligent and cost-effective use of electricity as the primary source of energy
[1]. This, in turn, raises the need for systems that recommend best practices and actions to use energy in homes, workplaces, and buildings more efficiently
[2][3][4]. To recommend positive actions to the users and help them adopt a more efficient energy consumption behavior, it is essential first to capture their energy footprint and analyze their behavior concerning the use of appliances
[5][6]. The analysis of energy utilization can help in this regard. A viable solution is to use smart meters and sensors to record the energy consumption of each appliance, potentially combined with smart data analytics to visualize the energy consumption habits
[7]. Nonetheless, a more financially affordable solution is to use only a single meter, and non-intrusive load monitoring (NILM) techniques
[8][9] to identify the consumption of each appliance from the aggregate measurements. NILM techniques offer thus the possibility to determine which appliances are utilized in a household at any moment and the corresponding amount of energy consumed
[10]. Therefore, these approaches can be leveraged by different services such as activity monitoring
[11], and the detection of defective appliances
[12].
2. Data Engineering for NILM
The emergence of ML algorithms in NILM scholarship highlighted the importance of data engineering to enhance the disaggregation performance for different appliances. Thus, this aspect received particular attention from recent systematic reviews. The current manuscript groups three main data processing steps under the term data engineering: data preprocessing, feature extraction, and postprocessing techniques. Nonetheless, feature extraction received relatively more attention from the selected set of reviews than preprocessing and postprocessing techniques adopted in different contributions.
The preprocessing techniques adopted in different contributions were covered in two reviews, mainly
[17][18]. The authors of
[17] highlight two main techniques considered mandatory for the majority of algorithms: (i) handling sampling rates and missing data, and (ii) balancing. The first technique, handling sampling rates and missing data, is related to the quality of the data sets during training and is leveraged to address potential technical problems that may occur in real setups (i.e., hardware and communication issues). The second technique provides a balance between the ON states/events of each appliance and the OFF states/events. The latter problem is mainly caused by residential appliances being OFF most of the time. In addition to the previous two techniques, the authors of
[18] provided an overview of data augmentation techniques adopted mainly to address the underrepresented classes.
An overview of the types of features in NILM was suggested in five different reviews, mainly
[9][19][20][21]. A consensus between all these reviews can be concluded where three types of features were highlighted: steady-state features, transient features, and external/nontraditional features. Alternatively, the reviews presented in
[17][22] provide a classification of NILM features based on the sampling frequency required, where they offered a clear distinction between low-frequency and high-frequency features, as follows:
Considering post-processing techniques, only reviews presented in
[18][21] provided an overview of existing approaches for NILM algorithms. One of the main findings of the quantitative analysis provided by the first review (i.e.,
[18]) was the enhancement that can be achieved, where they found that 28% to 54% of improvement was recorded in related work. Consequently, it is to conclude that postprocessing techniques are a key factor in improving existing algorithms.
It was widely acknowledged in all the reviews that ML and AI models are the most prominent algorithms in the NILM scholarship in recent years. Consequently, data engineering techniques are of enormous importance to future NILM developments.
3. NILM Algorithms, Comparison, and Evaluation Setups
The non-intrusive monitoring of the operation and energy consumption of appliances, especially in household setups that consist of a large variate of loads and specific usage patterns, has been recognized as an essential task for more than three decades, with the seminal work of Hart that defined the task
[8].
The first group of solutions was mostly based on combinatorial optimization techniques, which assumed that the total load was the result of a combination of appliances (with known loads) that operate in different states (even not operating at all) and tried to find the combination of appliances and states that better matches the overall load measurement. Taking this one step further, hidden Markov models(HMM) attempted to model the task using a probabilistic approach concerning the appliances that operate at every moment and the state they are on. In the last decade, technological advancements in neural networks and the underlying infrastructures that support their operations, as well as the abundance of training data, gave rise to the ML approaches for NILM and mainly to neural NILM, which demonstrated state-of-the-art performance under a variety of training conditions (e.g., high sampling rates, enough computational capacity).
The problem of non-intrusive monitoring of appliances’ load based on the disaggregation of the measurements from a single monitoring device is usually approached in the literature by breaking it into smaller tasks. Given a known inventory of appliances for a household, these tasks comprise (a) the detection of different states for each appliance, (b) the extraction of signatures per state and appliance, and (c) the classification of each measurement to the most promising combination of appliances’ states
[23]. Instead of monitoring the operation of each appliance on a second-by-second basis, some NILM techniques simply identify state change events and consequently record the start and end time of an appliance usage and the total energy consumed
[24]. Alternatively, Neural NILM models provide a point-to-point solution for each appliance.
Convolutional neural networks (CNNs) can be employed to detect state-change events. As suggested in
[24], a current sequence of length
L2 is transformed in an image of
L×L pixels and is fed to a CNN, which is then trained to identify appliances initially on a single load task. This task allows distinguishing between appliances when a single appliance is on at each moment. This is taken one step further by establishing a multi-load identification task, in which the model is trained to distinguish between all possible load combinations. The main restriction of such approaches is that the number of appliances in a household can be large. Consequently, the respective number of combinations that must be identified at any moment becomes huge.
Energy measurement data are usually considered to be in the form of time series or sequences. Consequently, the respective DNN architectures that capture the temporal semantics of input have also been employed. More specifically, recurrent neural networks (RNNs) have been used in
[25] as an alternative to combinatorial optimization. RNNs successfully reconstruct the appliance signatures for the aggregated measurements and can perfectly fit appliances they have already been trained on. However, they need help to generalize on unseen appliances or power states and require vast amounts of data and a lot of computational power to be trained. In an attempt to improve the generalization of RNNs, authors in
[26] employ gated recurrent units (GRU) and show that they outperform the RNN baseline. In the same direction, authors in
[27] suggest using LSTM-RNNs to tackle the vanishing gradient problem better whilst learning the long-term patterns that constitute the appliances’ signatures in the multi-state and multi-appliance setup.
The autoencoders (AEs) represent another architecture commonly used to extract more coherent input data representations. As such, they can be used to extract the features that compose the signature of the various appliances. They are composed of encoding and decoding layers, and at training time, they learn to optimize the output so that it better resembles (if not identical) the input. After training, the encoder is used to obtain the representation of the input to a different dimension. A stochastic variation of autoencoders is the denoising autoencoders (dAEs), which introduce noise to the input so that the autoencoder does not learn the identity function (i.e., f(x) = f) during training. Consequently, the energy disaggregation task can be approached as a denoising problem, utilizing techniques that can transfer a noisy overall consumption from multiple appliances to a “clean” consumption of each individual appliance, using as input either active, reactive, apparent power, current, voltage, or any combination of them.
Denoising AEs employs a 1-D convolutional layer in the encoder part to feed the input measurements in segments (few seconds windows) and another 1-D convolutional layer in the decoder, with the size that depends on the size of the appliance activations
[28]. They can be trained using synthetic datasets that combine the measurements of various appliances and aim to reconstruct each appliance’s signature in the output. Authors in
[29] have combined dAEs with RNNs to combine the merits of ANNs and HMM-based methods. Using dAEs, they obtain the signatures of the appliances, and by feeding them to the LSTM, they can identify the most promising combination of appliances (and modes) that corresponds to the aggregated consumption at any moment.
The review presented in
[18] on the DNN approaches for low-frequency NILM begins with the increased requirements for processing high-frequency NILM data and continues with the evaluation of various NN-based techniques that combine CNNs with LSTMs, GRUs, and other RNN variations or even with generative adversarial networks (GANs) and AEs (denoising or variational autoencoders) in an attempt to improve the classification accuracy of collective appliance signals. The main challenge for the different algorithms relates to the overall performance, which is usually affected by the dataset used, the sampling frequencies, the input features, the metrics used for evaluation, etc. The choice of the best parameters for all the above can significantly affect the final performance as much as the architecture. According to
[30], a best practice for developing DNN models is the automation of hyper-parameters tuning and selecting the appropriate architecture. Using toolkits that aggregate multiple alternative architectures allows for finding the best solution at each NILM setup.
The evaluation of NILM algorithms is generally performed using widely acknowledged ML metrics and NILM datasets. Nonetheless, some evaluation metrics dedicated only to NILM models can also be identified
[31] though receiving little attention in recent NILM reviews since they are less commonly used. However, despite their seldom use, these metrics could show a better summary of disaggregation results since they focus on the NILM problem by design. NILM datasets also received significant attention from existing reviews where the sampling rate and the data quality remain the main concern. Furthermore, NILM toolkits are an important part of the evaluation as they improve research efficiency. This aspect was only covered in two reviews
[21][32] revealing that available NILM toolkits emphasize the algorithms without considering the available hardware and network infrastructure, which is critical for the real-time monitoring of appliances. In this direction, lightweight models
[33] that combine CNNs for learning features and simple classifiers to detect appliances seem to be promising solutions. Another solution for scalability is using FL approaches
[34], which can move the processing load from a centralized to a decentralized approach taking advantage of several low processing power devices to solve the same task. Federated NILM solutions can also support privacy since data are not shared across nodes or with a centralized server
[35], but also open new challenges for researchers, which are discussed in more detail in the following section.
4. Federated NILM
FL
[36][37][38], also referred to as collaborative learning, is a learning paradigm that Google introduced in 2017 to protect the privacy of its clients. Following this learning paradigm, the model is sent to the client rather than the data uploaded to a cloud server. It starts in a central server responsible for initializing the model’s weight and sharing them with the clients. Upon the reception of the global model, each client executes a training task using its local data for a number of iterations and sends the new weights of the model back to the central server. Once the central server has received the local models, it will aggregate them to obtain an updated version of the global model. The process is repeated for several rounds until convergence is achieved. The most popular aggregation algorithm is known as the FedAvG
[39], which relies on calculating the average of the weights of local models as an aggregation mechanism. The weighted average can be used when the size of local datasets differs for clients participating in the training. Several variants of this scheme exist in the literature, considering different aspects
[37]. For example, peer-to-peer FL enables direct clients’ communication and eliminates the central node
[36]. More precisely, each client broadcasts their model to the other clients contributing to the training round. Considering this variant of FL, the goal is to achieve a fully decentralized training process without the need for a central server considered a single point of failure. Other variants of FL also exist but remain out of the scope of the current manuscript.
The upgrade of the electrical grid in many countries around the globe, with the advanced metering infrastructure and edge devices, offers the possibility of adopting an FL paradigm for efficient grid management. It was extensively adopted in the case of load forecasting (e.g.,
[40]) and power generation prediction for renewable energies (e.g.,
[41]). Nonetheless, only a handful of contributions have explored the adoption of this learning paradigm in NILM scholarship: ten contributions for residential load disaggregation, one for solar energy disaggregation, and only one for investigating security aspects of FL in smart grids with respect to load disaggregation.
An FL framework for NILM was suggested in
[42], where transfer learning was used between different domains. The goal of the contribution was to protect consumers’ privacy and overcome the problem of non-identically distributed data. Three public data sets were considered during the evaluation setup, where the main focus was to establish a comparison with centralized load disaggregation schemes. The results showed high potential for the suggested FL approach. Nonetheless, transfer learning from one domain to another one demonstrated poor results and showed that fine-tuning is required. Despite the extensive evaluation of the disaggregation performance, the previous study provided no analysis of the communication cost and model efficiency, and little attention was given to the hardware requirements of the edge devices. These limitations were also admitted in
[43] and highlighted as future direction. Furthermore, the authors stressed the need to upgrade NILM toolkits with federated/decentralized trainers, enabling further research in this respect. Both of the previous studies adopted a Seq2Point model, which shows the strength of this model in the case of FL for load disaggregation. More precisely, even short versions of this model provide very competitive results as demonstrated in
[44] where the authors suggested shortening the Seq2Point baseline trained following an FL paradigm revealing promising results despite the low number of training clients. A similar study focusing on transfer learning was suggested in
[45], where a model-agnostic meta-learning approach was introduced to enable task-specific learning and allow data owners to adjust the models based on the tasks. In this regard, the FL is augmented with a meta-learning step at each round. The evaluation setup demonstrated enhanced disaggregation performance but with a longer time required for convergence.
The FL was further tested in combination with differential privacy in
[46] where the Seq2Point
[14] baseline was leveraged during the experimental setup. The evaluation showed that this combination provides good results in the case of the fridge, which exhibits a period consumption pattern but failed in the case of hand-operated appliances, mainly the kettle and microwave, which are directly related to daily routines. Furthermore, they demonstrate that differential privacy causes poor results due to the noise added where smaller epsilon values allow mitigating privacy attacks. Still, higher values provide similar privacy leakage to the standard FL framework. A similar study was presented in
[47], evaluating the impact of noise added on the overall disaggregation of a standard federated NILM framework. The evaluation was performed using a temporal pooling model on three different data sets. It resulted in the amount of added noise drastically hindering the disaggregation task, thus achieving similar conclusions to work presented in
[46].
The performance of a classification federated NILM algorithm was investigated in
[48], combining FL with state-of-the-art NILM models for state classification. However, it mainly concentrated on using testing data from the same buildings included in the training, which may have led to biased conclusions. A multi-target federated NILM was suggested in
[35]. The proposed framework leverages a multi-target learning paradigm to train a single model for all the target appliances with pruning techniques to compress the model. The experiments on three real datasets demonstrated an acceptable trade-off between privacy and disaggregation performance but with a relatively low performance, mainly a low f1-score.
Interestingly, a federated decision tree algorithm was designed in
[49] for load disaggregation leveraging a two-state voting process and node-level parallelism for co-modeling NILM. During the model training phase, the server receives the local training results. It makes the final decision to select the model parameters used to split the tree nodes, including features and the corresponding thresholds. The local clients are responsible for data preprocessing, tree structure initialization, gradient computation, local histogram establishment, local split finding, and model updating. The voting thus results in a list of top-K candidate features chosen based on the maximum variance gain on local machines forwarded to the central server that will select candidate features based on majority voting. Unfortunately, designed this way, the algorithm suffers from privacy leakage of partial feature indexes.
Despite the interesting findings of previous studies, a shared shortcoming between is their high on to the central node. More precisely, all previously presented works adopt a client-server architecture where the server represents a single point of failure. To overcome this issue, a fully decentralized FL approach was evaluated in
[50] by adopting a circle topology instead of a star topology to optimize clients’ communication. The experimental setup highlighted equivalent results to the centralized FL approach. However, the authors did not evaluate the gain/loss in the communication bandwidth in the case of the decentralized FL. Furthermore, each node in the circle topology is a point of failure. Further research is thus required to develop a mechanism that allows to re-establish the circle in the case of failures.
The results obtained on unseen buildings were chosen whenever available. Moreover, the F1-score is the most common metric among the different contributions. It is clear from the table that the results drastically differ between appliances. The highest f1-score was reported in the case of the washing machine upon optimal model selection before the aggregation in
[43]. Meanwhile, the worst value was reported for the case of the dishwasher in
[47]. Apart from indicating the low quality of FL frameworks, these results highlight the tremendous challenge that training on several appliances from different buildings could impose. The low values reported in
[35][47] are linked to the approaches added to the standard federated framework, that is, compression and differential privacy. Overall, the reported results are acceptable, especially in the case of approaches that consider training data from different buildings and were tested on unseen buildings, simulating thus the most realistic scenario.