Self-Healing in Cyber–Physical Systems Using Machine Learning: Comparison
Please note this is a comparison between Version 2 by Sirius Huang and Version 1 by Ali Safaa Sadiq.

The rapid advancement of networking, computing, sensing, and control systems has introduced a wide range of cyber threats, including those from new devices deployed during the development of scenarios. With advancements in automobiles, medical devices, smart industrial systems, and other technologies, system failures resulting from external attacks or internal process malfunctions are increasingly common. Restoring the system’s stable state requires autonomous intervention through the self-healing process to maintain service quality.

  • cyber–physical system
  • cybersecurity
  • threat tolerance
  • self-healing
  • intrusion detection
  • machine-learning algorithms

1. Introduction

Cyber–physical systems (CPSs) are integrated systems that bridge the physical and cyber domains, enabling the seamless integration of biological processes and computing systems [1]. Self-healing in CPSs refers to the ability of these systems to automatically detect and respond to faults or failures without human intervention, enhancing their resilience and reliability [2]. While self-healing capabilities can improve the performance and robustness of CPSs, they also face several vulnerabilities, threats, and challenges that need to be addressed [3]. These include hardware and software component vulnerabilities that may be susceptible to cyber-attacks and other threats, compromising the self-healing process [4]. The lack of standardisation in self-healing mechanisms and technologies creates system interoperability issues, leading to vulnerabilities and integration challenges [2]. The complexity of CPSs, with multiple interconnected components, poses difficulties in identifying and diagnosing faults, making it challenging to implement effective self-healing mechanisms [5]. Human error during system design, implementation, and maintenance can create vulnerabilities and compromise the self-healing capabilities of the system [2]. The lack of visibility into CPS self-healing systems can also hinder fault identification and compromise system operations [6]. CPS self-healing systems are also vulnerable to malicious attacks, including denial-of-service attacks, malware, and hacking, which can compromise the system’s integrity and availability [7]. Considering that CPS self-healing systems are vital for critical infrastructure, such as transportation systems and power grids, failures or vulnerabilities in these systems can have severe safety implications [8]. Addressing these vulnerabilities, threats, and challenges is essential to ensure the security, reliability, and safety of critical infrastructure supported by self-healing capabilities in CPSs [2].
The increased adoption of digital systems in conducting human socioeconomic development affairs concerning business, manufacturing, healthcare provisions, and government services comes with the attendant risk of increased threats to computer systems and networks. These threats could be in the form of cyber-attacks on the individual level or at the organisational level. For example, they targeted those isolated at home during the COVID-19 pandemic lockdowns, schools, businesses, hospitals, manufacturing plants, and social infrastructures. Through the widespread adoption of digital systems, communities have become more susceptible to malicious cyber-attacks; hence, the importance of research around computer systems self-healing has increased over the recent years. 
Cyber–physical systems are part of the Industry 4.0 devices that utilise the power of the Internet to convert the existing Industry 3.0 devices into smart industry devices. These include cyber–physical systems deployed in smart manufacturing, smart grid, smart city, and innovative automobiles. The cyber–physical system is highlighted in Figure 1 as part of Industry 4.0, and the figure focuses on the physical components of Industry 4.0, including cyber–physical systems and IoT, while underscoring the self-healing capability of CPSs in modern manufacturing systems using digital technologies such as cloud computing. An example of such development in transitioning the existing state-of-the-art systems protection from manual interventions to a self-healing approach through automation is noted in [9]. The study argues that as providers migrate from 4G to more robust 5G networks, the operational costs associated with network failures, predicted to increase exponentially, account for approximately 23% to 26% of revenue from the mobile network. A shift towards automating the system’s protection process through self-healing is occurring to control expenses as mobile network providers migrate to 5G. Self-healing systems are being deployed in electricity distribution plants worldwide, with most deployments burdened with latency, bandwidth, and scalability problems, as highlighted in [10]. Standardised architecture for distributed power control using self-healing functionality to solve systems faults is presented. The system proposal increases reliability during normal operations and resilience during threat events. The result of the self-healing experiment in [10] is currently undergoing field implementation by Duke Energy. Deploying machine learning to build self-healing functionality into the power grid is very important in a world where population growth is rising, and according to [11], frequent power outages constitute a considerable cost to the economy and adversely affect people’s quality of life. A proposal for using a fault-solving library coupled with a machine-learning algorithm to create self-healing functionality in computer systems was put forth by [11].
Figure 1.
Chronological progression of industrial revolutions: from the 1st to the 4th.

2. Self-Healing Theories

Self-healing theories are areas of research that seek to formulate arguments that explain the fundamental principles to be considered when implementing self-healing functionality and the pattern between self-healing and other areas of science. The self-healing cyber–physical system section describes what it means to have the self-healing functionality implemented into the cyber–physical system, and self-healing methods detail the models, frameworks, and network architectures that underpin the implementation of self-healing functionality.
Hence, different self-healing theories are presented and discussed in the following subsections.

2.1. Negative and Positive Selection

Negative and positive selection are two processes in the immune system to ensure that only healthy cells are present in the body. A self-healing system refers to a system that can repair itself when damaged or infected. Hence in the context of a self-healing system, the immune system uses both negative and positive selection to ensure that only healthy cells are present. If a cell is found to be harmful, the immune system eliminates it and then begins to repair and regenerate healthy cells. From the biological science viewpoint, negative selection is the process in which the immune system removes cells that recognise self-antigens, and the immune system uses negative selection to ensure that immune cells do not attack healthy cells. Likewise, positive is a process in which the immune system selects cells that recognise foreign antigens and prime the immune system to identify and eliminate harmful cells or pathogens. The CPS self-healing theory of negative and positive selection is the replication of the biological immune response in computer science. An example of such is using a genetic algorithm to detect system intrusions and then deploying the self-healing functionality of the algorithm to remediate the threat.
The characterisation of anomaly is essential in ascertaining where the potential threats or faults are located within a system, and the theory that is relied upon to achieve this is the negative and positive selection theory. Identifying threats before deploying practical self-healing functionality is a vital aspect of its implementation for appropriate remediation. Negative selection of anomaly detection is called “non-self” detection and positive selection of anomaly detection is called “self” detection [13][12]. The central concept of negative selection, as shown in (Figure 2) is to construct a set of “non-self” entities that do not pass a similarity test with any pre-existing “self” entities. If a new entity is detected that matches the “non-self” entities, it is rejected as foreign.
Figure 2.
Negative and positive selection in self-healing systems.
Similarly, the positive selection principle reduces the algorithm by one step, and instead of matching a new entity with a constructed “non-self” entity set, it matches the entity with pre-existing “self” set and rejects the entity if no matches are found. D’haeseleer in [13][12] posits that negative selection has the properties of a thriving immune system, requiring no prior knowledge of intrusions. This is due to being, at its core, a general anomaly detection method. Negative selection is self-learning because it naturally evolves as a set of detectors; when obsolete detectors die, new detectors are obtained from the current event traffic. Dasgupta, cited in [13][12], argued that negative and positive selection produce comparable results despite their fundamental approach differences. Both approaches raise the alarm when an unknown entity infiltrates the system.

2.2. Danger Theory

Danger theory is the approach where immune responses are triggered by danger signals rather than just by the presence of any “self” or “non-self” objects. Negative or positive selection entities are allowed until signs indicate that they pose a threat. For example, as shown in (Figure 3) within the immune system (which this theory is modelled against), if a harmful activity is detected, the immune response is triggered, attacking either all the foreign entities or entities locally, depending on the severity of the danger signal as noted in [13][12] that Burges et al. (1998) was among the first study that proposed the use of biologically inspired danger theory to detect and react to harmful activity in computer systems. Danger theory establishes the link between artificial immune systems and intrusion detection systems. Mazinger in [5] argued that danger theory is based on the concept that the immune system does not entirely differentiate between self and non-self but differentiates between events that possess the potential to cause damage and or the events that will not. Once the system understands itself, it can extend its pattern recognition capabilities and respond to dangerous circumstances.
Figure 3.
Harnessing the power of danger theory to optimise self-healing systems.
The creation of an intrusion detection self-healing system based on danger theory in which anomaly score is calculated for every event in the system was proposed by [13][12]. Each event has three computed values: event type (ET), anomaly value (AV), and danger value (DV). The ET is based on predefined types or automated events clustering. The AV defines how the abnormal event is based on “non-self” computations. The DV increases when any strange or potentially dangerous signal is associated with an event. All these three central event values are combined to calculate the threat total value (TV). TV is the perceived potential of a particular event to cause damage or to be a constitutional part of events that can cause a system’s failure. Three main system flow originates from dangerous events [13][12]:
  • New event analysis: When a new event is detected, it should be added to the timeline, and the dangerous pattern should be checked;
  • Danger signal procession: When a danger signal is detected, the system must decide if any pattern can be related to the danger signal and then act accordingly;
  • Warning signal processing: When a warning arrives from other hosts that carry information about a danger signal and related dangerous sequence of events, a host’s timeline should be checked to verify that it does not have a similar dangerous sequence of events.

2.3. Holistic Self-Healing Theory

The holistic self-healing theory is a holism principle that reinforces complex systems’ resilience. Improving the resilience of one part of the system can potentially introduce fragility in another. This occurs because when one aspect of the system’s resilience is enhanced, it may inadvertently compromise the stability of another element. In mobile network management, for instance, Ref. [10] argued that this approach, as depicted in (Figure 4) means that different management domains and levels are not considered in isolation. Though the other management domains may be operating on different time scales and different managed objects, the domains need to be aware of the threat events that occur in each segment of the whole to react to the danger and trigger appropriate remedial action. Effective communication between the various subdomains of the system allows for the application of danger theory to protect the overall design as a singularity.
Figure 4.
A holistic approach to maintenance and repair of the self-healing system.

3. Self-Healing for Cyber-Physical Systems

Alhomoud described a self-healing system in [13][12] as a resilient system that can carry on its normal functions even when under attack. A self-healing system is equipped with measures to identify and prevent attacks from internal or external events and to facilitate the system’s recovery autonomously. A system equipped with self-healing functionality monitors the system’s environment by constructing a pattern of the sequence of the events and using the pattern to detect anomalies in the circumstances before the remedial functions that correct or eliminate the events anomaly can be successfully deployed. Only when this autonomous remediation of attacks has been successfully achieved can the system be described as having demonstrated self-healing functionality. The main characteristic of a self-healing or self-organising system is the ability to react to problems through self-adaptive principles, which is shown in [19][13] using a platform they termed PREMiuM. The system must be able to classify the attack from everyday activities and take remedial actions to mitigate the impact. The proposed PREMiuM platform in [19][13] is designed to realise self-healing functionality in manufacturing systems, focusing on increasing efficiency during manufacturing processes. The PREMiuM platform consists of a top-level architecture of several services, i.e., interactive, self-healing, proactive, communication, modelling, and security services. These services, which are independent of each other, are deployed to achieve predictive maintenance of manufacturing systems. The self-healing service can detect or predict failures in the system in furtherance of the self-healing and self-adaptive functionality. A proposed intrusion detection system (IDS) by [13][12] is based on anomaly attack detection in which the IDS monitors the system’s environment, constructs a pattern of events and then uses the pattern to detect anomalies in the system, sometimes called outliers. The intrusion detection system detecting outliers triggers the system’s defence mechanisms. Then, it notifies the other parts of the system and or system administrators of the anomalies that have been detected. In a similar self-healing approach, Ref. [1] proposed using machine-learning (ML) algorithms to implement IDS in smart grid construction by integrating traditional power grid strategy with the computer network. It then established the distribution fault-solving strategy library, which caused the grid to become self-adaptive. The method abstracted the power grid into an integrated domain with the cyber–physical system through data sharing, and the grid state in each of the system’s nodes corresponds to a twin matrix, making the grid fully modelled and digitised. The grid, therefore, in the event of failure, utilises the fault-solving strategy library to self-correct itself using the functions of the distribution network. The critical characteristics of self-healing are reliability, fault tolerance, and flexibility. These characteristics are demonstrated in [27][14] principles of self-adaptation systems research, in which self-healing forms part of the fundamental principle encompassing self-protection, self-configuration, and self-optimisation.
RADAR, a self-healing resource, was evaluated by [12][15] using a toolkit called CloudSim, and the experiment results show a promising outcome, with an improvement in the fault detection rate of 16.88% more than the state-of-the-art management techniques. Resource utilisations increase of 8.79% is shown, and throughput increased by 14.50%; availability increased by 5.96%; reliability increased by 11.23%; resource contention decreased by 6.64%; SLA breaches of QoS decreased by 14.50%; energy consumption decreased by 9.73%; waiting time decreased by 19.75%; turnaround time decreased by 17.45%; and lastly, execution time reduced by 5.83%. The critical contributions of RADAR are listed in [12][15]:
  • Provision of self-configuration resources by reinstalling newer versions of obsolete dependencies of the system’s software and offers management of errors through self-healing;
  • Automatically schedules resource provisioning and optimises QoS without the need for human intervention;
  • Provides algorithms for four-phased approaches of monitoring, analysis, planning, and execution of the QoS values. These four phases are triggered through corresponding alerts to aid the preservation of the system’s efficiency;
  • Reduces the breach of service level agreement (SLA) and increases the QoS expectation of the user by improving the availability and reliability of services.
A prominent issue in current research is the ability of systems to identify “zero-day” or never-before-seen anomaly events intelligently. The proposal presented by [33][16] suggests utilising a knowledge-based algorithm to construct an intrusion detection system (IDS) that effectively prevents power grid fault line intrusion. Experiments were conducted within a testbed of a six-bus mesh network modelled to identify fault events within the system and concurrently perform mitigating actions initiated by [33][16] and proposed as a novel protocol. The proposed protocol is referred to as autonomous isolation strategies. The strategies involve rerouting power flow displacements within the power grid once a threat intrusion is detected. Simulations during the experiment were conducted using Power World Simulator, MATLAB, and SimAuto (a fault detection platform). As noted in [33][16], the experiment result shows that MATLAB extracts network parameters. Then, self-healing strategies are triggered by rerouting network processes to other distribution areas, providing stability to the system. The self-healing approach, started by the knowledge-based algorithm, continues concurrently until all the overloading lines on the grid are cleared and all effects of the system’s threat eradicated. The guidelines of supervised learning for the knowledge-based algorithm as related to the electric power network are listed as having the following characteristics in [33][16]:
  • Detection of overloaded transmission lines in the power network;
  • Identify buses that have overloaded transmission lines connected to them;
  • Identification of the busbar that has the highest reserve capacity and that can then serve as a viable option for a power restoration strategy;
  • Identification of the nearest distribution generator to the overloaded transmission line;
  • Identification of the termination point of the overloaded lines;
  • Establishment of line connection using the references of the reserve busbar index.
Other anomaly detection systems for network diagnosis are proposed in previous studies, such as ARCD by [34][17], which uses data logs collected from large-scale monitoring systems to identify root causes of problems in a cellular network. An experiment by [34][17] identified that ARCD systems achieved rate levels above 90% in terms of anomaly detection accuracy rate and detection rate. The drive towards an automatic diagnosis of computer systems failures in mobile cellular networks is propelled by the industry’s need for efficient means of identifying problems within the network. Interestingly, Ref. [35][18] noted that mobile network operators spend a quarter of their revenues on network maintenance, and a drive towards maintenance automation will drive down costs. A solution that relies on random tree forest (RF), convolutional neural network (CNN), and neuromorphic deep-learning module to perform fault diagnostics were proposed by [35][18]. The proposal uses an RSRP map of fault-generated images to provide an AI-based fault diagnostic solution. The impact of fault diagnostic solutions is noticeable in reducing costs and improving the end user’s overall quality of service (QoS). Experiments during research by [35][18] show that the proposed system could identify all the faults fed through the image datasets.
Similarly, a system that is resilient to system intrusions and built using Python-based libraries, software-defined networks, and virtual machine composition was proposed by [3]. The system is called Shar-Net and was tested in a smart grid environment. The experiment results show demonstrably viable IDS that can prevent cyber-attacks and, at the same time, can mitigate the effects of attacks through the system network’s automatic reconfiguration. The principal areas covered in the proposed system are intrusion detection system (IDS), intrusion mitigation system (IMS), and alert management system (AMS). Zolli and Healy describe the resilience of a system in [10] as the ability of the system to recover from failure or attack. The above description is quite different from a robust system. A robust system is a system that is built to withstand unforeseen threats. Although the two terms describing the core functionality of a self-healing-capable system might be used interchangeably, it is essential to note the difference between them. A robust system relies on threats that have been previously seen and thus has allowed the designers of the system to build countermeasures to such threats proactively. On the other hand, a resilient system retroactively reacts to unforeseen or “zero-day attacks” and applies countermeasures accordingly. The authors of [10] listed the following principles of a resilient system:
  • Monitoring and adaptation: It must be responsive to unforeseen attacks;
  • Redundancy, decoupling, and modularity: It must have a decentralised structure to prevent the threats from spreading to the other constituent parts of the network or the system’s host;
Figure 5.
Four stages of a self-healing system: implementation and functions.
  • The system’s architecture;
  • The available datasets;
  • Focusing: The system must be able to focus resources where they are most needed to prevent the overuse of resources, which may be counterintuitive to the task of shoring up the system’s resilience;
  • Diverse at the edge and simple at the core: The system should be able to utilise shared protocols through simply defined processes. Still, it should also retain an element of diversity to circumvent widespread attack threats.
Self-healing functions can be implemented in four stages (Figure 5). These include profiling the system’s normal states, detecting the system’s deviation from its normal state, diagnosing the system’s failures, and taking corrective actions to mitigate the impact of the system’s failure. The choice of profiling algorithm for a self-healing system is dependent on the scope of the design requirements and based on further considerations such as:
  • Manufacturing: In a manufacturing environment, production lines and equipment must always be operational and available to ensure maximum output. Self-healing mechanisms can detect and respond to faults or failures automatically, thereby minimising downtime and reducing the need for manual intervention;
  • Profile scope;
  • Transportation: Transportation systems, such as trains, planes, and automobiles, rely on sensors and other technology to monitor and control their operations. Self-healing mechanisms can detect faults or failures and take corrective action to ensure the system’s safety;
  • Profile features;
  • Power grids: Power grids are critical infrastructure that must always be operational to ensure reliable access to electricity. Self-healing mechanisms can detect and respond to faults or failures, preventing cascading failures and reducing the impact of outages;
  • Feature distribution or subset;
  • Understandability.
The illustration of anomaly detection and diagnosis for radio access networks (RANs) (Figure 6) shows the profiling, detection, and diagnosis of anomaly events related to RANs. The self-healing function is implemented according to a selected event context, like when a threat event occurs. The key performance indicators (KPIs) are calculated when a profile is created within a time-series format. The anomaly events that are unique in characteristics are detected based on their anomaly level. The diagnosis function then analyses the detected anomaly occurrences. The diagnosis function then identifies the root causes of the anomaly events to ascertain whether corrective measures are required or not to lessen the potential threats, and the corrective workflow is then triggered if indeed needed [10]. The major problem that affects the optimal performance of the smart grid network today is the occurrence of system failures caused by multifaceted fault areas, such as system overload, system intrusion, and system misconfiguration, among others [1]. Such failures within the smart grid can cause significant economic setbacks, with consequences that sometimes negatively impact human livelihoods or quality of life. To mitigate the problem of the system’s inability to self-heal after failures, Ref. [1] proposed using a fault-solving strategy library based on a twin model system and machine learning (ML) algorithm to implement a self-healing mechanism in a smart grid. The algorithm will be fed into the dataset derived from the fault-solving library to detect anomalies within the system. Then, the self-healing function is trigged once the classification process is completed and a viable mitigation solution is found. Self-healing methods can be helpful in a variety of contexts where uptime, reliability, and performance are critical, such as:
Figure 6.
Anomaly detection for radio access.
  • Healthcare: Healthcare systems rely on technology to monitor and provide critical care. Self-healing mechanisms can ensure that these systems are always operational, minimising the risk of disruption that could compromise patient safety;
  • Internet of Things (IoT): IoT devices are becoming increasingly common in homes, businesses, and public spaces. Self-healing mechanisms can detect and respond to faults or failures, ensuring these devices remain operational and connected to the Internet.

References

  1. Li, J.; Li, H. Cyber-Physical Systems: A Comprehensive Review. IEEE Access 2021, 9, 112003–112033.
  2. El Fallah Seghrouchni, A.; Beynier, A.; Gleizes, M.P.; Glize, P. A review on self-healing systems: Approaches, properties, and evaluation. Eng. Appl. Artif. Intell. 2021, 99, 104220.
  3. Hahsler, M.; Piekenbrock, M.; Thiel, S.; Kuhn, R. Review of Cyber-Physical Systems for Autonomous Driving: Approaches, Challenges, and Tools. Sensors 2021, 21, 1577.
  4. Subashini, S.; Kavitha, V. A survey on security issues in cyber-physical systems. J. Netw. Comput. Appl. 2016, 68, 1–22.
  5. Sejdić, E.; Djouani, K.; Mouftah, H.T. A Survey on Fault Diagnosis in Cyber-Physical Systems. ACM Comput. Surv. 2020, 53, 1–36.
  6. Zhang, J.; Yang, J.; Sun, H. Security and Privacy in Cyber-Physical Systems: A Survey. ACM Trans. Cyber-Phys. Syst. 2019, 3, 1027–1070.
  7. Samuel, S.R.; Madria, S.K. Cyber-Physical Systems Security: A Survey. J. Netw. Comput. Appl. 2020, 150, 102520.
  8. Mahdavinejad, M.; Al-Fuqaha, A.; Oh, S. Cyber-Physical Systems: A Survey. J. Syst. Archit. 2018, 90, 60–91.
  9. Omar, T.; Ketseoglou, T.; Naffaa, O.; Marzvanyan, A.; Carr, C. A Precoding Real-Time Buffer Based Self-Healing Solution for 5G Networks. J. Comput. Commun. 2021, 9, 1–23.
  10. Schneider, K.P.; Laval, S.; Hansen, J.; Melton, R.B.; Ponder, L.; Fox, L.; Hart, J.; Hambrick, J.; Buckner, M.; Baggu, M.; et al. A Distributed Power System Control Architecture for Improved Distribution System Resiliency. IEEE Access 2021, 7, 9957–9970.
  11. Cai, W.; Yu, L.; Yang, D.; Zheng, Y. Research on Risk Assessment and Strategy Dynamic Attack and Defence Game Based on Twin Model of power distribution network. In Proceedings of the 7th Annual IEEE International Conference on Cyber Technology in Automation, Control, and Intelligent Systems, Honolulu, HI, USA, 31 July–4 August 2017; pp. 684–689.
  12. Degeler, V.; French, R.; Jones, K. Self-Healing Intrusion Detection System Concept. In Proceedings of the 2016 IEEE 2nd International Conference on Big Data Security on Cloud (BigdataSecurity), IEEE International Conference on High Performance and Smart Computing (HPSC), and IEEE International Conference on Intelligent Data and Security (IDS), New York, NY, USA, 9–10 April 2016; pp. 351–356.
  13. Stojanovic, L.; Stojanovic, N. PREMIuM: Big Data Platform for Enabling Self-Healing Manufacturing. In Proceedings of the 2017 International Conference on Engineering, Technology, and Innovation (ICE/ITMC), Madeira Island, Portugal, 27–29 June 2017; pp. 1501–1508.
  14. Ahmad, M.; Samiullah, M.; Pirzada, M.J.; Fahad, M. Using ML in Designing Self-Healing OS. In Proceedings of the The Sixth International Conference on Innovative Computing Technology (INTECH 2016), Dublin, Ireland, 24–26 August 2016; pp. 667–671.
  15. Gill, S.S.; Chana, I.; Singh, M.; Buyya, R. RADAR: Self-Configuring and Self-Healing in Resource Management for Enhancing Quality of Cloud Services. J. Concurr. Comput. Exp. 2016, 31, 1–29.
  16. Muhammad, B.M.S.R.; Raj, S.; Logenthiran, T.; Naayagi, R.T.; Woo, W.L. Self-Healing Network Instigated by Distributed Energy Resources. In Proceedings of the 2017 IEEE PES Asia-Pacific Power and Energy Engineering Conference (APPEEC), Bengaluru, India, 8–10 November; pp. 1–6.
  17. Mdini, M.; Simon, G.; Blanc, A.; Lecoeuvre, J. Introducing an Unsupervised Automated Solution for Root Cause Diagnosis in Mobile Networks. IEEE Trans. Netw. Serv. Manag. 2020, 17, 547–561.
  18. Bothe, S.; Masood, U.; Farooq, H.; Imran, A. Neuromorphic AI Empowered Root Cause Analysis of Faults in Engineering Networks. In Proceedings of the 2020 IEEE International Black Sea Conference on Communications and Networking (BLackSeaCom), Odessa, Ukraine, 26–29 May 2020; pp. 1–6.
More
ScholarVision Creations