The objective of the current section is to define a taxonomy of Fault Tolerance tasks to help categorize the identified papers. The Fault Tolerance tasks are based on more general Fault Tolerance principles from References [18][19]. . shows the taxonomy of Fault Tolerance tasks applicable in USNs and how they affect each other. While the design and initial deployment of USNs contribute to Fault Prevention and Prediction abilities, data collecting techniques at the run-time contribute also to Fault Detection and Fault Recovery stages of the system, all of which are going to be discussed in the current paper.
The overview of fault tolerant techniques presented in the following section follows the above-described taxonomy.
4. Overview of Techniques by Fault Tolerance Tasks
4.1. Fault Prevention and Prediction
Fault prevention and prediction in sensor networks are dependent on the architectural design of the system and the initial deployment method of the sensor network. These will be discussed in the following subsections. In addition, data collection in USNs and testing frameworks for UWSNs are presented.
4.1.1. Design of the Sensor Network
In Wireless Sensor Networks (WSN), instead of a centralized homogeneous topology, dividing nodes into clusters is an energy efficient and resilient method
[20], where dedicated cluster head nodes may have more energy and communication capabilities to effectively act as mediators between regular nodes and data sinks.
To overcome the issues caused by varying environmental challenges of Underwater Wireless Sensor Networks (UWSN), natural algorithms may be utilized. For instance, clustering and routing can be done utilizing Cuckoo Search algorithm and Particle Swarm Optimization
[21], which have behaved more resiliently in underwater conditions than more usual terrestrial Low Energy Adaptive Clustering Hierarchy (LEACH) protocol
[22]. Pressure measurements have been used for UWSN routing
[23] with floating depth-controlling sensors. Fault Management tasks can also be distributed across the whole network. In WSN with enough spare nodes energy efficient grid can be formed
[24], changing the node manager, gateway and sensing nodes selected and spare nodes put to sleep. This results in energy-efficient and lightweight network but requires excess nodes.
However, existing UWSN protocols have not been adequately compared in underwater field trials yet
[25].
4.1.2. Sensor Network Deployment
Sensor network deployment techniques are important for WSNs where deployment may directly affect the nodes’ locations and networking availability. Even for terrestrial wireless sensor networks, to obtain a satisfactory network performance, an adaptable deployment method is essential
[26]. Usually, the sensor placement for WSNs utilizes, for redundancy reasons, more sensors than the minimum required number
[27]. The deployment costs and energy efficiency of WSNs have been investigated in Reference
[28], and it has been found that there is no single solution that can easily be applied in practice
[29].
Wired sensor network deployment is less researched, possibly because wired sensor networks’ node deployment locations are limited by the cables, their locations are more predetermined, and node connectivity is not directly related to the location.
4.1.3. Data Collection
Sensor networks tend to have limited network bandwidth, energy, and storage capabilities. Thus, filtering and aggregating sensor information may be a way to meet those requirements. Raw sensor data near the source can be divided into informative, non-informative, and outlier groups
[30], and only the needed data could be communicated or stored. Outlier data may result from noise, failures, disturbances, etc., and may be useful for Fault Tolerance purposes.
Different techniques to compress and aggregate collected information in UWSNs are investigated in Reference
[31]. It was found that aggregation is justified, and cluster-based aggregation techniques are performing better than non-cluster-based ones. For instance, cluster head (CH) switching to backup (BCH) technique was proposed
[32] for cluster-based UWSNs.
Moreover, security challenges need to be addressed. One way to minimize the risk of data tampering and/or interference is to ensure that the data is processed locally or, if that is not possible, then communicated end-to-end encrypted
[33].
4.1.4. UWSN Testing Frameworks
Wireless networking protocols are one of the key research areas in UWSNs. To evaluate the implementation of underwater wireless protocols, simulation is often used. Due to the specifics of underwater environments (See
Section 3), generic simulation environments are not able to capture some of the relevant features. Frameworks covered in the current section are useful for underwater acoustic protocols’ simulation and evaluation.
Frameworks, such as DESERT version 1 and 2
[34] and SUNSET
[35], that allow simulation, emulation, and testing of the sensor networks, have been developed for UWSNs. An analysis conducted in Reference
[36] shows that SUNSET represents a more mature, flexible, and robust framework for in-field testing than DESERT. However, DESERT v2 was released subsequently. For acoustic UWSN security testing, SecFUN framework
[37] has been proposed.
4.2. Fault Detection and Identification
In essence, Fault Detection means determining that one or more bits in the computation differ from their correct value
[19]. This can be detected via continuous monitoring of the network and nodes’ status. Some sources also use the word “Diagnosis” in a broader meaning than just detection and identification. Diagnosis has been defined as “characterizing the system’s state to locate the causes of errors, determine how the system is changing over time, and predict errors before they occur
[19]”. The current section covers different techniques to execute the previously mentioned concepts.
A distributed hierarchical fault management
[38] has been used for WSNs, where agent Fault Detection devices collect information from the power modules and sensors to determine failure conditions and sequentially diagnose the nature of the detected failure.
At higher abstraction levels, there has been a wide use of the SNMP protocol
[39] by the industry for Fault Detection querying and triggering in IP networked devices. There are multiple commercial tools for generating failures, e.g., Chaos Monkey from Netflix
[40], that randomly terminate services in production environments, to ensure their resiliency. The latter does not manage the occurring faults but ensures that the repairing mechanisms are in place and operable. Intelligent Platform Management Interface (IPMI)
[41] is an industrial technology specification for hardware system management and monitoring.
A neural-network-based scheme for sensor failure detection, identification, and accommodation can be used which may allow the conditions to deviate to greater extent from theoretical models and estimation. A relatively simple and computationally light approach has been presented
[42], where a neural network is used as an online learning state estimator for detecting faults. The neural network itself can be built as fault-tolerant
[43], so that failing nodes have the least impact on result data.
Situational Awareness approach, using a mechanism that has been borrowed from humans, can be applied in sensor data interpretation for Internet of Things (IoT), specifically, regarding processes of sensation, perception and cognition. In addition to specification-based and learning-based approaches, a perception-based approach utilizing Fuzzy Formal Concept was proposed
[44] for Situational Awareness identification.
Semantic Sensor Network Ontology has been proposed in Reference
[45] for managing interoperability between sensing systems. The Semantic Ground describes information for interoperability and cooperation among agents
[46]. To enhance resilience in Semantic Sensor Networks, monitoring nodes may forward observations to association nodes, which develop Situational Awareness by mining association rules, for example, via a natural Artificial Bee Colony algorithm
[46].
Electric Power Grids need efficient monitoring since, for outage detection, environmental monitoring, and fault diagnostics, different WSN-based approaches are reviewed
[47]. Most of these approaches are also applicable in other kinds of applications.
4.3. Fault Isolation, Masking and Recovery
Subsequent to Fault Detection, Fault Identification, and Fault Diagnosis, a fault handling stage can be entered [38] to prevent further data corruption and system deterioration. The fault handling consists of Fault Isolation, Masking, and Recovery. Fault handling can hide the fault occurrence from other components by applying Fault Masking; the key techniques for such masking are informational, time, and physical redundancy [18]. Proposed masking technique For Underwater Vehicles is Triple Modular Redundancy (TMPR) [48], which is also one of the most commonly used Fault Masking techniques. Isolating a faulty component from the others can be facilitated by using virtualization [18]. In large scale distributed systems, frozen virtual images of healthy services have been used as checkpoints [49] for rolling back in case of a fault occurrence.
Fault Recovery ensures that the fault does not propagate to visible results, for instance, by rolling back to a previous healthy state (checkpointing) or re-trying failed operations (time redundancy). Some of the techniques for Fault Recovery can be Reconfiguration, which is changing the system’s state so that the same or similar error is prevented from occurring again, and Adaptation, which is re-optimizing the system, for instance, after Reconfiguration task [19].
In Sensor Networks, different approaches for Fault Recovery have been used, that have different resource overheads, energy-efficiencies, scalabilities and network types. For both network and node Fault Recovery in wireless sensor networks, Mitra et al. (2016) [50] compares techniques, such as checkpoint-based recovery (CRAFT), agent-based recovery (ABSR), fault node recovery (FNR), cluster-based and hierarchical fault management (CHFM), and Failure Node Detection and Recovery algorithm (FNDRA). While some of those are specific to terrestrial wireless usage, some principles (e.g., checkpointing, etc.) can also be used in wired and/or underwater environments. To reduce the network bandwidth requirements, checkpoint backup can be mobile to nearby nodes [51] and used for recovering from fault situations.
In network protocols, Fault Masking and Fault Recovery are handled by error control schemes that are commonly categorized into the following three groups [1]:
-
Automatic Repeat Request (ARQ)—re-transmission of corrupted data is asked;
-
Forward Error Correction (FEC)—data corruption can be detected and corrected by the receiving end; and
-
Hybrid ARQ (HARQ)—a combination of FEC and ARQ.
The cross-layer approach benefits Fault Recovery significantly since single-layer redundancy, such as hardware redundancy and application checkpointing, have very high costs, and latency between fault occurrence and detection makes the recovery difficult [19].