Smart Distribution Networks Resilience Quantification

Smart Distribution Networks Resilience Quantification: Comparison

Please note this is a comparison between Version 1 by Youba NAIT BELAID and Version 2 by Peter Tang.

The introduction of pervasive telecommunication devices, in the scope of smart grids (SGs), has accentuated interest in the distribution network, which integrates a huge portion of new grid applications. High impact low probability (HILP) events, such as natural hazards, manmade errors, and cyber-attacks, as well as the inherent fragility of the distribution grid have propelled the development of effective resilience tools and methods for the power distribution network (PDN) to avoid catastrophic infrastructural and economical losses.

resilience
quantification
smart grids
power networks
information and communication networks

1. Introduction

Current information and communication technologies (ICTs) have achieved a high degree of penetration in all critical infrastructure (CI) systems, owing to the ever-increasing capabilities of their services in terms of coverage, throughput capacity, latency, scalability, and privacy ^[1][2][3][4][1,2,3,4]. In power systems, the massive introduction of telecommunication devices accelerated the shift toward smart grids (SGs) [5] that come with a whole new package of functionalities such as automated control, smart sensing and metering, high-power converters, and modern energy management techniques based on the optimization of demand, energy, and network availability [6]. The high-performance smart grid allows thereby for the insertion of new applications in the network like distributed generation, Industrial Internet of Things (IIOT), and electrical vehicles [7]. This comes, however at the expense of increased complexity, which brings new vulnerabilities and broadens the attack surface [8]. Recent extreme events of natural disasters, cyber-attacks, and man-made errors which we refer to as HILP events, have shown that SGs are susceptible to strong disruptions given the large-scale networks they represent, and the attendant interdependencies [9]. Some recent examples are the power disruptions in the US in 2017, caused by hurricanes and wildfires [10], which caused a cumulative damage of $306.2 billion, affecting a total of 47 million people—nearly 15 percent of the nation’s population. For instance, at the peak of hurricane Irma, more than 6.7 million electrical customers were without power [11], and hurricane Maria severely damaged the Puerto Rico power grid leaving 1.5 million people out of power [12]. China’s severe ice storm in 2008 resulted in the service disruption of 2000 power substations and 8500 towers leading to power interruptions in 13 provinces and 170 cities [13], and over 4 million customers went on power outage for over seven days during the Great East Japan Earthquake in 2011 [14]. During the Ukraine power grid cyber-attack in 2015, 30 power substations were turned off, and hundreds of thousands of people were without electricity for a period from 1 to 6 h ^[15][16][15,16].

Events like these reveal the need for strategies that are able to cope with such harsh impacts, especially given that the capacity to operate resiliently against attacks and natural disasters is one of the multiple smart grid attributes [17]. Resilience is defined as the ability to “anticipate, absorb, adapt to and/or rapidly recover from a disruptive event” [18]. In line with this definition, the U.S. Presidential Policy Directives-21(PPD-21) introduces resilience as “the ability to prepare for and adapt to changing conditions and withstand and recover rapidly from disruptions” [19]. This same directive involves the “fail safe” paradigm in system engineering through recommendations for cyber-physical security, while highlighting the shift toward “safe-to-fail” paradigm brought by cyber-physical resilience. Many conceptual frameworks are proposed for understanding and evaluating resilience, where the time dimension is very important, as various facets (anticipation, absorption, robustness, survivability, mitigation, flexibility, adaptability, restoration, and recovery) are linked to different temporal phases that describe system performance during an extreme event ^{[20][21][22][23][24][25]}[20,21,22,23,24,25]. Resilience moves from traditional risk assessment, which relies on probabilistic analysis of likely failures, toward dealing with unexpected events, requiring mitigation and healing strategies. The main difference is that risk assessment aims to achieve situational awareness and diagnosis, while resilience moves one step further by incorporating reactive actions against the contingency and launching restoration operations, which maintain the functionality of most critical loads and/or make them rapidly recoverable [26].

Within the growing literature on power system resilience ^{[27][28][29][30][31]}[27,28,29,30,31], utilities are particularly interesting in quantitative assessments of resilience, which propose relevant indicators to guide cost-benefit studies before planning investments. In this context, multi-dimensional characteristics of resilience are a considerable challenge ^[32][33][34][32,33,34]. Ouyang and Dueñas-Osorio [35] tackled technical, organizational, and social dimensions of resilience, while providing an alternative to evaluate the economic dimension by estimation of economic losses. Only the technical dimension of the power network is widely investigated in the literature [36], which reveals the need to examine all other dimensions for a comprehensive analysis of resilience ^[33][37][38][33,37,38]. Technical and organizational dimensions are the most suitable in the case of power grids as they can be applicable at individual system levels, while social and economic dimensions are better suited for community level (interdependent systems), to which resilience studies should converge in the future [38]. Temporal multi-phase resilience quantification is a well-adopted technique that can embed other dimensions by linking them to technical and organizational dimensions through the implementation of enhancement strategies. Unlike [35], most proposed metrics in literature exclude pre-event and post-recovery phases, suggesting that quantification is conducted for a single scenario and not for a sequence of disruptive events, which corroborates the relevance of resilience for HILP disruptions. Work in [39] introduced resilience-based component importance measures centered in the recovery phase; on the one hand by establishing a ranking for load restoration using optimal repair time, and on the other hand, by quantifying the potential loss in optimal system resilience due to a delay in component repair process (computed through resilience reduction worth metric). Likewise, [40] focuses on the recovery stage of resilience, with the goal of comparing different restoration strategies and selecting an appropriate performance measure. Authors in [41] proposed a multi-phase framework to assess the resilience of the UK power transmission network under a windstorm. The framework considered both infrastructural and operational aspects, introducing four simple metrics to describe the degree and speed of degradation, duration of the disruption, and recovery speed. Grid connectivity and operational metrics can jointly describe the whole span of post-event analysis, and be used for planning short-term mitigation and recovery, or long-term hardening [42]. Resilience strategies to minimize system performance loss can be further analyzed under budget constraint by a tri-level planner-attacker-defender model [43], where a planner optimizes long-term transmission network expansion before an attack hits the system. Short-term switching operations are then applied in reaction. Resilience is quantified using customer demand not supplied, which includes both mitigation and recovery capabilities in the system. Many other optimization models and performance measures are adopted in related studies ^[44][45][44,45].

Given widely stretched power networks, resilience studied at the system level for generation and transmission, does not (or negligibly) include distribution grid components ^{[35][39][40][41][42][43]}[35,39,40,41,42,43]. In 2010, only 15% to 20% of feeders implement distribution automation in the North American grid, one of the most advanced electrical systems [46]. This illustrates that the PDN is the most fragile level of electrical systems due to legacy “blindness” and manual operations along with electromechanical components [47], especially with the fact that an estimated 90% of customer outages in the US are related to this part of the system [48].

The advent of smart grids renewed interest in enhancing the PDN performance [49] as nearly all SG provided abilities of self-healing, high reliability, energy management, and real-time pricing are empowered by technologies introduced at the distribution level such as advanced metering, automation, distributed generation, and distributed storage [50]. ICTs are the main enabler of this new portfolio of applications [7], by transforming a traditionally one-way, limited-control, and radial PDN into a two-way power flow, intelligent, and mesh-networked grid capable of guaranteeing improved service for all connected loads [51]. In this regard, expected high-performance capabilities of smart distribution grid can succeed in coping with most failures in the system ^[49][50][49,50]. The smart PDN remains susceptible to HILP events, or even more prone in some cases, due to increased uncertainty (in events, load, distributed generation, market prices) ^[52][53][54][52,53,54] and strong dependency on telecommunications that widen the attack surface [55] and may cause undesirable cascade effects [56]. Consequently, the resilience of smart PDN becomes a concern from both electric and communication domain perspectives, as a failure in the telecommunication service may affect the electric service [57] and vice-versa [58]. Recent publications recommend a joint handling of smart PDN resilience quantification as the robustness and adaptation ability of a coupled system are even lower than a single system ^{[56][59][60][61]}[56,59,60,61]. However, such an approach needs to build upon a solid understanding of resilience assessment of electric and communication domains when considered distinctly.

The present paper aims to set the ground for future joint evaluation of PDN resilience by reviewing relevant works, centered thus far on electric service, and to a lesser extent on ICT service. Essentially, the type of HILP event is identified from each selected contribution with details in the method used for contingency characterization. Also, the measure of performance, recognized as an enabler for resilience quantification [62], is tracked through this work to explain how it is defined and computed, relying usually on system modeling, or empirical and surrogate models in some cases. In addition, a classification based on the temporal phase where resilience evaluation takes place is proposed, which allows for addressing practical requirements of utility companies. The resilience phase-based approach was linked with different objectives of the assessment, from simple metrics evaluation, to either planning or response for survivability and recovery, achieved through a variety of improvement strategies for which allocation is optimized under the constraint of a limited budget. This bridges resilience studies and economic considerations in order to help stakeholders in investment plan elaboration and crisis decision-making. Aspects of cost, critical load, microgrids, and uncertainty of hazards, load, and distributed generation are discussed to show their high importance, and available tools to date for their involvement in the study.

We extend by this work the wide spectrum of subjects associated with resilience quantification in power networks (modeling and simulation, enhancement strategies, metrics, and extreme events), covered in recent reviews ^{[28][36][44][45][63][64][65]}[28,36,44,45,63,64,65]. The main contributions and novelty of this paper can be summarized as follows: (a) focus on resilience assessment of both electric and telecommunications domains of smart power distribution networks. (b) Detailed analysis and classification of performance calculation techniques. (c) Fine-grained categorization of quantitative resilience works based on time of evaluation and target objective.

Finally, despite the considerable number of works analyzed and relatively deep examination of reviewed methods for resilience quantification in smart PDNs, this paper does not claim to be comprehensive in the issues addressed (and related references), but remains complete enough to give a good overall perspective of the research trends and understanding of challenges and opportunities.

2. Resilience in Smart Grids

Amid desired functionalities for smart grids lays the need for capabilities like: self-healing, high reliability, power quality, and resistance against various disasters and attacks [50]. Resilience represents a promising approach to meet such requirements, by being able to address network circumstances not handled by widely adopted principles of reliability and quality of service.

2.1. From Reliability to Resilience

Reliability is the ability of an item (component or system) to operate under designated operating conditions for a designated period of time or number of cycles, where this ability can be formulated through a probability [66]. In electrical networks, this is equivalent to maintaining the delivery of electric services to customers in the face of routine uncertainty under operating conditions [67]. Metrics like Energy Not Supplied (ENS), Average Customer Curtailment Index (ACCI), System Average Interruption Duration Index (SAIDI), System Average Interruption Frequency Index (SAIFI), Customer Average Interruption Duration Index (CAIDI) are widely used to describe PDN reliability ^[68][69][68,69]. System operators use such indicators to track and enhance the performance of their networks. These indices are further used by system regulators and system operators in service level agreements (SLAs), in order to define penalty thresholds and ensure that the right compensation is paid based on the experienced outages. Reliability metrics are relevant to assess the impact of recurrent events with available historical records, over which maintenance actions are applicable; excluding major hazards such as severe weather events [70]. Some of these metrics were extended to capture more severe events, where metrics like STorm Average Interruption Frequency Index (STAIFI) and STorm Average Interruption Duration Index (STAIDI) are proposed [71]. However, a demonstration was made that these two metrics are not relevant for resilience evaluation because, when used during a storm, they show large deviation that can be even greater than the values of STAIDI and STAIFI [62].

Resilience is “the ability to prepare and plan for, absorb, recover from, or more successfully adapt to actual or potential adverse events” [72]. Unlike reliability, which focuses on the frequency and duration of failures “event-agnostically,” resilience seeks to further track the dynamics and resources of response, adaptability, and ability to restore. This is relevant to HILP hazards where consequences in the system need to be studied with respect to specific events, as each disruption has its distinguishing characteristics [73]. Thus, the fundamental difference resides in the scale, scope, and duration of events handled: resilience targets events with strong impact in a wide geographical area with long duration of outages, while reliability handles local impact in short duration outages [74].

Despite this difference between the two concepts, mainly due to events each of them tackles, they remain closely related because enhancing resilience or reliability may require the same strategies, with resilience being more general, confirming that being resilient typically encompasses being reliable, but not vice versa [75].

2.2. Resilience and QoS in ICT Networks

ICT networks traditionally rely on QoS metrics to define SLAs [76]. These metrics consisting of delay, jitter, bandwidth, packet loss, bit error rate, and traffic load are performance measures that do not give a comprehensive view of network state. Therefore, other complementary metrics are adopted in SLAs in order to better quantify the system state, namely availability metrics.

In the initial introduction of Quality of Resilience (QoR) in [77], QoS is divided into short-term quality parameters referred to by availability, and long-term quality parameters grouped under QoR. In other words, resilience is considered as an aspect of QoS, as latency or packet loss. However, QoR is presented in [78] as a concept-treating quality at different levels of the Open Systems Interconnection model (OSI-model), including the network level which corresponds to traditional QoS. Figure 1 shows how QoR extends QoS to include other types of quality: Quality of Experience (QoE), Quality of Delivery (QoD), and Quality of Protection (QoP). This is done by considering the additional metrics from each level. QoR is used as a transverse evaluation for all aforementioned qualities. This is done by considering the metrics that describe different resilience stages. From a high-level perspective, we can say that QoR is a shift from client-centric evaluation, conducted using QoS, toward a more general framework that includes the system potential in terms of resources, organizational processes, and humans.

Figure 1. OSI-based classification of quality in relation with resilience [78].

Once again, in both cases above, the need for resilience stems from harsh large-scale events imposing consideration of stress in the system and recovery strategies. Then, despite the slight divergence in terminology that is still the case today, the two concepts go beyond the traditional QoS evaluation, to capture both requirements of customers and enhancement strategies of operators. Nevertheless, the idea that resilience takes in QoS is gaining more attention [75], suggesting that a system cannot be resilient if it does not offer acceptable QoS, but providing acceptable QoS is not the only requirement for a network to be resilient.

3. Taxonomy of Resilience Evaluation Methods

The panoply of methods proposed for qualitative evaluation of resilience in electric power networks ^{[20][21][22][23][24][25]}[20,21,22,23,24,25] is not enough to convince critical infrastructure operators in general, and utilities in particular, to adopt the resilience-based design. They are unable to systematically discover hidden vulnerabilities and critical elements [79]. To overcome this, stakeholders need to have a closer, more tangible grasp of resilience, using quantitative analyses which gained huge momentum in recent years. Most of these analyses are performance-based, where performance is defined in various ways in order to fit different participants and study objectives [80]. The fact that almost all works selected in this paper happen to belong to this high-level method of quantification comes to stress the consensus in progress toward the adaptation of this method as a tool for resilience quantification.

In Figure 2, we propose four aspects based on which the state-of-the-art papers on resilience metrics for smart power distribution network could be classified, evaluated, and compared. Some of these aspects will be further elaborated in the later sections. For instance, in Section V we classify the papers based on extreme event handled, performance calculation method, and both type and computational method of resilience metrics. Each of these four aspects is explained in detail below.

Figure 2. Proposed classification for resilience evaluation frameworks.

4. Resilience Quantification Objectives

Four broad classes of resilience metrics are generally adopted: (i) average performance metrics, (ii) integrated multi-phase metrics, (iii) time-dependent metrics, and (iv) probability-based metrics ^[81][118]. In the case of a HILP event, probability distributions are often not available, whereas the other three classes depend on the measure of performance in the network. Thus, a reasonable statement is that an ideal evaluation of resilience may consist of a complete tracking of the time-dependent performance function P(t). This way, network operators can have the value of performance at any instant for the complete event duration. However, despite the apparent dependence of P(t) in time, performance function does not necessarily change with time if it is not for the extreme event which hits the system. In other words, performance function depends on many parameters including hazard intensity, system preparedness, resilience strategies in hand, and priority decisions made, all of which cause network state to change. This sends back the problem of resilience multi-dimensionality, which makes developing closed form derivation for resilience function challenging and hitherto out of reach. Performance-based methods try to include all previously mentioned parameters and additional ones into a temporal curve describing the performance evolution of the network. It can be said that many resilience features are embedded in a performance curve as shown in Figure 3, because the construction of such a graph takes into consideration all factors intervening during a catastrophic contingency.

Figure 3. Performance curve for resilience quantification.

A salient advantage of such an approach is to have the temporal follow-up of network state which allows decision-makers to be in a best-informed posture. Four main phases can be distinguished, among which some can be further detailed into sub-phases:

Anticipation phase (phase I): Represents the time period before the event occurrence, when performance is at its nominal level. Monitoring information, impact projections, and historical data when available are used for prediction studies, and all possible defensive measures are implemented. This serves particularly in the case of multi-hazard management where risks and vulnerabilities to each event are investigated. For single hazard resilience analysis which is the most relevant in the case of HILP event, this phase is not considered and a post-event resilience study is adopted. However, this also refers to the period of normal operation where reliability and risk management for recurrent failures can be conducted, which participates in system resilience, because a resilient system needs to be first as reliable and low-risk as possible. In addition, security measures for protecting the system and preparing it to withstand malicious behaviors are implemented at this stage ^[82][96].
Mitigation phase (phase II): Once an extreme event hits the network, reliance is on system robustness, reactivity, and absorption to minimize the effect on services and infrastructure. Adding to some preparation policies that could be anticipated, many dynamic actions can be implemented to reduce the aftermath, like distribution automation actions, load shedding, and monitoring actions in power distribution networks or customer prioritization in telecom networks. These actions can withstand performance degradation that is in place, or serve to coordinate between entities in order to achieve an accurate assessment of consequences and prepare next crisis management steps.
Recovery phase (phase III): Unlike short-timed low impact incidents where maintenance actions are achieved relatively fast, in major events, recovery actions can require anywhere between several weeks to months ^[83][119]. The main reason is that, given the safety of emergency crews and logistic constraints, restoration is conducted carefully and waits for the reduction in hazard intensity, or more generally identification of restoration windows. Priority is first given to service restoration where all alternative (even temporary) ways to provide services are explored and deployed allowing to regain an intermediate level of performance. Complete recovery will take more time and effort as it involves mostly infrastructure catering which turns out to be very challenging.
Learning phase (phase IV): This phase is less considered than the two previous phases in quantitative resilience frameworks, generally with the argument that resilience is best examined in face of exogenous threats ^[84][120]. The post-recovery phase should still be looked at closely in order to draw conclusions about damages experienced by the network and how various implemented policies helped to alleviate consequences. Data collection through field surveys and supervisory management tools enable improvement in system performance and enhancement in preparation for upcoming extreme events backing the vision for a sustainable network.

Many works ^{[13][27][45][63]}[13,27,45,63] explore each of these phases with slightly different denominations. Here a generalizing description is adopted where the four above-mentioned phases are considered, with mitigation and recovery divided each into two sub-phases in order to better explain all involved mechanisms. Resilience quantitative frameworks can be assessed based on phases they handle [64]. The more phases taken into consideration, the better the insight into system operation during extreme events. Furthermore, the layout can be used to seek answers for the following questions: When is resilience evaluation conducted and for which reason?

Figure 4 distinguishes time instants at which resilience quantification can be conducted, and objectives of this evaluation. The former here orients/guides/steers the latter, because for example, an operator who aims to plan investments for his network will most likely opt for pre-event evaluation, while another who only wants to see the impact induced by a contingency in his network may adopt post-recovery damage evaluation. Knowing “why” resilience is to be evaluated serves as a guideline to choose “when” it should be done. Without loss of generality, resilience evaluation can be induced from the performance curve in Figure 3; so it is important to know when system operators can get such a representation. Three options are available:

Figure 4. Options for resilience evaluation timing and associated objectives.

Proactive evaluation: The procedure in this case is to drive pre-event studies with the goal of obtaining resilience indicators before contingency happens. The outbuilding is in prediction data, recommendations of experts, supervision alerts, and historical records. However, for HILP anomalies, little information is available, then designing preventive measures appeals for simulation tools, emulation, and analytical models which help to make projections for the impact that will be borne by the network in face of uncertain events.

Once metrics are computed, they can be used to make informed decisions about resilience strategies to implement in order to minimize the impact and speed up recovery. In other words, the output of this phase is planning schemes which enhance robustness, survivability, restoration, and recovery of the system that can be summarized in the concept of resilience. The prominent advantage of a proactive evaluation is the ability to look-forward that allows foreseeing what is coming. On the other hand, the large number of possible contingency scenarios and little relevant data cause low-confidence results.

Reactive evaluation: Quantification is carried out as the event happens, meaning that resilience metrics are computed on-the-fly, and policies adopted to cope with severe hazards are taken from the inherent reaction capacity of the system without support from pre-event recommendations. Metrics are calculated as the event goes for the two broad phases of robustness and recovery. In such real-time setup, information that can be gathered is realistic and narrows down failure modes space. However, the flexibility margin can be very tight because the HILP event hits the network by surprise while no anticipative actions are in place. There are no good or bad choices between proactive and reactive evaluation, they are both suitable for resilience analysis and can be complementary. The goal is to find a balanced fit for a given use case

⁸⁵

[

121].
Deductive evaluation: When resilience metrics are computed at the end of a HILP disturbance, they mainly serve to draw conclusions about how the system handled an external event ^[86]^[87]^[88][81,107,108]. Results of this are intended to point out axes of improvement for future reference in similar extreme situations, and can also be considered as performance evolvement baseline. Further, the output of such post-recovery evaluation can be fed to the pre-event phase for hazards in the future, closing a kind of a cycle with the evaluations presented above.

Proactive approaches are dominant in resilience engineering, especially when considering the fact that in some cases the reactive approach is subsumed therein. The combination of the two is simply referred to as proactive approaches.