The use of parallel techniques becomes inescapable, and defaulting HPC competence by the electrical energy and power system community is inevitable in the face of the presumed future and its upcoming challenges. With some algorithmic modifications, parallel computing unlocks the potential to solve huge power system problems that are conventionally intractable. This helps in the reduction of cost and CO2
2. Parallel Hardware
Parallel computation involves several tasks being performed simultaneously by multiple workers on a parallel machine. Here, a worker is a loose term and could refer to different processing hardware (e.g., a core, Central Processing Unit (CPU), or a compute node). Predominantly, parallel machines can be placed under two categories based on The Von Neumann architecture [
50], MIMD, and SIMD machines.
SIMD architecture dominated supercomputing with vector processors, but that changed soon after general-purpose processors became scalable in the 1990s [
51,
52]. Followed by transputers and microprocessors designed specifically for aggregation and scalability [
53].
2.1. CPUs
CPUs were initially optimized for scalar programming and executing complex logical tasks effectively until they hit a power wall, leading to multicore architecture [
54]. Today, they function as miniature superscalar computers that enable pipelining, task-level parallelism, and multi-threading [
55]. They employ a variety of self-optimizing techniques, such as “Speculation”, “Hyperthreading or “Simultaneous Multi-threading”, “Auto vectorization”, and “task dependency detection” [
55]. They contain an extra SIMD layer that supports data-level parallelism, vectorization, and fused multiply-add with high register capacity [
56,
57]. Furthermore, CPUs use a hierarchy of memory and caches, which allows complex operations without Random Access Memory (RAM) fetching, from high-speed low-capacity (L1) to lower-speed, higher-capacity caches (L2 then L3). They give the CPU a distinct functional advantage over GPUs.
A Workstation CPU can have up to 16 processing cores, and server-level CPUs can have up to 128 cores in certain products [
59]. Multi-threading is carried out on Application Programming Interfaces (API) such as Cilk or OpenMP, allowing parallelism of functions and loops. Using several server-level CPUs in multi-processing to solve massive decomposed problems is facilitated by APIs such as MPI.
2.2. GPUs
GPUs function very similarly to Vector Processing Units or Array processors, which used to dominate supercomputer design. They are additional components to a “Host” machine that sends kernels, which is essentially the CPU. GPUs were originally designed to render 3D graphics. They are especially good at vector calculations. The representation of 3D graphics has a “grid” nature and requires the same process for a vast number of input data points. This execution has been extended to many applications in scientific computing and machine learning, solving massive symmetrical problems or performing symmetrical tasks. Unlike CPUs, achieving efficiency in GPUs parallelism is a more tedious task due to the fine-grained SIMD nature and rigid architecture.
The GPU (Device) interfaces with the CPU (Host) through PCI express bus from which it receives instruction “Kernels”. In each cycle, a Kernel function is sent and processed by vast amounts of GPU threads with limited communication between them. Thus, the symmetry of the parallelized task is a requisite, and the number of parallel threads has to be of specific multiple factors to avoid the sequential execution of tasks. Specifically, they need to be executed in multiples of 32 threads (a warp) and multiples of two streaming processors per block for the highest efficiencies.
GPUs can be programmed in C or C++. However, many APIs exist to program GPUs, such as OpenCL, HIP, C++ AMP, DirectCompute, and OpenACC. These APIs provide high-level functions, instructions, and hardware abstractions, making GPU utilization more accessible. The most relevant interface is the CUDA by NVIDIA since it dominates the GPU market in desktop and HPC/Cloud [
60]. CUDAs libraries make NVIDIAs GPU’s power much more accessible to the scientific and engineering communities.
GPU’s different architecture may cause discrepancy and lower accuracy in results, as floating points are often rounded in a different manner and precision than in CPUs [
61]. Nevertheless, these challenges can be worked around with CUDA and sparse techniques that reduce the number of ALUs required to achieve a massive speedup. Finally, GPUs can offer a huge advantage over CPUs in terms of energy efficiency and cost if their resources are used effectively and appropriately.
2.3. Other Hardware
There are two more notable parallel devices to mention. One is the Field Programmable Gate Arrays (FPGA). This chip consists of configurable logic blocks, allowing the user to have complete flexibility in programming the internal hardware and architecture of the chip itself. They are attractive as they are parallel, and their logic can be optimized for desired parallel processes. However, they consume a considerable amount of power compared to other devices, such as the Advanced RISC Machine. Those are processors that consume very little energy due to their reduced instruction set, making them suitable for portable devices and applications [
62].
3. Aggregation and Paradigms
In the late 1970s, project ARPANET took place [
63] UNIX was developed [
64], and advancement in networking and communication hardware was achieved. The first commercial LAN clustering system/adaptor, ARCNET, was released in 1977 [
65], and hardware abstraction sprung in the form of virtual memory, such as OpenVM, which was adopted by operating systems and supercomputers [
66]. Around that same time, the concept of computer clusters was forming. Many research facilities and customers of commercial supercomputers started developing their in-house clusters of more than one supercomputer. Today’s HPC facilities are highly scalable and are comprised of specialized aggregate hardwar
e. The communication between processes through aggregate hardware is aided by high-level software such as MPI, which is available in various implementations and packages such as Mpi4py in python or Apache, Slurm, and mrjob, to aid in data management, job scheduling, and other routines.
Specific clusters might be designed or equipped with components geared more toward specific computing needs or paradigms. HPC usually includes tasks with rigid time constraints (minutes to days or maybe weeks) that require a large amount of computation. The High-Throughput Computing (HTC) paradigm involves long-term tasks that require a large amount of computation (months to years) [
67]. The Many Task Computing (MTC) paradigm involves computing various distinct HPC tasks and revolves around applications that require a high level of communication and data management [
68]. Grid or Cloud facilities provide the flexibility to adopt all the mentioned paradigms.
3.1. Grid Computing
The information age spurring in the 1990s set off the trend of wide-area distributed computing and “Grid Computing”, the predecessor of the Cloud. Ian Foster coined the term with Carl Kesselman and Steve Tuecke, who developed the Globus toolkit that provides grid computing solutions [
69]. Many Grid organizations exist today, such as Organizations such as NASA 3-EGEE and Open Science Grid. Grid computing shaped the field of “Metacomputing”, which revolves around the decentralized management and coordination of Grid resources, often carried out by virtual organizations with malleable boundaries and responsibilities. The infrastructure of grids tends to be very secure and reliable, with an exclusive network of users (usually scientists and experts), discouraging virtualization and interactive applications. Hardware is not available on demand; thus, it is only suitable for sensitive, close-ended, non-urgent applications. Grid computing features provenance performance monitoring and is mainly adopted by research organizations.
3.2. Cloud Computing
Cloud computing is essentially the commercialization and effective scaling of Grid Computing driven by demand, and it is all about the scalability of computational resources for the masses. It mainly started with Amazon’s demand for computational resources for its e-commerce activities, which precipitated Amazon to start the first successful infrastructure as a service-providing platform with Elastic Compute Cloud [
70] for other businesses that conduct similar activities.
The distinction between Cloud and Grid is an implication of their business model. Cloud computing is way more flexible and versatile than Grid when it comes to accommodating different customers and applications. It relies heavily on virtualization and resource sharing. This makes Cloud inherently less secure, less efficient in performance than Grid, and more challenging to manage, yet way more scalable, on-demand, and overall more resource efficient. It achieves a delicate balance between efficiency and the cost of computation.
Today, AWS, Microsoft Azure, Oracle Cloud, Google Cloud, and many other cloud commercial services provide massive computational resources for various companies such as Netflix, Airbnb, ESPN, HSBC, GE, Shell, and the NSA. It only makes sense that the electrical industry will adopt the Cloud.
3.2.1. Virtualization
The appearance of virtualization caused a considerable leap in massive parallel computing, especially after the software tool Parallel Virtual Machines (PVM) [
71] was created in 1989. Since then, tens and hundreds of virtualization platforms have been developed, and are used today on the smallest devices with processing power [
72]. Virtualization allows resources to be shared in a pool, where multiple instances of different types of hardware can be emulated on the same metal. This means less hardware can be allocated or invested in Cloud computing for a more extensive user base. Often, the percentage of hardware used is low compared to the requested hardware. Idle hardware is reallocated to other user processes that need it. The instances initiated by users float on the hardware, such as clouds shifting and moving or shrinking and expanding depending on the actual need of the process.
3.2.2. Containers
While virtualization makes hardware processes portable, containers make software portable. Developing applications, software, or programs in containers allows them to be used on any Operating System (OS) as long as it supports container engines. That means one can develop a Linux-based software (e.g., that works on Ubuntu 20.04) in a container and run that same application on a machine with Windows OS or iOS installed. This flexibility applies to service-based applications that utilize HPC facilities. An application can be developed on containers, and clients can use it on their cluster or a cloud service.
3.2.3. Fog Computing
Cloudlets, edge nodes, and edge computing are all related to an emerging IoT trend, Fog Computing. Fogs are computed nodes associated with a cloud that are geographically closer to the end-user or control devices. Fogs mediate between extensive data or cloud computing centers and users. This topology aims to achieve data locality, offering several advantages, such as low latency, higher efficiency, and decentralized computation.
3.3. Volunteer Computing
Volunteer computing is an interesting distributed computing model that originated in 1996 via a Great Internet Mersenne Prime Search [
75], allowing individuals connected to the internet to donate their personal computer’s idle resources for a scientific research computing task. Volunteer computing remains active today, with many users and various middleware and projects, both scientific and non-scientific, primarily based on BOINC [
76], and in commercial services such as macOS Server Resources [
77].
3.4. Granularity
Fine-grained parallelism appears in algorithms that frequently repeat a simple homogeneous operation on a vast dataset. They are often associated with embarrassingly parallel problems. The problems can be divided into many highly, if not wholly symmetrical simple tasks, providing high throughput. Fine-grained algorithms are also often associated with multi-threading and shared memory resources. Coarse-grained algorithms imply moderate or low task parallelism that sometimes involves heterogeneous operations. Today, coarse-grained algorithms are almost synonymous with Multi-Processing, where the algorithms use distributed memory resources to divide tasks into different processors or logical CPU cores.
3.5. Centralized vs. Decentralized
Centralized algorithms refer to problems with a single task or objective function, solved by a single processor, with data stored at a single location. When a centralized problem is decomposed into N subproblems, sent to N number of processors to be solved, and retrieved by the central controller to update variables, re-iterate, and verify convergence, the algorithm becomes a “Distributed” algorithm. The terms distributed and decentralized are often used interchangeably and are often confused in the literature. There is an important distinction to make between them. A Decentralized Algorithm is one in which the decomposed subproblems do not have a central coordinator or a master problem. Instead, the processes responsible for the subproblems communicate with neighboring processes to reach a solution consensus (several local subproblems with coupling variables where subproblems communicate without a central coordinator). The value of each type is not only determined by computational performance but the decision-making policy.
In large-scale complex problems, distributed algorithms sometimes outperform centralized algorithms. The speedup keeps growing with the problem size if the problem has “strong scalability”. Distributed algorithms’ subproblems share many global variables. This means a higher communication frequency, as all the variables need to be communicated back and forth to the central coordinator. Moreover, in some real-life problems, central coordination of distributed computation might not be possible. Fully decentralized algorithms solve this problem as their processes communicate laterally, and only neighboring processes have shared variables.
3.6. Synchronous vs. Asynchronous
Synchronous algorithms are ones in which the algorithm does not move forward until the parallel tasks at a certain step or iteration are executed. Synchronous algorithms are more accurate and efficient for tasks with symmetrical data and complexity. However, that is usually not the case in power system optimization studies. The efficiency of these algorithms suffers, however, when the tasks are not symmetrical. Asynchronous algorithms allow idling workers to take on new tasks, even if not all the adjacent processes are complete. This is possible only at the cost of accuracy when there are dependencies between parallel tasks. To achieve better accuracy in asynchronous algorithms, “Formation” needs to be ensured, meaning that while subproblems may have a deviation in the direction of convergence, they should keep a global tendency toward the solution.
3.7. Problem Splitting and Task Scheduling
Large emphasis must be placed on task scheduling when designing parallel algorithms. In multi-threading, synchronization of tasks is required to avoid “Race Conditions” that cause numerical errors due to multiple threads accessing the same memory location simultaneously. Hence, synchronization does not necessarily imply that processes will execute every instruction simultaneously but rather in a coordinated manner. Coordination mechanisms involve pipe-lining or task overlapping, which can increase efficiency and reduce the latency of parallel performance. For example, sub-tasks that take the longest time in synchronous algorithms can utilize idle workers of completed sub-tasks if no dependencies prevent such allocation. Dependency analysis is occasionally carried out when splitting tasks. In an elaborate parallel framework, such as in multi-domain simulations or smart grid applications, task scheduling becomes its own complex optimization problem, which is often solved heuristically. However, there exist packages such as DASK [
78], which can help with optimal task planning and parallel task schedulin.
3.8. Parallel Performance Metrics
Solution time and accuracy are the main measures of the success of the parallel algorithm. According to the Amhals law of strong scaling, there is an upper limit to the speedup achieved for a fixed-size problem. Dividing a specific fixed-size problem into more subproblems does not result in a linear speedup. However, if the parallel portion of the algorithm increases, then proportionally increasing the subproblems or the number of processors could continuously increase the speedup, according to Gustafson’s Law of strong scaling. The good news is that Gustafson’s law applies to large decomposed power system problems.
4. Power Flow
Power Flow (PF) studies are central to all power system studies involving network constraints. The principal goal of PF is to solve the network’s bus voltages and angles to calculate the flows of the network. For some applications, PF is solved using DC power flow equations, approximations based on realistic assumptions. Solving these equations is easy and relatively fast and results in an excellent approximation of the network PF [
29]. On the other hand, non-linear full AC Power flow equations need to be solved to obtain an accurate solution, and these require numerical approximation methods. The most popular ones in power system analysis are the Newton Raphson (NR) method, and the Interior Point Method (IPM) [
37]. However, these methods are computationally expensive and too slow for real-time applications, making them a target for parallel execution.
4.1. MIMD Based Studies
The other way to parallelize PF (or OPF) is through network partitioning. While network partitioning usually occurs at the problem level in OPF, in PF, the partitioning often happens at the solution/matrix level. Such partitioning methods for PF use sparsity techniques involving LU decomposition, forward/backward substitution, and diakoptics that trace back to the late 1960s, predominantly by H. H. Happ [
86,
87] for load flow on typical systems [
88,
89], and dense systems [
41]. Parallel implementation of PF using this method started in the 1980s on array processors such as the VAX11 [
90] and later in the 1990s on the iPSC hypercube [
91]. Techniques such as FDPF were also parallelized on the iPSC using Successive Over-relaxation (SOR) on Gauss-Sidel (GS) [
92], and on vector computers such as the Cray, X/MP using Newtons FDPF [
93]. PF can also be treated as an unconstrained non-linear minimization problem, which is precisely what E. Housos and O. Wing [
94] did to solve it using a parallelizable modified conjugate directions method.
When general processors started dominating parallel computers, their architecture was homogenized, and the enhancements achieved by parallel algorithms became comparable and easier to experiment with. This enabled a new target of optimizing the parallel techniques themselves. Chen and Chen used transputer-based clusters to test the best workload/node distribution on clusters, and [
95] and a novel BBDF approach for power flow analysis [
96]. The advent of Message Passing Interface (MPI) allowed the exploration of scalability with the Generalized Minimal Residual Method (GMRES) in [
97] and the multi-port inversed matrix method [
98] as opposed to the direct LU method. Beyond this point, parallel PF shifted heavily towards using SIMD hardware (GPUs particularly), except for a few studies involving elaborate schemes. Some examples include transmission/distribution, smart grid PF calculation [
99], or Optimal network partitioning for fast PF calculation [
100].
4.2. SIMD Based Studies
4.2.1. Development
GPU dominates recent parallel power system studies. The first power flow study implementation might have been achieved by using a preconditioner Biconjugate Gradient Algorithm and sparsity techniques to implement the NR method on a NVIDIA Tesla C870 GPU [
101]. Some elementary approaches parallelized the computation of connection matrices for networks where more than one generator could exist on a bus on a NVIDIA GPU [
102]. CPUs were also used in SIMD-based power flow studies since modern CPUs exhibit multiple cores; hence, multi-threading with OpenMP can be used to vectorize NR with LU factorization [
103]. Some resorted to GPUs to solve massive batches of PF for Probabilistic Power Flow (PPF) or contingency analysis, thread per scenario, such as in [
104]. Others modified the power flow equations to improve the suitability and performance on GPU [
105,
106].
While many papers limit their applications to NVIDIA GPUs by using CUDA, OpenCL, a general parallel hardware API, has also been used occasionally [
107]. Some experimented with and compared the performance of different CUDA routines on different NIVIDIA GPU models [
108]. Similar experimentation on routines was conducted to solve ACOPF using FDPF [
109]. In [
110], NR, Gauss Sidel, and Decoupled PF were tested and compared against each other on GPU. Improvement on the Newtons Method and parallelizing different steps of it were performed previously [
111]. Asynchronous PF algorithms were applied on GPU, which sounds difficult, as the efficiency of GPUs depends on synchronicity and hegemony [
112]. Even with the existence of CUDA, many still venture into creating their routines with OpenCL [
113,
114] or direct C coding [
115] of GPU hardware to fit their needs for PF. Very recently, a few authors made thorough overviews for parallel power flow algorithms on GPU covering general trends [
116,
117] and specifically AC power flow GPU algorithms [
118]. In the State-of-the-Art subsection, the most impactful work is covered.
4.2.2. State-of-the-Art
The DC Power Flow (DCPF) problem was solved using the Chebyshev preconditioner and conjugate gradient iterative method in a GPU (448 cores Tesla M2070) implementation in [
120,
121]. The vector processes involved are easily parallelizable in the most efficient way with CUDA libraries such as CUBLAS and CUSPARSE, which are Basic Linear Algebra Subroutine and Sparse Matrix Operation Libraries.
Later, the same author went on to Parallelize the FDPF using the same hardware and pre-conditioning steps [
122]. Two natural systems were used, the Polish system, which had groups of locally connected systems, and the Pan-European system, which consisted of several large coupled systems. This topology difference results in a difference in the sparsity patterns of the SLS matrix, which offers a unique perspective. Their proposed GPU-based FDPF was implemented with Matlab on top of MatPower v4.1. In their algorithm, the CPU regularly corresponds with the GPU, sending information back and forth over one iteration. Their tests showed that the FDPF performed better on the Pan-European system because its connections were more ordered than the Polish system. CPU-GPU communication frequently occurred in their algorithm steps, most likely bottlenecking the speedup of their algorithm (less than 3× achieved compared to CPU only).
Instead of adding pre-conditioning steps, M. Wang et al. [
123] focus on improving the continuous Newtons method such that a stable solution is found even for an ill-conditioned power flow problem. For example, if any load or generator power exceeds 3.2 p.u. in the IEEE-118 test case, the NR method fails to converge; even if the value is realized in any iteration, their algorithm will still converge to the solution with their method. This was achieved using different-order numerical integration methods. The CPU loads data into GPU and extracts the results upon convergence only, making the algorithm very efficient. The approach substantially improved over the previous work by removing the pre-conditioning step and reducing CPU–GPU communication (speedup of 11× compared to CPU-only implementation).
Sometimes, dividing the bulk of computational load between the CPU and GPU (a hybrid approach) can be more effective depending on the distribution of processes. In one hybrid CPU–GPU approach, a heavy emphasis on the sparsity analysis of PF-generated matrices was made in [
124]. When using a sparse technique, the matrices operated on are reduced to ignore the zero terms. For example, the matrix is turned into a vector of indices referring to the non-zero values to confine operations to these values. Seven parallelization schemes were compared, varying the techniques used (Dense vs. Sparse treatment), the majoring type (row vs. column), and the threading strategy. Row/column-major signifies whether the matrix’s same row/column data are stored consecutively. The thread invocation strategies varied in splitting or combining the calculation of P and Q of the mismatch vectors. Two sparsity techniques were experimented with, showing a reduction in operations down to 0.1% of the original number and two or even three orders of magnitude performance enhancement for power mismatch vector operations. In 100 trials, their best scheme converged within six iterations on a four-core host and a GeForce GTX 950M GPU, with a small deviation in solution time between trials. CPU–GPU communication took about 7.79–10.6% of the time, a fairly low frequency. However, the proposed approach did not consistently outperform a CPU-based solution with all of these reductions. The authors suggested that this was due to using higher-grade CPU hardware than the GPU.
Zhou et al. might have conducted the most extensive research in GPU-accelerated batch solvers in a series between 2014 and 2020. They fine-tune the process of solving PPF for GPU architecture in [
125,
126]. The strategies used include Jacobian matrix packaging, contiguous memory addresses, and thread block assignment to avoid divergence of the solution. Subsequently, they use the LU-factorization solver from previous work to finally create a batch-DPF algorithm [
127]. They test their batch-DPF algorithm on three cases: 1354-bus, 3375-bus, and 9241-bus systems. For 10,000 scenarios, they solved the largest case within less than 13 s, showing the potential for online application.
Most of the previous studies solve the PF problem in a bare and limited setup when compared to the work by J. Kardos et al. [
128] that involves similar techniques in a massive HPC framework. Namely, preventative Security Constrained Optimal Power Flow (SCOPF) is solved by building on an already existing suite called BELTISTOS [
129]. BELTISTOS specifically includes SCOPF solvers and has an established Schurs Complement Algorithm that factorizes the Karush–Kuhn–Tucker (KKT) conditions, allowing for a great degree of parallelism in using IPM to solve general-purpose Non-Linear Programming (NLP) problems. Thus, the main contribution is in removing some bottlenecks and ill-conditioning that exist in Schur’s Complement steps introducing a modified framework (BELTISTOS-SC). The parallel Schur algorithm is bottlenecked by a dense matrix associated with the solution’s global part. This matrix is solved in a single process. Since GPUs are meant to be used for dense systems, they factorize the system and apply forward–backward substitution, solving it using cuSolve, a GPU-accelerated library to solve dense linear algebraic systems.
They performed their experiments using a multicore Cray XC40 computer at the Swiss National Supercomputing Centre. They used 18 2.1 GHz cores, NVIDIA Tesla P100 with 16 GB memory, and many other BELTISTOS and hardware-associated libraries. They tested their modification on several system sizes from PEGASE1354 to PEGASE13659. Their approach sped up the solution of the Dense Schur Complement System by 30× for the largest system over CPU solution of that step, achieving notable speed up in all systems sizes tested. They later performed a large-scale performance study, where they increased the number of computing cores used from 16 to 1024 on the cluster. The BELTISTOS-SC augmented approach achieved up to 500× speedup for the PEGASE1354 system and 4200× for the PEGASE9241 when 1024 cores are used, demonstrating strong scalability up to 512 cores.
5. Optimal Power Flow
Like PF, OPF studies are the basis of many operational assessments such as System Stability Analysis (SSA), UC, ED, and other market decisions [
29]. Variations of these assessments include Security Constrained Economic Dispatch (SCED) and SCOPF, both involving contingencies. OPF ensures the satisfaction of network constraints over cost or power-loss minimization objectives. The full ACOPF version has non-linear, non-convex constraints, making it computationally complex and making it difficult to reach a global optimum. DC Optimal Power Flow (DCOPF) and other methods, such as decoupled OPF, linearize and simplify the problem, and when solved, they produce a fast but sub-optimal solution. Because DCOPF makes voltage and reactive power assumptions, it becomes less reliable with increased RE penetration. RE deviates voltages and reactive powers of the network significantly. This is one of the main drivers behind speeding up ACOPF in real-time applications for all algorithms involving it. The first formulation of OPF was achieved by J. Carpenter in 1962 [
130], followed by an enormous volume of OPF formulations and studies, as surveyed in [
131].
5.1. MIMD Based Studies
5.1.1. Development
OPF and SCOPF decomposition approaches started appearing in the early 1980s using P-Q decomposition [
132,
133] and including corrective rescheduling [
134]. The first introduction to parallel OPF algorithms might have been by Garng M Huang, and Shih-Chieh Hsieh in 1992 [
135], who proposed a “textured” OPF algorithm that involved network partitioning. In a different work, they proved that their algorithm would converge to a stationary point and that with certain conditions, optimality is guaranteed. Later, they implemented the algorithm on the nCUBE2 machine [
136], showing that both their sequential and parallel textured algorithm is superior to non-textured algorithms. It was atypical for studies at the time to highlight portability, which makes Huang’s work in [
92] special. It contributed another OPF algorithm using Successive Overrelaxation by making it “Adaptive”, reducing the number of iterations. The code was applied on the nCUBE2 and ported to Intel iPSC/860 hypercube, demonstrating its portability.
In 1990, M.Teixeria et al. [
137] demonstrated what might be the first parallel SCOPF on a 16-CPU system developed by the Brazilian Telecom R&D center. The implementation was somewhat “makeshift” and coarse to the level where each CPU was installed with a whole MS/DOS OS for the multi-area reliability simulation. Nevertheless, it outperformed a VAX 11/780 implementation and scaled perfectly, was still 2.5 times faster than running on, and exhibited strong scalability.
Distributed OPF algorithms started appearing in the late 1990s with a coarse-grained multi-region coordination algorithm using the Auxiliary Problem Principle (APP) [
138,
139]. This approach was broadened much later by [
140] using Semi-Definite Programming and Alternating Direction Method of Multiplier (ADMM). Prior to that, ADMM was also compared against the method of partial duality in [
141]. The convergence properties of the previously mentioned techniques and more were compared comprehensively in [
142].
The asynchronous parallelization of OPF first appeared on preventative [
143], and corrective SCOPF [
144] targeting online applications [
47] motivated by the heterogeneity of solution time of different scenarios. Both SIMD and MIMD machines were used, emphasizing portability as “Getsub and Fifo” routines were carried out. On the same token, MPI protocols were used to distribute and solve SCOPF, decomposing the problem with GRMES and solving it with the non-linear IPM method varying the number of processors [
145]. Real-time application potential was later demonstrated by using Benders decomposition instead for distributed SCOPF [
146]. Benders decomposition is one of the most commonly used techniques to create parallel structures in power system optimization problems, and it shows up in different variations in the present literature.
5.1.2. State-of-the-Art
In contrast, parallelizing a monolithic ACOPF problem itself is much more complicated. However, the same authors did this readily since their model was already decomposable due to the conic relaxation [
149]. Here, the choice of network partitions is treated as an optimization problem to realize the least number of lines between sub-networks. A graph partitioning algorithm and a modified benders decomposition approach were used, providing analytical and numerical proof that they converge to the same value as the original benders. This approach achieved a lower–upper bound gap of around 0–2%, demonstrating scalability. A maximum number of eight partitions (eight subproblems) were divided on a four-core 2.4 GHz workstation. Beyond four partitions, hyperthreading or sequential execution must have occurred. This is a shortcoming, as only four threads can genuinely run in parallel at each time. Hyper-threading only allows a core to alternate between two tasks. Their algorithm might have even more potential if distributed on an HPC platform.
The ACOPF formulation is further coupled and complicated when considering Optimal Transmission Switching (OTS). The addition of binary variables ensures the non-convexity of the problem, turning it from an NLP to an Mixed Integer Non-Linear Programming (MINLP). In Lan et al. [
150], this formulation is parallelized for battery storage embedded systems, where temporal decomposition was performed, recording the State of Charge (SoC) at the end of each 6 h (four subproblems). They employed a two-stage scheme with an NLP first stage to find the ACOPF of a 24 h time horizon and transmission switching in the second stage. The recorded SOCs of the first stage are added as constraints to the corresponding subproblems, which are entirely separable. They tested the algorithm IEEE-188 test case and solved it with Bonmin with GAMS on a four-core workstation. While the coupled ACOPF-OTS formulation achieved a 4.6% Optimality gap at a 16 h 41 min time limit, their scheme converged to a similar gap within 24 m. The result is impressive, considering the granularity of the decomposition. This is yet another example in which a better test platform could have shown more exciting results, as the authors were limited to parallelizing four subproblems. Algorithm-wise, an asynchronous approach or better partition strategy is needed, as one of the subproblems took double the time of all the others to solve.
The inclusion of voltage and reactive power predicate the benefits gained by ACOPF. However, it is their effect on the optimal solution that matters, and there are ways to preserve that while linearizing the ACOPF. The DCOPF model is turned into a Mixed Integer Linear Programming (MILP) in [
151] by adding on/off series reactance controllers (RCs) to the model. The effect of the reactance is implied by approximating its value and adding it to the DC power flow term as a constant without actually modeling reactive power. The binary variables are relaxed using the Big M approach to linearize the problem, derive the first-order KKT conditions, and solve it using the decentralized iterative approach. Each node solved its subproblem, making this a fine-grained algorithm, and each subproblem had coupling variables with adjacent buses only. The approach promised scalability, and its convergence was proven in [
152]. However, it was not implemented in parallel, and the simulation-based assumptions are debatable.
Decentralization, in that manner, reduces the number of coupling variables and communication overhead. However, this also depends on the topology of the network, as shown in [
111]. A stochastic DCOPF formulation incorporating demand response was introduced. The model network was decomposed using ADMM and different partitioning strategies where limited information exchange occurs between adjacent subsystems. The strategies were implemented using MATLAB and CPLEX on a six-bus to verify solution accuracy and later on larger systems. The ADMMBased Fully Decentralized DCOPF and Accelerated ADMM for Distributed DCOPF were compared. Recent surveys on Distributed OPF algorithms showed that in OPF decomposition and parallelization, ADMM and APP are preferred in most of the studies as a decomposition technique [
149,
153]. The distributed version converged faster, while the decentralized version exhibited better communication efficiency. More importantly, a separate test showed that decentralized algorithms work better on subsystems that exhibit less coupling (are less interconnected) and vice versa. This breeds the idea that decentralized algorithms are better suited for ring or radial network topologies while distributed algorithms are better for meshed networks [
153,
154].
Distribution networks tend to be radial. A ring topology is rare, except in microgrids. Aside from their topology, they have many differences compared to transmission networks causing the division of their studies and OPF formulations. OPF for Transmission–Distribution co-optimization makes a great case for HPC use in power system studies, as co-optimizing the two together is considered peak complexity. S. Tu et al. [
155] decomposed a very large-scale ACOPF problem in the Transmission–Distribution network co-optimization attempt. They devised a previously used approach where the whole network was divided by its feeders, where each distribution network had a subproblem. The novelty in their approach lies in a smoothing technique that allows gradient-based non-linear solvers to be used, particularly the Primal-Dual-Interior Point Method, which is the most commonly used method for solving ACOPF. Similar two-stage stochastic algorithms were implemented to account for the uncertainties in Distributed Energy Resources (DER) at a simpler level [
99,
156].
S. Tu et al. used an augmented IEEE-118 network, adding distribution systems to all busses, resulting in 9206 buses. Their most extreme test produced 11,632,758 bus solutions (1280 scenarios). Compared to a generic sequential Primal Dual Interior Primal Method (PDIPM), the speedup of their parallelized approach increased linearly with the number of scenarios and scaled strongly by increasing the number of cores used in their cluster. In contrast, the serial solution time increased superlinearly and failed to converge within a reasonable time in a relatively trivial number of scenarios. While their approach proved to solve large-scale ACOPF much faster than a serial approach, it falls short in addressing Transmission–Distribution co-optimization because it merely considered the distribution network as sub-networks with the same objective as the transmission, which is unrealistic.