High-Performance and Parallel Computing Techniques

High-Performance and Parallel Computing Techniques: Comparison

Please note this is a comparison between Version 3 by Jason Zhu and Version 2 by Jason Zhu.

The transition towards net-zero emissions is inevitable for humanity’s future. Of all the sectors, electrical energy systems emit the most emissions. This urgently requires the witnessed accelerating technological landscape to transition towards an emission-free smart grid. It involves massive integration of intermittent wind and solar-powered resources into future power grids. Additionally, new paradigms such as large-scale integration of distributed resources into the grid, proliferation of Internet of Things (IoT) technologies, and electrification of different sectors are envisioned as essential enablers for a net-zero future. However, these changes will lead to unprecedented size, complexity and data of the planning and operation problems of future grids. It is thus important to discuss and consider High Performance Computing (HPC), parallel computing, and cloud computing prospects in any future electrical energy studies.

parallel computing
optimization
power system studies

1. Introduction

To date, the global mean temperature continues to rise, and emissions continue to grow, creating great risk to humanity. Efforts and pathways are drawn by many jurisdictions to limit warming to 2 °C and reach net zero CO

_{2}

emissions ^[1]. This is largely due to the outdated electrical energy system operation and infrastructure, which causes the largest share of emissions of all sectors. Electrical energy systems are, however, witnessing a transition, consequently causing a growth in the scale and complexity of their planning and operation problems. The changing grid topology, decarbonization, electricity market decentralization, and grid modernization mean innovations and new elements are continuously introduced to the inventory of factors considered in grid operation and planning. Moreover, with the accelerating technological landscape and policy changes, the number of potential future paths to Net-Zero increases, and finding the optimal transition plan becomes an inconceivable task.

The use of parallel techniques becomes inescapable, and defaulting HPC competence by the electrical energy and power system community is inevitable in the face of the presumed future and its upcoming challenges. With some algorithmic modifications, parallel computing unlocks the potential to solve huge power system problems that are conventionally intractable. This helps in the reduction of cost and CO

_{2}

emissions indirectly through detailed models that help people find less conservative operational solutions—which reduce thermal generation commitment and dispatch—and plan the transition to a net-zero grid with optimal placement of the continually growing inventory of Renewable Energy (RE) resources and smart component investmen. Moreover, parallel processing on multiple units is inherently more efficient and reduces energy use. Using multi-threading causes the energy consumption of multi-processors to increase drastically ^[2]. Thus, in the case of resource abundance, it is more efficient to distribute work on separate hardware. Resource sharing is more effective than resource distribution, which reduces the demand for hardware investment and larger servers. All of these factors make it increasingly important for electrical engineering scientists to familiarize themselves with efficient resource allocation and parallel computation strategies.

North America ^[3], the EU ^[4], and many other countries ^[5] set a target to completely retire coal plants earlier than 2035 and decarbonize the power system by 2050. In addition, the development of Carbon Capture and Storage Facilities is growing ^[6]. Renewable energy penetration targets are set, with evidence of fast-growing proliferation across the globe, including both transmission-connected Variable Renewable Energy (VRE) ^[7] and behind the meter distributed resources ^[8]. The demand profile is changing with increased electrification of various industrial sectors ^[9] and the transportation sector ^[10] building electrification, energy efficiency ^[11][12], and the venture into a Sharing Economy ^[13].

The emerging IoT, facilitated by low latency, low-cost next-generation 5G communication networks, helps roll out advanced control technologies and Advanced Metering Infrastructure ^[14][15]. This gives more options for contingency remedial operational actions to increase the grid reliability, and cost-effectiveness, such as Transmission Switching ^[16], Demand Response ^[17], adding more micro-grids, and other Transmission–Distribution coordination mechanisms ^[18]. Additionally, they allow lower investment in transmission lines and look for other future planning solutions, such as flow management devices and FACTs ^[19], Distributed Variable Renewable Energy ^[20], and Bulk Energy Storage ^[21].

2. Parallel Hardware

Parallel computation involves several tasks being performed simultaneously by multiple workers on a parallel machine. Here, a worker is a loose term and could refer to different processing hardware (e.g., a core, Central Processing Unit (CPU), or a compute node). Predominantly, parallel machines can be placed under two categories based on The Von Neumann architecture ^[22], MIMD, and SIMD machines. SIMD architecture dominated supercomputing with vector processors, but that changed soon after general-purpose processors became scalable in the 1990s ^[23][24]. Followed by transputers and microprocessors designed specifically for aggregation and scalability ^[25].

2.1. CPUs

CPUs were initially optimized for scalar programming and executing complex logical tasks effectively until they hit a power wall, leading to multicore architecture ^[26]. Today, they function as miniature superscalar computers that enable pipelining, task-level parallelism, and multi-threading ^[27]. They employ a variety of self-optimizing techniques, such as “Speculation”, “Hyperthreading or “Simultaneous Multi-threading”, “Auto vectorization”, and “task dependency detection” ^[27]. They contain an extra SIMD layer that supports data-level parallelism, vectorization, and fused multiply-add with high register capacity ^[28][29]. Furthermore, CPUs use a hierarchy of memory and caches, which allows complex operations without Random Access Memory (RAM) fetching, from high-speed low-capacity (L1) to lower-speed, higher-capacity caches (L2 then L3). They give the CPU a distinct functional advantage over GPUs. A Workstation CPU can have up to 16 processing cores, and server-level CPUs can have up to 128 cores in certain products ^[30]. Multi-threading is carried out on Application Programming Interfaces (API) such as Cilk or OpenMP, allowing parallelism of functions and loops. Using several server-level CPUs in multi-processing to solve massive decomposed problems is facilitated by APIs such as MPI.

2.2. GPUs

GPUs function very similarly to Vector Processing Units or Array processors, which used to dominate supercomputer design. They are additional components to a “Host” machine that sends kernels, which is essentially the CPU. GPUs were originally designed to render 3D graphics. They are especially good at vector calculations. The representation of 3D graphics has a “grid” nature and requires the same process for a vast number of input data points. This execution has been extended to many applications in scientific computing and machine learning, solving massive symmetrical problems or performing symmetrical tasks. Unlike CPUs, achieving efficiency in GPUs parallelism is a more tedious task due to the fine-grained SIMD nature and rigid architecture. The GPU (Device) interfaces with the CPU (Host) through PCI express bus from which it receives instruction “Kernels”. In each cycle, a Kernel function is sent and processed by vast amounts of GPU threads with limited communication between them. Thus, the symmetry of the parallelized task is a requisite, and the number of parallel threads has to be of specific multiple factors to avoid the sequential execution of tasks. Specifically, they need to be executed in multiples of 32 threads (a warp) and multiples of two streaming processors per block for the highest efficiencies. GPUs can be programmed in C or C++. However, many APIs exist to program GPUs, such as OpenCL, HIP, C++ AMP, DirectCompute, and OpenACC. These APIs provide high-level functions, instructions, and hardware abstractions, making GPU utilization more accessible. The most relevant interface is the CUDA by NVIDIA since it dominates the GPU market in desktop and HPC/Cloud ^[31]. CUDAs libraries make NVIDIAs GPU’s power much more accessible to the scientific and engineering communities. GPU’s different architecture may cause discrepancy and lower accuracy in results, as floating points are often rounded in a different manner and precision than in CPUs ^[32]. Nevertheless, these challenges can be worked around with CUDA and sparse techniques that reduce the number of ALUs required to achieve a massive speedup. Finally, GPUs can offer a huge advantage over CPUs in terms of energy efficiency and cost if their resources are used effectively and appropriately.

2.3. Other Hardware

There are two more notable parallel devices to mention. One is the Field Programmable Gate Arrays (FPGA). This chip consists of configurable logic blocks, allowing the user to have complete flexibility in programming the internal hardware and architecture of the chip itself. They are attractive as they are parallel, and their logic can be optimized for desired parallel processes. However, they consume a considerable amount of power compared to other devices, such as the Advanced RISC Machine. Those are processors that consume very little energy due to their reduced instruction set, making them suitable for portable devices and applications ^[33].

3. Aggregation and Paradigms

In the late 1970s, project ARPANET took place ^[34] UNIX was developed ^[35], and advancement in networking and communication hardware was achieved. The first commercial LAN clustering system/adaptor, ARCNET, was released in 1977 ^[36], and hardware abstraction sprung in the form of virtual memory, such as OpenVM, which was adopted by operating systems and supercomputers ^[37]. Around that same time, the concept of computer clusters was forming. Many research facilities and customers of commercial supercomputers started developing their in-house clusters of more than one supercomputer. Today’s HPC facilities are highly scalable and are comprised of specialized aggregate hardware. The communication between processes through aggregate hardware is aided by high-level software such as MPI, which is available in various implementations and packages such as Mpi4py in python or Apache, Slurm, and mrjob, to aid in data management, job scheduling, and other routines. Specific clusters might be designed or equipped with components geared more toward specific computing needs or paradigms. HPC usually includes tasks with rigid time constraints (minutes to days or maybe weeks) that require a large amount of computation. The High-Throughput Computing (HTC) paradigm involves long-term tasks that require a large amount of computation (months to years) ^[38]. The Many Task Computing (MTC) paradigm involves computing various distinct HPC tasks and revolves around applications that require a high level of communication and data management ^[39]. Grid or Cloud facilities provide the flexibility to adopt all the mentioned paradigms.

3.1. Grid Computing

The information age spurring in the 1990s set off the trend of wide-area distributed computing and “Grid Computing”, the predecessor of the Cloud. Ian Foster coined the term with Carl Kesselman and Steve Tuecke, who developed the Globus toolkit that provides grid computing solutions ^[40]. Many Grid organizations exist today, such as Organizations such as NASA 3-EGEE and Open Science Grid. Grid computing shaped the field of “Metacomputing”, which revolves around the decentralized management and coordination of Grid resources, often carried out by virtual organizations with malleable boundaries and responsibilities. The infrastructure of grids tends to be very secure and reliable, with an exclusive network of users (usually scientists and experts), discouraging virtualization and interactive applications. Hardware is not available on demand; thus, it is only suitable for sensitive, close-ended, non-urgent applications. Grid computing features provenance performance monitoring and is mainly adopted by research organizations.

3.2. Cloud Computing

Cloud computing is essentially the commercialization and effective scaling of Grid Computing driven by demand, and it is all about the scalability of computational resources for the masses. It mainly started with Amazon’s demand for computational resources for its e-commerce activities, which precipitated Amazon to start the first successful infrastructure as a service-providing platform with Elastic Compute Cloud ^[41] for other businesses that conduct similar activities. The distinction between Cloud and Grid is an implication of their business model. Cloud computing is way more flexible and versatile than Grid when it comes to accommodating different customers and applications. It relies heavily on virtualization and resource sharing. This makes Cloud inherently less secure, less efficient in performance than Grid, and more challenging to manage, yet way more scalable, on-demand, and overall more resource efficient. It achieves a delicate balance between efficiency and the cost of computation. Today, AWS, Microsoft Azure, Oracle Cloud, Google Cloud, and many other cloud commercial services provide massive computational resources for various companies such as Netflix, Airbnb, ESPN, HSBC, GE, Shell, and the NSA. It only makes sense that the electrical industry will adopt the Cloud.

3.2.1. Virtualization

The appearance of virtualization caused a considerable leap in massive parallel computing, especially after the software tool Parallel Virtual Machines (PVM) ^[42] was created in 1989. Since then, tens and hundreds of virtualization platforms have been developed, and are used today on the smallest devices with processing power ^[43]. Virtualization allows resources to be shared in a pool, where multiple instances of different types of hardware can be emulated on the same metal. This means less hardware can be allocated or invested in Cloud computing for a more extensive user base. Often, the percentage of hardware used is low compared to the requested hardware. Idle hardware is reallocated to other user processes that need it. The instances initiated by users float on the hardware, such as clouds shifting and moving or shrinking and expanding depending on the actual need of the process.

3.2.2. Containers

While virtualization makes hardware processes portable, containers make software portable. Developing applications, software, or programs in containers allows them to be used on any Operating System (OS) as long as it supports container engines. That means one can develop a Linux-based software (e.g., that works on Ubuntu 20.04) in a container and run that same application on a machine with Windows OS or iOS installed. This flexibility applies to service-based applications that utilize HPC facilities. An application can be developed on containers, and clients can use it on their cluster or a cloud service.

3.2.3. Fog Computing

Cloudlets, edge nodes, and edge computing are all related to an emerging IoT trend, Fog Computing. Fogs are computed nodes associated with a cloud that are geographically closer to the end-user or control devices. Fogs mediate between extensive data or cloud computing centers and users. This topology aims to achieve data locality, offering several advantages, such as low latency, higher efficiency, and decentralized computation.

3.3. Volunteer Computing

Volunteer computing is an interesting distributed computing model that originated in 1996 via a Great Internet Mersenne Prime Search ^[44], allowing individuals connected to the internet to donate their personal computer’s idle resources for a scientific research computing task. Volunteer computing remains active today, with many users and various middleware and projects, both scientific and non-scientific, primarily based on BOINC ^[45], and in commercial services such as macOS Server Resources ^[46].

3.4. Granularity

Fine-grained parallelism appears in algorithms that frequently repeat a simple homogeneous operation on a vast dataset. They are often associated with embarrassingly parallel problems. The problems can be divided into many highly, if not wholly symmetrical simple tasks, providing high throughput. Fine-grained algorithms are also often associated with multi-threading and shared memory resources. Coarse-grained algorithms imply moderate or low task parallelism that sometimes involves heterogeneous operations. Today, coarse-grained algorithms are almost synonymous with Multi-Processing, where the algorithms use distributed memory resources to divide tasks into different processors or logical CPU cores.

3.5. Centralized vs. Decentralized

Centralized algorithms refer to problems with a single task or objective function, solved by a single processor, with data stored at a single location. When a centralized problem is decomposed into N subproblems, sent to N number of processors to be solved, and retrieved by the central controller to update variables, re-iterate, and verify convergence, the algorithm becomes a “Distributed” algorithm. The terms distributed and decentralized are often used interchangeably and are often confused in the literature. There is an important distinction to make between them. A Decentralized Algorithm is one in which the decomposed subproblems do not have a central coordinator or a master problem. Instead, the processes responsible for the subproblems communicate with neighboring processes to reach a solution consensus (several local subproblems with coupling variables where subproblems communicate without a central coordinator). The value of each type is not only determined by computational performance but the decision-making policy. In large-scale complex problems, distributed algorithms sometimes outperform centralized algorithms. The speedup keeps growing with the problem size if the problem has “strong scalability”. Distributed algorithms’ subproblems share many global variables. This means a higher communication frequency, as all the variables need to be communicated back and forth to the central coordinator. Moreover, in some real-life problems, central coordination of distributed computation might not be possible. Fully decentralized algorithms solve this problem as their processes communicate laterally, and only neighboring processes have shared variables.

3.6. Synchronous vs. Asynchronous

Synchronous algorithms are ones in which the algorithm does not move forward until the parallel tasks at a certain step or iteration are executed. Synchronous algorithms are more accurate and efficient for tasks with symmetrical data and complexity. However, that is usually not the case in power system optimization studies. The efficiency of these algorithms suffers, however, when the tasks are not symmetrical. Asynchronous algorithms allow idling workers to take on new tasks, even if not all the adjacent processes are complete. This is possible only at the cost of accuracy when there are dependencies between parallel tasks. To achieve better accuracy in asynchronous algorithms, “Formation” needs to be ensured, meaning that while subproblems may have a deviation in the direction of convergence, they should keep a global tendency toward the solution.

3.7. Problem Splitting and Task Scheduling

Large emphasis must be placed on task scheduling when designing parallel algorithms. In multi-threading, synchronization of tasks is required to avoid “Race Conditions” that cause numerical errors due to multiple threads accessing the same memory location simultaneously. Hence, synchronization does not necessarily imply that processes will execute every instruction simultaneously but rather in a coordinated manner. Coordination mechanisms involve pipe-lining or task overlapping, which can increase efficiency and reduce the latency of parallel performance. For example, sub-tasks that take the longest time in synchronous algorithms can utilize idle workers of completed sub-tasks if no dependencies prevent such allocation. Dependency analysis is occasionally carried out when splitting tasks. In an elaborate parallel framework, such as in multi-domain simulations or smart grid applications, task scheduling becomes its own complex optimization problem, which is often solved heuristically. However, there exist packages such as DASK ^[47], which can help with optimal task planning and parallel task schedulin.

3.8. Parallel Performance Metrics

Solution time and accuracy are the main measures of the success of the parallel algorithm. According to the Amhals law of strong scaling, there is an upper limit to the speedup achieved for a fixed-size problem. Dividing a specific fixed-size problem into more subproblems does not result in a linear speedup. However, if the parallel portion of the algorithm increases, then proportionally increasing the subproblems or the number of processors could continuously increase the speedup, according to Gustafson’s Law of strong scaling. The good news is that Gustafson’s law applies to large decomposed power system problems.

4. Power Flow

Power Flow (PF) studies are central to all power system studies involving network constraints. The principal goal of PF is to solve the network’s bus voltages and angles to calculate the flows of the network. For some applications, PF is solved using DC power flow equations, approximations based on realistic assumptions. Solving these equations is easy and relatively fast and results in an excellent approximation of the network PF ^[48]. On the other hand, non-linear full AC Power flow equations need to be solved to obtain an accurate solution, and these require numerical approximation methods. The most popular ones in power system analysis are the Newton Raphson (NR) method, and the Interior Point Method (IPM) ^[49]. However, these methods are computationally expensive and too slow for real-time applications, making them a target for parallel execution.

4.1. MIMD Based Studies

The other way to parallelize PF (or OPF) is through network partitioning. While network partitioning usually occurs at the problem level in OPF, in PF, the partitioning often happens at the solution/matrix level. Such partitioning methods for PF use sparsity techniques involving LU decomposition, forward/backward substitution, and diakoptics that trace back to the late 1960s, predominantly by H. H. Happ ^[50][51] for load flow on typical systems ^[52][53], and dense systems ^[54]. Parallel implementation of PF using this method started in the 1980s on array processors such as the VAX11 ^[55] and later in the 1990s on the iPSC hypercube ^[56]. Techniques such as FDPF were also parallelized on the iPSC using Successive Over-relaxation (SOR) on Gauss-Sidel (GS) ^[57], and on vector computers such as the Cray, X/MP using Newtons FDPF ^[58]. PF can also be treated as an unconstrained non-linear minimization problem, which is precisely what E. Housos and O. Wing ^[59] did to solve it using a parallelizable modified conjugate directions method. When general processors started dominating parallel computers, their architecture was homogenized, and the enhancements achieved by parallel algorithms became comparable and easier to experiment with. This enabled a new target of optimizing the parallel techniques themselves. Chen and Chen used transputer-based clusters to test the best workload/node distribution on clusters, and ^[60] and a novel BBDF approach for power flow analysis ^[61]. The advent of Message Passing Interface (MPI) allowed the exploration of scalability with the Generalized Minimal Residual Method (GMRES) in ^[62] and the multi-port inversed matrix method ^[63] as opposed to the direct LU method. Beyond this point, parallel PF shifted heavily towards using SIMD hardware (GPUs particularly), except for a few studies involving elaborate schemes. Some examples include transmission/distribution, smart grid PF calculation ^[64], or Optimal network partitioning for fast PF calculation ^[65].

4.2. SIMD Based Studies

4.2.1. Development

GPU dominates recent parallel power system studies. The first power flow study implementation might have been achieved by using a preconditioner Biconjugate Gradient Algorithm and sparsity techniques to implement the NR method on a NVIDIA Tesla C870 GPU ^[66]. Some elementary approaches parallelized the computation of connection matrices for networks where more than one generator could exist on a bus on a NVIDIA GPU ^[67]. CPUs were also used in SIMD-based power flow studies since modern CPUs exhibit multiple cores; hence, multi-threading with OpenMP can be used to vectorize NR with LU factorization ^[68]. Some resorted to GPUs to solve massive batches of PF for Probabilistic Power Flow (PPF) or contingency analysis, thread per scenario, such as in ^[69]. Others modified the power flow equations to improve the suitability and performance on GPU ^[70][71]. While many papers limit their applications to NVIDIA GPUs by using CUDA, OpenCL, a general parallel hardware API, has also been used occasionally ^[72]. Some experimented with and compared the performance of different CUDA routines on different NIVIDIA GPU models ^[73]. Similar experimentation on routines was conducted to solve ACOPF using FDPF ^[74]. In ^[75], NR, Gauss Sidel, and Decoupled PF were tested and compared against each other on GPU. Improvement on the Newtons Method and parallelizing different steps of it were performed previously ^[76]. Asynchronous PF algorithms were applied on GPU, which sounds difficult, as the efficiency of GPUs depends on synchronicity and hegemony ^[77]. Even with the existence of CUDA, many still venture into creating their routines with OpenCL ^[78][79] or direct C coding ^[80] of GPU hardware to fit their needs for PF. Very recently, a few authors made thorough overviews for parallel power flow algorithms on GPU covering general trends ^[81][82] and specifically AC power flow GPU algorithms ^[83]. In the State-of-the-Art subsection, the most impactful work is covered.

4.2.2. State-of-the-Art

The DC Power Flow (DCPF) problem was solved using the Chebyshev preconditioner and conjugate gradient iterative method in a GPU (448 cores Tesla M2070) implementation in ^[84][85]. The vector processes involved are easily parallelizable in the most efficient way with CUDA libraries such as CUBLAS and CUSPARSE, which are Basic Linear Algebra Subroutine and Sparse Matrix Operation Libraries. Later, the same author went on to Parallelize the FDPF using the same hardware and pre-conditioning steps ^[86]. Two natural systems were used, the Polish system, which had groups of locally connected systems, and the Pan-European system, which consisted of several large coupled systems. This topology difference results in a difference in the sparsity patterns of the SLS matrix, which offers a unique perspective. Their proposed GPU-based FDPF was implemented with Matlab on top of MatPower v4.1. In their algorithm, the CPU regularly corresponds with the GPU, sending information back and forth over one iteration. Their tests showed that the FDPF performed better on the Pan-European system because its connections were more ordered than the Polish system. CPU-GPU communication frequently occurred in their algorithm steps, most likely bottlenecking the speedup of their algorithm (less than 3× achieved compared to CPU only). Instead of adding pre-conditioning steps, M. Wang et al. ^[87] focus on improving the continuous Newtons method such that a stable solution is found even for an ill-conditioned power flow problem. For example, if any load or generator power exceeds 3.2 p.u. in the IEEE-118 test case, the NR method fails to converge; even if the value is realized in any iteration, their algorithm will still converge to the solution with their method. This was achieved using different-order numerical integration methods. The CPU loads data into GPU and extracts the results upon convergence only, making the algorithm very efficient. The approach substantially improved over the previous work by removing the pre-conditioning step and reducing CPU–GPU communication (speedup of 11× compared to CPU-only implementation). Sometimes, dividing the bulk of computational load between the CPU and GPU (a hybrid approach) can be more effective depending on the distribution of processes. In one hybrid CPU–GPU approach, a heavy emphasis on the sparsity analysis of PF-generated matrices was made in ^[88]. When using a sparse technique, the matrices operated on are reduced to ignore the zero terms. For example, the matrix is turned into a vector of indices referring to the non-zero values to confine operations to these values. Seven parallelization schemes were compared, varying the techniques used (Dense vs. Sparse treatment), the majoring type (row vs. column), and the threading strategy. Row/column-major signifies whether the matrix’s same row/column data are stored consecutively. The thread invocation strategies varied in splitting or combining the calculation of P and Q of the mismatch vectors. Two sparsity techniques were experimented with, showing a reduction in operations down to 0.1% of the original number and two or even three orders of magnitude performance enhancement for power mismatch vector operations. In 100 trials, their best scheme converged within six iterations on a four-core host and a GeForce GTX 950M GPU, with a small deviation in solution time between trials. CPU–GPU communication took about 7.79–10.6% of the time, a fairly low frequency. However, the proposed approach did not consistently outperform a CPU-based solution with all of these reductions. The authors suggested that this was due to using higher-grade CPU hardware than the GPU. Zhou et al. might have conducted the most extensive research in GPU-accelerated batch solvers in a series between 2014 and 2020. They fine-tune the process of solving PPF for GPU architecture in ^[89][90]. The strategies used include Jacobian matrix packaging, contiguous memory addresses, and thread block assignment to avoid divergence of the solution. Subsequently, they use the LU-factorization solver from previous work to finally create a batch-DPF algorithm ^[91]. They test their batch-DPF algorithm on three cases: 1354-bus, 3375-bus, and 9241-bus systems. For 10,000 scenarios, they solved the largest case within less than 13 s, showing the potential for online application. Most of the previous studies solve the PF problem in a bare and limited setup when compared to the work by J. Kardos et al. ^[92] that involves similar techniques in a massive HPC framework. Namely, preventative Security Constrained Optimal Power Flow (SCOPF) is solved by building on an already existing suite called BELTISTOS ^[93]. BELTISTOS specifically includes SCOPF solvers and has an established Schurs Complement Algorithm that factorizes the Karush–Kuhn–Tucker (KKT) conditions, allowing for a great degree of parallelism in using IPM to solve general-purpose Non-Linear Programming (NLP) problems. Thus, the main contribution is in removing some bottlenecks and ill-conditioning that exist in Schur’s Complement steps introducing a modified framework (BELTISTOS-SC). The parallel Schur algorithm is bottlenecked by a dense matrix associated with the solution’s global part. This matrix is solved in a single process. Since GPUs are meant to be used for dense systems, they factorize the system and apply forward–backward substitution, solving it using cuSolve, a GPU-accelerated library to solve dense linear algebraic systems. They performed their experiments using a multicore Cray XC40 computer at the Swiss National Supercomputing Centre. They used 18 2.1 GHz cores, NVIDIA Tesla P100 with 16 GB memory, and many other BELTISTOS and hardware-associated libraries. They tested their modification on several system sizes from PEGASE1354 to PEGASE13659. Their approach sped up the solution of the Dense Schur Complement System by 30× for the largest system over CPU solution of that step, achieving notable speed up in all systems sizes tested. They later performed a large-scale performance study, where they increased the number of computing cores used from 16 to 1024 on the cluster. The BELTISTOS-SC augmented approach achieved up to 500× speedup for the PEGASE1354 system and 4200× for the PEGASE9241 when 1024 cores are used, demonstrating strong scalability up to 512 cores.

5. Optimal Power Flow

Like PF, OPF studies are the basis of many operational assessments such as System Stability Analysis (SSA), UC, ED, and other market decisions ^[48]. Variations of these assessments include Security Constrained Economic Dispatch (SCED) and SCOPF, both involving contingencies. OPF ensures the satisfaction of network constraints over cost or power-loss minimization objectives. The full ACOPF version has non-linear, non-convex constraints, making it computationally complex and making it difficult to reach a global optimum. DC Optimal Power Flow (DCOPF) and other methods, such as decoupled OPF, linearize and simplify the problem, and when solved, they produce a fast but sub-optimal solution. Because DCOPF makes voltage and reactive power assumptions, it becomes less reliable with increased RE penetration. RE deviates voltages and reactive powers of the network significantly. This is one of the main drivers behind speeding up ACOPF in real-time applications for all algorithms involving it. The first formulation of OPF was achieved by J. Carpenter in 1962 ^[94], followed by an enormous volume of OPF formulations and studies, as surveyed in ^[95].

5.1. MIMD Based Studies

5.1.1. Development

OPF and SCOPF decomposition approaches started appearing in the early 1980s using P-Q decomposition ^[96][97] and including corrective rescheduling ^[98]. The first introduction to parallel OPF algorithms might have been by Garng M Huang, and Shih-Chieh Hsieh in 1992 ^[99], who proposed a “textured” OPF algorithm that involved network partitioning. In a different work, they proved that their algorithm would converge to a stationary point and that with certain conditions, optimality is guaranteed. Later, they implemented the algorithm on the nCUBE2 machine ^[100], showing that both their sequential and parallel textured algorithm is superior to non-textured algorithms. It was atypical for studies at the time to highlight portability, which makes Huang’s work in ^[57] special. It contributed another OPF algorithm using Successive Overrelaxation by making it “Adaptive”, reducing the number of iterations. The code was applied on the nCUBE2 and ported to Intel iPSC/860 hypercube, demonstrating its portability. In 1990, M.Teixeria et al. ^[101] demonstrated what might be the first parallel SCOPF on a 16-CPU system developed by the Brazilian Telecom R&D center. The implementation was somewhat “makeshift” and coarse to the level where each CPU was installed with a whole MS/DOS OS for the multi-area reliability simulation. Nevertheless, it outperformed a VAX 11/780 implementation and scaled perfectly, was still 2.5 times faster than running on, and exhibited strong scalability. Distributed OPF algorithms started appearing in the late 1990s with a coarse-grained multi-region coordination algorithm using the Auxiliary Problem Principle (APP) ^[102][103]. This approach was broadened much later by ^[104] using Semi-Definite Programming and Alternating Direction Method of Multiplier (ADMM). Prior to that, ADMM was also compared against the method of partial duality in ^[105]. The convergence properties of the previously mentioned techniques and more were compared comprehensively in ^[106].

The asynchronous parallelization of OPF first appeared on preventative ^[107], and corrective SCOPF ^[108] targeting online applications ^[109] motivated by the heterogeneity of solution time of different scenarios. Both SIMD and MIMD machines were used, emphasizing portability as “Getsub and Fifo” routines were carried out. On the same token, MPI protocols were used to distribute and solve SCOPF, decomposing the problem with GRMES and solving it with the non-linear IPM method varying the number of processors ^[110]. Real-time application potential was later demonstrated by using Benders decomposition instead for distributed SCOPF ^[111]. Benders decomposition is one of the most commonly used techniques to create parallel structures in power system optimization problems, and it shows up in different variations in the present literature.

5.1.2. State-of-the-Art

In contrast, parallelizing a monolithic ACOPF problem itself is much more complicated. However, the same authors did this readily since their model was already decomposable due to the conic relaxation ^[112]. Here, the choice of network partitions is treated as an optimization problem to realize the least number of lines between sub-networks. A graph partitioning algorithm and a modified benders decomposition approach were used, providing analytical and numerical proof that they converge to the same value as the original benders. This approach achieved a lower–upper bound gap of around 0–2%, demonstrating scalability. A maximum number of eight partitions (eight subproblems) were divided on a four-core 2.4 GHz workstation. Beyond four partitions, hyperthreading or sequential execution must have occurred. This is a shortcoming, as only four threads can genuinely run in parallel at each time. Hyper-threading only allows a core to alternate between two tasks. Their algorithm might have even more potential if distributed on an HPC platform. The ACOPF formulation is further coupled and complicated when considering Optimal Transmission Switching (OTS). The addition of binary variables ensures the non-convexity of the problem, turning it from an NLP to an Mixed Integer Non-Linear Programming (MINLP). In Lan et al. ^[113], this formulation is parallelized for battery storage embedded systems, where temporal decomposition was performed, recording the State of Charge (SoC) at the end of each 6 h (four subproblems). They employed a two-stage scheme with an NLP first stage to find the ACOPF of a 24 h time horizon and transmission switching in the second stage. The recorded SOCs of the first stage are added as constraints to the corresponding subproblems, which are entirely separable. They tested the algorithm IEEE-188 test case and solved it with Bonmin with GAMS on a four-core workstation. While the coupled ACOPF-OTS formulation achieved a 4.6% Optimality gap at a 16 h 41 min time limit, their scheme converged to a similar gap within 24 m. The result is impressive, considering the granularity of the decomposition. This is yet another example in which a better test platform could have shown more exciting results, as the authors were limited to parallelizing four subproblems. Algorithm-wise, an asynchronous approach or better partition strategy is needed, as one of the subproblems took double the time of all the others to solve. The inclusion of voltage and reactive power predicate the benefits gained by ACOPF. However, it is their effect on the optimal solution that matters, and there are ways to preserve that while linearizing the ACOPF. The DCOPF model is turned into a Mixed Integer Linear Programming (MILP) in ^[114] by adding on/off series reactance controllers (RCs) to the model. The effect of the reactance is implied by approximating its value and adding it to the DC power flow term as a constant without actually modeling reactive power. The binary variables are relaxed using the Big M approach to linearize the problem, derive the first-order KKT conditions, and solve it using the decentralized iterative approach. Each node solved its subproblem, making this a fine-grained algorithm, and each subproblem had coupling variables with adjacent buses only. The approach promised scalability, and its convergence was proven in ^[115]. However, it was not implemented in parallel, and the simulation-based assumptions are debatable. Decentralization, in that manner, reduces the number of coupling variables and communication overhead. However, this also depends on the topology of the network, as shown in ^[76]. A stochastic DCOPF formulation incorporating demand response was introduced. The model network was decomposed using ADMM and different partitioning strategies where limited information exchange occurs between adjacent subsystems. The strategies were implemented using MATLAB and CPLEX on a six-bus to verify solution accuracy and later on larger systems. The ADMMBased Fully Decentralized DCOPF and Accelerated ADMM for Distributed DCOPF were compared. Recent surveys on Distributed OPF algorithms showed that in OPF decomposition and parallelization, ADMM and APP are preferred in most of the studies as a decomposition technique ^[112][116]. The distributed version converged faster, while the decentralized version exhibited better communication efficiency. More importantly, a separate test showed that decentralized algorithms work better on subsystems that exhibit less coupling (are less interconnected) and vice versa. This breeds the idea that decentralized algorithms are better suited for ring or radial network topologies while distributed algorithms are better for meshed networks ^[116][117]. Distribution networks tend to be radial. A ring topology is rare, except in microgrids. Aside from their topology, they have many differences compared to transmission networks causing the division of their studies and OPF formulations. OPF for Transmission–Distribution co-optimization makes a great case for HPC use in power system studies, as co-optimizing the two together is considered peak complexity. S. Tu et al. ^[118] decomposed a very large-scale ACOPF problem in the Transmission–Distribution network co-optimization attempt. They devised a previously used approach where the whole network was divided by its feeders, where each distribution network had a subproblem. The novelty in their approach lies in a smoothing technique that allows gradient-based non-linear solvers to be used, particularly the Primal-Dual-Interior Point Method, which is the most commonly used method for solving ACOPF. Similar two-stage stochastic algorithms were implemented to account for the uncertainties in Distributed Energy Resources (DER) at a simpler level ^[64][119]. S. Tu et al. used an augmented IEEE-118 network, adding distribution systems to all busses, resulting in 9206 buses. Their most extreme test produced 11,632,758 bus solutions (1280 scenarios). Compared to a generic sequential Primal Dual Interior Primal Method (PDIPM), the speedup of their parallelized approach increased linearly with the number of scenarios and scaled strongly by increasing the number of cores used in their cluster. In contrast, the serial solution time increased superlinearly and failed to converge within a reasonable time in a relatively trivial number of scenarios. While their approach proved to solve large-scale ACOPF much faster than a serial approach, it falls short in addressing Transmission–Distribution co-optimization because it merely considered the distribution network as sub-networks with the same objective as the transmission, which is unrealistic.

References

Climate Change 2022: Impacts, Adaptation and Vulnerability. Available online: https://www.ipcc.ch/report/ar6/wg2/ (accessed on 13 November 2022).
Mccool, M.D. Parallel Programming, Chapter 2; Number March in 1; Morgan Kaufmann: Burlington, MA, USA, 2012; pp. 39–76.
The White House. Building on Past U.S. Leadership, Including Efforts by States, Cities, Tribes, and Territories, the New Target Aims at 50–52 Percent Reduction in U.S. 2021. Available online: bit.ly/3UIWaeK (accessed on 15 November 2021).
Coal Exit. 2022. Available online: https://beyond-coal.eu/coal-exit-timeline/ (accessed on 25 May 2022).
Carbon Brief. Online. 2020. Available online: https://bit.ly/3UHwzD4 (accessed on 15 November 2021).
IEA. Online. 2022. Available online: https://bit.ly/3QB9fFh (accessed on 25 May 2022).
Ritchie, H.; Roser, M.; Rosado, P. Energy. In Our World in Data; 2020; Available online: https://ourworldindata.org/renewable-energy (accessed on 13 November 2022).
IEA. 2022. Available online: https://bit.ly/3njRMEd (accessed on 25 May 2022).
Deloitte Electrification in Industrials. Available online: https://bit.ly/3UIpe62 (accessed on 25 May 2022).
IEA. 2021. Available online: https://www.iea.org (accessed on 25 May 2022).
González-Torres, M.; Pérez-Lombard, L.; Coronel, J.F.; Maestre, I.R.; Yan, D. A review on buildings energy information: Trends, end-uses, fuels and drivers. Energy Rep. 2022, 8, 626–637.
B2E Resources. 2021. Available online: https://bit.ly/3HFyceH (accessed on 25 May 2022).
Zhu, X.; Liu, K. A systematic review and future directions of the sharing economy: Business models, operational insights and environment-based utilities. J. Clean. Prod. 2021, 290, 125209.
EIA. Annual Electric Power Industry Report; Technical Report; 2021. Available online: https://www.eia.gov/electricity/annual/ (accessed on 13 November 2022).
Borenius, S.; Hämmäinen, H.; Lehtonen, M.; Ahokangas, P. Smart grid evolution and mobile communications—Scenarios on the finnish power grid. Electr. Pow. Syst. Res. 2021, 199, 107367.
Hua, H.; Liu, T.; He, C.; Nan, L.; Zeng, H.; Hu, X.; Che, B. Day-ahead scheduling of power system with short-circuit current constraints considering transmission switching and wind generation. IEEE Access 2021, 9, 110735–110745.
Daly, P.; Qazi, H.W.; Flynn, D. Rocof-constrained scheduling incorporating non-synchronous residential demand response. IEEE Trans. Power Syst. 2019, 34, 3372–3383.
Nawaz, A.; Wang, H. Distributed stochastic security constrained unit commitment for coordinated operation of transmission and distribution system. CSEE J. Power Energy Syst. 2021, 7, 708–718.
Luburić, Z.; Pandžić, H.; Carrión, M. Transmission expansion planning model considering battery energy storage, tcsc and lines using ac opf. IEEE Access 2020, 8, 203429–203439.
Zhuo, Z.; Zhang, N.; Yang, J.; Kang, C.; Smith, C.; O’Malley, M.J.; Kroposki, B. Transmission expansion planning test system for ac/dc hybrid grid with high variable renewable energy penetration. IEEE Trans. Power Syst. 2020, 35, 2597–2608.
Gonzalez-Romero, I.C.; Wogrin, S.; Gomez, T. Proactive transmission expansion planning with storage considerations. Energy Strategy Rev. 2019, 24, 154–165.
Tan, L.; Jiang, J. Chapter 14—Hardware and software for digital signal processors. In Digital Signal Processing, 3rd ed.; Tan, L., Jiang, J., Eds.; Academic Press: Cambridge, MA, USA, 2019; pp. 727–784.
Aspray, W. The intel 4004 microprocessor: What constituted invention? IEEE Ann. Hist. Comput. 1997, 19, 4–15.
Stringer, L. Vectors: How the Old Became New again in Supercomputing. 2016. Available online: https://www.hpcwire.com/2016/09/26/vectors-old-became-new-supercomputing/ (accessed on 13 November 2022).
Hey, A.J.G. Supercomputing with transputers—Past, present and future. In Proceedings of the 4th International Conference on Supercomputing (ICS ’90), Amsterdam, The Netherlands, 11–15 June 1990; Association for Computing Machinery: New York, NY, USA, 1990; pp. 479–489.
Bose, P. Encyclopedia of parallel computing. In Encyclopedia of Parallel Computing; Padua, D., Ed.; Springer: Boston, MA, USA, 2011; Chapter PowerWall; pp. 1593–1608.
Intel. 8th and 9th Generation Intel Core Processor Families and Intel Xeon E Processor Families; Technical Report; Intel Coorperation: Mountain View, CA, USA, 2020.
Intel. Intel oneAPI DPC++/C++ Compiler Developer Guide and Reference; Intel Coorperation: Mountain View, CA, USA, 2022; Available online: https://bit.ly/3X69OKu (accessed on 13 November 2022).
Intel. Intel AVX-512—Instruction Set for Packet Processing; Technical Report; Intel Corporation: Mountain View, CA, USA, 2021; Available online: https://intel.ly/3GiWooH (accessed on 13 November 2022).
Computing, A. Ampere Altra Max 64-bit Multi-Core Processor Features; Technical Report; Ampere Computing: Santa Clara, CA, USA, 2021; Available online: https://bit.ly/3hNqTZG (accessed on 13 November 2022).
Teich, P. Nvidia Dominates the Market for Cloud AI Accelerators More Than You Think. 2021. Available online: https://bit.ly/3QIvdqa (accessed on 13 November 2022).
Navarro, C.A.; Hitschfeld-Kahler, N.; Mateu, L. A survey on parallel computing and its applications in data-parallel problems using gpu architectures. Commun. Comput. Phys. 2014, 15, 285–329.
ARM. ARM Technology Is Defining the Future of Computing: Record Royalties Highlight Increasing Diversity of Products and Market Segment Growth. 2022. Available online: https://www.arm.com/company/news/2022/11/arm-achieves-record-royalties-q2-fy-2022 (accessed on 13 November 2022).
Leiner, B.M.; Cerf, V.G.; Clark, D.D.; Kahn, R.E.; Kleinrock, L.; Lynch, D.C.; Postel, J.; Roberts, L.G.; Wolff, S. Brief History of the Internet; Technical Report; Internet Society: Reston, VA, USA, 1997.
Spinellis, D. A repository of unix history and evolution. Empir. Softw. Engg. 2017, 22, 1372–1404.
Stott, M. ARCNETworks; Technical Report; Arcnet Trade Association: Downers Grove, IL, USA, 1998.
Majidha Fathima, K.M.; Santhiyakumari, N. A survey on evolution of cloud technology and virtualization. In Proceedings of the 2021 Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV), Tirunelveli, India, 4–6 February 2021; pp. 428–433.
Beck, A. High Throughput Computing: An Interview with Miron Livny. 2021. Available online: https://bit.ly/3y2Tuje (accessed on 13 November 2022).
Raicu, I.; Foster, I.T.; Yong, Z. Many-task computing for grids and supercomputers. In Proceedings of the 2008 Workshop on Many-Task Computing on Grids and Supercomputers, Austin, TX, USA, 17 November 2008; pp. 1–11.
Globus. Globus Toolkit. 2021. Available online: https://toolkit.globus.org/ (accessed on 13 November 2022).
Amazon. Overview of Amazon Web Services: Aws Whitepaper; Technical Report; Amazon Web Services: Seattle, WA, USA, 2022; Available online: https://docs.aws.amazon.com/whitepapers/latest/aws-overview/introduction.html (accessed on 13 November 2022).
Geist, G.A.; Sunderam, V.S. Network-based concurrent computing on the pvm system. Concurr. Pract. Exper. 1992, 4, 293–311.
UTM. What is UTM? 2022. Available online: https://docs.getutm.app/ (accessed on 13 November 2022).
Mersenne. Great Internet Mersenne Prime Search—Primenet. 2022. Available online: https://www.mersenne.org/ (accessed on 15 March 2012).
BOINC. News from Boinc Projects. 2022. Available online: https://boinc.berkeley.edu/ (accessed on 15 March 2012).
Apple. Macos Server. 2022. Available online: https://www.apple.com/macos/server/ (accessed on 15 March 2012).
DASK. Task Graph Optimization. 2022. Available online: https://docs.dask.org/en/stable/optimize.html (accessed on 15 March 2012).
Conejo, A.J.; Baringo, L. Power Electronics and Power Systems Power System Operations; Springer: Cham, Switzerland, 2019; pp. 21–22.
O’Neill, R.; Castillo, A.; Cain, B. The IV Formulation and Linearizations of the AC Optimal Power Flow Problem; Technical Report; Federal Energy Regulatory Commission: Washington, DC, USA, 2013. Available online: https://www.ferc.gov/sites/default/files/2020-04/acopf-2-iv-linearization.pdf (accessed on 13 November 2022).
Happ, H.H. Special cases of orthogonal networks—Tree and link. IEEE Trans. Power Appl. Syst. 1966, 85, 880–891.
Happ, H.H. Z diakoptics—Torn subdivisions radially attached. IEEE Trans. Power Appl. Syst. 1967, 86, 751–769.
Carre, B.A. Solution of load-flow problems by partitioning systems into trees. IEEE Trans. Power Appl. Syst. 1968, 87, 1931–1938.
Andretich, R.G.; Brown, H.E.; Happ, H.H.; Person, C.E. The piecewise solution of the impedance matrix load flow. IEEE Trans. Power Appl. Syst. 1968, 87, 1877–1882.
Wu, F. Solution of large-scale networks by tearing. IEEE Trans. Circuits Syst. 1976, 23, 706–713.
Takatoo, M.; Abe, S.; Bando, T.; Hirasawa, K.; Goto, M.; Kato, T.; Kanke, T. Floating vector processor for power system simulation. IEEE Power Eng. Rev. 1985, 5, 29–30.
Lau, K.; Tylavsky, D.J.; Bose, A. Coarse grain scheduling in parallel triangular factorization and solution of power system matrices. IEEE Trans. Power Syst. 1991, 6, 708–714.
Huang, G.; Ongsakul, W. An adaptive sor algorithm and its parallel implementation for power system applications. In Proceedings of the 1994 6th IEEE Symposium on Parallel and Distributed Processing, Dallas, TX, USA, 26–29 October 1994; pp. 84–91.
Gomez, A.; Betancourt, R. Implementation of the fast decoupled load flow on a vector computer. IEEE Trans. Power Syst. 1990, 5, 977–983.
Housos, E.C.; Wing, O. Parallel optimization with applications to power systems. IEEE Trans. Power Appl. Syst. 1982, 101, 244–248.
Chen, S.D.; Chen, J.F. Fast load flow using multiprocessors. Int. J. Electr. Power Energy Syst. 2000, 22, 231–236.
Chen, S.D.; Chen, J.F. A novel approach based on global positioning system for parallel load flow analysis. Int. J. Electr. Power Energy Syst. 2005, 27, 53–59.
Feng, T.; Flueck, A.J. A message-passing distributed-memory newton-gmres parallel power flow algorithm. In Proceedings of the IEEE Power Engineering Society Summer Meeting, Chicago, IL, USA, 21–25 July 2002; Volume 3, pp. 1477–1482.
Li, Y.; Li, F.; Li, W. Parallel power flow calculation based on multi-port inversed matrix method. In Proceedings of the 2010 International Conference on Power System Technology, Hangzhou, China, 24–28 October 2010; pp. 1–6.
Sun, H.; Guo, Q.; Zhang, B.; Guo, Y.; Li, Z.; Wang, J. Master–slave-splitting based distributed global power flow method for integrated transmission and distribution analysis. IEEE Trans. Smart Grid 2015, 6, 1484–1492.
Su, X.; Liu, T.; Wu, L. Fine-grained fully parallel power flow calculation by incorporating bbdf method into a multistep nr algorithm. IEEE Trans. Power Syst. 2018, 33, 7204–7214.
Garcia, N. Parallel power flow solutions using a biconjugate gradient algorithm and a newton method: A gpu-based approach. In Proceedings of the IEEE PES General Meeting, Detroit, MI, USA, 24–29 July 2010; pp. 1–4.
Singh, J.; Aruni, I. Accelerating power flow studies on graphics processing unit. In Proceedings of the 2010 Annual IEEE India Conference (INDICON), Kolkata, India, 17–19 December 2010; pp. 1–5.
Dağ, H.; Soykan, G. Power flow using thread programming. In Proceedings of the 2011 IEEE Trondheim PowerTech, Trondheim, Norway, 19–23 June 2011; pp. 1–5.
Vilachá, C.; Moreira, J.C.; Míguez, E.; Otero, A.F. Massive jacobi power flow based on simd-processor. In Proceedings of the 2011 10th International Conference on Environment and Electrical Engineering, Rome, Italy, 1–7 May 2011; pp. 1–4.
Yang, M.; Sun, C.; Li, Z.; Cao, D. An improved sparse matrix-vector multiplication kernel for solving modified equation in large scale power flow calculation on cuda. In Proceedings of the 7th International Power Electronics and Motion Control Conference, Harbin, China, 2–5 June 2012; Volume 3, pp. 2028–2031.
Xue, L.; Fangxing, L.; Clark, J.M. Exploration of multi-frontal method with gpu in power flow computation. In Proceedings of the 2013 IEEE Power Energy Society General Meeting, Vancouver, BC, Canada, 21-25 July 2013; pp. 1–5.
Ablakovic, D.; Dzafic, I.; Kecici, S. Parallelization of radial three-phase distribution power flow using gpu. In Proceedings of the 2012 3rd IEEE PES Innovative Smart Grid Technologies Europe (ISGT Europe), Berlin, Germany, 14–17 October 2012; pp. 1–7.
Blaskiewicz, P.; Zawada, M.; Balcerek, P.; Dawidowski, P. An application of gpu parallel computing to power flow calculation in hvdc networks. In Proceedings of the 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, Turku, Finland, 4–6 March 2015; pp. 635–641.
Huang, R.H.; Jin, S.; Chen, Y.; Diao, R.; Palmer, B.; Qiuhua. Faster than real-time dynamic simulation for large-size power system with detailed dynamic models using high-performance computing platform. In Proceedings of the 2017 IEEE Power & Energy Society General Meeting, Chicago, IL, USA, 16–20 July 2017.
Guo, C.; Jiang, B.; Yuan, H.; Yang, Z.; Wang, L.; Ren, S. Performance comparisons of parallel power flow solvers on gpu system. In Proceedings of the 2012 IEEE International Conference on Embedded and Real-Time Computing Systems and Applications Performance 2012, Seoul, Republic of Korea, 19–22 August 2012; pp. 232–239.
Wang, Y.; Wu, L.; Wang, S. A fully-decentralized consensus-based admm approach for dc-opf with demand response. IEEE Trans. Smart Grid 2017, 8, 2637–2647.
Marin, M.; Defour, D.; Milano, F. Asynchronous power flow on graphic processing units. In Proceedings of the 2017 25th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP), St. Petersburg, Russia, 6–8 March 2017; pp. 255–261.
Gnanavignesh, R.; Shenoy, U.J. Gpu-accelerated sparse lu factorization for power system simulation. In Proceedings of the 2019 IEEE PES Innovative Smart Grid Technologies Europe (ISGT-Europe), Bucharest, Romania, 29 September–2 October 2019; pp. 1–5.
Tang, K.; Fang, R.; Wang, X.; Dong, S.; Song, Y. Mass expression evaluation parallel algorithm based on ‘expression forest’ and its application in power system calculation. In Proceedings of the 2019 IEEE Power Energy Society General Meeting (PESGM), Atlanta, GA, USA, 4–8 August 2019; pp. 1–5.
Araújo, I.; Tadaiesky, V.; Cardoso, D.; Fukuyama, Y.; Santana, Á. Simultaneous parallel power flow calculations using hybrid cpu-gpu approach. Int. J. Electr. Power Energy Syst. 2019, 105, 229–236.
Yoon, D.H.; Han, Y. Parallel power flow computation trends and applications: A review focusing on gpu. Energies 2020, 13, 2147.
Daher Daibes, J.V.; Brown Do Coutto Filho, M.; Stacchini de Souza, J.C.; Gonzalez Clua, E.W.; Zanghi, R. Experience of using graphical processing unit in power flow computation. Concurr. Comput. Pract. Exp. 2022, 34, e6762.
Abhyankar, S.; Peles, S.; Rutherford, R.; Mancinelli, A. Evaluation of ac optimal power flow on graphical processing units. In Proceedings of the 2021 IEEE Power Energy Society General Meeting (PESGM), Washington, DC, USA, 25–29 July 2021; pp. 1–5.
Li, X.; Li, F. Gpu-based power flow analysis with chebyshev preconditioner and conjugate gradient method. Electr. Pow. Syst. Res. 2014, 116, 87–93.
Li, X.; Li, F. Gpu-based two-step preconditioning for conjugate gradient method in power flow. In Proceedings of the 2015 IEEE Power Energy Society General Meeting, Denver, CO, USA, 26–30 July 2015; pp. 1–5.
Li, X.; Li, F.; Yuan, H.; Cui, H.; Hu, Q. Gpu-based fast decoupled power flow with preconditioned iterative solver and inexact newton method. IEEE Trans. Power Syst. 2017, 32, 2695–2703.
Wang, M.; Chen, Y.; Huang, S. Gpu-based power flow analysis with continuous newton’s method. In Proceedings of the IEEE Conference on Energy Internet and Energy System Integration (EI2), Beijing, China, 20–22 October 2018.
Su, X.; He, C.; Liu, T.; Wu, L. Full parallel power flow solution: A gpu-cpu-based vectorization parallelization and sparse techniques for newton-raphson implementation. IEEE Trans. Smart Grid 2020, 11, 1833–1844.
Zhou, G.; Feng, Y.; Bo, R.; Zhang, T. Gpu-accelerated sparse matrices parallel inversion algorithm for large-scale power systems. Int. J. Electr. Power Energy Syst. 2019, 111, 34–43.
Zhou, G.; Bo, R.; Chien, L.; Zhang, X.; Yang, S.; Su, D. Gpu-accelerated algorithm for online probabilistic power flow. IEEE Trans. Power Syst. 2018, 33, 1132–1135.
Zhou, G.; Bo, R. Gpu-based batch lu-factorization solver for concurrent analysis of massive power flows. IEEE Trans. Smart Grid 2017, 32, 4975–4977.
Kardoš, J.; Kourounis, D.; Schenk, O. Two-level parallel augmented schur complement interior-point algorithms for the solution of security constrained optimal power flow problems. IEEE Trans. Power Syst. 2020, 35, 1340–1350.
Beltistos. Beltistos. 2022. Available online: http://www.beltistos.com/ (accessed on 13 November 2022).
Carpentier, J. Optimal power flows. Int. J. Electr. Power Energy Syst. 1979, 1, 3–15.
Huneault, M.; Galiana, F.D. A survey of the optimal power flow literature. IEEE Trans. Power Syst. 1991, 6, 762–770.
Shoults, R.R.; Sun, D.T. Optimal power flow based upon p-q decomposition. IEEE Trans. Power Appl. Syst. 1982, 101, 397–405.
Talukdar, S.N.; Giras, T.C.; Kalyan, V.K. Decompositions for optimal power flows. IEEE Trans. Power Appl. Syst. 1983, PAS-102, 3877–3884.
Monticelli, A.; Pereira, M.V.F.; Granville, S. Security-constrained optimal power flow with post-contingency corrective rescheduling. IEEE Power Eng. Rev. 1987, 7, 43–44.
Huang, G.M.; Hsieh, S.C. Exact convergence of a parallel textured algorithm for constrained economic dispatch control problems. In Proceedings of the 31st IEEE Conference on Decision and Control, Tucson, AZ, USA, 16–18 December 1992; pp. 570–575.
Huang, G.; Hsieh, S.C. A parallel had-textured algorithm for constrained economic dispatch control problems. IEEE Trans. Power Syst. 1995, 10, 1553–1558.
Teixeira, M.J.; Pinto, H.J.C.P.; Pereira, M.V.F.; McCoy, M.F. Developing concurrent processing applications to power system planning and operations. IEEE Trans. Power Syst. 1990, 5, 659–664.
Kim, B.H.; Baldick, R. Coarse-grained distributed optimal power flow. IEEE Trans. Power Syst. 1997, 12, 932–939.
Baldick, R.; Kim, B.H.; Chase, C.; Luo, Y. A fast distributed implementation of optimal power flow. IEEE Trans. Power Syst. 1999, 14, 858–864.
Anese, E.D.; Zhu, H.; Giannakis, G.B. Distributed optimal power flow for smart microgrids. IEEE Trans. Smart Grid 2013, 4, 1464–1475.
Liu, K.; Li, Y.; Sheng, W. The decomposition and computation method for distributed optimal power flow based on message passing interface (mpi). Int. J. Electr. Power Energy Syst. 2011, 33, 1185–1193.
Kim, B.H.; Baldick, R. A comparison of distributed optimal power flow algorithms. IEEE Trans. Power Syst. 2000, 15, 599–604.
Talukdar, S.; Ramesh, V.C. A multi-agent technique for contingency constrained optimal power flows. IEEE Trans. Power Syst. 1994, 9, 855–861.
Rodrigues, M.; Saavedra, O.R.; Monticelli, A. Asynchronous programming model for the concurrent solution of the security constrained optimal power flow problem. IEEE Trans. Power Syst. 1994, 9, 2021–2027.
Ramesh, V.C. On distributed computing for on-line power system applications. Int. J. Electr. Power Energy Syst. 1996, 18, 527–533.
Wei, Q.; Flueck, A.J.; Feng, T. A new parallel algorithm for security constrained optimal power flow with a nonlinear interior point method. In Proceedings of the IEEE Power Engineering Society General Meeting, San Francisco, CA, USA, 12–16 June 2005; Volume 1, pp. 447–453.
Borges, C.L.T.; Alves, J.M.T. Power system real time operation based on security constrained optimal power flow and distributed processing. In Proceedings of the 2007 IEEE Lausanne Power Tech, Lausanne, Switzerland, 1–5 July 2007; pp. 960–965.
Yuan, Z.; Hesamzadeh, M.R. A modified benders decomposition algorithm to solve second-order cone ac optimal power flow. IEEE Trans. Smart Grid 2019, 10, 1713–1724.
Lan, T.; Huang, G.M. An intelligent parallel scheduling method for optimal transmission switching in power systems with batteries. In Proceedings of the 19th International Conference on Intelligent System Application to Power Systems (ISAP), San Antonio, TX, USA, 17–20 September 2017.
Zhang, Q.; Sahraei-ardakani, M. Distributed dcopf with flexible transmission. Electr. Power Syst. Res. 2018, 154, 37–47.
Mohammadi, J.; Zhang, J.; Kar, S.; Hug, G.; Moura, J.M.F. Multilevel distributed approach for dc optimal power flow. In Proceedings of the 2015 IEEE Global Conference on Signal and Information Processing (GlobalSIP), Orlando, FL, USA, 14–16 December 2015; pp. 1121–1125.
Sadnan, R.; Dubey, A. Distributed optimization using reduced network equivalents for radial power distribution systems. IEEE Trans. Power Syst. 2021, 36, 3645–3656.
Molzahn, D.K.; Dorfler, F.; Sandberg, H.; Low, S.H.; Chakrabarti, S.; Baldick, R.; Lavaei, J. A survey of distributed optimization and control algorithms for electric power systems. IEEE Trans. Smart Grid 2017, 8, 2941–2962.
Tu, S.; Wächter, A.; Wei, E. A two-stage decomposition approach for ac optimal power flow. arXiv 2020, arXiv:2002.08003.
DeMiguel, V.; Murray, W. A local convergence analysis of bilevel decomposition algorithms. Optim. Eng. 2006, 7, 99–133.