Received May 26, 2020, accepted June 17, 2020, date of publication June 29, 2020, date of current version July 20, 2020. Digital Object Identifier 10.1109/ACCESS.2020.3004713 GPU-Based N-1 Static Security Analysis Algorithm With Preconditioned Conjugate Gradient Method MENG FU1 , GAN ZHOU 1 , (Member, IEEE), JIAHAO ZHAO1 , YANJUN FENG1 , HUAN HE2 , AND KAI LIANG2 1 School 2 State of Electrical Engineering, Southeast University, Nanjing 210096, China Grid Anshan Electric Power Supply Company, Anshan 114001, China Corresponding author: Gan Zhou (zhougan2002@seu.edu.cn) This work was supported in part by the National Natural Science Foundation of China under Grant 51877038, and in part by the Science and Technology Foundation of the State Grid Corporation of China: High-Performance Computing Technology for Analysis and Service on Entire Network of STATE GRID Corporation of China, under Grant DZB17201800023. ABSTRACT N-1 static security analysis (SSA) is an important method for power system stability analysis that requires solving N alternating-current power flows (ACPF) for a system with N elements to obtain strictly accurate results. Past researches have shown the potential of accelerating these calculations using an iterative solver with graphics processing unit (GPU). This paper proposes a GPU-based N-1 SSA algorithm with the preconditioned conjugate gradient (PCG) method. First, a shared preconditioner is selected to accelerate preprocessing of the iterative method for fast decoupled power flow (FDPF) in N-1 SSA. Second, it proposes a GPU-based batch-PCG solver, which packages a massive number of PCG subtasks into a large-scale problem to achieve a higher degree of parallelism and better coalesced memory accesses. Finally, the paper presents a novel GPU-accelerated batch-PCG solution for N-1 SSA. Case studies on a practical 10828-bus system show that the GPU-based N-1 SSA algorithm with the batch-PCG solver is 4.90 times faster than a sequential algorithm on an 8-core CPU. This demonstrates the potential of the GPU-based high-performance SSA solution with the PCG method under a batch framework. INDEX TERMS N-1 static security analysis, graphics processing unit, preconditioned conjugate gradient method, batch-PCG solver. I. INTRODUCTION With the rapid development of emerging grid technologies such as smart grids and microgrids, the concept of an Energy Internet has been proposed that will enable complete integration of power grids, new energy and the Internet. The Energy Internet brings not only industrial development but also technical challenges. Security analysis is one of the most urgent challenges for the Energy Internet, requiring highperformance computational efficiency for on-line analysis. N-1 static security analysis (SSA) is an important method of power system stability analysis that requires solving N alternating-current power flows (ACPFs) for a system with N elements to obtain a strictly accurate result. N-1 SSA that needs to solve massive ACPF problems involves an intensive The associate editor coordinating the review of this manuscript and approving it for publication was Huiqing Wen 124066 . computational task. The traditional method for SSA fails to meet the speed requirements for on-line analysis, especially in the kind of environment proposed for the Energy Internet. Because of the computing parallelism intrinsic to the SSA problem, CPU multi-core architecture was used first to accelerate the computations [1]. However, saturation of the memory bandwidth and additional computing nodes with larger systems remain a bottleneck in accelerating the computing [2]. Recently, graphics processing units (GPU) with single instruction multiple thread (SIMT) architecture are showing superior performance on float-pointing calculations and memory bandwidth [3]. As the representation of high-performance computing technology, GPU-accelerated method for computationally intensive tasks has been recognized as a promising and viable solution in power systems [4]. Thus, GPUs offer a potential solution for solving massive ACPFs in the SSA field. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ VOLUME 8, 2020 M. Fu et al.: GPU-Based N-1 SSA Algorithm With PCG Method Existing research proposes two different technical frameworks to accelerate the solution of massive ACPFs in N-1 SSA. Solution 1 uses GPU to accelerate the ACPF subtasks one after another. Power flow calculation imposes a computational burden that is identical to the one involved in solving sparse linear system of equations (SLSE). Two kinds of solvers are generally used to solve linear equations. One is the direct solver [5], [6]; for example, LU factorization is generally understood to be the most efficient solution and is widely used for solving power flows. However, in recent years, the iterative solver, an alternative to the LU-based direct solver, is gradually gaining interest for power flow calculation because of its numerical stability and better scalability under the paradigm of parallel computing [7]. Preconditioning is commonly deployed with an iterative solver, because an efficient preconditioner reduces the condition number of the coefficient matrix and the number of iterations. A preconditioned iterative solver such as the conjugate gradient (CG) solver shows better convergence in large sparse linear systems. Reference [8] proposed a GPU-based pre-processed conjugate gradient method using Chebyshev polynomials. Reference [9] used a GPU to implement a pre-processed biconjugate gradient (BiCG) method to solve the Newton-Raphson power flow problem. Reference [10] applied a GPU-based conjugate gradient normal residual method (CGNR) with a Jacobi preconditioner for power system state estimation and power flow calculations. Nevertheless, no matter which solver is used, there turn out to be two obvious disadvantages to Solution 1. First, limited by the scale of the single SLSE problem and the excessive sequential processes, the degree of parallelism is relatively low and numerous computing cores on the GPU are not fully utilized. Second, the massive unmerged memory access in the single SLSE solving process leads to underutilization of the high bandwidth of the GPU. To explore a higher degree of parallelism, Solution 2 is to accelerate massive ACPFs concurrently. This solution packages multiple independent subtasks to formulate a larger-scale batch calculation task, and it has been successfully applied to the metaheuristic-based optimization problem [11], [12], static security analysis [13] and MCS-based PPF analysis [14]. Obviously, Solution 2 is a superior alternative to Solution 1 because it achieves a higher degree of parallelism among the multiple subtasks. With the parallel scalability of the iterative method and inner parallelism of the massive ACPFs, the solution’s efficiency can be further improved by applying an iterative method to simultaneously solving multiple sparse linear equations under the paradigm of a batch framework and parallel computing. The difficulty with this solution lies in the design of the algorithm, which must conform to the GPU’s SIMT architecture [15], [16]. Against this background, this paper focuses on making the parallelism among massive ACPFs in SSA more regular, and then proposes a parallel batch-ACPF solution with the preconditioned conjugate gradient (PCG) method. This paper makes the following contributions to the field. VOLUME 8, 2020 First, a shared preconditioner, which comes from the ground state Jacobi matrix of the pre-contingency power system, is proposed for the iterative solver to solve the massive SLSEs from the SSA problem. Case results show that this preconditioner improves the condition number of the coefficient matrices better and has a faster convergence speed. As only once complete LU decomposition is made for this shared preconditioner, the factorization time shared by each SLSE is relatively less, and this improves the efficiency of the preprocessing step. Second, we propose a GPU-accelerated batch-PCG solver to achieve better computing efficiency when solving massive SLSEs. The batch version of the solver packages the massive PCG tasks into a brand new larger-scale calculation problem, which achieves a higher degree of parallelism and well-coalesced memory accesses. The batch-PCG solver is designed by optimizing the thread and block allocation strategy and fulfilling coalesced accesses. Our performance analysis verifies that the batch-PCG solver improves the computing efficiency when solving massive numbers of SLSEs, and is 24.10 times faster than a solver on the CPU platform. Last, this paper presents a novel GPU-accelerated batchACPF solution for N-1 SSA, including the overall framework design and a detailed performance analysis. Case studies on a practical 10828-bus system show that the GPU-based batchACPF solution with a batch-PCG solver is 4.90 times faster than a solution with the rank-one update method on an 8-core CPU. The paper is organized as follows. Section II reviews the background to the fast decoupled power flow (FDPF) method, the CG solver and the preconditioner. Section III proposes a shared preconditioner for SLSEs from the SSA problem and then analyzes the performance of GPU-based PCG solver for single SLSE. As the analysis results for GPU-based solution for single SLSE show that it is difficult to improve the computing efficiency, this paper then proposes a GPU-accelerated batch version of the PCG solver for massive numbers of SLSEs and gives details of its tuning strategies. In Section IV, a novel GPU-accelerated batch-ACPF solution for N-1 SSA based on the batch-PCG solver is proposed. The experimental results for four power systems are also presented. Finally, Section V presents the conclusions to our study. II. BACKGROUND A. FDPF METHOD AND ITS SOLVER Power flow is a nonlinear problem which is numerically calculated by iteratively solving the set of sparse linear equations, Ax = b. Newton-Raphson method has to generate the Jacobian matrix in each iteration. Fast decoupled method simplifies the formulation by decoupling active power and reactive power with a constant Jacobian matrix in each iteration, which gives a faster computing speed. The basic FDPF can be written as shown in Equation (1). The coefficient matrices B0 and B00 of the equations are derived from the admittance matrix, so they are constant Jacobi matrices in each iteration. 124067 M. Fu et al.: GPU-Based N-1 SSA Algorithm With PCG Method Solving the SLSE is a common and computationally intensive task that takes up 80% of the power flow computing time [17]. Accelerating the solving of the SLSE can greatly improve the FDPF calculation efficiency. ( 1P U = B0 1θ (1) 1Q U = B00 1U Solvers for FDPFs are generally divided into two categories: direct solvers and iterative solvers. Direct solvers, such as the commonly used LU solver, include factorization and forward substitution and backward substitution (FS&BS). As the Jacobi matrix is constant in FDPF, the factorization step in the direct solver is done only once, and the rest of the calculations and FS&BS are done in the power flow iterations. Iterative solvers, such as the CG solver, include preprocess and sparse vector/matrix operations. Similarly, the preconditioning step in the iterative solver is also done once, and the sparse vector/matrix operations are done in the power flow iterations. However, for large-scale systems, the reordering scheme and numerous fill-ins make the LU solver less efficient, and more storage is needed for the two triangular matrices L and U. Rank-one update, another direct method, uses the fact that a matrix can be updated by a rank-one matrix, which is the product of one non-zero column vector and one nonzero row vector. Rank-one update is also used for many power system applications, such as solving FDPFs in the SSA problem. Considering the fact that any contingency affects at most four elements of the pre-contingency bus admittance matrix, rank-one update improves the computing efficiency when solving massive numbers of SLSEs. A0 (here, that is the ground state Jacobi matrix B0 or B00 ) represents the coefficient matrix of the SLSE in the pre-contingency system, Ai xi = bi (i = 1, 2, . . . , N) represents the SLSE in the ith post-contingency system and vi represents the vector of influenced elements. As Ai = A0 + vi vTi , the factorization of A0 can be reused for decomposing Ai , which consumes less computing time than complete LU factorization. Nevertheless, because of intrinsic dependency, the direct solver is more suitable for sequential computing. In contrast, as an alternative to the direct solver, the iterative solver is generally more scalable with better parallelism. Moreover, with less computing complexity and memory storage, the iterative solver shows better computing efficiency in large-scale systems. The applications of the iterative solver to power system is discussed in [18]–[21]. As B0 and B00 are all symmetric positive definite (SPD) linear systems in FDPF [22], [23], the CG solver is usually the first choice among the various iterative solvers because of its convergence rate and robustness [24], [25]. B. CG SOLVER AND PRECONDITIONER Solving FDPF with the CG solver is essentially a two-layer iterative method consisting of an external Newton iteration and an internal CG iteration. The CG solver keeps each 124068 residual and each new search direction orthogonal to all the previous selected directions. In each iteration, the CG solver only requires once matrix vector multiplication and 10n floating-point operations, and the number of operations at each time is O(n2 ). (The time complexity of a Gauss elimination method such as the LU decomposition is O(n3 ).) Unlike the direct solver, the iterative solver does not require reordering to reduce the potential fill-in elements; however, it does require a preconditioning step to obtain better convergence in iterative solving. The convergence properties of the CG solver are described in Equation (2). Here, x∗ , x0 and xm are the precise, initial and current solution vectors respectively, and k is the ratio of the largest and smallest eigenvalues. When k is small, the CG solver converges fast. Additionally, the convergence speed depends on the distribution of the eigenvalues. !m √ k−1 kx∗ − x0 kA kx∗ − xm kA ≤ 2 √ (2) k+1 To improve the convergence speed and numerical stability, the coefficient matrix must be preprocessed; that is, convert the given problem Ax = b to Ãx = e b and make k(Ã) close to 1. Here, à = C −1 AC −1 , x̃ = Cx, b̃ = C −1 b, and C is an SPD matrix. Define preconditioner M = C 2 , and the equation then becomes M −1 Ax = M −1 b. It is noted here that an efficient preconditioner has the following features: First, the condition number of M −1 A should be as small as possible, or the eigenvalues of M −1 A should present a concentrated distribution. Second, Mz = r should be solved easily, by such as triangular factorization. Therefore, an efficient preconditioner should approximate closely to A and maintain sparsity, as with the incomplete LU (ILU) preconditioner, Chebyshev preconditioner and diagonal preconditioner. This principle can be explained as follows. For example, A = GGT , where A is the SPD matrix and G is the Cholesky factor. There exists a positive definite matrix Q such that C = QH T and H T is the upper triangular factor of the QR factorization of C. We therefore obtain the heuristic à = C −1 AC −1 = C −T AC −1 = (HQT )−1 A(QH T )−1 = Q(H −1 GGT H −T )QT Thus, the better H approximates the Cholesky factor G, the smaller the condition number of à is, and then the better convergence performance the iterative solver shows. III. GPU-ACCELERATED BATCH-PCG SOLVER A. PRECONDITIONER DESIGN FOR SSA N-1 SSA is a problem where massive numbers of SLSEs are solved. Although a preconditioner is recommended for the iterative solver, it will make the preprocessing step time-consuming if each SLSE needs an efficient preconditioner. To avoid this dilemma, a shared preconditioner is proposed for all SLSEs from the SSA problem; the efficiency of this preconditioner is demonstrated as follows. VOLUME 8, 2020 M. Fu et al.: GPU-Based N-1 SSA Algorithm With PCG Method TABLE 1. Performance comparison between different preconditioners. An ideal preconditioner for the iterative solver would be the coefficient matrix A itself, that is M = A = LU. Complete LU decomposition of the preconditioner is meaningless as the problem has been solved by the direct method and there is no need for iteration. Generally, an ILU preconditioner is always selected as an alternative to improve the convergency. However, if an efficient shared preconditioner is proposed for a set of SLSEs, the operation of complete LU decomposition of this preconditioner will be meaningful. In the SSA problem, since any contingency affects at most four elements of the admittance matrix, Ai of the post-contingency system is similar to A0 of the pre-contingency system. When M = A0 is selected as the shared preconditioner for solving Ai xi = bi , the approximation between Ai and A0 guarantees that this preconditioner is efficient. Complete LU factorization of A0 is made once to obtain L0 and U 0 , and the factors can be reused when solving Mzi = ri (that is L0 U 0 zi = ri ) in each CG iteration. Our experiment verifies the efficiency of this shared preconditioner. The cases come from four different power systems. Cases 1, 2 and 3 are 118-bus, 1354-bus and 2869-bus systems from MATPOWER [14]. Case 4 is a 10828-bus practical East China power grid. Contingencies that can result in power grid islands are not considered. We use the shared preconditioner M = A0 and check the condition number of Ãi (that is M−1 Ai ) of each contingency from the four systems. The performance is compared with the ILU(Ai ) preconditioner in Fig. 1. As shown in Fig. 1, compared with ILU(Ai ), the condition number Ãi applied with a shared preconditioner M = A0 is much less. This means the shared preconditioner improves the condition number Ãi . Therefore, the PCG solver with a shared preconditioner needs fewer iterations to solve the linear equation and has a faster convergence speed. Table 1 displays the average iteration and average time for solving a set of Axi = bi . For the 10828-bus system, the PCG solver with M = A0 needs 10.15 iterations per Axi = bi and consumes a total of 9.8195 ms on preconditioner generation and CG iteration, and this is 5.46 times faster than the PCG solver with ILU(Ai ). Moreover, M = A0 needs only once complete LU decomposition to generate factorization; this means that the A0 factorization time shared by each SLSE is relatively less. The design of this shared preconditioner not only improves the convergence for solving massive SLSEs from SSA problems, but also reduces the preprocessing time of the PCG solver. VOLUME 8, 2020 FIGURE 1. The condition number of M −1 Ai under the preconditioner of shared M (A0 ) and ILU(Ai ). B. GPU-BASED PCG SOLVER FOR SINGLE SLSE After the design of a shared preconditioner for the SSA problem, the following section focuses on accelerating the other sparse vector/matrix operations of the PCG algorithm. Besides the sparse triangular solving (that is FS&BS) for Mz = r, the following steps include sparse matrix vector multiplication (SMVM), vector product and vector addition/subtraction. The GPU-based PCG solver accelerates CG iteration by implementing sparse vector/matrix operations in parallel. If the GPU-based PCG solver for a single SLSE shows more performance advantages than the solver on the CPU platform, the solution that uses the GPU to accelerate ACPF subtasks one after another would easily accelerate computing. The computing kernels of the PCG solver are built by the library functions in cuBLAS and cuSPARSE [26], [27], two linear algebra libraries provided by Nvidia. For a fair performance comparison with the proposed GPU-based PCG solver, two high-performance and commercially-available multicoreCPU-accelerated solvers are chosen. The first one is the KLU solver [28], and the other is the LUSOL solver. KLU is one of the fastest single-threaded libraries for solving SLSEs in power systems. Based on multi-threaded technology such as OpenMP, it can be easily extended to multicore-CPU platforms and function as a parallel solver. LUSOL, another 124069 M. Fu et al.: GPU-Based N-1 SSA Algorithm With PCG Method high-performance library, contains a sparse LU factorization for square and rectangular matrices, with Bartels-Golub-Reid updates for column replacement and other rank-one modifications [29]. TABLE 2. Computing time of different solvers for single SLSE. Three solvers are presented in Table 2. Solver 1 is complete LU factorization from KLU, Solver 2 is the rank-one update from LUSOL and Solver 3 is the PCG method supported by cuBLAS and cuSPARSE. Computing time refers to the time to solve the SLSE in each power flow iteration of a single ACPF. The computing time of the direct solver is described as the sum of the time for the 1/n factorization step and the FS&BS step, while the computing time of the GPUbased iterative solver is the sum of the time for the sparse matrix/vector operations. The test platform in this paper is a server equipped with an NVIDIA Tesla K40C GPU and two Intel Xeon E5-2620 2-GHz CPUs. The operating system is Ubuntu 16.04 and the CUDA driver version is 9.0. As shown in Table 2, the GPU-based iterative solver for a single SLSE does not show its superiority. Although the GPU-based PCG solver shows a slightly better solving efficiency than complete LU factorization, it consumes more time than the rank-one update method, and fails to provide the expected improvements. The performance bottleneck of the GPU-based PCG solver is analyzed as follows. The first reason is the low degree of parallelism. For GPU parallel architecture such as the NVIDIA Tesla K40, even a onemillion-order matrix is a small-scale calculation task. Solving a single SLSE of a single ACPF is too small a problem to saturate the numerous computing cores in the GPU. The solution that uses the GPU to accelerates single ACPF subtasks in sequence fails to achieve enough parallelism to improve the solving efficiency. Besides the low parallelism of the single SLSE problem, another reason is random memory access, which leads to massive uncoalesced memory access. Table 3 presents the low device memory bandwidth of the kernel functions of Solver 3, which fails to take good advantage of the GPU’s high bandwidth. TABLE 3. Bandwidth analysis of GPU-based solver for single SLSE. 124070 The above analysis shows that the GPU-based PCG algorithm for solving massive ACPFs in sequence is not an efficient solution due to the low intrinsic parallelism and random memory access. However, it is premature to claim that the GPU-based iterative method is not an ideal solution for accelerating the SSA problem, and that the performance speedup ceiling has been reached. To break the bottleneck and bring the acceleration to a higher level, this paper proposes a batch-parallel solution. By packaging massive SLSEs into a large-scale problem, a batch version of the PCG solver can take more advantage of the GPU’s capabilities, obtain higher degree of parallelism and achieve better speedup performance. C. BATCH-PCG SOLVER FOR MASSIVE NUMBERS OF SLSES 1) OVERALL DESIGN Focusing on the demand for solving massive numbers of SLSEs from the SSA problem, this section proposes a novel batch-PCG solver to solve a set of Ai xi = bi simultaneously. As illustrated in Fig. 2, well-designed strategies are adopted to reconfigure multiple small-scale problems as a large-scale task. Two-level parallelism is then implemented; that is, blocks are responsible for the internal parallelism of a single SLSE solution, and threads in the block are responsible for the external parallelism of the multi-task. From these design strategies, three benefits can be achieved: FIGURE 2. Reconfiguration of multi-task: package massive problems into the GPU for parallel calculation. 1) Higher degree of algorithm parallelism is achieved by packaging massive subtasks into a brand new larger-scale calculation problem. 2) Thread divergence is avoided. Each thread in the block is responsible for calculations of the same complexity, and the operations of each thread maintain exact synchronization. 3) Fulfilling coalesced access to device memory is utilized. Elements which have the same position from different matrices are stored at contiguous addresses, and the warp memory scheduler accesses data in continuous memory. The common pattern of SSA subtasks and the proposed shared preconditioner guarantee that the flow path of vector/matrix operations keeps the same pace, which makes the design of the batch-PCG solver executable. The GPU-based VOLUME 8, 2020 M. Fu et al.: GPU-Based N-1 SSA Algorithm With PCG Method FIGURE 4. GPU-based SMVM for single SLSE. FIGURE 3. GPU-based batch PCG algorithm. batch PCG algorithm is illustrated in Fig. 3. Vectors and matrices are expanded to batch-version format and updated simultaneously. Kernel functions provided by the batch-PCG solver support the corresponding batch-version computing step. The following section presents details of the batchversion kernel function; for example, design of the batch SMVM reveals how to expand single vector/matrix calculation to batch-version operation on the GPU. 2) BATCH SMVM Since the batch-version algorithm is actually optimized from a reconfiguration of multiple single problems, it is necessary to analyze the parallel algorithm for single-task first. SMVM is the most time-consuming calculation in the PCG iteration. Algorithm 1 gives the implementation of the GPU-based parallel SMVM algorithm for single SLSE. It turns the SMVM into n sparse vector multiplications, where n is the number of rows in the sparse matrix. The independent calculation of each row is intrinsic parallelism. As shown in Fig. 4, one thread is responsible for one row’s calculation; that is, thread-level parallel computing completes the solving of the single problem. Algorithm 1 GPU-Based Single-SMVM Algorithm 1: i ← thread ID in thread grid; 2: cur_row = CSR_Row[i]; 3: next_row = CSR_Row[i + 1]; 4: for j = cur_row: (next_row-1) do 5: cur_col = CSR_Col[j]; 6: b[i] += CSR_Val[j]∗ x[cur_col]; 7: end for access. When the thread requests the corresponding columns’ position, the discontinuous column data are not coalesced into one request due to the compressed sparse row (CSR) format,. Take the matrix case shown in Fig. 4 as an example; the first columns of row 1, row 2 and row 3 are respectively stored at the 1st, 4th and 6th positions. The random memory address results in the failure of coalesced access. Moreover, different numbers of non-zero elements in each row lead to different loops of multiplication and addition operations. For example, row 0 and row 1 of the matrix case have 3 and 2 nonzero elements respectively, so thread 0 and thread 1 need 3 and 2 loops respectively. Consequently, the computing time of each thread is different. In other words, a thread that has completed its task in advance fails to do anything, but has to wait for all the other threads, which brings about thread divergence. To overcome the above shortcomings, three strategies are introduced to design batch SMVM, as shown in Fig. 5. First, considering the similarity between the Jacobi matrices of the SLSEs from the SSA problem, uniform sparsity pattern is designed. Second, the same numbered rows of different sparse matrices are assigned to the same block. For example, all of rows 0 from sparse matrices are assigned to threads 1 to N in block 0, and the threads in block 0 are responsible for the same operations to calculate the elements of row 0. Moreover, data of the same numbered rows are stored at contiguous addresses in device memory. Until then, all the threads in the same block are able to read data continuously and have the same loops of operation. The implementation of batch parallel SMVM is shown in Algorithm 2. As described above, the batch SMVM algorithm avoids thread divergence and achieves perfect coalesced memory access. 3) PERFORMANCE ANALYSIS Unfortunately, there exist shortcomings to this parallel SMVM algorithm. First, Algorithm 1 fails to fulfill coalesced VOLUME 8, 2020 The other parts of the batch-PCG solver, including batch vector product and batch vector addition/subtraction, can be 124071 M. Fu et al.: GPU-Based N-1 SSA Algorithm With PCG Method FIGURE 5. GPU-based batch SMVM for massive numbers of SLSEs. Algorithm 2 GPU-Based Batch-SMVM Algorithm 1: bid ← block ID and tid ← thread ID in the block; 2: cur_row = CSR_Row[bid]; // thread bid responsible for row bid 3: next_row = CSR_Row[bid + 1]; 4: for j = cur_row: (next_row-1) do 5: cur_col = CSR_Col[j]; 6: b [bid][tid] += CSR_Val[j][tid]∗ x[cur_col][tid]; 7: end for TABLE 5. Saturation analysis of batch SMVM. TABLE 4. Computing time of batch-PCG solver. FIGURE 6. Bandwidth analysis of batch SMVM. designed in the same way as the batch SMVM. The design of FS&BS for the batch triangular solving is described in our previous work [30]. Finally, we have the complete batch-PCG solver. It uses both single-task block-level parallelism and multi-task thread-level parallelism, where great acceleration can be obtained from both computational parallelism and memory access. The performance analysis of the batch-PCG solver is presented in Table 4. Performance of the batch-PCG solver is evaluated by the computing time for different batch sizes. Case results listed in Table 4 show the same characteristic: the computing time only increases slightly for a rapid increase in the batch size. For example, when the batch size in Case 4 varies from 1 to 256, the computing time increases only up to 3.36 times (from 3.2196 ms to 10.8330 ms). When the batch size in Case 4 is 1024, the batch-PCG solver consumes only 0.0335 ms per 124072 SLSE, which is 24.10 times faster than Solver 2 (that is 0.8075 ms) on a CPU platform. The batch computing problem is a memory-bound task. It should be noted that there is a saturation point where the performance of the batch-PCG solver fails to improve further; that is, the larger batch size may not mean better solving efficiency. This is because the average occupancy of each SM and the global memory access bandwidth of the GPU affect the execution efficiency of the kernel functions during the calculation process [31]. Average computing time of batch SMVM for different batch sizes is listed in Table 5, and a bandwidth analysis is presented in Fig. 6. The figure shows that the bandwidth tends to saturate with increasing batch size, and this makes the computing time decrease slowly or even stay steady. For example, when the batch size in VOLUME 8, 2020 M. Fu et al.: GPU-Based N-1 SSA Algorithm With PCG Method Case 4 increases up to 512, the average computing time approaches its minimal value, 2.7760 µs. The saturation point is a very important index that can guide us on how many GPUs are needed for a specific computation task or how to separate a large-scale task into subtasks that can be processed simultaneously on several GPU cards. IV. GPU-ACCELERATED BATCH-ACPF SOLUTION FOR SSA We propose a complete GPU-accelerated batch-ACPF solution for N-1 SSA based on the well-designed batch-PCG solver. The following section gives the overall framework, and then presents case studies and performance analysis. A. OVERALL FRAMEWORK As shown in Fig. 7, the overall process framework given here introduces a GPU-accelerated batch-ACPF solution for SSA. The main design considerations include: B. CASE STUDY AND PERFORMANCE COMPARISON Table 6 presents the total computing time of three SSA solutions. Solution 1 is a CPU SSA solution based on complete LU supported by Solver 1. Solution 2 is a CPU SSA solution based on rank-one update supported by the multi-threaded Solver 2 described in Section III. Solution 3 is a GPU SSA solution supported by the batch-PCG solver. The numbers of contingencies of the four cases are 177, 1728, 4204 and 11062 respectively. As the rank-one update method solves single SLSE faster than LU decomposition, Solution 2 with multi-threaded Solver 2 is a superior SSA solution on the CPU platform. However, the performance comparison indicates that the GPU-based SSA solution with the batch-PCG solver shows a better performance than the CPU multi-threaded SSA solution. For example, Solution 3 only consumes 40.9540 s on the SSA problem in Case 4, which is about 4.90 times faster than Solution 2 with the 8-threaded solver. The detailed test result for Case 4 are listed in Table 7, which presents the computing time for each step in SSA. The case study compares the performances of Solution 2 with the multi-threaded solver and Solution 3 with the batch-PCG solver. Although the GPU solution sacrifices some time resulting from IO consumption and data transmission, the acceleration of the key steps makes up for this TABLE 6. Performance comparison of different solutions. FIGURE 7. Flow of GPU-accelerated batch-ACPF solution for SSA. (1) Prepare pre-contingency data. Generate ground state Jacobi matrices B00 and B000 and decompose them to L00 , U00 and L000 , U000 as a shared preconditioner for Jacobi matrices of all contingencies. (2) Generate a critical contingency set S, mismatch set {1Pi , 1Qi | i ∈ S}, Jacobi matrix set {B0i , B2i |i ∈ S} on the GPU. (3) Use the batch-PCG solver described in Section III to accelerate the solving of sparse linear equation sets {B0i 1δ i = 1P i |i ∈ S} and {B2i 1U i = 1Qi |i ∈ S}, and then update the voltage amplitude and phase angle vectors to generate new mismatch sets. (4) The convergence criterion for each ACPF is ||xi ||∞ < 10−8 and the stop criterion for the batch-ACPF is when 95% of ACPFs have converged or the batch-ACPF solver has reached the maximum number of iterations (30 times in this paper). VOLUME 8, 2020 FIGURE 8. Proportion of computing time for different steps. 124073 M. Fu et al.: GPU-Based N-1 SSA Algorithm With PCG Method TABLE 7. Pseudo-code of SSA with average runtime and speedup (test case is a 10828-bus system). loss: On the one hand, in the step which generates the initial power mismatch set and the Jacobi matrix set, Solution 3 achieves about 5.56 times speedup relative to Solution 2. On the other hand, and more importantly, the well-designed batch-PCG solver consumes only 0.1492 ms on average for each iteration (while the CPU solver consumes 1.7632 ms), which consequently achieves an overall speedup of 4.65 times for the power flow iterations. Figure 8 gives the proportion of the computing time for Solution 3. With the increase in the case dimension, the proportion taken by preconditioner generation for the iterative solver decreases, from 10.10% in Case 1 down to 2.00% in Case 4. In contrast, the percentage of ACPFs iterations with the batch-PCG solver increases, varying from 80.79% to 95.20%. This shows that power flow iterations are the main calculation burden and take up most of the computing time in the SSA problem. This reflects the distinct advantage of the proposed batch-PCG solver. Even on a large-scale system, the batch framework still shows its superiority, and this is one of the most important reasons that the GPU-based SSA solution achieves a better performance. V. CONCLUSION This paper aims at improving the computing efficiency when solving massive numbers of ACPFs in N-1 SSA. A shared preconditioner for the SSA problem is proposed to reduce the preprocessing time and improve the convergence of the iterative solver. The proposed well-designed GPU-accelerated batch-PCG solver achieves a higher degree of parallelism and better coalesced memory access. The paper then presents a novel GPU-accelerated batch-ACPF solution for N-1 SSA. Case studies on a practical 10828-bus system show that the GPU-based batch-ACPF solution with the batch-PCG solver is 4.90 times faster than the solution on a multi-core CPU. The GPU-based high-performance SSA solution can provide 124074 reference values for power grid security analysis for the Energy Internet. REFERENCES [1] F. Li, H. Li, Y. Yu, and S. Feng, ‘‘Fast computing technologies for static security checking based on parallel computation and data reuse,’’ Automat. Electr. Power Syst., vol. 37, nol. 14, pp. 75–80, 2013. [2] R. C. Green, L. Wang, and M. Alam, ‘‘Applications and trends of high performance computing for electric power systems: Focusing on smart grid,’’ IEEE Trans. Smart Grid, vol. 4, no. 2, pp. 922–931, Jun. 2013. [3] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, ‘‘NVIDIA tesla: A unified graphics and computing architecture,’’ IEEE Micro, vol. 28, no. 2, pp. 39–55, Mar. 2008. [4] B. Shang, Y. Xu, C. Zhang, Y. Chen, Z. Liu, L. Lin, C. Xu, and J. Yu, ‘‘GPU-accelerated batch solution for short-circuit current calculation of large-scale power systems,’’ in Proc. IEEE 3rd Int. Electr. Energy Conf. (CIEEC), Beijing, China, Sep. 2019, pp. 1743–1748. [5] P. Sao, R. Vuduc, and X. S. Li, ‘‘A distributed CPU-GPU sparse direct solver,’’ in Proc. Eur. Conf. Parallel Process. Cham, Switzerland: Springer, 2014, pp. 487–498. [6] X. S. Li and J. W. Demmel, ‘‘SuperLU_DIST: A scalable distributedmemory sparse direct solver for unsymmetric linear systems,’’ ACM Trans. Math. Softw., vol. 29, no. 2, pp. 110–140, Jun. 2003. [7] R. Idema, G. Papaefthymiou, D. Lahaye, C. Vuik, and L. van der Sluis, ‘‘Towards faster solution of large power flow problems,’’ IEEE Trans. Power Syst., vol. 28, no. 4, pp. 4918–4925, Nov. 2013. [8] X. Li and F. Li, ‘‘GPU-based power flow analysis with Chebyshev preconditioner and conjugate gradient method,’’ Electr. Power Syst. Res., vol. 116, pp. 87–93, Nov. 2014. [9] N. Garcia, ‘‘Parallel power flow solutions using a biconjugate gradient algorithm and a Newton method: A GPU-based approach,’’ in Proc. IEEE PES Gen. Meeting, Jul. 2010, pp. 1–4. [10] Z. Li, V. D. Donde, J.-C. Tournier, and F. Yang, ‘‘On limitations of traditional multi-core and potential of many-core processing architectures for sparse linear solvers used in large-scale power system applications,’’ in Proc. IEEE Power Energy Soc. Gen. Meeting, Jul. 2011, pp. 1–8. [11] V. Roberge, M. Tarbouchi, and F. Okou, ‘‘Parallel power flow on graphics processing units for concurrent evaluation of many networks,’’ IEEE Trans. Smart Grid, vol. 8, no. 4, pp. 1639–1648, Jul. 2017. [12] E. Belič, N. Lukač, K. Deželak, B. Žalik, and G. Štumberger, ‘‘GPU-based online optimization of low voltage distribution network operation,’’ IEEE Trans. Smart Grid, vol. 8, no. 3, pp. 1460–1468, May 2017. VOLUME 8, 2020 M. Fu et al.: GPU-Based N-1 SSA Algorithm With PCG Method [13] G. Zhou, Y. Feng, R. Bo, L. Chien, X. Zhang, Y. Lang, Y. Jia, and Z. Chen, ‘‘GPU-accelerated batch-ACPF solution for N-1 static security analysis,’’ IEEE Trans. Smart Grid, vol. 8, no. 3, pp. 1406–1416, May 2017. [14] M. Abdelaziz, ‘‘GPU-OpenCL accelerated probabilistic power flow analysis using Monte-Carlo simulation,’’ Electr. Power Syst. Res., vol. 147, pp. 70–72, Jun. 2017. [15] S. Cook, CUDA Programming: A Developer’S Guide to Parallel Computing With GPUs. San Mateo, CA, USA: Morgan Kaufmann, 2012. [16] NVIDIA Corporation. NVIDIA CUDA C Programming Guide. Accessed: Sep. 15, 2019. [Online]. Available: http://docs.nvidia.com/ cuda/cuda-c-programming-guide/ [17] R. D. Zimmerman, C. E. Murillo-Sanchez, and R. J. Thomas, ‘‘MATPOWER: Steady-state operations, planning, and analysis tools for power systems research and education,’’ IEEE Trans. Power Syst., vol. 26, no. 1, pp. 12–19, Feb. 2011. [18] F. D. Leon and A. Sernlyen, ‘‘Iterative solvers in the Newton power flow problem: Preconditioners, inexact solutions and partial Jacobian updates,’’ IEE Proc.-Gener., Transmiss. Distrib., vol. 149, no. 4, pp. 479–484, Jul. 2002. [19] A. B. Alves, E. N. Asada, and A. Monticelli, ‘‘Critical evaluation of direct and iterative methods for solving Ax=b systems in power flow calculations and contingency analysis,’’ in Proc. 21st Int. Conf. Power Ind. Comput. Appl. Connecting Utilities Millennium Beyond, May 1999, pp. 15–21. [20] A. Semlyen, ‘‘Fundamental concepts of a Krylov subspace power flow methodology,’’ IEEE Trans. Power Syst., vol. 11, no. 3, pp. 1528–1537, Aug. 1996. [21] T. Cui and F. Franchetti, ‘‘Power system probabilistic and security analysis on commodity high performance computing systems,’’ in Proc. 3rd Int. Workshop High Perform. Comput., Netw. Anal. Power Grid (HiPCNA-PG), 2013, pp. 1–10. [22] R. A. M. van Amerongen, ‘‘A general-purpose version of the fast decoupled load flow,’’ IEEE Trans. Power Syst., vol. 4, no. 2, pp. 760–770, May 1989. [23] B. Stott and O. Alsac, ‘‘Fast decoupled load flow,’’ IEEE Trans. Power App. Syst., vol. PAS-93, no. 3, pp. 859–869, May 1974. [24] J. R. Shewchuk, ‘‘An introduction to the conjugate gradient method without the agonizing pain,’’ School Comput. Sci., Carnegie Mellon Univ., Pittsburgh, PA, USA, Tech. Rep. CMU-CS-94-125, 1994. [25] J. L. Nazareth, Conjugate-Gradient Methods. New York, NY, USA: Springer, 2001. [26] NVIDIA Corporation. CUBLAS | NVIDIA Developer Zone. Accessed: Sep. 15, 2019. [Online]. Available: https://developer. nvidia.com/cublas/ [27] NVIDIA Corporation. CUSPARSE | NVIDIA Developer Zone. Accessed: Sep. 15, 2019. [Online]. Available: https://developer. nvidia.com/cusparse/ [28] T. A. Davis and E. P. Natarajan, ‘‘Algorithm 907: KLU, a direct sparse solver for circuit simulation problems,’’ ACM Trans. Math. Softw., vol. 37, no. 3, pp. 1–17, Sep. 2010. [29] J. K. Reid, ‘‘A sparsity-exploiting variant of the Bartels–Golub decomposition for linear programming bases,’’ Math. Program., vol. 24, no. 1, pp. 55–69, Dec. 1982. [30] G. Zhou, Y. Feng, R. Bo, and T. Zhang, ‘‘GPU-accelerated sparse matrices parallel inversion algorithm for large-scale power systems,’’ Int. J. Electr. Power Energy Syst., vol. 111, pp. 34–43, Oct. 2019. [31] A. R. Brodtkorb, T. R. Hagen, and M. L. Sætra, ‘‘Graphics processing unit (GPU) programming strategies and trends in GPU computing,’’ J. Parallel Distrib. Comput., vol. 73, no. 1, pp. 4–13, Jan. 2013. GAN ZHOU (Member, IEEE) received the M.S. and Ph.D. degrees from the School of Electrical Engineering, Southeast University, Nanjing, China, in 2003 and 2009, respectively. He is currently an Associate Professor with the School of Electrical Engineering, Southeast University. He has authored or coauthored over 40 articles in refereed journals and conference proceedings. His current research interests include CPU+GPU hybrid computing architecture and high-performance computing in power systems. MENG FU received the B.S. and M.S. degrees from the School of Electrical Engineering, Southeast University, Nanjing, China, in 2004 and 2007, respectively, where she is currently pursuing the Ph.D. degree. Her current research interests include power systems analysis, large scale linear systems, and high-performance computing in power systems. KAI LIANG is currently an Engineer with State Grid Anshan Electric Power Supply Company. His current research interests include power systems analysis and control, electric power dispatching, and big data technology in power systems. VOLUME 8, 2020 JIAHAO ZHAO received the B.S. degree in electrical engineering from the East China University of Science and Technology, Shanghai, China, in 2018. He is currently pursuing the M.S. degree in electrical engineering with the School of Electrical Engineering, Southeast University, Nanjing, China. His current research interests include power network analysis and high-performance parallel computing in power systems. YANJUN FENG received the B.S. degree in electrical engineering from the University of Electronic Science and Technology of China, Chengdu, China, in 2013, and the M.S. degree from Southeast University, Nanjing, China, in 2016, where he is currently pursuing the Ph.D. degree. His current research interests include power network analysis and GPU parallel computing in power systems. HUAN HE is currently an Engineer with State Grid Anshan Electric Power Supply Company. His current research interests include power systems analysis and control, electric power dispatching, and big data technology in power systems. 124075