Uploaded by sai.3nandhu

09127980

advertisement
Received May 26, 2020, accepted June 17, 2020, date of publication June 29, 2020, date of current version July 20, 2020.
Digital Object Identifier 10.1109/ACCESS.2020.3004713
GPU-Based N-1 Static Security Analysis
Algorithm With Preconditioned
Conjugate Gradient Method
MENG FU1 , GAN ZHOU 1 , (Member, IEEE), JIAHAO ZHAO1 , YANJUN FENG1 ,
HUAN HE2 , AND KAI LIANG2
1 School
2 State
of Electrical Engineering, Southeast University, Nanjing 210096, China
Grid Anshan Electric Power Supply Company, Anshan 114001, China
Corresponding author: Gan Zhou (zhougan2002@seu.edu.cn)
This work was supported in part by the National Natural Science Foundation of China under Grant 51877038, and in part by the Science
and Technology Foundation of the State Grid Corporation of China: High-Performance Computing Technology for Analysis and Service
on Entire Network of STATE GRID Corporation of China, under Grant DZB17201800023.
ABSTRACT N-1 static security analysis (SSA) is an important method for power system stability analysis
that requires solving N alternating-current power flows (ACPF) for a system with N elements to obtain strictly
accurate results. Past researches have shown the potential of accelerating these calculations using an iterative
solver with graphics processing unit (GPU). This paper proposes a GPU-based N-1 SSA algorithm with
the preconditioned conjugate gradient (PCG) method. First, a shared preconditioner is selected to accelerate
preprocessing of the iterative method for fast decoupled power flow (FDPF) in N-1 SSA. Second, it proposes
a GPU-based batch-PCG solver, which packages a massive number of PCG subtasks into a large-scale
problem to achieve a higher degree of parallelism and better coalesced memory accesses. Finally, the paper
presents a novel GPU-accelerated batch-PCG solution for N-1 SSA. Case studies on a practical 10828-bus
system show that the GPU-based N-1 SSA algorithm with the batch-PCG solver is 4.90 times faster than a
sequential algorithm on an 8-core CPU. This demonstrates the potential of the GPU-based high-performance
SSA solution with the PCG method under a batch framework.
INDEX TERMS N-1 static security analysis, graphics processing unit, preconditioned conjugate gradient
method, batch-PCG solver.
I. INTRODUCTION
With the rapid development of emerging grid technologies
such as smart grids and microgrids, the concept of an Energy
Internet has been proposed that will enable complete integration of power grids, new energy and the Internet. The
Energy Internet brings not only industrial development but
also technical challenges. Security analysis is one of the most
urgent challenges for the Energy Internet, requiring highperformance computational efficiency for on-line analysis.
N-1 static security analysis (SSA) is an important method
of power system stability analysis that requires solving N
alternating-current power flows (ACPFs) for a system with
N elements to obtain a strictly accurate result. N-1 SSA that
needs to solve massive ACPF problems involves an intensive
The associate editor coordinating the review of this manuscript and
approving it for publication was Huiqing Wen
124066
.
computational task. The traditional method for SSA fails
to meet the speed requirements for on-line analysis, especially in the kind of environment proposed for the Energy
Internet. Because of the computing parallelism intrinsic to
the SSA problem, CPU multi-core architecture was used
first to accelerate the computations [1]. However, saturation
of the memory bandwidth and additional computing nodes
with larger systems remain a bottleneck in accelerating the
computing [2]. Recently, graphics processing units (GPU)
with single instruction multiple thread (SIMT) architecture
are showing superior performance on float-pointing calculations and memory bandwidth [3]. As the representation of
high-performance computing technology, GPU-accelerated
method for computationally intensive tasks has been recognized as a promising and viable solution in power systems [4].
Thus, GPUs offer a potential solution for solving massive
ACPFs in the SSA field.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 8, 2020
M. Fu et al.: GPU-Based N-1 SSA Algorithm With PCG Method
Existing research proposes two different technical frameworks to accelerate the solution of massive ACPFs in
N-1 SSA. Solution 1 uses GPU to accelerate the ACPF subtasks one after another. Power flow calculation imposes a
computational burden that is identical to the one involved
in solving sparse linear system of equations (SLSE). Two
kinds of solvers are generally used to solve linear equations.
One is the direct solver [5], [6]; for example, LU factorization is generally understood to be the most efficient
solution and is widely used for solving power flows. However, in recent years, the iterative solver, an alternative to
the LU-based direct solver, is gradually gaining interest for
power flow calculation because of its numerical stability
and better scalability under the paradigm of parallel computing [7]. Preconditioning is commonly deployed with an
iterative solver, because an efficient preconditioner reduces
the condition number of the coefficient matrix and the number
of iterations. A preconditioned iterative solver such as the
conjugate gradient (CG) solver shows better convergence
in large sparse linear systems. Reference [8] proposed a
GPU-based pre-processed conjugate gradient method using
Chebyshev polynomials. Reference [9] used a GPU to implement a pre-processed biconjugate gradient (BiCG) method
to solve the Newton-Raphson power flow problem. Reference [10] applied a GPU-based conjugate gradient normal
residual method (CGNR) with a Jacobi preconditioner for
power system state estimation and power flow calculations.
Nevertheless, no matter which solver is used, there turn out
to be two obvious disadvantages to Solution 1. First, limited
by the scale of the single SLSE problem and the excessive
sequential processes, the degree of parallelism is relatively
low and numerous computing cores on the GPU are not fully
utilized. Second, the massive unmerged memory access in the
single SLSE solving process leads to underutilization of the
high bandwidth of the GPU.
To explore a higher degree of parallelism, Solution 2 is
to accelerate massive ACPFs concurrently. This solution
packages multiple independent subtasks to formulate a
larger-scale batch calculation task, and it has been successfully applied to the metaheuristic-based optimization problem [11], [12], static security analysis [13] and MCS-based
PPF analysis [14]. Obviously, Solution 2 is a superior alternative to Solution 1 because it achieves a higher degree of
parallelism among the multiple subtasks. With the parallel
scalability of the iterative method and inner parallelism of
the massive ACPFs, the solution’s efficiency can be further
improved by applying an iterative method to simultaneously
solving multiple sparse linear equations under the paradigm
of a batch framework and parallel computing. The difficulty
with this solution lies in the design of the algorithm, which
must conform to the GPU’s SIMT architecture [15], [16].
Against this background, this paper focuses on making
the parallelism among massive ACPFs in SSA more regular,
and then proposes a parallel batch-ACPF solution with the
preconditioned conjugate gradient (PCG) method. This paper
makes the following contributions to the field.
VOLUME 8, 2020
First, a shared preconditioner, which comes from the
ground state Jacobi matrix of the pre-contingency power system, is proposed for the iterative solver to solve the massive
SLSEs from the SSA problem. Case results show that this
preconditioner improves the condition number of the coefficient matrices better and has a faster convergence speed.
As only once complete LU decomposition is made for this
shared preconditioner, the factorization time shared by each
SLSE is relatively less, and this improves the efficiency of the
preprocessing step.
Second, we propose a GPU-accelerated batch-PCG solver
to achieve better computing efficiency when solving massive SLSEs. The batch version of the solver packages the
massive PCG tasks into a brand new larger-scale calculation problem, which achieves a higher degree of parallelism
and well-coalesced memory accesses. The batch-PCG solver
is designed by optimizing the thread and block allocation
strategy and fulfilling coalesced accesses. Our performance
analysis verifies that the batch-PCG solver improves the computing efficiency when solving massive numbers of SLSEs,
and is 24.10 times faster than a solver on the CPU platform.
Last, this paper presents a novel GPU-accelerated batchACPF solution for N-1 SSA, including the overall framework
design and a detailed performance analysis. Case studies on a
practical 10828-bus system show that the GPU-based batchACPF solution with a batch-PCG solver is 4.90 times faster
than a solution with the rank-one update method on an 8-core
CPU.
The paper is organized as follows. Section II reviews the
background to the fast decoupled power flow (FDPF) method,
the CG solver and the preconditioner. Section III proposes a
shared preconditioner for SLSEs from the SSA problem and
then analyzes the performance of GPU-based PCG solver for
single SLSE. As the analysis results for GPU-based solution
for single SLSE show that it is difficult to improve the computing efficiency, this paper then proposes a GPU-accelerated
batch version of the PCG solver for massive numbers of
SLSEs and gives details of its tuning strategies. In Section IV,
a novel GPU-accelerated batch-ACPF solution for N-1 SSA
based on the batch-PCG solver is proposed. The experimental
results for four power systems are also presented. Finally,
Section V presents the conclusions to our study.
II. BACKGROUND
A. FDPF METHOD AND ITS SOLVER
Power flow is a nonlinear problem which is numerically
calculated by iteratively solving the set of sparse linear equations, Ax = b. Newton-Raphson method has to generate the
Jacobian matrix in each iteration. Fast decoupled method simplifies the formulation by decoupling active power and reactive power with a constant Jacobian matrix in each iteration,
which gives a faster computing speed. The basic FDPF can
be written as shown in Equation (1). The coefficient matrices
B0 and B00 of the equations are derived from the admittance
matrix, so they are constant Jacobi matrices in each iteration.
124067
M. Fu et al.: GPU-Based N-1 SSA Algorithm With PCG Method
Solving the SLSE is a common and computationally intensive
task that takes up 80% of the power flow computing time [17].
Accelerating the solving of the SLSE can greatly improve the
FDPF calculation efficiency.
( 1P U = B0 1θ
(1)
1Q U = B00 1U
Solvers for FDPFs are generally divided into two categories: direct solvers and iterative solvers. Direct solvers,
such as the commonly used LU solver, include factorization and forward substitution and backward substitution
(FS&BS). As the Jacobi matrix is constant in FDPF, the factorization step in the direct solver is done only once, and the
rest of the calculations and FS&BS are done in the power flow
iterations. Iterative solvers, such as the CG solver, include
preprocess and sparse vector/matrix operations. Similarly,
the preconditioning step in the iterative solver is also done
once, and the sparse vector/matrix operations are done in the
power flow iterations.
However, for large-scale systems, the reordering scheme
and numerous fill-ins make the LU solver less efficient, and
more storage is needed for the two triangular matrices L and
U. Rank-one update, another direct method, uses the fact
that a matrix can be updated by a rank-one matrix, which
is the product of one non-zero column vector and one nonzero row vector. Rank-one update is also used for many
power system applications, such as solving FDPFs in the SSA
problem. Considering the fact that any contingency affects
at most four elements of the pre-contingency bus admittance
matrix, rank-one update improves the computing efficiency
when solving massive numbers of SLSEs. A0 (here, that
is the ground state Jacobi matrix B0 or B00 ) represents the
coefficient matrix of the SLSE in the pre-contingency system,
Ai xi = bi (i = 1, 2, . . . , N) represents the SLSE in the ith post-contingency system and vi represents the vector of
influenced elements. As Ai = A0 + vi vTi , the factorization
of A0 can be reused for decomposing Ai , which consumes
less computing time than complete LU factorization.
Nevertheless, because of intrinsic dependency, the direct
solver is more suitable for sequential computing. In contrast,
as an alternative to the direct solver, the iterative solver is generally more scalable with better parallelism. Moreover, with
less computing complexity and memory storage, the iterative
solver shows better computing efficiency in large-scale systems. The applications of the iterative solver to power system
is discussed in [18]–[21]. As B0 and B00 are all symmetric
positive definite (SPD) linear systems in FDPF [22], [23],
the CG solver is usually the first choice among the various
iterative solvers because of its convergence rate and robustness [24], [25].
B. CG SOLVER AND PRECONDITIONER
Solving FDPF with the CG solver is essentially a two-layer
iterative method consisting of an external Newton iteration
and an internal CG iteration. The CG solver keeps each
124068
residual and each new search direction orthogonal to all
the previous selected directions. In each iteration, the CG
solver only requires once matrix vector multiplication and
10n floating-point operations, and the number of operations
at each time is O(n2 ). (The time complexity of a Gauss
elimination method such as the LU decomposition is O(n3 ).)
Unlike the direct solver, the iterative solver does not require
reordering to reduce the potential fill-in elements; however,
it does require a preconditioning step to obtain better convergence in iterative solving. The convergence properties of
the CG solver are described in Equation (2). Here, x∗ , x0
and xm are the precise, initial and current solution vectors
respectively, and k is the ratio of the largest and smallest
eigenvalues. When k is small, the CG solver converges fast.
Additionally, the convergence speed depends on the distribution of the eigenvalues.
!m
√
k−1
kx∗ − x0 kA
kx∗ − xm kA ≤ 2 √
(2)
k+1
To improve the convergence speed and numerical stability,
the coefficient matrix must be preprocessed; that is, convert
the given problem Ax = b to Ãx = e
b and make k(Ã) close
to 1. Here, Ã = C −1 AC −1 , x̃ = Cx, b̃ = C −1 b, and C
is an SPD matrix. Define preconditioner M = C 2 , and the
equation then becomes M −1 Ax = M −1 b. It is noted here that
an efficient preconditioner has the following features: First,
the condition number of M −1 A should be as small as possible,
or the eigenvalues of M −1 A should present a concentrated
distribution. Second, Mz = r should be solved easily, by such
as triangular factorization. Therefore, an efficient preconditioner should approximate closely to A and maintain sparsity,
as with the incomplete LU (ILU) preconditioner, Chebyshev
preconditioner and diagonal preconditioner. This principle
can be explained as follows. For example, A = GGT , where A
is the SPD matrix and G is the Cholesky factor. There exists
a positive definite matrix Q such that C = QH T and H T
is the upper triangular factor of the QR factorization of C.
We therefore obtain the heuristic
à = C −1 AC −1 = C −T AC −1 = (HQT )−1 A(QH T )−1
= Q(H −1 GGT H −T )QT
Thus, the better H approximates the Cholesky factor G,
the smaller the condition number of à is, and then the better
convergence performance the iterative solver shows.
III. GPU-ACCELERATED BATCH-PCG SOLVER
A. PRECONDITIONER DESIGN FOR SSA
N-1 SSA is a problem where massive numbers of SLSEs
are solved. Although a preconditioner is recommended for
the iterative solver, it will make the preprocessing step
time-consuming if each SLSE needs an efficient preconditioner. To avoid this dilemma, a shared preconditioner is
proposed for all SLSEs from the SSA problem; the efficiency
of this preconditioner is demonstrated as follows.
VOLUME 8, 2020
M. Fu et al.: GPU-Based N-1 SSA Algorithm With PCG Method
TABLE 1. Performance comparison between different preconditioners.
An ideal preconditioner for the iterative solver would be
the coefficient matrix A itself, that is M = A = LU. Complete
LU decomposition of the preconditioner is meaningless as
the problem has been solved by the direct method and there
is no need for iteration. Generally, an ILU preconditioner is
always selected as an alternative to improve the convergency.
However, if an efficient shared preconditioner is proposed for
a set of SLSEs, the operation of complete LU decomposition
of this preconditioner will be meaningful. In the SSA problem, since any contingency affects at most four elements of
the admittance matrix, Ai of the post-contingency system is
similar to A0 of the pre-contingency system. When M = A0
is selected as the shared preconditioner for solving Ai xi = bi ,
the approximation between Ai and A0 guarantees that this
preconditioner is efficient. Complete LU factorization of A0
is made once to obtain L0 and U 0 , and the factors can be
reused when solving Mzi = ri (that is L0 U 0 zi = ri ) in each
CG iteration.
Our experiment verifies the efficiency of this shared preconditioner. The cases come from four different power systems. Cases 1, 2 and 3 are 118-bus, 1354-bus and 2869-bus
systems from MATPOWER [14]. Case 4 is a 10828-bus
practical East China power grid. Contingencies that can result
in power grid islands are not considered. We use the shared
preconditioner M = A0 and check the condition number of Ãi
(that is M−1 Ai ) of each contingency from the four systems.
The performance is compared with the ILU(Ai ) preconditioner in Fig. 1. As shown in Fig. 1, compared with ILU(Ai ),
the condition number Ãi applied with a shared preconditioner
M = A0 is much less. This means the shared preconditioner improves the condition number Ãi . Therefore, the PCG
solver with a shared preconditioner needs fewer iterations to
solve the linear equation and has a faster convergence speed.
Table 1 displays the average iteration and average time for
solving a set of Axi = bi . For the 10828-bus system, the PCG
solver with M = A0 needs 10.15 iterations per Axi = bi and
consumes a total of 9.8195 ms on preconditioner generation
and CG iteration, and this is 5.46 times faster than the PCG
solver with ILU(Ai ). Moreover, M = A0 needs only once
complete LU decomposition to generate factorization; this
means that the A0 factorization time shared by each SLSE
is relatively less. The design of this shared preconditioner not
only improves the convergence for solving massive SLSEs
from SSA problems, but also reduces the preprocessing time
of the PCG solver.
VOLUME 8, 2020
FIGURE 1. The condition number of M −1 Ai under the preconditioner of
shared M (A0 ) and ILU(Ai ).
B. GPU-BASED PCG SOLVER FOR SINGLE SLSE
After the design of a shared preconditioner for the SSA
problem, the following section focuses on accelerating the
other sparse vector/matrix operations of the PCG algorithm.
Besides the sparse triangular solving (that is FS&BS) for
Mz = r, the following steps include sparse matrix vector multiplication (SMVM), vector product and vector addition/subtraction.
The GPU-based PCG solver accelerates CG iteration by
implementing sparse vector/matrix operations in parallel.
If the GPU-based PCG solver for a single SLSE shows more
performance advantages than the solver on the CPU platform,
the solution that uses the GPU to accelerate ACPF subtasks
one after another would easily accelerate computing. The
computing kernels of the PCG solver are built by the library
functions in cuBLAS and cuSPARSE [26], [27], two linear
algebra libraries provided by Nvidia. For a fair performance
comparison with the proposed GPU-based PCG solver, two
high-performance and commercially-available multicoreCPU-accelerated solvers are chosen. The first one is the KLU
solver [28], and the other is the LUSOL solver. KLU is one
of the fastest single-threaded libraries for solving SLSEs in
power systems. Based on multi-threaded technology such
as OpenMP, it can be easily extended to multicore-CPU
platforms and function as a parallel solver. LUSOL, another
124069
M. Fu et al.: GPU-Based N-1 SSA Algorithm With PCG Method
high-performance library, contains a sparse LU factorization
for square and rectangular matrices, with Bartels-Golub-Reid
updates for column replacement and other rank-one modifications [29].
TABLE 2. Computing time of different solvers for single SLSE.
Three solvers are presented in Table 2. Solver 1 is complete
LU factorization from KLU, Solver 2 is the rank-one update
from LUSOL and Solver 3 is the PCG method supported
by cuBLAS and cuSPARSE. Computing time refers to the
time to solve the SLSE in each power flow iteration of a
single ACPF. The computing time of the direct solver is
described as the sum of the time for the 1/n factorization step
and the FS&BS step, while the computing time of the GPUbased iterative solver is the sum of the time for the sparse
matrix/vector operations. The test platform in this paper is a
server equipped with an NVIDIA Tesla K40C GPU and two
Intel Xeon E5-2620 2-GHz CPUs. The operating system is
Ubuntu 16.04 and the CUDA driver version is 9.0.
As shown in Table 2, the GPU-based iterative solver for
a single SLSE does not show its superiority. Although the
GPU-based PCG solver shows a slightly better solving efficiency than complete LU factorization, it consumes more
time than the rank-one update method, and fails to provide
the expected improvements. The performance bottleneck of
the GPU-based PCG solver is analyzed as follows. The first
reason is the low degree of parallelism. For GPU parallel
architecture such as the NVIDIA Tesla K40, even a onemillion-order matrix is a small-scale calculation task. Solving
a single SLSE of a single ACPF is too small a problem to
saturate the numerous computing cores in the GPU. The solution that uses the GPU to accelerates single ACPF subtasks
in sequence fails to achieve enough parallelism to improve
the solving efficiency. Besides the low parallelism of the
single SLSE problem, another reason is random memory
access, which leads to massive uncoalesced memory access.
Table 3 presents the low device memory bandwidth of the kernel functions of Solver 3, which fails to take good advantage
of the GPU’s high bandwidth.
TABLE 3. Bandwidth analysis of GPU-based solver for single SLSE.
124070
The above analysis shows that the GPU-based PCG algorithm for solving massive ACPFs in sequence is not an
efficient solution due to the low intrinsic parallelism and
random memory access. However, it is premature to claim
that the GPU-based iterative method is not an ideal solution
for accelerating the SSA problem, and that the performance
speedup ceiling has been reached. To break the bottleneck and
bring the acceleration to a higher level, this paper proposes
a batch-parallel solution. By packaging massive SLSEs into
a large-scale problem, a batch version of the PCG solver
can take more advantage of the GPU’s capabilities, obtain
higher degree of parallelism and achieve better speedup performance.
C. BATCH-PCG SOLVER FOR MASSIVE
NUMBERS OF SLSES
1) OVERALL DESIGN
Focusing on the demand for solving massive numbers of
SLSEs from the SSA problem, this section proposes a novel
batch-PCG solver to solve a set of Ai xi = bi simultaneously.
As illustrated in Fig. 2, well-designed strategies are adopted
to reconfigure multiple small-scale problems as a large-scale
task. Two-level parallelism is then implemented; that is,
blocks are responsible for the internal parallelism of a single
SLSE solution, and threads in the block are responsible for
the external parallelism of the multi-task. From these design
strategies, three benefits can be achieved:
FIGURE 2. Reconfiguration of multi-task: package massive problems into
the GPU for parallel calculation.
1) Higher degree of algorithm parallelism is achieved
by packaging massive subtasks into a brand new
larger-scale calculation problem.
2) Thread divergence is avoided. Each thread in the block
is responsible for calculations of the same complexity,
and the operations of each thread maintain exact synchronization.
3) Fulfilling coalesced access to device memory is utilized. Elements which have the same position from different matrices are stored at contiguous addresses, and
the warp memory scheduler accesses data in continuous
memory.
The common pattern of SSA subtasks and the proposed
shared preconditioner guarantee that the flow path of vector/matrix operations keeps the same pace, which makes the
design of the batch-PCG solver executable. The GPU-based
VOLUME 8, 2020
M. Fu et al.: GPU-Based N-1 SSA Algorithm With PCG Method
FIGURE 4. GPU-based SMVM for single SLSE.
FIGURE 3. GPU-based batch PCG algorithm.
batch PCG algorithm is illustrated in Fig. 3. Vectors and
matrices are expanded to batch-version format and updated
simultaneously. Kernel functions provided by the batch-PCG
solver support the corresponding batch-version computing
step. The following section presents details of the batchversion kernel function; for example, design of the batch
SMVM reveals how to expand single vector/matrix calculation to batch-version operation on the GPU.
2) BATCH SMVM
Since the batch-version algorithm is actually optimized from
a reconfiguration of multiple single problems, it is necessary
to analyze the parallel algorithm for single-task first. SMVM
is the most time-consuming calculation in the PCG iteration.
Algorithm 1 gives the implementation of the GPU-based parallel SMVM algorithm for single SLSE. It turns the SMVM
into n sparse vector multiplications, where n is the number
of rows in the sparse matrix. The independent calculation
of each row is intrinsic parallelism. As shown in Fig. 4,
one thread is responsible for one row’s calculation; that is,
thread-level parallel computing completes the solving of the
single problem.
Algorithm 1 GPU-Based Single-SMVM Algorithm
1: i ← thread ID in thread grid;
2: cur_row = CSR_Row[i];
3: next_row = CSR_Row[i + 1];
4: for j = cur_row: (next_row-1) do
5: cur_col = CSR_Col[j];
6: b[i] += CSR_Val[j]∗ x[cur_col];
7: end for
access. When the thread requests the corresponding columns’
position, the discontinuous column data are not coalesced into
one request due to the compressed sparse row (CSR) format,.
Take the matrix case shown in Fig. 4 as an example; the first
columns of row 1, row 2 and row 3 are respectively stored at
the 1st, 4th and 6th positions. The random memory address
results in the failure of coalesced access. Moreover, different
numbers of non-zero elements in each row lead to different
loops of multiplication and addition operations. For example,
row 0 and row 1 of the matrix case have 3 and 2 nonzero
elements respectively, so thread 0 and thread 1 need 3 and
2 loops respectively. Consequently, the computing time of
each thread is different. In other words, a thread that has
completed its task in advance fails to do anything, but has
to wait for all the other threads, which brings about thread
divergence.
To overcome the above shortcomings, three strategies are
introduced to design batch SMVM, as shown in Fig. 5. First,
considering the similarity between the Jacobi matrices of
the SLSEs from the SSA problem, uniform sparsity pattern
is designed. Second, the same numbered rows of different
sparse matrices are assigned to the same block. For example,
all of rows 0 from sparse matrices are assigned to threads
1 to N in block 0, and the threads in block 0 are responsible
for the same operations to calculate the elements of row 0.
Moreover, data of the same numbered rows are stored at
contiguous addresses in device memory. Until then, all the
threads in the same block are able to read data continuously
and have the same loops of operation. The implementation of
batch parallel SMVM is shown in Algorithm 2. As described
above, the batch SMVM algorithm avoids thread divergence
and achieves perfect coalesced memory access.
3) PERFORMANCE ANALYSIS
Unfortunately, there exist shortcomings to this parallel
SMVM algorithm. First, Algorithm 1 fails to fulfill coalesced
VOLUME 8, 2020
The other parts of the batch-PCG solver, including batch
vector product and batch vector addition/subtraction, can be
124071
M. Fu et al.: GPU-Based N-1 SSA Algorithm With PCG Method
FIGURE 5. GPU-based batch SMVM for massive numbers of SLSEs.
Algorithm 2 GPU-Based Batch-SMVM Algorithm
1: bid ← block ID and tid ← thread ID in the block;
2: cur_row = CSR_Row[bid];
// thread bid responsible for row bid
3: next_row = CSR_Row[bid + 1];
4: for j = cur_row: (next_row-1) do
5: cur_col = CSR_Col[j];
6: b [bid][tid] += CSR_Val[j][tid]∗ x[cur_col][tid];
7: end for
TABLE 5. Saturation analysis of batch SMVM.
TABLE 4. Computing time of batch-PCG solver.
FIGURE 6. Bandwidth analysis of batch SMVM.
designed in the same way as the batch SMVM. The design
of FS&BS for the batch triangular solving is described in our
previous work [30]. Finally, we have the complete batch-PCG
solver. It uses both single-task block-level parallelism and
multi-task thread-level parallelism, where great acceleration
can be obtained from both computational parallelism and
memory access. The performance analysis of the batch-PCG
solver is presented in Table 4.
Performance of the batch-PCG solver is evaluated by the
computing time for different batch sizes. Case results listed
in Table 4 show the same characteristic: the computing time
only increases slightly for a rapid increase in the batch size.
For example, when the batch size in Case 4 varies from 1 to
256, the computing time increases only up to 3.36 times (from
3.2196 ms to 10.8330 ms). When the batch size in Case 4 is
1024, the batch-PCG solver consumes only 0.0335 ms per
124072
SLSE, which is 24.10 times faster than Solver 2 (that is
0.8075 ms) on a CPU platform.
The batch computing problem is a memory-bound task.
It should be noted that there is a saturation point where the
performance of the batch-PCG solver fails to improve further;
that is, the larger batch size may not mean better solving
efficiency. This is because the average occupancy of each SM
and the global memory access bandwidth of the GPU affect
the execution efficiency of the kernel functions during the
calculation process [31]. Average computing time of batch
SMVM for different batch sizes is listed in Table 5, and a
bandwidth analysis is presented in Fig. 6. The figure shows
that the bandwidth tends to saturate with increasing batch
size, and this makes the computing time decrease slowly
or even stay steady. For example, when the batch size in
VOLUME 8, 2020
M. Fu et al.: GPU-Based N-1 SSA Algorithm With PCG Method
Case 4 increases up to 512, the average computing time
approaches its minimal value, 2.7760 µs. The saturation point
is a very important index that can guide us on how many
GPUs are needed for a specific computation task or how to
separate a large-scale task into subtasks that can be processed
simultaneously on several GPU cards.
IV. GPU-ACCELERATED BATCH-ACPF SOLUTION FOR SSA
We propose a complete GPU-accelerated batch-ACPF solution for N-1 SSA based on the well-designed batch-PCG
solver. The following section gives the overall framework,
and then presents case studies and performance analysis.
A. OVERALL FRAMEWORK
As shown in Fig. 7, the overall process framework given here
introduces a GPU-accelerated batch-ACPF solution for SSA.
The main design considerations include:
B. CASE STUDY AND PERFORMANCE COMPARISON
Table 6 presents the total computing time of three SSA
solutions. Solution 1 is a CPU SSA solution based on
complete LU supported by Solver 1. Solution 2 is a CPU
SSA solution based on rank-one update supported by the
multi-threaded Solver 2 described in Section III. Solution 3 is
a GPU SSA solution supported by the batch-PCG solver. The
numbers of contingencies of the four cases are 177, 1728,
4204 and 11062 respectively. As the rank-one update method
solves single SLSE faster than LU decomposition, Solution 2
with multi-threaded Solver 2 is a superior SSA solution
on the CPU platform. However, the performance comparison indicates that the GPU-based SSA solution with the
batch-PCG solver shows a better performance than the CPU
multi-threaded SSA solution. For example, Solution 3 only
consumes 40.9540 s on the SSA problem in Case 4, which is
about 4.90 times faster than Solution 2 with the 8-threaded
solver.
The detailed test result for Case 4 are listed in Table 7,
which presents the computing time for each step in SSA.
The case study compares the performances of Solution 2
with the multi-threaded solver and Solution 3 with the
batch-PCG solver. Although the GPU solution sacrifices
some time resulting from IO consumption and data transmission, the acceleration of the key steps makes up for this
TABLE 6. Performance comparison of different solutions.
FIGURE 7. Flow of GPU-accelerated batch-ACPF solution for SSA.
(1) Prepare pre-contingency data. Generate ground state
Jacobi matrices B00 and B000 and decompose them to L00 , U00
and L000 , U000 as a shared preconditioner for Jacobi matrices of
all contingencies.
(2) Generate a critical contingency set S, mismatch set
{1Pi , 1Qi | i ∈ S}, Jacobi matrix set {B0i , B2i |i ∈ S} on the
GPU.
(3) Use the batch-PCG solver described in Section III
to accelerate the solving of sparse linear equation sets {B0i
1δ i = 1P i |i ∈ S} and {B2i 1U i = 1Qi |i ∈ S}, and
then update the voltage amplitude and phase angle vectors
to generate new mismatch sets.
(4) The convergence criterion for each ACPF is ||xi ||∞ <
10−8 and the stop criterion for the batch-ACPF is when 95%
of ACPFs have converged or the batch-ACPF solver has
reached the maximum number of iterations (30 times in this
paper).
VOLUME 8, 2020
FIGURE 8. Proportion of computing time for different steps.
124073
M. Fu et al.: GPU-Based N-1 SSA Algorithm With PCG Method
TABLE 7. Pseudo-code of SSA with average runtime and speedup (test case is a 10828-bus system).
loss: On the one hand, in the step which generates the initial
power mismatch set and the Jacobi matrix set, Solution 3
achieves about 5.56 times speedup relative to Solution 2.
On the other hand, and more importantly, the well-designed
batch-PCG solver consumes only 0.1492 ms on average for
each iteration (while the CPU solver consumes 1.7632 ms),
which consequently achieves an overall speedup of 4.65 times
for the power flow iterations. Figure 8 gives the proportion
of the computing time for Solution 3. With the increase in
the case dimension, the proportion taken by preconditioner
generation for the iterative solver decreases, from 10.10% in
Case 1 down to 2.00% in Case 4. In contrast, the percentage
of ACPFs iterations with the batch-PCG solver increases,
varying from 80.79% to 95.20%. This shows that power flow
iterations are the main calculation burden and take up most
of the computing time in the SSA problem. This reflects the
distinct advantage of the proposed batch-PCG solver. Even
on a large-scale system, the batch framework still shows its
superiority, and this is one of the most important reasons that
the GPU-based SSA solution achieves a better performance.
V. CONCLUSION
This paper aims at improving the computing efficiency when
solving massive numbers of ACPFs in N-1 SSA. A shared
preconditioner for the SSA problem is proposed to reduce the
preprocessing time and improve the convergence of the iterative solver. The proposed well-designed GPU-accelerated
batch-PCG solver achieves a higher degree of parallelism and
better coalesced memory access. The paper then presents a
novel GPU-accelerated batch-ACPF solution for N-1 SSA.
Case studies on a practical 10828-bus system show that the
GPU-based batch-ACPF solution with the batch-PCG solver
is 4.90 times faster than the solution on a multi-core CPU.
The GPU-based high-performance SSA solution can provide
124074
reference values for power grid security analysis for the
Energy Internet.
REFERENCES
[1] F. Li, H. Li, Y. Yu, and S. Feng, ‘‘Fast computing technologies for static
security checking based on parallel computation and data reuse,’’ Automat.
Electr. Power Syst., vol. 37, nol. 14, pp. 75–80, 2013.
[2] R. C. Green, L. Wang, and M. Alam, ‘‘Applications and trends of high
performance computing for electric power systems: Focusing on smart
grid,’’ IEEE Trans. Smart Grid, vol. 4, no. 2, pp. 922–931, Jun. 2013.
[3] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, ‘‘NVIDIA tesla:
A unified graphics and computing architecture,’’ IEEE Micro, vol. 28,
no. 2, pp. 39–55, Mar. 2008.
[4] B. Shang, Y. Xu, C. Zhang, Y. Chen, Z. Liu, L. Lin, C. Xu, and J. Yu,
‘‘GPU-accelerated batch solution for short-circuit current calculation of
large-scale power systems,’’ in Proc. IEEE 3rd Int. Electr. Energy Conf.
(CIEEC), Beijing, China, Sep. 2019, pp. 1743–1748.
[5] P. Sao, R. Vuduc, and X. S. Li, ‘‘A distributed CPU-GPU sparse direct
solver,’’ in Proc. Eur. Conf. Parallel Process. Cham, Switzerland: Springer,
2014, pp. 487–498.
[6] X. S. Li and J. W. Demmel, ‘‘SuperLU_DIST: A scalable distributedmemory sparse direct solver for unsymmetric linear systems,’’ ACM Trans.
Math. Softw., vol. 29, no. 2, pp. 110–140, Jun. 2003.
[7] R. Idema, G. Papaefthymiou, D. Lahaye, C. Vuik, and L. van der Sluis,
‘‘Towards faster solution of large power flow problems,’’ IEEE Trans.
Power Syst., vol. 28, no. 4, pp. 4918–4925, Nov. 2013.
[8] X. Li and F. Li, ‘‘GPU-based power flow analysis with Chebyshev preconditioner and conjugate gradient method,’’ Electr. Power Syst. Res., vol. 116,
pp. 87–93, Nov. 2014.
[9] N. Garcia, ‘‘Parallel power flow solutions using a biconjugate gradient
algorithm and a Newton method: A GPU-based approach,’’ in Proc. IEEE
PES Gen. Meeting, Jul. 2010, pp. 1–4.
[10] Z. Li, V. D. Donde, J.-C. Tournier, and F. Yang, ‘‘On limitations of
traditional multi-core and potential of many-core processing architectures
for sparse linear solvers used in large-scale power system applications,’’ in
Proc. IEEE Power Energy Soc. Gen. Meeting, Jul. 2011, pp. 1–8.
[11] V. Roberge, M. Tarbouchi, and F. Okou, ‘‘Parallel power flow on graphics
processing units for concurrent evaluation of many networks,’’ IEEE Trans.
Smart Grid, vol. 8, no. 4, pp. 1639–1648, Jul. 2017.
[12] E. Belič, N. Lukač, K. Deželak, B. Žalik, and G. Štumberger, ‘‘GPU-based
online optimization of low voltage distribution network operation,’’ IEEE
Trans. Smart Grid, vol. 8, no. 3, pp. 1460–1468, May 2017.
VOLUME 8, 2020
M. Fu et al.: GPU-Based N-1 SSA Algorithm With PCG Method
[13] G. Zhou, Y. Feng, R. Bo, L. Chien, X. Zhang, Y. Lang, Y. Jia, and Z. Chen,
‘‘GPU-accelerated batch-ACPF solution for N-1 static security analysis,’’
IEEE Trans. Smart Grid, vol. 8, no. 3, pp. 1406–1416, May 2017.
[14] M. Abdelaziz, ‘‘GPU-OpenCL accelerated probabilistic power flow analysis using Monte-Carlo simulation,’’ Electr. Power Syst. Res., vol. 147,
pp. 70–72, Jun. 2017.
[15] S. Cook, CUDA Programming: A Developer’S Guide to Parallel Computing With GPUs. San Mateo, CA, USA: Morgan Kaufmann, 2012.
[16] NVIDIA Corporation. NVIDIA CUDA C Programming Guide.
Accessed: Sep. 15, 2019. [Online]. Available: http://docs.nvidia.com/
cuda/cuda-c-programming-guide/
[17] R. D. Zimmerman, C. E. Murillo-Sanchez, and R. J. Thomas, ‘‘MATPOWER: Steady-state operations, planning, and analysis tools for power
systems research and education,’’ IEEE Trans. Power Syst., vol. 26, no. 1,
pp. 12–19, Feb. 2011.
[18] F. D. Leon and A. Sernlyen, ‘‘Iterative solvers in the Newton power flow
problem: Preconditioners, inexact solutions and partial Jacobian updates,’’
IEE Proc.-Gener., Transmiss. Distrib., vol. 149, no. 4, pp. 479–484,
Jul. 2002.
[19] A. B. Alves, E. N. Asada, and A. Monticelli, ‘‘Critical evaluation of direct
and iterative methods for solving Ax=b systems in power flow calculations
and contingency analysis,’’ in Proc. 21st Int. Conf. Power Ind. Comput.
Appl. Connecting Utilities Millennium Beyond, May 1999, pp. 15–21.
[20] A. Semlyen, ‘‘Fundamental concepts of a Krylov subspace power flow
methodology,’’ IEEE Trans. Power Syst., vol. 11, no. 3, pp. 1528–1537,
Aug. 1996.
[21] T. Cui and F. Franchetti, ‘‘Power system probabilistic and security analysis
on commodity high performance computing systems,’’ in Proc. 3rd Int.
Workshop High Perform. Comput., Netw. Anal. Power Grid (HiPCNA-PG),
2013, pp. 1–10.
[22] R. A. M. van Amerongen, ‘‘A general-purpose version of the fast decoupled
load flow,’’ IEEE Trans. Power Syst., vol. 4, no. 2, pp. 760–770, May 1989.
[23] B. Stott and O. Alsac, ‘‘Fast decoupled load flow,’’ IEEE Trans. Power
App. Syst., vol. PAS-93, no. 3, pp. 859–869, May 1974.
[24] J. R. Shewchuk, ‘‘An introduction to the conjugate gradient method without
the agonizing pain,’’ School Comput. Sci., Carnegie Mellon Univ., Pittsburgh, PA, USA, Tech. Rep. CMU-CS-94-125, 1994.
[25] J. L. Nazareth, Conjugate-Gradient Methods. New York, NY, USA:
Springer, 2001.
[26] NVIDIA Corporation. CUBLAS | NVIDIA Developer Zone.
Accessed: Sep. 15, 2019. [Online]. Available: https://developer.
nvidia.com/cublas/
[27] NVIDIA Corporation. CUSPARSE | NVIDIA Developer Zone.
Accessed: Sep. 15, 2019. [Online]. Available: https://developer.
nvidia.com/cusparse/
[28] T. A. Davis and E. P. Natarajan, ‘‘Algorithm 907: KLU, a direct sparse
solver for circuit simulation problems,’’ ACM Trans. Math. Softw., vol. 37,
no. 3, pp. 1–17, Sep. 2010.
[29] J. K. Reid, ‘‘A sparsity-exploiting variant of the Bartels–Golub decomposition for linear programming bases,’’ Math. Program., vol. 24, no. 1,
pp. 55–69, Dec. 1982.
[30] G. Zhou, Y. Feng, R. Bo, and T. Zhang, ‘‘GPU-accelerated sparse matrices
parallel inversion algorithm for large-scale power systems,’’ Int. J. Electr.
Power Energy Syst., vol. 111, pp. 34–43, Oct. 2019.
[31] A. R. Brodtkorb, T. R. Hagen, and M. L. Sætra, ‘‘Graphics processing unit
(GPU) programming strategies and trends in GPU computing,’’ J. Parallel
Distrib. Comput., vol. 73, no. 1, pp. 4–13, Jan. 2013.
GAN ZHOU (Member, IEEE) received the M.S.
and Ph.D. degrees from the School of Electrical Engineering, Southeast University, Nanjing,
China, in 2003 and 2009, respectively.
He is currently an Associate Professor with
the School of Electrical Engineering, Southeast
University. He has authored or coauthored over
40 articles in refereed journals and conference
proceedings. His current research interests include
CPU+GPU hybrid computing architecture and
high-performance computing in power systems.
MENG FU received the B.S. and M.S. degrees
from the School of Electrical Engineering, Southeast University, Nanjing, China, in 2004 and 2007,
respectively, where she is currently pursuing the
Ph.D. degree.
Her current research interests include power
systems analysis, large scale linear systems, and
high-performance computing in power systems.
KAI LIANG is currently an Engineer with State
Grid Anshan Electric Power Supply Company. His
current research interests include power systems
analysis and control, electric power dispatching,
and big data technology in power systems.
VOLUME 8, 2020
JIAHAO ZHAO received the B.S. degree in electrical engineering from the East China University of Science and Technology, Shanghai, China,
in 2018. He is currently pursuing the M.S. degree
in electrical engineering with the School of Electrical Engineering, Southeast University, Nanjing,
China. His current research interests include power
network analysis and high-performance parallel
computing in power systems.
YANJUN FENG received the B.S. degree in electrical engineering from the University of Electronic Science and Technology of China, Chengdu,
China, in 2013, and the M.S. degree from Southeast University, Nanjing, China, in 2016, where he
is currently pursuing the Ph.D. degree.
His current research interests include power
network analysis and GPU parallel computing in
power systems.
HUAN HE is currently an Engineer with State
Grid Anshan Electric Power Supply Company. His
current research interests include power systems
analysis and control, electric power dispatching,
and big data technology in power systems.
124075
Download