Hardware Acceleration for Load Flow Computation

advertisement
Hardware Acceleration for Load Flow Computation
Jeremy Johnson, Chika Nwankpa and Prawat Nagvajara
Kevin Cunningham, Tim Chagnon, Petya Vachranukunkiet
Computer Science and Electrical and Computer Engineering
Drexel University
Eigth Annual CMU Conference on the Electricity Industry
March 13, 2012
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
1 / 47
Overview
I
Problem and Goals
I
Background
I
LU Hardware
I
Merge Accelerator Design
I
Accelerator Architectures
I
Performance Results
I
Conclusions
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
2 / 47
Problem
I
Problem: Sparse Lower-Upper (LU) Triangular Decomposition
performs inefficiently on general-purpose processors (irregular data
access, indexing overhead, less than 10% FP efficiency on power
systems) [3, 12]
[3] T. Chagnon. Architectural Support for Direct Sparse LU Algorithms.
[12] P. Vachranukunkiet. Power Flow Computation using Field Programmable Gate
Arrays.
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
3 / 47
Problem
I
Problem: Sparse Lower-Upper (LU) Triangular Decomposition
performs inefficiently on general-purpose processors (irregular data
access, indexing overhead, less than 10% FP efficiency on power
systems) [3, 12]
I
Solution: Custom designed hardware can provide better performance
(3X speedup over Pentium 4 3.2GHz) [3, 12]
[3] T. Chagnon. Architectural Support for Direct Sparse LU Algorithms.
[12] P. Vachranukunkiet. Power Flow Computation using Field Programmable Gate
Arrays.
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
3 / 47
Problem
I
Problem: Sparse Lower-Upper (LU) Triangular Decomposition
performs inefficiently on general-purpose processors (irregular data
access, indexing overhead, less than 10% FP efficiency on power
systems) [3, 12]
I
Solution: Custom designed hardware can provide better performance
(3X speedup over Pentium 4 3.2GHz) [3, 12]
I
Problem: Complex custom hardware is difficult, expensive, and
time-consuming to design
[3] T. Chagnon. Architectural Support for Direct Sparse LU Algorithms.
[12] P. Vachranukunkiet. Power Flow Computation using Field Programmable Gate
Arrays.
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
3 / 47
Problem
I
Problem: Sparse Lower-Upper (LU) Triangular Decomposition
performs inefficiently on general-purpose processors (irregular data
access, indexing overhead, less than 10% FP efficiency on power
systems) [3, 12]
I
Solution: Custom designed hardware can provide better performance
(3X speedup over Pentium 4 3.2GHz) [3, 12]
I
Problem: Complex custom hardware is difficult, expensive, and
time-consuming to design
I
Solution: Use a balance of software and hardware accelerators (ex:
FPU, GPU, Encryption, Video/Image Encoding, etc.)
[3] T. Chagnon. Architectural Support for Direct Sparse LU Algorithms.
[12] P. Vachranukunkiet. Power Flow Computation using Field Programmable Gate
Arrays.
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
3 / 47
Problem
I
Problem: Sparse Lower-Upper (LU) Triangular Decomposition
performs inefficiently on general-purpose processors (irregular data
access, indexing overhead, less than 10% FP efficiency on power
systems) [3, 12]
I
Solution: Custom designed hardware can provide better performance
(3X speedup over Pentium 4 3.2GHz) [3, 12]
I
Problem: Complex custom hardware is difficult, expensive, and
time-consuming to design
I
Solution: Use a balance of software and hardware accelerators (ex:
FPU, GPU, Encryption, Video/Image Encoding, etc.)
I
Problem: Need an architecture that efficiently combines
general-purpose processors and accelerators
[3] T. Chagnon. Architectural Support for Direct Sparse LU Algorithms.
[12] P. Vachranukunkiet. Power Flow Computation using Field Programmable Gate
Arrays.
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
3 / 47
Goals
I
Increase performance of sparse LU over existing software
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
4 / 47
Goals
I
Increase performance of sparse LU over existing software
I
Design a flexible, easy to use hardware accelerator that can integrate
with multiple platforms
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
4 / 47
Goals
I
Increase performance of sparse LU over existing software
I
Design a flexible, easy to use hardware accelerator that can integrate
with multiple platforms
I
Find an architecture that efficiently uses the accelerator to improve
sparse LU
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
4 / 47
Summary of Results and Contributions
I
Designed and implemented a merge hardware accelerator [5]
I
Data rate approaches one element per cycle
[5] Cunningham, Nagvajara. Reconfigurable Stream-Processing Architecture for
Sparse Linear Solvers.
[6] Cunningham, Nagvajara, Johnson. Reconfigurable Multicore Architecture for
Power Flow Calculation.
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
5 / 47
Summary of Results and Contributions
I
Designed and implemented a merge hardware accelerator [5]
I
I
Data rate approaches one element per cycle
Designed and implemented supporting hardware [6]
[5] Cunningham, Nagvajara. Reconfigurable Stream-Processing Architecture for
Sparse Linear Solvers.
[6] Cunningham, Nagvajara, Johnson. Reconfigurable Multicore Architecture for
Power Flow Calculation.
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
5 / 47
Summary of Results and Contributions
I
Designed and implemented a merge hardware accelerator [5]
I
I
I
Data rate approaches one element per cycle
Designed and implemented supporting hardware [6]
Implemented a prototype Triple Buffer Architecture [5]
I
I
I
Efficiently utilize external transfer bus
Not suitable for sparse LU application
Advantageous for other applications
[5] Cunningham, Nagvajara. Reconfigurable Stream-Processing Architecture for
Sparse Linear Solvers.
[6] Cunningham, Nagvajara, Johnson. Reconfigurable Multicore Architecture for
Power Flow Calculation.
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
5 / 47
Summary of Results and Contributions
I
Designed and implemented a merge hardware accelerator [5]
I
I
I
Designed and implemented supporting hardware [6]
Implemented a prototype Triple Buffer Architecture [5]
I
I
I
I
Data rate approaches one element per cycle
Efficiently utilize external transfer bus
Not suitable for sparse LU application
Advantageous for other applications
Implemented a prototype heterogeneous multicore architecture [6]
I
I
Combines general-purpose processor and reconfigurable hardware cores
Speedup of 1.3X over sparse LU software with merge accelerator
[5] Cunningham, Nagvajara. Reconfigurable Stream-Processing Architecture for
Sparse Linear Solvers.
[6] Cunningham, Nagvajara, Johnson. Reconfigurable Multicore Architecture for
Power Flow Calculation.
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
5 / 47
Summary of Results and Contributions
I
Designed and implemented a merge hardware accelerator [5]
I
I
I
Designed and implemented supporting hardware [6]
Implemented a prototype Triple Buffer Architecture [5]
I
I
I
I
Efficiently utilize external transfer bus
Not suitable for sparse LU application
Advantageous for other applications
Implemented a prototype heterogeneous multicore architecture [6]
I
I
I
Data rate approaches one element per cycle
Combines general-purpose processor and reconfigurable hardware cores
Speedup of 1.3X over sparse LU software with merge accelerator
Modified Data Pump Architecture (DPA) Simulator
I
I
I
Added merge instruction and support
Implemented sparse LU on the DPA
Speedup of 2.3X over sparse LU software with merge accelerator
[5] Cunningham, Nagvajara. Reconfigurable Stream-Processing Architecture for
Sparse Linear Solvers.
[6] Cunningham, Nagvajara, Johnson. Reconfigurable Multicore Architecture for
Power Flow Calculation.
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
5 / 47
Overview
I
Problem and Goals
I
Background
I
LU Hardware
I
Merge Accelerator Design
I
Accelerator Architectures
I
Performance Results
I
Conclusions
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
6 / 47
Power Systems
I
The power grid delivers electrical power from
generation stations to distribution stations
I
Power grid stability is analyzed with Power
Flow (aka Load Flow) calculation [1]
I
System nodes and voltages create a system of
equations, represented by a sparse matrix [1]
[1] A. Albur. Power System State Estimation: Theory and Implementation.
[12] P. Vachranukunkiet. Power Flow Computation using Field Programmable Gate
Arrays.
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
7 / 47
Power Systems
I
The power grid delivers electrical power from
generation stations to distribution stations
I
Power grid stability is analyzed with Power
Flow (aka Load Flow) calculation [1]
I
System nodes and voltages create a system of
equations, represented by a sparse matrix [1]
I
Solving the system allows grid stability to be
analyzed
I
LU decomposition accounts for a large part of
the Power Flow calculation [12]
I
Power Flow calculation is repeated thousands
of times to perform a full grid analysis [1]
Power Flow Execution Profile [12]
[1] A. Albur. Power System State Estimation: Theory and Implementation.
[12] P. Vachranukunkiet. Power Flow Computation using Field Programmable Gate
Arrays.
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
7 / 47
Power System Matrices
Power System Matrix Properties
System
1648 Bus
7917 Bus
10278 Bus
26829 Bus
# Rows/Cols
2,982
14,508
19,285
50,092
NNZ
21,196
105,522
134,621
351,200
Sparsity
0.238%
0.050%
0.036%
0.014%
Power System LU Statistics
System
1648 Bus
7917 Bus
10279 Bus
26829 Bus
Johnson (Drexel University)
HW Acceleration for Load Flow
Avg. NNZ
per Row
7.1
7.3
7.0
7.0
Avg. Num
Submatrix Rows
8.0
8.5
8.3
8.7
% Merge
March 13, 2012
65.6%
54.3%
55.8%
62.7%
8 / 47
Power System LU
I
Gaussian LU
outperforms UMFPACK
for power matrices [3]
I
As matrix size grows,
multi-frontal becomes
more effective
I
2
1.8
1.6
1.4
1.2
The merge is a
bottleneck in Gaussian
LU performance [3]
Speedup
I
Gaussian LU vs UMFPACK on Intel Core i7
at 3.2GHz [3]
1
0.8
0.6
Goal: Design a hardware
accelerator for the Merge
in Gaussian LU
0.4
0.2
0
1648 Bus
7917 Bus
10279 Bus
26829 Bus
Power System
[3] T. Chagnon. Architectural Support for Direct Sparse LU Algorithms.
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
9 / 47
Sparse Matrices
I
A matrix is sparse when it has a large number of zero elements
I
Below is a small section of the 26K-Bus power system matrix


122.03 −61.01 −5.95
0
0
−61.01 277.93 19.38
0
0 



−19.56 275.58
0
0 
A =  5.95

 0
0
0
437.82 67.50 
0
0
0
−67.50 437.81
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
10 / 47
Sparse Compressed Formats
I
Sparse matrices use a compressed format to ignore zero elements
I
Saves storage space
I
Decreases amount of computation
Row
Johnson (Drexel University)
Column:Value
0
0 122.03 1 -61.01 2
-5.95
1
0 -61.01 1 277.93 2
19.38
2
0
3
3 437.82 4
4
3 -67.50 4 437.81
5.95
1 -19.56 2 275.58
67.50
HW Acceleration for Load Flow
March 13, 2012
11 / 47
Sparse LU Decomposition

122.03 −61.01 −5.95
0
0
−61.01 277.93 19.38
0
0 



5.95
−19.56
275.58
0
0
A=


 0
0
0
437.82 67.50 
0
0
0
−67.50 437.81



1.00
0
0
0
0
−0.50 1.00
0
0
0 


0
0 
L=
 0.05 −0.07 1.00

 0
0
0
1.00
0 
0
0
0
−0.15 1.00
Johnson (Drexel University)


122.03 −61.01 −5.95
0
0
 0
247.42 16.41
0
0 


0
276.97
0
0 
U=
 0

 0
0
0
437.82 67.50 
0
0
0
0
448.22
HW Acceleration for Load Flow
March 13, 2012
12 / 47
Sparse LU Methods
I
Approximate Minimum Degree (AMD) algorithm pre-orders matrices
to reduce fill-in [2]
I
Multiple methods for performing sparse LU:
Multi-frontal
I
I
I
I
Used by UMFPACK [7]
Divides matrix into multiple, independent, dense blocks
Gaussian Elimination
I
Eliminates elements below the diagonal by scaling and adding rows
together
[2] Amestoy et al. Algorithm 837: AMD, an approximate minimum degree ordering
algorithm.
[7] T. Davis. Algorithm 832: UMFPACK V4.3 - An Unsymmetric-Pattern
Multi-Frontal Method.
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
13 / 47
Sparse LU Decomposition
Algorithm 1 Gaussian LU
for i = 1 → N do
pivot search()
update U()
for j = 1 → NUM SUBMATRIX ROWS do
merge(pivot row , j)
update L()
update colmap()
end for
end for
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
14 / 47
Column Map
I
Column map keeps track of the rows with non-zero elements in each
column
I
Do not have to search matrix on each iteration
I
Fast access to all rows in a column
I
Select pivot row from the set of rows
I
Column map is updated after each merge
I
Fill-in elements add new elements to a row during the merge
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
15 / 47
Sparse LU Decomposition
Column 0 1 2 3 4 5 6 7
+
I
7 3
2
4 1
5
8
2
7 7 1 2 5
2 8
Row u
Row v
Row u+v
Challenges with software merging:
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
16 / 47
Sparse LU Decomposition
Column 0 1 2 3 4 5 6 7
+
I
7 3
2
4 1
5
8
2
7 7 1 2 5
2 8
Row u
Row v
Row u+v
Challenges with software merging:
I
Indexing overhead
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
16 / 47
Sparse LU Decomposition
Column 0 1 2 3 4 5 6 7
+
I
7 3
2
4 1
5
8
2
7 7 1 2 5
2 8
Row u
Row v
Row u+v
Challenges with software merging:
I
I
Indexing overhead
Column numbers are different from row array indices
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
16 / 47
Sparse LU Decomposition
Column 0 1 2 3 4 5 6 7
+
I
7 3
2
4 1
5
8
2
7 7 1 2 5
2 8
Row u
Row v
Row u+v
Challenges with software merging:
I
I
I
Indexing overhead
Column numbers are different from row array indices
Must fetch column numbers from memory to operate on elements
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
16 / 47
Sparse LU Decomposition
Column 0 1 2 3 4 5 6 7
+
I
7 3
2
4 1
5
8
2
7 7 1 2 5
2 8
Row u
Row v
Row u+v
Challenges with software merging:
I
I
I
I
Indexing overhead
Column numbers are different from row array indices
Must fetch column numbers from memory to operate on elements
Cache misses
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
16 / 47
Sparse LU Decomposition
Column 0 1 2 3 4 5 6 7
+
I
7 3
2
4 1
5
8
2
7 7 1 2 5
2 8
Row u
Row v
Row u+v
Challenges with software merging:
I
I
I
I
I
Indexing overhead
Column numbers are different from row array indices
Must fetch column numbers from memory to operate on elements
Cache misses
Data-dependent branching
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
16 / 47
Sparse LU Decomposition
Merge Algorithm
int p , s ;
i n t r n z =0 , f n z =0;
f o r ( p=1 , s =1; p < p i v o t L e n && s < n o n p i v o t L e n ; ) {
i f ( n o n p i v o t [ s ] . c o l == p i v o t [ p ] . c o l ) {
merged [ r n z ] . c o l = p i v o t [ p ] . c o l ;
merged [ r n z ] . v a l = ( n o n p i v o t [ s ] . v a l − l x ∗ p i v o t [ p ] . v a l ) ;
r n z ++; p++; s ++;
} else i f ( nonpivot [ s ] . col < pivot [ p ] . col ) {
merged [ r n z ] . c o l = n o n p i v o t [ s ] . c o l ;
merged [ r n z ] . v a l = n o n p i v o t [ s ] . v a l ;
r n z ++; s ++;
} else {
merged [ r n z ] . c o l = p i v o t [ p ] . c o l ;
merged [ r n z ] . v a l = (− l x ∗ p i v o t [ p ] . v a l ) ;
f i l l i n [ fnz ] = pivot [ p ] . col ;
r n z ++; f n z ++; p++;
}
}
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
18 / 47
Overview
I
Problem and Goals
I
Background
I
LU Hardware
I
Merge Accelerator Design
I
Accelerator Architectures
I
Performance Results
I
Conclusions
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
19 / 47
LU Hardware
I
Originally designed by Petya Vachranukunkiet[12]
I
Straightforward Gaussian Elimination
I
Parameterized for power system matrices
I
Streaming operations
hide latencies
I
Computation pipelined for
1 Flop / cycle
I
Memory and FIFOs used for
cache and bookeeping
I
Multiple merge units possible
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
20 / 47
Pivot and Submatrix Update Logic
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
21 / 47
Custom Cache Logic
I
Cache line is a whole row
I
Fully-associative, write-back, FIFO replacement
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
22 / 47
Reconfigurable Hardware
I
I
I
Field Programmable Gate Array
(FPGA)
Advantages:
I
I
I
I
Disadvantages:
I
I
I
Difficult to program
Long design compile times
Lower clock frequencies
Low power consumption
Reconfigurable
Low cost
Pipelining and parallelism
SRAM Logic Cell [13]
FPGA Architecture [13]
[13] Xilinx, Inc. Programmable Logic Design Quick Start Handbook.
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
23 / 47
ASIC vs FPGA
Application-Specific Integrated
Circuit (ASIC)
Field Programmable Gate Array
(FPGA)
I
Higher clock frequency
I
Reconfigurable
I
Lower power consumption
I
Design updates and corrections
I
More customization
I
Multiple designs in same area
I
Lower cost
I
Less implementation time
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
24 / 47
Overview
I
Problem and Goals
I
Background
I
LU Hardware
I
Merge Accelerator Design
I
Accelerator Architectures
I
Performance Results
I
Conclusions
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
25 / 47
Merge Hardware Accelerator
U FIFO
I
Input and Output BlockRAM FIFOs
I
Two compare stages allow look-ahead
comparison
I
Pipelined floating point adder
I
Merge unit is pipelined and outputs one
element per cycle
I
Latency is dependent on the structure
and length of input rows
Johnson (Drexel University)
U Stage 1
V FIFO
CMP
U Stage 2
HW Acceleration for Load Flow
V Stage 1
V Stage 2
Merged
+
March 13, 2012
26 / 47
Merge Hardware Manager
I
Must manage rows going
through merge units
I
Pivot row is recycled
I
Floating point multiplier scales
pivot row before entering merge
unit
Pivot
FIFO
si
Non
Pivot
FIFO
*
Pivot
FIFO
si
Non
Pivot
FIFO
*
I
Can support multiple merge
units
Merge
Unit
Merge
Unit
I
Input and output rotate among
merge units after each full row
Output
FIFO
Output
FIFO
I
Rows enter and leave in the
same order
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
27 / 47
Merge Performance
Software Merge Performance on Core i7 at 3.2GHz
Power System
1648 Bus
7917 Bus
10279 Bus
26829 Bus
I
Average Cycles
per Element
35.3
32.2
31.8
32.8
Merge accelerator:
I
I
I
I
Average Data Rate
(Millions of Elements per Second)
90.77
99.54
100.75
97.46
Outputs one element per cycle
Data rate approaches hardware clock rate
FPGA clock frequencies provide better merge performance than
processor
Goal: Find an architecture that efficiently delivers data to the
accelerator
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
28 / 47
Overview
I
Problem and Goals
I
Background
I
LU Hardware
I
Merge Accelerator Design
I
Accelerator Architectures
I
Performance Results
I
Conclusions
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
29 / 47
Summary of Design Methods
I
Completely Software (UMFPACK, Gaussian LU)
I
I
Completely Hardware (LUHW on FPGA)
I
I
I
Inefficient performance
Large design time and effort
Less scalable
Investigate three accelerator architectures:
I
Triple Buffer Architecture
I
I
Heterogeneous Multicore Architecture
I
I
Efficiently utilize external transfer bus to a reconfigurable accelerator
Combine general-purpose and reconfigurable cores in a single package
Data Pump Architecture
I
Take advantage of programmable data/memory management to
efficiently feed an accelerator
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
30 / 47
Accelerator Architectures
Mem
Mem
CPU
FPGA
I
Common setup for processor and FPGA
I
Each device has its own external memory
I
Communicate by USB, Ethernet, PCI-Express, HyperTransport, ...
I
Transfer bus can be bottleneck between processor and FPGA
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
31 / 47
Triple Buffer Architecture
Transfer BUS
Buffer 1
Mem
Mem
Buffer 2
Buffer 3
CPU
FPGA
CPU
Low
Latency
Memory
Banks
FPGA
I
How to efficiently use transfer bus?
I
Break data into blocks and buffer
I
Three buffers eliminate competition on memory ports
I
Allows continuous stream of data through FPGA accelerator
I
Processor can transfer data at bus/memory rate, not limited by FPGA
rate
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
32 / 47
Heterogeneous Multicore Architecture
I
Communication with an external
transfer bus requires additional
overhead
I
Merge accelerator does not
require a large external FPGA
I
Closely integrate processor and
FPGA
I
Share same memory system
Reconfigurable Processor Architecture
[8]
[8] Garcia and Compton. A Reconfigurable Hardware Interface for a Modern
Computing System.
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
33 / 47
Heterogeneous Multicore Architecture
DDR Memory
AXI
CPU
AXI LITE
(Microblaze)
AXI
DMA
AXI
STREAM
Reconfigurable
Core
Reconfigurable Processor Architecture
[8]
[8] Garcia and Compton. A Reconfigurable Hardware Interface for a Modern
Computing System.
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
34 / 47
Heterogeneous Multicore Architecture
DDR Memory
I
Processor sends DMA requests
to DMA module
I
DMA module fetches data from
memory and streams to
accelerator
I
Accelerator streams outputs
back to DMA module to store
to memory
AXI
CPU
AXI LITE
(Microblaze)
AXI
DMA
AXI
STREAM
Reconfigurable
Core
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
35 / 47
Data Pump Architecture
Data Pump Architecture [11]
I
I
I
I
I
I
Developed as part of the SPIRAL project
Intended as a signal processing platform
Processors explicitly control data movement
No automatic data caches
Data Processor (DP) moves data between external and local memory
Compute Processor (CP) moves data between local memory and
vector processors
[11] SPIRAL. DPA Instruction Set Architecture V0.2 Basic Configuration.
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
36 / 47
Data Pump Architecture
Data Pump Architecture [11]
I
Replace vector processors with merge accelerator
I
DP brings rows from external memory into local memory
I
CP sends rows to merge accelerator and writes results in local memory
I
DP stores results back to external memory
[11] SPIRAL. DPA Instruction Set Architecture V0.2 Basic Configuration.
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
37 / 47
Overview
I
Problem and Goals
I
Background
I
LU Hardware
I
Merge Accelerator Design
I
Accelerator Architectures
I
Performance Results
I
Conclusions
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
38 / 47
DPA Implementation
Data Pump Architecture [11]
I
I
I
I
I
Implemented Sparse LU on DPA Simulator [10]
DP and CP synchronize with local memory read/write counters
Double buffer row inputs and outputs in local memory
DP and CP at 2GHz, DDR at 1066MHz
Single merge unit
[11] SPIRAL. DPA Instruction Set Architecture V0.2 Basic Configuration.
[10] D. Jones. Data Pump Architecture Simulator and Performance Model.
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
39 / 47
DPA Implementation
Sparse LU DP
Sparse LU CP
Initialize Matrix
in External Memory
Wait for DP to
load counts
Load row and column
counts, and column
map to Local Memory
Wait for DP to load
submatrix rows
Merge rows
Load pivot
row
Update column
map
Load other
submatrix rows
Yes
Wait for CP
No
Store merged
rows to
External Memory
Yes
< N?
No
< N?
End
Store updated
counts to
External Memory
End
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
40 / 47
DPA Performance
DPA Single vs Double Buffer Speedup
Against Gaussian LU
I
Increase performance by
double buffering rows
2.5
DP alternates loading to
two input buffers
I
CP alternates outputs to
two output buffers
I
DP and CP at 2GHz,
DDR at 1066MHz
I
Merge accelerator at
same frequency as CP
2
Speedup vs Gaussian LU
I
1.5
1
0.5
0
1648 Bus
7917 Bus
10279 Bus
26829 Bus
Power System
Single Buffer
Johnson (Drexel University)
HW Acceleration for Load Flow
Double Buffer
March 13, 2012
41 / 47
DPA Performance
2.5
Speedup vs Gaussian
2
I
Measured with different merge
frequencies
I
Slower merges do not provide
speedup over software
I
Faster merges require ASIC
implementation
I
Larger matrices see most benefit
1.5
1
0.5
0
1648 Bus
7917 Bus
10279 Bus
26829 Bus
Merge 2X Slower
Merge Same as CP
Power System
Merge 10X Slower
Merge 5X Slower
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
42 / 47
DPA Performance
2.5
I
DP and CP at 3.2GHz
I
DDR at 1066MHz
Speedup vs Gaussian
2
1.5
DPA LU Speedup vs Gaussian LU for
26K-Bus Power System
1
0.5
0
1648 Bus
7917 Bus
10279 Bus
26829 Bus
Merge 2X Slower
Merge Same as CP
Power System
Merge 10X Slower
Merge 5X Slower
Johnson (Drexel University)
CP/DP
Freq
2.0GHz
3.2GHz
HW Acceleration for Load Flow
10 times
slower
0.57X
0.86X
5 times
slower
1.00X
1.45X
2 times
slower
1.85X
2.28X
Same as
CP
2.27X
2.35X
March 13, 2012
43 / 47
Overview
I
Problem and Goals
I
Background
I
LU Hardware
I
Merge Accelerator Design
I
Accelerator Architectures
I
Performance Results
I
Conclusions
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
44 / 47
Conclusions and Future Work
I
Merge accelerator reduces indexing cost and improves sparse LU
performance (2.3X improvement)
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
45 / 47
Conclusions and Future Work
I
Merge accelerator reduces indexing cost and improves sparse LU
performance (2.3X improvement)
I
DPA architecture provides a memory framework to utilize merge
accelerator and outperforms triple buffering and heterogeneous
multicore schemes investigated
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
45 / 47
Conclusions and Future Work
I
Merge accelerator reduces indexing cost and improves sparse LU
performance (2.3X improvement)
I
DPA architecture provides a memory framework to utilize merge
accelerator and outperforms triple buffering and heterogeneous
multicore schemes investigated
I
Software control of memory hierarchy allows alternate protocols such
as double buffering to be investigated
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
45 / 47
Conclusions and Future Work
I
Merge accelerator reduces indexing cost and improves sparse LU
performance (2.3X improvement)
I
DPA architecture provides a memory framework to utilize merge
accelerator and outperforms triple buffering and heterogeneous
multicore schemes investigated
I
Software control of memory hierarchy allows alternate protocols such
as double buffering to be investigated
I
Additional modifications such as multiple merge units, row caching
and prefetching will be required for further improvement
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
45 / 47
Conclusions and Future Work
I
Merge accelerator reduces indexing cost and improves sparse LU
performance (2.3X improvement)
I
DPA architecture provides a memory framework to utilize merge
accelerator and outperforms triple buffering and heterogeneous
multicore schemes investigated
I
Software control of memory hierarchy allows alternate protocols such
as double buffering to be investigated
I
Additional modifications such as multiple merge units, row caching
and prefetching will be required for further improvement
I
With accelerator, high performance load flow calculation is possible
with low power device
I
Suggesting a distributed network of such devices
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
45 / 47
Conclusions and Future Work
I
Merge accelerator reduces indexing cost and improves sparse LU
performance (2.3X improvement)
I
DPA architecture provides a memory framework to utilize merge
accelerator and outperforms triple buffering and heterogeneous
multicore schemes investigated
I
Software control of memory hierarchy allows alternate protocols such
as double buffering to be investigated
I
Additional modifications such as multiple merge units, row caching
and prefetching will be required for further improvement
I
With accelerator, high performance load flow calculation is possible
with low power device
I
Suggesting a distributed network of such devices
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
45 / 47
References
[1]
A. Albur, A., Expsito.
Power System State Estimation: Theory and Implementation.
Marcel Dekker, New York, New York, USA, 2004.
[2]
P R Amestoy, Timothy Davis, and I S Duff.
Algorithm 837: AMD, an approximate minimum degree ordering algorithm, 2004.
[3]
T. Chagnon.
Architectural Support for Direct Sparse LU Algorithms.
Masters thesis, Drexel University, 2010.
[4]
DRC Computer.
DRC Coprocessor System User’s Guide, 2007.
[5]
Kevin Cunningham and Prawat Nagvajara.
Reconfigurable Stream-Processing Architecture for Sparse Linear Solvers.
Reconfigurable Computing: Architectures, Tools and Applications, 2011.
[6]
Kevin Cunningham, Prawat Nagvajara, and Jeremy Johnson.
Reconfigurable Multicore Architecture for Power Flow Calculation.
In Review for North American Power Symposium 2011, 2011.
[7]
Timothy Davis.
Algorithm 832:UMFPACK V4.3 - An Unsymmetric-Pattern Multi-Frontal Method.
ACM Trans. Math. Softw, 2004.
[8]
Philip Garcia and Katherine Compton.
A Reconfigurable Hardware Interface for a Modern Computing System.
15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM 2007), pages 73–84, April
2007.
[9]
Intel.
Intel Atom Processor E6x5C Series-Based Platform for Embedded Computing.
Technical report, 2011.
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
46 / 47
[10] Douglas Jones.
Data Pump Architecture Simulator and Performance Model.
Masters thesis, 2010.
[11] SPIRAL.
DPA Instruction Set Architecture V0.2 Basic Configuration (SPARC V8 Extension), 2008.
[12] Petya Vachranukunkiet.
Power flow computation using field programmable gate arrays.
PhD thesis, 2007.
[13] Xilinx Inc.
Programmable Logic Design Quick Start Handbook, 2006.
Johnson (Drexel University)
HW Acceleration for Load Flow
March 13, 2012
47 / 47
Download