Hardware Acceleration for Load Flow Computation Jeremy Johnson, Chika Nwankpa and Prawat Nagvajara Kevin Cunningham, Tim Chagnon, Petya Vachranukunkiet Computer Science and Electrical and Computer Engineering Drexel University Eigth Annual CMU Conference on the Electricity Industry March 13, 2012 Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 1 / 47 Overview I Problem and Goals I Background I LU Hardware I Merge Accelerator Design I Accelerator Architectures I Performance Results I Conclusions Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 2 / 47 Problem I Problem: Sparse Lower-Upper (LU) Triangular Decomposition performs inefficiently on general-purpose processors (irregular data access, indexing overhead, less than 10% FP efficiency on power systems) [3, 12] [3] T. Chagnon. Architectural Support for Direct Sparse LU Algorithms. [12] P. Vachranukunkiet. Power Flow Computation using Field Programmable Gate Arrays. Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 3 / 47 Problem I Problem: Sparse Lower-Upper (LU) Triangular Decomposition performs inefficiently on general-purpose processors (irregular data access, indexing overhead, less than 10% FP efficiency on power systems) [3, 12] I Solution: Custom designed hardware can provide better performance (3X speedup over Pentium 4 3.2GHz) [3, 12] [3] T. Chagnon. Architectural Support for Direct Sparse LU Algorithms. [12] P. Vachranukunkiet. Power Flow Computation using Field Programmable Gate Arrays. Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 3 / 47 Problem I Problem: Sparse Lower-Upper (LU) Triangular Decomposition performs inefficiently on general-purpose processors (irregular data access, indexing overhead, less than 10% FP efficiency on power systems) [3, 12] I Solution: Custom designed hardware can provide better performance (3X speedup over Pentium 4 3.2GHz) [3, 12] I Problem: Complex custom hardware is difficult, expensive, and time-consuming to design [3] T. Chagnon. Architectural Support for Direct Sparse LU Algorithms. [12] P. Vachranukunkiet. Power Flow Computation using Field Programmable Gate Arrays. Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 3 / 47 Problem I Problem: Sparse Lower-Upper (LU) Triangular Decomposition performs inefficiently on general-purpose processors (irregular data access, indexing overhead, less than 10% FP efficiency on power systems) [3, 12] I Solution: Custom designed hardware can provide better performance (3X speedup over Pentium 4 3.2GHz) [3, 12] I Problem: Complex custom hardware is difficult, expensive, and time-consuming to design I Solution: Use a balance of software and hardware accelerators (ex: FPU, GPU, Encryption, Video/Image Encoding, etc.) [3] T. Chagnon. Architectural Support for Direct Sparse LU Algorithms. [12] P. Vachranukunkiet. Power Flow Computation using Field Programmable Gate Arrays. Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 3 / 47 Problem I Problem: Sparse Lower-Upper (LU) Triangular Decomposition performs inefficiently on general-purpose processors (irregular data access, indexing overhead, less than 10% FP efficiency on power systems) [3, 12] I Solution: Custom designed hardware can provide better performance (3X speedup over Pentium 4 3.2GHz) [3, 12] I Problem: Complex custom hardware is difficult, expensive, and time-consuming to design I Solution: Use a balance of software and hardware accelerators (ex: FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that efficiently combines general-purpose processors and accelerators [3] T. Chagnon. Architectural Support for Direct Sparse LU Algorithms. [12] P. Vachranukunkiet. Power Flow Computation using Field Programmable Gate Arrays. Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 3 / 47 Goals I Increase performance of sparse LU over existing software Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 4 / 47 Goals I Increase performance of sparse LU over existing software I Design a flexible, easy to use hardware accelerator that can integrate with multiple platforms Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 4 / 47 Goals I Increase performance of sparse LU over existing software I Design a flexible, easy to use hardware accelerator that can integrate with multiple platforms I Find an architecture that efficiently uses the accelerator to improve sparse LU Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 4 / 47 Summary of Results and Contributions I Designed and implemented a merge hardware accelerator [5] I Data rate approaches one element per cycle [5] Cunningham, Nagvajara. Reconfigurable Stream-Processing Architecture for Sparse Linear Solvers. [6] Cunningham, Nagvajara, Johnson. Reconfigurable Multicore Architecture for Power Flow Calculation. Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 5 / 47 Summary of Results and Contributions I Designed and implemented a merge hardware accelerator [5] I I Data rate approaches one element per cycle Designed and implemented supporting hardware [6] [5] Cunningham, Nagvajara. Reconfigurable Stream-Processing Architecture for Sparse Linear Solvers. [6] Cunningham, Nagvajara, Johnson. Reconfigurable Multicore Architecture for Power Flow Calculation. Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 5 / 47 Summary of Results and Contributions I Designed and implemented a merge hardware accelerator [5] I I I Data rate approaches one element per cycle Designed and implemented supporting hardware [6] Implemented a prototype Triple Buffer Architecture [5] I I I Efficiently utilize external transfer bus Not suitable for sparse LU application Advantageous for other applications [5] Cunningham, Nagvajara. Reconfigurable Stream-Processing Architecture for Sparse Linear Solvers. [6] Cunningham, Nagvajara, Johnson. Reconfigurable Multicore Architecture for Power Flow Calculation. Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 5 / 47 Summary of Results and Contributions I Designed and implemented a merge hardware accelerator [5] I I I Designed and implemented supporting hardware [6] Implemented a prototype Triple Buffer Architecture [5] I I I I Data rate approaches one element per cycle Efficiently utilize external transfer bus Not suitable for sparse LU application Advantageous for other applications Implemented a prototype heterogeneous multicore architecture [6] I I Combines general-purpose processor and reconfigurable hardware cores Speedup of 1.3X over sparse LU software with merge accelerator [5] Cunningham, Nagvajara. Reconfigurable Stream-Processing Architecture for Sparse Linear Solvers. [6] Cunningham, Nagvajara, Johnson. Reconfigurable Multicore Architecture for Power Flow Calculation. Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 5 / 47 Summary of Results and Contributions I Designed and implemented a merge hardware accelerator [5] I I I Designed and implemented supporting hardware [6] Implemented a prototype Triple Buffer Architecture [5] I I I I Efficiently utilize external transfer bus Not suitable for sparse LU application Advantageous for other applications Implemented a prototype heterogeneous multicore architecture [6] I I I Data rate approaches one element per cycle Combines general-purpose processor and reconfigurable hardware cores Speedup of 1.3X over sparse LU software with merge accelerator Modified Data Pump Architecture (DPA) Simulator I I I Added merge instruction and support Implemented sparse LU on the DPA Speedup of 2.3X over sparse LU software with merge accelerator [5] Cunningham, Nagvajara. Reconfigurable Stream-Processing Architecture for Sparse Linear Solvers. [6] Cunningham, Nagvajara, Johnson. Reconfigurable Multicore Architecture for Power Flow Calculation. Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 5 / 47 Overview I Problem and Goals I Background I LU Hardware I Merge Accelerator Design I Accelerator Architectures I Performance Results I Conclusions Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 6 / 47 Power Systems I The power grid delivers electrical power from generation stations to distribution stations I Power grid stability is analyzed with Power Flow (aka Load Flow) calculation [1] I System nodes and voltages create a system of equations, represented by a sparse matrix [1] [1] A. Albur. Power System State Estimation: Theory and Implementation. [12] P. Vachranukunkiet. Power Flow Computation using Field Programmable Gate Arrays. Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 7 / 47 Power Systems I The power grid delivers electrical power from generation stations to distribution stations I Power grid stability is analyzed with Power Flow (aka Load Flow) calculation [1] I System nodes and voltages create a system of equations, represented by a sparse matrix [1] I Solving the system allows grid stability to be analyzed I LU decomposition accounts for a large part of the Power Flow calculation [12] I Power Flow calculation is repeated thousands of times to perform a full grid analysis [1] Power Flow Execution Profile [12] [1] A. Albur. Power System State Estimation: Theory and Implementation. [12] P. Vachranukunkiet. Power Flow Computation using Field Programmable Gate Arrays. Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 7 / 47 Power System Matrices Power System Matrix Properties System 1648 Bus 7917 Bus 10278 Bus 26829 Bus # Rows/Cols 2,982 14,508 19,285 50,092 NNZ 21,196 105,522 134,621 351,200 Sparsity 0.238% 0.050% 0.036% 0.014% Power System LU Statistics System 1648 Bus 7917 Bus 10279 Bus 26829 Bus Johnson (Drexel University) HW Acceleration for Load Flow Avg. NNZ per Row 7.1 7.3 7.0 7.0 Avg. Num Submatrix Rows 8.0 8.5 8.3 8.7 % Merge March 13, 2012 65.6% 54.3% 55.8% 62.7% 8 / 47 Power System LU I Gaussian LU outperforms UMFPACK for power matrices [3] I As matrix size grows, multi-frontal becomes more effective I 2 1.8 1.6 1.4 1.2 The merge is a bottleneck in Gaussian LU performance [3] Speedup I Gaussian LU vs UMFPACK on Intel Core i7 at 3.2GHz [3] 1 0.8 0.6 Goal: Design a hardware accelerator for the Merge in Gaussian LU 0.4 0.2 0 1648 Bus 7917 Bus 10279 Bus 26829 Bus Power System [3] T. Chagnon. Architectural Support for Direct Sparse LU Algorithms. Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 9 / 47 Sparse Matrices I A matrix is sparse when it has a large number of zero elements I Below is a small section of the 26K-Bus power system matrix 122.03 −61.01 −5.95 0 0 −61.01 277.93 19.38 0 0 −19.56 275.58 0 0 A = 5.95 0 0 0 437.82 67.50 0 0 0 −67.50 437.81 Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 10 / 47 Sparse Compressed Formats I Sparse matrices use a compressed format to ignore zero elements I Saves storage space I Decreases amount of computation Row Johnson (Drexel University) Column:Value 0 0 122.03 1 -61.01 2 -5.95 1 0 -61.01 1 277.93 2 19.38 2 0 3 3 437.82 4 4 3 -67.50 4 437.81 5.95 1 -19.56 2 275.58 67.50 HW Acceleration for Load Flow March 13, 2012 11 / 47 Sparse LU Decomposition 122.03 −61.01 −5.95 0 0 −61.01 277.93 19.38 0 0 5.95 −19.56 275.58 0 0 A= 0 0 0 437.82 67.50 0 0 0 −67.50 437.81 1.00 0 0 0 0 −0.50 1.00 0 0 0 0 0 L= 0.05 −0.07 1.00 0 0 0 1.00 0 0 0 0 −0.15 1.00 Johnson (Drexel University) 122.03 −61.01 −5.95 0 0 0 247.42 16.41 0 0 0 276.97 0 0 U= 0 0 0 0 437.82 67.50 0 0 0 0 448.22 HW Acceleration for Load Flow March 13, 2012 12 / 47 Sparse LU Methods I Approximate Minimum Degree (AMD) algorithm pre-orders matrices to reduce fill-in [2] I Multiple methods for performing sparse LU: Multi-frontal I I I I Used by UMFPACK [7] Divides matrix into multiple, independent, dense blocks Gaussian Elimination I Eliminates elements below the diagonal by scaling and adding rows together [2] Amestoy et al. Algorithm 837: AMD, an approximate minimum degree ordering algorithm. [7] T. Davis. Algorithm 832: UMFPACK V4.3 - An Unsymmetric-Pattern Multi-Frontal Method. Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 13 / 47 Sparse LU Decomposition Algorithm 1 Gaussian LU for i = 1 → N do pivot search() update U() for j = 1 → NUM SUBMATRIX ROWS do merge(pivot row , j) update L() update colmap() end for end for Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 14 / 47 Column Map I Column map keeps track of the rows with non-zero elements in each column I Do not have to search matrix on each iteration I Fast access to all rows in a column I Select pivot row from the set of rows I Column map is updated after each merge I Fill-in elements add new elements to a row during the merge Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 15 / 47 Sparse LU Decomposition Column 0 1 2 3 4 5 6 7 + I 7 3 2 4 1 5 8 2 7 7 1 2 5 2 8 Row u Row v Row u+v Challenges with software merging: Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 16 / 47 Sparse LU Decomposition Column 0 1 2 3 4 5 6 7 + I 7 3 2 4 1 5 8 2 7 7 1 2 5 2 8 Row u Row v Row u+v Challenges with software merging: I Indexing overhead Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 16 / 47 Sparse LU Decomposition Column 0 1 2 3 4 5 6 7 + I 7 3 2 4 1 5 8 2 7 7 1 2 5 2 8 Row u Row v Row u+v Challenges with software merging: I I Indexing overhead Column numbers are different from row array indices Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 16 / 47 Sparse LU Decomposition Column 0 1 2 3 4 5 6 7 + I 7 3 2 4 1 5 8 2 7 7 1 2 5 2 8 Row u Row v Row u+v Challenges with software merging: I I I Indexing overhead Column numbers are different from row array indices Must fetch column numbers from memory to operate on elements Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 16 / 47 Sparse LU Decomposition Column 0 1 2 3 4 5 6 7 + I 7 3 2 4 1 5 8 2 7 7 1 2 5 2 8 Row u Row v Row u+v Challenges with software merging: I I I I Indexing overhead Column numbers are different from row array indices Must fetch column numbers from memory to operate on elements Cache misses Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 16 / 47 Sparse LU Decomposition Column 0 1 2 3 4 5 6 7 + I 7 3 2 4 1 5 8 2 7 7 1 2 5 2 8 Row u Row v Row u+v Challenges with software merging: I I I I I Indexing overhead Column numbers are different from row array indices Must fetch column numbers from memory to operate on elements Cache misses Data-dependent branching Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 16 / 47 Sparse LU Decomposition Merge Algorithm int p , s ; i n t r n z =0 , f n z =0; f o r ( p=1 , s =1; p < p i v o t L e n && s < n o n p i v o t L e n ; ) { i f ( n o n p i v o t [ s ] . c o l == p i v o t [ p ] . c o l ) { merged [ r n z ] . c o l = p i v o t [ p ] . c o l ; merged [ r n z ] . v a l = ( n o n p i v o t [ s ] . v a l − l x ∗ p i v o t [ p ] . v a l ) ; r n z ++; p++; s ++; } else i f ( nonpivot [ s ] . col < pivot [ p ] . col ) { merged [ r n z ] . c o l = n o n p i v o t [ s ] . c o l ; merged [ r n z ] . v a l = n o n p i v o t [ s ] . v a l ; r n z ++; s ++; } else { merged [ r n z ] . c o l = p i v o t [ p ] . c o l ; merged [ r n z ] . v a l = (− l x ∗ p i v o t [ p ] . v a l ) ; f i l l i n [ fnz ] = pivot [ p ] . col ; r n z ++; f n z ++; p++; } } Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 18 / 47 Overview I Problem and Goals I Background I LU Hardware I Merge Accelerator Design I Accelerator Architectures I Performance Results I Conclusions Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 19 / 47 LU Hardware I Originally designed by Petya Vachranukunkiet[12] I Straightforward Gaussian Elimination I Parameterized for power system matrices I Streaming operations hide latencies I Computation pipelined for 1 Flop / cycle I Memory and FIFOs used for cache and bookeeping I Multiple merge units possible Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 20 / 47 Pivot and Submatrix Update Logic Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 21 / 47 Custom Cache Logic I Cache line is a whole row I Fully-associative, write-back, FIFO replacement Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 22 / 47 Reconfigurable Hardware I I I Field Programmable Gate Array (FPGA) Advantages: I I I I Disadvantages: I I I Difficult to program Long design compile times Lower clock frequencies Low power consumption Reconfigurable Low cost Pipelining and parallelism SRAM Logic Cell [13] FPGA Architecture [13] [13] Xilinx, Inc. Programmable Logic Design Quick Start Handbook. Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 23 / 47 ASIC vs FPGA Application-Specific Integrated Circuit (ASIC) Field Programmable Gate Array (FPGA) I Higher clock frequency I Reconfigurable I Lower power consumption I Design updates and corrections I More customization I Multiple designs in same area I Lower cost I Less implementation time Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 24 / 47 Overview I Problem and Goals I Background I LU Hardware I Merge Accelerator Design I Accelerator Architectures I Performance Results I Conclusions Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 25 / 47 Merge Hardware Accelerator U FIFO I Input and Output BlockRAM FIFOs I Two compare stages allow look-ahead comparison I Pipelined floating point adder I Merge unit is pipelined and outputs one element per cycle I Latency is dependent on the structure and length of input rows Johnson (Drexel University) U Stage 1 V FIFO CMP U Stage 2 HW Acceleration for Load Flow V Stage 1 V Stage 2 Merged + March 13, 2012 26 / 47 Merge Hardware Manager I Must manage rows going through merge units I Pivot row is recycled I Floating point multiplier scales pivot row before entering merge unit Pivot FIFO si Non Pivot FIFO * Pivot FIFO si Non Pivot FIFO * I Can support multiple merge units Merge Unit Merge Unit I Input and output rotate among merge units after each full row Output FIFO Output FIFO I Rows enter and leave in the same order Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 27 / 47 Merge Performance Software Merge Performance on Core i7 at 3.2GHz Power System 1648 Bus 7917 Bus 10279 Bus 26829 Bus I Average Cycles per Element 35.3 32.2 31.8 32.8 Merge accelerator: I I I I Average Data Rate (Millions of Elements per Second) 90.77 99.54 100.75 97.46 Outputs one element per cycle Data rate approaches hardware clock rate FPGA clock frequencies provide better merge performance than processor Goal: Find an architecture that efficiently delivers data to the accelerator Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 28 / 47 Overview I Problem and Goals I Background I LU Hardware I Merge Accelerator Design I Accelerator Architectures I Performance Results I Conclusions Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 29 / 47 Summary of Design Methods I Completely Software (UMFPACK, Gaussian LU) I I Completely Hardware (LUHW on FPGA) I I I Inefficient performance Large design time and effort Less scalable Investigate three accelerator architectures: I Triple Buffer Architecture I I Heterogeneous Multicore Architecture I I Efficiently utilize external transfer bus to a reconfigurable accelerator Combine general-purpose and reconfigurable cores in a single package Data Pump Architecture I Take advantage of programmable data/memory management to efficiently feed an accelerator Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 30 / 47 Accelerator Architectures Mem Mem CPU FPGA I Common setup for processor and FPGA I Each device has its own external memory I Communicate by USB, Ethernet, PCI-Express, HyperTransport, ... I Transfer bus can be bottleneck between processor and FPGA Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 31 / 47 Triple Buffer Architecture Transfer BUS Buffer 1 Mem Mem Buffer 2 Buffer 3 CPU FPGA CPU Low Latency Memory Banks FPGA I How to efficiently use transfer bus? I Break data into blocks and buffer I Three buffers eliminate competition on memory ports I Allows continuous stream of data through FPGA accelerator I Processor can transfer data at bus/memory rate, not limited by FPGA rate Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 32 / 47 Heterogeneous Multicore Architecture I Communication with an external transfer bus requires additional overhead I Merge accelerator does not require a large external FPGA I Closely integrate processor and FPGA I Share same memory system Reconfigurable Processor Architecture [8] [8] Garcia and Compton. A Reconfigurable Hardware Interface for a Modern Computing System. Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 33 / 47 Heterogeneous Multicore Architecture DDR Memory AXI CPU AXI LITE (Microblaze) AXI DMA AXI STREAM Reconfigurable Core Reconfigurable Processor Architecture [8] [8] Garcia and Compton. A Reconfigurable Hardware Interface for a Modern Computing System. Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 34 / 47 Heterogeneous Multicore Architecture DDR Memory I Processor sends DMA requests to DMA module I DMA module fetches data from memory and streams to accelerator I Accelerator streams outputs back to DMA module to store to memory AXI CPU AXI LITE (Microblaze) AXI DMA AXI STREAM Reconfigurable Core Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 35 / 47 Data Pump Architecture Data Pump Architecture [11] I I I I I I Developed as part of the SPIRAL project Intended as a signal processing platform Processors explicitly control data movement No automatic data caches Data Processor (DP) moves data between external and local memory Compute Processor (CP) moves data between local memory and vector processors [11] SPIRAL. DPA Instruction Set Architecture V0.2 Basic Configuration. Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 36 / 47 Data Pump Architecture Data Pump Architecture [11] I Replace vector processors with merge accelerator I DP brings rows from external memory into local memory I CP sends rows to merge accelerator and writes results in local memory I DP stores results back to external memory [11] SPIRAL. DPA Instruction Set Architecture V0.2 Basic Configuration. Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 37 / 47 Overview I Problem and Goals I Background I LU Hardware I Merge Accelerator Design I Accelerator Architectures I Performance Results I Conclusions Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 38 / 47 DPA Implementation Data Pump Architecture [11] I I I I I Implemented Sparse LU on DPA Simulator [10] DP and CP synchronize with local memory read/write counters Double buffer row inputs and outputs in local memory DP and CP at 2GHz, DDR at 1066MHz Single merge unit [11] SPIRAL. DPA Instruction Set Architecture V0.2 Basic Configuration. [10] D. Jones. Data Pump Architecture Simulator and Performance Model. Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 39 / 47 DPA Implementation Sparse LU DP Sparse LU CP Initialize Matrix in External Memory Wait for DP to load counts Load row and column counts, and column map to Local Memory Wait for DP to load submatrix rows Merge rows Load pivot row Update column map Load other submatrix rows Yes Wait for CP No Store merged rows to External Memory Yes < N? No < N? End Store updated counts to External Memory End Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 40 / 47 DPA Performance DPA Single vs Double Buffer Speedup Against Gaussian LU I Increase performance by double buffering rows 2.5 DP alternates loading to two input buffers I CP alternates outputs to two output buffers I DP and CP at 2GHz, DDR at 1066MHz I Merge accelerator at same frequency as CP 2 Speedup vs Gaussian LU I 1.5 1 0.5 0 1648 Bus 7917 Bus 10279 Bus 26829 Bus Power System Single Buffer Johnson (Drexel University) HW Acceleration for Load Flow Double Buffer March 13, 2012 41 / 47 DPA Performance 2.5 Speedup vs Gaussian 2 I Measured with different merge frequencies I Slower merges do not provide speedup over software I Faster merges require ASIC implementation I Larger matrices see most benefit 1.5 1 0.5 0 1648 Bus 7917 Bus 10279 Bus 26829 Bus Merge 2X Slower Merge Same as CP Power System Merge 10X Slower Merge 5X Slower Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 42 / 47 DPA Performance 2.5 I DP and CP at 3.2GHz I DDR at 1066MHz Speedup vs Gaussian 2 1.5 DPA LU Speedup vs Gaussian LU for 26K-Bus Power System 1 0.5 0 1648 Bus 7917 Bus 10279 Bus 26829 Bus Merge 2X Slower Merge Same as CP Power System Merge 10X Slower Merge 5X Slower Johnson (Drexel University) CP/DP Freq 2.0GHz 3.2GHz HW Acceleration for Load Flow 10 times slower 0.57X 0.86X 5 times slower 1.00X 1.45X 2 times slower 1.85X 2.28X Same as CP 2.27X 2.35X March 13, 2012 43 / 47 Overview I Problem and Goals I Background I LU Hardware I Merge Accelerator Design I Accelerator Architectures I Performance Results I Conclusions Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 44 / 47 Conclusions and Future Work I Merge accelerator reduces indexing cost and improves sparse LU performance (2.3X improvement) Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 45 / 47 Conclusions and Future Work I Merge accelerator reduces indexing cost and improves sparse LU performance (2.3X improvement) I DPA architecture provides a memory framework to utilize merge accelerator and outperforms triple buffering and heterogeneous multicore schemes investigated Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 45 / 47 Conclusions and Future Work I Merge accelerator reduces indexing cost and improves sparse LU performance (2.3X improvement) I DPA architecture provides a memory framework to utilize merge accelerator and outperforms triple buffering and heterogeneous multicore schemes investigated I Software control of memory hierarchy allows alternate protocols such as double buffering to be investigated Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 45 / 47 Conclusions and Future Work I Merge accelerator reduces indexing cost and improves sparse LU performance (2.3X improvement) I DPA architecture provides a memory framework to utilize merge accelerator and outperforms triple buffering and heterogeneous multicore schemes investigated I Software control of memory hierarchy allows alternate protocols such as double buffering to be investigated I Additional modifications such as multiple merge units, row caching and prefetching will be required for further improvement Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 45 / 47 Conclusions and Future Work I Merge accelerator reduces indexing cost and improves sparse LU performance (2.3X improvement) I DPA architecture provides a memory framework to utilize merge accelerator and outperforms triple buffering and heterogeneous multicore schemes investigated I Software control of memory hierarchy allows alternate protocols such as double buffering to be investigated I Additional modifications such as multiple merge units, row caching and prefetching will be required for further improvement I With accelerator, high performance load flow calculation is possible with low power device I Suggesting a distributed network of such devices Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 45 / 47 Conclusions and Future Work I Merge accelerator reduces indexing cost and improves sparse LU performance (2.3X improvement) I DPA architecture provides a memory framework to utilize merge accelerator and outperforms triple buffering and heterogeneous multicore schemes investigated I Software control of memory hierarchy allows alternate protocols such as double buffering to be investigated I Additional modifications such as multiple merge units, row caching and prefetching will be required for further improvement I With accelerator, high performance load flow calculation is possible with low power device I Suggesting a distributed network of such devices Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 45 / 47 References [1] A. Albur, A., Expsito. Power System State Estimation: Theory and Implementation. Marcel Dekker, New York, New York, USA, 2004. [2] P R Amestoy, Timothy Davis, and I S Duff. Algorithm 837: AMD, an approximate minimum degree ordering algorithm, 2004. [3] T. Chagnon. Architectural Support for Direct Sparse LU Algorithms. Masters thesis, Drexel University, 2010. [4] DRC Computer. DRC Coprocessor System User’s Guide, 2007. [5] Kevin Cunningham and Prawat Nagvajara. Reconfigurable Stream-Processing Architecture for Sparse Linear Solvers. Reconfigurable Computing: Architectures, Tools and Applications, 2011. [6] Kevin Cunningham, Prawat Nagvajara, and Jeremy Johnson. Reconfigurable Multicore Architecture for Power Flow Calculation. In Review for North American Power Symposium 2011, 2011. [7] Timothy Davis. Algorithm 832:UMFPACK V4.3 - An Unsymmetric-Pattern Multi-Frontal Method. ACM Trans. Math. Softw, 2004. [8] Philip Garcia and Katherine Compton. A Reconfigurable Hardware Interface for a Modern Computing System. 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM 2007), pages 73–84, April 2007. [9] Intel. Intel Atom Processor E6x5C Series-Based Platform for Embedded Computing. Technical report, 2011. Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 46 / 47 [10] Douglas Jones. Data Pump Architecture Simulator and Performance Model. Masters thesis, 2010. [11] SPIRAL. DPA Instruction Set Architecture V0.2 Basic Configuration (SPARC V8 Extension), 2008. [12] Petya Vachranukunkiet. Power flow computation using field programmable gate arrays. PhD thesis, 2007. [13] Xilinx Inc. Programmable Logic Design Quick Start Handbook, 2006. Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 47 / 47