Parallel VLSI CAD Algorithms for Energy Efficient Heterogeneous Computing Platforms Xueqian Zhao, Zhuo Feng Department of Electrical and Computer Engineering, Michigan Technological University, Houghton, MI 3. PERFORMANCE MODELING & OPTIMIZATION FOR ELECTRICAL & THERMAL SIMULATIONS c) FMMGpu Data Structure and Algorithm Flow Power Delivery Network Direct charges of source cube i 3D-ICs w/ Microchannels m×1 qi ∈ Panel potentials in cube k: n×1 pk ∈ Substrate Layer 1 Multipole expansions of source cube j s×1 m j ∈ Substrate Layer 2 ( k ,i ) (k , j ) p k = Φ Q2P × q i + + Φ M 2P × m j q i ( k ,i ) (k , j ) pk = Φ Q 2 P Φ M 2 P × m j Substrate Layer 3 3D Parasitic Extraction: Large Power Grid Simulation: Full Chip Thermal Simulation: Capacitance extraction for delay, noise and power estimation computes charge distributions for discretized panels Power grids DC and transient simulations involve more than hundreds of millions of nodes Thermal simulation for 3-D dense mesh structures requires solving many millions of unknowns p k = p l a) GPU Has Evolved into A Very Flexible and Powerful Processor 180 GB/s (GPU) vs. 30 GB/s (CPU) 1350 GFLOPS (GPU) vs. 200 GFLOPS (CPU) Q2P index matrix: I Achieve highly energy efficient computing, and CPU-GPU communications Need reinventing CAD algorithms: circuit simulations, interconnect modeling, performance optimizations,… Φ ( k ,i ) Q2P Global Vec global = 0 vector: T q1 ∈ VDD T my T l1 c) Latest NVIDIA Fermi GPU Architecture – All processors do the same job on different data sets (SIMD) – More transistors are devoted to data processing – GPU concurrent kernel execution can improve the SM occupancy. – Less transistors are used for data caching and control flow SM Occupancy Streaming Multiprocessor (SM16) SP SP SP SPSP SP SP SPSP SP SP SP SP SP SP SPSP SP SP SPSP SP SP SP SP SP SPSP SP SP SPSP SP SP SP SP SP SPSP SP SP SPSP SP SP SP SP SFU SFU SFU SP SFU SFU SP SFU SP SFU SP SFU SP SP SP SPSP SP SP SPSP SP SP SFU Shared & L1 Cache SP SP SPSP SP SP Memory SP SFU & L1 Cache SP SP Shared SP Memory SP Shared Memory & L1 Cache Kernel 2 Kernel 3 Kernel 4 Kernel 2 Kernel 3 (cont.) Running Time SFU SP Running Time SP SP SP SPL1SPSP SP SP Instruction SP SP SPSP SP SP SPSP SP SP SFU Kernel 1 T T lz Kernel 5 1 1 1 pk = a) Problem Formulation Vb −> a b εb dA =∫ panel b R ba 1V qb εb = Ab ∑ a + - x i ∈Ω ( y k ) Directly Solve qi Summation: V= ∑ a i =b Ai Upward Pass 2 ∑ x i ∉Ω ( y k ) Φ qi, xi, y k ∈ 3 Q2M, M2M T0/4 ∑Φ q i i Evaluation Pass Evaluation pass (M2P, L2P) 0 1 M2P, L2P 3 SMs 2 3 2 SMs Cluster Coef. Mat. … 4 n 0 Number of Coefficients Small Tests Small Tests 3 3 4 3 4 4 4 4 8 Computation + 1 Memory exit n computation instructions M memory instructions WFG of Multigrid Smoother C7 Mg C2 Mts C3 Mts C2 Mts Mt S Mt2 C1 Mt2 C1 Mts C3 C5 C1 Ms C3 Ms C2 Ms C3 D Ms C4 S C1 Ms Inner Loop Instruction issue latency memory loading latency S C2 Ms Mg CPU Chip-level Electrical [1,2] and Thermal Modeling & Simulation [3] 25X Speedups Find optimal SM assignment 80 80 75 70 18X 1.79 1.785 1.78 1.775 70 60 65 1.77 1.765 1 0.5 0 2 1.5 60 50 200 3 2.5 Time (seconds) 5 4.5 4 3.5 x 10 -9 1.7711 150 100 100 0 50 0 1.771 1.771 1.771 1.7709 1.7708 X Axle Cholmod GPU 1.7708 IC Temperature Prediction 2.3 2.305 2.31 2.315 2.32 2.325 Time (seconds) 2.33 2.335 Voltage Waveforms Prediction Performance Modeling & Optimization [1] x 10 -9 p ∑S i =1 i Worst Case Worst Case Automatically Identify the Optimal Simulation Configurations Runtime of single FMM iteration on CPU & GPU (ms) 22X Cholmod GPU Si e) Experimental Results [4] 30X 22X Speedups 1.8 1.795 Normalization based on #SM on GPU = N i N sm × Small Tests 18X 2 2 3 4 memory dependency Lm M C1 Lc Lc Y Axle 2 SMs Number of SMs Approximately Solve Local expansion: Approximate panel potentials within a small sphere 4 1 2 3 90 8 SMs Super linear Speedup Range 26X M2L, L2L 3 1 1 2 2 200 d Multipole expansion: Approximate panel charges within a small sphere 2 Cn Lc Lm Realistic Speedups 26X Downward Pass 1 ∫panel i Ria dA=i qi + ( k ,i ) Direct Pass Q2P d ... Φ ( k ,i ) C2 Cube indices after ordering Linear Speedups T0/2 Key Steps: c Downward Pass (M2L, L2L) T0 b) FMM Turns the O(n2) into O(n) via Approximations 1 1 d) Experimental Results Runtime T0/n 2. CAPACITANCE EXTRACTION entry Converged? Compute residual T0/3 Streaming Multiprocessor Additional delay GPU Kernel 6 Concurrent Kernel Execution GPU-Friendly Multi-Level Iterative Algorithm Work Flow Graph (WFG): an extension of the control flow graph of a GPU kernel d) GPU Performance Modeling and Workload Balancing Serial Kernel Execution Matrix + Geometric Representations c) Adaptive Performance Modeling (PPoPP’10) End Kernel 6 Return Final Solution 8 Computation + 1 Memory Upward pass (Q2M, M2M) GMRES solver Kernel 5 Multigrid Preconditioning Update Search Direction No Do workload balance and SM assignment Kernel 4 Not Converged MWP is greater than CWP 4 Direct pass (Q2P) Yes Kernel 3 Converged Check Convergence MWP(Memory Warp Parallelism): the max. num. of warps per SM that can access the memory simultaneously CWP(Computation Warp Parallelism): the num. of warps the SM processor can execute during one memory warp waiting period plus one Precondition pass Initialize panel charges SM Occupancy Kernel 1 GPU Global Memory Instruction L1 Streaming Multiprocessor (SM2) SP SP(SM1) SP Instruction L1SP Streaming Multiprocessor Update Solution and Residual Synchronization Initialize GPU memory allocation – Fermi GPUs allow at most 16 different kernels executed at the same time Get Initial Residual and Search Direction b) Analytical GPU Runtime Performance Model (ISCA’09) GPU Reorder and cluster panel cubes d) Concurrent Kernel Executions a3,8 a6,8 a8,8 Geometric Multigrid Solver (GMD) Original Grid Temperature (Celsius) Intel Larrabee Set Initial Solution dt (k , j ) n× s I ∈ M2P index matrix: M 2 P T m1 MGPCG Algorithm on GPU DC : Gx = b dx (t ) TR : Gx (t ) + C = b(t ) (k , j ) CPU AMD APU VDD … 3 NVIDIA Tegra 2 VDD VDD 2 Compute and combine all coefficient matrices VDD a1,4 a1,5 a1,1 a a a2,6 2,2 2,3 a3,2 a3,3 a a4,4 4,1 a5,1 a5,5 a5,7 a6,2 a6,6 a7,5 a7,7 a8,3 a8,6 + Φ (Mk ,2jP) ∈ n×s ( k ,i ) T qx Jacobi Smoother using Sparse Matrix n×m I mix ∈ 3D Multi-layer Irregular Electrical/Thermal Grid M2P coefficient matrix: Φ mix n×m GPU Global Memory Q2P coefficient matrix: (k , j ) ( k ,i ) Q2P Host (CPU) Memory VDD Φ Q 2 P Φ M 2 P I Q 2 P I M 2 P ⊗ Φ ( l , k ) Φ r ×d I ( l , k ) I r ×d Q2P Q 2 P ( k ,i ) p b) Heterogeneous CPU-GPU Computing Achieves Much Better Energy Efficiency a) Multigrid Preconditioned Krylov Subspace Iterative Solver Voltage (V) 3D Interconnections Voltage (V) 1. INTRODUCTION Runtime of each kernel on CPU & GPU (ms) 180 160 140 120 100 80 60 40 20 0 33X CPU 24X GPU 45X 45X 3.5X Precond Direct Upward Downward Evaluation 4. REFERENCE [1] Z. Feng, X. Zhao, and Z. Zeng. Robust Parallel Preconditioned Power Grid Simulation on GPU with Adaptive Runtime Performance Modeling and Optimization. In IEEETCAD, 2011. [2] Z. Feng and Z. Zeng. Parallel multigrid preconditioning on graphics processing units (GPUs) for robust power grid analysis. In Proc. IEEE/ACM DAC, 2010. [3] Z. Feng and P. Li. Fast thermal analysis on GPU for 3D-ICs with integrated microchannel cooling. In Proc. IEEE/ACM ICCAD, 2010. [4] X. Zhao and Z. Feng. Fast Multipole Method on GPU: Tackling 3-D Capacitance Extraction on Massively Parallel SIMD Platforms. In Proc. IEEE/ACM DAC, 2011.