Uploaded by Poojamanjunath9901

11352471(1)

advertisement
Parallel VLSI CAD Algorithms for Energy Efficient Heterogeneous Computing Platforms
Xueqian Zhao, Zhuo Feng
Department of Electrical and Computer Engineering, Michigan Technological University, Houghton, MI
3. PERFORMANCE MODELING & OPTIMIZATION
FOR ELECTRICAL & THERMAL SIMULATIONS
c) FMMGpu Data Structure and Algorithm Flow
Power Delivery Network
Direct charges
of source cube i
3D-ICs w/ Microchannels

m×1
qi ∈ 
Panel potentials
in cube k:

n×1
pk ∈
Substrate Layer 1
Multipole expansions
of source cube j

s×1
m j ∈
Substrate Layer 2



( k ,i )
(k , j )
p k = Φ Q2P × q i +  + Φ M 2P × m j
 
q
i



( k ,i )
(k , j )

pk =
Φ Q 2 P  Φ M 2 P 
×  
  

m j 

Substrate Layer 3
3D Parasitic Extraction:
Large Power Grid Simulation:
Full Chip Thermal Simulation:
Capacitance extraction for delay, noise
and power estimation computes charge
distributions for discretized panels
Power grids DC and transient
simulations involve more than
hundreds of millions of nodes
Thermal simulation for 3-D dense
mesh structures requires solving
many millions of unknowns

p 
 k
=
  
  
p l 


a) GPU Has Evolved into A Very Flexible and Powerful Processor
 180 GB/s (GPU) vs. 30 GB/s (CPU)
 1350 GFLOPS (GPU) vs. 200 GFLOPS (CPU)
Q2P index matrix: I
 Achieve highly energy efficient computing, and CPU-GPU communications
 Need reinventing CAD algorithms: circuit simulations, interconnect modeling, performance optimizations,…
Φ
( k ,i )
Q2P
Global 

Vec
global = 0
vector:

T
q1
∈
VDD
 T
 my
T
l1
c) Latest NVIDIA Fermi GPU Architecture
– All processors do the same job on different data sets (SIMD)
– More transistors are devoted to data processing
– GPU concurrent kernel execution can improve the
SM occupancy.
– Less transistors are used for data caching and control flow
SM Occupancy
Streaming Multiprocessor (SM16)
SP
SP SP SPSP SP
SP SPSP SP SP
SP
SP
SP SP SPSP SP
SP SPSP SP SP
SP
SP SP SPSP SP
SP SPSP SP SP
SP
SP SP SPSP SP
SP SPSP SP SP
SP
SP
SFU
SFU
SFU
SP
SFU
SFU
SP
SFU
SP
SFU
SP
SFU
SP
SP SP SPSP SP
SP SPSP SP SP
SFU
Shared
& L1 Cache
SP SP SPSP SP
SP Memory
SP
SFU
& L1 Cache
SP
SP Shared
SP Memory
SP
Shared Memory & L1 Cache
Kernel 2
Kernel 3
Kernel 4
Kernel 2
Kernel 3
(cont.)
Running Time
SFU
SP
Running Time
SP
SP
SP SPL1SPSP SP SP
Instruction
SP SP SPSP SP
SP SPSP SP SP
SFU
Kernel 1
T T
 lz

Kernel 5
1
1
1
pk =
a) Problem Formulation
Vb −> a
b
εb
dA
=∫
panel b R
ba
1V
qb
εb =
Ab
∑
a
+
-
x i ∈Ω ( y k )
Directly
Solve
qi
Summation: V=
∑
a
i =b Ai
Upward
Pass
2
∑
x i ∉Ω ( y k )
Φ
qi, xi, y k ∈ 
3
Q2M, M2M
T0/4
∑Φ q
i i
Evaluation
Pass
Evaluation pass
(M2P, L2P)
0
1
M2P, L2P
3 SMs
2
3
2 SMs
Cluster
Coef. Mat.
…
4
n
0
Number of Coefficients
Small Tests
Small Tests
3
3
4
3
4
4
4
4
8 Computation + 1 Memory
exit
n computation instructions
M
memory instructions
WFG of Multigrid Smoother
C7
Mg
C2
Mts
C3
Mts
C2
Mts
Mt
S
Mt2
C1
Mt2
C1
Mts
C3
C5
C1
Ms
C3
Ms
C2
Ms
C3
D
Ms
C4
S
C1
Ms
Inner
Loop
Instruction issue latency
memory loading latency
S
C2
Ms
Mg
CPU
Chip-level Electrical [1,2] and Thermal Modeling & Simulation [3]
25X Speedups
Find optimal SM assignment
80
80
75
70
18X
1.79
1.785
1.78
1.775
70
60
65
1.77
1.765
1
0.5
0
2
1.5
60
50
200
3
2.5
Time (seconds)
5
4.5
4
3.5
x 10
-9
1.7711
150
100
100
0
50
0
1.771
1.771
1.771
1.7709
1.7708
X Axle
Cholmod
GPU
1.7708
IC Temperature Prediction
2.3
2.305
2.31
2.315
2.32
2.325
Time (seconds)
2.33
2.335
Voltage Waveforms Prediction
Performance Modeling & Optimization [1]
x 10
-9
p
∑S
i =1
i
Worst Case
Worst Case
Automatically Identify the Optimal Simulation Configurations
Runtime of single FMM
iteration on CPU & GPU (ms)
22X
Cholmod
GPU
Si
e) Experimental Results [4]
30X
22X Speedups
1.8
1.795
Normalization based
on #SM on GPU
=
N i N sm ×
Small Tests
18X
2
2
3
4
memory dependency
Lm
M
C1
Lc
Lc
Y Axle
2 SMs
Number of SMs
Approximately
Solve
Local expansion:
Approximate panel potentials
within a small sphere
4
1
2
3
90
8 SMs
Super linear
Speedup Range
26X
M2L, L2L
3
1
1
2
2
200
d
Multipole expansion:
Approximate panel charges
within a small sphere
2
Cn
Lc
Lm
Realistic
Speedups
26X
Downward
Pass
1
∫panel i Ria dA=i
qi +
( k ,i )
Direct Pass Q2P
d
...
Φ
( k ,i )
C2
Cube indices after ordering
Linear
Speedups
T0/2
Key Steps:
c
Downward Pass
(M2L, L2L)
T0
b) FMM Turns the O(n2) into O(n) via Approximations
1
1
d) Experimental Results
Runtime
T0/n
2. CAPACITANCE EXTRACTION
entry
Converged?
Compute
residual
T0/3
Streaming Multiprocessor
Additional delay
GPU
Kernel 6
Concurrent Kernel Execution
GPU-Friendly Multi-Level
Iterative Algorithm
Work Flow Graph (WFG): an extension of the control flow graph of a GPU kernel
d) GPU Performance Modeling and Workload Balancing
Serial Kernel Execution
Matrix + Geometric
Representations
c) Adaptive Performance Modeling (PPoPP’10)
End
Kernel 6
Return Final Solution
8 Computation + 1 Memory
Upward pass
(Q2M, M2M)
GMRES solver
Kernel 5
Multigrid Preconditioning
Update Search Direction
No
Do workload balance
and SM assignment
Kernel 4
Not Converged
MWP is greater than CWP
4
Direct pass
(Q2P)
Yes
Kernel 3
Converged
Check Convergence
MWP(Memory Warp Parallelism): the max. num. of warps per SM that can access
the memory simultaneously
CWP(Computation Warp Parallelism): the num. of warps the SM processor can
execute during one memory warp waiting period plus one
Precondition
pass
Initialize panel
charges
SM Occupancy
Kernel 1
GPU Global Memory
Instruction
L1
Streaming Multiprocessor
(SM2)
SP
SP(SM1)
SP
Instruction
L1SP
Streaming Multiprocessor
Update Solution and Residual
Synchronization
Initialize GPU
memory allocation
– Fermi GPUs allow at most 16 different kernels
executed at the same time
Get Initial Residual and
Search Direction
b) Analytical GPU Runtime Performance Model (ISCA’09)
GPU
Reorder and cluster
panel cubes
d) Concurrent Kernel Executions



a3,8 




a6,8 


a8,8 
Geometric Multigrid
Solver (GMD)
Original Grid
Temperature (Celsius)
Intel Larrabee
Set Initial Solution
dt
(k , j )
n× s
I
∈

M2P index matrix: M 2 P
 T
m1
MGPCG Algorithm on GPU
DC : Gx = b
dx (t )
TR : Gx (t ) + C
=
b(t )
(k , j )
CPU
AMD APU
VDD
…
3
NVIDIA Tegra 2
VDD
VDD
2
Compute and combine
all coefficient matrices
VDD
a1,4 a1,5
 a1,1

a
a
a2,6
2,2
2,3


a3,2 a3,3

a
a4,4
 4,1
 a5,1
a5,5
a5,7

a6,2
a6,6


a7,5
a7,7

a8,3
a8,6

+
Φ (Mk ,2jP) ∈  n×s
( k ,i )
T
 qx
Jacobi Smoother using
Sparse Matrix
n×m
I mix
∈
3D Multi-layer Irregular
Electrical/Thermal Grid
M2P coefficient matrix:
Φ mix
n×m
GPU Global Memory
Q2P coefficient matrix:
(k , j )
( k ,i )
Q2P
Host (CPU) Memory
VDD
Φ Q 2 P  Φ M 2 P   I Q 2 P  I M 2 P 

 


 ⊗ 

 
 
 Φ ( l , k )  Φ r ×d   I ( l , k )  I r ×d 
Q2P
  Q 2 P




 


( k ,i )
p
b) Heterogeneous CPU-GPU Computing Achieves Much Better Energy Efficiency
a) Multigrid Preconditioned Krylov Subspace Iterative Solver
Voltage (V)
3D Interconnections
Voltage (V)
1. INTRODUCTION
Runtime of each kernel
on CPU & GPU (ms)
180
160
140
120
100
80
60
40
20
0
33X
CPU
24X
GPU
45X
45X
3.5X
Precond
Direct
Upward Downward Evaluation
4. REFERENCE
[1] Z. Feng, X. Zhao, and Z. Zeng. Robust Parallel Preconditioned Power Grid
Simulation on GPU with Adaptive Runtime Performance Modeling and Optimization.
In IEEETCAD, 2011.
[2] Z. Feng and Z. Zeng. Parallel multigrid preconditioning on graphics processing
units (GPUs) for robust power grid analysis. In Proc. IEEE/ACM DAC, 2010.
[3] Z. Feng and P. Li. Fast thermal analysis on GPU for 3D-ICs with integrated
microchannel cooling. In Proc. IEEE/ACM ICCAD, 2010.
[4] X. Zhao and Z. Feng. Fast Multipole Method on GPU: Tackling 3-D Capacitance
Extraction on Massively Parallel SIMD Platforms. In Proc. IEEE/ACM DAC, 2011.
Related documents
Download