Advanced Scientific Computing on Intel Xeon Phi

advertisement
Technische Universität München
Advanced Scientific Computing on Intel
Xeon Phi
PARCO’13, Garching (Germany)
Alexander Heinecke
Technische Universität München, Chair of Scientific Computing
Sep 13th 2013
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
1
Technische Universität München
Acknowledgment
For input and fruitful discussions I want to thank:
Intel DE: Michael Klemm, Hans Pabst, Christopher Dahnken
Intel Labs US/IN, PCL: Mikhail Smelyanskiy, Karthikeyan
Vaidyanathan, Pradeep Dubey, Victor Lee, Christopher
Hughes
Intel US: Mike Julier, Susan Meredith, Catherine Djunaedi,
Michael Hebenstreit
Intel RU: Dmitry Budnikov, Maxim Shevtsov, Alexander Batushin
Intel IL: Ayal Zaks, Arik Narkis, Arnon Peleg
And a very special thank you to Joe Curley and the KNC Team for
providing the SE10P cards for my test cluster!
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
2
Technische Universität München
Outline
Introduction to Intel Xeon Phi
Codes, which are not fully optimized for Xeon Phi, yet!
Molecular Dynamics
Earthquake Simulation - SeisSol, SuperMUC bid workload
Hybrid HPL
Data Mining with Sparse Grids
Conclusions
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
3
Technische Universität München
The Intel Xeon Phi SE10P (5110P)
• 61(60) cores with 4 threads each
at 1.1(1.053) GHz
• 32 KB L1$, 512 KB L2$, pre core
• L2$ are kept coherent through a
fast on-chip ring bus network
• 512 bit wide vector instructions, 2
cycle throughput
• 1074(1010) GFLOPS Peak DP
• 2148(2020) GFLOPS Peak SP
• 8 GB GDDR5, 512 bit interface,
352(330) GB/s
• PCIe 2.0 Host-Interface
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
4
Technische Universität München
The Intel Xeon Phi Usage Model
MIC
CPU
CPU centric
CPU hosted offload to MIC
Main()
Compute()
Comm()
symmetric
offload to CPU
Main()
Compute_m()
Comm()
Main()
Compute()
Comm()
Compute_h()
Compute_m()
Main()
Compute()
Comm()
Main()
Compute_h()
Comm()
MIC centric
MIC hosted
Main()
Compute()
Comm()
• scalar: CPU hosted
• highly parallel: MIC offload
• offload through MPI: symmetric
• scalar phases: reverse offload
• highly parallel: MIC hosted
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
5
Technische Universität München
Infiniband-Connectivity of Xeon Phi
7000
6000
5000
4000
3000
2000
1000
0
0
host1 <-> host0-mic0
host0-mic0 <-> host1
host0-mic0 <-> host1-mic0
host1-mic0 <-> host0-mic0
mic1 <-> mic0
host <-> mic0
host <-> mic1
MByte/s
MByte/s
mic0 <-> mic1
mic0 <-> host
mic1 <-> host
1000000 2000000 3000000 4000000
Messagesize in Bytes
7000
6000
5000
4000
3000
2000
1000
0
0
1000000 2000000 3000000 4000000
Messagesize in Bytes
⇒ Xeon E5 has only 16 PCIe read buffers but 64 PCIe write buffers!
⇒ Start-up latency is 9us (host ↔ host: 1.5us, 1 hop), with 3 hops!
⇒ Using hosts as proxy like in GPU diret results in up to 6GB/s!
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
6
Technische Universität München
Outline
Introduction to Intel Xeon Phi
Codes, which are not fully optimized for Xeon Phi, yet!
Molecular Dynamics
Earthquake Simulation - SeisSol, SuperMUC bid workload
Hybrid HPL
Data Mining with Sparse Grids
Conclusions
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
7
Technische Universität München
Linked Cell Particle Simulations with ls1 mardyn∗
• with increasing
vectorlength net-vector
usage goes down (holes)
rcutoff
• Xeon E5 has too low
bandwidth to L1 (32 byte
read, 16 byte write)
calculate distances
x_4
x_3
x_2
x_1
y_a
d_a_4 d_a_3 d_a_2 d_a_1
y_4
y_3
y_2
y_1
z_4
z_3
z_2
z_1
z_a
f_x_a
f_y_a
f_z_a
d_a_i < rcutoff
1<i<4
x_a
calculate forces (LJ-12-6)
f_x_4 f_x_3 f_x_2 f_x_1
f_y_4 f_y_3 f_y_2 f_y_1
f_z_4
f_z_3
f_z_2
• gather/scatter can be used
to achieve 100%
vectorregister load
f_z_1
⇒ Porting ls1 mardyn to MIC hosted mode is ongoing research
∗
Eckhardt, et.al.: 591 TFLOPS Multi-Trillion Particles Simulation on SuperMUC, ISC’13
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
8
Technische Universität München
Single Card Performance Xeon Phi*
• Compute distances
d_a_m7
d_a_m6
d_a_m5
d_a_m4
d_a_m3
d_a_m2
d_a_m1
d_a_m0
94,00%
100,00%
50,00%
96,00%
100,00%
61,00%
89,00%
63,00%
100,00%
• Execute force computation
One Box SuperMUC
(16 SNB-EP cores)
Xeon Phi 5110P;
120 MPI ranks;
classic approach
Xeon Phi 5110P;
120 MPI ranks;
gather/scatter
rcutoff
2-center Lennard-Jones 3-center Lennard-Jones 4-center Lennard-Jones
∗
Eckhardt, et.al.: Vectorization of Multi-Center, Highly-Parallel Rigid-Body Molecular Dynamics
Simulations, Poster SC’13, accepted
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
9
Technische Universität München
SeisSol∗
SeisSol is a software-package for the simulation of seismic wave
phenomena on unstructured grids, based on the discontinuous
Galerkin method com- bined with ADER time discretization.
On key-routine is the ADER time integration working on following
matrix patterns:
0
0
1
2
3
3
4
5
4
5
6
5
6
7
7
6
7
8
8
0
1
2
3
4
5
6
7
8
0
2
2
3
4
5
6
7
8
1
1
2
3
4
5
6
0
0
1
1
2
3
4
7
8
8
9
9
9
9
9
9
10
10
10
10
10
10
11
11
0
1
13
14
2
6
7
21
8
22
1
2
3
4
5
6
7
8
11
11
12
13
0
1
19
6
20
7
21
21
8
22
22
1
2
3
4
5
6
7
8
6
7
21
8
22
22
23
23
23
23
23
23
24
24
24
24
24
25
25
25
25
25
25
26
26
26
26
26
27
27
27
27
27
27
28
28
28
29
29
29
29
29
30
30
30
30
30
31
31
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
1
2
3
4
5
6
7
8
3
4
5
6
7
8
33
34
0
2
32
33
34
0
1
31
32
33
33
34
0
31
32
32
33
34
0
31
31
32
32
33
0
28
29
30
34
4
5
19
20
21
26
28
2
3
17
18
19
20
24
28
1
15
16
17
18
0
0
13
14
15
16
4
19
20
12
13
14
2
3
5
18
18
0
14
17
17
5
19
20
22
16
16
4
17
18
19
21
15
15
3
16
17
18
12
12
13
14
15
15
16
11
11
12
12
13
14
20
34
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
0
1
2
3
4
5
6
7
8
⇒ we are currently at 1.3X for compute kernels comparing 1 Xeon
Phi to 1 dual socket SNB-EP!
∗
Breuer, et.al.: Accelerating SeisSol by Generating Vectorcode for Sparse Matrix Operators,
PARCO’13, accepted
∗
Heinecke, et.al.: Optimized Kernels for large scale earthquake simulations with SeisSol, an
unstructured ADER-DG code, Poster SC’13, accepted
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
10
Technische Universität München
Outline
Introduction to Intel Xeon Phi
Codes, which are not fully optimized for Xeon Phi, yet!
Molecular Dynamics
Earthquake Simulation - SeisSol, SuperMUC bid workload
Hybrid HPL
Data Mining with Sparse Grids
Conclusions
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
11
Technische Universität München
The HPL Benchmark∗
• Matrix is stored in block-cyclic
layout
D
Current panel Li
Finished part of L
Finished part of U
Ui
• Matrix is decomposed into
P-process-rows and
Q-process-columns
• LU is most critical component
Ai =Ai -LiUi
• DGEMM fraction grows with
increasing system size
⇒ solve largest possible problem for best performance
⇒ keep MIC busy all the time (doing DGEMM)
∗
: Heinecke, et.al.: Design and Implementation of the Linpack Benchmark for Single and
Multi-Node Systems Based on Intel(R) Xeon Phi(TM) Coprocessor, IPDPS’13
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
12
Technische Universität München
Designing Offload-DGEMM with work-stealing
Knights
Corner
CPU
B0
B1
B2
B3
A0
C00
C01 C02 C03
A1
C10
C11 C12 C13
A2
C20
C21 C22 C23
A3
C30 C31 C32
C33
• Tc = 2MNK
Peak
8MN
• Tt = PCIe
< Tc
• K > 4 Peak
PCIe ≈ 1000
• C,A,B are split into tiles
• MIC starts to process tiles
from C (0, 0)
Knights
A0
Corner
B0
DGEMM
(A0,B0,C00)
C00
B1
(A0,B1,C01)
DGEMM
B2
• CPU starts to process tiles
C01
DGEMM
(A0,B1,C02)
B3
CPU
DGEMM
(A3,B,C3)
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
from C (X , X )
• dynamic load balancing
assigns work to MIC or
CPU
13
Technische Universität München
Hybrid HPL using Pipelined Lookahead
(Hybrid, load
balanced via
workstealing)
CPU
DGEMM
Columns K-N
CPU DGEMM
Columns 0-K
Panel Factor.
Panel Broadcast
DMA comm. via PCIe
Knights Corner
DGEMM
Columns K-N
U BCST, SWP, DTRSM (I-J)
U BCST, SWP, DTRSM (0-I)
U BCST, SWP, DTRSM (J-N)
• A panel was already
swapped in last iteration
• swap, U broadcast and
DTRSM on one B-tile
• panel free-up on CPU is
merged with swap, U
broadcast and DTRSM
• afterwards start pipelined
execution of swap, U
broadcast, DTRSM and
trailing update DGEMM
⇒ Xeon Phi is executing DGEMM basically all the time!
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
14
Technische Universität München
Lookahead Comparison
100%
100%
90%
90%
80%
panel factorization exposed (CPU)
80%
panel factorization exposed (CPU)
70%
DGEMM on Knights Corner
70%
DGEMM on Knights Corner
60%
DTRSM exposed (CPU)
DTRSM exposed (CPU)
50%
50%
40%
60%
U broadcast and row swapping
exposed (CPU)
40%
30%
20%
20%
10%
10%
0%
0%
82800
78000
73200
68400
63600
58800
54000
49200
44400
39600
34800
30000
25200
20400
15600
10800
6000
1200
82800
78000
73200
68400
63600
58800
54000
49200
44400
39600
34800
30000
25200
20400
15600
10800
6000
1200
30%
Standard Lookahead
U broadcast and row swapping
exposed (CPU)
Pipelined Lookahead
Comparison of Lookahead variants on a four node cluster with 2
Xeon Phi per node and 64 GB RAM per node, local matrix size is
82.8K, P = Q = 2.
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
15
Technische Universität München
Performance Results
System
CPU only, 64GB
CPU only, 64GB
no pipeline, 1 card, 64GB
pipeline, 1 card, 64GB
no pipeline, 1 card, 64GB
pipeline, 1 card, 64GB
no pipeline, 1 card, 64GB
pipeline, 1 card, 64GB
no pipeline, 2 cards, 64GB
pipeline, 2 cards, 64GB
no pipeline, 2 cards, 64GB
pipeline, 2 cards, 64GB
no pipeline, 2 cards, 64GB
pipeline, 2 cards, 64GB
pipeline, 1 card, 128GB
N
84K
168K
84K
84K
168K
168K
825K
825K
84K
84K
166K
166K
822K
822K
242K
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
P
1
2
1
1
2
2
10
10
1
1
2
2
10
10
2
Q
1
2
1
1
2
2
10
10
1
1
2
2
10
10
2
TFLOPS
0.29
1.10
0.99
1.12
3.88
4.36
95.2
107.0
1.66
1.87
6.36
7.15
156.5
175.8
4.42
Eff.
86.4
82.8
71.0
79.8
69.1
77.6
67.7
76.1
68.2
76.6
65.0
73.1
64.0
71.9
79.6
16
Technische Universität München
Outline
Introduction to Intel Xeon Phi
Codes, which are not fully optimized for Xeon Phi, yet!
Molecular Dynamics
Earthquake Simulation - SeisSol, SuperMUC bid workload
Hybrid HPL
Data Mining with Sparse Grids
Conclusions
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
17
Technische Universität München
Sparse Grids in a nutshell∗
• Curse of Dimensionality: O (N d ) grid points
• Therefore: Sparse Grids
Reduce cost to O (N log(N )d−1 )
Similar accuracy, if problem sufficiently smooth
•
•
• Basic idea: hierarchical basis in 1d (here: piecewise linear)
0,0
1,1
0,1
l =1
1,1
l =1
.
.
x0,1
x1,1
x0,0
x1,1
2,3
2,1
l =2
l =2
x2,1
.
x2,3
x2,1
3,1 3,3 3,5 3,7
l =3
3,1
l =3
.
x3,1 x3,3 x3,5 x3,7
∗
2,3
2,1
.
x2,3
3,3 3,5
3,7
.
x3,1 x3,3 x3,5 x3,7
: Heinecke, et.al.: Data Mining on Vast Dataset as a Cluster System Benchmark, PMBS’13,
submitted
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
18
Technische Universität München
Data Mining with SG - In a nutshell
d-dimensional instances and ansatz-functions fN (~x ):
n
o
S = (~xm , ym ) ∈ Rd × K
N
m =1,...,M
fN (~x ) =
∑ ~uj ϕj (~x )
j =1
problem to be solved:
!
fN = arg min
fN ∈VN
1
M
!
M
∑
(ym − fN (~xm ))
m =1
results into SLE:
1
1
BB T + λ C ~u = B~y
M
M
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
2
+ λ ||∇fN ||2L2
here C = I, bij = ϕi (~xj )
19
Technische Universität München
Implementation of B T and B
B T : Parallelization over dataset instances:
Level
Index
α
Sum
Dataset
y
B: Parallelization over grid points:
Level
Index
α
y
Dataset
Sum
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
20
Technische Universität München
Non-uniform Basis Functions - B T Operator
for ~xm ∈ S with |S| = M, ~ym ← 0 do
for all g ∈ grid with s ← ~αg do
for k = 1 to d do
if glk = 1 then
s ← s·1
else if gik = 1 then
s ← s · max(2 − glk ·~xmk ; 0)
g
else if gik = 2 lk − 1 then
s ← s · max(glk ·~xmk − gik + 1; 0)
else
s ← s · max(1 − |glk ·~xmk − gik |; 0)
end if
end for
~ym ← ~ym + s
end for
end for
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
21
Technische Universität München
Extending B T and B to Clusters
α
m
d
v
P1
P2
N
P3
BT
P4
+
no comm.
B
w
MPI_Allreduce
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
+
+++++++
+
22
Technische Universität München
Performance Model for MPI Allreduce
Total learning time on p nodes:
Tlearn (p, N ) = Tseq (N ) + TB (p, N ) + TB T (p, N ) + Tcomm (p, N ),
with
Tseq = Ttotalsinglenode − TBsinglenode − TB T
singlenode
and
TB (p ) =
TBsinglenode
and
TB T (p ) =
TB T
singlenode
p
The communication time can be estimated by:
Tcomm (p, N ) =
∑
p
.
TcommCG (p, N ),
r ∈Refinements
with employing the OSU MVAPICH benchmark suite:
TcommCG (p, N ) = microbenchmark(N, p) · (#iterations + 1).
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
23
Technische Universität München
Performance Model Evaluation
4000
100
3500
GFLOPS
2500
60
2000
40
1500
Async
Allreduce
Allgatherv 20
Model
Optimal
1000
500
00
•
5
10
15
#nodes
20
25
30
efficiency
80
3000
350
test dataset DR5: 400K instances, 5 dimensions
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
24
Technische Universität München
Xeon Phi Offload vs. Xeon Phi Native
CoolMAC - 2 W8000s
Beacon - 1 5110P native
Beacon - 2 5110Ps native
Todi - 1 K20X
14000
Beacon - 1 5110P offload
Beacon - 2 5110Ps offload
Beacon - 4 5110Ps offload
SuperMUC
instances, 5
dimensions (upper)
12000
GFLOPS
10000
8000
• Friedman1 dataset,
6000
4000
2000
0
1
GFLOPS
• DR5 dataset, 400K
2
4
Beacon - 4 5110Ps offload
Todi - 1 K20X
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
1
2
4
8
16
32
64
128
#nodes
Beacon - 4 5110Ps native
SuperMUC
10M instances per
node, 10 dimensions
(lower)
• We’re running out of
parallel tasks
• offloading is a
limitation in case of
small grids
8
16 32
#nodes
64
128 256
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
25
Technische Universität München
GFLOPS
Using MIC native implementation on Cluster
Beacon - CPU only
Beacon - 1 5110P native
Beacon - symmetric (CPU+MIC)
5000
4500
4000
3500
3000
2500
2000
1500
1000
500
0
1 node
2 nodes
•
•
Beacon - 1 5110P offload
Beacon - Hybrid (CPU+MIC)
4 nodes
8 nodes
16 nodes
test dataset DR5: 400K instances, 5 dimensions
symmetric mode: 2 ranks per Xeon Phi, 1 per Xeon E5
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
26
Technische Universität München
Conclusions
Data Parallelism in x86 delivers performance comparable to
GPUs, but at a higher programming productivity!
Key Findings:
• we are flexible to use different programming models
• we can just re-compile and code runs directly on Xeon Phi
• Intrinsics can be easily transferred
• Optimizations on Xeon Phi are optimization on Xeon E5, too!
• tuning for Intel MIC Architecture does not differ from tuning for a
standard CPU which is also required when using GPUs (since
CPU FLOPS should not be neglected)
• but: running 60+ MPI ranks on Xeon Phi seems to be suboptimal
• but: ccNUMA like L2 cache needs to be respected
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
27
Technische Universität München
BACKUP
BACKUP
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
28
Technische Universität München
STREAM on Xeon Phi
250
copy
GByte/s
200
scale
add
150
triad
100
50
0
1
5
10
20
30
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
40 50
#Threads
60
120 180 240
29
Technische Universität München
DGEMM on Xeon Phi
1000
950
89.8%
formating
overhead
89.4%
900
850
800
750
DGEMM on SNB-EP using Intel MKL 11.0: A, B and C are square
700
GFLOPS
650
600
Outter product kernel on KNC (A and B in KNC-friendly format) , K=300
Outter product DGEMM on KNC, K=300
550
500
450
400
350
90.00%
300
250
1K 2K 3K 4K 5K 6K 7K 8K 9K 10K 11K 12K 13K 14K 15K 16K 17K 23K 28K
N
Heinecke, et.al.: Design and Implementation of the Linpack Benchmark for Single and Multi-Node
Systems Based on Intel(R) Xeon Phi(TM) Coprocessor, at IPDPS’13
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
30
Technische Universität München
DGETRF on Xeon Phi
HPL with Static Look-Ahead on KNC
HPL with Dynamic Scheduling on KNC
Native DGEMM on KNC
SMP Linpack on SNB-EP using MKL 11.0
1.000
89.8%
900
78.8%
800
GFLOPS
700
600
500
400
83.0%
300
200
100
0
1K
3K
5K
7K
9K
11K
13K
15K
17K
20K
22K
24K
26K
28K
30K
Matrix Rank (N)
Heinecke, et.al.: IPDPS’13
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
31
Technische Universität München
The Intel Xeon Phi Usage Model - Offloading
low-level SCI Low-level MPI-like interface (DMA-windows,
mem-fences)
med-level COI OpenCL-like, handles, buffers, etc.
high-level #pragma annotations for driving offload to MIC
Example:
#pragma offload target(mic:d) nocopy(ptrData)
in(ptrCoef:length(size) alloc if(0) free if(0) align(64))
out(ptrCoef[start:mychunk] : into
(ptrCoef[start:mychunk]) alloc if(0) free if(0) align(64))
in(size)
{
..
}
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
32
Technische Universität München
90%
GFLOPS
80%
Efficiency
70%
60%
50%
40%
30%
Efficiency
2000
1800
1600
1400
1200
1000
800
600
400
200
0
20%
10%
79.2K
72K
64.8K
57.6K
50.4K
43.2K
36K
28.8K
21.6K
14.4K
7.2K
0%
1.2K
GFLOPS
Offload-DGEMM with one Xeon Phi
N
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
33
Technische Universität München
90%
GFLOPS
80%
Efficiency
70%
60%
50%
40%
30%
Efficiency
2000
1800
1600
1400
1200
1000
800
600
400
200
0
20%
10%
79.2K
72K
64.8K
57.6K
50.4K
43.2K
36K
28.8K
21.6K
14.4K
7.2K
0%
1.2K
GFLOPS
Offload-DGEMM with two Xeon Phis
N
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
34
Technische Universität München
Naive Hybrid HPL
CPU
DGEMM
(MIC Hybrid)
Panel Factor.
Panel Broadcast
DMA comm. via PCIe
Knights Corner
DGEMM
U Broadcast
SWAP+DTRSM
• very simple
• Least change to HPL, plug
and play
• no-lookahead is even
inefficient on CPU
• Performance is limited by
swapping and panel fact.
running on CPU
• As MIC is 3-4× faster than
CPU, MIC idle time is more
expensive
⇒ Keeping Xeon Phi as busy as possible is major goal!
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
35
Technische Universität München
(Hybrid, load
balanced via
workstealing)
CPU
DGEMM
Columns K-N
U Broadcast
SWAP+DTRSM
CPU DGEMM
Columns 0-K
Panel Factor.
Hybrid HPL using Standard Lookahead
Knights Corner
DGEMM
Columns K-N
Panel Broadcast
DMA comm. via PCIe
• little changes to HPL
• DGEMM is split into tow
parts: panel free-up and
trailing update
• trailing update DGEMM is
load-balanced via
work-stealing between MIC
and CPU
• panel fact. is entirely
hidden
⇒ U broadcast, swapping of trailing matrix and DTRSM are still
exposed!
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
36
Technische Universität München
Outline
Introduction to Intel Xeon Phi
Codes, which are not fully optimized for Xeon Phi, yet!
Molecular Dynamics
Earthquake Simulation - SeisSol, SuperMUC bid workload
Hybrid HPL
Data Mining with Sparse Grids
Conclusions
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
37
Technische Universität München
Beacon:
A Path to Energy-­‐Ef3icient HPC
R. Glenn Brook Director, Application Acceleration Center of Excellence
National Institute for Computational Sciences
glenn-­‐brook@tennessee.edu
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
38
Technische Universität München
Green500 Hardware
� Appro Xtreme-­‐X Supercomputer powered by ACE
Beacon Green500 Cluster
Nodes 36 compute
CPU model
Intel® Xeon® E5-­‐2670 CPUs per node 2x 8-­‐core, 2.6GHz
RAM per node
256 GB SSD per node
960 GB (RAID 0)
Intel® Xeon Phi™ Coprocessors 5110P per node
4 Cores per Intel® Xeon Phi™ coprocessor 5110P 60 @ 1.053GHz
RAM per Intel® Xeon Phi™ coprocessor 5110P 8 GB GDDR5
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
39
Technische Universität München
Assembled Team
� Intel:
� Mikhail Smelyanskiy – software lead
� Karthikeyan Vaidyanathan – MIC HPL
� Ian Steiner – host HPL optimization
� Many, many others: Joe Curley, Jim Jeffers, Pradeep Dubey, Susan Meredith, Rajesh Agny, Russ Fromkin, and many dedicated and passionate team members
� Technische Universität München: � Alexander Heinecke – MIC HPL
� Appro: � John Lee – hardware lead
� David Parks – HPL and system software
� Edgardo Evangelista, Danny Tran, others – system deployment and support
� NICS: � Glenn Brook – project lead
� Ryan Braby, Troy Baer – system support
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
40
Technische Universität München
Power Consumption
����� ������ �������
������ ��������� ��
��������� �������
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
41
Technische Universität München
classical, recursive function evaluation
l1=1
l1=2
l1=3
l1
l2=1
l2=2
l2=3
l2
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
42
Technische Universität München
Algorithm: Learning data sets with adaptive
sparse grids
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
D := data set
G := start grid
~u := ~0
while TRUE do
trainGrid (G,~u, D )
acc = getAccuracy (G,~u, D )
if acc ≥ accneeded then
return ~u, G
end if
refineGrid (G,~u )
end while
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
43
Technische Universität München
First results (modlinear) - Baseline: classic
algorithms
DR5 data set (370K instances):
6
5.1
5.3
5
5
4.6
Speed-Up
4
3.1
3
3
2.6
2
1
0
X5650
SSE
6128HE
SSE
i7-2600
SSE
i7-2600
AVX
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
1x GTX470 2x GTX470
Hybrid
X5650/
GTX470
44
Technische Universität München
First results (regular) - Baseline: classic
algorithms
DR5 data set (370K instances):
60
50.9
50
42.8
Speed-Up
40
30
24
20
10
7.7
7.4
4.7
4
6128HE
SSE
i7-2600
SSE
0
X5650
SSE
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
i7-2600
AVX
1x GTX470 2x GTX470
Hybrid
X5650/
GTX470
45
Technische Universität München
GFLOPS
Scaling on SuperMUC
524288
262144
131072
65536
32768
16384
8192
4096
2048
1024
512
256
128
64
modlinear
modlinear - ideal
linear boundary
linear boundary - ideal
16
32
64
128
256
512
1024
2048
4096
8192 16384 32768
modlinear: spatially adaptive grids (5K - 21K gridpoints in 7
refinements), modlinear basis, MSE: 4.8*10-4, runtime
16384: 4.4s
linear boundary: regular grid (100K gridpoints), regular linear basis,
MSE: 5.4*10-4, runtime 32768: 3.6s
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
46
Technische Universität München
Data Mining with Sparse Grids @SC12
Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi
PARCO’13, Garching (Germany) , Sep 13th 2013
47
Download