Parallel programming trends

advertisement
Parallel Programming Trends in Extremely Scalable
Architectures
Carlo Cavazzoni, HPC department, CINECA
www.cineca.it
CINECA
CINECA non profit Consortium, made
up of 50 Italian universities*, The National
Institute of Oceanography and
Experimental Geophysics - OGS, the
CNR (National Research Council), and the
Ministry of Education, University and
Research (MIUR).
CINECA is the largest Italian computing centre, one of the most important worldwide.
The HPC department manage the HPC infrastructure, provide support to Italian and European researchers, promote
technology transfer initiatives for industry.
www.cineca.it
Why parallel programming?
Solve larger problems
Run memory demanding codes
Solve problems with greater speed
www.cineca.it
Modern Parallel Architectures
Two basic architectural scheme:
Distributed Memory
Shared Memory
Now most computers have a mixed architecture
+ accelerators -> hybrid architectures
www.cineca.it
Distributed Memory
CPU
CPU
CPU
www.cineca.it
memory
memory
CPU
CPU
node
CPU
node
memory
NETWORK
node
node
memory
node
memory
node
memory
Shared Memory
memory
CPU
www.cineca.it
CPU
CPU
CPU
CPU
Real Shared
Memory banks
System Bus
CPU
www.cineca.it
CPU
CPU
CPU
CPU
Virtual Shared
Network
HUB
CPU
node
www.cineca.it
HUB
CPU
node
HUB
CPU
node
HUB
CPU
node
HUB
CPU
node
HUB
CPU
node
Mixed Architectures
memory
memory
CPU
CPU
CPU
CPU
node
node
CPU
CPU
node
NETWORK
www.cineca.it
memory
Most Common Networks
switched
Cube, hypercube, n-cube
switch
Torus in 1,2,...,N Dim
www.cineca.it
Fat Tree
HPC Trends
www.cineca.it
Number of cores of no 1 system from Top500
500000
Number of cores
Paradigm Change in HPC
400000
300000
200000
100000
Next HPC system installed in CINECA will have 200000 cores
www.cineca.it
-1
1
-1
0
….
What about applications?
Ju
n
-0
9
Ju
n
-0
8
Ju
n
-0
7
Ju
n
-0
6
Ju
n
Ju
n
-0
5
Ju
n
-0
4
Ju
n
-0
3
-0
2
Ju
n
Ju
n
-0
1
Ju
n
-0
0
Ju
n
-9
9
Ju
n
-9
8
Ju
n
-9
7
Ju
n
-9
6
Ju
n
-9
5
Ju
n
-9
4
Ju
n
-9
3
0
Ju
n
T
o
p
5
0
0
600000
Roadmap to Exascale
(architectural trends)
www.cineca.it
Dennard Scaling law (MOSFET)
L’ = L / 2
do not hold anymore!
V’ = V / 2
F’ = F * 2
D’ = 1 / L2 = 4D
P’ = P
The core frequency
and performance do not
grow following the
Moore’s law any longer
L’ = L / 2
V’ = ~V
F’ = ~F * 2
D’ = 1 / L2 = 4 * D
P’ = 4 * P
The power crisis!
www.cineca.it
CPU + Accelerator
to maintain the
architectures evolution
In the Moore’s law
Programming crisis!
Where Watts are burnt?
Today (at 40nm) moving 3 64bit operands to compute a 64bit floatingpoint FMA takes 4.7x the energy with respect to the FMA operation itself
D = A + B* C
A
B
C
Extrapolating down to 10nm integration, the energy required to move date
Becomes 100x !
www.cineca.it
MPP System
Arch
Option
for BG/Q
When?
2012
PFlop/s
>2
Power
>1MWatt
Cores
>150000
Threads
>500000
www.cineca.it
Accelerator
A set (one or more) of very simple execution units that can perform few operations (with
respect to standard CPU) with very high efficiency. When combined with full featured
CPU (CISC or RISC) can accelerate the “nominal” speed of a system. (Carlo Cavazzoni)
CPU
Single thread perf.
ACC.
throughput
CPU
CPU & ACC
www.cineca.it
ACC.
Architectural integration
Physical integration
nVIDIA GPU
Fermi implementation
will pack 512
processor cores
www.cineca.it
ATI FireStream, AMD GPU
2012
New Graphics Core Next “GCN”
With new instruction set and new
SIMD design
www.cineca.it
Intel MIC (Knight Ferry)
www.cineca.it
What about parallel App?
In a massively parallel context, an upper limit for the scalability of
parallel applications is determined by the fraction of the overall
execution time spent in non-scalable operations (Amdahl's law).
maximum speedup tends to
1/(1−P)
P= parallel fraction
1000000 core
P = 0.999999
serial fraction= 0.000001
www.cineca.it
Programming Models
•
•
•
Message Passing (MPI)
Shared Memory (OpenMP)
Partitioned Global Address Space Programming (PGAS)
Languages

•
Next Generation Programming Languages and Models

•
•
UPC, Coarray Fortran, Titanium
Chapel, X10, Fortress
Languages and Paradigm for Hardware Accelerators

CUDA, OpenCL
Hybrid: MPI + OpenMP + CUDA/OpenCL
www.cineca.it
trends
Scalar Application
Vector
MPP System, Message Passing: MPI
Distributed
memory
Multi core nodes: OpenMP
Accelerator (GPGPU,
FPGA): Cuda, OpenCL
Shared
Memory
Hybrid codes
www.cineca.it
CPU
memory
node
memory
node
node
Message Passing
domain decomposition
CPU
memory
CPU
CPU
www.cineca.it
memory
CPU
node
memory
node
node
Internal High Performance Network
memory
CPU
Ghost Cells - Data exchange
Processor 1
sub-domain boundaries
i,j+1
i-1,j i,j i+1,j
i,j+1
i-1,j i,j i+1,j
Ghost Cells
i,j-1
Processor 1
i,j+1
i-1,j i,j i+1,j
Ghost Cells exchanged
between processors
at every update
i,j+1
i-1,j i,j i+1,j
i,j-1
i,j+1
i-1,j i,j i+1,j
i,j-1
Processor 2
www.cineca.it
Processor 2
Message Passing: MPI
Main Characteristic
• Library
• Coarse grain
• Inter node parallelization
(few real alternative)
• Domain partition
• Distributed Memory
• Almost all HPC parallel
App
www.cineca.it
Open Issue
• Latency
• OS jitter
• Scalability
Shared memory
node
CPU
Thread 1
CPU
Thread 2
CPU
Thread 3
CPU
y
memory
www.cineca.it
Thread 0
x
Shared Memory: OpenMP
Main Characteristic
• Compiler directives
• Medium grain
• Intra node parallelization (pthreads)
• Loop or iteration partition
• Shared memory
• Many HPC App
www.cineca.it
Open Issue
• Thread creation overhead
• Memory/core affinity
• Interface with MPI
OpenMP
!$omp parallel do
do i = 1 , nsl
call 1DFFT along z ( f [ offset( threadid ) ] )
end do
!$omp end parallel do
call fw_scatter ( . . . )
!$omp parallel
do i = 1 , nzl
!$omp parallel do
do j = 1 , Nx
call 1DFFT along y ( f [ offset( threadid ) ] )
end do
!$omp parallel do
do j = 1, Ny
call 1DFFT along x ( f [ offset( threadid ) ] )
end do
end do
!$omp end parallel
www.cineca.it
Accelerator/GPGPU
+
Sum of 1D array
www.cineca.it
CUDA sample
void
CPUCode( int* input1, int* input2, int* output, int length) {
for ( int i = 0; i < length; ++i ) {
output[ i ] = input1[ i ] + input2[ i ];
}
}
__global__void
GPUCode( int* input1, int*input2, int* output, int length) {
int idx = blockDim.x * blockIdx.x + threadIdx.x;
if ( idx < length ) {
output[ idx ] = input1[ idx ] + input2[ idx ];
}
}
Each thread execute one loop iteration
www.cineca.it
CUDA
OpenCL
Main Characteristic
• Ad-hoc compiler
• Fine grain
• offload parallelization (GPU)
• Single iteration parallelization
• Ad-hoc memory
• Few HPC App
www.cineca.it
Open Issue
• Memory copy
• Standard
• Tools
• Integration with other
languages
Hybrid
(MPI+OpenMP+CUDA+…
Take the positive off all models
Exploit memory hierarchy
Many HPC applications are adopting this model
Mainly due to developer inertia
Hard to rewrite million of source lines
…+python)
www.cineca.it
Hybrid parallel programming
Python: Ensemble simulations
MPI: Domain partition
OpenMP: External loop partition
CUDA: assign inner loops
Iteration to GPU threads
Quantum ESPRESSO
www.cineca.it
http://www.qe-forge.org/
Storage I/O
•
•
•
•
•
•
The I/O subsystem is not
keeping the pace with CPU
Checkpointing will not be
possible
Reduce I/O
On the fly analysis and
statistics
Disk only for archiving
Scratch on non volatile
memory (“close to RAM”)
www.cineca.it
PRACE
PRACE Research Infrastructure (www.prace-ri.eu)
the top level of the European HPC ecosystem
The vision of PRACE is to enable and support European global
leadership in public and private research and development.
CINECA (representing Italy)
is an hosting member
of PRACE
can host a Tier-0 system
European (PRACE)
Tier 0
National (CINECA today)
Tier 1
Local
Tier 2
www.cineca.it
FERMI @ CINECA
PRACE Tier-0 System
Architecture: 10 BGQ Frame
Model: IBM-BG/Q
Processor Type: IBM PowerA2, 1.6 GHz
Computing Cores: 163840
Computing Nodes: 10240
RAM: 1GByte / core
Internal Network: 5D Torus
Disk Space: 2PByte of scratch space
Peak Performance: 2PFlop/s
ISCRA & PRACE call for projects now open!
Conclusion
Parallel programming trends in extremely scalable architectures
•
•
•
•
•
•
•
•
www.cineca.it
Exploit millions of ALU
Hybrid Hardware
Hybrid codes
Memory Hierarchy
Flops/Watt (more that Flops/Sec)
I/O subsystem
Non volatile memory
Fault Tolerance!
Download