Slides - Lattice 2013

advertisement
The PRACE project and the Application
Development Programme (WP8-2IP)
Claudio Gheller (ETH-CSCS)
PRACE - Partnership for Advanced
Computing in Europe
• PRACE and has the aim of creating a European
Research Infrastructure providing world class systems
and services and coordinating their use throughout
Europe.
2
PRACE History – An Ongoing Success Story
HPC part of the
ESFRI Roadmap;
creation of a vision
involving 15
European countries
Creation of the
Scientific Case
HPCEUR
2004
2005
Signature of the MoU
HET
2006
PRACE Initiative
2007
Creation of the PRACE
Research Infrastructure
PRACE RI
2008 2009
2010
2011
2012 2013
PRACE Preparatory
PRACE-1IP PRACE-3IP
Phase Project
PRACE-2IP
3
PRACE-2IP
• 22 partners (21 countries), funding 18 Million €
• Preparation/Coordination: FZJ/JSC/PRACE PMO
• 1.9.2011 – 31.8.2013, extended to 31.8.2014 (only
selected WPs)
• Main objectives:
– Provision of HPC resources access
– Refactoring and scaling of major user codes
– Tier-1 Integration (DEISA  PRACE)
– Consolidation of the Research Infrastructure
5
PRACE-3IP
• Funding 20 Million €
• Started: summer 2013
• Objectives
– Provision of HPC resources access
– Planned: Pre-commercial procurement exercise
– Planned: Industry application focus
6
Access to Tier-0 supercomputers
Open Call
for
Proposals
~ 2 Months
Technical
Peer Review
Priorisation
+
Resource
Allocation
Scientific
Peer Review
~ 3 Months
Project
+
Final
Report
~ 1 year
Technical
Researchers with
experts in
expertise in
PRACE systems
scientific
and software
field of proposal
Access
Committee
Researcher
PRACE director decides on the
proposal of the Access Committee
Distribution of resources
10%
Astrophysics
20%
3%
Chemistry and Materials
Earth Sciences and Environment
8%
Engineering and Energy
28%
4%
Fundamental Physics
Mathematics and Computer Science
Medicine and Life Sciences
27%
8
PRACE-2IP WP8: Enabling Scientific
Codes to the Next Generation of HPC
Systems
9
PRACE 2IP workpackages
–
–
–
–
–
–
–
–
–
–
–
–
WP1 Management
WP2 Framework for Resource Interchange
WP3 Dissemination
WP4 Training
WP5 Best Practices for HPC Systems Commissioning
WP6 European HPC Infrastructure Operation and Evolution
WP7 Scaling Applications for Tier-0 and Tier-1 Users
WP8 Community Code Scaling
ETH leading the WP
WP9 Industrial Application Support
WP10 Advancing the Operational Infrastructure
WP11 Prototyping
WP12 Novel Programming Techniques
10
WP8: involved centers
11
WP8 objectives
• Initiate a sustainable program in application
development for coming generation of supercomputing
architectures with a selection of community codes
targeted at problems of high scientific impact that
require HPC.
• Refactoring of community codes in order to optimally
map
applications
to
future
supercomputing
architectures.
• Integrate and validate these new developments into
existing applications communities.
12
WP8 principles
• scientific communities, with their high-end research
challenges, are the main drivers for software
development;
• synergy between HPC experts and application
developers from the communities;
• Supercomputer have to recast their service activities
in order to support, guide and enable scientific
program developers and researchers in refactoring
codes and re-engineering algorithms
• strong commitment from the scientific community has
to be granted
13
WP8 workflow
Task 1
Scientific Domains and Communities Selection
Scientific Communities Engagement
Codes screening
Codes Performance Analysis and Model
Communities build-up
Codes and kernels selection
Task 2
Codes Refactoring
Communities consolidation
Prototypes experimentation
Code Validation and reintegration
Task 3
14
Communities selection (task 1)
• the candidate community must have high impact on science
and/or society;
• the candidate community must rely on and leverage high
performance computing;
• WP8 can have a high impact on the candidate community;
• the candidate community must be willing to actively invest in
software refactoring and algorithm re-engineering.
o Astrophysics
o Climate
o Material Science
o Particle Physics
o Engineering
15
Codes and kernels selection methodology (task 1)
• Performance Modelling methodology
• Objective and quantitative way to select code and
estimate possible performance improvements
– Performance modelling goal is gaining insight into
application’s performance on a given computer system.
an
– achieved first by measurement and analysis, and then by the
synthesis of the application and computing system characteristics
– also represents a predictive tool, estimating the behaviour on a
different computing architecture identifying the most promising
areas for performance improvement.
16
Selected codes and institution in charge (task 1)
RAMSES
PFARM
EAF-PAMR
OASIS
I/O
ICON
NEMO
Fluidity/ICOM
ABINIT
QuantumESPRESSO
YAMBO
SIESTA
OCTOPUS
EXCITING/ELK
PLQCD
ELMER
CODE_SATURN
ALYA
ZFS
ETH
STFC
UC-LCA
CEA
ICHEC
ETH
STFC
STFC
CEA
CINECA
UC-LCA
BSC
UC-LCA
ETH
CASTORC
VSB-TUO
STFC
BSC
HLRS
17
Codes Refactoring (task 2)
• Still running (last few weeks)
• Specific codes’ kernels are being re-designed and reimplemented according to the workplans defined in
task 1
• Each group works independently
• Check points at Face to Face workshops and AllHands meetings
• Specific Wiki Web site implemented for report
progresses, collect and exchange information and
documents and to manage and release implemented
code: http://prace2ip-wp8.hpcforge.org
18
Codes validation and re-introduction (task 3)
• Collaborative work (daily basis) involving
developers and HPC experts
• Dedicated workshops
• Face to Face meetings
• Participation and contribution to conferences
code
This way, no actual need of a special/specific reintegration procedure was needed
19
Case study: RAMSES
•
The RAMSES code was developed to study
the evolution of the large-scale structure of
the universe and the process of galaxy
formation.
•
adaptive mesh refinement (AMR) multispecies code (baryons – hydrodynamics –
plus dark matter – N-Body)
•
Gravity couples the two components.
Solved by multigrid approach
•
Other components supported (e.g. MHD,
radiative transfer), but not subject of our
anaysis
20
Performance analysis example
Parallel Profiling, large test (5123 base grid 9 refinement levels – 250 GB): strong scaling
For this test Communication becomes the
most relevant part, and it is dominated by
synchronizations, due to the difficulties in
load balancing the AMR-Multigrid algorithms
Strong improvements can be obtained tuning
the load balance among computational
elements (nodes?)
21
Performance analysis: conclusions
The performance analysis identified the critical kernels of the code:
•Hydro: all the functions needed to solve the hydrodynamic problem are
included. Within these functions, we have those that collect from grids at
different resolutions the data necessary to update each single cell, those that
calculate fluxes to solve conservation equations, Riemann solvers, finitevolume solvers.
•Gravity: this group comprises functions needed to calculate the
gravitational potential at different resolutions using a multigrid-relaxation
approach.
•MPI: comprises all the communication related MPI calls (data tranfer,
synchronisation, management)
22
HPC architectures model
Performance improvements
Two main objectives
1.hybrid OpenMP+MPI parallelization, to exploit systems with distributed nodes, each
accounting for cores with shared memory
2.Exploitation of accelerators, in particular GPUs, adopting different paradigms
(CUDA, OpenCL, directives)
From the analysis of the performance and of the characteristics of the kernels under
investigation we can say that:
•
The Hydro kernel is suitable for both approaches. Specific care must be posed to
memory access issues.
•
The Gravity kernel can benefit from the hybrid implementation.
•
Due to the multigrid structure, however, an efficient GPU version can be
particularly challenging, so it will be considered only if time and resources permit.
24
Performance modeling
•
Hybrid version (trivial modeling):
THYBRID = TMPI eMPI,NTOT / (eOMP,Ncores eMPI,Nnodes)
•
GPU version
TTOT = TCPU + TCPU-GPU + TGPU-GPU + TGPU
25
Performance model example
26
Results: Hybrid code (OpenMP+MPI)
27
GPU implementation – approach 1
GPU implementation – approach 1
Step 2: solve Hydro
equations for cell i,j,k
Step 3: compose results
array
New Hydro variables
Step 4: copy results back
to the CPU
Copy to the
CPU
Results
Sedov Blast wave test (hydro only, unigrid):
ACC
OFF - 1 Pe
ON
ON
ON
ON
ON
NVECTOR
10
512
1024
2048
4096
8192
Ttot
94.54
55.83
45.66
42.08
41.32
41.19
Times in seconds
Tacc
Ttransf
0
0
38.22
9.2
29.27
9.2
25.36
9.2
23.2
9.2
23.15
9.2
Eff. Speed-up
2.012820513
2.669969252
3.068611987
3.293965517
3.304535637
20 GB tranferred in/out
(constant overhead)
Performance pitfalls
• Amount of transferred data
–Overhead increasing linearly with data size
• Data structure, irregular data distribution
–PREVENTS any asynchronous operation: NO overlap of
computation and data transfer.
–Ineffective memory access
–Prevents coalesced memory access
• Low flops per byte ratio
–this is intrinsic to the algorithm…
• Asynchronous operations not permitted
–See above…
Claudio Gheller
31
© CSCS 2013 -
Hydro variables
Gravitational forces
Other quantities
Step 1: compose data
chunks on the CPU
Data chunks are the basic
building block of the RAMSES’
AMR hierarchy:
OCTs and their refinements
CPU memory
GPU implementation – approach 2
• Data is moved to and from the GPU in chunks
• Data transfer and computation can be overlapped
Step 2: copy multiple data
chunks to the GPU
Step 3: solve Hydro equations
for chuncks N, M…
Step 4: compose results
array
New Hydro variables
Copy to the
CPU
Advantages over previous implementation
• Data is regularly distributed in each chunk and its
access is efficient. Improved flop per byte ratio
• Effective usage of the GPU computing architecture
• Data re-organization is performed on the CPU and its
overhead hidden by asynchronous processes
• Data transfer overhead almost completely hidden
• AMR naturally supported
• DRAWBACKS: much more complex implementation
Claudio Gheller
34
© CSCS 2013 -
Conclusions
• PRACE is providing European scientist top level HPC
services
• PRACE-2IP
WP8
successfully
introduced
a
methodology for code development relying on a close
synergy between scientists, community codes
developers and HPC experts
• Many community codes re-design and implemented to
exploit novel HPC architectures (see http://prace2ipwp8.hpcforge.org/ for details)
• Most of WP8 results are already available to the
community
• WP8 is going to be extended one more year (no
similar activity in PRACE-3IP)
35
Download