Next decade in supercomputing José M. Cela Director departmento CASE BSC-CNS

advertisement
Next decade in supercomputing
José M. Cela
Director departmento CASE
BSC-CNS
josem.cela@bsc.es
Talk outline
● Supercomputing from the past…..
● Architecture evolution
● Applications and algorithms
● …..Supercomputing for the future
● Technology trends
● Multidisciplinary top-down approach
● BSC-CNS activities
● Conclusions
2
One upon a time … ENIAC 1946
Eniac, 1946, Moore School
18000 válvulas de vacio,
70000 resistores y 5 millones
de conexiones soldadas
Consumo
= 140 Kw
Dimensiones = 8x3x100 pies
Peso
> 30 toneladas
Capacidad de cálculo = 5000
sumas y 360 multiplicaciones
por segundo
3
Technological Achievements
● Transistor (Bell Labs, 1947)
● DEC PDP-1 (1957)
● IBM 7090 (1960)
● Integrated circuit (1958)
● IBM System 360 (1965)
● DEC PDP-8 (1965)
● Microprocessor (1971)
● Intel 4004
● 2.300 transistors
● Could access 300 bytes of memory
4
Technology Trends:
Microprocessor Capacity
Moore’s Law
2X transistors/Chip Every 1.5 years
Called “Moore’s Law”
Gordon Moore (co-founder of
Intel) predicted in 1965 that the
transistor density of semiconductor
chips would double roughly every
18 months.
Microprocessors have become
smaller, denser, and more powerful.
Not just processors, bandwidth,
storage, etc
5
Pipeline (H. Ford)
6
DRAM access bottleneck
● Not everything is scaling up fast
● DRAM access speed is hardly improved
7
Latencies and Pipelines
•
Memory
F
L2
Cache
D/
M
a
p
Q
R
E
x
D/
St
R
W
R
et
PC
Register
Map
Ic
a
c
h
e
100-1000
cycles
1-3
cycles
10-20
cycles
8
R
e
g
s
R
D
e
c
g
a
s
c
h
e
1-3
cycles
Processor on a chip
Hybrid SMP-cluster parallel systems
● Most modern high-performance computing systems are
clusters of SMP nodes (performance/cost trade-off)
Interconnection
Interconnection Network
Network
Memory
P
P
P
Memory
P
P
SMP
P
P
Memory
P
P
SMP
P
P
SMP
● MPI parallel level
● Threads (openMP) parallel level
9
Memory
P
P
P
P
SMP
P
TOP500
10
TOP500
11
Technology Outlook
High Volume
2004
2006
2008
2010
2012
2014
2016
2018
90
65
45
32
22
16
11
8
Integration Capacity
(BT)
2
4
8
16
32
64
128
256
Delay = CV/I scaling
0.7
~0.7
>0.7
Delay scaling will slow down
>0.35
>0.5
>0.5
Energy scaling will slow down
Manufacturing
Technology Node (nm)
Energy/Logic Op
scaling
Bulk Planar CMOS
High Probability
Low Probability
Alternate, 3G etc
Low Probability
High Probability
Variability
ILD (K)
RC Delay
Metal Layers
Medium
High
~3
<3
1
1
1
6-7
7-8
8-9
Very High
Reduce slowly towards 2-2.5
1
1
1
1
0.5 to 1 layer per generation
Shekhar Borkar, Micro37, P
12
1
Increasing CPU performance: a delicate balancing
Increasing the number of gates into a tight knot and decreasing the cycle time of the processor
Increase
Clock Rate
& Transistor
Density
Lower
Voltage
Cache
Cache
Core
C1
Core
C2
Cache
C3
C4
Core
C1 C2 C1 C2
C1
C2
C3 C4 C3 C4
Cache
C1 C2 C1 C2
C3
C4
C3 C4 C3 C4
We have seen increasing number of gates on a
chip and increasing clock speed.
Heat becoming an unmanageable problem, Intel
Processors > 100 Watts
We will not see the dramatic increases in clock
speeds in the future.
However, the number of
gates on a chip will
continue to increase.
13
Moore’s law
14
Multicore chips
15
16
ORNL Computing Power and Cooling 2006 - 2011
90
Cooling
Computers
80
$31M
70
$23M
60
Power (MW)
● Immediate need to add 8 MW to prepare
for 2007 installs of new systems
● NLCF petascale system could require an
additional 10 MW by 2008
● Need total of 40-50 MW for projected
systems by 2011
● Numbers just for computers: add 75% for
cooling
● Cooling will require 12,000 – 15,000 tons
of chiller capacity
Computer Center Power Projections
50
$17M
40
$9M
30
20
$3M
10
0
2005
2006
2007
2008
Year
2009
2010
2011
Cost estimates based on $0.05 kW/hr
Site
LBNL
ANL
ORNL
PNNL
FY 2005
43.70
44.92
46.34
49.82
Annual Average Electrical Power Rates $/MWh
FY 2006
FY 2007
FY 2008
FY 2009
FY 2010
50.23
53.43
57.51
58.20
56.40 *
Data taken from Energy Management System-4 (EMS4). EMS4 is the DOE corporate
53.01
system for collecting energy information from the sites. EMS4 is a web-based system that
51.33
collects energy consumption and cost information for all energy sources used at each DOE
site. Information is entered into EMS4 by the site and reviewed at Headquarters for
N/A
accuracy.
17
View from the Computer Room
18
How to reduce energy but not performance?
● Reduce the amount of DRAM memory per core and
redesign everything for energy saving
● Blue Gene Solution
● Eliminate the cache coherency in a multicore chip and use
accelerators instead of general purpose cores
● Cell/B.E. solution
● GPU solution
● FPGA solution
19
Blue Gene/P
System
72 Racks
Blue Gene/P continues Blue Gene’s
leadership performance in a spacesaving, power-efficient package for
the most demanding and scalable
high-performance computing
applications
Rack
Cabled 8x8x16
32 Node Cards
1024 chips, 4096 procs
Node Card
(32 chips 4x4x2)
32 compute, 0-1 IO cards
Final System:1 PF/s,144 TB
November 2007: 0.596 PF/s
14 TF/s
2 TB
Compute Card
1 chip, 20
DRAMs
435 GF/s
64 GB
Chip
4 processors
13.6 GF/s
8 MB EDRAM
13.6 GF/s
2.0 (or 4.0) GB DDR
Supports 4-way SMP
Front End Node / Service Node
JS21 / Power5
Linux SLES10
20
HPC SW:
Compilers
GPFS
ESSL
Loadleveler
Cell Broadband Engine architecture
235 Mtransistors
235 mm2
21
Cell Broadband Engine Architecture™ (CBEA)
Technology Competitive Roadmap
Next Gen
(2PPE’+32SPE’)
45nm SOI
~1 TFlop (est.)
Performance
Enhancements/
Scaling
Advanced
Cell BE
(1+8eDP SPE)
65nm SOI
Cost
Reduction
Cell BE
Cell BE
(1+8)
90nm SOI
(1+8)
65nm SOI
2006
2007
2008
2009
2010
All future dates and specifications are estimations only; Subject to change without notice. Dashed outlines indicate concept designs.
22
First PetaFlop computer (Nov2008):
Roadrunner at LANL
~7,000 dual-core Opterons
 ~50 TeraFlop/s (total)
~13,000 eDP Cell chips
 1.4 PetaFlop/s (Cell)
“Connected Unit” cluster
192 Opteron nodes
(180 w/ 2 dual-Cell blades
connected w/ 4 PCIe x8 links)
CU clusters
2nd stage InfiniBand 4x DDR interconnect
(18 sets of 12 links to 8 switches)
2nd Generation IB 4X DDR
23
How we are going to program it?
● MPI layer will continue
● Hybrid codes will be mandatory just for load balancing
● openMP on homogeneous processors
● But with heterogeneous processors
● openCL
● CUDA
● …
● SIMD code should be provided by the compiler
24
● BSC-CNS activities
25
Barcelona Supercomputing Center
Centro Nacional de Supercomputación
● Mission
● Investigate, develop and manage technology to
facilitate the advancement of science.
● Objectives
● Operate national supercomputing facility
● R&D in Supercomputing
● Collaborate in R&D e-Science
● Public Consortium
● the Spanish Government (MEC) 51%
● the Catalonian Government (DURSI) 37%
● the Technical University of Catalonia (UPC)12%
26
Location
27
28
Blades, blade center and racks
JS21 Processor Blade
• 2x2 PPC 970 MP 2,3 GHz
• 8 GB memory
• 36 Gigabytes HD SAS
• 2x1Gb Ethernet on board
6 chassis in a rack (42U)
• Myrinet daughter card
• 336 processors
• 672 GB memory
29
Blade Center
• 14 blades per chassis (7U)
• 56 processors
• 112 GB memory
• Gigabit ethernet switch
Network: Myrinet
Spine 1280
Spine 1280
128 Links
Clos 256x256
Clos 256x256
Clos 256x256
Clos 256x256
Clos 256x256
Clos 256x256
Clos 256x256
Clos 256x256
Clos 256x256
256 links (1 to each node)
Clos 256x256
250MB/s each direction
…
0
30
255
MareNostrum
Blade centers
Storage servers
Gigabit switch
Myrinet racks
Operations rack
10/100 switches
- 2560 JS21
- 2 PPC 790 MP 2,3 GHz
- 8 Gigabytes (20 TB)
- 36 Gigabytes HD SAS
- Myrinet daughter card
- 2x1Gb Ethernet on board
-Myrinet
- 10 clos256+256
- 2 spines 1280s
-20 Storage nodes
- 2 P615, 2 Power4+, 4 GigaBytes
- 28 SATA disc, 512 Gbytes (280 TB)
Performance Summary
- 4 instructions per cycle, 2,3 GHz
- 10240 processors
- 94,21 TFlops
-20 TB Memory, 300 TB disk
31
Additional Systems
● Tape facility
● 6 Petabytes
● LTO4 Technology
● HSM and Backup
● Shared memory system (ALTIX)
● 128 cores Montecito
● 2.5 TByte Main Memory
32
Spanish Supercomputing Network
Magerit
CaesarAugusta
33
Altamira
LaPalma
Picasso
Tirant
RES services
● Red Española de Supercomputacion (RES)
supercomputers can be access free by any public
Spanish research group. MareNostrum is the main RES
node.
● Web application form and instructions could be found in
the web page: www.bsc.es (support&services / RES)
● External committee evaluates the proposals
● Access reviewed every 4 months
● For any question contact BSC operations director
● Sergi Girona (sergi.girona@bsc.es)
34
Au
Bestra
lg lia
i
Bu Braum
C lga zil
an ri
a a
D C h da
en in
Fi m a a
nl rk
F
G ra and
er nc
m e
an
IreInd y
la ia
Is nd
ra
Ko
e
I
t
re Ja al l
a p y
M , So an
a u
N Mlay th
N eth ex sia
ew e ic
r
Ze lan o
d
N alan s
or d
Po wa
R lan y
So Slo uss d
ut ve ia
h n
Af ia
Sprica
S
Sw w a
itz ed in
U
ni
e e
te T rla n
d
U K aiw nd
ni in a
te g n
d do
St m
at
es
Top500: who is who?
60,00
50,00
40,00
30,00
20,00
10,00
0,00
35
Can Europe compete?
60, 00
50, 00
40, 00
30, 00
20, 00
10, 00
0, 00
UE
Japan
China
35,00
30,00
25,00
20,00
15,00
10,00
5,00
36
Slovenia
Ireland
Finland
Bulgaria
Belgium
Netherlands
Denmark
Spain
Poland
Sweden
Italy
Germany
France
0,00
United Kingdom
USA
ESFRI: European Infrastructure Roadmap
● The high-end (capability) resources should be implemented
every 2-3 years in a “renewal spiral” process
● Tier0 Centre total cost over a 5 year period shall be in the
range of 200-400 M€
● With supporting actions in the national/regional centers to
maintain the transfer of knowledge and feed projects to the
top capability layer
tier0
tier1
tier2
37
PRACE
tier 1
Principal Partners
General Partners
Associated Partners
38
Ecosystem
GENCI
BSC-IBM MareIncognito project
● Our 10 Petaflop research project for
BSC (2011)
● Port/develop applications to reduce
time-to-production once installed
Application
development
an tuning
● Programming models
● Tools for application development
and to support previous evaluations
● Evaluate node architecture
● Evaluate interconnect options
Performance
analysis and
Prediction
Tools
Model and
prototype
Interconnect
39
Fine-grain
programming
models
Load
balancing
Processor
and node
BSC Departments
• Computational
Mechanics
• Applied Computer
Science
• Optimization
40
What are the CASE objectives?
● Identify scientific communities with supercomputing needs and help
them to develop software
● Material Science (SIESTA)
● Fusion (EUTERPE, EIRENE, BIT1)
● Spectroscopy (OCTOPUS, ALYA)
● Atmospheric modeling (ALYA, WRF)
● Geophysics (BSIT, ALYA)
● Develop our own technology in Computational Mechanics
● ALYA, BSIT, …
● Perform technology transfer with companies
● REPSOL, AIRBUS, …
41
Who needs 10 Petaflops?
42
Airbus 380 Design
43
Seismic Imagining: RTM (REPSOL)
SPM
RTM
44
RTM Performance in Cell
Platform
Gflops
Power (W)
Gflops/W
JS21
8,3
267
0,03
QS20
108,2
315
0,34
QS21
116,6
370
0,32
8
7
22.1 GB/s of
memory BW used
Speed-up
6
5
Mesured
4
Ideal
3
2
1
1
2
3
4
5
# of SPU
45
6
7
8
ALYA
Computational Mechanics and Design
In-house development
Parallel
Coupled Multiphysics
Fluid dynamics
Structure dynamics
Heat transfer
Wave propagation
Excitable media…
46
Alya: Multiphysics Code
Domain
decomposition
Parallelization
Parall
MUMPS
sparse direct
solver
Optimization
Dodeme
Optima
Solmum
Services
Kernel
Solidz
Nastin
Structure
dynamics
 v      b
t
Turbul
Temper
Incompressible Turbulence
Navier-Stokes models
Heat transfer
 u    ( u  u  σ )
 g
t
u  0
k   , k  ,
Spalart  Almmaras
c  T  c u  T
p
t
p
Nastal
Compressible
Navier-Stokes
   (kT )  2ε(u) : ε(u)  t ( u)    ( u  u  σ )
 g
 t     ( u)  0
 t E    (uE  kT
Modules
 u  σ )  u  g
47
•Mesh
•Coupling
•Solvers
•Input/output
Exmedi
Excitable
media
 M
imp 
n 1
n
cT t  K i  I NL Vm  fm


K totUen 1  fen
Apelme
Fracture
mechanics
Gotita
Wavequ
Droplet
Impingement
(icing)
Wave
propagation
Dt (ud )  cD
 Re
D
24K
  (1 
1 2
1
 t u    ( u)  f

(ua  ud ) 

1
)
g
 Fr
a
2
w
 t    (ud )  0
ALYA keywords
● Multi-physics modular code for High Performance Computational
Mechanics
● Numerical solution of PDE’s
● Variational methods are preferred (FEM)...
● Coupling between multi-physics (loose or strong)
● Explicit and Implicit formulations
● Hybrid meshes, non-conforming meshes
● Advanced meshing issues
● Parallelization by MPI + OpenMP
● Automatic mesh partition using Metis
● Portability is a must (Compiled on Windows, Linux, MacOS)
● Porting to new architectures: CELL, …
● Scalability Tested on:
● IBM JS21 blades on MareNostrum: BSC, 10000 CPUs
● IBM Blue Gene/P & /L: IBM Lab. Montpellier and Watson, 4000 CPUs
● SGI Altix shared memory: BSC, Barcelona 128 CPUs
● PC cluster, 10 - 80 CPUs
48
Alya speed-up
MARENOSTRUM - IBM Blades
Boundary layer flow, 25M hexas
NASTAL module
Explicit compressible flow
Fractional step
NASTIN module
Implicit incompressible flow
Fractional step
49
CASE R&D: Aero-Acoustics
● High Speed train
50
CASE R&D: Automotive
● Ahmed body benchmark
● Win speed 120 km/h
51
CASE R&D: Building & Energy
● Benchmark cavity
● MareNostrum Cooling
52
CASE R&D: Aerospace
● Icing Simulation
● Subsonic / Transonic / Supersonic flows
● Adjoint methods in Shape Optimization
53
CASE R&D: Aerospace
● Subsonic cavity flow (0.82 Mach)
54
CASE R&D: Free surface problems
● Level set method
55
CASE R&D: Mesh generation
● Meshing boundary layer
56
CASE R&D: Mesh adaptativity
● Meshing
57
CASE R&D: Atmospheric Flows
● San Antonio Quarter (Barcelona)
58
CASE R&D: Meteo Mesh
● Surface from topography
● Semi-structured in volume
59
CASE R&D: Biomechanics
● Cardiac Simulator
● By-pass flow
● Arterial brain system
60
CASE R&D: Biomechanics
● Nose air flow
61
Scalability problems:
The deflated PCG
62
The deflated PCG
• Mesh partitioner slices arteries => two neigbours
• But, there exists fat meeting points of arteries => more neighbours
63
Parallel footprint
512 proc
Efficiency
Load
balance
Overall
0.67
0.92
GMRES
0.74
0.92
Deflated
CG
0.43
0.83
120 ms
6.6 ms
All_reduce
Sendrec
Momentum
solver
Pressure
solver
170 μs: very fine grain
8Bytes: support for fast reductions would be useful
64
Solver continuity: Deflated CG
(1)
Subdomains with
lots of neighbors
Sendrec
All_reduce (500x8b)
All_reduce (8B)
Sendrec
All_reduce (8B)
65
(2)
(6)
(9)
● The conclusions
66
The accelerator era
Cell
Multi-core
“Wedge of Opportunity”
Multi-threading
FPGAs Vector
GPUs
performance
67
Near Future Supercomputing Trends
● Performance will be provided by
● Multi-core
● Without cache coherency
● With Accelerators (top-down approach)
● Programming is going to suffer a revolution
● openCL
● CUDA
● …
● Compilers should provided SIDM parallelism level
68
Thank you !
69
Download