High Productivity Computing System Program

advertisement
HPC as a Driver for Computing
Technology and Education
Tarek El-Ghazawi
The George Washington University
Washington D.C., USA
NOW- July 2015: The TOP 10 Systems
Rank
1
2
3
Site
RIKEN Advanced Inst
for Comp Sci, Japan
5
DOE / OS
Argonne Nat Lab,
USA
6
Swiss CSCS
7
KAUST, Saudi
8
TACC, USA
10
Cores
National Super
Tianhe-2 NUDT,
Computer Center in Xeon 12C 2.2GHz + IntelXeon Phi 3,120,000
Guangzhou, China
(57c) + Custom
DOE / OS
Titan, Cray XK7, AMD (16C) +
Oak Ridge Nat Lab
560,640
Nvidia Kepler GPU (14c) + Custom
USA
DOE / NNSA
Sequoia, BlueGene/Q (16c)
+
L Livermore Nat Lab
1,572,864
custom
USA
4
9
Computer
% of
Peak
Power MFlops
[MW] /Watt
33.9
62
17.8
1905
17.6
65
8.3
2120
17.2
85
7.9
2063
K computer Fujitsu SPARC64
VIIIfx (8c) + Custom
705,024
10.5
93
12.7
827
Mira, BlueGene/Q (16c)
Custom
786,432
8.16
85
3.95
2066
115,984
6.27
81
2.3
2726
Shaheen II, Cray XC30, Xeon 16C +
196,608
Custom
5.54
77
4.5
1146
5.17
61
4.5
1489
5.01
85
2.30
2178
4.29
85
1.97
2177
+
Piz Daint, Cray XC30, Xeon 8C +
Nvidia Kepler (14c) + Custom
Stampede, Dell Intel (8c) + Intel
Xeon Phi (61c) + IB
204,900
Forschungszentrum
JuQUEEN, BlueGene/Q,
Juelich (FZJ),
458,752
Power BQC 16C 1.6GHz+Custom
Germany
DOE / NNSA
Vulcan, BlueGene/Q,
393,216
LLNL, USA
Power BQC 16C Tarek
1.6GHz+Custom
El-Ghazawi, GWU
500 (422) Software Comp
Rmax
[Pflops]
HP Cluster
USA
18,896
.309
48
2
HPC is a Top National Priority!
Executive Order from the
White House
Establishment of a National
Strategic Computing Initiative
(NCSI) – 29 July 2015
Tarek El-Ghazawi, GWU
3
3
National Strategic Computing Initiative
Five strategic themes of the NSCI:
1) Create systems that can apply exaflops of computing
power to exabytes of data
2) Keep the United States at the forefront of HPC
capabilities
3) Improve HPC application developer productivity
4) Make HPC readily available
5) Establish hardware technology for future HPC
systems
Tarek El-Ghazawi, GWU
4
4
Future/Investments - International
Exascale HPC Programs
Country
Funding
Year(s)
Remarks
European Union
€700M
2014-20
Private-Public Partnership commitment
through European Tech Platform for
HPC (ETP4HPC)
€143.4M in 2014-15
€74M
2011-
6 dedicated FP7 Exascale projects
$2B
2014-20
Led by IISc (Indian Institute of Science)
and ISRO (Indian Space Research
Organization).
Targeting a 132 ExaFLOP/s machine
$750M
2014-19
C-DAC (Center for Development of
Advanced Computing) to set up 70
supercomputers over 5 years
$1.38B
2013-20
Post-K computer to be installed at
RIKEN; Tentatively based on Extreme
SIMD chip “PACS-G”
India
Japan
China
-
Due to U.S./DoC ban will use Chinese
5
Tarek El-Ghazawi, parts
GWU to upgrade current #1 system
5
Why is HPC Important?
 Critical for economic competitiveness (Highlighted
by Minster Daoudi) because of its wide
applications (through simulations and intensive
data analyses)
 Drives computer hardware and software
innovations for future conventional computing
 Is becoming ubiquitous, i.e. all
computing/information technology is turning into
Parallel!!
Is that why it is turning into an international HPC
muscle flexing contest?
Tarek El-Ghazawi, GWU
6
Why is HPC Important?
(1)Competitiveness
Design
Design
Build
Model
Simulate
Tarek El-Ghazawi, GWU
Test
Build
7
Why is HPC Important?
Competitiveness
Molecular Dynamics
HIV-1 Protease
Simulation for 2ns:
• 2 weeks on a desktop
• 6 hours on a supercomputer
Gene Sequence Alignment
Inhibitor
Drug
HPC
Application
Examples
Phylogenetic Analysis:
• 32 days on desktop
• 1.5 hrs supercomputer
Car Crash
Simulations
2 million elements simulation:
• 4 days on a desktop
• 25 minutes on a supercomputer
Tarek El-Ghazawi, GWU
Understanding
Fundamental
Structure of Matter
Requires a billionbillion calculations per
second
8
Why is HPC Important?
(2) HPC of Today is Conventional Computing for
Tomorrow
The ASCI Red Supercomputer
9000 chips for 3 TeraFLOPs in 1997
Tarek El-Ghazawi, GWU
Intel 80 Core Chip
1 Chip and 1 TeraFLOPs in 2007
9
3- Why is HPC Important?HPC Concepts are becoming Ubiquitous
Samsung S6 – 8 Cores
Sony PS3
HPC is
Ubiquitous!
All Computing is
becoming HPC, Can
we become
Uses the Cell Processors! bystanders?
The Road Runner:
Was Fastest Supercomputer in 08
Uses Cell Processors!
Tarek El-Ghazawi, GWU
Tile64: A 64 CPU ChipCan be in your future laptop!
10
How Did we Get Here - Supercomputers in recent History
Computer
Processor
# Pr.
Year
Rmax
(TFlops)
Tianhe-2 (MilkyWay-2)
TH-IVB-FEP Cluster, Intel Xeon E5-2692 12C
2.200GHz, TH Express-2, Intel Xeon Phi
31S1P
3120000
2013-till now
33,862
Titan
Cray XK7, Opteron 16 Cores, 2.2GHz, Nvidia
K20X
560640
2012
17,600
SPARC64 VIIIfx 2.0GHz,
705024
2011
10,510
Intel EM64T Xeon X56xx (Westmere-EP) 2930
MHz (11.72 Gflops) + NVIDIA GPU, FT-1000
8C
186368
2010
2,566
Cray XT5-HE Opteron Six Core 2.6 GHz
224162
2009
1,759
PowerXCell 8i 3200 MHz (12.8 GFlops)
122400
2008
1,026
BlueGene/L - eServer Blue Gene
Solution, IBM
PowerPC 440 700 MHz (2.8 GFlops)
212992
2007
478
BlueGene/L - eServer Blue Gene
Solution, IBM
PowerPC 440 700 MHz (2.8 GFlops)
131072
2005
280
BlueGene/L beta-System IBM
PowerPC 440 700 MHz (2.8 GFlops)
32768
2004
70.7
NEC 1000 MHz (8 GFlops)
5120
2002
35.8
IBM ASCI White,SP
POWER3 375 MHz (1.5 GFlops)
8192
2001
7.2
IBM ASCI White,SP
POWER3 375MHz (1.5 GFlops)
8192
2000
4.9
Intel IA-32 Pentium Pro 333 MHz (0.333 GFlops)
9632
1999
2.4
K-Computer, Japan
Tianhe-1A, China
Jaguar, Cray
Roadrunner, IBM
Earth-Simulator / NEC
Intel ASCI Red
Tarek El-Ghazawi, GWU
11
How Did we Get Here - Supercomputers in
recent History
See: http://spectrum.ieee.org/tech-talk/computing/hardware/china-
builds-worlds-fastest-supercomputer
Tarek El-Ghazawi, GWU
12
How Did we Get Here - Supercomputers in
recent History
Performance
Vector
Machines
Massively
Parallel
Processors
MPPs with Multicores and
Heterogeneous Accelerators
PetaFLOPS
TeraFLOPS
Discrete
Integrated
Time
1993HPCC
20082011
End of Moore’s
Law in Clocking!
Tarek El-Ghazawi, GWU
13
NOW- July 2015: The TOP 10 Systems
Rank
1
2
3
Site
RIKEN Advanced Inst
for Comp Sci, Japan
5
DOE / OS
Argonne Nat Lab,
USA
6
Swiss CSCS
7
KAUST, Saudi
8
TACC, USA
10
Cores
National Super
Tianhe-2 NUDT,
Computer Center in Xeon 12C 2.2GHz + IntelXeon Phi 3,120,000
Guangzhou, China
(57c) + Custom
DOE / OS
Titan, Cray XK7, AMD (16C) +
Oak Ridge Nat Lab
560,640
Nvidia Kepler GPU (14c) + Custom
USA
DOE / NNSA
Sequoia, BlueGene/Q (16c)
+
L Livermore Nat Lab
1,572,864
custom
USA
4
9
Computer
% of
Peak
Power MFlops
[MW] /Watt
33.9
62
17.8
1905
17.6
65
8.3
2120
17.2
85
7.9
2063
K computer Fujitsu SPARC64
VIIIfx (8c) + Custom
705,024
10.5
93
12.7
827
Mira, BlueGene/Q (16c)
Custom
786,432
8.16
85
3.95
2066
115,984
6.27
81
2.3
2726
Shaheen II, Cray XC30, Xeon 16C +
196,608
Custom
5.54
77
4.5
1146
5.17
61
4.5
1489
5.01
85
2.30
2178
4.29
85
1.97
2177
+
Piz Daint, Cray XC30, Xeon 8C +
Nvidia Kepler (14c) + Custom
Stampede, Dell Intel (8c) + Intel
Xeon Phi (61c) + IB
204,900
Forschungszentrum
JuQUEEN, BlueGene/Q,
Juelich (FZJ),
458,752
Power BQC 16C 1.6GHz+Custom
Germany
DOE / NNSA
Vulcan, BlueGene/Q,
393,216
LLNL, USA
Power BQC 16C Tarek
1.6GHz+Custom
El-Ghazawi, GWU
500 (422) Software Comp
Rmax
[Pflops]
HP Cluster
USA
18,896
.309
48
14
How to Make Progress
 Launch a competitive funding cycle or a large
national project
 Pose a system challenge
 ~ 33.8 PFLOPS/17.8 Mwatt provides about
2GF/Watt
To get to Exascale using same total power we
need 200GF/Watt
 Pose an application challenge(s)
 Let the community compete for government
funding with innovative ideas
Tarek El-Ghazawi, GWU
15
Challenges - The End of Moore’s Law
The phenomenon of exponential
improvements in processors was observed
in 1979 by Intel co-founder Gordon Moore
The speed of a microprocessor doubles every
18-24 months, assuming the price of the
processor stays the same Wrong, not anymore!
The price of a microchip drops about 48% every
18-24 months, assuming the same processor
speed and on chip memory capacity Ok, for Now
The number of transistors on a microchip
doubles every 18-24 months, assuming the price
of the chip stays the same
Ok, for Now
Tarek El-Ghazawi, GWU
16
No faster clocking but more Cores?
Source: Ed Davis, Intel
Tarek El-Ghazawi, GWU
17
Accelerators and Dealing with the Moore’s Law
Challenge Through Parallelism
Fab.
Freq
Process
# Cores
Peak FP
Performance
Peak
DP
Power Flops/W
Memory
SPFP
GFlops
DPFP
GFlops
W
1+8
204
102.4
92
1.11
25.6
XDR
0.75
2880
4290
1430
235
6.1
288
GDDR5
22
1.24
61 (244
threads)
2417
1208
300
4.0
352
GDDR5
Intel Xeon 12core 2.7 GHz E52697v2
22
2.7
12
518.4
259.2
130
1.99
59.7
DDR31866
AMD Opteron
6370P Interlagos
32
2.5
16
320
160
99
1.62
42.7
DDR31333
Xilinx XC7VX1140T
28
-
-
801
241
43
5.6
-
-
Xilinx XCUV440
20
-
-
1306
402
80*
5.0*
Altera Stratix V
GSB8
28
-
-
59
5.0
-
-
nm
GHz
PowerXCell 8i
65
3.2
Nvidia Kepler
K40
28
Intel Xeon Phi
7120P
604
296
Tarek El-Ghazawi, GWU
BW GB/s
Memory
type
18
Accelerators/Heterogeneous Computing
FPGAs
Cell
GPUs
Phi …
Microprocessor
Application
SAVINGS
Speedup
Cost
Power
Size
DNA Match
8723
22x
779x
253x
DES Breaker
38514
96x
3439x
1116x
El-Ghazawi et. al. The Promise of HPRCs. IEEE Computer, February 2008
Tarek El-Ghazawi, GWU
19
A General Execution Model for Heterogeneous
Computers
Accelerator
µP
CELL B.E.
•Transfer of Control
•Input Data
GPU
PC
Clearspeed
FPGA
Intel Xeon Phi
•Output Data
•Transfer of Control
Tarek El-Ghazawi, GWU
20
Challenges for Accelerators
1. Application must lend itself to the 90-10 rule, and different
accelerators suit diffent type of computations
2. Programmer partitions the code across the CPU and accelerator
3. Programmer co-schedules CPU and accelerator, and ensures
good utilization of the expensive accelerator resources
4. Programmer explicitly transfers data between CPU and
accelerator
5. Accelerators are fast as compared to the link, and overhead that
can render the use of the accelerator useless or harmful
6. Multiple programming paradigms are needed
7. New accelerator means learning/porting to a new programming
interface
8. Changing the ratio of CPUs to accelerators requires also
substantial programming unless accelerators are vituralized
Tarek El-Ghazawi, GWU
21
Challenges for Advancing or for Exascale
1. Energy Efficiency
2. Interconnect Technology
3. Memory Technology
4. Scalable System Software
5. Programming Systems
6. Data Management
7. Exascale Algorithms
DoE ASCAC Subcommittee Report
Feb 2014
8. Algorithms for Discovery, Design
& Decision
9. Resilience and Correctness
10. Scientific Productivity
Data movementTarek
and/or
programming
El-Ghazawi,
GWUrelated
22
Exascale Technological Challenges
 The Power Wall
Frequency scaling is no longer possible,
power increases rapidly
 The Memory Wall
Gap between processor speed and memory
speed is widening
 The Interconnect Wall
Available bandwidth per compute operations
is dropping
Power needed for data movement is
increasing
 Programmability Wall, Resilience Wall, ..
Tarek El-Ghazawi, GWU
23
23
The Data Movement Challenge
Bandwidth density vs. system distance
Energy vs. system distance
[Source: ASCAC 14]
 Locality matters a lot, cost (energy and time) rapidly increases
with distance
 Locality should be exploited at short distance, needed more at
far distances
Tarek El-Ghazawi, GWU
24
Data Movement and the Hierarchical Locality
Challenge
Tarek El-Ghazawi, GWU
25
25
Locality is Not Flat Anymore– Chip and System
Tarek El-Ghazawi, GWU
26
26
Locality is Not Flat in Anymore – Chip and System
Tarek El-Ghazawi, GWU
27
27
Locality is Not Flat Anymore – Chip and System
Tarek El-Ghazawi, GWU
28
28
Locality is Not Flat in Extreme Scale – Chip and
System
Cray
XC40
Tarek El-Ghazawi, GWU
29
29
Locality in Extreme Scale – Chip
and System Perspectives
 TTT TILE64
Tile64
Cray
XC40
30
Tarek El-Ghazawi, GWU
30
What Does that Mean for Programmers
 Exploiting Hierarchical Locality
 Machine level and Chip level
 Hierarchical Tiled Data Structures
 Hierarchical Locality Exploitation with RTS
 MPI+X
Tarek El-Ghazawi, GWU
31
General Implications
 Short term
programming challenge
Golden opportunity for smart programmer
 New hardware advances needed first and they will
influence software
 May be silicon based, may be nano technologies
like carbon nano-tube transistors by IBM (9nm),
may keep things the way they are from the
software side for a while
Tarek El-Ghazawi, GWU
32
General Implications- Longer Run
 Long-term hardware technology may move toward
Nano-photonics for computing
Quantum Computing

 Many of the new hardware computing innovations
may show first as discrete accelerators, then on
the chip accelerator, then move closer to the
processor internal circuitry ( data path )
Tarek El-Ghazawi, GWU
33
Longer term
 The bad news: with the limits of the silicon
approached we may see departures from
conventional methods of computing which may
dramatically change the way we conceive software
 The good news: history has shown that good ideas
from the past get resurrected in new ways
Tarek El-Ghazawi, GWU
34
Conclusions
 Graduating and intelligent IT workforce can be a
golden egg for countries like Morocco
 You can teach skills but it is imperative to teach
and stress concepts in the curriculum
Stress Parallelism
Stress Locality
 See the recommendations by IEEE/NSF and SIAM
for incorporating parallelism in Computer Science,
Computer Engineering, and Computational Science
and Engineering Curricula, and add locality
 For the very long-term
There is nothing better than having good
foundations in Physics and Math even for CS
Tarek El-Ghazawi, GWU
and CE majors
35
Conclusions cont.
 Integrate teaching soft skills as President
Ouaouicha said
Communications
Entrepreneurism and marketing, individually
and in groups
Patenting and legal
Tarek El-Ghazawi, GWU
36
Download