EECS 594 Spring 2011 Lecture 2: Overview of High

advertisement
EECS 594 Spring 2011
Lecture 2:
Overview of High-Performance
Computing
1
100 Pflop/s
100000000
44.16 PFlop/s 10 Pflop/s
10000000
2.56 PFlop/s 1 Pflop/s
1000000
100 Tflop/s
SUM 100000
31 TFlop/s 10 Tflop/s
10000
1 Tflop/s
1000
N=1 1.17 TFlop/s 100 Gflop/s
100
59.7 GFlop/s 6-8 years
N=500 10 Gflop/s
10
My Laptop
1 Gflop/s
1
100 Mflop/s
0.1
400 MFlop/s 1993 1995 1997 1999 2001 2003 My
2005
2007(402009
2010
iPhone
Mflop/s)
1
Looking at the Gordon Bell Prize
(Recognize outstanding achievement in high-performance computing applications
and encourage development of parallel processing )
 
1 GFlop/s; 1988; Cray Y-MP; 8 Processors
  Static
 
finite element analysis
1 TFlop/s; 1998; Cray T3E; 1024 Processors
  Modeling
of metallic magnet atoms, using a
variation of the locally self-consistent multiple
scattering method.
 
1 PFlop/s; 2008; Cray XT5; 1.5x105 Processors
  Superconductive
 
materials
1 EFlop/s; ~2018; ?; 1x107 Processors (109 threads)
Performance Development in Top500
1E+11
1E+10
1 Eflop/s
1E+09
100 Pflop/s
100000000
10 Pflop/s
10000000
SUM 1000000
1 Pflop/s
Gordon
Bell
Winners
N=1 100
Tflop/s
100000
10
Tflop/s
10000
1 1000
Tflop/s
N=500 100 Gflop/s
100
10 Gflop/s
10
2020
2018
2016
2014
2012
2010
2006
2004
2002
2000
1998
1996
1994
0.1
2008
My laptop
1 Gflop/s
1
100 Mflop/s
2
Intel 81% (406)
AMD 11% (57)
IBM 8% (40)
5
Countries Share
3
Customer Segments
Performance of Top20 Over 10 Years
Pflop/s
Tianhe-1A, NSCC
3.0
Jaguar ORNL
2.5
2.0
Roadrunner LANL
1.5
BG/L LLNL
1.0
0.5
Earth Simulator
0.0
ASCI White
LLNL
1
3
Rank
5
7
9
11
13
15
17
19
Jun
Jun -10
Jun -09
Jun -08
-07
Jun
-06
Jun
-05
Jun
-04
Jun
-03
Jun
-02
Jun
-01
4
Pflop/s Club (11 systems; Peak)
Name
Peak
Pflop/s
“Linpack”
Pflop/s
Country
Tianhe-1A
4.70
2.57
China
NUDT: Hybrid Intel/Nvidia/Self
Nebula
2.98
1.27
China
Dawning: Hybrid Intel/Nvidia/IB
Jaguar
2.33
1.76
US
Tsubame 2.0
2.29
1.19
Japan
HP: Hybrid Intel/Nvidia/IB
RoadRunner
1.38
1.04
US
IBM: Hybrid AMD/Cell/IB
Hopper
1.29
1.054
US
Cray: AMD/Self
Tera-100
1.25
1.050
France
Bull: Intel/IB
Mole-8.5
1.14
.207
China
CAS: Hybrid Intel/Nvidia/IB
Kraken
1.02
.831
US
Cray: AMD/Self
Cielo
1.02
.817
US
Cray: AMD/Self
JuGene
1.00
.825
Germany
IBM: BG-P/Self
Cray: AMD/Self
Performance of Countries
100,000
10,000
US
1,000
100
10
1
0
5
Performance of Countries
100,000
US
10,000
EU
1,000
100
10
1
0
Performance of Countries
100,000
US
10,000
EU
Japan
1,000
100
10
1
0
6
Performance of Countries
100,000
US
10,000
EU
Japan
China
1,000
100
10
1
0
•  Of the 500 Fastest
Supercomputer
•  Worldwide, Industrial
Use is > 60%
• 
• 
• 
• 
• 
• 
• 
• 
• 
• 
• 
• 
• 
• 
• 
• 
• 
• 
• 
• 
• 
• 
• 
• 
• 
• 
• 
Aerospace
Automotive
Biology
CFD
Database
Defense
Digital Content Creation
Digital Media
Electronics
Energy
Environment
Finance
Gaming
Geophysics
Image Proc./Rendering
Information Processing Service
Information Service
Life Science
Media
Medicine
Pharmaceutics
Research
Retail
Semiconductor
Telecomm
Weather and Climate Research
Weather Forecasting
14
7
¨  Google facilities
 leveraging
hydroelectric
power
“Hiding in Plain Sight, Google Seeks More Power”,
by John Markoff, June 14, 2006
 old aluminum
plants
Google Plant in The Dalles, Oregon,
from NYT, June 14, 2006
Microsoft and Yahoo are building big data centers
upstream in Wenatchee and Quincy, Wash.
– To keep up with Google, which means they need cheap
electricity and readily accessible data networking
Microsoft Quincy, Wash.
470,000 Sq Ft, 47MW! 15
Facebook
300,000 sq ft
1.5 cents per kW hour
Prineville OR
Microsoft 700,000 sq ft in Chicago
Apple 500,000 sq ft in Rural NC 4 cents kW/h
8
Rank
Site
Computer
Country
Cores
Rmax
[Pflops]
% of
Peak
1
Nat. SuperComputer
Center in Tianjin
NUDT YH, X5670 2.93Ghz
6C, NVIDIA GPU
China
186,368
2.57
55
4.04
636
2
DOE / OS
Oak Ridge Nat Lab
USA
224,162
1.76
75
7.0
251
China
120,640
1.27
43
2.58
493
Japan
73,278
1.19
52
1.40
850
Hopper, Cray XE6 12-core
2.1 GHz
USA
153,408
1.054
82
2.91
362
Tera-100 Bull bullx supernode S6010/S6030
France
138,368
1.050
84
4.59
229
Roadrunner / IBM
BladeCenter QS22/LS21
USA
122,400
1.04
76
2.35
446
8
NSF / NICS /
U of Tennessee
Jaguar / Cray
Cray XT5 sixCore 2.6 GHz
USA
98,928
.831
81
3.09
269
9
Forschungszentrum
Juelich (FZJ)
Jugene / IBM
Blue Gene/P Solution
Germany
294,912
.825
82
2.26
365
10
DOE/ NNSA /
Los Alamos Nat Lab
Cray XE6 8-core 2.4 GHz
USA
107,152
.817
79
2.95
277
3
4
5
6
7
Jaguar / Cray
Cray XT5 sixCore 2.6 GHz
Nebulea / Dawning / TC3600
Nat. Supercomputer
Blade, Intel X5650, Nvidia
Center in Shenzhen
C2050 GPU
Tusbame 2.0 HP ProLiant
GSIC Center, Tokyo
SL390s G7 Xeon 6C X5670,
Institute of Technology
Nvidia GPU
DOE/SC/LBNL/NERSC
Commissariat a
l'Energie Atomique
(CEA)
DOE / NNSA
Los Alamos Nat Lab
Power Flops/
[MW] Watt
9
Rank
Site
Computer
Country
Cores
Rmax
[Pflops]
% of
Peak
1
Nat. SuperComputer
Center in Tianjin
NUDT YH, X5670 2.93Ghz
6C, NVIDIA GPU
China
186,368
2.57
55
4.04
636
2
DOE / OS
Oak Ridge Nat Lab
USA
224,162
1.76
75
7.0
251
China
120,640
1.27
43
2.58
493
Japan
73,278
1.19
52
1.40
850
Hopper, Cray XE6 12-core
2.1 GHz
USA
153,408
1.054
82
2.91
362
Tera-100 Bull bullx supernode S6010/S6030
France
138,368
1.050
84
4.59
229
Roadrunner / IBM
BladeCenter QS22/LS21
USA
122,400
1.04
76
2.35
446
8
NSF / NICS /
U of Tennessee
Jaguar / Cray
Cray XT5 sixCore 2.6 GHz
USA
98,928
.831
81
3.09
269
9
Forschungszentrum
Juelich (FZJ)
Jugene / IBM
Blue Gene/P Solution
Germany
294,912
.825
82
2.26
365
10
DOE/ NNSA /
Los Alamos Nat Lab
Cray XE6 8-core 2.4 GHz
USA
107,152
.817
79
2.95
277
Jaguar / Cray
Cray XT5 sixCore 2.6 GHz
Nebulea / Dawning / TC3600
Nat. Supercomputer
Blade, Intel X5650, Nvidia
Center in Shenzhen
C2050 GPU
Tusbame 2.0 HP ProLiant
GSIC Center, Tokyo
SL390s G7 Xeon 6C X5670,
Institute of Technology
Nvidia GPU
3
4
5
DOE/SC/LBNL/NERSC
Commissariat a
l'Energie Atomique
(CEA)
DOE / NNSA
Los Alamos Nat Lab
6
7
Power Flops/
[MW] Watt
•  Has 3 Pflops systems
  NUDT, Tianhe-1A, located in Tianjin
Dual-Intel 6 core + Nvidia Fermi w
/custom interconnect
•  Budget 600M RMB
•  MOST 200M RMB, Tianjin Government
400M RMB
  CIT, Dawning 6000, Nebulea, located
in Shenzhen
Dual-Intel 6 core + Nvidia Fermi w
/QDR Ifiniband
•  Budget 600M RMB
•  MOST 200M RMB, Shenzhen Government
400M RMB
  Mole-8.5 Cluster/320x2 Intel QC Xeon
E5520 2.26 Ghz + 320x6 Nvidia Tesla
C2050/QDR Infiniband
•  Fourth one planned for Shandong
10
•  Loongson (Chinese: 龙芯; academic name:
Godson, also known as Dragon chip) is a family
of general-purpose MIPS-compatible CPUs
developed at the Institute of Computing
Technology, Chinese Academy of Sciences.
•  The chief architect is Professor Weiwu Hu.
•  The 65 nm Loongson 3 (Godson-3) is able to
run at a clock speed between 1.0 to 1.2 GHz,
with 4 CPU cores (10W) first and 8 cores later
(20W), and it is expected to debut in 2010.
•  Will use this chip as basis for
Petascale system in 2010.
22
11
Recently upgraded to a 2.3 Pflop/s
system with more than 224K
processor cores using AMD’s 6 Core
Peak performance chip.
2.3 PF
System memory
300 TB
Disk space
10 PB
Disk bandwidth
240+ GB/s
Interconnect bandwidth
374 TB/s
Office of
Science
  University of Tennessee’s National Institute for
Computational Sciences
  Housed at ORNL, operated for the NSF, named
Kraken
  Number 8 on the Top500
Just upgraded to 1 Pflop/s peak
99,072 cores, AMD 2.6 GHz
6 core chip, w/129 TB memory
12
University of Illinois - Blue Waters will be the
powerhouse of the National Science Foundation’s
strategy to support supercomputers for scientists
nationwide
T1
Blue Waters
NCSA/Illinois
10 Pflop/s peak; 1 Pflop/s sustained
per second in 2010
Kraken
NICS/U of
Tennessee
1 Pflop/s peak per second
Ranger
TACC/U of Texas
504 Tflop/s peak per second
Campuses
across the U.S.
Several sites
50-100 Tflop/s peak per second
T2
T3
427 use Quad-Core
59 use Dual-Core
6 use 9-core
2 use 6-Core
Intel Clovertown (4 cores)
Sun Niagra2 (8 cores)
Fujitsu Venus (8 cores)
AMD Istambul (6 cores)
IBM Power 7 (8 cores)
IBM Cell (9 cores)
Intel Polaris [experimental]
(80 cores)
26
IBM BG/P (4 cores)
13
•  Today
  Typical server node chip ~ 8 cores
  1k node cluster  8,000 cores
  Laptop ~ 2 cores (low power)
Intel SCC 48 cores
•  By 2020
  Typical server node chip ~400 cores
  1k node cluster  400,000 cores
  Laptop ~ 100 cores (low power)
Cores per Die
Predicted Number of CPU
Cores
600
500
400
300
200
100
0
Intel 80 cores (teraflop)
Server
Laptop
1 2 3 4 5 6 7 8 9 10 11 12 13
Years from now
Assuming continuation of 58%
historical density improvement per year
Tilera 100 GP cores
•  Most likely be a hybrid design
•  Think standard multicore chips and accelerator
(GPUs)
•  Today accelerators are attached
•  Next generation more integrated
•  Intel’s “Knights Corner” and “Knights Ferry” to
come.
  48 x86 cores
•  AMD’s Fusion in 2011 - 2013
  Multicore with embedded graphics ATI
•  Nvidia’s Project Denver plans to develop an
integrated chip using ARM architecture in 2013.
28
14
" 
High levels of parallelism
Many GPU cores, serial kernel execution
[ e.g. 240 in the Nvidia Tesla; up to 512 in Fermi – to have concurrent kernel
execution ]
" 
Hybrid/heterogeneous architectures
Match algorithmic requirements to architectural
strengths
[ e.g. small, non-parallelizable tasks to run on CPU, large and parallelizable on
GPU ]
" 
Compute
vs communication gap
Exponentially growing gap; persistent challenge
[ Processor speed improves 59%, memory bandwidth 23%, latency 5.5% ]
[ on all levels, e.g. a GPU Tesla C1070 (4 x C1060) has compute power o
O(1,000)
Gflop/s but GPUs communicate through the CPU using O(1) GB/s connection ]
Moore’s Law is Alive and Well
1.E+07
1.E+06
Transistors (in Thousands)
1.E+05
1.E+04
1.E+03
1.E+02
1.E+01
1.E+00
1.E-01
1970
1975
1980
1985
1990
1995
2000
2005
2010
Data from Kunle Olukotun, Lance Hammond, Herb Sutter,
Burton Smith, Chris Batten, and Krste Asanoviç
Slide from Kathy Yelick
15
But Clock Frequency Scaling
Replaced by Scaling Cores / Chip
1.E+07
15 Years of exponential growth ~2x year has ended
1.E+06
Transistors (in Thousands)
Frequency (MHz)
Cores
1.E+05
1.E+04
1.E+03
1.E+02
1.E+01
1.E+00
1.E-01
1970
1975
1980
1985
1990
1995
2000
2005
2010
Data from Kunle Olukotun, Lance Hammond, Herb Sutter,
Burton Smith, Chris Batten, and Krste Asanoviç
Slide from Kathy Yelick
Performance Has Also Slowed,
Along with Power
1.E+07
1.E+06
1.E+05
Power is the root cause of all this
Transistors (in Thousands)
Frequency (MHz)
Power (W)
1.E+04
Cores
1.E+03
A
hardware issue just became a
software problem
1.E+02
1.E+01
1.E+00
1.E-01
1970
1975
1980
1985
1990
1995
2000
2005
2010
Data from Kunle Olukotun, Lance Hammond, Herb Sutter,
Burton Smith, Chris Batten, and Krste Asanoviç
Slide from Kathy Yelick
16
•  Power ∝ Voltage2 x Frequency (V2F)
•  Frequency
∝ Voltage
•  Power ∝Frequency3
33
•  Power ∝ Voltage2 x Frequency (V2F)
•  Frequency
∝ Voltage
•  Power ∝Frequency3
34
17
•  Number of cores per chip
doubles every 2 year, while
clock speed decreases (not
increases).
•  Need to deal with systems with
millions of concurrent threads
•  Future generation will have
billions of threads!
•  Need to be able to easily replace
inter-chip parallelism with
intro-chip parallelism
•  Number of threads of
execution doubles every 2
year
Average Number of Cores Per
Supercomputer
100,000
90,000
80,000
70,000
60,000
50,000
40,000
30,000
20,000
10,000
0
•  Must rethink the design of our
software
  Another disruptive technology
• Similar to what happened with cluster
computing and message passing
  Rethink and rewrite the applications,
algorithms, and software
36
18
Systems
2009
2015
2018
System peak
2 Pflop/s
100-200 Pflop/s
1 Eflop/s
System memory
0.3 PB
5 PB
10 PB
Node performance
125 Gflop/s
400 Gflop/s
1-10 Tflop/s
Node memory BW
25 GB/s
200 GB/s
>400 GB/s
Node concurrency
12
O(100)
O(1000)
Interconnect BW
1.5 GB/s
25 GB/s
50 GB/s
System size (nodes)
18,700
250,000-500,000
O(106)
Total concurrency
225,000
O(108)
O(109)
Storage
15 PB
150 PB
300 PB
IO
0.2 TB
10 TB/s
20 TB/s
MTTI
days
days
O(1 day)
Power
7 MW
~10 MW
~20 MW
37
•  Steepness of the ascent from terascale to petascale
to exascale
•  Extreme parallelism and hybrid design
  Preparing for million/billion way parallelism
•  Tightening memory/bandwidth bottleneck
  Limits on power/clock speed implication on multicore
  Reducing communication will become much more intense
  Memory per core changes, byte-to-flop ratio will change
•  Necessary Fault Tolerance
  MTTF will drop
  Checkpoint/restart has limitations
Software infrastructure does not exist today
www.exascale.org
19
•  For the last decade or more, the research
investment strategy has been
overwhelmingly biased in favor of hardware.
•  This strategy needs to be rebalanced barriers to progress are increasingly on the
software side.
•  Moreover, the return on investment is more
favorable to software.
  Hardware has a half-life measured in years, while
software has a half-life measured in decades.
•  High Performance Ecosystem out of balance
  Hardware, OS, Compilers, Software, Algorithms, Applications
•  No Moore’s Law for software, algorithms and applications
•  The simplest and most useful way to
classify modern parallel computers is
by their memory model:
  shared memory
  distributed memory
40
20
P
P
P
P
P
P
BUS
Memory
Shared memory - single address
space. All processors have access to
a pool of shared memory. (Ex: SGI
Origin, Sun E10000)
Distributed memory - each
processor has it’s own local
memory. Must do message
passing to exchange data between
processors. (Ex: CRAY T3E, IBM
SP, clusters)
P
P
P
P
P
P
M
M
M
M
M
M
Network
41
P
P
P
P
P
P
BUS
Memory
Non-uniform memory access
(NUMA): Time for memory
access depends on location
of data. Local access is
faster than non-local access.
Easier to scale than SMPs
(SGI Origin)
Uniform memory access (UMA):
Each processor has uniform
access to memory. Also known
as symmetric multiprocessors
(Sun E10000)
P
P
P
P
P
P
P
BUS
BUS
Memory
Memory
P
Network
42
21
•  Processors-memory nodes are
connected by some type of
interconnect network
  Massively Parallel Processor (MPP):
tightly integrated, single system image.
  Cluster: individual computers connected
by s/w
CPU
CPU
CPU
MEM
MEM
MEM
CPU
CPU
CPU
MEM
MEM
MEM
CPU
CPU
CPU
MEM
MEM
MEM
Interconnect
Network
43
•  Latency: How long does it take to
start sending a "message"? Measured
in microseconds.
(Also in processors: How long does it
take to output results of some
operations, such as floating point add,
divide etc., which are pipelined?)
•  Bandwidth: What data rate can be
sustained once the message is
started? Measured in Mbytes/sec.
44
22
Percentage of peak
 
A rule of thumb that often applies
 A contemporary processor, for a spectrum of
applications, delivers (i.e., sustains) 10% of
peak performance
There are exceptions to this rule, in
both directions
  Why such low efficiency?
 
45
Why Fast Machines Run Slow
 
Latency
 
Overhead
  Waiting for access to memory or other parts of the system
  Extra work that has to be done to manage program
concurrency and parallel resources the real work you want
to perform
 
Starvation
  Not enough work to do due to insufficient parallelism or
poor load balancing among distributed resources
 
Contention
  Delays due to fighting over what task gets to use a shared
resource next. Network bandwidth is a major constraint.
46
23
Processor-DRAM Memory Gap
“Moore’s Law”
µProc
60%/yr.
(2X/1.5yr)
Chip
Processor-Memory
Performance Gap:
(grows 50% / year)
DRAM
DRAM
9%/yr.
(2X/10 yrs)
47
Memory hierarchy
 
Typical latencies for today’s technology
48
24
My Laptop
 
2.13 GHz
 2 ops/cycle DP /core
 8.54 Gflop/s peak
 
FSB 1.07 GHz
 64 bit data path (8
bytes) or 8.56 GB/s
 
With 8 bytes/word
(DP)
 1.07 GW/s from
memory
49
Intel Clovertown
Quad-core processor
  Each core does 4 floating point ops/s
  Say 2.4 GHz
 
 thus 4 ops/core*4 flop/s * 2.4 GHz =
38.4 Gflop/s peak
 
FSB 1.066 GHz
 1.066 GHz*8B /8 (W/B) = 1.066 GW/s
» There’s your problem
50
25
Three Types of Cache Misses
 
Compulsory (or cold-start) misses
 First access to data
 Can be reduced via bigger cache lines
 Can be reduced via some pre-fetching
 
Capacity misses
 Misses due to the cache not being big enough
 Can be reduced via a bigger cache
 
Conflict misses
 Misses due to some other memory line having
evicted the needed cache line
 Can be reduced via higher associativity
Tuning for Caches
1. Preserve locality.
2. Reduce cache thrashing.
3. Loop blocking when out of cache.
4. Software pipelining.
52
26
The Principle of Locality
  The
Principle of Locality:
 Program access a relatively small portion of the
address space at any instant of time.
  Two
Different Types of Locality:
 Temporal Locality (Locality in Time): If an item
is referenced, it will tend to be referenced
again soon (e.g., loops, reuse)
 Spatial Locality (Locality in Space): If an item is
referenced, items whose addresses are close by
tend to be referenced soon
(e.g., straightline code, array access)
  Last
15 years, HW relied on localilty for
speed
53
Principals of Locality
 
 
 
 
Temporal: an item referenced now will be again
soon.
Spatial: an item referenced now causes
neighbors to be referenced soon.
Lines, not words, are moved between memory
levels. Both principals are satisfied. There is
an optimal line size based on the properties of
the data bus and the memory subsystem
designs.
Cache lines are typically 32-128 bytes
54
27
Counting cache misses
 
nxn 2-D array, element size = e bytes, cache line size = b
bytes
memory/cache line
 
 
One cache miss for every cache line: n2 x e /b
 
Total number of memory accesses: n2
 
Miss rate: e/b
Example: Miss rate = 4 bytes / 64 bytes = 6.25%
 
Unless the array is very small
memory/cache line
 
One cache miss for every access
Example: Miss rate = 100%
 
 
Unless the array is very small
Cache Thrashing
 
Thrashing occurs when frequently used cache lines
replace each other. There are three primary
causes for thrashing:
 Instructions and data can conflict, particularly in unified
caches.
 Too many variables or too large of arrays are accessed
that do not fit into cache.
 Indirect addressing, e.g., sparse matrices.
 
Machine architects can add sets to the
associativity. Users can buy another vendor’s
machine. However, neither solution is realistic.
56
28
HW#2
Look at Matrix Multiply
C C + A × B
For _ = 1, n
For _ = 1, n
For _ = 1, n
Cij = Cij + Aik*Bkj
Look at performance for various
ordering of i, j, and k.
57
29
Download