Optimizing Matrix Multiply - Computer Science Division

advertisement
CS 267:
Introduction to Parallel Machines
and Programming Models
James Demmel
www.cs.berkeley.edu/~demmel/cs267_Spr09
01/28/2009
CS267 Lecture 3
1
Outline
• Overview of parallel machines (~hardware) and
programming models (~software)
• Shared memory
• Shared address space
• Message passing
• Data parallel
• Clusters of SMPs
• Grid
• Parallel machine may or may not be tightly
coupled to programming model
• Historically, tight coupling
• Today, portability is important
• Trends in real machines
CS267 Lecture 3
01/28/2009
2
A generic parallel architecture
Proc
Proc
Proc
Proc
Proc
Proc
Interconnection Network
Memory
Memory
Memory
Memory
Memory
• Where is the memory physically located?
• Is it connect directly to processors?
• What is the connectivity of the network?
01/28/2009
CS267 Lecture 3
3
Parallel Programming Models
• Programming model is made up of the languages and
libraries that create an abstract view of the machine
• Control
• How is parallelism created?
• What orderings exist between operations?
• Data
• What data is private vs. shared?
• How is logically shared data accessed or communicated?
• Synchronization
• What operations can be used to coordinate parallelism?
• What are the atomic (indivisible) operations?
• Cost
• How do we account for the cost of each of the above?
01/28/2009
CS267 Lecture 3
4
Simple Example
• Consider applying a function f to the elements
of an array A and then computing its sum:
n 1

f ( A[i ])
i 0
• Questions:
• Where does A live? All in single memory?
Partitioned?
• What work will be done by each processors?
• They need to coordinate to get a single result, how?
A = array of all data
fA = f(A)
s = sum(fA)
A:
f
fA:
sum
s:
01/28/2009
CS267 Lecture 3
5
Programming Model 1: Shared Memory
• Program is a collection of threads of control.
• Can be created dynamically, mid-execution, in some languages
• Each thread has a set of private variables, e.g., local stack variables
• Also a set of shared variables, e.g., static variables, shared common
blocks, or global heap.
• Threads communicate implicitly by writing and reading shared
variables.
• Threads coordinate by synchronizing on shared variables
Shared memory
s
s = ...
y = ..s ...
01/28/2009
i: 2
i: 5
P0
P1
i: 8
Private
memory
CS267 Lecture 3
Pn
6
Simple Example
• Shared memory strategy:
• small number p << n=size(A) processors
• attached to single memory
• Parallel Decomposition:
n 1

f ( A[i ])
i 0
• Each evaluation and each partial sum is a task.
• Assign n/p numbers to each of p procs
• Each computes independent “private” results and partial sum.
• Collect the p partial sums and compute a global sum.
Two Classes of Data:
• Logically Shared
• The original n numbers, the global sum.
• Logically Private
• The individual function evaluations.
• What about the individual partial sums?
01/28/2009
CS267 Lecture 3
7
Shared Memory “Code” for Computing a Sum
static int s = 0;
Thread 1
for i = 0, n/2-1
s = s + f(A[i])
Thread 2
for i = n/2, n-1
s = s + f(A[i])
• Problem is a race condition on variable s in the program
• A race condition or data race occurs when:
- two processors (or two threads) access the same
variable, and at least one does a write.
- The accesses are concurrent (not synchronized) so
they could happen simultaneously
01/28/2009
CS267 Lecture 3
8
Shared Memory “Code” for Computing a Sum
A=
3
5
f (x) = x2
static int s = 0;
Thread 1
….
compute f([A[i]) and put in reg0
reg1 = s
reg1 = reg1 + reg0
s = reg1
…
9
0
9
9
Thread 2
…
compute f([A[i]) and put in reg0
reg1 = s
reg1 = reg1 + reg0
s = reg1
…
25
0
25
25
• Assume A = [3,5], f(x) = x2, and s=0 initially
• For this program to work, s should be 32 + 52 = 34 at the end
• but it may be 34,9, or 25
• The atomic operations are reads and writes
• Never see ½ of one number, but no += operation is not atomic
• All computations happen in (private) registers
01/28/2009
CS267 Lecture 3
9
Improved Code for Computing a Sum
static int s = 0;
static lock lk;
Thread 1
Thread 2
local_s1= 0
for i = 0, n/2-1
local_s1 = local_s1 + f(A[i])
lock(lk);
s = s + local_s1
unlock(lk);
local_s2 = 0
for i = n/2, n-1
local_s2= local_s2 + f(A[i])
lock(lk);
s = s +local_s2
unlock(lk);
• Since addition is associative, it’s OK to rearrange order
• Most computation is on private variables
- Sharing frequency is also reduced, which might improve speed
- But there is still a race condition on the update of shared s
- The race condition can be fixed by adding locks (only one
thread can hold a lock at a time; others wait for it)
01/28/2009
CS267 Lecture 3
10
Machine Model 1a: Shared Memory
• Processors all connected to a large shared memory.
• Typically called Symmetric Multiprocessors (SMPs)
• SGI, Sun, HP, Intel, IBM SMPs (nodes of Millennium, SP)
• Multicore chips, except that all caches are shared
• Difficulty scaling to large numbers of processors
• <= 32 processors typical
• Advantage: uniform memory access (UMA)
• Cost: much cheaper to access data in cache than main memory.
P2
P1
$
Pn
$
$
bus
shared $
memory
01/28/2009
CS267 Lecture 3
Note: $ = cache
11
Problems Scaling Shared Memory Hardware
• Why not put more processors on (with larger memory?)
• The memory bus becomes a bottleneck
• Caches need to be kept coherent
• Example from a Parallel Spectral Transform Shallow
Water Model (PSTSWM) demonstrates the problem
• Experimental results (and slide) from Pat Worley at ORNL
• This is an important kernel in atmospheric models
•
•
99% of the floating point operations are multiplies or adds,
which generally run well on all processors
But it does sweeps through memory with little reuse of
operands, so uses bus and shared memory frequently
• These experiments show serial performance, with one
“copy” of the code running independently on varying
numbers of procs
•
•
01/28/2009
The best case for shared memory: no sharing
But the data doesn’t all fit in the registers/cache
CS267 Lecture 3
12
Example: Problem in Scaling Shared Memory
• Performance degradation
is a “smooth” function of
the number of processes.
• No shared data between
them, so there should be
perfect parallelism.
• (Code was run for a 18
vertical levels with a
range of horizontal
sizes.)
01/28/2009
CS267 Lecture 3
From Pat Worley, ORNL 13
Machine Model 1b: Multithreaded Processor
• Multiple thread “contexts” without full processors
• Memory and some other state is shared
• Sun Niagra processor (for servers)
• Up to 64 threads all running simultaneously (8 threads x 8 cores)
• In addition to sharing memory, they share floating point units
• Why? Switch between threads for long-latency memory operations
• Cray MTA and Eldorado processors (for HPC)
T0
T1
Tn
shared $, shared floating point units, etc.
Memory
01/28/2009
CS267 Lecture 3
14
Eldorado Processor (logical view)
1
2
i =n
.
.
.
3
i =2
i =1
Programs
running in
parallel
i =n
Subproblem
A
i =3
4
.
Su bproblem
B
.
.
Serial
Code
i =1
i =0
Concurrent
threads of
computation
Subproblem A
....
Hardware
stream s
(128)
Unused streams
Instruction
Ready
Pool;
Pipeline of
executing
instructions
01/28/2009
CS267 Lecture 3
Source: John Feo, Cray
15
Machine Model 1c: Distributed Shared Memory
• Memory is logically shared, but physically distributed
• Any processor can access any address in memory
• Cache lines (or pages) are passed around machine
• SGI Origin is canonical example (+ research machines)
• Scales to 512 (SGI Altix (Columbia) at NASA/Ames)
• Limitation is cache coherency protocols – how to
keep cached copies of the same address consistent
P2
P1
$
Pn
$
$
network
memory memory
01/28/2009
memory
CS267 Lecture 3
Cache lines (pages)
must be large to
amortize overhead

locality still critical
to performance
16
Programming Model 2: Message Passing
• Program consists of a collection of named processes.
• Usually fixed at program startup time
• Thread of control plus local address space -- NO shared data.
• Logically shared data is partitioned over local processes.
• Processes communicate by explicit send/receive pairs
• Coordination is implicit in every communication event.
• MPI (Message Passing Interface) is the most commonly used SW
Private
memory
s: 12
s: 14
s: 11
receive Pn,s
y = ..s ...
i: 2
i: 3
P0
P1
i: 1
send P1,s
Pn
Network
01/28/2009
CS267 Lecture 3
17
Computing s = A[1]+A[2] on each processor
° First possible solution – what could go wrong?
Processor 1
xlocal = A[1]
send xlocal, proc2
receive xremote, proc2
s = xlocal + xremote
Processor 2
xlocal = A[2]
send xlocal, proc1
receive xremote, proc1
s = xlocal + xremote
° If send/receive acts like the telephone system? The post office?
° Second possible solution
Processor 1
xlocal = A[1]
send xlocal, proc2
receive xremote, proc2
s = xlocal + xremote
Processor 2
xlocal = A[2]
receive xremote, proc1
send xlocal, proc1
s = xlocal + xremote
° What if there are more than 2 processors?
01/28/2009
CS267 Lecture 3
18
MPI – the de facto standard
MPI has become the de facto standard for parallel
computing using message passing
Pros and Cons of standards
• MPI created finally a standard for applications
development in the HPC community  portability
• The MPI standard is a least common denominator
building on mid-80s technology, so may discourage
innovation
Programming Model reflects hardware!
“I am not sure how I will program a Petaflops computer,
but I am sure that I will need MPI somewhere” – HDS 2001
01/28/2009
CS267 Lecture 3
19
Machine Model 2a: Distributed Memory
• Cray T3E, IBM SP2
• PC Clusters (Berkeley NOW, Beowulf)
• IBM SP-3, Millennium, CITRIS are distributed memory
machines, but the nodes are SMPs.
• Each processor has its own memory and cache but
cannot directly access another processor’s memory.
• Each “node” has a Network Interface (NI) for all
communication and synchronization.
P0
memory
NI
P1
memory
NI
Pn
...
NI
memory
interconnect
01/28/2009
CS267 Lecture 3
20
PC Clusters: Contributions of Beowulf
• An experiment in parallel computing systems
• Established vision of low cost, high end computing
• Demonstrated effectiveness of PC clusters for
some (not all) classes of applications
• Provided networking software
• Conveyed findings to broad community (great PR)
• Tutorials and book
• Design standard to rally
community!
• Standards beget:
books, trained people,
software … virtuous cycle
Adapted from Gordon Bell, presentation at Salishan 2000
CS267 Lecture 3
01/28/2009
21
Tflop/s and Pflop/s Clusters
The following are examples of clusters configured out of
separate networks and processor components
• 82% of Top 500 (Nov 2008, up from 72% in 2005),
• 3 of top 10
• IBM Cell cluster at Los Alamos (Roadrunner) is #1
• 12,960 Cell chips + 6,948 dual-core AMD Opterons;
• 129600 cores altogether
• 1.45 PFlops peak, 1.1PFlops Linpack, 2.5MWatts
• Infiniband connection network
• In 2005 Walt Disney Feature Animation (The Hive)
was #96
• 1110 Intel Xeons @ 3 GHz
• Gigabit Ethernet
• For more details use “database/sublist generator” at www.top500.org
01/28/2009
CS267 Lecture 3
22
Machine Model 2b: Internet/Grid Computing
• SETI@Home: Running on 500,000 PCs
• ~1000 CPU Years per Day
• 485,821 CPU Years so far
• Sophisticated Data & Signal Processing Analysis
• Distributes Datasets from Arecibo Radio Telescope
Next StepAllen Telescope Array
01/28/2009
CS267 Lecture 4
23
Programming Model 2a: Global Address Space
• Program consists of a collection of named threads.
•
•
•
•
Usually fixed at program startup time
Local and shared data, as in shared memory model
But, shared data is partitioned over local processes
Cost models says remote data is expensive
• Examples: UPC, Titanium, Co-Array Fortran
• Global Address Space programming is an intermediate
point between message passing and shared memory
Shared memory
s[0]: 26
s[1]: 32
i: 1
i: 5
P0
P1
s[n]: 27
y = ..s[i] ...
01/28/2009
Private
memory
CS267 Lecture 3
i: 8
Pn s[myThread] = ...
24
Machine Model 2c: Global Address Space
• Cray T3D, T3E, X1, and HP Alphaserver cluster
• Clusters built with Quadrics, Myrinet, or Infiniband
• The network interface supports RDMA (Remote Direct
Memory Access)
• NI can directly access memory without interrupting the CPU
• One processor can read/write memory with one-sided
operations (put/get)
• Not just a load/store as on a shared memory machine
•
Continue computing while waiting for memory op to finish
• Remote data is typically not cached locally
P0
memory
NI
P1
memory
NI
Pn
...
memory
NI
Global address
space may be
supported in
varying degrees
interconnect
01/28/2009
CS267 Lecture 3
25
Programming Model 3: Data Parallel
• Single thread of control consisting of parallel operations.
• Parallel operations applied to all (or a defined subset) of a
data structure, usually an array
•
•
•
•
Communication is implicit in parallel operators
Elegant and easy to understand and reason about
Coordination is implicit – statements executed synchronously
Similar to Matlab language for array operations
• Drawbacks:
• Not all problems fit this model
• Difficult to map onto coarse-grained machines
A = array of all data
fA = f(A)
s = sum(fA)
A:
f
fA:
sum
s:
01/28/2009
CS267 Lecture 3
26
Machine Model 3a: SIMD System
• A large number of (usually) small processors.
• A single “control processor” issues each instruction.
• Each processor executes the same instruction.
• Some processors may be turned off on some instructions.
• Originally machines were specialized to scientific computing,
few made (CM2, Maspar)
• Programming model can be implemented in the compiler
• mapping n-fold parallelism to p processors, n >> p, but it’s hard
(e.g., HPF)
control processor
P1
memory
NI
P1
memory
NI
P1
memory
NI
...
P1
memory
NI
P1
NI
memory
interconnect
01/28/2009
CS267 Lecture 3
27
Machine Model 3b: Vector Machines
• Vector architectures are based on a single processor
• Multiple functional units
• All performing the same operation
• Instructions may specific large amounts of parallelism (e.g., 64way) but hardware executes only a subset in parallel
• Historically important
• Overtaken by MPPs in the 90s
• Re-emerging in recent years
• At a large scale in the Earth Simulator (NEC SX6) and Cray X1
• At a small scale in SIMD media extensions to microprocessors
•
•
•
SSE, SSE2 (Intel: Pentium/IA64)
Altivec (IBM/Motorola/Apple: PowerPC)
VIS (Sun: Sparc)
• At a larger scale in GPUs
• Key idea: Compiler does some of the difficult work of finding
parallelism, so the hardware
doesn’t
have to
CS267 Lecture
3
28
01/28/2009
Vector Processors
• Vector instructions operate on a vector of elements
• These are specified as operations on vector registers
r1
r2
…
vr1
+
vr2
+
r3
…
(logically, performs # elts
adds in parallel)
…
vr3
• A supercomputer vector register holds ~32-64 elts
• The number of elements is larger than the amount of parallel
hardware, called vector pipes or lanes, say 2-4
• The hardware performs a full vector operation in
• #elements-per-vector-register / #pipes
vr1
…
+
01/28/2009
…
vr2
+
+
++
+
CS267 Lecture 3
(actually, performs
#pipes adds in parallel)
29
Cray X1: Parallel Vector Architecture
Cray combines several technologies in the X1
•
•
•
•
•
12.8 Gflop/s Vector processors (MSP)
Shared caches (unusual on earlier vector machines)
4 processor nodes sharing up to 64 GB of memory
Single System Image to 4096 Processors
Remote put/get between nodes (faster than MPI)
01/28/2009
CS267 Lecture 3
31
Earth Simulator Architecture
Parallel Vector
Architecture
• High speed (vector)
processors
• High memory
bandwidth (vector
architecture)
• Fast network (new
crossbar switch)
Rearranging commodity
parts can’t match this
performance
01/28/2009
CS267 Lecture 3
32
Programming Model 4: Hybrids
• These programming models can be mixed
• Message passing (MPI) at the top level with shared
memory within a node is common
• New DARPA HPCS languages mix data parallel and
threads in a global address space
• Global address space models can (often) call
message passing libraries or vice verse
• Global address space models can be used in a
hybrid mode
• Shared memory when it exists in hardware
• Communication (done by the runtime system) otherwise
• For better or worse….
01/28/2009
CS267 Lecture 3
33
Machine Model 4: Clusters of SMPs
• SMPs are the fastest commodity machine, so use them
as a building block for a larger machine with a network
• Common names:
• CLUMP = Cluster of SMPs
• Hierarchical machines, constellations
• Many modern machines look like this:
• Millennium, IBM SPs, ASCI machines
• What is an appropriate programming model #4 ???
• Treat machine as “flat”, always use message
passing, even within SMP (simple, but ignores an
important part of memory hierarchy).
• Shared memory within one SMP, but message
passing outside of an SMP.
01/28/2009
CS267 Lecture 3
34
Outline
• Overview of parallel machines and programming
models
• Shared memory
• Shared address space
• Message passing
• Data parallel
• Clusters of SMPs
• Trends in real machines (www.top500.org)
01/28/2009
CS267 Lecture 3
35
TOP500
- Listing of the 500 most powerful
Computers in the World
- Yardstick: Rmax from Linpack
Ax=b, dense problem
- Updated twice a year:
Rate
TPP performance
ISC‘xy in Germany, June xy
SC‘xy in USA, November xy
Size
- All data available from www.top500.org
01/28/2009
CS267 Lecture 3
36
EXTRA SLIDES
(TOP 500 FROM NOV 2007)
01/28/2009
CS267 Lecture 3
37
30th List: The TOP10
Manufacturer
Computer
BlueGene/L
Rmax
Installation Site
Country
478.2
DOE/NNSA/LLNL
USA
2007 212,992
167.3
Forschungszentrum
Juelich
Germany
2007 65,536
USA
2007 14,336
India
2007 14,240
Sweden
2007 13,728
[TF/s]
1
IBM
2
IBM
3
SGI
4
HP
Cluster Platform 3000 BL460c 117.9
5
HP
Cluster Platform 3000 BL460c 102.8
6
Sandia/Cray
7
Cray
8
IBM
9
Cray
10
IBM
eServer Blue Gene
JUGENE
BlueGene/P Solution
SGI Altix ICE 8200
Red Storm
Cray XT3
Jaguar
Cray XT3/XT4
BGW
eServer Blue Gene
Franklin
Cray XT4
New York Blue
eServer Blue Gene
126.9
New Mexico Computing
Applications Center
Computational Research
Laboratories, TATA SONS
Swedish Government
Agency
Year
#Cores
102.2
DOE/NNSA/Sandia
USA
2006 26,569
101.7
DOE/ORNL
USA
2007 23,016
91.29
IBM Thomas Watson
USA
2005 40,960
85.37
NERSC/LBNL
USA
2007 19,320
82.16
Stony Brook/BNL
USA
2007 36,864
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007
page 38
Competition between Manufacturers,
Countries and Sites
Top500 Status
1st Top500 List in June 1993 at ISC’93 in Mannheim
•
•
•
•
29th Top500 List on June 27, 2007 at ISC’07 in Dresden
30th Top500 List on November 13, 2007 at SC07 in Reno
31st Top500 List on June 18, 2008 at ISC’08 in Dresden
32nd Top500 List on November 18, 2008 at SC08 in Austin
33rd Top500 List on June 24, 2009 at ISC’09 in Hamburg
Acknowledged by HPC-users, manufacturers and media
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007
page 39
Competition between Manufacturers,
Countries and Sites
1st List as of 06/1993
Countries
30th List as of 11/2007
Count
Share %
Countries
Count
Share %
USA
225
45.00 %
USA
283
56.60%
Japan
111
22.20 %
Japan
20
4.00 %
Germany
59
11.80 %
Germany
31
6.20 %
France
26
5.20 %
France
17
3.40 %
United Kingdom
25
5.00 %
United Kingdom
48
9.60 %
Australia
9
1.80 %
Australia
1
0.20 %
Italy
6
1.20 %
Italy
6
1.20 %
Netherlands
6
1.20 %
Netherlands
6
1.20 %
Switzerland
4
0.80 %
Switzerland
7
1.40 %
Canada
3
0.60 %
Canada
5
1.00 %
Denmark
3
0.60 %
Denmark
1
0.20 %
Korea
3
0.60 %
Korea
1
0.20 %
Others
20
4.00 %
China
10
2.00 %
Totals
500
100 %
India
9
1.80 %
Others
55
11.00 %
Totals
500
100 %
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007
page 40
Competition between Manufacturers,
Countries and Sites
Countries/Systems
500
Others
China
400
Korea
Italy
300
France
200
UK
Germany
100
Japan
US
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
0
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007
page 41
Competition between Manufacturers,
Countries and Sites
1st List as of 06/1993
30th List as of 11/2007
Manufacturers
Count
Share %
Manufacturers
Count
Share %
Cray Research
205
41.00 %
Cray Inc.
14
2.80 %
Fujitsu
69
13.80 %
Fujitsu
3
0.60 %
Thinking Machines
54
10.80 %
Thinking Machines
Intel
44
8.80 %
Intel
Convex
36
7.20 %
Hewlett-Packard
NEC
Kendall Square
Research
MasPar
32
6.40 %
21
4.20 %
—
—
18
3.60 %
NEC
Kendall Square
Research
MasPar
—
—
Meiko
9
1.80 %
Meiko
—
—
Hitachi
6
1.20 %
Parsytec
3
0.60 %
nCube
3
0.60 %
Totals
500
100 %
—
Hitachi/Fujitsu
—
1
0.20 %
166
33.20 %
2
0.40 %
1
0.20 %
Parsytec
—
—
nCube
—
—
IBM
232
46.40 %
SGI
22
4.40 %
Dell
24
4.80 %
Others
35
7.00 %
Totals
500
100 %
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007
page 42
Competition between Manufacturers,
Countries and Sites
Manufacturer/Systems
500
others
Hitachi
NEC
Fujitsu
Intel
TMC
HP
Sun
IBM
SGI
Cray
400
300
200
100
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
0
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007
page 43
Competition between Manufacturers,
Countries and Sites
Top Sites through 30 Lists
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Site
Lawrence Livermore National Laboratory
Sandia National Laboratories
Los Alamos National Laboratory
Government
The Earth Simulator Center
National Aerospace Laboratory of Japan
Oak Ridge National Laboratory
NCSA
NASA/Ames Research Center/NAS
University of Tokyo
NERSC/LBNL
Pittsburgh Supercomputing Center
Semiconductor Company (C)
Naval Oceanographic Office (NAVOCEANO)
ECMWF
ERDC MSRC
IBM Thomas J. Watson Research Center
Forschungszentrum Juelich (FZJ)
Japan Atomic Energy Research Institute
Minnesota Supercomputer Center
Country
United States
United States
United States
United States
Japan
Japan
United States
United States
United States
Japan
United States
United States
United States
United States
United Kingdom
United States
United States
Germany
Japan
United States
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007
% over time
5.39
3.70
3.41
3.34
1.99
1.70
1.39
1.31
1.25
1.21
1.19
1.15
1.11
1.08
1.02
0.91
0.86
0.84
0.83
0.74
page 44
My Supercomputer Favorite
in the Top500 Lists
Number of Teraflop/s Systems in the
Top500
500
100
10
1
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007
page 45
The 30th List as of November 2007
Processor Architecture/Systems
500
400
SIMD
300
Vector
200
Scalar
100
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
0
page 46
The 30th List as of November 2007
Operating Systems/Systems
500
Windows
400
Mac OS
N/A
300
Mixed
200
BSD Based
Linux
100
Unix
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
0
page 47
The 30th List as of November 2007
Processor Generation/Systems (November 2007)
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007
page 48
The 30th List as of November 2007
Interconnect Family/Systems
500
Others
Quadrics
400
Proprietary
Fat Tree
300
Infiniband
Cray Interconnect
Myrinet
200
SP Switch
Gigabit Ethernet
100
Crossbar
N/A
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
0
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007
page 49
The 30th List as of November 2007
Architectures / Systems
500
SIMD
400
Single Proc.
300
SMP
200
Const.
Cluster
100
MPP
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
0
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007
page 50
Performance Development
and Projection
Performance Development
6.97 PF/s
1 Pflop/s
478.2 TF/s
100 Tflop/s
10 Tflop/s
1 Tflop/s
BlueGene/L
NEC
1.167 TF/s
N=1
Earth Simulator
5.9 TF/s
IBM ASCI White
59.7 GF/s
Intel ASCI Red
LLNL
Sandia
100 Gflop/s
Jack‘s Notebook
Fujitsu
'NWT' NAL
10 Gflop/s
1 Gflop/s
IBM
SUM
N=500
0.4 GF/s
Jack‘s Notebook
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
100 Mflop/s
page 51
Performance Development
and Projection
Performance Projection
1 Eflop/s
1 Pflop/s
100 Pflop/s
10 Pflop/s
1 Pflop/s
100 Tflop/s
10 Tflop/s
1 Tflop/s
100 Gflop/s
10 Gflop/s
1 Gflop/s
SUM
Jack‘s Notebook
6-8 years
N=1
8-10 years
N=500
Jack‘s Notebook
100 Mflop/s
1993 1995 1997
1999 2001 2003 2005 2007 2009 2011 2013 2015
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007
page 52
Analysis of TOP500 Data
Annual performance growth about a factor of 1.82
Two factors contribute almost equally to the annual total
performance growth
Processor number grows per year on the average by a factor
of 1.30 and the
Processor performance grows by 1.40 compared to 1.58 of
Moore's Law
Strohmaier, Dongarra, Meuer, and Simon, Parallel Computing 25, 1999, pp 1517-1544.
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007
page 53
Bell‘s Law
Bell's Law of Computer Class formation
was discovered about 1972. It states that technology
advances in semiconductors, storage, user interface and
networking advance every decade enable a new, usually
lower priced computing platform to form. Once formed, each
class is maintained as a quite independent industry structure.
This explains mainframes, minicomputers, workstations and
Personal computers, the web, emerging web services, palm
and mobile devices, and ubiquitous interconnected networks.
We can expect home and body area networks to follow this
path.
From Gordon Bell (2007),
http://research.microsoft.com/~GBell/Pubs.htm
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007
page 54
Bell‘s Law
Bell’s Law states, that:
Important classes of computer architectures
come in cycles of about 10 years.
It takes about a decade for each phase :
Early research
Early adoption and maturation
Prime usage
Phase out past its prime
Can we use Bell’s Law to classify computer
architectures in the TOP500?
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007
page 55
Bell‘s Law
Gordon Bell (1972): 10 year cycles for computer classes
Computer classes in HPC based on the TOP500:
Data Parallel Systems:
Vector (Cray Y-MP and X1, NEC SX, …)
SIMD (CM-2, …)
Custom Scalar Systems:
MPP (Cray T3E and XT3, IBM SP, …)
Scalar SMPs and Constellations (Cluster of big SMPs)
Commodity Cluster: NOW, PC cluster, Blades, …
Power-Efficient Systems (BG/L or BG/P as first example of lowpower / embedded systems = potential new class ?)
Tsubame with Clearspeed, Roadrunner with Cell ?
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007
page 56
Bell‘s Law
Computer Classes / Systems
500
BG/L or
BG/P
400
Commodity
Cluster
300
200
Custom
Scalar
100
Vector/SIMD
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
0
page 57
Bell‘s Law
Computer Classes - refined / Systems
500
BG/L or
BG/P
Commodity
Cluster
400
300
200
SMP/
Constellat.
MPP
100
SIMD
Vector
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
0
page 58
Bell‘s Law
HPC Computer Classes
Class
Early Adoption
starts:
Prime Use
starts:
Past Prime
Usage starts:
Data Parallel
Systems
Mid 70’s
Mid 80’s
Mid 90’s
Custom Scalar
Systems
Mid 80’s
Mid 90’s
Mid 2000’s
Commodity
Cluster
Mid 90’s
Mid 2000’s
Mid 2010’s ???
BG/L or BG/P
Mid 2000’s
Mid 2010’s ???
Mid 2020’s ???
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007
page 59
Supercomputing, quo vadis?
Countries/Systems (November 2007)
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007
page 60
Supercomputing, quo vadis?
Countries/Performance (November 2007)
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007
page 61
Supercomputing, quo vadis?
Countries/Systems
(November 2007)
TOP50
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007
page 62
Supercomputing, quo vadis?
Continents/Systems
500
400
Others
300
Asia
Europe
200
Americas
100
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
0
page 63
Supercomputing, quo vadis?
Asian Countries / Systems
120
100
80
Others
60
India
40
China
20
Korea, South
Japan
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
0
page 64
Supercomputing, quo vadis?
Producing Regions / Systems
Others
Europe
500
Japan
400
300
USA
200
100
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
0
page 65
Top500, quo vadis?
www.top500.org
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007
page 66
Top500, quo vadis?
Summary after Fifteen Years of Experience
TOP500 proofed itself by
correcting the Mannheim
Supercomputer Statistics.
150
100
50
Nov 07
Nov 06
Nov 05
Nov 04
Nov 03
Nov 02
Nov 01
Nov 00
Nov 99
Nov 98
Nov 97
0
Nov 96
This smoothes seasonal fluctuation.
200
Nov 95
It’s inventory based.
250
Nov 94
It does not easily allow
to track market size
300
Nov 93
It is simplistic, but (or because
of it) it gets trends right.
Replacement Rate
Turn-over very high, so it still
reflects recent developments.
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007
page 67
Top500, quo vadis?
Motivation for Additional
Benchmarks
Linpack Benchmark
Pros
One number
Simple to define and rank
Allows problem size to
change with machine and
over time
Allowing Competitions
Clearly need something more
than Linpack
HPC Challenge Benchmark and
others
Cons
Emphasizes only “peak” CPU
speed and number of CPUs
Does not stress local
bandwidth
Does not stress the network
No single number can reflect
overall performance
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007
page 68
Top500, quo vadis?
Presented at ISC’06
June 27-30 2006
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007
page 69
Top500, quo vadis?
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007
page 70
Top500, quo vadis?
http://www.green500.org
Green500
Rank
1
MFLOPS/W
Site
Computer
357.23
Blue Gene/P
Solution
2
352.25
3
346.95
Science and Technology
Facilities Council - Daresbury
Laboratory
Max-Planck-Gesellschaft
MPI/IPP
IBM - Rochester
4
336.21
5
310.93
6
210.56
7
210.56
8
210.56
9
210.56
10
210.56
Forschungszentrum Juelich
(FZJ)
Oak Ridge National
Laboratory
Harvard University
High Energy Accelerator
Research Organization /KEK
IBM - Almaden Research
Center
IBM Research
IBM Thomas J. Watson
Research Center
Blue Gene/P
Solution
Blue Gene/P
Solution
Blue Gene/P
Solution
Blue Gene/P
Solution
eServer Blue
Gene Solution
eServer Blue
Gene Solution
eServer Blue
Gene Solution
eServer Blue
Gene Solution
eServer Blue
Gene Solution
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007
Total
Power
(kW)
31.10
TOP500
Rank
121
62.20
40
124.40
24
497.60
2
70.47
41
44.80
170
44.80
171
44.80
172
44.80
173
44.80
174
page 71
Summary
• Historically, each parallel machine was unique, along
with its programming model and programming language.
• It was necessary to throw away software and start over
with each new kind of machine.
• Now we distinguish the programming model from the
underlying machine, so we can write portably correct
codes that run on many machines.
• MPI now the most portable option, but can be tedious.
• Writing portably fast code requires tuning for the
architecture.
• Algorithm design challenge is to make this process easy.
• Example: picking a blocksize, not rewriting whole algorithm.
01/28/2009
CS267 Lecture 4
72
Reading Assignment
• Extra reading for today
• Cray X1
http://www.sc-conference.org/sc2003/paperpdfs/pap183.pdf
• Clusters
http://www.mirror.ac.uk/sites/www.beowulf.org/papers/ICPP95/
• "Parallel Computer Architecture: A Hardware/Software Approach"
by Culler, Singh, and Gupta, Chapter 1.
• Next week: Current high performance architectures
• Shared memory (for Monday)
• Memory Consistency and Event Ordering in Scalable SharedMemory Multiprocessors, Gharachorloo et al, Proceedings of the
International symposium on Computer Architecture, 1990.
• Or read about the Altix system on the web (www.sgi.com)
• Blue Gene L (for Wednesday)
• http://sc-2002.org/paperpdfs/pap.pap207.pdf
01/28/2009
CS267 Lecture 4
73
Open Source Software Model for HPC
• Linus's law, named after Linus Torvalds, the
creator of Linux, states that "given enough
eyeballs, all bugs are shallow".
• All source code is “open”
• Everyone is a tester
• Everything proceeds a lot faster when
everyone works on one code (HPC: nothing gets done
if resources are scattered)
• Software is or should be free (Stallman)
• Anyone can support and market the code
for any price
• Zero cost software attracts users!
• Prevents community from losing HPC software
(CM5, T3E)
01/28/2009
CS267 Lecture 4
74
Cluster of SMP Approach
• A supercomputer is a stretched high-end server
• Parallel system is built by assembling nodes that are
modest size, commercial, SMP servers – just put more
of them together
Image from LLNL
01/28/2009
CS267 Lecture 4
75
Download