SC’99: The 14th Mannheim Supercomputing Conference Gordon Bell Microsoft

advertisement
SC’99: The 14th Mannheim
Supercomputing Conference
June 10, 1999
“looking 10 years ahead”
Gordon Bell
http://www.research.microsoft.com/users/gbell
Microsoft
What a difference 25 years and spending
>10x makes!
ESRDC c2002
40 Tflops. 5120 Proc.
640 Computers
LLNL center
150 Mflops
7600 & Cray1
c1978
Talk plan





We are at a new beginning… many views:
installations, parallelism, machine intros(t),
timeline, cost to get results, and scalabilities
SCI c1985, the beginning: 1K processors (MPP)
ASCI c1998, new beginning: 10K processors
Why I traded places with Greg Papadapolous
re. Clusters and SmPs
Questions that users & architects will resolve
New structures: Beowulf and NT eqivalent,
Condor, Cow, Legion, Globus, Grid …
Comments from LLNL
Program manager

Lessons Learned with
“Full-System Mode”
It is harder than you think
– It takes longer than you think
– It requires more people than you can
believe
–

Just as in the very beginning of
computing, leading edge users
are building their own computers.
Are we at a new beginning?
“Now, this is not the end. It is not even the
beginning of the end, but it is, perhaps,
the end of the beginning.”
1999 Salishan HPC Conference from
W. Churchill 11/10/1942
“You should not focus NSF CS Research on
parallelism. I can barely write a correct
sequential program.” Don Knuth 1987 (to Gbell)
“Parallel processing is impossible for people to
create well, much less debug?’ Ken Thompson 1987
“I’ll give a $100 to anyone who can run a program
on more than 100 processors.” Alan Karp (198x?)
“I’ll give a $2,500 prize for parallelism every year.”
Gordon Bell (1987)
Yes… we are at a new beginning!
Based on clustered computing
Single jobs, composed of 1000s of quasiindependent programs running in parallel on 1000s
of processors.
Processors (or computers) of all types
are distributed and inter-connected) in every fashion
from a collection using
a single shared memory
to globally disperse computers.
TOP500 Systems by Manufacturers
500
Cray
450
SGI
Number of Systems
400
350
300
IBM
250
TMC
200
HP
intel
Convex
Sun
150
100
DEC
Japan Inc.
Compaq
50
other
0
6/93
11/93 6/94
11/94 6/95
11/95 6/96
11/96 6/97
11/97 6/98
11/98 6/99
Cray
SGI
IBM
Convex
HP
Sun
TMC
intel
Compaq
DEC
Japan Inc.
other
U. S. Tax Dollars At Work. How many
processors does your center have?



Intel/Sandia:
9000 Pentium Pro
LLNL/IBM SP2:
3x(488x8)
PowerPC
LNL/Cray: 6144 P
in 48x128 DSM
clusters
1950 .
Vtubes
High performance
architectures timeline
1960 .
MSI(mini)
1980 .
1990 .
2000
Micro RISC
nMicr
“IBM PC”
Sequential programming---->-----------------------------(single execution stream e.g. Fortran)
Processor overlap, lookahead
“killer micros”
Cray era
Trans.
1970 .
6600 7600 Cray1
X Y
C T
Func Pipe Vector-----SMP---------------->
SMP
mainframes--->
“multis”----------->
DSM??
Mmax. KSR DASHSGI--->
<SIMD Vector--//--------------Parallelization-------------------THE NEW BEGINNING----------------------Parallel programs aka Cluster Computing <--------------multicomputers
<--MPP era-----Clusters
Tandm VAX
IBM
UNIX->
MPP if n>1000
Ncube Intel IBM->
Local
NOW Beowlf
and Global Networks n>10,000
Grid
Computer types
-------- Connectivity-------WAN/LAN
Netwrked
Supers…
SAN
VPPuni
DSM
SM
NEC super
NEC mP
Cray X…T
(all mPv)
Clusters
GRID
Legion
T3E
SGI DSM
Mainframes
Condor
SP2(mP)
clusters &
Multis
BeowulfNOW
SGI DSM WSs PCs
NT clusters
Technical computer types
WAN/LAN
Netwrked
Supers…
New
SAN
DSM
SM
NEC mP NEC super
Old
Cray X…T
T series World
(all mPv)
world:
VPPuni
Clustered
GRID
( one
Computing
Legion
SGI DSM program
Mainframes
(multiple program
SP2(mP)
Condor
clusters &
Multis
NOW
stream)
streams)
Beowulf
SGI DSM WSs PCs
Technical computer types
WAN/LAN
SAN
DSM
SM
Netwrked
Supers…
MPI,
NEC mP NEC super
Vectorize
Cray X…T
Linda,
PVM,
VPPuni
Parallellelize
T series (all mPv)
???
LegionDistributed SGI DSM Mainframes
SP2(mP)
Condor
clusters Parallellelize
&
Multis
Computing
NOW
GRID
Beowulf
SGI DSM
WSs PCs
Technical computer types:
Pick of: 4 nodes, 2-3 interconnects
SAN
DSM
NEC
Fujitsu
Hitachi
IBM ?PC?
SGI cluster SGI DSM
Beow/NT T3 HP?
SMP
NEC super
Cray ???
Fujitsu
Hitachi
HP IBM
Intel SUN
plain old
PCs
Bell Prize and
Future Peak
Tflops (t)
1000
100
10
Petaflops
study
target
1
NEC
0.1
CM2
0.01
0.001
XMP
NCube
0.0001
1985
1990
1995
2000
2005
2010
SCI c1983
(Strategic Computing Initiative)
funded by DARPA in the early 80s
and aimed at a Teraflops!
Era of State computers and many
efforts to build high speed
computers… lead to HPCC
Thinking Machines, Intel supers,
Cray T3 series
Humble
beginning:
“Killer”
Micro?
In 1981…
did you
predict this
would be the
basis of
supers?
SCI (c1980s):
Strategic Computing Initiative funded
ATT/Columbia (Non Von), BBN Labs,
Bell Labs/Columbia (DADO),
CMU Warp (GE & Honeywell),
CMU (Production Systems), Cedar (U. of IL),
Encore, ESL, GE (like connection machine),
Georgia Tech, Hughes (dataflow), IBM (RP3),
MIT/Harris, MIT/Motorola (Dataflow), MIT
Lincoln Labs, Princeton (MMMP),
Schlumberger (FAIM-1), SDC/Burroughs,
SRI (Eazyflow),
University of Texas,
Thinking Machines (Connection Machine)
Those who gave their lives in
the search for parallelism
Alliant, American Supercomputer, Ametek,
AMT, Astronautics, BBN Supercomputer,
Biin, CDC, Chen Systems, CHOPP, Cogent,
Convex (now HP), Culler, Cray Computers,
Cydrome, Dennelcor, Elexsi, ETA, E & S
Supercomputers, Flexible, Floating Point
Systems, Gould/SEL, IPM, Key, KSR,
MasPar, Multiflow, Myrias, Ncube, Pixar,
Prisma, SAXPY, SCS, SDSA, Supertek (now
Cray), Suprenum, Stardent (Ardent+Stellar),
Supercomputer Systems Inc., Synapse,
Thinking Machines, Vitec, Vitesse,
Wavetracer.
What can we learn from this?



SCI: ARPA-funded product development
failed. No successes. Intel prospered.
ASCI: DOE-funded product purchases
creates competition
First efforts in startups… all failed.
–
–
–
–


Too much competition (with each other)
Too little time to establish themselves
Too little market. No apps to support them
Too little cash
Supercomputing is for the large & rich
… or is it? Beowulf, shrink-wrap
clusters; NOW,Condor, Legion, Grid, etc.
2010 ground rules:
The component specs
2010 component characteristics
100x improvement @60% growth
Chip Density
Bytes/chip
On chip clock
Inter-system clock
Disk
Fiber speed (1 ch)
500. Mt
8. GB
2.5 GHz
0.5
1. TB
10. Gbps
Computer ops/sec x word length / $
1.E+09
doubles every 1.0
1.E+06
.=1.565^(t-1959.4)
1.E+03
y = 1E-248e0.2918x
1.E+00
1.E-03
1.E-06
1880
doubles every 2.3
doubles every 7.5
1900
1920
1940
1960
1980
2000
Processor Limit: DRAM Gap
“Moore’s Law”
100
10
1
µProc
60%/yr.
.
Processor-Memory
Performance Gap:
(grows 50% / year)
DRAM
DRAM
7%/yr..
CPU
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Performance
1000
• Alpha 21264 full cache miss / instructions executed:
180 ns/1.7 ns =108 clks x 4 or 432 instructions
• Caches in Pentium Pro: 64% area, 88% transistors
*Taken from Patterson-Keeton Talk to SigMod
Gordon B.& Greg P.: Trading places.
Or why I switched from SMPs to clusters
“Miles Law: where you stand
depends on where you sit.”
1993
GB: SMP and DSM inevitability after
30 years of belief in/building mPs
GP: multicomputers ala CM5
2000+
GB: commodity clusters, improved log(p)
GP: SMPs => DSM
GB with NT, Compaq, HP cluster
AOL Server Farm
WHY BEOWULFS ?
• best price performance
• rapid response to technology trends
• no single-point vendor
• just-in-place configuration
• scalable
• leverages large software development
investment
• mature, robust, accessible
• user empowerment
• meets low expectations created by MPPs
from Thomas Sterling
IT'S THE COST, STUPID
•
$28 per sustained MFLOPS
•
$11 per peak MFLOPS
from Thomas Sterling
Why did I trade places i.e. switch
to clustered computing?

Economics: commodity components give a
10-100x advantage in price performance
–

Difficulty of making large SMPs (and DSM)
–

Single system image… clearly needs more work
SMPs (and DSMs) are NOT scalable in:
–
–
–
–

Backplane connected processors (incl. DSMs)
vs board-connected processors
size. All have very lumpy memory access patterns
reliability. Redundancy and ft is required.
cross-generation. Every 3-5 years start over.
Spatial. Put your computers in multiple locations.
Clusters are the only structure that scales!
Technical users have alternatives
(making the market size too small)







PCs work fine for smaller problems
“Do it yourself clusters” e.g. Beowulf works!
MPI, PVM, Linda: programming models don’t
exploit shared memory… are they lcd?
ISVs have to use lcd to survive
SMPs are expensive. Parallelization is limited.
Clusters required for scalabilities or apps
requiring extra-ordinary performance
...so DSM only adds to the already complex
parallelization problem
Non-U.S. users buy SMPvectors for capacity
for legacy apps, until cluster-ready apps
C1999 Clusters of computers.
It’s MPP when processors/cluster >1000
Who
ΣP.pap ΣP.
T.fps #.K
LLNL 3.9
P.pap
G.fps
5.9
.66
6.1
.5
9.1
.3
ΣP.pap/C Σp/.C ΣMp./C
G.fps
#
GB
5.3
8
2.5
ΣM.s
TB
62
(IBM)
LANL 3.1
64
128.
32
.6
2
-
0.5
2.0
4
.13 9.6
.5 8
5.12 8
9.6
128
64
1
16
8
(SGI)
Sandia 2.7
(Intel)
Beowulf
Fujitsu 1.2
NEC 4.0
ESRDC 40
4.-16
128
16
76
Commercial users don’t need them





Highest growth is & will be web servers
delivering pages, audio, and video
Apps are inherently, embarrassingly parallel
Databases and TP parallelized and transparent
A single SMP handles traditional apps
Clusters required for reliability, scalabilities
Questions for builders & users
Can we count on Moore’s Law continuation?
Vector vs scalar using commodity chips?
Clustered computing vs traditional SMPv?
Can MPP apps be written for scalable //lism?
Cost: How much time and money for apps?
Benefit/need: In time & cost of execution?
When will DSM occur or be pervasive?
Commodity, proprietary, or net interconnections?
VendorIX (or Linux) vs NT? Shrink-wrap supers?
When will computer science research & teach //ism?
Did Web divert follow-through efforts and funding?
What’s the prognosis for gov’t leadership, funding?
The Physical Processor

commodity aka Intel micros
–




Does VLIW work better as a micro than it did
as a mini at Cydrome & Multiflow?
vector processor… abandoned or reborn?
multiple processors per chip or
multi-threading
FPGA chip-based special processors or
other higher volume processors
What Is The Processor Architecture?
Clearly polarized as US vs Japan
VECTORS
OR
VECTORS
Comp. Sci. View
Super Computer View
MISC >> CISC
RISC
Language directed
VCISC (vectors)
RISC
multiple pipes
Super-scalar
MTA
Extra-Long Instruction Word
Weather model performance
40 Tflops Earth Simulator R&D Center c2002
Mercury & Sky Computers - & $
Rugged System With 10 Modules ~ $100K; $1K /#
3
Scalable to several K processors; ~1-10 Gflop / Ft
10 9U Boards * 4 Ppc750’s  440 Specfp95 in
3
1 Ft (18.5 * 8 * 10.75”) … 256 Gflops/$3M
Sky 384 Signal Processor, #20 on ‘Top 500’, $3M
Mercury VME Platinum System
Sky PPC Daughtercard
Russian Elbrus E2K
Who
E2K
Merced
Clock GHz
1.2
0.8
Spec i/fp
135./350
45./70
Size mm2(.18u) 126.
300.
Power
35.
60.
PAP Gflps
10.2
Pin B/W GB/8
1.9
Cache (KB)
64./256
System ship Q4./2001
Computer (P-Mp) system
Alternatives

Node size: most cost-effective SMPs
–
–



Now 1-2 on a single board, evolving to 4-8
Evolves based on n processor per chip
Continued use of single bus SMP “multi”
with enhancements for perf. & reliability
Large, backplane bus based SMP provide a
single system image for small systems, but
not cost or space efficient for use as cluster
component
SMPs evolving to weak coherency DSMs
“
Petaflops by 2010
”
DOE
Accelerated Strategic
Computing Initiative (ASCI)
1994 Petaflops Workshop
c2007-2014. Clusters of
clusters.
Something
for everyone
SMP
Clusters
Active Memory
400 P
4-40K P
400K P
1 Tflops*
10-100 Gflops 1 Gflops
400 TB SRAM 400 TB DRAM 0.8 TB embed
250 Kchips
60K-100K chips 4K chips
1 ps/result
10-100 ps/result
*100 x 10 Gflops
threads
100,000 1 Tbyte discs => 100 Petabytes.
10 failures / day
Petaflops Disks
Just compute it at the source
100,000
1 Tbyte discs => 100 Petabytes
8 Gbytes of memory per chip
10 Gflops of processing per chip
NT, Linux, or whatever O/S
10 Gbps network interface
Result: 1.0 petaflops at the disks
HT-MT
HT-MT…
Mechanical: cooling and signals
 Chips: design tools, fabrication
 Chips: memory, PIM
 Architecture: mta on steroids
 Storage material

Global clusters… a goal,
challenge, possibility?
“
“
Our vision ... is a system of millions
of hosts… in a loose confederation.
”
Users will have the illusion of a very
powerful desktop computer through
which they can manipulate objects.
”
Grimshaw, Wulf, et al
“Legion” CACM Jan. 1997
Utilize in situ workstations!





NoW (Berkeley) set sort record, decrypting
Grid, Globus, Condor and other projects
Need “standard” interface and
programming model for clusters using
“commodity” platforms & fast switches
Giga- and tera-bit links and switches
allow geo-distributed systems
Each PC in a computational environment
should have an additional 1GB/9GB!
In 2010 every organization will have
its own petaflops supercomputer!







10,000 nodes in 1999 or 10x over 1987
Assume 100K nodes in 2010
10 Gflops/10GBy/1,000 GB nodes
for low end c2010 PCs
Communication is first problem… use
the network that will be >10 Gbps
Programming is still the major barrier
Will any problems or apps fit it?
Will any apps exploit it?
The Grid:
Blueprint for a New Computing Infrastructure
Ian Foster, Carl Kesselman (Eds), Morgan Kaufmann, 1999

Published July 1998;
ISBN 1-55860-475-8

22 chapters by expert
authors including:
–
–
–
–
–
–
–
–
–
–
–
–
–
–
Andrew Chien,
Jack Dongarra,
Tom DeFanti,
Andrew Grimshaw,
Roch Guerin,
Ken Kennedy,
“A source book for the history
Paul Messina,
of the future” -- Vint Cerf
Cliff Neuman,
Jon Postel,
Larry Smarr,
Rick Stevens,
Charlie Catlett
John Toole
and many others http://www.mkp.com/grids
The Grid



“Dependable, consistent,
pervasive access to
[high-end] resources”
Dependable: Can provide
performance and
functionality guarantees
Consistent: Uniform
interfaces to a wide variety
of resources
Pervasive: Ability to “plug
in” from anywhere
2004 Computer Food Chain ???
Mainframe
Vector
Super
Massively Mini
Parallel
Processors
Portable
Computers
Networks of Workstation/PCs
Dave Patterson, UC/Berkeley
Summary






1000x increase in PAP has not been
accompanied with RAP, insight, infrastructure,
and use. We are finally at the beginning.
“The PC World Challenge” is to provide
“shrink-wrap”, clustered parallelism to
commercial and technical communities
Only becomes true if system suppliers e.g.
Microsoft deliver commodity, control software
ISVs must believe that clusters are the future
Computer science has to get with the program
Grid etc. using world-wide resources,
including in situ PCs is the new idea
2010 architecture evolution

High end computing will continue.
Advantage SMPvector clusters
–

Unclear that U.S. will produce one versus
“stay the course” using 10x “killer micros”
Shrink-wrap clusters become pervasive.
–
–
–
SmP (m>16) will be the cluster component,
including SmP-on-a chip and board “multis”.
Cost-effective systems come from best nodes.
Backplanes are not cost-effective I/Cs
Interconnection nets, log(p), are the challenge.
Apps determine whether clusters become a
general purpose versus niche structure
Technical computer types:
Pick of: 4 nodes, 2-3 interconnets
SAN
DSM
NEC
Fujitsu
Hitachi
IBM ?PC?
?PC?
SGI cluster SGI DSM
Beowulf
?HP?
SMP
NEC super
Cray ???
Fujitsu
Hitachi
HP IBM
?PC? SUN
plain old
PCs
The end
Download