ISHPC International Symposium on High-Performance Computing 26 May 1999

advertisement
ISHPC
International Symposium on
High-Performance Computing
26 May 1999
Gordon Bell
http://www.research.microsoft.com/users/gbell
Microsoft
What a difference spending >10X/system
& 25 years makes!
40 Tflops
ESRDC c2002
(Artist’s view)
150 Mflops
CDC 7600+
Cray 1
LLNL c1978
Supercomputers(t)
Time $M
1950 1
1960 3
1970
1980
1990
2000
10
30
250
1,000
structure
mainframes
instruction //sm
mainframe SMP
pipelining
vectors; SCI
MIMDs: mC, SMP, DSM
ASCI, COTS MPP
example
many...
IBM / CDC
7600 / Cray 1
“Crays”
“Crays”/MPP
Grid, Legion
Supercomputing:
speed at any price, using parallelism
Intra processor
Memory overlap & instruction lookahead
Functional parallelism (2-4)
Pipelining (10)
SIMD ala ILLIAC 2d array of 64 pe vs vectors
Wide instruction word (2-4)
MTA (10-20) with parallelization of a stream
MIMDs… multiprocessors… parallelization
allows programs to stay with ONE stream
SMP (4-64)
Distributed Shared Memory SMPs 100
MIMD… multicomputers force multi-streams
Multicomputers aka MPP aka clusters (10K)
Grid: 100K
Growth in Computational Resources
Used for UK Weather Forecasting
10T •
1T •
1010/ 50 yrs = 1.5850
100G •
10G •
205
1G •
100M •
YMP
195
10M •
1M •
KDF9
100K •
10K •
Mercury
1K •
100 •
10 • •
1950
Leo
•
2000
Talk plan









The very beginning: “build it yourself”
Supercomputing with one computer…
the Cray era 1960-1995
Supercomputing with many computers…
parallel computing 1987SCI: what was learned?
Why I gave up to shared memory…
From the humble beginnings
Petaflops: when, … how, how much
New ideas: Now, Legion, Grid, Globus …
Beowulf: “build it yourself”
Supercomputer: old definition(s)






In the beginning everyone built their
own computer
Largest computer of the day
Scientific and engineering apps
Large government defense, weather,
aero, laboratories and centers are first
buyers
Price is no object:
$3M … 30M, 50M, 150 … 250M
Worldwide market: 3-5, xx, or xxx?
Supercomputing: new definition






Was a single, sequential program
Has become a single, large scale
job/program composed of many
programs running in parallel
Distributed within a room
Evolving to be distributed
in a region and
globe
Cost, effort, and time is extraordinary
Back to the future: build your own super
with shrink-wrap software!
Manchester:
the first
computer.
Baby, Mark I,
and Atlas
von
Neumann
computers:
Rand
Johniac
When
laboratories
built their
own
computers
Cray
1925
-1996
see
gbell
home
page
CDC 1604 &
6600
CDC 7600: pipelining
CDC
STAR…
ETA10
Scalar
matters
Cray 1 #6 from
LLNL.
Located at
The Computer
Museum
History
Center, Moffett
Field
Cray 1 150 Kw. MG set & heat exchanger
Cray
XMP/4
Proc.
c1984
A look at the beginning of the
new beginning
SCI
(Strategic Computing Initiative)
funded by DARPA and aimed at a
Teraflops!
Era of State computers and many
efforts to build high speed
computers… lead to HPCC
Thinking Machines, Intel supers,
Cray T3 series
Minisupercomputers: a market
whose time never came.
Alliant, Convex,
Ardent+Stellar= Stardent = 0,
Cydrome and Multiflow:
prelude to wide word parallelism
in Merced




Minisupers with VLIW attack the market
Like the minisupers, they are repelled
It’s software, software, and software
Was it a basically good idea that will
now work as Merced?
KSR 1:
first commercial
DSM
NUMA
(non-uniform
memory access)
aka
COMA
(cache-only
memory
architecture)
Intel’s ipsc 1 & Touchstone Delta
“
In Dec. 1995 computers
with 1,000 processors
will do most of the
scientific processing.
”
Danny Hillis
1990 (1 paper or 1 company)
The Bell-Hillis Bet
Massive Parallelism in 1995
TMC
TMC
TMC
World-wide
Supers
World-wide
Supers
World-wide
Supers
Applications
Petaflops / mo.
Revenue
Thinking Machines: CM1 & CM5 c1983-1993
Bell-Hillis Bet: wasn’t paid off!
My goal was not necessarily to just
win the bet!
 Hennessey and Patterson were to
evaluate what was really
happening…
 Wanted to understand degree of
MPP progress and programmability

SCI (c1980s):
Strategic Computing Initiative funded
ATT/Columbia (Non Von), BBN Labs,
Bell Labs/Columbia (DADO),
CMU Warp (GE & Honeywell),
CMU (Production Systems), Encore, ESL,
GE (like connection machine), Georgia
Tech, Hughes (dataflow), IBM (RP3),
MIT/Harris, MIT/Motorola (Dataflow), MIT
Lincoln Labs, Princeton (MMMP),
Schlumberger (FAIM-1), SDC/Burroughs,
SRI (Eazyflow),
University of Texas,
Thinking Machines (Connection Machine),
Those who gave their lives in
the search for parallellism
Alliant, American Supercomputer, Ametek,
AMT, Astronautics, BBN Supercomputer,
Biin, CDC, Chen Systems, CHOPP, Cogent,
Convex (now HP), Culler, Cray Computers,
Cydrome, Dennelcor, Elexsi, ETA, E & S
Supercomputers, Flexible, Floating Point
Systems, Gould/SEL, IPM, Key, KSR,
MasPar, Multiflow, Myrias, Ncube, Pixar,
Prisma, SAXPY, SCS, SDSA, Supertek (now
Cray), Suprenum, Stardent (Ardent+Stellar),
Supercomputer Systems Inc., Synapse,
Thinking Machines, Vitec, Vitesse,
Wavetracer.
What can we learn from this?




The classic flow: university research to
product development worked
SCI: ARPA-funded product development
failed. No successes. Intel prospered.
ASCI: DOE-funded product purchases
creates competition
First efforts in startups… all failed.
–
–
–
–


Too much competition (with each other)
Too little time to establish themselves
Too little market. No apps to support them
Too little cash
Supercomputing is for the large & rich
… or is it? Beowulf, shrink-wrap clusters
Humble
beginning:
In 1981…
would you
have
predicted
this would
be the
basis of
supers?
The Virtuous Economic Cycle
that drives the PC industry
Standards
Platform Economics
•
•
Traditional computers: custom or semi-custom,
high-tech and high-touch
New computers: high-tech and no-touch
100000
10000
Price (K$)
Volume (K)
Application
price
1000
100
10
1
0.1
0.01
Mainframe
WS
Computer type
Browser
Computer ops/sec x word length / $
1.E+09
doubles every 1.0
1.E+06
.=1.565^(t-1959.4)
1.E+03
y = 1E-248e0.2918x
1.E+00
1.E-03
1.E-06
1880
doubles every 2.3
doubles every 7.5
1900
1920
1940
1960
1980
2000
Intel’s ipsc 1 & Touchstone Delta
GB with NT, Compaq, HP cluster
The Alliance LES NT Supercluster
“Supercomputer performance at mail-order prices”-- Jim Gray, Microsoft
• Andrew Chien, CS UIUC-->UCSD
• Rob Pennington, NCSA
• Myrinet Network, HPVM, Fast Msgs
• Microsoft NT OS, MPI API
192 HP 300 MHz
64 Compaq 333 MHz
Are we at a new beginning?
“Now, this is not the end. It is not even the
beginning of the end, but it is, perhaps, the end of
the beginning.”
1999 Salishan HPC Conference from
W. Churchill 11/10/1942
“You should not focus NSF CS Research on
parallelism. I can barely write a correct
sequential program.”
Don Knuth 1987 (to Gbell)
“I’ll give a $100 to anyone who can run a program
on more than 100 processors.”
Alan Karp (198x?)
“I’ll give a $2,500 prize for parallelism every year.”
Gordon Bell (1987)
Bell Prize and
Future Peak
Gflops (t)
1000
100
10
Petaflops
study
target
1
0.1
CM2
0.01
0.001
XMP
NCube
0.0001
1985
1990
1995
2000
2005
2010
1989 Predictions vs
1999 Observations


Predicted 1 TFlops PAP 1995. Actual 1996.
Very impressive progress! (RAP<1 TF)
More diversity =>less software progress!
–
Predicted: SIMD, mC (incl. W/S), scalable SMP,
DSM, supers would continue as significant
– Got: SIMD disappeared, 2 mC, 1-2 SMP/DSM,
4 supers, 2 mCv with one address space
1 SMP became larger and clusters, MTA,
workstation clusters, GRID



$3B (unprofitable?) industry; 10+ platforms
PCs and workstations diverted users
MPP apps market DID/could NOT materialize
U. S. Tax Dollars At Work. How many
processors does your center have?




Intel/Sandia:
9000 Pentium Pro
LLNL/IBM:
488x8x3 PowerPC
(SP2)
LNL/Cray: 6144 P
in DSM clusters
Maui
Supercomputer
Center
–
512x1 SP2
ASCI Blue Mountain 3.1
Tflops SGI Origin 2000
12,000 sq. ft. of floor space
1.6 MWatts of power
530 tons of cooling
384 cabinets to house 6144 CPU’s with
1536 GB (32GB / 128 CPUs)
48 cabinets for metarouters
96 cabinets for 76 TB of raid disks
36 x HIPPI-800 switch Cluster Interconnect
9 cabinets for 36 HIPPI switches
about 348 miles of fiber cable
Half of LASL
Comments from LLNL
Program manager

Lessons Learned with
“Full-System Mode”
It is harder than you think
– It takes longer than you think
– It requires more people than you can
believe
–

Just as in the very beginning of
computing, leading edge users
are building their own computers.
NEC Supers
40 Tflops Earth Simulator R&D Center c2002
Fujitsu VPP5000 multicomputer:
(not available in the U.S.)


Computing nodes
speed: 9.6 Gflops vector, 1.2 Gflops scalar
primary memory: 4-16 GB
memory bandwidth: 76 GB/s (9.6 x 64 Gb/s)
inter-processor comm: 1.6 GB/s non-blocking
with global addressing among all nodes
I/O: 3 GB/s to scsi, hippi, gigabit ethernet, etc.
1-128 computers deliver 1.22 Tflops
C1999 Clusters of computers.
It’s MPP when processors/cluster >1000
Who
ΣP.pap ΣP.
T.fps #.K
LLNL 3.9
P.pap
G.fps
5.9
.66
6.1
.5
9.1
.3
ΣP.pap/C Σp/.C ΣMp./C
G.fps
#
GB
5.3
8
2.5
ΣM.s
TB
62
(IBM)
LANL 3.1
64
128.
32
.6
2
-
0.5
2.0
4
.13 9.6
.5 8
5.12 8
9.6
128
64
1
16
8
(SGI)
Sandia 2.7
(Intel)
Beowulf
Fujitsu 1.2
NEC 4.0
ESRDC 40
4.-16
128
16
76
High performance
architecture/program timeline
1950 .
1960 .
1970 .
Vtubes
Trans.
MSI(mini)
1980 .
1990 .
Micro RISC
2000
nMicr
Sequential programming---->-----------------------------<SIMD Vector--//--------------Parallelization---
Parallel programming
multicomputers
ultracomputers 10X in size & price!
“in situ” resources 100x in //sm
<--------------<--MPP era-----10x MPP
NOW
VLSC
Grid
Yes… we are at a new beginning!
Single jobs, composed of 1000s of quasiindependent programs running in parallel on
1000s of processors (or computers).
Processors (or computers) of all types
are distributed (I.e. connected) in every fashion
from a collection using
a single shared memory
to globally disperse computers.
Future
2010 component characteristics
100x improvement @60% growth
Chip Density
Bytes/chip
On chip clock
Inter-system clock
Disk
Fiber speed (1 ch)
500. Mt
8. GB
2.5 GHz
0.5
1. TB
10. Gbps
Application
Taxonomy
If real rich
then SMP clusters
else PC Clusters (U.S. only)
Technical
General purpose, nonparallelizable codes
(PCs have it!)
Vectorizable & //able
(Supers & all SMPs)
Hand tuned, one-of
MPP course grain.
MPP embarrassingly //
(Clusters of anythings)
Commercial
Database
Database/TP
Web Host
Stream Audio/Video
If real rich
then IBM Mainframes or large SMPs
else PC Clusters
C2000+ Architecture
Taxonomy
Xpt SMPs (mainframes)
Xpt-SMPvector
Xpt-multithread (Tera)
mainline “multi” as a component
SMP
Xpt-”multi” hybrid
DSM-(commodity-SCI)?
DSM (scalar)
DSM (vector)
Multicomputers
aka Clusters.
MPP when n>1000
processors
mainline Commodity “multis”
Clusters of: “multis”
Clusters of: DSMs
(scalar & vector)
Questions that will get answered
How long will Moore’s Law continue?
MPP (Clusters of >1K proc,) vs SMP (incl. DSM)?
How much time and money for programming?
How much time and money for execution?
When or will DSM be pervasive?
Is the issue of processor architecture (scalar, MTA,
VLIW/MII, vector important?
Commodity vs proprietary chips?
Commodity, Proprietary, or Net interconnections?
Unix vs VendorIX vs NT?
Commodity vs proprietary systems?
Can we find a single, all pervasive programming
model for scalable parallelism to support apps?
When will computer science teach parallelism?
Switching from a long-term belief in
SMPs (e.g. DSM, NUMA) to Clusters
1963-1993
SMP => DSM inevitability after
30 years of belief in & building mPs
1993
clusters are inevitable
2000+
commodity clusters, improved log(p)
SMPs => DSM
SNAP Systems circa 2000
Portables
Person
servers
(PCs)
Local &
global
Mobile
data
Nets
comm
Wide-area
world
global
ATM network
Telecomputers
aka Internet
Terminals
???
TC=TV+PC
home ...
(CATV or ATM
or satellite)
Legacy
mainframe &
minicomputer
servers & terminals
ATM & Ethernet:
to PC, workstation,
& servers
A space, time (bandwidth),
generation, and reliability
scalable environment
scalable computers
built from PCs &
SANs
Centralized
& departmental
servers built from
PCs
Scaling dimensions include:
reliability… including always up
 number of nodes

most cost-effective system built
from best nodes… PCs with NO
backplane
– highest throughput distributes
disks to each node versus into a
single node
–

location within a region or
continent
Why did I switch to clusters
aka multicomputers aka MPP?

Economics: commodity components give a
10-100x advantage in price performance
–

Difficulty of making large SMPs (and DSM)
–

–
–
–

Single system image… clearly needs more work
SMPs (and DSMs) fail ALL scalabilities!
–

Backplane connected processors (incl. DSMs)
vs board-connected processors
size and lumpiness
reliability
cross-generation
spatial
We need a single programming model
Clusters are the only structure that scales!
Technical users have alternatives







PCs work fine for smaller problems
“Do it yourself clusters” e.g. Beowulf works!
MPI: lcd? programming model doesn’t
exploit shared memory
ISVs have to use lcd to survive
SMPs are expensive
Clusters required for scalabilities or apps
requiring extra-ordinary performance
...so DSM only adds to the already complex
parallelization problem
Non-U.S. users continue to use vectors
Commercial users don’t need them





Highest growth is & will be web servers
delivering pages, audio, and video
Apps are inherently, embarrassingly parallel
Databases and TP parallelized and transparent
A single SMP handles traditional apps
Clusters required for reliability, scalabilities
2010 architecture




Not much different… I see no
surprises, except at the chip level.
Good surprises would drive
performance more rapidly
SmP (m<16) will be the component
for clusters. Most cost-effective
systems are made from best nodes.
Clusters will be pervasive.
Interconnection networks log(p)
continue to be the challenge
Computer (P-Mp) system
Alternatives

Node size: most cost-effective SMPs
–
–



Now 1-2 on a single board
Evolves based on n processor per chip
Continued use of single bus SMP
“multi”
Large SMP provide a single system
image for small systems, but not cost
or space efficient for use as cluster
component
SMPs evolving to weak coherency
DSMs
Cluster system Alternatives



System in a room:
SAN connected e.g. NOW, Beowulf
System in the building:
LAN connected
System across the continent or globe:
Inter- / intra-net connected networks
NCSA Cluster of 8 x 128
processors SGI Origin
Architects & architectures…
clusters aka (MPP if p>1000)
clusters => NUMA/DSM iff
commodity interconnects supply them

U.S. vendors = 9 x scalar processors
–
–
–

Intel-based: desktop & small servers
–

HP, IBM, and SUN: minicomputers aka servers to
attack mainframes are the basic building blocks
SMPs with 100+ processors per system
surprise: 4-16 processors / chip… MTA?
commodity supercomputers ala Beowulf
Japanese vendors = vector processors
–
–
NEC continue driving NUMA approach
Fujitsu will evolve to NUMA/DSM
1994 Petaflops Workshop
c2007-2014. Clusters of
clusters.
Something
for everybody
SMP
Clusters
Active Memory
400 P
4-40K P
400K P
1 Tflops*
10-100 Gflops 1 Gflops
400 TB SRAM 400 TB DRAM 0.8 TB embed
250 Kchips
60K-100K chips 4K chips
1 ps/result
10-100 ps/result
*100 x 10 Gflops
threads
100,000 1 Tbyte discs => 100 Petabytes.
10 failures / day
HT-MT: What’s 0.55?
HT-MT…
Mechanical: cooling and signals
 Chips: design tools, fabrication
 Chips: memory, PIM
 Architecture: MTA on steroids
 Storage material

HTMT & heuristics for computer
builders




Mead 11 year rule: time between lab
appearance and commercial use
Requires >2 break throughs
Team’s first computer or super
It’s government funded…
albeit at a university
Global interconnection
“
“
Our vision ... is a system of millions
of hosts… in a loose confederation.
”
Users will have the illusion of a very
powerful desktop computer through
which they can manipulate objects.
”
Grimshaw, Wulf, et al
“Legion” CACM Jan. 1997
Utilize in situ workstations!





NoW (Berkeley) set sort record, decrypting
Grid, Globus, Condor and other projects
Need “standard” interface and
programming model for clusters using
“commodity” platforms & fast switches
Giga- and tera-bit links and switches
allow geo-distributed systems
Each PC in a computational environment
should have an additional 1GB/9GB!
Or more parallelism… and use
installed machines






10,000 nodes in 1999 or 10x Increase
Assume 100K nodes
10 Gflops/10GBy/100GB nodes
or low end c2010 PCs
Communication is first problem…
use the network
Programming is still the major barrier
Will any problems fit it?
The Grid:
Blueprint for a New Computing Infrastructure
Ian Foster, Carl Kesselman (Eds), Morgan Kaufmann, 1999

Published July 1998;
ISBN 1-55860-475-8

22 chapters by expert
authors including:
–
–
–
–
–
–
–
–
–
–
–
–
–
–
Andrew Chien,
Jack Dongarra,
Tom DeFanti,
Andrew Grimshaw,
Roch Guerin,
Ken Kennedy,
“A source book for the history
Paul Messina,
of the future” -- Vint Cerf
Cliff Neuman,
Jon Postel,
Larry Smarr,
Rick Stevens,
Charlie Catlett
John Toole
and many others http://www.mkp.com/grids
The Grid



“Dependable, consistent,
pervasive access to
[high-end] resources”
Dependable: Can provide
performance and
functionality guarantees
Consistent: Uniform
interfaces to a wide variety
of resources
Pervasive: Ability to “plug
in” from anywhere
Alliance Grid Technology Roadmap:
It’s just not flops or records/se
User Interface
Cave5D
Webflow
Virtual Director
VRML
NetMeeting
H.320/323
Java3D
ActiveX
Java
Middleware
CAVERNsoft
Workbenches
Tango
RealNetworks
Visualization
SCIRun
Habanero
Globus
LDAP
QoS
OpenMP
MPI
Compute
HPF
DSM
Clusters
Clusters
HPVM/FM
Condor JavaGrande
Symera (DCOM)
Abilene
vBNS
MREN
Data
svPablo
XML
SRB HDF-5
Emerge (Z39.50)
SANs
ODBC
DMF
Summary






1000x increase in PAP has not always been
accompanied with RAP, insight, infrastructure,
and use. Much remains to be done.
“The PC World Challenge” is to provide
commodity, clustered parallelism to
commercial and technical communities
Only becomes true if software vendors e.g.
Microsoft deliver “shrink-wrap” software
ISVs must believe that clusters are the future
Computer science has to get with the program
Grid etc. using world-wide resources,
including in situ PCs is the new idea
2004 Computer Food Chain ???
Mainframe
Vector
Super
Massively Mini
Parallel
Processors
Portable
Computers
Networks of Workstation/PCs
Dave Patterson, UC/Berkeley
The end
“
When is a Petaflops possible?
What price?
”
Gordon Bell, ACM 1997





Moore’s Law
100x
But how fast can the clock tick?
Are there any surprises?
Increase parallelism 10K>100K
10x
Spend more ($100M  $500M)
5x
Centralize center or fast network
3x
Commoditization (competition)
3x
Processor Alternatives

commodity aka Intel micros
–





Does VLIW work better as a micro than it
did as Cydrome & Multiflow minis?
vector processor
multiple processors per chip or
multi-threading
MLIW? a.k.a. signal processing
FPGA chip-based special processors
Russian Elbrus E2K Micro
Who
E2K
Merced
Clock GHz
1.2
0.8
Spec i/fp
135./350
45./70
Size mm2
126.
300.
Power
35.
60.
Pin B/W GB
15.
Cache (KB)
64./256
PAP Gflps
10.2
System ship Q4./2001
What Is The Processor Architecture?
VECTORS
OR
VECTORS
Comp. Sci. View
Super Computer View
MISC >> CISC
RISC
Language directed
VCISC (vectors)
RISC
multiple pipes
Super-scalar
MTA
Extra-Long Instruction Word
Observation: CMOS supers
replaced ECL in Japan

10 Gflops vector units have dual use
–
–



In traditional mPv supers
as basis for computers in mC
Software apps are present
Vector processors out-perform
n (n=10) micros for many apps.
It’s memory bandwidth, cache
prediction, inter-communication, and
overall variation
Weather model performance
Observation: MPPs 1, Users <1

MPPs with relatively low speed micros with
lower memory bandwidth, ran over supers,
but didn’t kill ‘em.
 Did the U.S. industry enter an abyss?
- Is crying “unfair trade” hypocritical?
- Are U. S. users being denied tools?
- Are users not “getting with the program”
 Challenge we must learn to program clusters...
- Cache idiosyncrasies
- Limited memory bandwidth
- Long Inter-communication delays
- Very large numbers of computers
- NO two computers are alike => NO Apps
The Law of Massive Parallelism
(mine) is based on application scaling
There exists a problem that can be made sufficiently large
such that any network of computers can run efficiently
given enough memory, searching, & work -- but this
problem may be unrelated to no other.
A ... any parallel problem can be scaled to run efficiently
on an arbitrary network of computers, given enough
memory and time… but it may be completely impractical
Challenge to theoreticians and tool builders:
How well will or will an algorithm run?
Challenge for software and programmers:
Can package be scalable & portable? Are there models?
Challenge to users:
Do larger scale, faster, longer run times, increase
problem insight and not just total flop or flops?
Gordon’s
Challenge to funders:
WAG
Is the cost justified?
log (# apps)
GB's Estimate of Parallelism in
Engineering & Scientific Applications
----scalable multiprocessors----PCs
WSs
Supers
Clusters aka MPPs
aka multicomputers
dusty decks
for supers
new or
scaled-up
apps
Gordon’s
WAG
scalar
60%
vector
15%
Vector
One-of Embarrassingly &
& //
>>//
perfectly parallel
5%
5%
15%
granularity & degree of coupling (comp./comm.)
Download