NRC Review Panel on High Performance Computing 11 March 1994 Gordon Bell

advertisement
NRC Review Panel on
High Performance Computing
11 March 1994
Gordon Bell
1
© Gordon Bell
Position
Dual use: Exploit parallelism with in situ nodes & networks
Leverage WS & mP industrial HW/SW/app infrastructure!
No Teraflop before its time -- its Moore's Law
It is possible to help fund computing: Heuristics from federal
funding & use (50 computer systems and 30 years)
Stop Duel Use, genetic engineering of State Computers
•10+ years: nil pay back, mono use, poor, & still to come
•plan for apps porting to monos will also be ineffective -apps must leverage, be cross-platform & self-sustaining
•let "Challenges" choose apps, not mono use computers
•"industry" offers better computers & these are jeopardized
•users must be free to choose their computers, not funders
•next generation State Computers "approach" industry
•10 Tflop ... why?
Summary recommendations
2
© Gordon Bell
Principle computing
Environments
ASCII & PC
terminals
circa 1994 ->4 networks to
support
POTs net
for switching
mainframes,
terminals
minis,
Wide-area
Data
UNIX servers,
inter-site
comm.
network
workstations &
worlds
PCs
X
mainframes
mainframes
clusters
IBM &
propritary
mainframe
world '50s
3270 (&PC)
terminals
clusters
minicomputers
minicomputers
ASCII & PC
terminals
terminals
'80s Unix distributed workstations & servers world
UNIX
workstations
NFS
servers
Compute & dbase
uni- & mP servers
Token-ring
Ethernet
(gateways, bridges,
(gateways, bridges,
routers, hubs, etc.)
routers, hubs, etc.)
LANs
LANs
PCs (DOS,
Windows, NT)
Novell & NT
servers
70's mini
(prop.) world
&
'90s UNIX
mini world
UNIX Multiprocessor
servers operated as
traditional minicomputers
>4 Interconnect &
comm. stds:
POTS & 3270 terms.
WAN (comm. stds.)
Late '80s
LAN-PC
world
LAN (2 stds.)
Clusters (prop.)
3
© Gordon Bell
Computing
Environments
circa 2000
NT, Windows
& UNIX
person
NT, Windows
servers
& UNIX
person
servers*
Wide-area
global
ATM network
Local &
global data
comm
ATM†
world
& Local
Area Networks
for: terminal,
PC, workstation,
& servers
???
TC=TV+PC
home ...
(CATV or ATM)
Legacy
mainframes &
Legacy
minicomputers
mainframe
& terms
servers &
minicomputer
servers & terminals
multicomputers built
from multiple simple,
servers
Centralized
&Centralized
departmental
uni& mP servers
&
departmental
(UNIX
& NT)
scalable
uni- & mP servers*
NFS, database, compute, print, & (NT & UNIX)
* Platforms: X86
communication servers
PowerPC Sparc
etc.
Universal high
speed data
service using
ATM or ??
† also10 - 100 mb/s
pt-to-pt Ethernet
4
© Gordon Bell
Beyond Dual & Duel Use Technology:
Parallelism can & must be free!
HPCS, corporate R&D, and technical users must
have the goal to design, install and support parallel
environments using and leveraging:
• every in situ workstation & multiprocessor server
• as part of the local ... national network.
Parallelism is a capability that all computing
environments can & must possess!
--not a feature to segment "mono use" computers
Parallel applications become a way of computing
utilizing existing, zero cost resources
-- not subsidy for specialized ad hoc computers
Apps follow pervasive computing environments
5
© Gordon Bell
Computer genetic engineering &
species selection has been ineffective
Although Problem x Machine Scalability using SIMD for simulating some physical
systems has been demonstrated, given extraordinary resources, the efficacy of
larger problems to justify cost-effectiveness has not.
Hamming:"The purpose of computing is insight, not numbers."
The "demand side" Challenge users have the problems and should be drivers.
ARPA's contractors should re-evaluate their research in light of driving needs.
Federally funded "Challenge" apps porting should be to multiple platforms
including workstations & compatible, multis that support // environments to
insure portability and understand main line cost-effectiveness
Continued "supply side"programs aimed at designing, purchasing, supporting,
sponsoring, & porting of apps to specialized, State Computers, including
programs aimed at 10 Tflops, should be re-directed to networked computing.
User must be free to choose and buy any computer, including PCs & WSs, WS
Clusters, multiprocessor servers, supercomputers, mainframes, and even highly
distributed, coarse grain, data parallel, MPP State computers.
6
© Gordon Bell
10000
The teraflops
Cray DARPA
Intel $300M
1000
•
CM5 $240M
CM5 $120M
• Intel $55M
NEC
CM5 $30M
100
Bell Prize
•
Performance (t)
10
Cray Super $30M
1
1988
1990
1992
1994
7
1996
1998
2000
© Gordon Bell
We get no Teraflop before it's time:
it's Moore's Law!
Flops = f(t,$), not f(t) technology plans e.g. BAA 94-08 ignores $s!
All Flops are not equal (peak announced performance-PAP or real app perf. -RAP)
FlopsCMOSPAP*< C x 1.6**(1992-t) x $; C = 128 x 10**6 flops / $30,000
FlopsRAP =FlopsPAP x 0.5 for real apps, 1/2 PAP is a great goal
Flopssupers = FlopsCMOS x 0.1; improvement of supers 15-40%/year;
higher cost is f(need for profitability, lack of subsidies, volume, SRAM)
92'-94': FlopsPAP/$ =4K; Flopssupers/$=500; Flopsvsp/$ =50 M (1.6G@$25)
*Assumes primary & secondary memory size & costs scale with time
memory = $50/MB in 1992-1994 violates Moore's Law
disks = $1/MB in1993, size must continue to increases at 60% / year
When does a Teraflop arrive if only $30 million** is spent on a super?
1 TflopCMOS PAP in 1996 (x7.8) with 1 GFlop nodes!!!; or 1997 if RAP
10 TflopCMOS PAP will be reached in 2001 (x78) or 2002 if RAP
How do you get a teraflop earlier?
**A $60 - $240 million Ultracomputer reduces the time by 1.5 - 4.5 years.
8
© Gordon Bell
Funding Heuristics
(50 computers & 30 years of hindsight)
1. Demand side works i.e., we need this product/technology for x;
Supply side doesn't work! Field of Dreams": build it and they will come.
2. Direct funding of university research resulting in technology and product
prototypes that is carried over to startup a company is the most effective.
-- provided the right person & team are backed with have a transfer avenue.
a. Forest Baskett > Stanford to fund various projects (SGI, SUN, MIPS)
b. Transfer to large companies has not been effective
c. Government labs... rare, an accident if something emerges
3. A demanding & tolerant customer or user who "buys" products works best to
influence and evolve products (e.g., CDC, Cray, DEC, IBM, SGI, SUN)
a. DOE labs have been effective buyers and influencers, "Fernbach policy";
unclear if labs are effective product or apps or process developers
b. Universities were effective at influencing computing in timesharing,
graphics, workstations, AI workstations, etc.
c. ARPA, per se, and its contractors have not demonstrated a need for flops.
d. Universities have failed ARPA in defining work that demands HPCS -hence are unlikely to be very helpful as users in the trek to the teraflop.
4. Direct funding of large scale projects" is risky in outcome, long-term, training,
and other effects. ARPAnet established an industry after it escaped BBN!
© Gordon Bell
9
Funding Heuristics-2
5. Funding product development, targeted purchases, and other subsidies to
establish "State Companies"in a vibrant and overcrowded market is wasteful,
likely to be wrong , likely to impede computer development, (e.g. by having to
feed an overpopulated industry). Furthermore, it is likely to have a deleterious
effect on a healthy industry (e.g. supercomputers).
A significantly smaller universe of computing environments is needed. Cray
& IBM are given; SGI is probably the most profitable technical; HP/Convex are
likely to be a contender, & others (e.g., DEC) are trying. No state co (intel,TMC,
Tera) is likely to be profitable & hence self-sustaining.
6. "University-Company collaboration is a new area of government R&D. So far
it hasn't worked nor is it likely to, unless the company invests. Appears to be
a way to help company fund marginal people and projects.
7. CRADAs or co-operative research and development agreement are very closely
allied to direct product development and are equally likely to be ineffective.
8. Direct subsidy of software apps or the porting of apps to one platform, e.g.,
EMI analysis are a way to keep marginal computers afloat.
If government funds apps, they must be ported cross-platform!
9. Encourage the use of computers across the board, but discourage designs
from those who have not used or built a successful computer.
10
© Gordon Bell
Scalability: The Platform of HPCS
& why continued funding is unnecessary
Mono use aka MPPs have been, are, and will be doomed
The law of scalability
Four scalabilities: machine, problem x machine,
generation (t), & now spatial
How do flops, memory size, efficiency & time vary with
problem size? Does insight increase with problem size?
What's the nature of problems & work for monos?
What about the mapping of problems onto monos?
What about the economics of software to support monos?
What about all the competitive machines? e.g. workstations,
workstation clusters, supers, scalable multis, attached P?
11
© Gordon Bell
Special, mono-use MPPs are doomed...
no matter how much fedspend!
Special because it has non-standard nodes & networks -- with no apps
Having not evolved to become mainline -- events have over-taken them.
It's special purpose if it's only in Dongarra's Table 3.
Flop rate, execution time, and memory size vs problem size shows limited
applicability to very large scale problems that must be scaled to cover the
inherent, high overhead.
Conjecture: a properly used supercomputer will provide greater insight and utility
because of the apps and generality
-- running more, smaller sized problems with a plan produces more insight
The problem domain is limited & now they have to compete with: •supers -- do
scalars, fine grain, and work and have apps
•workstations -- do very long grain, are in situ and have apps
•workstation clusters -- have identical characteristics and have apps
•low priced ($2 million) multis -- are superior i.e., shorter grain and have apps
•scalable multiprocessors -- formed from multis are in design stage
Mono useful (>>//) -- hence, are illegal because they are not dual use
Duel use -- only useful to keep a high budget in tact e.g., 10 TF
12
© Gordon Bell
The Law of Massive Parallelism is
based on application scale
There exists a problem that can be made sufficiently large
such that any network of computers can run efficiently
given enough memory, searching, & work -- but this
problem may be unrelated to no other problem.
A ... any parallel problem can be scaled to run on an arbitrary
network of computers, given enough memory and time
Challenge to theoreticians: How well will an algorithm run?
Challenge for software: Can package be scalable & portable?
Challenge to users: Do larger scale, faster, longer run times,
increase problem insight and not just flops?
Challenge to HPCC: Is the cost justified? if so let users do it!
13
© Gordon Bell
Scalabilities
Size scalable computers are designed from a few
components, with no bottleneck component.
Generation scalable computers can be implemented with
the next generation technology with No rewrite/recompile
Problem x machine scalability - ability of a problem,
algorithm, or program to exist at a range of sizes so that
it can be run efficiently on a given, scalable computer.
Although large scale problems allow high flops, large probs
running longer may not produce more insight.
Spatial scalability -- ability of a computer to be scaled over a
large physical space to use in situ resources.
14
© Gordon Bell
100
Linpack rate in Gflops
vs Matrix Order
CM5 1K
SX 3 4 P
10
???
1
100
1000
10000
15
100000
© Gordon Bell
Linpack Solution time
vs Matrix Order
1000
CM5 1K
100
10
SX 3 4 P
1
100
1,000
10,000
16
100,000
© Gordon Bell
GB's Estimate of Parallelism in
Engineering & Scientific Applications
----scalable multiprocessors----log (# of apps) WSs Supers
dusty decks
for supers
massive mCs & WSs
new or
scaled-up
apps
scalar vector mP (<8) >>//
embarrassingly
60% 15% vector 5% or perfectly parallel
5%
15%
granularity & degree of coupling (comp./comm.)
17
© Gordon Bell
MPPs are only for unique,
$M
100 very large scale, data parallel apps
.
.
.
10
mono
use
.
s
.
.
s
s
s
s
s
1
.
>>//
>>//
.
mP
mP
mP
mP
mP
. mP
.1
WS
.
.
WS
WS
WS
WS
. WS
.01
Scalar| vector |vector mP| data // | emb. // | gp work | viz
| apps
Application characterization
18
© Gordon Bell
Applicability of various
technical computer alternatives
Domain
scalar
vector
vect.mP
data //
ep & inf.//
gp wrkld
vizualiz'n
apps
PC|WS
Multi servr
SC & Mfrm >>//
WS Clusters
1
2*
na
na
1
3
1
1
1
2
2
1
2
1
na
1
2
1
1
2
3
1
na
1
1*
2
na
1
1
2
1
from WS
na
3
3
1
2
na
na
na
*Current micros are weak, but improving rapidly such that
subsequent >>//s that use them will have no advantage for
node vectorization
19
© Gordon Bell
Performance using distributed
computers depends on
problem & machine granularity
Berkeley's log(p) model characterizes granularity &
needs to be understood, measured, and used
Three parameters are given in terms of processing ops:
l = latency -- delay time to communicate between apps
o = overhead -- time lost transmitting messages
g = gap - 1 / message-passing rate ( bandwidth)
between messages
- time
p = number of processors
20
© Gordon Bell
x
Granularity
Nomograph
Processor
speed
10M
Grain
Comm.
Latency &
Synch.1sec.
Ovhd.
1995µ
1993
1G
C 90
Ultra
1M
Very
100 K
10 ms.
Coarse
1 ms. (LAN)
10K
10 µs.
1993µ
WANs
&
LANs
100 ms. (WAN)
100 µs.
100M
10 M
1 µs.
100 ns.
21
MPPs
Med.
1000
Fine
100
Grain length
(ops)
© Gordon Bell
x
10 M
Granularity
Nomograph
Grain Comm.
Latency &
Synch. Ovhd.
1sec.
1 ms. (LAN)
100 µs.
1995µ
1G
1993
super
Very
100 K
10 ms.
10M
1993µ
1M
100 ms. (WAN)
Processor
speed
100M
Ultra
Cray T3D
10 µs.
Coarse
VPP
500
10K
Med.
1000
1 µs.
C 90
Fine
100 ns.
(Supers
mem.)
VP
22
100
Grain length
(ops)
© Gordon Bell
Economics of Packaged Software
Platform
Cost
Leverage
# copies
MPP
>100K
1
1-10 copies
Minis, mainframe 10-100K 10-100
1000s copies
also, evolving high performance
multiprocessor servers
Workstation
1-100K
1-10K
1-100K copies
PC
25-500
50K-1M
1-10M copies
23
© Gordon Bell
Chuck Seitz comments
on multicomputers
“I believe that the commercial, medium grained
multicomputers aimed at ultra-supercomputer
performance have adopted a relatively unprofitable scaling
track, and are doomed to extinction. ... they may as
Gordon Bell believes be displaced over the next several
years by shared memory multiprocessors. ... For loosely
coupled computations at which they excel, ultra-super
multicomputers will, in any case, be more economically
implemented as networks of high-performance
workstations connected by high-bandwidth, local area
networks...”
24
© Gordon Bell
Convergence to a single architecture
with a single address space
that uses a distributed, shared memory
limited (<20) scalability multiprocessors
>> scalable multiprocessors
workstations with 1-4 processors
>> workstation clusters & scalable multiprocessors
workstation clusters
>> scalable multiprocessors
State Computers built as message passing multicomputers
>> scalable multiprocessors
25
© Gordon Bell
limited
scalability: mP,
uniform
memory access
mP
mainframe,
super
micros
C om
mi c p etiti o
ro s
n&
mP
bus based
multi:
mini, W/S
note, only two structures:
1. shared memory mP with
mP
ring-based uniform & non-uniform
memory access; and
multi
2. networked workstations,
shared nothing
??
DEC, Encore,
Sequent, Stratus,
SGI, SUN, etc.
Conv ex, Cray,
mPs continue
istu, IBM,
to be the main line Fuj
Hitachi, NEC
1995?
mainframes &
supers
scalable, mP: smP,
non-uniform memory access
1st smP
-0 cache
Convergence to
one architecture
Cache for
locality
networked
workstations:
smC
smC:
very
coarse
grain
Apollo, SUN, HP, etc.
Natural
evolution
r
64 e
> ach
32 ? c
?
hr ead
e
d p ro
high bandwith
switch , comm.
protocols e.g.
ATM
smP
DSM
some
cache
o y 1995?
Cm* ('75),
I
c
C
S ren
Butterfly ('85),
smC
,
ts h e
Cedar ('88)
bi co
next gen.
DSM=>smP
experimental,
1st smC
smC
scalable,
hypercube WS Micros,
med-coarse Fuj itsu, Intel,
multicomputer: smC, Transputer fast switch
Meiko, NCUBE,
grain
non uniform memory
(grid) H ig h
TMC; 1985-1994
den s
ity, m
access
u lti -t
Cosmic Cube,
iPSC 1, NCUBE,
Transputer-based
DASH, Conv ex,
Cray T3D, SCI
cess
or s &
swi tc
h
smC
coarse gr.
clusters
WSs Clusters v ia special
sw itches 1994 &ATM 1995
26
1995?
smC
fine-grain
DSM??
smP
all cache
arch.
KSR Allcache
next gen. smP
research e.g.
DDM, DASH+
Evolution of scalable
multiprocessors,
multicomputers, &
workstations to shared
memory computers
Mosaic-C,
J-machine
© Gordon Bell
Re-engineering HPCS
Genetic engineering of computers has not produced a healthy strain that lives
more than one, 3 year computer generation. Hence no app base can form.
•No inter-generational, MPPs exist with compatible networks & nodes.
•All parts of an architecture must scale from generation to generation!
•An archecture must be designed for at least three, 3 year generations!
High price to support a DARPA U. to learn computer design -- the market is only
$200 million and R&D is billions-- competition works far better
Inevitable movement of standard networks and nodes can or need not be
accelerated,
these best evolve by a normal market mechanism through driven by users
Dual use of Networks & Nodes is the path to widescale parallelism, not weird
computers
Networking is free via ATM
Nodes are free via in situ workstations
Apps follow pervasive computing environments
Applicability was small and getting smaller very fast with many experienced
computer companies entering the market with fine products e.g. Convex/HP,
Cray, DEC, IBM, SGI & SUN that are leveraging their R&D, apps, apps, & apps
Japan has a strong supercomputer industry. The more we jeprodize ours by
mandating use of weird machines that take away from use, the weaker it
becomes.
MPP won, mainstream vendors have adopted multiple CMOS. Stop funding!
© Gordon
Bell
environments & apps are needed, but are27
unlikely because the market
is small
Recommendations to HPCS
Goal: By 2000, massive parallelism must exist as a by-products that leverages
a widescale national network & workstation/multi HW/SW nodes
Dual use not duel use of products and technology or the principle of "elegance" one part serves more than one function
network companies supply networks,
node suppliers use ordinary workstations/servers with existing apps will
leverage $30 billion x 10**6 R&D
Fund high speed, low latency, networks for a ubiquitous service as the base of all
forms of interconnections from WANs to supercomputers (in addition, some
special networks will exist for small grain probs)
Observe heuristics in future federal program funding scenarios
... eliminate
direct or indirect product development and mono-use computers
Fund Challenges who in turn fund purchase, not product development
Funding or purchase of apps porting must be driven by Challenges, but builds on
binary compatible workstation/server apps to leverage nodes be crossplatform based to benefit multiple vendors & have cross-platform use
Review effectiveness of State Computers e.g., need, economics, efficacy
Each committee member might visit 2-5 sites using a >>// computer
Review // program environments & the efficacy to produce & support apps
Eliminate all forms of State Computers & recommend a balanced HPCS program:
nodes & networks; based on industrial infrastructure
stop funding the development of mono computers, including the 10Tflop
it must be acceptable & encouraged to buy any computer for any contract
© Gordon Bell
28
Gratis advice for HPCC* & BS*
D. Bailey warns that scientists have almost lost credibility....
Focus on Gigabit NREN with low overhead connections that
will enable multicomputers as a by-product
Provide many small, scalable computers vs large, centralized
Encourage (revert to) & support not so grand challenges
Grand Challenges (GCs) need explicit goals & plans -disciplines fund & manage (demand side)... HPCC will not
Fund balanced machines/efforts; stop starting Viet Nams
Drop the funding & directed purchase of state computers
Revert to university research -> company & product
development
Review the HPCC & GCs program's output ...
*High Performance Cash Conscriptor; Big Spenders
Disclaimer
This talk may appear inflammatory ...
i.e. the speaker may have appeared
"to flame".
It is not the speaker's intent to make
ad hominem attacks on people,
organizations, countries, or
computers ... it just may appear
that way.
30
© Gordon Bell
Scalability: The Platform of HPCS
The law of scalability
Three kinds: machine, problem x machine, & generation (t)
How do flops, memory size, efficiency & time vary with
problem size?
What's the nature problems & work for the computers?
What about the mapping of problems onto the machines?
31
© Gordon Bell
Download