Goal of the Committee

advertisement
Goal of the Committee
The goal of the committee is to assess the status of supercomputing
in the United States, including the characteristics of relevant systems
and architecture research in government, industry, and academia and
the characteristics of the relevant market. The committee will examine
key elements of context--the history of supercomputing, the erosion of
research investment, the needs of government agencies for
supercomputing capabilities--and assess options for progress. Key
historical or causal factors will be identified. The committee will
examine the changing nature of problems demanding supercomputing
(e.g., weapons design, molecule modeling and simulation,
cryptanalysis, bioinformatics, climate modeling) and the implications
for systems design. It will seek to understand the role of national
security in the supercomputer market and the long-term federal
interest in supercomputing.
1
© Gordon Bell
NRC-CSTB
Future of Supercomputing Committee
22 May 2003
NRC Brooks-Sutherland Committee
11 March 1994
NRC OSTP Report
18 August 1984
Gordon Bell
Microsoft Bay Area Research Center
San Francisco
2
© Gordon Bell
Outline
•
•
•
•
Community re-Centric Computing vs. Computer Centers
Background: Where we are and how we got here.
Performance (t). Hardware trends and questions
If we didn’t have general purpose computation centers, would we
invent them?
• Atkins Report: Past and future concerns…be careful what we wish
for
• Appendices: NRC Brooks-Sutherland 94 comments; CISE(gbell)
Concerns re. Supercomputing Centers (’87);
CISE(gbell) //ism goals (‘87); NRC Report 84 comments.
• Bottom line, independent of the question:
It has been and always will be the software and apps, stupid!
And now it’s the data, too!
© Gordon Bell
3
Community re-centric Computing...
• Goal: Enable technical communities to create their own
computing environments for personal, data, and program
collaboration and distribution.
• Design based on technology trends, especially networking,
apps programs maintenance, databases, & providing web and
other services
• Many alternative styles and locations are possible
–
–
–
–
Service from existing centers, including many state centers
Software vendors could be encouraged to supply apps svcs
NCAR style center around data and apps
Instrument-based databases. Both central & distributed when multiple
viewpoints create the whole
– Wholly distributed with many individual groups
4
© Gordon Bell
Community re-Centric Computing
Time for a major change
• Community Centric
• Community is responsible
–
–
–
–
• Centers Centric
• Center is responsible
–
–
–
–
Planned & budget as resources
Responsible for its infrastructure
Apps are community centric
Computing is integral to sci./eng.
• In sync with technologies
Computing is “free” to users
Provides a vast service array for all
Runs & supports all apps
Computing grant disconnected
• Counter to technologies directions
– 1-3 Tflops/$M; 1-3 PBytes/$M
to buy smallish Tflops & PBytes.
– More costly. Large centers operate at a
dis-economy of scale
• New scalables are “centers” fast • Based on unique, fast computers
– Community can afford
– Dedicated to a community
– Program, data & database centric
– Center can only afford
– Divvy cycles among all communities
– Cycles centric; but politically difficult to
maintain highest power vs more centers
– Data is shipped to centers requiring,
expensive, fast networking
– May be aligned with instruments
or other community activities
• Output = web service; an entire
community demands real-time
web service
5
• Output = diffuse among gp centers;
Are centers ready or able to support
on-demand, real time
web services?
© Gordon Bell
Background: scalability at last
• Q: How did we get to scalable computing and parallel
processing?
• A: Scalability evolved from a two decade old vision
and plan starting at DARPA & NSF. Now picked up by
DOE & row.
• Q: What should be done now?
• A: Realize scalability, the web, and now web services
change everything. Redesign to get with the program!
• Q: Why do you seem to be wanting to de-center?
• A: Besides the fact that user demand has been and is
totally de-coupled from supply, I believe technology
doesn’t necessarily support users or their mission, and
that centers are potentially inefficient compared with a
more distributed approach.
Steve Squires &
Gordon Bell
at our “Cray” at
the start of
DARPA’s SCI
program c1984.
20 years later:
Clusters of
Killer micros
become the
single
standard
Copyright Gordon Bell
Lost in the search for parallelism

























ACRI
Alliant
American Supercomputer
Ametek
Applied Dynamics
Astronautics
BBN
CDC
Cogent
Convex > HP
Cray Computer
Cray Research > SGI > Cray
Culler-Harris
Culler Scientific
Cydrome
Dana/Ardent/Stellar/Stardent
Denelcor
Encore
Elexsi
ETA Systems
Evans and Sutherland Computer
Exa
Flexible
Floating Point Systems
Galaxy YH-1




























Goodyear Aerospace MPP
Gould NPL
Guiltech
Intel Scientific Computers
International Parallel Machines
Kendall Square Research
Key Computer Laboratories searching again
MasPar
Meiko
Multiflow
Myrias
Numerix
Pixar
Parsytec
nCube
Prisma
Pyramid
Ridge
Saxpy
Scientific Computer Systems (SCS)
Soviet Supercomputers
Supertek
Supercomputer Systems
Suprenum
Tera > Cray Company
Thinking
Machines
Copyright
Gordon Bell
Vitesse Electronics
Wavetracer
A brief, simplified history of HPC
1. Cray
formula smPv evolves for Fortran. 60-02 (US:60-90)
2. 1978: VAXen threaten computer centers…
3. NSF response: Lax Report. Create 7-Cray centers 1982 4. SCI: DARPA searches for parallelism using killer micros
5. Scalability found: “bet the farm” on micros clusters
Users “adapt”: MPI, lcd programming model found. >95
Result: EVERYONE gets to re-write their code!!
6. Beowulf Clusters form by adopting PCs and Linus’ Linux
to create the cluster standard! (In spite of funders.)>1995
7. ASCI: DOE gets petaflops clusters, creating “arms” race!
8. “Do-it-yourself” Beowulfs negate computer centers since
everything is a cluster and shared power is nil! >2000.
9. 1997-2002: Tell Fujitsu & NEC to get “in step”!
10.High speed nets enable peer2peer & Grid or Teragrid
11.Atkins Report: Spend $1B/year, form more and larger
centers and connect them as a single center…
The Virtuous Economic Cycle
drives the PC industry… & Beowulf
Attracts
suppliers
Greater
availability
@ lower cost
Standards
Attracts users
Creates apps,
tools,
training,
Copyright Gordon
Bell
Lessons from Beowulf










An experiment in parallel computing systems ‘92
Established vision- low cost high end computing
Demonstrated effectiveness of PC clusters for some
(not all) classes of applications
Provided networking software
Provided cluster management tools
Conveyed findings to broad community
Tutorials and the book
Provided design standard to rally community!
Standards beget: books, trained people, software …
virtuous cycle that allowed apps to form
Industry began to form beyond a research project
Copyright Gordon Bell
Courtesy, Thomas Sterling, Caltech.
Technology: peta-bytes, -flops, -bps
We get no technology before its time



Moore’s Law 2004-2012: 40X
The big surprise will be the 64 bit micro
2004: O(100) processors = 300 GF PAP, $100K
–
–


3 TF/M, not diseconomy of scale for large systems
1 PF => 330M, but 330K processors; other paths
Storage 1-10 TB disks; 100-1000 disks
Internet II killer app – NOT teragrid
–
–
Access Grid, new methods of communication
Response time to provide web services
Copyright Gordon Bell
10,000,000
RAP(GF)
Performance
metrics
(t)
19871,000,000
Proc(#)
ES
2009 cost ($M)
100,000
Density(Gb/in)
10,000
110%
Flops/$
1,000
100%
100
10
1
60%
19
87
19
89
19
91
19
93
19
95
19
97
19
99
20
01
20
03
20
05
20
07
20
09
0
Computing Laws
Perf (PAP) = c x 1.6**(t-1992); c = 128 GF/$300M
‘94 prediction: c = 128 GF/$30M
1.E+16
1.E+15
1.E+14
1.E+13
1.E+12
1.E+11
1.E+10
1.E+09
1.E+08
1992
GB peak
1996
30 M super
2000
100 M super
14
2004
300 M super
2008
2012
Flops(PAP)M/$
© Gordon Bell
Performance(TF) vs. cost($M) of noncentral and centrally distributed systems
100
10
1
Performance
+ Centers (old style super)
0.1
0.01
0.1
Non-central
1
Centers delivery
10
100
Center purchase
Cost base
National Semiconductor Technology
Roadmap (size)
10000
0.35
Memory size (Mbytes/chip) & Mtransistors/ chip
Mem(MBytes)
0.3
Micros Mtr/chip
Line width
1000
0.25
0.2
+ 1Gbit
100
0.15
0.1
10
0.05
1
0
1995
1998
2001
2004
2007
2010
Disk Density Explosion



Magnetic disk recording density (bits per mm2) grew at
25% per year from 1975 until 1989.
Since 1989 it has grown at 60-70% per year
Since 1998 it has grown at >100% per year
–

This rate will continue into 2003
Factors causing accelerated growth:
–
–
–
Improvements in head and media technology
Improvements in signal processing electronics
Lower head flying heights
Courtesy Richie Lary
National Storage Roadmap
2000
100x/decade
=100%/year
~10x/decade = 60%/year
Computing Laws
Disk / Tape Cost Convergence
$3.00
5400 RPM ATA Disk
Retail Price .
$2.50
SDLT Tape Cartridge
$2.00
$1.50
$1.00
$0.50
$0.00
1/01


1/03
1/04
1/05
3½” ATA disk could cost less than SDLT cartridge in 2004.


1/02
If disk manufacturers maintain 3½”, multi-platter form factor
Volumetric density of disk will exceed tape in 2001.
“Big Box of ATA Disks” could be cheaper than a tape
library of equivalent size in 2001
Courtesy of Richard Lary
19
Disk Capacity / Performance
Imbalance



Capacity growth
outpacing
100
performance growth
Difference must be
made up by better
caching and load
10
balancing
Actual disk capacity
may be capped by
market (red line); shift
to smaller disks
1
(already happening for
1992
high speed disks)
Capacity
Performance
1995
1998
140x in
9 years
(73%/yr)
3x in 9 years
(13%/yr)
2001
Courtesy of Richard Lary
Re-Centering to Community Centers
• There is little rational support for general purpose centers
–
–
–
–
–
Scalability changes the architecture of the entire Cyberinfrastructure
No need to have a computer bigger than the largest parallel app.
They aren’t super.
World is substantially data driven, not cycles driven.
Demand is de-coupled from supply planning, payment or services
• Scientific / Engineering computing has to be the responsibility
of each of its communities
– Communities form around instruments, programs, databases, etc.
– Output is web service for the entire community
21
© Gordon Bell
Grand Challenge (c2002) problems
become desktop (c2004) tractable
I don’t buy problem growth mantra 2x res. >2**4 (yrs.)
• Radical machines will come from low cost 64-bit explosion
• Today’s desktop has and will increasingly trump
yesteryear’s super simply due to memory size emplosion
• Pittsburgh Alpha: 3D MRI skeleton computing/viewing
using a large memory is a desktop problem given a large
memory
• Tony Jamieson: “I can models an entire 747 on my laptop!”
22
© Gordon Bell
Centers aren’t very super…
• Pittsburgh: 6; NCAR: 10;
LSU: 17; Buffalo: 22; FSU: 38; San Diego: 52; NCSA:
82; Cornell: 88 Utah: 89;
• 17 Universities, world-wide in top 100
• Massive upgrade is continuously required:
– Large memories: machines aren’t balanced and haven’t been.
Bandwidth 1 Byte/flop vs. 24 Bytes/flop
– File Storage > Databases
• Since centers systems have >4 year lives, they start as
obsolete and overpriced…and then get worse.
23
© Gordon Bell
Centers: The role going forward
• The US builds scalable clusters, NOT supercomputers
– Scalables are 1 to n commodity PCs that anyone can assemble.
– Unlike the “Crays” all are equal. Use is allocated in small clusters.
– Problem parallelism sans ∞// has been elusive (limited to 1K)
– No advantage of having a computer larger than a //able program
• User computation can be acquired and managed effectively.
– Computation is divvied up in small clusters e.g. 128 nodes that
individual groups can acquire and manage effectively
• The basic hardware evolves, doesn’t especially favor centers
– 64-bit architecture. 512Mb x 32/dimm =2GB x 4/system = 8GB
Systems >>16GB
(Centers machine will be obsolete, by memory / balance rules.)
– 3 year timeframe: 1 TB disks at $0.20/TB
– Last mile communication costs are not decreasing to favor centers or
grids.
24
© Gordon Bell
Review the bidding
• 1984: Japanese are coming. CMOS and killer Micros. Build // machines.
– 40+ computers were built & failed based on CMOS and/or micros
– No attention to software or apps
• 1994: Parallelism and Grand Challenges
– Converge to Linux Clusters (Constellations nodes >1 Proc.) & MPI
– No noteworthy middleware software to aid apps & replace Fortran e.g.
HPF failed.
– Whatever happened to the Grand Challenges??
• 2004: Teragrid has potential as a massive computer and massive research
– We have and will continue to have the computer clusters to reach a
<$300M Petaflop
– Massive review and re-architecture of centers and their function.
– Instrument/program/data/community centric (CERN, Fermi, NCAR,
Calera)
25
© Gordon Bell
Recommendations
• Give careful advice on the Atkins Report
(It is just the kind of plan that is likely to fly.)
• Community Centric Computing
– Community/instrument/data/programcentric (Calera, CERN, NCAR)
• Small number of things to report
– Forget about hardware for now…it’s scalables. The die has been cast.
– Support training, apps, and any research to ease apps development.
– Databases represent the biggest gain. Don’t grep, just reference it.
26
© Gordon Bell
The End
27
© Gordon Bell
Atkins Report: Be careful of what you ask for
• Suggestions (gbell)
– Centers to be re-centered in light of data versus flops
– Overall re-architecture based on user need & technology
– Highly distributed structure aligned with users who plan
their facilities
– Very skeptical “gridized” projects e.g. tera, GGF
– Training in the use of databases is needed! It will get
more productivity than another generation of computers.
• The Atkins Report
– 1.02 Billion per year recommendation for research, buy
software, and spend $600 M to build and maintain more
centers that are certain to be obsolete and non-productive
28
© Gordon Bell
Summary to Atkins Report 2/15/02 15:00 gbell
•
•
•
•
Same old concerns: “I don’t have as many flops as users at the national labs.”
Many facilities should be distributed and with build-it yourself Beowulf clusters to
get extraordinary cycles and bytes.
Centers need to be re-centered see Bell & Gray, “What’s Next in High Performance
Computing?”, Comm. ACM, Feb. 2002, pp91-95.
Scientific computing needs re architecting based on networking, communication,
computation, and storage. Centrality versus distributed depends on costs and the
nature of the work e.g. instrumentation that generates lots of data. (Last mile
problem is significant.)
– Fedex’d hard drive is low cost. Cost of hard drive < network cost. Net is very expensive!
– Centers flops and bytes are expensive. Distributed likely to be less so.
– Many sciences need to be reformulated as a distribute computing/dbase
•
•
•
•
Network costs (last mi.) are a disgrace. $1 billion boondoggle with NGI, Internet II.
Grid funding: Not in line with COTS or IETF model. Another very large SW project!
Give funding to scientists in joint grants with tool builders e.g. www came from user
Database technology is not understood by users and computer scientists
– Training, tool funding, & combined efforts especially when large & distributed
– Equipment, problems, etc are dramatically outstripping capabilities!
•
It is still about software, especially in light of scalable computers that
require reformulation into a new, // programming model
Atkins Report: the critical challenges
1) build real synergy between computer and information science
research and development, and its use in science and
engineering research and education;
2) capture the cyberinfrastructure commonalities across science
and engineering disciplines;
3) use cyberinfrastructure to empower and enable, not impede,
collaboration across science and engineering disciplines;
4) exploit technologies being developed commercially and apply
them to research applications, as well as feed back new
approaches from the scientific realm into the larger world;
5) engage social scientists to work constructively with other
scientists and technologists.
30
© Gordon Bell
Atkins Report: Be careful of what you ask for
1. fundamental research to create advanced cyberinfrastructure
($60M);
2. research on the application of cyberinfrastructure to specific
fields of science and engineering research ($100M);
3. acquisition and development of production quality software
for cyberinfrastructure and supported applications ($200M);
4. provisioning and operations (including computational centers,
data repositories, digital libraries, networking, and application
support) ($600M).
5. archives for software ($60M)
31
© Gordon Bell
NRC Review Panel on
High Performance Computing
11 March 1994
Gordon Bell
32
© Gordon Bell
Position
Dual use: Exploit parallelism with in situ nodes & networks
Leverage WS & mP industrial HW/SW/app infrastructure!
No Teraflop before its time -- its Moore's Law
It is possible to help fund computing: Heuristics from federal
funding & use (50 computer systems and 30 years)
Stop Duel Use, genetic engineering of State Computers
•10+ years: nil pay back, mono use, poor, & still to come
•plan for apps porting to monos will also be ineffective -apps must leverage, be cross-platform & self-sustaining
•let "Challenges" choose apps, not mono use computers
•"industry" offers better computers & these are jeopardized
•users must be free to choose their computers, not funders
•next generation State Computers "approach" industry
•10 Tflop ... why?
Summary recommendations
33
© Gordon Bell
Principle computing
Environments
ASCII & PC
terminals
circa 1994 ->4 networks to
support
POTs net
for switching
mainframes,
terminals
minis,
Wide-area
Data
UNIX servers,
inter-site
comm.
network
workstations &
worlds
PCs
X
mainframes
mainframes
clusters
IBM &
propritary
mainframe
world '50s
3270 (&PC)
terminals
clusters
minicomputers
minicomputers
ASCII & PC
terminals
terminals
'80s Unix distributed workstations & servers world
UNIX
workstations
NFS
servers
Compute & dbase
uni- & mP servers
Token-ring
Ethernet
(gateways, bridges,
(gateways, bridges,
routers, hubs, etc.)
routers, hubs, etc.)
LANs
LANs
PCs (DOS,
Windows, NT)
Novell & NT
servers
70's mini
(prop.) world
&
'90s UNIX
mini world
UNIX Multiprocessor
servers operated as
traditional minicomputers
>4 Interconnect &
comm. stds:
POTS & 3270 terms.
WAN (comm. stds.)
Late '80s
LAN-PC
world
LAN (2 stds.)
Clusters (prop.)
34
© Gordon Bell
Computing
Environments
circa 2000
NT, Windows
& UNIX
person
NT, Windows
servers
& UNIX
person
servers*
Wide-area
global
ATM network
Local &
global data
comm
ATM†
world
& Local
Area Networks
for: terminal,
PC, workstation,
& servers
???
TC=TV+PC
home ...
(CATV or ATM)
Legacy
mainframes &
Legacy
minicomputers
mainframe
& terms
servers &
minicomputer
servers & terminals
multicomputers built
from multiple simple,
servers
Centralized
&Centralized
departmental
uni& mP servers
&
departmental
(UNIX
& NT)
scalable
uni- & mP servers*
NFS, database, compute, print, & (NT & UNIX)
* Platforms: X86
communication servers
PowerPC Sparc
etc.
Universal high
speed data
service using
ATM or ??
† also10 - 100 mb/s
pt-to-pt Ethernet
35
© Gordon Bell
Beyond Dual & Duel Use Technology:
Parallelism can & must be free!
HPCS, corporate R&D, and technical users must
have the goal to design, install and support parallel
environments using and leveraging:
• every in situ workstation & multiprocessor server
• as part of the local ... national network.
Parallelism is a capability that all computing
environments can & must possess!
--not a feature to segment "mono use" computers
Parallel applications become a way of computing
utilizing existing, zero cost resources
-- not subsidy for specialized ad hoc computers
Apps follow pervasive computing environments
36
© Gordon Bell
Computer genetic engineering &
species selection has been ineffective
Although Problem x Machine Scalability using SIMD for simulating some physical
systems has been demonstrated, given extraordinary resources, the efficacy of
larger problems to justify cost-effectiveness has not.
Hamming:"The purpose of computing is insight, not numbers."
The "demand side" Challenge users have the problems and should be drivers.
ARPA's contractors should re-evaluate their research in light of driving needs.
Federally funded "Challenge" apps porting should be to multiple platforms
including workstations & compatible, multis that support // environments to
insure portability and understand main line cost-effectiveness
Continued "supply side"programs aimed at designing, purchasing, supporting,
sponsoring, & porting of apps to specialized, State Computers, including
programs aimed at 10 Tflops, should be re-directed to networked computing.
User must be free to choose and buy any computer, including PCs & WSs, WS
Clusters, multiprocessor servers, supercomputers, mainframes, and even highly
distributed, coarse grain, data parallel, MPP State computers.
37
© Gordon Bell
10000
The teraflops
Cray DARPA
Intel $300M
1000
•
CM5 $240M
CM5 $120M
• Intel $55M
NEC
CM5 $30M
100
Bell Prize
•
Performance (t)
10
Cray Super $30M
1
1988
1990
1992
1994
38
1996
1998
2000
© Gordon Bell
We get no Teraflop before it's time:
it's Moore's Law!
Flops = f(t,$), not f(t) technology plans e.g. BAA 94-08 ignores $s!
All Flops are not equal (peak announced performance-PAP or real app perf. -RAP)
FlopsCMOSPAP*< C x 1.6**(t-1992) x $; C = 128 x 10**6 flops / $30,000
FlopsRAP =FlopsPAP x 0.5 for real apps, 1/2 PAP is a great goal
Flopssupers = FlopsCMOS x 0.1; improvement of supers 15-40%/year;
higher cost is f(need for profitability, lack of subsidies, volume, SRAM)
92'-94': FlopsPAP/$ =4K; Flopssupers/$=500; Flopsvsp/$ =50 M (1.6G@$25)
*Assumes primary & secondary memory size & costs scale with time
memory = $50/MB in 1992-1994 violates Moore's Law
disks = $1/MB in1993, size must continue to increases at 60% / year
When does a Teraflop arrive if only $30 million** is spent on a super?
1 TflopCMOS PAP in 1996 (x7.8) with 1 GFlop nodes!!!; or 1997 if RAP
10 TflopCMOS PAP will be reached in 2001 (x78) or 2002 if RAP
How do you get a teraflop earlier?
**A $60 - $240 million Ultracomputer reduces the time by 1.5 - 4.5 years.
39
© Gordon Bell
Re-engineering HPCS
Genetic engineering of computers has not produced a healthy strain that lives
more than one, 3 year computer generation. Hence no app base can form.
•No inter-generational, MPPs exist with compatible networks & nodes.
•All parts of an architecture must scale from generation to generation!
•An archecture must be designed for at least three, 3 year generations!
High price to support a DARPA U. to learn computer design -- the market is only
$200 million and R&D is billions-- competition works far better
Inevitable movement of standard networks and nodes can or need not be
accelerated,
these best evolve by a normal market mechanism through driven by users
Dual use of Networks & Nodes is the path to widescale parallelism, not weird
computers
Networking is free via ATM
Nodes are free via in situ workstations
Apps follow pervasive computing environments
Applicability was small and getting smaller very fast with many experienced
computer companies entering the market with fine products e.g. Convex/HP,
Cray, DEC, IBM, SGI & SUN that are leveraging their R&D, apps, apps, & apps
Japan has a strong supercomputer industry. The more we jeprodize ours by
mandating use of weird machines that take away from use, the weaker it
becomes.
MPP won, mainstream vendors have adopted multiple CMOS. Stop funding!
© Gordon
Bell
environments & apps are needed, but are40
unlikely because the market
is small
Recommendations to HPCS
Goal: By 2000, massive parallelism must exist as a by-products that leverages
a widescale national network & workstation/multi HW/SW nodes
Dual use not duel use of products and technology or the principle of "elegance" one part serves more than one function
network companies supply networks,
node suppliers use ordinary workstations/servers with existing apps will
leverage $30 billion x 10**6 R&D
Fund high speed, low latency, networks for a ubiquitous service as the base of all
forms of interconnections from WANs to supercomputers (in addition, some
special networks will exist for small grain probs)
Observe heuristics in future federal program funding scenarios
... eliminate
direct or indirect product development and mono-use computers
Fund Challenges who in turn fund purchase, not product development
Funding or purchase of apps porting must be driven by Challenges, but builds on
binary compatible workstation/server apps to leverage nodes be crossplatform based to benefit multiple vendors & have cross-platform use
Review effectiveness of State Computers e.g., need, economics, efficacy
Each committee member might visit 2-5 sites using a >>// computer
Review // program environments & the efficacy to produce & support apps
Eliminate all forms of State Computers & recommend a balanced HPCS program:
nodes & networks; based on industrial infrastructure
stop funding the development of mono computers, including the 10Tflop
it must be acceptable & encouraged to buy any computer for any contract
© Gordon Bell
41
Gratis advice for HPCC* & BS*
D. Bailey warns that scientists have almost lost credibility....
Focus on Gigabit NREN with low overhead connections that
will enable multicomputers as a by-product
Provide many small, scalable computers vs large, centralized
Encourage (revert to) & support not so grand challenges
Grand Challenges (GCs) need explicit goals & plans -disciplines fund & manage (demand side)... HPCC will not
Fund balanced machines/efforts; stop starting Viet Nams
(efforts that are rat holes that you can’t get out of)
Drop the funding & directed purchase of state computers
Revert to university research -> company & product
development
Review the HPCC & GCs program's output ...
*High Performance Cash Conscriptor; Big Spenders
Scalability: The Platform of HPCS
The law of scalability
Three kinds: machine, problem x machine, & generation (t)
How do flops, memory size, efficiency & time vary with
problem size?
What's the nature problems & work for the computers?
What about the mapping of problems onto the machines?
43
© Gordon Bell
Disclaimer
This talk may appear inflammatory ...
i.e. the speaker may have appeared
"to flame".
It is not the speaker's intent to make
ad hominem attacks on people,
organizations, countries, or
computers ... it just may appear
that way.
44
© Gordon Bell
Backups
45
© Gordon Bell
Funding Heuristics
(50 computers & 30 years of hindsight)
1. Demand side works i.e., we need this product/technology for x;
Supply side doesn't work! Field of Dreams": build it and they will come.
2. Direct funding of university research resulting in technology and product
prototypes that is carried over to startup a company is the most effective.
-- provided the right person & team are backed with have a transfer avenue.
a. Forest Baskett > Stanford to fund various projects (SGI, SUN, MIPS)
b. Transfer to large companies has not been effective
c. Government labs... rare, an accident if something emerges
3. A demanding & tolerant customer or user who "buys" products works best to
influence and evolve products (e.g., CDC, Cray, DEC, IBM, SGI, SUN)
a. DOE labs have been effective buyers and influencers, "Fernbach policy";
unclear if labs are effective product or apps or process developers
b. Universities were effective at influencing computing in timesharing,
graphics, workstations, AI workstations, etc.
c. ARPA, per se, and its contractors have not demonstrated a need for flops.
d. Universities have failed ARPA in defining work that demands HPCS -hence are unlikely to be very helpful as users in the trek to the teraflop.
4. Direct funding of large scale projects" is risky in outcome, long-term, training,
and other effects. ARPAnet established an industry after it escaped BBN!
© Gordon Bell
46
Funding Heuristics-2
5. Funding product development, targeted purchases, and other subsidies to
establish "State Companies"in a vibrant and overcrowded market is wasteful,
likely to be wrong , likely to impede computer development, (e.g. by having to
feed an overpopulated industry). Furthermore, it is likely to have a deleterious
effect on a healthy industry (e.g. supercomputers).
A significantly smaller universe of computing environments is needed. Cray
& IBM are given; SGI is probably the most profitable technical; HP/Convex are
likely to be a contender, & others (e.g., DEC) are trying. No state co (intel,TMC,
Tera) is likely to be profitable & hence self-sustaining.
6. "University-Company collaboration is a new area of government R&D. So far
it hasn't worked nor is it likely to, unless the company invests. Appears to be
a way to help company fund marginal people and projects.
7. CRADAs or co-operative research and development agreement are very closely
allied to direct product development and are equally likely to be ineffective.
8. Direct subsidy of software apps or the porting of apps to one platform, e.g.,
EMI analysis are a way to keep marginal computers afloat.
If government funds apps, they must be ported cross-platform!
9. Encourage the use of computers across the board, but discourage designs
from those who have not used or built a successful computer.
47
© Gordon Bell
Scalability: The Platform of HPCS
& why continued funding is unnecessary
Mono use aka MPPs have been, are, and will be doomed
The law of scalability
Four scalabilities: machine, problem x machine,
generation (t), & now spatial
How do flops, memory size, efficiency & time vary with
problem size? Does insight increase with problem size?
What's the nature of problems & work for monos?
What about the mapping of problems onto monos?
What about the economics of software to support monos?
What about all the competitive machines? e.g. workstations,
workstation clusters, supers, scalable multis, attached P?
48
© Gordon Bell
Special, mono-use MPPs are doomed...
no matter how much fedspend!
Special because it has non-standard nodes & networks -- with no apps
Having not evolved to become mainline -- events have over-taken them.
It's special purpose if it's only in Dongarra's Table 3.
Flop rate, execution time, and memory size vs problem size shows limited
applicability to very large scale problems that must be scaled to cover the
inherent, high overhead.
Conjecture: a properly used supercomputer will provide greater insight and utility
because of the apps and generality
-- running more, smaller sized problems with a plan produces more insight
The problem domain is limited & now they have to compete with: •supers -- do
scalars, fine grain, and work and have apps
•workstations -- do very long grain, are in situ and have apps
•workstation clusters -- have identical characteristics and have apps
•low priced ($2 million) multis -- are superior i.e., shorter grain and have apps
•scalable multiprocessors -- formed from multis are in design stage
Mono useful (>>//) -- hence, are illegal because they are not dual use
Duel use -- only useful to keep a high budget in tact e.g., 10 TF
49
© Gordon Bell
The Law of Massive Parallelism is
based on application scale
There exists a problem that can be made sufficiently large
such that any network of computers can run efficiently
given enough memory, searching, & work -- but this
problem may be unrelated to no other problem.
A ... any parallel problem can be scaled to run on an arbitrary
network of computers, given enough memory and time
Challenge to theoreticians: How well will an algorithm run?
Challenge for software: Can package be scalable & portable?
Challenge to users: Do larger scale, faster, longer run times,
increase problem insight and not just flops?
Challenge to HPCC: Is the cost justified? if so let users do it!
50
© Gordon Bell
Scalabilities
Size scalable computers are designed from a few
components, with no bottleneck component.
Generation scalable computers can be implemented with
the next generation technology with No rewrite/recompile
Problem x machine scalability - ability of a problem,
algorithm, or program to exist at a range of sizes so that
it can be run efficiently on a given, scalable computer.
Although large scale problems allow high flops, large probs
running longer may not produce more insight.
Spatial scalability -- ability of a computer to be scaled over a
large physical space to use in situ resources.
51
© Gordon Bell
100
Linpack rate in Gflops
vs Matrix Order
CM5 1K
SX 3 4 P
10
???
1
100
1000
10000
52
100000
© Gordon Bell
Linpack Solution time
vs Matrix Order
1000
CM5 1K
100
10
SX 3 4 P
1
100
1,000
10,000
53
100,000
© Gordon Bell
GB's Estimate of Parallelism in
Engineering & Scientific Applications
----scalable multiprocessors----massive mCs & WSs
log (# of apps) WSs Supers
dusty decks
new or
for supers
scaled-up
apps
scalar vector mP (<8) >>//
embarrassingly
60% 15% vector 5% or perfectly parallel
5%
15%
granularity & degree of coupling (comp./comm.)
54
© Gordon Bell
MPPs are only for unique,
$M
100 very large scale, data parallel apps
.
.
.
10
mono
use
.
s
.
.
s
s
s
s
s
1
.
>>//
>>//
.
mP
mP
mP
mP
mP
. mP
.1
WS
.
.
WS
WS
WS
WS
. WS
.01
Scalar| vector |vector mP| data // | emb. // | gp work | viz
| apps
Application characterization
55
© Gordon Bell
Applicability of various
technical computer alternatives
Domain
scalar
vector
vect.mP
data //
ep & inf.//
gp wrkld
vizualiz'n
apps
PC|WS
Multi servr
SC & Mfrm >>//
WS Clusters
1
2*
na
na
1
3
1
1
1
2
2
1
2
1
na
1
2
1
1
2
3
1
na
1
1*
2
na
1
1
2
1
from WS
na
3
3
1
2
na
na
na
*Current micros are weak, but improving rapidly such that
subsequent >>//s that use them will have no advantage for
node vectorization
56
© Gordon Bell
Performance using distributed
computers depends on
problem & machine granularity
Berkeley's log(p) model characterizes granularity &
needs to be understood, measured, and used
Three parameters are given in terms of processing ops:
l = latency -- delay time to communicate between apps
o = overhead -- time lost transmitting messages
g = gap - 1 / message-passing rate ( bandwidth)
between messages
- time
p = number of processors
57
© Gordon Bell
x
Granularity
Nomograph
Processor
speed
10M
Grain
Comm.
Latency &
Synch.1sec.
Ovhd.
1995µ
1993
1G
C 90
Ultra
1M
Very
100 K
10 ms.
Coarse
1 ms. (LAN)
10K
10 µs.
1993µ
WANs
&
LANs
100 ms. (WAN)
100 µs.
100M
10 M
1 µs.
100 ns.
58
MPPs
Med.
1000
Fine
100
Grain length
(ops)
© Gordon Bell
x
10 M
Granularity
Nomograph
Grain Comm.
Latency &
Synch. Ovhd.
1sec.
1 ms. (LAN)
100 µs.
1995µ
1G
1993
super
Very
100 K
10 ms.
10M
1993µ
1M
100 ms. (WAN)
Processor
speed
100M
Ultra
Cray T3D
10 µs.
Coarse
VPP
500
10K
Med.
1000
1 µs.
C 90
Fine
100 ns.
(Supers
mem.)
VP
59
100
Grain length
(ops)
© Gordon Bell
Economics of Packaged Software
Platform
Cost
Leverage
# copies
MPP
>100K
1
1-10 copies
Minis, mainframe 10-100K 10-100
1000s copies
also, evolving high performance
multiprocessor servers
Workstation
1-100K
1-10K
1-100K copies
PC
25-500
50K-1M
1-10M copies
60
© Gordon Bell
Chuck Seitz comments
on multicomputers
“I believe that the commercial, medium grained
multicomputers aimed at ultra-supercomputer
performance have adopted a relatively unprofitable scaling
track, and are doomed to extinction. ... they may as
Gordon Bell believes be displaced over the next several
years by shared memory multiprocessors. ... For loosely
coupled computations at which they excel, ultra-super
multicomputers will, in any case, be more economically
implemented as networks of high-performance
workstations connected by high-bandwidth, local area
networks...”
61
© Gordon Bell
Convergence to a single architecture
with a single address space
that uses a distributed, shared memory
limited (<20) scalability multiprocessors
>> scalable multiprocessors
workstations with 1-4 processors
>> workstation clusters & scalable multiprocessors
workstation clusters
>> scalable multiprocessors
State Computers built as message passing multicomputers
>> scalable multiprocessors
62
© Gordon Bell
limited
scalability: mP,
uniform
memory access
mP
mainframe,
super
micros
C om
mi c p etiti o
ro s
n&
mP
bus based
multi:
mini, W/S
note, only two structures:
1. shared memory mP with
mP
ring-based uniform & non-uniform
memory access; and
multi
2. networked workstations,
shared nothing
??
DEC, Encore,
Sequent, Stratus,
SGI, SUN, etc.
Conv ex, Cray,
mPs continue
istu, IBM,
to be the main line Fuj
Hitachi, NEC
1995?
mainframes &
supers
scalable, mP: smP,
non-uniform memory access
1st smP
-0 cache
Convergence to
one architecture
Cache for
locality
networked
workstations:
smC
smC:
very
coarse
grain
Apollo, SUN, HP, etc.
Natural
evolution
r
64 e
> ach
32 ? c
?
hr ead
e
d p ro
high bandwith
switch , comm.
protocols e.g.
ATM
smP
DSM
some
cache
o y 1995?
Cm* ('75),
I
c
C
S ren
Butterfly ('85),
smC
,
ts h e
Cedar ('88)
bi co
next gen.
DSM=>smP
experimental,
1st smC
smC
scalable,
hypercube WS Micros,
med-coarse Fuj itsu, Intel,
multicomputer: smC, Transputer fast switch
Meiko, NCUBE,
grain
non uniform memory
(grid) H ig h
TMC; 1985-1994
den s
ity, m
access
u lti -t
Cosmic Cube,
iPSC 1, NCUBE,
Transputer-based
DASH, Conv ex,
Cray T3D, SCI
cess
or s &
swi tc
h
smC
coarse gr.
clusters
WSs Clusters v ia special
sw itches 1994 &ATM 1995
63
1995?
smC
fine-grain
DSM??
smP
all cache
arch.
KSR Allcache
next gen. smP
research e.g.
DDM, DASH+
Evolution of scalable
multiprocessors,
multicomputers, &
workstations to shared
memory computers
Mosaic-C,
J-machine
© Gordon Bell
Re-engineering HPCS
Genetic engineering of computers has not produced a healthy strain that lives
more than one, 3 year computer generation. Hence no app base can form.
•No inter-generational, MPPs exist with compatible networks & nodes.
•All parts of an architecture must scale from generation to generation!
•An archecture must be designed for at least three, 3 year generations!
High price to support a DARPA U. to learn computer design -- the market is only
$200 million and R&D is billions-- competition works far better
Inevitable movement of standard networks and nodes can or need not be
accelerated,
these best evolve by a normal market mechanism through driven by users
Dual use of Networks & Nodes is the path to widescale parallelism, not weird
computers
Networking is free via ATM
Nodes are free via in situ workstations
Apps follow pervasive computing environments
Applicability was small and getting smaller very fast with many experienced
computer companies entering the market with fine products e.g. Convex/HP,
Cray, DEC, IBM, SGI & SUN that are leveraging their R&D, apps, apps, & apps
Japan has a strong supercomputer industry. The more we jeprodize ours by
mandating use of weird machines that take away from use, the weaker it
becomes.
MPP won, mainstream vendors have adopted multiple CMOS. Stop funding!
© Gordon
Bell
environments & apps are needed, but are64
unlikely because the market
is small
Recommendations to HPCS
Goal: By 2000, massive parallelism must exist as a by-products that leverages
a widescale national network & workstation/multi HW/SW nodes
Dual use not duel use of products and technology or the principle of "elegance" one part serves more than one function
network companies supply networks,
node suppliers use ordinary workstations/servers with existing apps will
leverage $30 billion x 10**6 R&D
Fund high speed, low latency, networks for a ubiquitous service as the base of all
forms of interconnections from WANs to supercomputers (in addition, some
special networks will exist for small grain probs)
Observe heuristics in future federal program funding scenarios
... eliminate
direct or indirect product development and mono-use computers
Fund Challenges who in turn fund purchase, not product development
Funding or purchase of apps porting must be driven by Challenges, but builds on
binary compatible workstation/server apps to leverage nodes be crossplatform based to benefit multiple vendors & have cross-platform use
Review effectiveness of State Computers e.g., need, economics, efficacy
Each committee member might visit 2-5 sites using a >>// computer
Review // program environments & the efficacy to produce & support apps
Eliminate all forms of State Computers & recommend a balanced HPCS program:
nodes & networks; based on industrial infrastructure
stop funding the development of mono computers, including the 10Tflop
it must be acceptable & encouraged to buy any computer for any contract
© Gordon Bell
65
Gratis advice for HPCC* & BS*
D. Bailey warns that scientists have almost lost credibility....
Focus on Gigabit NREN with low overhead connections that
will enable multicomputers as a by-product
Provide many small, scalable computers vs large, centralized
Encourage (revert to) & support not so grand challenges
Grand Challenges (GCs) need explicit goals & plans -disciplines fund & manage (demand side)... HPCC will not
Fund balanced machines/efforts; stop starting Viet Nams
(efforts that are rat holes that you can’t get out of)
Drop the funding & directed purchase of state computers
Revert to university research -> company & product
development
Review the HPCC & GCs program's output ...
*High Performance Cash Conscriptor; Big Spenders
Scalability: The Platform of HPCS
The law of scalability
Three kinds: machine, problem x machine, & generation (t)
How do flops, memory size, efficiency & time vary with
problem size?
What's the nature problems & work for the computers?
What about the mapping of problems onto the machines?
67
© Gordon Bell
Disclaimer
This talk may appear inflammatory ...
i.e. the speaker may have appeared
"to flame".
It is not the speaker's intent to make
ad hominem attacks on people,
organizations, countries, or
computers ... it just may appear
that way.
68
© Gordon Bell
Re. Centers Funding August 1987 gbell memo to E. Bloch
A fundamentally broken system!
1 Status quo. NSF funds them, as we do now in competition with computer science… use is completely
decoupled from the supply... If I make the decision to trade-off, it will not favor the centers...
2. Central facility. NSF funds ASC as an NSF central facility. This allows the Director, who has the purview for
all facilities and research to make the trade-offs across the foundation.
3. NSF Directorate use taxation. NSF funds it via some combination of the directorates on a taxed basis. The
overall budget is set by AD’s. DASC would present the options, and administer the program.
4. Directorate-based centers. The centers (all or in part) are “given” to the research directorates. NCAR
provides an excellent model. Engineering might also operate a facility. I see great economy, increased
quality, and effectiveness coming through specialization of programs, databases, and support.
5. Co-pay. In order to differentially charge for all the upgrades …a tax would be levied on various allocation
awards. Such a tax would be nominal (e.g. 5%) in order to deal with the infinite appetite for new hardware
and software. This would allow other agencies who use the computer to also help pay.
6 Manufacturer support. Somehow, I don’t see this changing for a long time. A change would require
knowing something about the power of the machines so that manufacturers could compete to provide lower
costs. BTW:Erich Bloch and I visited Cray Research and succeeded in getting their assistance.
7. Make the centers larger to share support costs. Manufacturers or service providers could contract with the
centers to “run” facilities. This would reduce our costs somewhat on a per machine basis.
8. Fewer physical centers. While we could keep the number of centers constant, greater economy of scale
would be created by locating machines in a central facility… LANL and LLNL each run 8 Crays to share
operators, mass storage and other hardware and software support.
With decent networks, multiple centers are even less important.
9. Simply have fewer centers. but with increasing power. This is the sole argument for centers!
10. Maintain centers at their current or constant core levels for some specified period. Each center would be
totally responsible for upgrades, etc. and their own ultimate fate.
11. Free market mechanism. Provide grant money for users to buy time. This might cost more because I sure
we get free rides e.g. Berkeley, Michigan, Texas and the increasing number of institutions providing service.
69
© Gordon Bell
GB Interview as CISE AD July 1987
• We, together with our program advisory committees have
described the need for basic work in parallel processing to
exploit both the research challenge and the plethora of
parallel-processing machines that are available and
emerging. We believe NSF’s role is to sponsor a wide range
of software research about these machines.
• This research includes basic computational models more
suited to parallelism, new algorithms, standardized
primitives (a small number) for addition to the standard
programming languages, new languages based on parallelcomputation primitives rather than extensions to sequential
languages, and new applications that exploit parallelism.
• Three approaches to parallelism are clearly here now:
70
© Gordon Bell
Bell CISE Interview July 1987
• First, vector processing has become primitive in supercomputers and minisupercomputers. In becoming so, it has created a revolution in scientific
applications. Unfortunately, computer science and engineering
departments are not part of the revolution in scientific computation that is
occurring as a result of the availability of vectors. New texts and curricula
are needed.
• Second, message-passing models of computation can be used now on
workstation clusters, on the various multicomputers such as the Hypercube
and VAX clusters, and on the shared-memory multiprocessors (from
supercomputers to multiple microprocessors). The Unix pipes mechanism
may be acceptable as a programming model, but it has to be an awful lot
faster for use in problems where medium-grain parallelism occurs. A
remote procedure-call mechanism may be required for control.
• Third, microtasking of a single process using shared-memory
multiprocessors must also be used independently. On shared- memory
multiprocessors, both mechanisms would be provided and used in forms
appropriate to the algorithms and applications. Of course, other forms of
parallelism will be used because it is relatively easy to build
large,
© Gordon
Belluseful
71
SIMD [ multiple-data] machines
Q: What performance do you expect from parallelism in the next
decade?
A: Our goal is obtaining a factor of 100 in the performance of
computing, not counting vectors, within the decade and a factor of
10 within five years. I think 10 will be easy because it is inherently
there in most applications right now. The hardware will clearly be
there if the software can support it or the users can use it.
Many researchers think this goal is aiming too low. They think it
should be a factor of I million within 15 years. However, I am
skeptical that anything more than our goal will be too difficult in
this time period. Still, a factor of 1 million may be possible through
SIMD.
The reasoning behind the NSF goals is that we have parallel
machines now and on the near horizon that can actually achieve
these levels of performance. Virtually all new computer systems
support parallelism in some form (such as vector processing or
clusters of computers). However, this quiet revolution demands a
major update of computer science, from textbooks and curriculum
to applications research.
72
© Gordon Bell
Bell Prize
Initiated…
73
© Gordon Bell
NRC/OSTP Report 18 August 1984
• Summary
– Pare the poor projects; fund proven researchers
– Understand the range of technologies required and
especially the Japanese position; also vector processors
• Heuristics for the program
– Apply current multicomputers and multicomputers now
– Fund software and applications starting now for //ism…
74
© Gordon Bell
…the report greatly underestimates the position and underlying strength of the Japanese in
regard to Supercomputers. The report fails to make a substantive case about the U. S. position,
based on actual data in all the technologies from chips (where the Japanese clearly lead) to
software engineering.
The numbers used for present and projected performance appear to be wildly optimistic with
no real underlying experimental basis. A near term future based on parallelism other than
evolving pipelining is probably not realistic.
The report continues the tradition of recommending that funding science is good, and in
addition everything be funded. The conclusions to continue to invest in small scale
fundamental research without a prioritization across the levels of integration or kinds of
projects would seem to be of little value to decision makers. For example, the specific
knowledge that we badly need in order to exploit parallelism is not addressed. Nor is the issue
of how we go about getting this knowledge.
My own belief is that small scale research around a single researcher is the only style of work
we understand or are effective with. This may not get us very far in supercomputers.
Infrastructure is more important than wild, new computer structures if the "one professor"
research model is to be useful in the supercomputer effort. While this is useful to generate
small startup companies, it also generates basic ideas for improving the Japanese state of the
art. This occurs because the Japanese excel in the transfer of knowledge from world research
laboratories into their products and because the U.S. has a declining technological base of
product and process (manufacturing) engineering.
The problem of organizing experimental research in the many projects requiring a small
laboratory (Cray-style lab of 40 or so) to actually build supercomputer prototypes isn't
addressed; these larger projects have been uniformly disastrous and the transfer to
non-Japanese products negligible.
Surprisingly, no one asked Seymour Cray whether there was anything he wanted in order to
© Gordon Bell
stay ahead…
75
1.
2.
3.
4.
5.
6.
7.
8.
9.
Narrow the choice of architectures that are to be pursued. There are simply
too many poor ones, and too few that can be adequately staffed.
Fund only competent, full-time efforts where people have proven ability to
build … systems. These projects should be carried out by full-time people,
not researchers servicing multiple contracts and doing consulting. New
entrants must demonstrate competence by actually building something!
Have competitive proposals and projects. If something is really an
important area to fund…, then have two projects with …information
exchange.
Fund balanced hardware/software/systems applications. Doing architectures
without user involvement (or understanding) is sure to produce useless toys.
Recognize the various types of projects and what the various organizational
structures are likely to be able to produce.
A strong infrastructure of chips to systems to support individual researchers
will continue to produce interesting results. These projects are not more
than a dozen people because professors don't work for or with other
professors well.
There are many existing multicomputers and multiprocessors that could be
delivered to universities to understand parallelism before we go off to
build…
It is essential to get the Cray X-MP alongside the Fujitsu machine to
understand …parallelism associated with multiple processors, and pipelines.
Build "technology transfer mechanisms" in up front. Transfer doesn't
happen automatically. Monitor the progress associated with "the transfer".
76
© Gordon Bell
Residue
77
© Gordon Bell
NSF TeraGrid c2003
78
© Gordon Bell
Notes with Jim
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Applications software and its development is still the no. 1 problem
64 bit addressing will change everything
Many machines are used for their large memory
Centers will always use all available time: Cycles bottom feeders.
Allocation is still central and a bad idea.
Not big enough centers
Can’t really architect or recommend a plan, unless you have some notion of the
needs and costs!
No handle on communication costs especially for the last mi where its 50 – 150
Mbps; not fiber (10 Gbps). Two orders of magnitude low…
BeoW happened as an embarrassment to funders, not because of it
Walk through 7>2 now 3 centers.
A center is a 50M/year expense when you upgrade!
NSF: The tools development is questionable. Part of business. Feel very, very
uncomfortable developing tools
Centers should be functionally specialized around communities and databases
Planning, budgets and allocation must be with the disciplines. People vs. Machines.
Teragrid problem: having not solved the Clusters problem move to larger problem
File oriented vs database hpss.
79
© Gordon Bell
Download