Download extended presentation (ppt, 18 slides)

advertisement
Reinvention of Computing for Many-Core Parallelism
Requires Addressing Programmer’s Productivity
Uzi Vishkin
Common wisdom [cf. tribal lore collected by DARPA HPCS, 2005]:
Programming for parallelism is easy
It is the programming for performance that makes it hard
Reinvention of Computing for Many-Core Parallelism
Requires Addressing Productivity
Uzi Vishkin
A less fatalistic position:
Programming for parallelism is easy
But, the difficulty of programming for performance depends on the system
Productivity in Parallel Computing
The large parallel machines story
Funding of productivity: $M650 HProductivityCS, ~2002
Met # Gflops goals: up by 1000X since mid-90’s; Exascale talk & plans
Met power goals. Also: groomed eloquent spokespeople
Progress on productivity: No agreed benchmarks. No spokesperson.
Elusive! In fact, not much has changed since: “as intimidating and time
consuming as programming in assembly language”--NSF Blue Ribbon
Committee, 2003 or even “parallel software crisis”, CACM 1991.
Common sense engineering: Untreated bottleneck  diminished
returns on improvements bottleneck becomes more critical
Next 10 years: New specific programs on flops and power. What about
productivity?!
Reality: economic island. Cleared by marketing: DOE applications
Enter: mainstream many-cores
Every CS major should be able to program many-cores
Coherence Issue
When you come to a fork in the road, take it!-Yogi Berra
Camp 1 Many US best minds opt for occupations that do not involve programming
• NSF tries to lure them to CS in HS by: (1) presenting the steady march and broad
reach of computing across the sciences, industries, culture and society, correcting
the current narrow focus on programming in introductory course [New Programs Aim
to Lure Young Into Digital Jobs, NYTimes, 12/09]; (2) productivity (3) computational thinking
Camp 2 Power/performance  Reinvent mainstream computing for parallelism
• Vendors try to build many-cores that require decomposition-first programming.
Railroading to productivity “disaster area”. Hacking. Insufficient support from
parallel algorithms design & analysis. Short on outreach/productivity/abstraction
Unintended outcome of “taking the fork” (prod vs. power/perf)
• Camp cheerleaders: core CS (alg design & analysis style) is radical. Peer review
favors both sides over center. Centrists as extremists is an oxymoron!
• Building wrong expectations among prospective CS majors. Disappointment will
lead to “Get me out of this major”
• Pool of CS majors to be engaged in decomposition- first too limited (after
subtracting the lured-to-breadth-over-programming and the core)
Consequences of “taking the fork” surrealism
• Eventual casualties: # students, credibility & productivity
Research/comparison of several holistic parallel platforms could: (i) prevent much of
the damage, (ii) build up the real diversity needed for natural selection, and (iii)
advise the NSF on programs that otherwise could cancel one another
Lessons from Invention of Computing
“It should be noted that in comparing codes four viewpoints must be kept in
mind, all of them of comparable importance:
• Simplicity and reliability of the engineering solutions required by the code;
• Simplicity, compactness and completeness of the code;
• Ease and speed of the human procedure of translating mathematical
conceived methods into the code [”COMPUTATIONAL THINKING”], and also
of finding and correcting errors in coding or of applying to it changes that
have been decided upon at a later stage;
• Efficiency of the code in operating the machine near it full intrinsic speed.
-H. Goldstine, J. von Neumann. Planning and coding problems for an electronic computing instrument, 1947
Take home
- Comparing codes is a pivotal and broad issue
- Concern for Productivity is as old as computing (development-time)
- Human process: intellectual/algorithm/planning plus skill/coding
- Contrast with: Tendency to understand HW upgrade from application code
(even if machine not yet built, A. Ghuloum, Intel, CACM 9/09) –
unreasonable expectation from application code developers
How was the “human procedure” addressed?
Answer: Basically, By Abstraction and Induction
1. General-Purpose computing is about a platform for your
future (whatever) program, as opposed specific application, a
general method for the human procedure was key
2. GvN47 based coding on mathematical induction (known for
math proofs and as axiom of the natural numbers)
3. It worked for establishing serial computing. This method led
to simplicity, compactness and completeness of the resulting
code. References:
- Knuth67, The art of Computer Programming. Vol. 1: Fundamental Algorithms.
Chapter 1: Basic concepts. 1.1 Algorithms. 1.2 Math Prelims. 1.2.1 Math Induction
Algorithms: 1. Finiteness. 2. Definiteness. 3. Input. 4. Output. 5. Effectiveness.
Gold standards
Definiteness: Induction
Effectiveness: “Uniform cost criterion" [AHU74] abstraction
“Killer app” for general-purpose many cores:
Let the app-dreamers do their magic
• Oxymoron?.. general-purpose: no one application in particular
Not really: If possible, a killer application would be helpful
• However, wrong as condition for progress
General-purpose computing is an infrastructure for the IT sector and
the economy
• The general-purpose computing infrastructure has been realized
by the software spiral (the cyclic process of hardware
improvements leading to software improvements that lead back to
hardware improvements and so on; Andy Grove, Intel)
• Instituting a parallel software spiral is a killer application for
many-cores: as in the past app-dreamers will invent uses
Not surprisingly, the killer application is also an infrastructure
• Government has a role in building infrastructure
Instituting a parallel software spiral merits government funding
However, insufficient empowerment for: creating and developing
alternative platforms to the point of establishing their merit.
Serial Abstraction & A Parallel Counterpart Example
•
Rudimentary abstraction that made serial computing simple that any single instruction
available for execution in a serial program executes immediately
Serial Execution, Based What could I do in parallel at
each step assuming unlimited
on Serial Abstraction
hardware
#
ops

..
..
# .
.
ops
..
..
..
time
Time = Work
Parallel Execution, Based
on Parallel Abstraction
..
time
Work = total #ops
Time << Work
Abstracts away different execution time for different operations (e.g., memory hierarchy) .
Used by programmers to conceptualize serial computing and supported by hardware
and compilers. The program provides the instruction to be executed next (inductively)
• Rudimentary abstraction for making parallel computing simple: that indefinitely many
instructions, which are available for concurrent execution, execute immediately, dubbed
Immediate Concurrent Execution (ICE)
Step-by-step (inductive) explication of the instructions available next for concurrent
execution. # processors not even mentioned. Falls back on the serial abstraction if 1
instruction/step.
CACM’10: Using simple abstraction to guide the
reinvention of computing for parallelism
[Overall: old Work-Depth description. Only “minimalist
abstraction”: ICE builds only on induction, itself a
rudimentary concept]
• [SV82] conjectured that the rest (full PRAM algorithm)
just a matter of skill
• Lots of evidence that “work-depth” works. Used as
framework in PRAM algorithms texts: JaJa-92, KKT-01
• ICE in line with PRAM: Only really successful parallel
algorithmic theory. Latent, though not widespread,
knowledgebase
• Widely agreed: work&depth are necessary. Jury is out
on: what else. Our position: as little as possible.
Workflow from parallel algorithms to programming
versus trial-and-error
Option 1
Domain
decomposition,
or task
decomposition
Option 2
PAT
Parallel algorithmic
thinking (ICE/WD/PRAM)
Program
Insufficient
inter-thread
bandwidth?
Rethink algorithm:
Take better advantage
of cache
Compiler
Hardware
Is Option 1 good enough for the parallel programmer’s model?
Options 1B and 2 start with a PRAM algorithm, but not option 1A.
Options 1A and 2 represent workflow, but not option 1B.
PAT
Prove
correctness
Program
Still correct
Tune
Still correct
Hardware
Not possible in the 1990s.
Possible now: XMT@UMD
Why settle for less?
Mark Twain on the PRAM
We should be careful to get out of an experience only the wisdom that is in it—
and stop there; lest we be like the cat that sits down on a hot stove-lid. She
will never sit down on a hot stove-lid again— and that is well; but also she
will never sit down on a cold one anymore— Mark Twain
PRAM algorithms did not become standard CS knowledge in 1988-90 since
“hot stove-lid”: No 1990s implementable computer architecture allowed
programmers to look at a computer as a PRAM
The XMT project @UMD changed that
PS NVidia happy to report success with 2 PRAM algorithms in IPDPS09. Great
to see that from a major vendor
[These 2 algorithms are decomposition-based, unlike most PRAM algorithms.
Freshmen programmed same 2 algorithms on our XMT machine
The Parallel Programmer’s Productivity Landscape
Postulation: a continental divide
Ocean 
Decomposition-first programming
Work-depth programming
Great Lakes
How different can productivity of many-core architectures be? Answer: very!
Metaphor: Dropping rain a short distance apart. Very different outcomes.
Think of programmer’s productivity as cost of producing usable water.
The decomposition-first programming side requires domain-decomposition or
task-decomposition that have not worked in spite of big investment. (Looks
greener, since invested; what if goes to ocean while arid side to Sweetwater?)
Work-depth initial abstraction is decomposition-free. (Arid, under-invested)
Require leap-of-faith for investment.
Validation of Ease of Programming To Date
1. Comparison with MPI by DARPA-HPCS SW Eng leaders [HochsteinBasiliVGilbert]
2. Teachability demonstrated so far [TorbertVTzurEllison, SIGCSE’10 to appear]:
- To freshman class with 11 non-CS students. Some prog. assignments:
median finding, merge-sort, integer-sort & sample-sort.
Other teachers:
- Magnet HS teacher. Downloaded simulator, assignments, class notes, from
XMT page. Self-taught. Recommends: Teach XMT first. Easiest to set up
(simulator), program, analyze: ability to anticipate performance (as in
serial). Can do not just for embarrassingly parallel. Teaches also OpenMP,
MPI, CUDA. Lookup keynote at CS4HS’09@CMU + interview with teacher.
- High school & Middle School (some 10 year olds) students from
underrepresented groups by HS Math teacher.
Teachability: necessary (but not sufficient) condition for ease-of-programming.
Itself necessary (but not sufficient) condition for productivity.
Hence, teachability as good a benchmark as any out there for productivity
Conclusion
1.
2.

Want future mainstream programmers to embrace general-purpose parallelism
(every CS major; for common SW architectures). Yet, in the past:
Insufficient evidence on productivity. Yet, history of repeated surprise: Parallel
machines repel programmers
Research Drivers
Empower select holistic (HW+SW) parallel platforms for merit-based
comparison. Imagine a new world with the given platform. Consider all aspects:
e.g., is it sufficient for reinstating the SW spiral? Is the barrier-to-entry for
creative applications low enough? How will the CS curriculum will look? Who will
be attracted to study CS?
Then, gather evidence:
Methodically compare productivity (development-time, run-time) of platforms.
Ownership stake role for Indian partner (Prof. PJ Narayan, IIIT,
Hyderabad): India – largest producer of SW. New platform requires
sufficient Indian interest. Lead benchmarking/comparison for
productivity, etc.
For session Coming from algorithms, computer vision and computational biology,
compare select platforms for performance, productivity (development-time and
run-time), and overall for reinstating the SW spiral. Benchmark algorithms and
applications based on their inherent parallelism for future machine platforms, as
opposed to using existing code written for yesterday’s (serial or parallel)
machines. Issue: How to benchmark for productivity?
Not just a theory. XMT: prototyped HW&SW
64-core, 75MHz FPGA prototype
[SPAA’07, Computing Frontiers’08]
Original explicit multi-threaded (XMT)
architecture [SPAA98]
Interconnection Network for 128-core. 9mmX5mm, IBM90nm
process. 400 MHz prototype [HotInterconnects’07]
Same design as 64-core FPGA. 10mmX10mm,
IBM90nm process. 150 MHz prototype
The design scales to 1000+ cores on-chip
Never a successful general-purpose parallel computer (easy to
program, good speedups, up&down scalable). IF you could
program it  great speedups.
Motivation: Fix the IF
Programmer’s Model: Engineering Workflow
• Arbitrary CRCW Work-depth algorithm. Reason about correctness &
complexity in synchronous model
• SPMD reduced synchrony
– Threads advance at own speed, not lockstep
– Main construct: spawn-join block. Note: can start any number of
processes at once. Can express locality (“decomposition-second”)
– Prefix-sum (ps). Independence of order semantics (IOS).
– Establish correctness & complexity by relating to WD analyses.
– Circumvents “The problem with threads”, e.g., [Lee].
spawn
join
spawn
join
• Tune (compiler or expert programmer): (i) Length of sequence of round
trips to memory, (ii) QRQW, (iii) WD. [VCL08]
• Trial&error contrast: similar startwhile insufficient inter-thread
bandwidth do{rethink algorithm to take better advantage of cache}
Performance
• Simulation of 1024 processors: 100X on standard
benchmark suite for VHDL gate-level simulation. for
1024 processors GV06
• [SPAA’09]: ~10X relative to Intel Core 2 Duo
with 64-processor XMT; same silicon area as 1
commodity processor (core)
• Promise of 100X with 1024 processors also for
irregular, fine-grained parallelism with up- and
down-scalability.
Some Credits
Grad students:, George Caragea, James Edwards, David Ellison, Fuat Keceli, Beliz
Saybasili, Alex Tzannes. Recent grads: Aydin Balkan, Mike Horak, Xingzhi Wen
• Industry design experts (pro-bono)
• Rajeev Barua, Compiler. Co-advisor of 2 CS grad students. 2008 NSF grant
• Gang Qu, VLSI and Power. Co-advisor
• Steve Nowick, Columbia U., Asynch computing. Co-advisor. 2008 NSF team grant.
• Ron Tzur, U. Colorado, K12 Education. Co-advisor. 2008 NSF seed funding
K12: Montgomery Blair Magnet HS, MD, Thomas Jefferson HS, VA, Baltimore (inner city) Ingenuity Project
Middle School 2009 Summer Camp, Montgomery County Public Schools
•
•
•
•
•
•
•
•
•
Marc Olano, UMBC, Computer graphics. Co-advisor.
Tali Moreshet, Swarthmore College, Power. Co-advisor.
Bernie Brooks, NIH. Co-Advisor
Marty Peckerar, Microelectronics
Igor Smolyaninov, Electro-optics
Funding: NSF, NSA 2008 deployed XMT computer, NIH
6 Issued patents. More patent applications
Informal industry partner: Intel
Reinvention of Computing for Parallelism. Selected for Maryland Research Center
of Excellence (MRCE) by USM. Not yet funded. 17 members, including UMBC,
UMBI, UMSOM. Mostly applications.
Download