Teaching Parallelism Panel, SPAA11 Uzi Vishkin,

advertisement
Teaching Parallelism Panel, SPAA11
Uzi Vishkin, University of Maryland
Dream opportunity
Limited interest in parallel computing evolved into quest for generalpurpose parallel computing in mainstream computers. Alas:
- Only heroic programmers can exploit the vast parallelism in today’s
mainstream computers.
- Rejection of their par prog by most programmers: all but certain.
- Widespread working assumption Programming models for larger-scale &
mainstream systems - similar. Not so in serial days.
- Parallel computing plagued with prog difficulties. [build-first figure-outhow-to-program-later’  fitting parallel languages to these arbitrary arch
 standardization of language fits  doomed later parallel arch
- Working assumption  import parallel computing ills to mainstream
Shock and awe example 1st par prog trauma ASAP: Start par prog course
with tile-based parallel algorithm for matrix mult. How many tiles to fit
1000X1000 matrices in cache of modern PC? Teach later: OK
Missing Many-Core Understanding Comparison of many-core platforms for:
Ease-of-programming & achieving hard speedups. The Economist I (F) :
Summary of my thoughts
1. In class Parallel PRAM algorithmic theory
-2nd in magnitude only to serial algorithmic theory
-Won the “battle of ideas” in the 1980s. Repeatedly:
-Challenged without success  no real alternative!
Is this another: the older we get the better we were? 
2. Parallel programming experience for concreteness:
•In Homework Extensive programming assignments
[ The XMT HW/SW/Alg solution Programming for locality: 2nd
order consideration]
•Must be Trauma-free providing Hard speedups/Best serial
3. Tread carefully Consider non-parallel computing colleague
instructors. Limited line of credit. Future change is certain.
3
Pushing may backfire when need cooperation in future
Parallel Random-Access Machine/Model
PRAM:
n synchronous processors all having unit time access to a shared memory.
Reactions
[Important to convey plurality, plus coherent approach(es)]
You got to be kidding, this is way:
- Too easy
- Too difficult:
Why even mention processors? What to do with
n processors?
How to allocate processors to instructions?
Immediate Concurrent Execution
‘Work-Depth framework’ SV82, Adopted in Par Alg texts [J92,KKT01].
ICE basis for architecture specs:
V, Using simple abstraction to reinvent computing for parallelism, CACM 1/2011
[Similar to role of stored-program & program-counter in arch specs for serial comp]
5
Algorithms-aware many-core is feasible
Algorithms
PRAM-On-Chip HW Prototypes
Programming
64-core, 75MHz FPGA of XMT [SPAA98..CF08]
Toolchain Compiler +
simulator HIPS’11
128-core interconnection network
Programmer’s
workflow
IBM 90nm: 9mmX5mm,
400 MHz [HotI07]
-
Rudimentary yet stable FPGA designASIC
•
IBM 90nm: 10mmX10mm
compiler
•
150 MHz
Architecture scales to 1000+ cores on-chip
XMT homepage: www.umiacs.umd.edu/users/vishkin/XMT/index.shtml or search:
‘XMT’. considerable material/suggestions for teaching: class notes, tool chain, lecture
videos, programming assignment.
Elements in My education platform
• Identify ‘thinking in parallel’ with the basic abstraction behind the SV82b workdepth framework. Note: presentation framework in PRAM texts: J92, KKT01.
• Teach as much PRAM algorithms as timing and developmental stage of the
students permit; extensive ‘dry’ theory homework: required from graduate
students. Little from high-school students.
• Students self-study programming in XMTC (standard C plus 2 commands, spawn
and prefix-sum) and do demanding programming assignments
• Provide a programmer’s workflow: links the simple PRAM abstraction with XMTC
(even tuned) programming. The synchronous PRAM provides ease of algorithm
design and reasoning about correctness and complexity. Multi-threaded
programming relaxes this synchrony for implementation. Since reasoning directly
about soundness and performance of multi-threaded code is known to be error
prone, the workflow only tasks the programmer with: establish that the code
behavior matches the PRAM-like algorithm
• Unlike PRAM, XMTC incorporates locality as 2nd order. Unlike many approaches,
XMTC preempts harm of locality on programmer’s productivity.
• If XMT architecture is presented: only at the end of the course; parallel
programming more difficult than serial, which… does not require architecture.
Anecdotal Validation (?)
• Breadth-first-search (BFS) example 42 students: joint UIUC/UMD course
- <1X speedups using OpenMP on 8-processor SMP
- 7x-25x speedups on 64-processor XMT FPGA prototype [Built at UMD]
What’s the big deal of 64 processors beating 8?
Silicon area of 64 XMT processors ~= 1-2 SMP processors
• Questionnaire Rank approaches for achieving (hard) speedups: All students,
but one : XMTC ahead of OpenMP
• Order-of-magnitude: teachability/learnabilty (MS, HS & up, SIGCSE’10)
• SPAA’11: >100X speedup on max-flow relative to 2.5X on GPU (IPDPS’10)
• Fleck/Kuhn: research too esoteric to be reliable  exoteric validation!
Reward alert: Try to publish a paper boasting easy to obtain results
 EoP: 1. Badly needed. Yet, 2. A lose-lose proposition.
8
Where to find a machine that supports effectively
such parallel algorithms?
• Parallel algorithms researchers realized decades ago that the main reason that
parallel machines are difficult to program has been that bandwidth between
processors/memories is limited. Lower bounds [VW85,MNV94].
• [BMM94]: 1. HW vendors see the cost benefit of lowering performance of
interconnects, but grossly underestimate the programming difficulties and the high
software development costs implied. 2. Their exclusive focus on runtime
benchmarks misses critical costs, including: (i) the time to write the code, and (ii)
the time to port the code to different distribution of data or to different machines
that require different distribution of data.
G. Blelloch, B. Maggs & G. Miller. The hidden cost of low bandwidth communication. In Developing a CS Agenda for HPC (Ed.
U. Vishkin). ACM Press, 1994
• Patterson, CACM04: Latency Lags Bandwidth. HP12: as latency improved by 30-80X,
bandwidth improved by 10-25KX
 Isn’t this great news: cost benefit of low bandwidth drastically decreasing
• Not so fast. X86Gen Senior Eng, 1/2011: Okay, you do have a ‘convenient’ way to
do parallel programming; so what’s the big deal?!
• Commodity HW  Decomposition-first programming doctrine  heroic
programmers  sigh … Has the ‘bw  ease-of-programming opportunity’ got
lost?
Sociologists of science
• Debates between adherents of different thought
styles consist almost entirely of misunderstandings.
Members of both parties are talking of different
things (though they are usually under an illusion
that they are talking about the same thing). They
are applying different methods and criteria of
correctness (although they are usually under an
illusion that their arguments are universally valid
and if their opponents do not want to accept them,
then they are either stupid or malicious)
10
Comment on need for breadth of
knowledge
Where are your specs?
One example: what is your par alg abstraction?
• ‘First-specs then-build’ is “not uncommon”.. for
engineering
• 2 options for architects WRT example:
A. 1. Learn parallel algorithms 2. Develop abstraction that
meets EoP 3. Develop specs 4. Build
B. Start from abstraction with proven EoP
It is similarly important for algorithms people to learn
architecture and applications
Download