talk (pptx, 47 slides) - University of Maryland Institute for

advertisement
No Need to Constrain Many-Core Parallel
Programming:
Time for Hardware Upgrade
The pompous version
- After 40 years of “wandering in the desert”, general-purpose parallelism is very
close to capturing the “promised land” of mainstream computing
- For that, we need the soldiers/programmers
- Vendors want programmers to embrace parallelism
- But, currently they don’t support the easiest possible form of parallelism
- A proper HW upgrade can provide the needed support
Uzi Vishkin
Many-Cores are Productivity Limited
Uninviting programmers' models simply turn programmers away.
"Ten ways to waste a parallel computer” (Keynote, ISCA09). But
you don't need 10 ways. Just repel the programmer and ... you
don't have to worry about the rest.
Many-Cores are Productivity Limited
~2003 Wall Street traded companies gave up the safety of the
only paradigm that worked for them for parallel computing
The Challenge Reproduce the success of the serial paradigm for
many-core computing, where obtaining strong, but not absolutely
the best performance is relatively easy.
[Reinvent HW, programming, training and education. My favorite
question: how will the algorithms course look?]
Positive News Vendors open up to 40 years of parallel computing.
Also to SW that matches vendors’ HW (2009 acquisitions). But,
did they pick the right part for adoption?
Never Easy-to-program, fast general-purpose parallel computer
for single task completion time. Less politically correct Current
parallel architectures: never really worked for productivity.
1991: “parallel software crisis”
2003: “as intimidating and time consuming as programming in assembly
language”--NSF Blue Ribbon Committee
Why drag the whole field to a recognized disaster area?
The business food chain
- SW developers are those who directly serve the customers
- The “software spiral” (the cyclic process of HW improvement
leading to SW improvement, e.g., around the von-Neumann
model) is broken
- The customer will benefit from HW improvements only if SW
uses them
- If HW developers will not get used to the idea of serving SW
developers by starting to benchmark HW for productivity, guess
what will happen to customers of their HW
Many-cores are productivity limited
Is there any really good news?
Many-core programming is too constrained
If only, we could “set the programmer free”
Priorities for today’s presentation
1. What does it mean to “set free” parallel algorithmic
thinking (PAT)?
2. Architecture functions/capabilities that support PAT
3. HW hooks enabling these functions
[Goal: Interest you in reading more  Google “XMT”]
Vendors must incorporate such functions
Simple way: just add these HW hooks to enhance your
design (if possible, with your design)
Example of HW hook Prefix-Sum
• 1500 cars enter a gas station with 1000 pumps
Function
• Direct in unit time a car to a EVERY pump
• Then, direct in unit time a car to EVERY pump
becoming available
Proposed HW hook
Prefix-sum functional unit.
[HW enhancement of Fetch&Add, US Patent]
Objective for programmer’s model:
Parallel Algorithmic Thinking (PAT)
•
CLRS-09 and others: analysis should be work-depth. Why not design for your
analysis? (like serial). Example: if 1 op now, why not any number next?
What could I do in parallel
at each step assuming
unlimited hardware
Serial
Paradigm
#
ops

..
time
Time = Work
•
•
•
•
•
..
Natural (Parallel)
Paradigm
# .
.
ops
..
..
..
..
time
Work = total #ops
Time << Work
[SV82] conjectured that the rest (full PRAM algorithm) just a matter of skill.
Lots of evidence that “work-depth” works. Used as framework in PRAM algorithms
texts: JaJa-92, Keller-Kessler-Traeff-01.
PRAM: Only really successful parallel algorithmic theory. Latent, though not
widespread, knowledgebase
NVidia happy to report success with 2 PRAM algorithms in IPDPS09. Great to see
that from a major vendor.
However: These 2 algorithms are decomposition-based, unlike most PRAM
algorithms. Freshmen programmed same 2 algorithms on our XMT machine.
XMT (Explicit Multi-Threading):
A PRAM-On-Chip Vision
• IF you could program a current manycore  great speedups. XMT: Fix the
IF
• XMT was designed from the ground up with the following features:
- Allows a programmer’s workflow, whose first step is algorithm design for
work-depth. Thereby, harness the whole PRAM theory
- No need to program for locality beyond use of local thread variables, post
work-depth
- Hardware-supported dynamic allocation of “virtual threads” to processors.
- Sufficient interconnection network bandwidth
- Gracefully moving between serial & parallel execution (no off-loading)
- Backwards compatibility on serial code
- Support irregular, fine-grained algorithms (unique). Some role for hashing.
• Unlike matching current HW
• Today’s position Enable (replicate) functions
• Tested HW & SW prototypes
• Software release of full XMT environment
• SPAA’09: ~10X relative to Intel Core 2 Duo
• For links to detailed info: See Proc. ICCD’09
Hardware prototypes of PRAM-On-Chip
64-core, 75MHz FPGA prototype
[SPAA’07, Computing Frontiers’08]
Original explicit multi-threaded
(XMT) architecture [SPAA98] (Cray
started to use “XMT” 7+ years later)
Interconnection Network for 128-core. 9mmX5mm, IBM90nm
process. 400 MHz prototype [HotInterconnects’07]
Same design as 64-core FPGA. 10mmX10mm,
IBM90nm process. 150 MHz prototype
The design scales to 1000+ cores on-chip
Programmer’s Model: Workflow Function
• Arbitrary CRCW Work-depth algorithm.
- Reason about correctness & complexity in synchronous model
• SPMD reduced synchrony
– Main construct: spawn-join block. Can start any number of processes at
once. Threads advance at own speed, not lockstep
– Prefix-sum (ps). Independence of order semantics (IOS)
– Establish correctness & complexity by relating to WD analyses
– Circumvents “The problem with threads”, e.g., [Lee]
spawn
join
spawn
join
• Tune (compiler or expert programmer): (i) Length of sequence
of round trips to memory, (ii) QRQW, (iii) WD. [VCL07]
Workflow from parallel algorithms to programming
versus trial-and-error
Option 1
Domain
decomposition,
or task
decomposition
Option 2
PAT
Parallel algorithmic
thinking (say PRAM)
Program
Insufficient
inter-thread
bandwidth?
Rethink algorithm:
Take better
advantage of cache
Compiler
Hardware
Is Option 1 good enough for the parallel programmer’s model?
Options 1B and 2 start with a PRAM algorithm, but not option 1A.
Options 1A and 2 represent workflow, but not option 1B.
PAT
Prove
correctness
Program
Still correct
Tune
Still correct
Hardware
Not possible in the 1990s.
Possible now.
Why settle for less?
Snapshot: XMT High-level language
Cartoon Spawn creates threads; a
thread progresses at its own speed
and expires at its Join.
Synchronization: only at the Joins. So,
virtual threads avoid busy-waits by
expiring. New: Independence of order
semantics (IOS).
The array compaction (artificial)
problem
Input: Array A[1..n] of elements.
Map in some order all A(i) not equal 0
to array D.
A
1
0
5
0
0
0
4
0
0
D
e0
e2
1
4
5
e6
For program below:
e$ local to thread $;
x is 3
XMT-C
Single-program multiple-data (SPMD) extension of standard C.
Includes Spawn and PS - a multi-operand instruction.
Essence of an XMT-C program
int x = 0;
Spawn(0, n) /* Spawn n threads; $ ranges 0 to n − 1 */
{ int e = 1;
if (A[$] not-equal 0)
{ PS(x,e);
D[e] = A[$] }
}
n = x;
Notes: (i) PS is defined next (think F&A). See results for
e0,e2, e6 and x. (ii) Join instructions are implicit.
XMT Assembly Language
Standard assembly language, plus 3 new instructions: Spawn, Join, and PS.
The PS multi-operand instruction
New kind of instruction: Prefix-sum (PS).
Individual PS, PS Ri Rj, has an inseparable (“atomic”) outcome:
(i) Store Ri + Rj in Ri, and
(ii) Store original value of Ri in Rj.
Several successive PS instructions define a multiple-PS instruction. E.g., the
sequence of k instructions:
PS R1 R2; PS R1 R3; ...; PS R1 R(k + 1)
performs the prefix-sum of base R1 elements R2,R3, ...,R(k + 1) to get:
R2 = R1; R3 = R1 + R2; ...; R(k + 1) = R1 + ... + Rk; R1 = R1 + ... + R(k + 1).
Idea: (i) Several ind. PS’s can be combined into one multi-operand instruction.
(ii) Executed by a new multi-operand PS functional unit.
Mapping PRAM Algorithms onto XMT
(1) PRAM parallelism maps into a thread structure
(2) Assembly language threads are not-too-short (to
increase locality of reference)
(3) the threads satisfy IOS
How (summary):
I. Use work-depth methodology [SV-82] for “thinking
in parallel”. The rest is skill.
II. Go through PRAM or not. Ideally compiler:
III. Produce XMTC program accounting also for:
(1) Length of sequence of round trips to memory,
(2) QRQW.
Issue: nesting of spawns.
Merging: Example for Algorithm & Program
Input: Two arrays A[1. . n], B[1. . n]; elements from a totally
ordered domain S. Each array is monotonically nondecreasing.
Merging: map each of these elements into a monotonically nondecreasing array C[1..2n]
Serial Merging algorithm
SERIAL − RANK(A[1 . . ];B[1. .])
Starting from A(1) and B(1), in each round:
1. compare an element from A with an element of B
2. determine the rank of the smaller among them
Complexity: O(n) time (and O(n) work...)
PRAM Challenge: O(n) work, least time
Also (new): fewest spawn-joins
Merging algorithm (cont’d)
“Surplus-log” parallel algorithm for Merging/Ranking
for 1 ≤ i ≤ n pardo
• Compute RANK(i,B) using standard binary search
• Compute RANK(i,A) using binary search
Complexity: W=(O(n log n), T=O(log n)
The partitioning paradigm
n: input size for a problem. Design a 2-stage parallel
algorithm:
1. Partition the input into a large number, say p, of
independent small jobs AND size of the largest small
job is roughly n/p.
2. Actual work - do the small jobs concurrently, using a
separate (possibly serial) algorithm for each.
Linear work parallel merging: using a single spawn
Stage 1 of algorithm: Partitioning for 1 ≤ i ≤ n/p pardo [p <= n/log and p | n]
• b(i):=RANK(p(i-1) + 1),B) using binary search
• a(i):=RANK(p(i-1) + 1),A) using binary search
Stage 2 of algorithm: Actual work
Observe Overall ranking task broken into 2p independent “slices”.
Example of a slice
Start at A(p(i-1) +1) and B(b(i)).
Using serial ranking advance till:
Termination condition
Either some A(pi+1) or some B(jp+1) loses
Parallel program 2p concurrent threads
using a single spawn-join for the whole
algorithm
Example Thread of 20: Binary search B.
Rank as 11 (index of 15 in B) + 9 (index of
20 in A). Then: compare 21 to 22 and rank
21; compare 23 to 22 to rank 22; compare 23
to 24 to rank 23; compare 24 to 25, but terminate
since the Thread of 24 will rank 24.
Linear work parallel merging (cont’d)
Observation 2p slices. None larger than 2n/p.
(not too bad since average is 2n/2p=n/p)
Complexity Partitioning takes W=O(p log n), and T=O(log n) time,
or O(n) work and O(log n) time, for p <= n/log n.
Actual work employs 2p serial algorithms, each takes O(n/p)
time.
Total W=O(n), and T=O(n/p), for p <= n/log n.
IMPORTANT: Correctness & complexity of parallel program
Same as for algorithm.
This is a big deal. Other parallel programming approaches do
not have a simple concurrency model, and need to reason w.r.t.
the program.
Example of PRAM-like Algorithm
Input: (i) All world airports.
(ii) For each, all airports to which
there is a non-stop flight.
Find: smallest number of flights
from DCA to every other
airport.
Parallel: parallel data-structures.
Inherent serialization: S.
Gain relative to serial: (first cut) ~T/S!
Decisive also relative to coarse-grained
parallelism.
Basic algorithm
Note: (i) “Concurrently”: only change to
Step i:
serial algorithm
For all airports requiring i-1flights
(ii) No “decomposition”/”partition”
For all its outgoing flights
(iii) Takes the better part of a semester
Mark (concurrently!) all “yet
to teach!
unvisited” airports as requiring
i flights (note nesting)
Please take into account that based on
experience with scores of good
Serial: uses “serial queue”.
students this semester-long course
O(T) time; T – total # of flights
is needed to make full sense of the
approach presented here.
XMT Architecture Overview
• One serial core – master thread
control unit (MTCU)
• Parallel cores (TCUs) grouped
in clusters
• Global memory space evenly
partitioned in cache banks using
hashing
• No local caches at TCU. Avoids
expensive cache coherence
hardware
• HW-supported run-time loadbalancing of concurrent threads
over processors. Low thread
creation overhead. (Extend
classic stored-program+program
counter; cited by 15 Intel
patents; Prefix-sum to registers
& to memory. )
MTCU
Hardware Scheduler/Prefix-Sum Unit
Cluster 1
Cluster 2
Cluster C
Parallel Interconnection Network
…
Memory
Bank 1
Memory
Bank 2
DRAM
Channel 1
Shared Memory
(L1 Cache)
Memory
Bank M
DRAM
Channel D
- Enough interconnection network
bandwidth
How-To Nugget
Seek 1st (?) upgrade of program-counter
& stored program since 1946
Virtual over physical:
distributed solution
Von Neumann (1946--??)
Virtual
Start
Hardware
PC
PC
$ := TCU-ID
Yes
Use PS to get new $
Is $ > n ?
XMT
Done
Hardware
Virtual
PC
PC1
PC1
Spaw n 1000000
PC
1000000
Join
PC
2
PC
1000
When PC1 hits Spawn, a spawn unit broadcasts 1000000 and
the code
Spawn
Join
to PC1, PC 2, PC1000 on a designated bus
No
Execute
Thread $
Ease of Programming
• Benchmark Can any CS major program your manycore?
- cannot really avoid it.
Teachability demonstrated so far for XMT:
- To freshman class with 11 non-CS students. Some prog.
assignments: merge-sort*, integer-sort* & sample-sort.
Other teachers:
- Magnet HS teacher. Downloaded simulator, assignments,
class notes, from XMT page. Self-taught. Recommends:
Teach XMT first. Easiest to set up (simulator), program,
analyze: ability to anticipate performance (as in serial). Can do
not just for embarrassingly parallel. Teaches also OpenMP,
MPI, CUDA. Lookup keynote at CS4HS’09@CMU + interview
with teacher.
- High school & Middle School (some 10 year olds) students
from underrepresented groups by HS Math teacher.
*Also in Nvidia’s Satish, Harris & Garland IPDPS09
Middle School Summer Camp
Class Picture, July’09 (20 of 22
students)
28
Software release
Allows to use your own computer for programming on
an XMT environment & experimenting with it, including:
a) Cycle-accurate simulator of the XMT machine
b) Compiler from XMTC to that machine
Also provided, extensive material for teaching or selfstudying parallelism, including
(i)Tutorial + manual for XMTC (150 pages)
(ii)Class notes on parallel algorithms (100 pages)
(iii)Video recording of 9/15/07 HS tutorial (300 minutes)
(iv) Video recording of grad Parallel Algorithms lectures
(30+hours)
www.umiacs.umd.edu/users/vishkin/XMT/sw-release.html,
Or just Google “XMT”
Q&A
Question: Why PRAM-type parallel algorithms matter, when we
can get by with existing serial algorithms, and parallel
programming methods like OpenMP on top of it?
Answer: With the latter you need a strong-willed Comp. Sci. PhD
in order to come up with an efficient parallel program at the
end. With the former (study of parallel algorithmic thinking and
PRAM algorithms) high school kids can write efficient (more
efficient if fine-grained & irregular!) parallel programs.
Conclusion
• XMT provides viable answer to biggest challenges for
the field
– Ease of programming
– Scalability (up&down)
Facilitates code portability
• Preliminary evaluation shows good result of XMT
architecture versus state-of-the art Intel Core 2
• ICPP’08 paper compares with GPUs  XMT + GPU
beats all-in-one
• Easy to build. 1 student in 2+ yrs: hardware design +
FPGA-based XMT computer in slightly more than two
years  time to market; implementation cost.
• Replicate functions, perhaps by replicating solutions
(HW hooks)
Is this enough to sway vendors?!
• An eye-opening Viewpoint, A. Ghuloum (Intel), CACM 9/09
notes: “..hardware vendors tend to understand the
requirements from the examples that software developers
provide… Re-architecting software now for scalability onto
(what appears to be) a highly parallel processor roadmap for
the foreseeable future will accelerate the assistance that
hardware and tool vendors can provide.”
• Ghuloum reports a worrisome reality: SW developers are
expected to develop elaborate code for processors that have
not yet been built, since… HW vendors are less likely to build
machines for code that had not yet been written.
• But, why would SW developers do that?!
Current Participants
Grad students:, George Caragea, James Edwards, David Ellison, Fuat Keceli,
Beliz Saybasili, Alex Tzannes, Joe Zhou. Recent grads: Aydin Balkan, Mike
Horak, Xingzhi Wen
• Industry design experts (pro-bono).
• Rajeev Barua, Compiler. Co-advisor of 2 CS grad students. 2008 NSF grant.
• Gang Qu, VLSI and Power. Co-advisor.
• Steve Nowick, Columbia U., Asynch computing. Co-advisor. 2008 NSF team
grant.
• Ron Tzur, Purdue U., K12 Education. Co-advisor. 2008 NSF seed funding
K12: Montgomery Blair Magnet HS, MD, Thomas Jefferson HS, VA, Baltimore (inner city)
Ingenuity Project Middle School 2009 Summer Camp, Montgomery County Public Schools
•
•
•
•
•
•
•
•
Marc Olano, UMBC, Computer graphics. Co-advisor.
Tali Moreshet, Swarthmore College, Power. Co-advisor.
Bernie Brooks, NIH. Co-Advisor.
Marty Peckerar, Microelectronics
Igor Smolyaninov, Electro-optics
Funding: NSF, NSA 2008 deployed XMT computer, NIH
Industry partner: Intel
Reinvention of Computing for Parallelism. Selected for Maryland Research
Center of Excellence (MRCE) by USM. Not yet funded. 17 members,
including UMBC, UMBI, UMSOM. Mostly applications.
Backup slides
Many forget that the only reason that PRAM algorithms did not
become standard CS knowledge is that there was no
demonstration of an implementable computer architecture that
allowed programmers to look at a computer like a PRAM. XMT
changed that, and now we should let Mark Twain complete the
job.
We should be careful to get out of an experience only the wisdom
that is in it— and stop there; lest we be like the cat that sits
down on a hot stove-lid. She will never sit down on a hot stovelid again— and that is well; but also she will never sit down on a
cold one anymore.— Mark Twain
PERFORMANCE PROGRAMMING & ITS PRODUCTIVITY
Basic Algorithm (sometimes informal)
Add data-structures (for serial algorithm)
Serial program (C)
3
1
Standard Computer
Decomposition
Assignment
Parallel
Programming
(Culler-Singh)
Orchestration
Mapping
2
Parallel computer
Add parallel data-structures
(for PRAM-like algorithm)
Parallel program (XMT-C)
Low overheads! 4
XMT Computer
(or Simulator)
• 4 easier than 2
• Problems with 3
• 4 competitive with 1:
cost-effectiveness; natural
APPLICATION PROGRAMMING & ITS PRODUCTIVITY
Application programmer’s interfaces (APIs)
(OpenGL, VHDL/Verilog, Matlab)
compiler
Serial program (C)
Parallel program (XMT-C)
Automatic?
Yes
Maybe
Yes
Decomposition
Standard Computer
XMT architecture
Assignment
Parallel
Programming
(Culler-Singh)
Orchestration
Mapping
Parallel computer
(Simulator)
XMT Block Diagram – Back-up slide
ISA
•
•
•
•
•
•
•
Any serial (MIPS, X86). MIPS R3000.
Spawn (cannot be nested)
Join
SSpawn (can be nested)
PS
PSM
Instructions for (compiler) optimizations
The Memory Wall
Concerns: 1) latency to main memory, 2) bandwidth to main memory.
Position papers: “the memory wall” (Wulf), “its the memory, stupid!” (Sites)
Note: (i) Larger on chip caches are possible; for serial computing, return on
using them: diminishing. (ii) Few cache misses can overlap (in time) in serial
computing; so: even the limited bandwidth to memory is underused.
XMT does better on both accounts:
• uses more the high bandwidth to cache.
• hides latency, by overlapping cache misses; uses more bandwidth to main
memory, by generating concurrent memory requests; however, use of the
cache alleviates penalty from overuse.
Conclusion: using PRAM parallelism coupled with IOS, XMT reduces the effect
of cache stalls.
Memory architecture, interconnects
• High bandwidth memory architecture.
- Use hashing to partition the memory and avoid hot spots.
- Understood, BUT (needed) departure from mainstream
practice.
• High bandwidth on-chip interconnects
• Allow infrequent global synchronization (with IOS).
Attractive: lower power.
• Couple with strong MTCU for serial code.
Some supporting evidence
(12/2007)
Large on-chip caches in shared memory.
8-cluster (128 TCU!) XMT has only 8 load/store
units, one per cluster. [IBM CELL: bandwidth
25.6GB/s from 2 channels of XDR. Niagara 2:
bandwidth 42.7GB/s from 4 FB-DRAM
channels.
With reasonable (even relatively high rate of)
cache misses, it is really not difficult to see that
off-chip bandwidth is not likely to be a showstopper for say 1GHz 32-bit XMT.
Some experimental results
• AMD Opteron 2.6 GHz, RedHat
Linux Enterprise 3, 64KB+64KB
L1 Cache, 1MB L2 Cache (none
in XMT), memory bandwidth 6.4
GB/s (X2.67 of XMT)
• M_Mult was 2000X2000 QSort
was 20M
• XMT enhancements: Broadcast,
prefetch + buffer, non-blocking
store, non-blocking caches.
XMT Wall clock time (in seconds)
App.
M-Mult
QSort
XMT Basic XMT
179.14
63.7
16.71
6.59
Opteron
113.83
2.61
Assume (arbitrary yet conservative)
ASIC XMT: 800MHz and 6.4GHz/s
Reduced bandwidth to .6GB/s and projected back
by 800X/75
XMT Projected time (in seconds)
App.
M-Mult
QSort
XMT Basic XMT
23.53
12.46
1.97
1.42
Opteron
113.83
2.61
- Simulation of 1024 processors: 100X on standard benchmark suite for VHDL
gate-level simulation. for 1024 processors [Gu-V06]
-Silicon area of 64-processor XMT, same as 1 commodity processor (core)
Naming Contest for New Computer
Paraleap
chosen out of ~6000 submissions
Single (hard working) person (X. Wen) completed
synthesizable Verilog description AND the new
FPGA-based XMT computer in slightly more than
two years. No prior design experience. Attests to:
basic simplicity of the XMT architecture  faster
time to market, lower implementation cost.
XMT Development – HW Track
– Interconnection network. Led so far to:
 ASAP’06 Best paper award for mesh of trees (MoT) study
 Using IBM+Artisan tech files: 4.6 Tbps average output at max frequency (1.3
- 2.1 Tbps for alt networks)! No way to get such results without such access
 90nm ASIC tapeout
Bare die photo of 8-terminal interconnection
network chip IBM 90nm process, 9mm x 5mm
fabricated (August 2007)
– Synthesizable Verilog of the whole architecture. Led so far to:
 Cycle accurate simulator. Slow. For 11-12K X faster:
 1st commitment to silicon—64-processor, 75MHz computer; uses FPGA:
Industry standard for pre-ASIC prototype
 1st ASIC prototype–90nm 10mm x 10mm
64-processor tapeout 2008: 4 grad students
Bottom Line
Cures a potentially fatal problem for growth of generalpurpose processors: How to program them for single
task completion time?
Positive record
Proposal
Over-Delivering
NSF ‘97-’02 experimental algs. architecture
NSF 2003-8 arch. simulator
silicon (FPGA)
DoD 2005-7 FPGA
FPGA+2 ASICs
Final thought: Created our own coherent planet
• When was the last time that a university project
offered a (separate) algorithms class on own
language, using own compiler and own
computer?
• Colleagues could not provide an example since
at least the 1950s. Have we missed anything?
For more info:
http://www.umiacs.umd.edu/users/vishkin/XMT/
Download