Introduction to parallel algorithms – Fall 2006 COT 5405 Ashok Srinivasan

advertisement
COT 5405 – Fall 2006
Introduction to parallel algorithms
Ashok Srinivasan
www.cs.fsu.edu/~asriniva
Florida State University
Outline
•
•
•
•
Background
Primitives
Algorithms
Important points
Background
• Terminology
– Time complexity
– Speedup
– Efficiency
– Scalability
• Communication cost model
Time complexity
• Parallel computation
– A group of processors work together to
solve a problem
– Time required for the computation is the
period from when the first processor starts
working until when the last processor stops
Sequential
Parallel - bad
Parallel - ideal
Parallel - realistic
Other terminology
•
•
•
•
Speedup: S = T1/TP
Efficiency: E = S/P
Work: W = P TP
Scalability
Notation
•P = Number of processors
•T1 = Time on one processor
•TP = Time on P processors
– How does TP decrease as we increase P to
solve the same problem?
– How should the problem size increase with
P, to keep E constant?
Communication cost model
• Processes spend some time doing useful
work, and some time communicating
• Model communication cost as
–
–
–
–
TC = ts + L tb
L = message size
Independent of location of processes
Any process can communicate with any other
process
– A process can simultaneously send and receive
one message
I/O model
• We will ignore I/O issues, for the most part
• We will assume that input and output are
distributed across the processors in a manner
of our choosing
• Example: Sorting
– Input: x1, x2, ..., xn
• Initially, xi is on processor i
– Output xp1, xp2, ..., xpn
• xpi on processor i
• xpi < xpi+1
Primitives
•
•
•
•
•
Reduction
Broadcast
Gather/Scatter
All gather
Prefix
Reduction -- 1
x1
Compute x1 + x2 + ... + xn
x2
xn
x3
• Tn = n-1 + (n-1)(ts+tb)
• Sn = 1/(1 + ts + tb)
x4
Reduction -- 2
Reduction-1
for {x1, ... xn/2}
x1
xn/2+1
Reduction-1
for {xn/2+1, ... xn}
• Tn = n/2-1 + (n/2-1)(ts+ tb) + (ts+ tb) + 1
= n/2 + n/2 (ts+ tb)
• Sn ~ 2/(1 + ts+ tb)
Reduction -- 3
xn/2+1
x1
x1
Reduction-1
Reduction-1
for {x1, ... xn/2}
for {xn/2+1, ... xn}
xn/4+1
xn/2+1
x3n/4+1
Reduction-1
Reduction-1
Reduction-1
Reduction-1
for {x1, ... xn/4}
for {xn/4+1, ... xn/2}
for {xn/2+1, ... x3n/4}
for {x3n/4+1, ... xn}
• Apply reduction-2 recursively
– * Divide and conquer
• Tn ~ log2n + (ts+ tb) log2n
• Sn ~ (n/ log2n) x 1/(1 + ts+ tb)
• Note that any associative operator can be used in place of +
Parallel addition features
• If n >> P
– * Each processor adds n/P distinct numbers
– Perform parallel reduction on P numbers
– TP ~ n/P + (1 + ts+ tb) log P
– Optimal P obtained by differentiating wrt P
• Popt ~ n/(1 + ts+ tb)
• If communication cost is high, then fewer processors
ought to be used
– E = [1 + (1+ ts+ tb) P log P/n]-1
• * As problem size increases, efficiency increases
• * As number of processors increases, efficiency
decreases
Some common collective operations
A
A
A
A
B
A
C
A
D
Broadcast
A, B, C, D
Scatter
A, B, C, D
Gather
A
A
A, B, C, D
B
B
A, B, C, D
C
C
A, B, C, D
D
D
A, B, C, D
All Gather
Broadcast
x7
x3
x8
x1
x1
x4
x2
x1
x5
x2
x4
x6
x1
x1
x3
x5
x3
x2
• T ~ (ts+ Ltb) log P
– L: Length of data
x7
x2
x6
x4
x8
Gather/Scatter
Note: Si=0log P–1 2i
x18
= (2 log P – 1)/(2–1) = P-1
4L
x14
2L
x12
2L
x34
L
x1
x2
~P
x58
x56
L
x3
x4
x78
L
x5
x6
L
x7
x8
• Gather: Data move towards the root
• Scatter: Review question
• T ~ ts log P + PLtb
All gather
x7
x8
x3
x4
x5
x6
L
x1
• Equivalent
x2
to each processor broadcasting to
all the processors
All gather
x78
x78
x34
2L
x34
x56
x56
L
x12
x12
All gather
x58
x58
x14
2L
x14
x58
x58
L
x14
x14
4L
All gather
x18
x18
x18
2L
x18
x18
x18
L
x18
x18
• Tn ~ ts log P + PLtb
4L
Review question: Pipelining
• * Useful when repeatedly and regularly
performing a large number of primitive
operations
– Optimal time for a broadcast = log P
• But doing this n times takes n log P time
– Pipelining the broadcasts takes n + P time
• Almost constant amortized time per broadcast
– if n >> P
• n + P << n log P when n >> P
• Review question: How can you accomplish this time
complexity?
Sequential prefix
• Input
– Values xi , 1 < i < n
• Output
– Xi = x1 * x2 * ... * xi, 1 < i < n
– * is an associative operator
• Algorithm
– X1 = x1
– for i = 2 to n
• Xi = Xi-1 * xi
Parallel prefix
• Define f(a,b) as follows
• Input
– Processor i has xi
• Output
– Processor i has
x1 * x2 * ... * xi
• Divide and conquer
– f(a,b) yields the following
• Xi = xa *... * xi, Proc Pi
• Xi = xa *... * xb, Proc Pi
• a<i<b
– f(1,n) solves the problem
– if a == b
• Xi = xi, on Proc Pi
• Xi = xi, on Proc Pi
– else
• compute in parallel
– f(a,(a+b)/2)
– f((a+b)/2+1,b)
• Pi and Pj send Xi and Xj to
each other, respectively
– a < i < (a+b)/2
– j = i + (a+b)/2
• Xi = Xi*Xj on Pi
• Xj = Xi*Xj on Pj
• Xj = Xi*Xj on Pj
– T(n) = T(n/2) + 2 + (ts+tw) => T(n) = O(log n)
– An iterative implementation improves the constant
Iterative parallel prefix example
x0
x1
x2
x3
x4
x5
x6
x7
x01
x12
x23
x34
x45
x56
x67
x02
x03
x14
x25
x36
x47
x04
x05
x06
x07
Algorithms
• Linear recurrence
• Matrix vector multiplication
Linear recurrence
• Determine each xi, 2 < i < n
– xi = ai xi-1 + bi xi-2
– x 0 = x 0, x 1 = x 1
• Sequential solution
– for i = 2 to n
• xi = ai xi-1 + bi xi-2
– Follows directly from the recurrence
– This approach is not easily parallelized
Linear recurrence in parallel
• Given xi = ai xi-1 + bi xi-2
– x2i = a2i x2i-1 + b2i x2i-2
– x2i+1 = a2i+1 x2i + b2i+1 x2i-1
• Rewrite this in matrix form
x2i
x2i+1
b2i
a2i
x2i-2
a2i+1 b2i
b2i+1 + a2i+1 a2i
x2i-1
Ai
Xi
• Xi = Ai A
i-1
Xi-1
... A1X0
• This is a parallel prefix computation, since matrix multiplication is
associative
• Solved in O(log n) time
Matrix-vector multiplication
• c=Ab
– Often performed repeatedly
• bi = A bi-1
– We need same data distribution for c and b
• One dimensional decomposition
– Example: row-wise block striped for A
• b and c replicated
– Each process computes its components of
c independently
– Then all-gather the components of c
1-D matrix-vector multiplication
c: Replicated
A: Row-wise
b: Replicated
• Each process computes its components of c
independently
– Time = Q(n2/P)
• Then all-gather the components of c
– Time = ts log P + tb n
• Note: P < n
2-D matrix-vector multiplication
C0
A00 A01 A02 A03
B0
C1
A10 A11 A12 A13
B1
C2
A20 A21 A22 A23
B2
C3
A30 A31 A32 A33
B3
• Processes Pi0 sends Bi to P0i
– Time: ts + tbn/P0.5
• Processes P0j broadcast Bj to all Pij
– Time = ts log P0.5 + tb n log P0.5 / P0.5
• Processes Pij compute Cij = AijBj
– Time = Q(n2/P)
• Processes Pij reduce Cij on to Pi0, 0 < i < P0.5
– Time = ts log P0.5 + tb n log P0.5 / P0.5
• Total time = Q(n2/P + ts log P + tb n log P / P0.5 )
– P < n2
– * More scalable than 1-dimensional decomposition
Important points
• Efficiency
– Increases with increase in problem size
– Decreases with increase in number of processors
• Aggregation of tasks to increase granularity
– Reduces communication overhead
• Data distribution
– 2-dimensional may be more scalable than 1-dimensional
– Has an effect on load balance too
• General techniques
– Divide and conquer
– Pipelining
Download