Uploaded by morichikarinnosuke17

Potential for Parallel Computation

advertisement
Potential for Parallel
Computation
Chapter 2
Fundamentals of Parallel
Processing
G. Alaghband
Fundamentals of Parallel
Processing
1, Potential for Parallel
Computation
Main Topics
• Prefix Algorithms
• Speedup and Efficiency
• Amdahl's Law
G. Alaghband
Fundamentals of Parallel
Processing
2, Potential for Parallel
Computation
Examples of Parallel Programming Design
• Sequential/Parallel Add
• Sum Prefix Algorithm
• Parameters of Parallel Algorithms
• Generalized Prefix Algorithm
• Divide and Conquer
• Upper/Lower Algorithm
• Size and Depth of Upper/Lower
Algorithm
• Odd/Even Algorithm
• Size and Depth of Odd/Even
Algorithm
• A Parallel Prefix Algorithm with Small Size
and Depth
• Size and Depth Analysis
G. Alaghband
Fundamentals of Parallel
Processing
3, Potential for Parallel
Computation
A Simple Algorithm :
Adding numbers:
Assume a vector of numbers in V[1:N]
Sequential add:
S:= V[1];
for i := 2 step 1 until N
S := S + V[i];
Data dependence graph for sequential summation
Back
G. Alaghband
Fundamentals of Parallel
Processing
4, Potential for Parallel
Computation
Data Dependence Graph for Parallel Summation
Back
G. Alaghband
Fundamentals of Parallel
Processing
5, Potential for Parallel
Computation
A Slightly More Complicated Algorithm
Sum Prefix:
For i := 2 step 1 until N
V[i] := V[i-1] + V[i];
Many Applications:
Dependence Graph for Sequential Prefix
• Radix sort
• Quicksort
• String comparison • Lexical
analysis • Stream
compaction
• Polynomial evaluation
• Solving recurrences
• Tree operations
• Histograms
• Assigning space in farmers
market
• Allocating memory to
parallel threads
• Allocating memory buffer
for communication channels
Back
G. Alaghband
Each term, V’[i], is the sum of all numbers in V[1:i], i  N
Fundamentals of Parallel
Processing
6, Potential for Parallel
Computation
Questions to contemplate and try to answer
as we study this section are:
• Do sequential computations involve different amount
of parallelism?
• What can be done in parallel in an arbitrary algorithm?
G. Alaghband
Fundamentals of Parallel
Processing
7, Potential for Parallel
Computation
PARAMETERS OF PARALLEL ALGORITHMS
SIZE:
Number of operations
DEPTH:
Number of operations in the longest chain from any input to
any output.
EXAMPLES
Sequential sum of N inputs:
SIZE = N - 1
DEPTH = N – 1
Parallel sum of N inputs (pair-wise summation):
SIZE = N - 1
DEPTH = Log2 N
Sequential Sum Prefix of N inputs:
SIZE = N - 1
DEPTH = N - 1
G. Alaghband
Fundamentals of Parallel
Processing
8, Potential for Parallel
Computation
A simply stated problem having several different algorithms is the
Generalized Prefix Problem:
Given an associative operator O, and N variables V1, V2, ..., VN, form the N
results:
V1, V1OV2, V1OV2OV3,
…,
V1OV2OV3O...OVN .
There are several different algorithms to solve this problem, each with
different characteristics.
G. Alaghband
Fundamentals of Parallel
Processing
9, Potential for Parallel
Computation
Divide and Conquer
A general technique for constructing non-trivial parallel algorithms
is the divide and conquer technique.
The idea is to split a problem into 2 smaller problems whose
solution can be simply combined to solve the larger problem.
The splitting is continued recursively until problems are so small
that they are easy to solve.
In this case we split the prefix problem on V1, V2, ..., VN into 2
problems:
Prefix on V1, V2, ..., VN/2 ,
and
Prefix on VN/2+1 , VN/2+2, ..., VN
That is, we split inputs to the prefix computation into a lower half
and an upper half, and solve the problem separately on each half.
G. Alaghband
Fundamentals of Parallel
Processing
10, Potential for Parallel
Computation
The Upper/Lower Construction
The solution to the 2 half size problems are combined by the construction
below:
Recall that the ceiling of X, X is the least integer  X and the floor of X,
X, is the greatest integer  X.
G. Alaghband
Fundamentals of Parallel
Processing
11, Potential for Parallel
Computation
Recursively applying the Upper/Lower construction will eventually result in
prefix computations on no more than 2 inputs, which is trivial.
For example:
For 4 inputs we obtain:
N=4
Size = 4
Depth = 2
G. Alaghband
Fundamentals of Parallel
Processing
12, Potential for Parallel
Computation
A larger example of the parallel prefix resulting from recursive
Upper/Lower construction Pul(8):
N=8
Size = 12
Depth = 3
G. Alaghband
Fundamentals of Parallel
Processing
13, Potential for Parallel
Computation
Finally Pul(16)
G. Alaghband
N = 16
Size = 32
Depth = 4
Fundamentals of Parallel
Processing
14, Potential for Parallel
Computation
Having developed a way to produce a prefix algorithm which
allows parallel operations, we should now characterize it in terms
of its size and depth.
The depth of the algorithm is trivial to analyze.
The construction must be repeated log2 N  times to reduce
everything to two inputs.
For each application of the construction, the path from the
rightmost input to the rightmost output passes through one more
operation.
Therefore
Depth = log2 N 
G. Alaghband
Fundamentals of Parallel
Processing
15, Potential for Parallel
Computation
Analysis of Size of Upper/Lower
Assume N a power of 2 ( easiest to analyze).
Theorem: Let s(N) = Size(Pul(N)), then for N a power of 2:
s(2k) = k 2k-1 =
N/2 log2N
where N = 2k.
G. Alaghband
Fundamentals of Parallel
Processing
16, Potential for Parallel
Computation
Proof: s(2k) = k 2k-1 = N/2 log2N
by induction on k.
The initial condition for k = 1 is : s(21) = Size(Pul(2)) = 1 = 21-1
Assume the result is true for k = i and prove it for k = i + 1.
We assume
s(2i) = i2i-1
The size of s(2i+1) is related to s(2i) by counting operations in the recursive
constructions:
s(2i+1) = 2 s(2i)
+
2 prefix on 2i
= 2(i2i-1) + 2i
= (i+1)2i+1-1 =
2i
Last operation on half size
k.2k-1
Thus if results hold for k = i, it holds for k = i + 1.
The result follows for arbitrary integer k by finite induction.
k.2k-1 = N/2log2N, ( N = 2k , K = log2N)
G. Alaghband
Fundamentals of Parallel
Processing
17, Potential for Parallel
Computation
If we have unlimited processors (arithmetic units)
available then the minimum depth algorithm finishes
soonest.
The Upper/Lower construction gives an algorithm
with minimum depth.
If number of processors are limited then we have to
keep the size small:
ODD/EVEN Algorithm
G. Alaghband
Fundamentals of Parallel
Processing
18, Potential for Parallel
Computation
Parallel prefix with larger depth but smaller size:
Divide the inputs into sets with odd and even index values.
Combine odds with the next higher evens at input.
Do the parallel prefix on the new combined evens and combine evens with
next higher odds at output.
Recursive application
of odd/even
construction
continues until a
prefix of 2 inputs is
reached. Poe(N)
G. Alaghband
Fundamentals of Parallel
Processing
19, Potential for Parallel
Computation
Examples:
Poe(4): The odd/even construction for 4 inputs is presented first:
N=4
Size = 4
Depth = 2
Note : Poe(4) is a special case. It is equivalent to Pul(4) . Notice that
the longest path is only one more than the 2-input case.
G. Alaghband
Fundamentals of Parallel
Processing
20, Potential for Parallel
Computation
Odd/Even construction for 8 inputs shows the recursive construction:
N=8
Size = 11
Depth = 4
G. Alaghband
Fundamentals of Parallel
Processing
21, Potential for Parallel
Computation
Size and Depth
The size and depth analysis of Odd/Even algorithm is simple for N a power of 2.
G. Alaghband
Fundamentals of Parallel
Processing
22, Potential for Parallel
Computation
Thus size of Odd/Even algorithm is less than the size of Upper/Lower
but its depth is greater (~ twice)
G. Alaghband
Fundamentals of Parallel
Processing
23, Potential for Parallel
Computation
The sequential algorithm is very deep,
Odd/Even
about
twice as
deep asis
Upper/Lower
The sizeisof
sequential
algorithm
smallest,
both are much
shallower
sequential
The sizebut
of Upper/Lower
grows
fasterthan
with the
N than
the sizecase.
of Odd/Even.
The size of Odd/Even is less than twice the size of sequential algorithm.
G. Alaghband
Fundamentals of Parallel
Processing
24, Potential for Parallel
Computation
The sequential algorithm is very deep,
Odd/Even is about twice as deep as Upper/Lower
but both are much shallower than the sequential case.
The size of sequential algorithm is smallest,
The size of Upper/Lower grows faster with N than the size of
Odd/Even.
The size of Odd/Even is less than twice the size of sequential
algorithm.
It is possible to find a parallel prefix algorithm with
minimum depth which also has a size proportional to
N instead of Nlog2N.
G. Alaghband
Fundamentals of Parallel
Processing
25, Potential for Parallel
Computation
A Parallel Algorithm with Small Depth and Size
Reference:
Ladner, R. E. and Fisher, M. J., “Parallel Prefix Computation,” JACM,
vol. 27, no. 4, pp. 831-838, Oct. 1980.
By combining the 2 methods:
(Upper/Lower and Odd/Even),
we can define a set of prefix algorithms Pj(N).
For j  1, Pj(N) is defined by Odd/Even construction
using Pj-1(N/2). (until we get to P0) P0 is the subject
of our interest!
G. Alaghband
Fundamentals of Parallel
Processing
26, Potential for Parallel
Computation
P0(N) is defined differently, using Upper/Lower construction with P1
and P0 of fewer inputs.
P0(N/2 ) will
do
upper/lower
with P1 and
P0 of N/4
P1(N/2 will
do odd/even
of P0 of N/4
G. Alaghband
Fundamentals of Parallel
Processing
27, Potential for Parallel
Computation
To get P0(N), only P0 and P1 are needed.
For example:
To get P0(16) we need
P1(8) and P0(8)
To get P1(8) we need
P0(4)
To get P0(8) we need
P1(4) and P0(4).
Note that P1(4) happens to be the same as P0(4).
G. Alaghband
Fundamentals of Parallel
Processing
28, Potential for Parallel
Computation
Back
G. Alaghband
Fundamentals of Parallel
Processing
29, Potential for Parallel
Computation
Depth of P0 can be seen by realizing that
Although P1 has depth log2N+1 for N input, it is being
applied only to half the size at the first application
(N/2).
Note that the longest path from any input to the highest
numbered output for P1(N/2) is only k-1. This value is
added to all outputs of P0(N/2) adding one more to the
depth = k.
Thus the longest path from any input to the highest
numbered output for P0(N) is only k=log2N.
Depth grows by one for each doubling of N.
G. Alaghband
Fundamentals of Parallel
Processing
30, Potential for Parallel
Computation
Chain here
goes through
3 operations
only
G. Alaghband
Fundamentals of Parallel
Processing
31, Potential for Parallel
Computation
Size of P1
Size of P0
Upper/Lower
P0
G. Alaghband
Fundamentals of Parallel
Processing
32, Potential for Parallel
Computation
Theorem: If N = 2k, then
S0(N) = 4N -F(2+k) -2F(3+k) + 1 and
S1(N) = 3N - F(1+k) -2F(2+k)
where F(m) is the m-th Fibonacci number.
Recall :
F(0) = 0, F(1) = 1,
F(m) = F(m-1) + F(m-2), m  2
0, 1, 1, 2, 3, 5, 8, ...
G. Alaghband
Fundamentals of Parallel
Processing
33, Potential for Parallel
Computation
S0(N) = 4N -F(2+k) -2F(3+k) + 1
S1(N) = 3N - F(1+k) -2F(2+k)
G. Alaghband
Fundamentals of Parallel
Processing
34, Potential for Parallel
Computation
S0(N) = 4N -F(2+k) -2F(3+k) + 1
S1(N) = 3N - F(1+k) -2F(2+k)
G. Alaghband
Fundamentals of Parallel
Processing
35, Potential for Parallel
Computation
Two problems remain in understanding Ladner and
Fisher’s P0.
1. What happens if N is not a power of 2?
In this case there is an upper bound:
for N  1:
Sj(N) < 4N
so P0(N) is no more than a few times larger than the
sequential algorithm.
G. Alaghband
Fundamentals of Parallel
Processing
36, Potential for Parallel
Computation
2. Some insight into the behavior of
Fibbonacci numbers would help as most
people have little intuition about how
Fibbonacci numbers behave.
We can do this by using an asymptotic formula for
large m.
G. Alaghband
Fundamentals of Parallel
Processing
37, Potential for Parallel
Computation
asymptotic formula for large m:
Which shows size of P0(N)
Is less than 4N
G. Alaghband
Fundamentals of Parallel
Processing
38, Potential for Parallel
Computation
Speedup and Efficiency
• Speed up and Efficiency of Parallel
Algorithms
• Arithmetic Expression Evaluations
• Vector and Matrix Algorithms
G. Alaghband
Fundamentals of Parallel
Processing
39, Potential for Parallel
Computation
Speedup and Efficiency of Algorithms
For any given computation (algorithm):
Let TP be the time to perform a computation with P
processors. (arithmetic units, or PEs)
We assume that any P independent operations can be
done simultaneously.
Note: The depth of an algorithm T , the minimum
execution time.
The speedup with P processors is
,
and efficiency is
G. Alaghband
Fundamentals of Parallel
Processing
40, Potential for Parallel
Computation
These numbers,
SP and EP,
refer to an algorithm and not to a machine.
Similar numbers can be defined for specific hardware.
The time T1 can be chosen in different ways:
To evaluate how good an algorithm is,
it should be the time for the
“BEST” sequential algorithm.
G. Alaghband
Fundamentals of Parallel
Processing
41, Potential for Parallel
Computation
The Minimum Number of Processors
Giving the Maximum Speedup:
Let P be the minimum number of P of processors such that
TP = T
i.e.
P = min { P | TP = T }
Then
TP, SP, EP
are the best known
time, speedup, and efficiency
respectively.
G. Alaghband
Fundamentals of Parallel
Processing
42, Potential for Parallel
Computation
What are T1 values for Upper/Lower and Odd/Even?
G. Alaghband
Fundamentals of Parallel
Processing
43, Potential for Parallel
Computation
G. Alaghband
Fundamentals of Parallel
Processing
44, Potential for Parallel
Computation
If we have one processor, we should use the smaller sequential algorithm
This makes the efficiency,
Thus the efficiency of Upper/Lower really decreases as
the problem size grows.
G. Alaghband
Fundamentals of Parallel
Processing
45, Potential for Parallel
Computation
Evaluation of Arithmetic Expressions
Most problems are not so simple that best sequential algorithm is known,
to say nothing of the best parallel algorithm.
Arithmetic expression evaluation is a case in which general results are
known:
Definition:
An atom is a constant or variable appearing in an expression.
Let E<N> be an expression in +, -, , /, (, ) having N atoms.
The minimum time for sequential evaluation of any such
algorithm is N-1 steps.
Evaluation of E<N> takes a minimum amount of time as shown
by the next lemma.
G. Alaghband
Fundamentals of Parallel
Processing
46, Potential for Parallel
Computation
Lemma 1:
For any number P of processors, the time to evaluate an
expression in N atoms satisfies:
The proof is based on the fact that all atoms are combined into one result.
Proof:
Since there is a single result and all operations are dyadic (they
take 2 operands), then there is only one result at the last step and no more
than 2 intermediate results at the next to the last step.
In general, there are at most 2l intermediate values at the last-l step.
Since there must be N atoms at stage l, then N  2 no. of stages and
QED
G. Alaghband
Fundamentals of Parallel
Processing
47, Potential for Parallel
Computation
Expressions can be transferred by mathematical operations into more
parallel forms.
Using associative, commutative, and distributive laws we can reduce the
height of an expression tree.
Consider the expression:
E<8> = A + B(CDE + F + G) + H
G. Alaghband
Fundamentals of Parallel
Processing
48, Potential for Parallel
Computation
E<8> = A + B(CDE + F + G) + H
G. Alaghband
Fundamentals of Parallel
Processing
49, Potential for Parallel
Computation
Including the distributive law, we can get an even smaller depth.
But the number of operations will increase.
E<8> = A + B(CDE + F + G) + H
G. Alaghband
Fundamentals of Parallel
Processing
50, Potential for Parallel
Computation
Using associativity and commutativity, evaluation of an expression is
bounded.
In addition to lower bound on the evaluation time for arithmetic expression,
we can also get an upper bound when associativity and commutativity are
used to put the expression into the most parallel form possible:
Theorem1:
If E<N,d> is an arithmetic expression in N atoms with depth d of
parenthesis nesting, then using commutativity and associativity only,
E<N,d> can be transformed so that
Reference:
J.L. Baer and D.P. Bovet, “Compilation of Arithmetic Expressions for Parallel
Computation,” Proc. IFIP Congress 1968, North Holland, Amsterdam, pp.
340-346.
G. Alaghband
Fundamentals of Parallel
Processing
51, Potential for Parallel
Computation
If distributivity is also used, the upper bound is independent of parenthesis
nesting.
But the size of the computation may increase, so the bound on P also will
increase.
Theorem:
An expression E<N> in N atoms can be transformed by
associativity, commutativity, and distributivity so that
Note: The time bound only applies to transformed expressions. The
transformation itself takes an order of Nlog2N steps.
For computations larger than single expressions,
we must look at specific cases.
G. Alaghband
Fundamentals of Parallel
Processing
52, Potential for Parallel
Computation
Non-Associativity of Floating-Point
Accumulation
• Originated by the limited precision and range of the IEEE
floating-point representation.
• A hypothetical numerical representation with a precision
of 32 “digits", showing two very large numbers added to
a very small number- the large numbers are conceived
to be at the extremes of the representation range.
G. Alaghband
Fundamentals of Parallel
Processing
53, Potential for Parallel
Computation
Non-Associativity of Floating-Point
Accumulation
G. Alaghband
Fundamentals of Parallel
Processing
54, Potential for Parallel
Computation
Amdahl’s Law
Let T(P) be the execution time with hardware parallelism P.
Let S be the time doing the sequential part of the work ,
Time to do the parallel part of the work sequentially is Q,
i.e., S and Q are the sequential and parallel amounts of work
measured by time on one processor,
The total time with P processors is
G. Alaghband
Fundamentals of Parallel
Processing
55, Potential for Parallel
Computation
Amdahl’s Law
Expressing this in terms of the fraction
of serial work
Amdahl’s law states that
Speedup
Efficiency
G. Alaghband
Fundamentals of Parallel
Processing
56, Potential for Parallel
Computation
• A very small amount of unparallelized code can
have a very large effect on efficiency if the amount of
parallelism is large;
• A fast vector processor must also have a fast
scalar processor in order to achieve a sizeable
fraction of its peak performance;
• Effort in parallelizing a small fraction of code that
is currently executed sequentially may pay off in
large performance gains;
• Hardware that allows even a small fraction of new
things to be done in parallel may be considerably
more efficient.
G. Alaghband
Fundamentals of Parallel
Processing
57, Potential for Parallel
Computation
Although Amdahl’s law is a simple performance
model, it need not be taken simplistically.
The behavior of the sequential fraction, f, for example, can be quite
important.
System sizes, especially the number, P, of processors are often
increased for the purpose of running larger problems.
Increasing the problem size often does not increase the absolute
amount of sequential work significantly.
In this case, f is a decreasing function of problem size,
and if problem size is increased with P, the somewhat pessimistic
implications of equations look much more favorable.
see Problem 2.16 for a specific example.
The behavior of performance as both problem and
system size increase is called scalability.
G. Alaghband
Fundamentals of Parallel
Processing
58, Potential for Parallel
Computation
Download