parallel denote

advertisement
3 INTRODUCTION TO PARALLEL ALGORITHMS
3.1 Performance measures
There are a number of different ways to characterize the performance of both
parallel computers and parallel algorithms. Usually, the peak performance of
a machine is expressed in units of millions of instructions executed per second
(MIPS) or millions of floating point operations executed per second
(MFLOPS). However, in practice, the realizable performance is clearly a
function of the match between the algorithms and the architecture.
3.1.1 Speed-up, efficiency, and redundancy
Perhaps the best known measure of the effectiveness of parallel algorithms is
the speed-up ratio (Sp) with respect to the best serial algorithm, which is
defined as the ratio of the time required to solve a given problem using the
best known serial method to the time required to solve the same problem by a
parallel algorithm using p processors. Thus, if T(N) denotes the time required
by the best known serial algorithm to solve a problem of size N, and Tp(N)
denotes the time taken by a parallel algorithm using p processors us in solving
a problem of the same size, then
Sp 
T (N )
.
Tp ( N )
Clearly, Sp is a function of N, the size of the problem, and p, the number of
processors. For notational simplicity, we often do not explicitly denote this
dependence. Another related measure is known as the processor efficiency,
Ep, which is the ratio of the speed-up to the number of processors. Thus
Ep 
S p (N )
p
.
Not all serial algorithms admit speed-up through the use of parallel
implementation. But all the parallel algorithms can be serially implemented in
a straightforward way. Thus, since a parallel algorithm using p>1 processors
must perform at least as fast as a serial algorithm, it is desired that S p  1 .
In the above definition of speed-up, the parallel algorithm may be quite
different from the serial algorithm. It is conceivable that in many cases one
might be interested in comparing the performance of the same algorithm on
the serial machine and on a given parallel processor. In such cases, the speedup with respect to a given algorithm is simply defined as
S*p 
T1 ( N )
.
Tp ( N )
Recall that Ti (N ) is the time needed to execute the algorithm on i processors.
Likewise, the processor efficiency in this case is
*
E
*
p

S p (N )
p
.
Clearly, S * p  1. Further, one step of a parallel algorithm using p processors
takes at most p steps when implemented serially. Thus T1 ( N )  pT p ( N ) . This
in turn implies that 1  S p  p and 0  E p  1 . Since T ( N )  T1 ( N ) , it readily
*
*
follows that S p  S p .
*
Any parallel algorithm, which asymptotically attains linear speed-up,
that is, S p ( N )  ( p ) is said to have the optimal speed-up. It is conceivable
that an algorithm with optimal speed-up may have very low processor
efficiency, and likewise, an algorithm with good processor efficiency may not
have large speed-up. In other words, a good serial algorithm may be a bad
parallel algorithm, and a throwaway serial algorithm may end up being a
reasonable parallel algorithm.
Another factor that would reflect the performance of a parallel
algorithm is the total number of scalar operations required by an algorithm. In
the case of serial algorithms, the serial time complexity function itself denotes
the number of scalar operations. But it is a common practice to design parallel
algorithms by introducing extra or redundant scalar computations to achieve
speed-up. The redundancy factor, Rp, of a p-procossor algorithm is defined as
the ratio of the total scalar operation performed by a parallel algorithm to the
total scaler operations of a serial algorithm. Clearly, R p  1 .
Consider the problem of adding 8 numbers x1, x2, … , x8 . Serial
algorithm to solve the problem has the following view
1st step:
2nd step:
3rd step:
4th step:
5th step:
6th step:
7th step:
y1 = x1 + x2
y2 = y1 + x3
y3 = y2 + x4
y4 = y3 + x5
y5 = y4 + x6
y6 = y5 + x7
y7 = y6 + x8
One of the parallel algorithms can be used to solve the same problem.
1st parallel algorithm
1st parallel step:
2nd parallel step:
3rd parallel step:
y1 = x1 + x2
z1 = y1 + y2
w1 = z1 + z2
y2 = x3 + x4
z2 = y3 + y4
2
y3 = x5 + x6
y4 = x7 + x8
2nd parallel algorithm
1st parallel step:
2nd parallel step:
3rd parallel step:
4th parallel step:
y1 = x1 + x2
y3 = x5 + x6
z1 = y1 + y2
w1 = z1 + z2
y2 = x3 + x4
y4 = x7 + x8
z2 = y3 + y4
Let us calculate the above described performance measures for these parallel
algorithms.
Results of calculations for the 1st parallel algorithm are as follows:
p  4 , T  7 , T4  3 , S 4 
7
2 .3
 2.3 , E 4 
 5.75 .
3
4
Results of calculations for the 2nd parallel algorithm are as follows:
p  4 , T  7 , T4  4 , S 4 
7
1.75
 1.75 , E 4 
 0.44 .
4
4
It is obvious that the 1st parallel algorithm is the best.
The number of parallel steps is called the height of parallel algorithm.
On the other hand, the maximal number of operations in parallel steps of an
algorithm is called its weight. If there are several parallel algorithms, the one
having minimal height is the best. Usually this algorithm has the maximal
weight. In order to design the best parallel algorithm, in each step we try to
find the maximal number of independent operations. Then carry out them
simultaneously.
3.2 Basic arithmetic operations
Parallelism in problem solving can be achieved in at least three different
levels: procedure level, expression level, and basic arithmetic operations level.
At the first two levels, the time complexity function merely indicates the
number of basic arithmetic (arithmetic or comparison) operations. For
example, at these two levels, addition of two integers is taken as a unit
(independent of size of integers) operation. But it is common knowledge that
it takes more time to add two large integers than to add two small ones. Thus,
for the overall speed-up, it is necessary to use the best possible algorithm to
perform these basic arithmetic operations. Below we analyse the number of
bit operations such as AND and OR, needed to perform basic arithmetic
operations in parallel. Since this is the lowest level at which parallelism can be
achieved, it is called bit level or micro level parallelism.
3.2.1 Analysis of parallel addition
Addition is the simplest of the basic arithmetic operations. Despite its
simplicity, development of “optimal ” algorithm parallel algorithm is far from
simple.
3
Brent’s algorithm Let a = aN aN-1 … a2 a1 and b = bN bN-1 … b2 b1 be two
integers in binary to be added. Let s = a + b, where s = sN sN-1 … s2 s1. It is wellknown that si  ai  bi  ci 1 , where c0 = 0, and for i = 1 to N,
ci  pi  ( g i  ci 1 )

pi  ai  bi


g i  ai  bi

where  ,  , and  are the logical exclusive-OR, inclusive-OR and AND
operations, respectively. ci is called the carry from the ith bit position, pi and gi
are the carry propagate condition and the carry generate condition,
respectively. Clearly, we need c0 through cN-1 for computing s. Since c0=0, by
distributing p1 we see that
c1  p1  g1
c2  p2  ( g 2  ( p1  g1 ))
and in general
ci  pi  ( g i  ( pi 1  ( g i 1    ( p1  g1 )))) .
The propagate bit p i and the generate bit g i can be computed in
parallel in unit step using a linear array of N AND gates and N OR gates with
two inputs. Once all the carry bit are known, si can be computed in just two
steps. Thus, if TA(N) is the time required to add two N-bit binary numbers and
TC(N-1) is the time to compute the N-1 carry bits c1 through cN-1 (that is height
of algorithm). Obviously carry bits can be computed in logN units of time. So,
TA(N)= logN + 3.
3.2.2 Parallel multiplication
We first analyse the complexity of Grade-School algorithm for parallel
multiplication of two numbers.
Grade-School algorithm The following example illustrates this algorithm.
Let a=1101 and b=1101. Then
x1= 0001101, x2= 0000000, x3= 0110100, x4= 1101000.
Clearly, x1+x2= 00001101 and x3+x4= 10011100 can be computed in parallel and
the final result is obtained in one more addition s= (x1+x2)+(x3+x2)= 10101001.
First step can be done in one unit of time. The associative fan-in algorithm
takes O(logN) units of time to find sum of two 2N-bit integers. Therefore,
second step takes O(log2N) units of time to compute the sum of the products.
4
So, overall complexity (or height) of Grade-School algorithm to find product
of two integers is O(log2N).
Ofman-Wallace algorithm By cleverly exploiting the properties of the full
adder, Ofman and Wallace independently developed a faster algorithm for
multiplying two N-bit integers. The algorithm is based on the following
result.
Let a  aN 1aN a2 a1 and b  bN 1bN b2b1 be two N  1 bit integers.
Then there exists three N bit integers x, y, and z such that x+y+z=a+b, where
a N 1  0 and b1  0 . a and b can be obtained from x, y, and z in parallel in
three steps by using formulae
ai  xi  yi  z i , bi 1  ( xi  yi )  ( xi  yi )  z i
for 1  i  N . Since the computation of ai and bi 1 can be done in parallel, it
follows that a and b can be obtained in three steps. As an example, let
x  1011, y  0111 , and z  1101. Then it can be easily verified that
a  00001 and b  11110 .
The complexity of two N-bit integers can be obtained in O(logN) steps.
Katasuba-Offman algorithm
We now define an algorithm based on
divide-and-conquer strategy. Let a and b are N-bit integers. First, rewrite a
and b as
a  A1  2
N
b  B1  2
N
2
 A0 ,
2
 B0 .
where Ai and Bi are N/2-bit integers, A0 and B0 is remainder and A1 and B1 is
the quotient when a is divided by 2N/2. Define
r0  A0  B0


r1  ( A1  A0 )  ( B1  B0 )

r2  ( A1  B1 )

N
Then a  b  r2  2 N  (r1  r2  r0 )2 2  r0 . Thus the product is obtained by
performing three multiplications of (N/2)-bit integers (the recursive step) and
a total of six additions/subtractions. Each additon/subtraction can be done in
O(logN) steps. The overall complexity of algorithm is O(log2N).
4 PRAM ALGORITHMS
4.1 A model of serial computation
5
The random access machine (RAM) is a model of a serial computer. A RAM
consists of a memory, a read-only input tape, a write-only output tape, and a
program. The program is not stored in memory and cannot be modified. The
input tape contains a sequence of integers. Every time an input value is read,
and input head advanced one square. Likewise, the output head advanced
after every write. Memory consists of an unbounded sequence of registers,
designated r0, r1, r2, … . Each register can hold a single integer. Register r0 is
the accumulator, where computations are performed. The allowed RAM
operation include load, store, read, write, add, subtract, multiply, divide, test,
jump, and half.
4.2 PRAM model of parallel computation
PRAM consists of a control unit, global memory, and an unbounded set of
processors, each with its own private memory. Although active processors
execute identical instructions, every processor has a unique index, and the
value of a processor’s index can be used to enable or disable the processor or
influence which memory location accessed.
A PRAM computation begins with the input stored in global memory
and a single active processing element. During each step of the computation
and active, enabled processor may read a value from a single private or global
memory location, perform a single RAM operation, and write into one local or
global memory location. Alternately, during the computation step a processor
may activate another processor. All active, enabled processors must execute
the same instruction, on different memory locations. The computation
terminates when the last processor halts.
Control
P1
P2
Private memory
Pn
Private memory
Private memory
…
…
…
…
Interconnection
network
Global memory
…
6
Various PRAM models differ on how handle read or write conflicts; i.e., when
two or more processors attempt to read from, or write to, the same global
memory location. Most of the results in the research literature have been
based upon one of the following models:
1. EREW (Exclusive Read Exclusive Write): read or write conflicts are
not allowed.
2. CREW (Concurrent Read Exclusive Write): concurrent read is
allowed; multiple processors may read from the same global
memory location during the same instruction step. Write conflicts
are not allowed.
3. CRCW (Concurrent Read Concurrent Write): concurrent reading
and concurrent writing allowed. A variety of CRCW models exist
with different policies for handling concurrent writes to the global
address. We list three different models:
a. COMMON. All processors concurrently writing into the
same global address must be writing the same value.
b. ARBITRARY. If multiple processors concurrently write to the
same global address, one of the competing processors is
arbitrarily chosen as the “winner,” and its value is written
into the register.
c. PRIORITY. If multiple processors concurrently write to the
same global address, the processor with the lowest index
succeeds in writing its value into memory location.
The EREW PRAM model is the weakest. Clearly a CREW PRAM can execute
any EREW PRAM algorithm in the same amount of time; the concurrent read
facility is simply not used. Similarly, a CREW PRAM can execute any EREW
PRAM algorithm in the same amount of time.
The PRIORITY PRAM model is the strongest. Any algorithm designed
for the COMMON PRAM model will execute with the same complexity on
the ARBITRARY PRAM and PRIORITY PRAM models as well, for if all
processors writing to the same location write the same value, choosing an
arbitrary would cause the same result. Likewise, if an algorithm executes
correctly when an arbitrary processor is chosen as the “winner,” the processor
with lowest index is as reasonable an alternative as any other. Hence any
algorithm designed for the ARBITRARY PRAM model will execute with the
same time complexity on the PRIORITY PRAM.
4.3 PRAM algorithms
PRAM algorithms have two phases. In the first phase a sufficient number of
processors are activated, and in the second phase these activated processors
perform the computation in parallel.
4.3.1 Parallel reduction on EREW PRAM
7
Given a set of n values a1, a2, … , an, and an associative binary operator +,
reduction is the process of computing a1,+ a2+… + an,. Parallel summation is
an example of reduction operation.
The realization of algorithm for
4+3+8+2+9+1+0+5+6+3 is illustrated below.
4.3.2 Prefix sums on EREW PRAM
Given a set of n values a1, a2, … , an, and an associative binary operator +, the
prefix sums problem is to compute the n quantities:
a1
a1+ a2
a1+ a2+ a3
…
a1+ a2+ a3+ … + an
4
3
8
2
9
1
0
5
6
3
4
7
1
1
1
1
1
5
1
9
4
7
1
1
2
2
1
15
12
14
4
7
15
17
26
27
27
32
34
34
4
7
15
17
26
27
27
32
38
41
For example, given the operation + and the integers 3, 1, 0, 4, and 2, the
prefix sums of the integers are 3, 4, 4, 8, 10. Realization of prefix sums on
EREW PRAM for 4, 3, 8, 2, 9, 1, 0, 5, 6, and 3 is shown below.
4.3.3 List ranking
Consider a problem of finding, for each of n elements on a linked list, the
suffix sums of the last i elements on the list, where i=1, … , n . The suffix sums
problem is a variant of the prefix sums problem, where an array is replaced
by a linked list, and the sums are computed from the end, rather than from
8
the beginning. If the values are 0 and 1, the problem is called the list ranking
problem.
If we associate a processor with every list element and jump pointers in
parallel, the distance to the end of the list is cut in half through the instruction
next[i]:=next[next[i]]. Hence a logarithmic number pointer jumping steps are
sufficient to collapse the list so that every list element points to the last
element.
4
7
17
3
8
10
2
9
1
10
0
5
5
6
3
9
15
41
Preorder tree traversal Sometimes it is appropriate to attempt to reduce
complicated-looking problem into a simpler one for which a fast parallel
algorithm is already known. The problem of numbering the vertices of a
rooted tree in preorder (depth-first search order) is a case in point. Note that
a preorder tree traversal algorithm visits nodes of the given tree according to
the principle root-left-right.
The algorithm works in the following way. In step one the algorithm
constructs a singly-linked list. Each vertex of the singly-linked list
corresponds to a downward or upward edge traversal of the tree.
In step two the algorithm assigns weights to the vertices of the newly
created singly-linked list. In the preorder traversal algorithm, a vertex is
labelled as soon as it is encountered via a downward edge traversal. Every
vertex in the singly-linked list gets the weight 1, meaning that the node count
is incremented when this edge is traversed. List elements corresponding to
upward edges have the weight 0, because the node count does not increase
when the preorder traversal works its way back up the tree through
9
previously labelled nodes. In the third step we compute for each element of
the singly-linked list , the rank of that list element. In step four the processors
associated with downward edges use the ranks they have computed to assign
a preorder traversal number.
Merging to sorted lists Many PRAM algorithms achieve low time complexity
by performing more operations than an optimal RAM algorithm. The problem
of merging two sorted lists is another example.
A
A
B
D
C
E
G
B
F
C
D
E
H
G
H
(a)
A
B
1
D
B
0
B
D
1
F
(b)
EG
1
BE
1
EH
1
EB
0
GE
0
HE
0
A
C
1
B
A
0
FC
0
CF
1
C
A
0
(c)
A
B
7
D
B
5
B
D
6
EG
4
BE
5
EH
3
EB
2
GE
3
HE
2
A
C
2
B
A
2
(d)
A
B
C
D
E(d) F
G
H
1
2
7
3
4
5
6
(d)
10
8
FC
0
CF
1
C
A
0
Parallel algorithm assigns one processor to each list element. So,
altogether, there will be 2n processors each keeping track for a particular
entry in the list. Every processor finds the position of its own element on the
other list using binary search. Because an element’s index on its own list is
known, its place in the merged list has been found and the two indices added.
All n elements can be inserted into the merged list by their processors in
constant time.
Graph coloring Determining the vertices of a graph can be colored with c
colors so that no two adjacent vertices are assigned the same color is called the
graph coloring problem. To solve the problem quickly, we can create a
processor for every possible coloring of the graph, then each processor checks
to see if the coloring it represents is valid.
1
2
4
A1
A2
A3
A4
A5
A6
A7
A8
1
5
7
9
1
3
1
7
1
9
2
3
7
8
9
1
1
1
2
1
3
1
7
1
9
2
4
8
1
1
1
2
2
1
2
2
2
4
B1
B2
B3
B4
B5
B6
B7
B8
5
0
1
2
1
2
Coloring
Initial values
After checking
0,0,0
0,0,1
0,1,0
0,1,1
1,0,0
1,0,1
1,1,0
1,1,1
1
1
1
1
1
1
1
1
0
0
1
0
0
1
0
0
+
2
11
Number of colorings
2
2
2
3
2
4
Assume that graph has n vertices. Given nxn adjacency matrix and a positive
constant c, a processor is created for every possible coloring of the graph.
Each processor initially sets its value in the n-dimentional candidate array to
1. It then determines whether, for the particular assignment of colors to
vertices it represents, two adjacent vertices have been given the same color. If
A[i,j]=1 means that vertices j and k are adjacent, and ij=ik means that vertices j
and k have the same color. If a processor detects an invalid coloring, it sets its
value in the candidate array to 0. After n2 comparisons, if any element in the
candidate array is still 1, then the coloring is valid. By summing over all cn
elements in the candidate array, it can be determined whether there exists a
valid coloring. The CREW PRAM algorithm for graph coloring appears
below.
12
Download