Clustering on a hypercube multicomputer

advertisement
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 2, NO. 2, APRIL 1991
129
Clustering on a Hypercube Multicomputer
S a n j a y R a n k a , Member, IEEE, a n d Sartaj S a h n i , Fellow, IEEE
Abstract-In this paper, squared error clustering algorithms for SIMD
hypercubes are presented. These algorithms are asymptotically faster
than previously known algorithms and require lesser amount of memory
per PE. For a clustering problem with .V patterns, M features per pattern
and I< clusters, our algorithms complete in O ( k logd%-dI)steps on
S-ZI processor hypercubes. This is optimal up to a constant factor. We
extend these results to the case when -Vd\fK processors are available.
Experimental results from an MIMD medium grain hypercube are also
presented.
+
Index Term+ Clustering, feature vector, hypercube multicomputer,
pattern recognition, MIMD, SIMD.
1. INTRODUCTION
EATURE vector is a basic notion of pattern recognition. A
feature vector v is a set of measurements (vir U,, . . . , v h l )
which map the important properties of an image into a Euclidean
space of dimension M [I]. Clustering partitions a set of feature
vectors into groups. It is a valuable tool in exploratory pattern
analysis and helps making hypotheses about the structure of
data. It is important in syntactic pattern recognition, image
segmentation, and registration. There are many methods for
clustering feature vectors [ l ] , [3],[6], [ 5 ] , [12], [13]. One popular
technique is squared error clustering.
Let N represent the number of patterns which are to be
partitioned and let M represent the number of features per
pattern. Let FIO . . . N - 1 , O . . . hl - 11 be the feature matrix
such that F [ i , j ]denotes the value of the j t h feature in the ith
pattern. Let SI.S p ,. . . , SICbe K clusters. Each pattern belongs
to exactly one of the clusters. Let C[i]represent the cluster to
which pattern i belongs. Thus, we can define Sk as
F
Sk
= {ilC[i]= k,O
5k 5K
-
1).
Further, I Sk 1 is the cardinality or size of the partition .SA..
The
center of cluster k is a 1 x M vector defined as
The squared distance d2 between pattern i and cluster k is
A-1
d 2 [ i ,k ] =
( F [ i , j ]- c e n t e r [ k , j ] ) * .
,=O
The squared error for the kth cluster is defined as
E 2 [ k ]=
d2[2,k]
05k
<K
lESk
Manucript received May 25, 1990; revised October 30, 1990. This work was
supported in part by the National Science Foundation under Grants DCR8420935 and MIP 86-17374.
S. Ranka is with the School of Computer Science, Syracuse University,
Syracuse, NY 13244.
S. Sahni is with the Department of Computer and Information Sciences,
University of Florida, Gainesville, FL 3261 1.
IEEE Log Number 9042569.
procedure CLUSTER(-%’. 211.li)
{iteratively improve the clustering}
Stepl: [Cluster Reassignment]
Newcluster [i] := q such that d2[i. q] = min {d2[2,k]}
O<k<Ii
{ties are broken arbitrarily}
Step2: [Termination Criterion and Cluster Update]
if NewCluster[i] = C[i],0 5 < .V then terminate;
else [C[i]:= NewCluser[i], 0 5 I < Z and goto Step 31
Step3: [Cluster Center Update]
Recompute cenfer[z.j],0 5 z < li, 0 _< j < XI using the new
cluster assignments.
end {of Cluster}
Fig. 1. One pass of the iterative cluster improvement algorithm.
and the squared error for the clustering is
I< - 1
E R R O R [ k ]=
E2[k]
k=U
In the clustering problem, we are required to partition the N
patterns such that the squared error for the clustering is minimum.
In practice, this is done by trying out several different values of
K . For each K , the clusters are constructed using an iterative
refinement technique in which we begin with an initial set of K
clusters, move each pattern to a cluster with which it has the
minimum squared distance, recompute cluster centers. The last
two steps are iterated until no pattern is moved from its current
cluster. The final clustering obtained in this way is not guaranteed
to be a global minimum. In fact, different initial clusterings can
result in different final clusters. One pass of the algorithm is
given in Fig. 1.
One pass of the cluster improvement algorithm takes
O ( N M K ) time on a uniprocessor computer. Since several
passes are needed before an acceptable K and corresponding
clustering is obtained, the computational requirements of the
algorithm are such that several researchers have studied parallel
implementations of the algorithm for one pass. For example,
Ni and Jain [ l o ] have developed systolic arrays for clustering,
Hwang and Kim [7] have developed clustering algorithms for
multiprocessors with orthogonally shared memory, and Li and
Fang [8] have developed an SIMD hypercube algorithm. This
latter algorithm requires that each processor have O(K ) memory
and has a run time of O ( K l o g ( N M ) )on an SIMD hypercube
with N M PE’s.
In this paper, we first improve upon the algorithm of [8] by
developing one that runs in O ( K l o g N M ) time. We present
two algorithms with this run time. Both use N M PE’s. However,
one requires O ( K )memory per PE and the other O(1). Next, we
show how these may be extended to the case when N M K P E ’ s
are available. In this case, the run time becomes O(1ogNMK).
Finally, we present a parallel clustering algorithm suitable for use
on an MIMD medium-grained hypercube. Experimental results
using the NCUBE hypercube are also presented.
1045-9219/91i0400~129$01.00 0 1991 IEEE
Authorized licensed use limited to: IEEE Xplore. Downloaded on April 16, 2009 at 11:38 from IEEE Xplore. Restrictions apply.
+
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 2, NO. 2, APRIL 1991
130
11. PRELIMINARIES
A. Hypercube Multicomputer
The important features of an SIMD hypercube and the programming
- notation we use are:
There are P = 2 P processing elements connected together
via a hypercube interconnection network (to be described
later). Each PE has a unique index in the range [O, 2” - 11.
We shall use brackets ([ 1) to index an array and parentheses
(’0’)
to index PE’s. Thus, A[i]refers to the ith element
of array A and A ( i ) refers to the A register of PE i.
Also, A [ j ] ( i )refers to the j t h element of array A in PE
i . The local memory in each PE holds data only (i.e.,
no executable instructions). Hence, PE’s need to be able
to perform only the basic arithmetic operations (i.e., no
instruction fetch or decode is needed).
A p-dimensional hypercube network connects 2“ PE’s.
Let i p p l i p - 2. . . i o be the binary representation of the
PE index i. Let ik be the complement of bit i k . A
hypercube network directly connects pairs of processors
whose indexes differ in exactly one bit. I.e., processor
. .
z p - - 1 a p - 2 . . . io is connected to processors i p P l. . . i k . . . io,
0 5 IC 5 p - 1. We use the notation i(b) to represent the
number that differs from i in exactly bit b.
There is a separate program memory and control unit. The
control unit performs instruction sequencing, fetching, and
decoding. In addition, instructions and masks are broadcast
by the control unit to the PE’s for execution. An instruction
mask is a Boolean function used to select certain PE’s to
execute an instruction. For example, in the instruction
A ( i ) := A ( i )+ 1,
(io = I)
( i o= 1) is a mask that selects only those PE’s whose index
has bit 0 equal to 1. Le., odd indexed PE’s increment their
A registers by 1.
Sometimes, we shall omit the PE indexing of registers. So,
the above statement is equivalent to the statement:
A := A + 1,
of two. In case N , M , and K are not powers of 2, we can do
the following:
I) 11 N is not a puwcr
01 2 ,
i m u u u c c auuitiuiia~pdtteriis
so that the total number of patterns becomes a power of
2. Note that this can be done by at most doubling the
number of patterns. These additional patterns have the
same feature vector which is chosen to be far removed
from the feature vectors of the remaining patterns. A s
a result, these additional patterns cluster together in a
separate cluster and do not affect the clustering of the
original patterns.
2) If M is not a power of 2, introduce additional features
to the feature vector so that the total number of features
becomes a power of 2. Note that this can be done by at
most doubling the number of features. The values for these
additional features are set to zero for all patterns. As a
result, the additional features do not affect the clustering.
3) If K is not a power of 2, then we replace K by the next
power of 2, say J . We start with J - K clusters with
centers such that no pattern can be assigned to one of these
clusters. The center with all coordinates x can be used for
this purpose for all J - K clusters.
The above changes do not affect the final clusters obtained
and since they at most double N , M , and K , the asymptotic
complexity of the resulting algorithms is also not affected.
Assume that the number of PE’s in the hypercube is N M . The
hypercube may be viewed as an N x M grid of PE’s as in Fig. 2.
If N = 2” and M = 2”‘, then a PE index has n + m bits in it.
The first n bits give the row number and the last m the column
number. Note that each row forms a subhypercube with M PE’s
and each column forms a subhypercube with N PE’s. We shall
use the notation PE ( i , j ) to refer to the PE in row i and column
j.Its index is obtained by appending the bits in j to those in i.
The initial configuration we assume for our algorithms has
F [ i . j ] in the F register of PE ( i , j ) . i . e . , F ( z , j ) = F [ i . j ] , O 5
i < N,O 5 j < M . Also, the center matrix is stored in the
top K rows of the N x M hypercube such that center(i,j) =
center[i,j],O5 i < K,O 5 j < M (see Fig. 2).
C. Basic Data Manipulation Operations
I ) Data Broadcast: In a data broadcast, data originate at one
Interprocessor assignments are denoted using the symbol
P
E
of a subhypercube and is to be transmitted to the remaining
+-, while intraprocessor assignments are denoted using the
PE’s of the subhypercube. If the subhypercube has P PE’s, a
symbol :=. Thus, the assignment statement: B ( i ( * ) )t
data broadcast can be done in log P unit routes [2].
B ( i ) , ( i 2 = 0) is executed only by the processors with
2) Window Broadcast: Here, data originate in an R x S subhybit 2 equal to 0. These processors transmit their B register
percube of a larger T x U subhypercube and are to be replicated
data to the corresponding processors with bit 2 equal to 1.
over this larger subhypercube. The larger subhypercube may be
In a unit route, data may be transmitted from one processor
naturally tiled by R x S windows and essentially data from one
to another if it is directly connected. We assume that the
window is to be broadcast to the others. This can be done using
links in the interconnection network are unidirectional.
l o g ( T U / ( R S ) )unit routes [2].
Hence, at any given time, data can be transferred either
3) Data Sum: Assume the window tiling of Section 11-C2. For
from PE i ( i b = 0) to PE i(b)or from PE i ( i b = 1) to
each window, the data in the A registers of the PE’s in this
PE i(b).
Hence, the instruction. B(i(’))+ B ( i ) , (iz = 0 )
window are to be summed and the result left in one of the PE’s
takes one unit route, while the instruction: Lj(i(*))
t B(i)
(same relative PE for each window) of the window. For windows
takes two unit routes.
of size R x S, the summing operation takes log(RS) unit routes
Since the asymptotic complexity of all our algorithms is
determined by the number of unit routes, our complexity 121.
4) Window Sum: Assume the tiling of Section 11-C2. This time,
analysis will count only these.
one of the R x S windows is to sum up the A register
values in
corresponding PE’s of all the windows. Le., the PE in position
B. Hypercube Embedding of F and center
(2.j)of a designated window is to accumulate the sum of the
Throughout this paper, we shall assume that N , M , and K are A registers of the PE’s, in position ( i , j ) of all the T U / ( R S )
powers of 2. This assumption greatly simplifies our discussion windows, 0 5 i < R,O 5 j < S . This can be done using
as the number of processors in a hypercube is always a power l o g ( T U / ( R S ) ) unit routes [2].
( i o = 1)
Authorized licensed use limited to: IEEE Xplore. Downloaded on April 16, 2009 at 11:38 from IEEE Xplore. Restrictions apply.
131
RANKA AND SAHNI: CLUSTERING ON A HYPERCUBE MULTICOMPUTER
Fl Fl
Fl
00
cl 0
01010
01011
01110
01111
Fl Fl
0
10110
{data circulation}
for I = 1 to P - 1 do
.4(Jf(’ ‘I) + - 4 ( J )
end
Fig. 3. Data circulation in an SIMD hypercube.
procedure ConsecutiveSum(S, S)
begin
in(p):=p mod S
.4(P) := S[in(p)l(p);
for i := 1 to S - 1 to
begin
1 := f(l0g 2 ) ;
-4(p) := -4(p(‘));
s.
in(p) := zn(p)H2’;
A ( p ) := A ( p ) X [ z n ( p ) ] ;
+
end
{move -4’s back to originating PE’s}
j := log, S - 1;
.4(p(J))t .4(p):
end; {of ConsecutiveSum}
10111
Fig. 4. Consecutive sum.
0
U 0
Fig. 2.
procedure Circulate(d);
11010
11011
11110
11111
A 32 PE hypercube viewed as an 8 x 4 grid.
Lemma 1 [ll]: Let Ao.Al, . . .
be the values in
A(0). A(1), . . . , A(2P - 1) initially. Let index(j,,i) be such that
A[indez(j,i)] is in A ( j )following the ith iteration of the for loop
of Fig. 3. Initially, index(j.O) = j . For every i,i > 0, ( j , i ) =
indez(j,i - 1)82f(P.” (8 is the Exclusive OR operator).
6) Consecutive Sum: ConsecutiveSum(X, S) works in row
windows of size 1 x S.Each of the S PE’s in such a window has
S values X[z],0 5 i < S.The j t h PE in the window computes
s-I
A(j) = c X [ j ] ( i ) ,
0
5 j < S.
,=(I
5 ) Data Circulation: The data in the A registers of each of the
R processors in a R processor subhypercube is to be circulated
through each of the remaining R - 1 PE’s in the subhypercube.
This can be accomplished using R- 1 unit routes. The circulation
algorithm uses the exchange sequence X,, R = 2‘ defined
recursively as [ 2 ] :
XI = o , X, = x q - l , q - 1,
x,-1
( q > 1).
This sequence essentially treats a q-dimensional hypercube as
two q - 1-dimensional hypercubes. Data circulation is done in
each of these in parallel using XqPl.Next an exchange is done
along bit q - 1. This causes the data in the two halves to be
swapped. The swapped data are again circulated in the two half
hypercubes using X,-]. Let f ( q , i ) be the ith number (left to
right) in the sequence X,, 1 5 i < 2q. The resulting SIMD data
circulation algorithm is given in Fig. 3. Here, it is assumed that
the r bits that define the subhypercube are bits 0 , 1 , 2 , . . . , T - 1.
Because of our assumption of unidirectional links, each iteration of the for loop of Fig. 3 takes 2 unit routes. Hence, Fig. 3
takes 2(P - 1) unit routes. The function f can be computed by
the control processor in O ( P ) time and saved in an array of
size P - 1 (actually it is convenient to compute f on the fly
using a stack of height lo@). The following lemma allows each
processor to compute the origin of the current A value.
The ith PE in the S block begins by initializing A to X[z].Next,
the A values of the S PE’s in the block circulate through the
block accumulating the remaining X’s needed. Finally, the A’s
are moved back to the originating PE. The algorithm is given in
Fig. 4. At all times, in(p) gives the index of the PE at which the
current A originated. The correctness of this statement follows
from Lemma 1. The number of unit routes required is 2 s .
7) Term Computation: This is done independently on all K x 1
column windows of the N M PE hypercube. The ith PE of each
such window has an F and center value, F ( i ) and center(z),
initially. Each PE, i, of the window computes the K values:
tem[IC](i)= ( F ( z )- center(k))‘
0
5 IC < K .
This computation is done by circulating the center() values
through the K x 1 window as in Fig. 5. The number of unit
routes is 2(K - I).
8) Distance Computation: This is done independently in all
S x S windows where S is a parameter to the operation. The PE
in position ( 2 . j ) of the window computes
x(~(i.
5-1
dist(i,j)=
q ) - center(j, q ) ) ? ,
q=u
O<i<S,O
Authorized licensed use limited to: IEEE Xplore. Downloaded on April 16, 2009 at 11:38 from IEEE Xplore. Restrictions apply.
<j<s.
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 2, NO. 2, APRIL 1991
132
procedure TermComputation(X, li)
begin
{index p of processor at (2.j) is p = z l i
S(p) := center(i,j)
i n ( p ) := i mod I i ;
t e 7 4 % n ( P ) l b )i= ( F b )- S ( p ) I 2 ;
for q := 1 to h - 1 do
begin
I
+j }
:= f ( l o g M , q ) ;
S(P) +
.w));
in(p)82’;
zn(p) :=
t e 4 i n ( p ) l ( p ) :=
(Fb)- s(P))2;
<:
<
I<.
Step3: The PE’s in each 1 x li row window compute A := Consecutive
Sum(term, I<).
I summed up over M / I i 1 x Ii windows
Step4: The values of .-are
in each row using window sum. This results in d2(i. k ) =
q
.IT- 1
,=O
terni[k](i.j)? 0
k
< I<.
O<k<K
Term computation.
Fig. 6. O ( l < ) memory cluster assignment Ii
Stepl.
Step2:
Step3:
Step4:
Steps:
Step6:
We note that the computation of dist is quite similar to the
matrix multiplication C = A * B where C , A, and B are S x S
matrices. In fact, if we let
A=F
B = center’
5
Steps: Compute NewCluster(i, 0) = NewCluster[z] by finding q such that
d 2 ( i . q ) = niiri { d 2 ( i . k ) } .
Step6: Broadcast NcwCluster(i, 0) to the remaining .\I - 1 PE’s in the
ith row, 0 5 i < -V.
end
end; {of TermComputation}
Fig. S.
Stepl: Broadcast the Ii x ,\I cluster center window to the remaining
*\‘/I<
- 1 Ii x M windows.
Step2: The PE’s in each Ii x 1 column window compute t e m [ q ] . 0
(i.e., Transpose of center)
and replace U,, * b,, by ( u t T- b:!)’ in the definition of matrix
product, we end up with the definition of dist. Hence, dist may be
computed using the following modification of the S’ PE matrix
product algorithm of [2]:
Stepl: Compute B = Transpose of the center matrix
Step2: Use the matrix product algorithm of [2] to “multiply”
F and B.However, each time two terms of F and
B are to be multiplied, compute the square of their
difference instead.
The number of unit routes required to compute dist this way
is 4s t O(1og S).
9) Summing Random Access Write (SRAW): This is done in
K x 1 column windows. The K PE’s of a window originate data
A(z) that are to be sent to the dest(i)th PE in the window. If
two or more PE’s have data that are to be sent to the same PE,
then their sum is needed at the destination PE. Thus, following
the operation, the j t h P E in the K x 1 window has
This can be done in O(logZK)unit routes by a modification of
the random access write algorithm of Nassimi and Sahni [9]. In
this modification, when two A’s reach one PE, they are replaced
by a single A which is the sum of the two.
111. NM PE CLUSTERING
We consider two cases for an N M PE SIMD hypercube.
In one, each PE has O ( K ) memory. In the other, each PE
has 0(1) memory. In both cases, we assume the initial data
configuration of Section 11-B. For each of these two cases, we
develop algorithms for the cluster reassignment (Step 1) and
center update (Step 3) steps of Fig. 1.
A. Cluster Reassignment
1) O(K) Memory: The cases K 5 M and K > M result
in two slightly different algorithms. The algorithm for the case
K 5 M is given in Fig. 6 while that for the case K >
Fig. 7.
5
log (.\-/Ii)
21i - 2
21i
log ( A l / l i )
2 log li
log df
Complexity analysis of Fig. 6
M is given in Fig. 8. In Fig. 6, we begin by broadcasting
the K x M cluster center matrix to the remaining N / K - 1
K x M windows of the N x M hypercube. This is done
using a window broadcast. Next, in Step 2, P E ( i : j ) computes
term[q]= ( F [ i . j ]- c e n t e ~ [ q , j ] ) ~5, O
q < K . This is done by
circulating the center values through column windows of size
K x 1 as discussed in Section 11-CS. The objective of Steps 3
and 4 are to compute d 2 ( i , k ) , 0 5 i < K . d 2 ( i , k ) is stored in
the d2 register of PE(i, IC). First, in Step 3 the j t h PE in each
1 x K row window computes the sum of the K terrn[j]values
in the window (i.e., A ( j ) =
term[j](q),O5 j < K
q t 1x A’ ILI17ldOW
is computed in all the 1 x K windows). This is done using
consecutive sum in 1 x K windows. Next, the PE’s in the first
1 x K window of each row sum up the values computed by
corresponding PE’s in the 1 x K windows in their row. This
gives the IC d2 values for the pattern represented in the row.
The minimum of these can be found using a data sum with add
replaced by min. Once the new cluster for each pattern is known,
it can be broadcast to all the PE’s in the pattern row (Step 6)
for later use. A complexity analysis is provided in Fig. 7. The
overall complexity is 4K O(1ogNMK) unit routes.
The algorithm for the case K > M is given in Fig. 8.
The strategy is similar to that for the case K 5 M . Fig. 9
provides a complexity analysis. The total number of unit routes
is 4K + O(1ogNMK) unit routes.
2) 0 ( 1 ) Memory: Once again, we need to develop different
algorithms for the two cases K 5 M and K > M . Fig. 10 gives
the algorithm for the former case and Fig. 11 for the latter. The
complexity analysis for both is done in Fig. 12. The number of
unit routes required by each is 4K O(1og N M K ) .
Let us go through the steps of Fig. 10. Recall that this
algorithm is for the case K 5 M . First, the cluster center window
is broadcast such that it resides in all K x M windows of the
N M PE hypercube. The objective of Steps 2 and 3 is to compute
+
+
.2f-l
( F ( i , q )- c e n t e r ( j , q ) ) ’ , O 5 i <
in P E ( i , j ) , d 2 ( 2 . j ) =
N,O 5 j
< K.
q=u
In Step 2, P E ( i , j ) computes in dzst(i,j) the
Authorized licensed use limited to: IEEE Xplore. Downloaded on April 16, 2009 at 11:38 from IEEE Xplore. Restrictions apply.
133
RANKA AND SAHNI: CLUSTERING ON A HYPERCUBE MULTICOMPUTER
Stepl: Broadcast the Ii x M cluster center window to the remaining
N / l i - 1 Ii x M windows.
Step2: The PE's in each I< x 1 column window compute t e r r n [ q ] , 0 5
q < Ii.
Step3: Each row forms a 1 x A4 window for the consecutive sum
operation. This operation is to be repeated Ii/M times. On the
zth iteration term[zM j ] , 0 5 j < I\j of each PE are
involved in the operation. Thus, each PE computes li/'Z! A values
A[O. . . Ii/M - 11.
Step4: At this time, the PE's in each row have li A values. Each
represents a different d 2 ( 2 , k ) value. Each PE computes D =
min
{Abl}.
+
0 5 1 <.W/IC
Step5: PE(Z, 0) computes NewCluster[i] by computing the minimum D
in its row and the cluster index corresponding to this.
Step6: Broadcast NewCluster(i, 0) to the remaining M - 1 PE's in the
ith row, 0 5 i < ,V.
Fig. 8. O ( l i ) memory cluster assignment Ii
Stepl:
Step2:
Step3:
Step4:
Step5:
Step6:
> JI.
Fig. 9. Complexity analysis of Fig. 8.
sum
Ii - 1
+
T) -
center(j,1 + r ) ) *
azd using the cluster center index remembered by PE (2. q )
Step6: Broadcast NewCluster(2, 0) to the remaining .If - 1 PE's in the
zth row, 0 5 z < jV.
< N, 0 5 j < M
where 1 = b / K ] . Then, in Step 4, d2 is computed by adding
the dist values in corresponding PE's of the 1 x K windows of
the 1 x M rows. Once d2 has been computed, the new cluster
values are easily obtained and broadcast to all PE's representing
the pattern.
B. Cluster Update
For this operation, we assume that P E ( i , j ) in the topmost
K x M window has values FeatureSum(i,j ) and Number(i, j )
defined as
FeatureSum(i,j) =
F(q,j),
05i
I
Stepl:
Step2:
Step3:
Step4:
Step5:
Step6:
log(N/li)
4h-+ O(log IC)
log (M/ A-)
log I<
log M
-
>
1I
log( N / l i )
Ii/nI (Step 3 Step 4
+
42w
+ O(l0g '14)
+ 1)
0
log ,I I
log >II
Fig. 12. Complexity analysis of Figs. 10 and 11
0 5 i < K, 0 5 j < M.
N u m b e r ( i , j ) = \Szl,
The algorithm to update the cluster centers is given in Fig. 13.
Steps 1 and 2 are performed in K x M windows. The ( i ,j ) PE in
each such window computes the change in FeatureSum(i,j) and
Number(i, j ) contributed by the patterns in this window. These
two steps can be restricted to
for which NewCluster(i,j)
# c(i,,j).
In Steps 3 and 4 the topmost window accumulates
the sum of these changes. Steps 5-8 update the clustering data.
The complexity analysis is provided in Fig. 14. A total of
O(1og'K l o-~
g ( N / K ) )unit routes are used.
Overall Complexity: The total number of unit routes used by
our algorithms for one pass of Fig. 1 is 4K
O(log2K)
O(1ogNMK) regardless of whether the amount of memory
available is O ( K ) or O(1). This improves on the algorithm of
Li and Fang [8] which requires O ( K * logNM) unit routes and
O ( K ) memory per PE.
+
Fig. 11. O(1) memory cluster assignment Ii
< K, 0 5 j < M
q€S,
~1
5 M.
Fig. 10. 0(1)memory cluster assignment K
0 < k <.\8
r=O
05i
U<k<li
Steps: Broadcast NcwCluster(z, 0) to the remaining ;Li' - 1 PE's in the
zth row, 0 5 z < IY.
Stepl: Broadcast the li x :If cluster center window to the remaining
-Y/Ii - 1 Ii x
windows.
Step2: In each I< x M window regard the IC x M cluster centrer matrix
as Ii/M d l x IZI cluster center matrices. These will be circulated
through the I</>'i'
x 'U' windows of the larger li x M window
using the data circulation procedure of Section 11-CS. As a result,
each d l x ,I,I cluster center window will visit each M x M window
exactly once. Whenever a new M x ,U' cluster center window is
received, the M x ,U' PE window does Steps 3 and 4. I.e., these
are done a total of K/A" times.
Step3: Each .VI x M window does a distance computation as described
in Section II-C8. Because of the window size used, each computed
distance represents the squared distance between a pattern and a
cluster.
Step4: Each PE remembers the smallest distance value it has computed
so far. It also remembers the cluster index that corresponds to this.
Steps: Compute NewCluster(z, 0) by finding q such that d 2 ( i , q ) =
min { d 2 ( i ,k ) }
log (-\-/Ii)
21i - 2
21i
0
2 log Ii
log n I
d i s t ( i , j ) = E ( F ( i ,1K
Stepl: Broadcast the Ii x hi' cluster center window to the remaining
N/li - 1 li x M windows.
Step2: The PE's in each Ii x Ii window perform a distance calculation
as discussed in Section II-C8. The result is left in the dist registers
of the PE's
Step3: The dist values in the iM/K 1 x Ii windows of each row are
summed using a window sum. The result of this is left in the d2
registers of the PE's in the first 1 x Ii window of each row.
Step4: Compute NewCluster(2, 0) = NewCluster[z] by finding q such that
d2(i,q) = min {d2(z,k)}.
IV. NMK PE CLUSTERING
An N M K PE hypercube may be viewed as an N x M x K
array with each N x M subarray forming one plane. The F
matrix begins in the plane P E ( i > j , o ) and the center matrix is in
center(i>j , '1:
5 j < K , 5 j < M' 'luster reassignment can
Of Fig' 15.
be done in ' ( l o g N M K ) time using the
Cluster centers may be updated in O(1og N M K ) time using
the algorithm of Fig. 16.
I ,
+
+
V. CLUSTERING
ON A MEDIUM
GRAINHYPERCUBE
In the previous sections, we have developed algorithms to
perform clustering on a fine grain hypercube. Such a computer
has the property that the cost of interprocessor communication
is comparable to that of a basic arithmetic operation. In this
Authorized licensed use limited to: IEEE Xplore. Downloaded on April 16, 2009 at 11:38 from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 2, NO. 2, APRIL 1991
134
Stepl: PE(Z, j ) does an SRAW of F ( i , j ) to the (NewCluster(2. j ) , j ) PE
in its li x M window. It also does an SRAW of - F ( i , j ) to the
window. Note that both SRAW’s
( C ( ~ , J ) .PE
J ) in its li x
involve data movement in li x 1 column windows only. Let the
resulting sum in PE(2, J ) be .4(i, j).
Step2: PE(i, j ) does an SRAW of +1 to the (NewCluster((i. j ) . J )PE in
its li x -I\ window. It does an SRAW of -1 to the (C‘(i.1).j )
PE in its li x .II window. Note that both SRAW’s involve data
movement in li x 1 column windows only. Let the resulting sum
in PE(,, j ) be E(,, j ) .
Step3: The .A values of corresponding PE’s in the .V/K I< x ,\I windows
are added using window sum. The results are in the D registers of
the topmost li x d l window.
Step4: The B values of corresponding PE’s in the .V/K IC x \I. windows
are added using window sum. The results are in the E registers of
the topmost li x .\I window.
Step.5: FeatureSurn(i, j) := FeatureSum(i, j ) + D(i.j ) .
0 5 2 <
I<. 0 5 j < -11.
Steph: Number(z, j ) := min { x,Number(i, j ) + E ( i j. ) } , 0 5 i <
I<, 0 5 j < -If.
Step7: center(i. j ) := FeatureSum(2, j)/Number(t, J ) . 0 5 i < I<. 0 5
j
< N.
Step8: C‘(7.j) := NewCluster(i, j ) , 0
Stepl: Broadcast F , Newcluster, and C such that F ( z ,j . k ) = F ( i , j >O ) ,
NewCluster(i. j , k ) =Newcluster ( i . j.O), C ( 2 . j . k ) = C‘(i.j%O),
0 5 i < 4, 0 5 j < nf, 0 5 k < I<.
- ) :’V(?.j.k)to
Step2: if NewCluster(i. j . b ) = C(2.j . k ) set A ( z . J . ~and
0
else if NewCluster(i. j . k ) = k set A ( ; .j . k ) = F ( z .j . k )
and z ( 2 . j . k ) to 1
else if C ( i ,j . k ) = k set -4(i, j . k ) to - F ( i .j . k ) and
.V(i. j . k) to -1
else set ,4(z. j. k ) to 0 and -\-(z, j. b ) to 0.
.v - 1
Step3: Compute B(0.j . k )
C
=
<4(i,j , k ) and D(0.j . k )
=
,=U
s-1
C
.Y(z, j , k ) .
,=U
Step4: Translate B and so that E ( k .j . 0) = B(O.j . k ) and F ( k .j . 0) =
D(0.J . I C ) , 0 5 j < >U,
0 5 k < Ii.
Step.5: FeatureSum(k. j , 0) := FeatureSum(k, j . O)+E(k.j . 0 )
Number(k, j , 0 ) := inin { x .Number(k, j.O) F ( k .j , O ) } and
center(k. j , 0 ) := FeatureSum(k.j.O)/ Number(k. j . 0), 0 5 j <
.lI, 0 5 k < Ii.
C(i.10) := Newcluster(/. j.O), 0 5 i < .V, 0 5 j < .\I.
A\-
+
5 i < li. 0 5 j < .U..
Fig. 16. Cluster center update with .\-Mli PE’s.
Fig. 13. Cluster updating.
Stepl:
Step2:
Step3:
Step4:
Step.5:
Step6:
Step7:
Step8:
Stepl: Receive partial feature matrix from host. Node s receives F [ i . j ]
s(.\-/p) 5 i < (s l)(.V/p) 0 5 s < p . Node 0 also receives
the cluster center matrix.
Step2: Steps 3-7 are repeated “iteration” number of times.
Step3: Node 0 broadcasts the cluster center matrix to all nodes.
Step4: Each node calculates the new clusters for each pattern using the
cluster center matrix.
Step5: Each node s calculates
O(log21;)
O(Iog2 I<)
log (-\-/Ii)
log (-\-/li)
+
0
0
0
0
~ [ s ] j[ ]i =
.
Fig. 14. Complexity analysis of Fig. 13
Stepl: Broadcast the center and F matrices such that center(i.1. k ) =
center[k.j ] and F ( i . j , k ) = F [ i . J ] , 0 5 i < .V, 0 5 j < ‘If,
0 5 k < 1;.
Step2: Compute t e r m ( t . j , k ) = ( F ( i . J . k ) - center(i.j.k))2 =
( ~ [ ji ].- center[k.j])’.
.\I-1
Step3: Compute d 2 ( i , 0. k ) =
term(2.j. k ) = d2[i. k ] by summing
3=11
along the second dimension.
Step4: Compute NewCluster(i, 0, 0 ) = y such that d2(i.0. y)
min {d2(2.0.k)}.
1s,
~ [ nj ]. .
. s ( . ~ / p )5 i
< (.s
+1(-~/p))
aE
=
O<k<li
0
.V[S][t]
=
1 1.
s(>\-/p)
5 j < I\.
5 2 < (s + l ( S / p ) )
*€S,
where S, denotes the tth cluster.
Steph. At node 0, the following information
IS
gathered
P-1
s[?.J]
=
T[5][2.
J].
0
5 t < l<,0 5 J < I\.
a=O
Step5: Broadcast Newcluster ( i , 0, 0) along the second dimension.
P-1
Fig. 15. Cluster reassignment with -V2111i PE’s.
1721 =
.V[.][t]. 0
5 1 < I<
s=o
section, we shall consider the clustering problem on a hypercube
in which interprocessor communication is relatively expensive
and the number of processors is small relative to the number of
patterns N . In particular we shall experiment with an NCUBE/7
hypercube which is capable of having up to 128 processors. The
NCUBE/7 available to us, however, has only 64 processors. The
time to perform a 2 byte integer addition on each hypercube
processor is 4.3 ps whereas the time to communicate b bytes to
a neighbor processor is approximately 447 2.4b ps [4].
Since the hypercube computer is attached to a host computer,
two cases of the clustering problem can be studied. These vary
in the initial location of the pattern and the location of the final
results:
1) Host-to-host: The pattern and cluster information is initially
at the host and the result is to be left in the host also.
2) Hypercube-to-hypercube: The pattern and cluster informa-
+
This is done using a binary tree as in [2]. At each stage the
node receiving the information adds its information to the received
information and sends it to its parent.
Step7: Node 0 calculates the new cluster center matrix.
StepH: Each node sends the information about the final value of S,,
(s(-\-/p) 5 z < ( E
l)(.Y/p)) to the host.
+
Fig. 17. Clustering algorithm on the hypercube.
tion is initially at the nodes and the result is to be left at
the nodes.
Let p be the number of hypercube processors. We assume
that the N feature vectors that constitute the feature matrix are
distributed equally among the p processors and that the center
matrix is located initially at node 0. Further we assume that each
processor has enough memory to hold its share of the pattern
feature matrix and the whole cluster center matrix. Fig. 17 gives
T
Authorized licensed use limited to: IEEE Xplore. Downloaded on April 16, 2009 at 11:38 from IEEE Xplore. Restrictions apply.
135
RANKA AND SAHNI: CLUSTERING ON A HYPERCUBE MULTICOMPUTER
No. of Clusters
I
No. of
Processors
16
32
64
1
2
4
8
16
32
64
44.188
22.346
11.463
6.080
3.440
2.184
1.632
86.842
43.659
22.282
11.700
6.504
4.015
2.889
171.481
86.286
43.921
22.941
12.632
7.675
5.405
1
No' Of
Processors
1
16
32
64
No. of
Processors
16
2
4
8
16
32
64
44.519
22.651
11.778
6.390
3.760
2.520
86.568
43.544
22.128
11.516
6.307
3.799
2.641
64
171.341
86.175
43.779
22.768
12.450
7.477
5.178
(a) 512 patterns
I
I
i
I
171.942
86.858
44.510
23.518
13.218
8.278
16
32
64
I
No. of Clusters
1
I
No. of
Processors
44.053
22.689
12.099
6.192
4.439
1
1
32
44.182
22.228
11.303
5.891
3.236
1.959
1.372
(a) 512 patterns
No. of Clusters
I
16
I
16
44.284
22.331
11.405
5.993
3.338
2.062
I
I
32
86.760
43.736
22.321
11.709
6.500
3.991
1
I
64
171.714
86.549
44.153
23.142
12.823
7.851
(a) 1024 patterns
(a) 1024 patterns
Times are in seconds
Number of features = 20
Number of iterations = 10
Times are in seconds
Number of features = 20
Number of iterations = 10
Fig. 18. Host-to-host times
Fig. 19. Hypercube-to-bypercube times.
our hypercube algorithm for the host-to-host case. This algorithm
assumes that p divides N . If this is not the case, then, in Step 1,
the rows of the feature matrix can be distributed so that each of
q = Nmodp processors gets [ N / p l rows of the feature matrix
and the remaining p - q processors get LN/pJ rows each. The
algorithm of Fig. 1 7 does not require that N , M , and K be
powers of 2. The current cluster center matrix is broadcast,
in Step 3, to all p processors using the standard binary tree
broadcast scheme [2]. Steps 4 and 5 require no interprocessor
communication. Step 6 uses the broadcast binary tree used in
Step 3 backward (i.e., from the leaves to the root rather than
from the root to the leaves). Step 7 requires no interprocessor
communication.
For the hypercube-to-hypercube case, Steps 1 and 8 are
omitted. Interprocessor communication is done in Steps 3 and 6.
The time needed for this is O(MK1ogp).The processing time for
Steps 4,5, and 6 is O ( N M K / p ) .The overall time complexity for
one iteration of Steps 3-7 is therefore O ( N M K / p MKlogp).
When N / p is O(logp), the time becomes O ( N M K / p ) .
The clustering algorithm of Fig. 17 was programmed in C
and run on an NCUBE/7 hypercube. We considered two random
feature matrices. One had N = 512 patterns (or feature vectors)
and M = 20 and the other had N = 1024 and M =
20. The use of random feature matrices is justified in our
experiments as we are interested in measuring the effectiveness
of our parallel algorithm in terms of speedup rather than in
the converegence of the clustering algorithm. The convergence
properties of the parallel clustering algorithm are the same as
those of the sequential one and the run time of the parallel
algorithm is insensitive to the actual feature values.
Run times for ten iterations of Steps 3-7 are given in Figs. 18
and 19. Run times are reported for p = 1 , 2 , 4 , 8 . 1 6 . 3 2 . and
64 for the case N = 512 and for p = 2 , 4 , 8 , 1 6 , 3 2 , and 64
for the case N = 1024. Fig. 18 is for the case host-to-host while
Fig. 19 is for the hypercube-to-hypercube case. For small p , there
is little difference between the times for the two cases. I.e., the
time for Steps 1 and 8 is small compared to the time for the
remaining steps. Since the host-to-hypercube and hypercube-tohost data transfer time of Steps 1 and 8, respectively, is relatively
insensitive to the value of p and since the time for Steps 2-7
decreases as l / p , the significance of the time for Steps 1 and 8
increases as we increase p . In fact, for p = 64 and K = 1 6 the
time for Steps 1 and 8 is approximately 20% of the time for ten
iterations of Steps 3-7 when N = 512 and approximately 23%
when N = 1024. The run times of Fig. 1 9 closely agree with our
analysis for Steps 3-7. Note that for our test data N / p 2 log,p,
so the predicted run time is O ( N M K / p ) . Hence, we expect
the run time to increase linearly with N , M , and K and the
dependency on p is as l/p.
The speedup and efficiency (speeduplp) for N = 512, K =
32, M = 20 for ten iterations are given in Fig. 20. Fig. 21 gives
these figures for the case N = 1024, K = 32, M = 20. and
number of iterations = 10. From Fig. 20, we see that when
N = 512, we get greater than 80% efficiency so long as the
number of processors is no more than 16. When p = 64, the
efficiency drops to approximately 50%. This drop in efficiency
as p increases while the problem size is held fixed is expected
as the useful work per processor declines and the effects of
the nonuseful work (e.g., interprocessor communication) become
more significant. The efficiency is expected to be larger for larger
N as the amount of useful work per processor increases relative
to the amount of nonuseful work being performed. With an N
+
Authorized licensed use limited to: IEEE Xplore. Downloaded on April 16, 2009 at 11:38 from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 2, NO. 2, APRIL 1991
136
P
1
2
4
8
16
32
64
Host-to-host
Speedup
Efficiency
1.000
1.000
1.989
0.994
3.897
0.974
7.422
0.928
13.352
0.834
21.629
0.676
30.059
0.470
Hypercubeto-hypercube
Speedup
Efficiency
1.000
1.000
1.988
0.994
3.9 10
0.978
7.517
0.939
13.725
0.857
22.787
0.712
32.778
0.512
Number of clusters = 32
Number of features = 20
Number of iterations = 10
Fig. 20. 512 patterns: Speedup and efficiency.
Number of clusters = 32
Number of features = 20
Number of iterations = 10
Fig. 21. 1024 patterns: Speedup and efficiency (based on estimated time for one processor).
of 1024, the efficiency for p = 64 is approximately 60% (since
we did not have enough memory to solve an N = 1024 instance
on one processor, the p = 1 time is estimated from the p = 2
time using an efficiency of 0.994. This is the efficiency for the
case N = 512).
VI. CONCLUSIONS
In this paper, optimal algorithms for squared error clustering
were developed. We considered the two cases when the number
of PE’s available is N M and N M K . For the former, we
developed algorithms for the cases of O ( K ) and 0(1)memory
per PE. While the algorithms for both cases use a comparable
number of unit routes (4K O(1og’K) O ( l o g N M K ) ) ,those
for the case of O ( K ) memory are simpler. Our algorithm for
the case of N M K PE’s runs in O(1ogNMK) time and uses
O( 1) memory per PE. All our algorithms are optimal to within
a constant factor. Experimental results obtained on an NCUBE/7
hypercube indicate that the clustering problem can be solved
efficiently on commercial medium grain multicomputers.
+
[9]
D. Nassimi and S. Sahni, “Data broadcasting in SIMD computers,”
IEEE Trans. Cornput., vol. (2-30, pp. 101-107, Feb. 1981.
[IO] L. M.
Ni and A. K. Jain, “A VLSI systolic architecture for pattern
clustering,” IEEE Trans. Pattern Anal. Machine lntell., vol. PAMI7, no. I, pp. 80-89, Jan. 1985.
[ l l ] S. Ranka and S. Sahni, “Image template matching on an SIMD
hypercube multicomputer,” Univ. of Minnesota Tech. Rep., 1987.
[12] A. Rosenfeld and A. C. Kak, Digital Picture Processing. New
York: Academic, 1982.
[13] J . T. Tou and R. C. Gonzalez, Pattern Recognition Principles.
Reading MA: Addison-Wesley, 1974.
+
REFERENCES
111 D.H. Ballard and C.M. Brown. Cornouter Vision. Enelewood
”
Cliffs, NJ: Prentice-Hall, 1985.
E. Dekel, D. Nassimi, and S. Sahni, “Parallel matrix and graph
algorithms,” SIAM J . Cornput., pp. 657-675, 1981.
R. 0. Duda and P. E. Hart, Pattern Classification and Scene Analysis. New York: Wiley, 1973.
T.H. Dunigan, “Hypercube performance,” in Hypercube Multiprocessors. M. T. Heath, Ed., SIAM, 1987, pp. 178-192
K. S. Fu, Syntactic Methods in Pattern Recognition. New York:
Academic, 1974.
K. Fukunaga, Introduction to Statistical Pattern Recognition. New
York: Academic, 1972.
K. Hwang and D. Kim, “Parallel pattern clusterine on a multiorocessor with orthogonally shared memory,” in Procy 1987 Int. konf:
Parallel Processing, pp. 913-916.
[S] X. Li and Z. Fang, “Parallel algorithms for clustering on hypercube
SIMD computers,” in Proc. I986 Conf: Computer Vision Pattern
Recognition, 1986, pp. 13CL133.
Sanjay Ranka (S’87-M’88) received the B.Tech
degree in computer science and engineering from
the Indian Institute of Technology, Kanpur, in
May 1985 and the Ph.D. degree in computer
and information science from the University of
Minnesota, Minneapolis, in August 1988.
Since August 1988, he has been an Assistant
Professor in the School of Computer Science,
Syracuse University, Syracuse, NY. His main
areas of interest are parallel and distributed
computing, algorithms. and neural networks. He
is particularly interested in a i h c i i l intelligence, computer vision, pattern
recognition, and VLSI. He has coauthored a monograph on Hypercube
Algorithms for Pattern Analysis and Machine Intelligence (New York:
Springer-Verlag). He is also a Guest Editor of a special issue of IEEE
COMPUTER(February 1991).
Authorized licensed use limited to: IEEE Xplore. Downloaded on April 16, 2009 at 11:38 from IEEE Xplore. Restrictions apply.
RANKA AND SA”1:
CLUSTERING ON
A
HYPERCUBE MULTICOMPUTER
Sartaj Sahni (M’79-SM’86-F’88) received the
B.Tech. (Electrical Engineering) degree from the
Indian Institute of Technology, Kanpur, and the
M.S. and Ph.D. degrees in computer science
from Cornel1 University, Ithaca, NY.
He is Professor of Computer and Information
Sciences at the University of Florida. He has
published over 100 research papers and written
several texts. His research publications are on
the design and analysis of efficient algorithms,
parallel computing, interconnection networks,
and design automation. He is coauthor of the texts Fundamentals
of Data Structures, Fundamentals of Data Structures in Pascal, and
Fundamentals of Computer Algorithms, and author of the texts Concepts
in Discrete Mathematics and Software Development in Pascal.
Dr. Sahni is the area editor for algorithms and multiprocessors for
the Journal of Parallel and Distributed Computing and is on the editorial boards of IEEE TRANSACTIONS
ON PARALLEL AND DISTRIBUTED
SYSTEMS,Information and Software Technology, and Computer Systems:
Science and Engineering.
Authorized licensed use limited to: IEEE Xplore. Downloaded on April 16, 2009 at 11:38 from IEEE Xplore. Restrictions apply.
137
Download