Abstract-In this paper, squared error clustering algorithms for SIMD hypercubes are presented. These algorithms are asymptotically faster than previously known algorithms and require lesser amount of memory per PE. For a clustering problem with .V patterns, M features per pattern and I< clusters, our algorithms complete in O ( k logd%-dI)steps on S-ZI processor hypercubes. This is optimal up to a constant factor. We extend these results to the case when -Vd\fK processors are available. Experimental results from an MIMD medium grain hypercube are also presented. + Index Term+ Clustering, feature vector, hypercube multicomputer, pattern recognition, MIMD, SIMD. 1. INTRODUCTION EATURE vector is a basic notion of pattern recognition. A feature vector v is a set of measurements (vir U,, . . . , v h l ) which map the important properties of an image into a Euclidean space of dimension M [I]. Clustering partitions a set of feature vectors into groups. It is a valuable tool in exploratory pattern analysis and helps making hypotheses about the structure of data. It is important in syntactic pattern recognition, image segmentation, and registration. There are many methods for clustering feature vectors [ l ] , [3],[6], [ 5 ] , [12], [13]. One popular technique is squared error clustering. Let N represent the number of patterns which are to be partitioned and let M represent the number of features per pattern. Let FIO . . . N - 1 , O . . . hl - 11 be the feature matrix such that F [ i , j ]denotes the value of the j t h feature in the ith pattern. Let SI.S p ,. . . , SICbe K clusters. Each pattern belongs to exactly one of the clusters. Let C[i]represent the cluster to which pattern i belongs. Thus, we can define Sk as F Sk = {ilC[i]= k,O 5k 5K - 1). Further, I Sk 1 is the cardinality or size of the partition .SA.. The center of cluster k is a 1 x M vector defined as The squared distance d2 between pattern i and cluster k is A-1 d 2 [ i ,k ] = ( F [ i , j ]- c e n t e r [ k , j ] ) * . ,=O The squared error for the kth cluster is defined as E 2 [ k ]= d2[2,k] 05k <K lESk Manucript received May 25, 1990; revised October 30, 1990. This work was supported in part by the National Science Foundation under Grants DCR8420935 and MIP 86-17374. S. Ranka is with the School of Computer Science, Syracuse University, Syracuse, NY 13244. S. Sahni is with the Department of Computer and Information Sciences, University of Florida, Gainesville, FL 3261 1. IEEE Log Number 9042569. procedure CLUSTER(-%’. {iteratively improve the clustering} Stepl: [Cluster Reassignment] Newcluster [i] := q such that d2[i. q] = min {d2[2,k]} O<k<Ii {ties are broken arbitrarily} Step2: [Termination Criterion and Cluster Update] if NewCluster[i] = C[i],0 5 < .V then terminate; else [C[i]:= NewCluser[i], 0 5 I < Z and goto Step 31 Step3: [Cluster Center Update] Recompute cenfer[z.j],0 5 z < li, 0 _< j < XI using the new cluster assignments. end {of Cluster} Fig. 1. One pass of the iterative cluster improvement algorithm. and the squared error for the clustering is I< - 1 E R R O R [ k ]= E2[k] k=U In the clustering problem, we are required to partition the N patterns such that the squared error for the clustering is minimum. In practice, this is done by trying out several different values of K . For each K , the clusters are constructed using an iterative refinement technique in which we begin with an initial set of K clusters, move each pattern to a cluster with which it has the minimum squared distance, recompute cluster centers. The last two steps are iterated until no pattern is moved from its current cluster. The final clustering obtained in this way is not guaranteed to be a global minimum. In fact, different initial clusterings can result in different final clusters. One pass of the algorithm is given in Fig. 1. One pass of the cluster improvement algorithm takes O ( N M K ) time on a uniprocessor computer. Since several passes are needed before an acceptable K and corresponding clustering is obtained, the computational requirements of the algorithm are such that several researchers have studied parallel implementations of the algorithm for one pass. For example, Ni and Jain [ l o ] have developed systolic arrays for clustering, Hwang and Kim [7] have developed clustering algorithms for multiprocessors with orthogonally shared memory, and Li and Fang [8] have developed an SIMD hypercube algorithm. This latter algorithm requires that each processor have O(K ) memory and has a run time of O ( K l o g ( N M ) )on an SIMD hypercube with N M PE’s. In this paper, we first improve upon the algorithm of [8] by developing one that runs in O ( K l o g N M ) time. We present two algorithms with this run time. Both use N M PE’s. However, one requires O ( K )memory per PE and the other O(1). Next, we show how these may be extended to the case when N M K P E ’ s are available. In this case, the run time becomes O(1ogNMK). Finally, we present a parallel clustering algorithm suitable for use on an MIMD medium-grained hypercube. Experimental results using the NCUBE hypercube are also presented. Hence, PE’s need to be able to perform only the basic arithmetic operations (i.e., no instruction fetch or decode is needed). A p-dimensional hypercube network connects 2“ PE’s. Let i p p l i p - 2. . . i o be the binary representation of the PE index i. Let ik be the complement of bit i k . A hypercube network directly connects pairs of processors whose indexes differ in exactly one bit. I.e., processor . . z p - - 1 a p - 2 . . . io is connected to processors i p P l. . . i k . . . io, 0 5 IC 5 p - 1. We use the notation i(b) to represent the number that differs from i in exactly bit b. There is a separate program memory and control unit. The control unit performs instruction sequencing, fetching, and decoding. In addition, instructions and masks are broadcast by the control unit to the PE’s for execution. An instruction mask is a Boolean function used to select certain PE’s to execute an instruction. For example, in the instruction A ( i ) := A ( i )+ 1, (io = I) ( i o= 1) is a mask that selects only those PE’s whose index has bit 0 equal to 1. Le., odd indexed PE’s increment their A registers by 1. Sometimes, we shall omit the PE indexing of registers. So, the above statement is equivalent to the statement: A := A + 1, of two. In case N , M , and K are not powers of 2, we can do the following: I) 11 N is not a puwcr 01 2 , i m u u u c c auuitiuiia~pdtteriis so that the total number of patterns becomes a power of 2. Note that this can be done by at most doubling the number of patterns. These additional patterns have the same feature vector which is chosen to be far removed from the feature vectors of the remaining patterns. A s a result, these additional patterns cluster together in a separate cluster and do not affect the clustering of the original patterns. 2) If M is not a power of 2, introduce additional features to the feature vector so that the total number of features becomes a power of 2. Note that this can be done by at most doubling the number of features. The values for these additional features are set to zero for all patterns. As a result, the additional features do not affect the clustering. 3) If K is not a power of 2, then we replace K by the next power of 2, say J . We start with J - K clusters with centers such that no pattern can be assigned to one of these clusters. The center with all coordinates x can be used for this purpose for all J - K clusters. The above changes do not affect the final clusters obtained and since they at most double N , M , and K , the asymptotic complexity of the resulting algorithms is also not affected. Assume that the number of PE’s in the hypercube is N M . The hypercube may be viewed as an N x M grid of PE’s as in Fig. 2. If N = 2” and M = 2”‘, then a PE index has n + m bits in it. The first n bits give the row number and the last m the column number. Note that each row forms a subhypercube with M PE’s and each column forms a subhypercube with N PE’s. We shall use the notation PE ( i , j ) to refer to the PE in row i and column j.Its index is obtained by appending the bits in j to those in i. The initial configuration we assume for our algorithms has F [ i . j ] in the F register of PE ( i , j ) . i . e . , F ( z , j ) = F [ i . j ] , O 5 i < N,O 5 j < M . Also, the center matrix is stored in the top K rows of the N x M hypercube such that center(i,j) = center[i,j],O5 i < K,O 5 j < M (see Fig. 2). C. Basic Data Manipulation Operations I ) Data Broadcast: In a data broadcast, data originate at one Interprocessor assignments are denoted using the symbol P E of a subhypercube and is to be transmitted to the remaining +-, while intraprocessor assignments are denoted using the PE’s of the subhypercube. If the subhypercube has P PE’s, a symbol :=. Thus, the assignment statement: B ( i ( * ) )t data broadcast can be done in log P unit routes [2]. B ( i ) , ( i 2 = 0) is executed only by the processors with 2) Window Broadcast: Here, data originate in an R x S subhybit 2 equal to 0. These processors transmit their B register percube of a larger T x U subhypercube and are to be replicated data to the corresponding processors with bit 2 equal to 1. over this larger subhypercube. The larger subhypercube may be In a unit route, data may be transmitted from one processor naturally tiled by R x S windows and essentially data from one to another if it is directly connected. We assume that the window is to be broadcast to the others. This can be done using links in the interconnection network are unidirectional. l o g ( T U / ( R S ) )unit routes [2]. Hence, at any given time, data can be transferred either 3) Data Sum: Assume the window tiling of Section 11-C2. For from PE i ( i b = 0) to PE i(b)or from PE i ( i b = 1) to each window, the data in the A registers of the PE’s in this PE i(b). Hence, the instruction. B(i(’))+ B ( i ) , (iz = 0 ) window are to be summed and the result left in one of the PE’s takes one unit route, while the instruction: Lj(i(*)) t B(i) (same relative PE for each window) of the window. For windows takes two unit routes. of size R x S, the summing operation takes log(RS) unit routes Since the asymptotic complexity of all our algorithms is determined by the number of unit routes, our complexity 121. 4) Window Sum: Assume the tiling of Section 11-C2. This time, analysis will count only these. one of the R x S windows is to sum up the A register values in corresponding PE’s of all the windows. Le., the PE in position B. Hypercube Embedding of F and center (2.j)of a designated window is to accumulate the sum of the Throughout this paper, we shall assume that N , M , and K are A registers of the PE’s, in position ( i , j ) of all the T U / ( R S ) powers of 2. This assumption greatly simplifies our discussion windows, 0 5 i < R,O 5 j < S . This can be done using as the number of processors in a hypercube is always a power l o g ( T U / ( R S ) ) unit routes [2]. ( i o = 1) Authorized licensed use limited to: IEEE Xplore. Downloaded on April 16, 2009 at 11:38 from IEEE Xplore. Restrictions apply. 131 RANKA AND SAHNI: CLUSTERING ON A HYPERCUBE MULTICOMPUTER Fl Fl Fl 00 cl 0 01010 01011 01110 01111 Fl Fl 0 10110 {data circulation} for I = 1 to P - 1 do .4(Jf(’ ‘I) + - 4 ( J ) end Fig. 3. Data circulation in an SIMD hypercube. procedure ConsecutiveSum(S, S) begin in(p):=p mod S .4(P) := S[in(p)l(p); for i := 1 to S - 1 to begin 1 := f(l0g 2 ) ; -4(p) := -4(p(‘)); s. in(p) := zn(p)H2’; A ( p ) := A ( p ) X [ z n ( p ) ] ; + end {move -4’s back to originating PE’s} j := log, S - 1; .4(p(J))t .4(p): end; {of ConsecutiveSum} 10111 Fig. 4. Consecutive sum. 0 U 0 Fig. 2. procedure Circulate(d); 11010 11011 11110 11111 A 32 PE hypercube viewed as an 8 x 4 grid. Lemma 1 [ll]: Let Ao.Al, . . . be the values in A(0). A(1), . . . , A(2P - 1) initially. Let index(j,,i) be such that A[indez(j,i)] is in A ( j )following the ith iteration of the for loop of Fig. 3. Initially, index(j.O) = j . For every i,i > 0, ( j , i ) = indez(j,i - 1)82f(P.” (8 is the Exclusive OR operator). 6) Consecutive Sum: ConsecutiveSum(X, S) works in row windows of size 1 x S.Each of the S PE’s in such a window has S values X[z],0 5 i < S.The j t h PE in the window computes s-I A(j) = c X [ j ] ( i ) , 0 5 j < S. ,=(I 5 ) Data Circulation: The data in the A registers of each of the R processors in a R processor subhypercube is to be circulated through each of the remaining R - 1 PE’s in the subhypercube. This can be accomplished using R- 1 unit routes. The circulation algorithm uses the exchange sequence X,, R = 2‘ defined recursively as [ 2 ] : XI = o , X, = x q - l , q - 1, x,-1 ( q > 1). This sequence essentially treats a q-dimensional hypercube as two q - 1-dimensional hypercubes. Data circulation is done in each of these in parallel using XqPl.Next an exchange is done along bit q - 1. This causes the data in the two halves to be swapped. The swapped data are again circulated in the two half hypercubes using X,-]. Let f ( q , i ) be the ith number (left to right) in the sequence X,, 1 5 i < 2q. The resulting SIMD data circulation algorithm is given in Fig. 3. Here, it is assumed that the r bits that define the subhypercube are bits 0 , 1 , 2 , . . . , T - 1. Because of our assumption of unidirectional links, each iteration of the for loop of Fig. 3 takes 2 unit routes. Hence, Fig. 3 takes 2(P - 1) unit routes. The function f can be computed by the control processor in O ( P ) time and saved in an array of size P - 1 (actually it is convenient to compute f on the fly using a stack of height lo@). The following lemma allows each processor to compute the origin of the current A value. The ith PE in the S block begins by initializing A to X[z].Next, the A values of the S PE’s in the block circulate through the block accumulating the remaining X’s needed. Finally, the A’s are moved back to the originating PE. The algorithm is given in Fig. 4. At all times, in(p) gives the index of the PE at which the current A originated. The correctness of this statement follows from Lemma 1. The number of unit routes required is 2 s . 7) Term Computation: This is done independently on all K x 1 column windows of the N M PE hypercube. The ith PE of each such window has an F and center value, F ( i ) and center(z), initially. Each PE, i, of the window computes the K values: tem[IC](i)= ( F ( z )- center(k))‘ 0 5 IC < K . This computation is done by circulating the center() values through the K x 1 window as in Fig. 5. The number of unit routes is 2(K - I). 8) Distance Computation: This is done independently in all S x S windows where S is a parameter to the operation. The PE in position ( 2 . j ) of the window computes x(~(i. 5-1 dist(i,j)= q ) - center(j, q ) ) ? , q=u O<i<S,O Authorized licensed use limited to: IEEE Xplore. Downloaded on April 16, 2009 at 11:38 from IEEE Xplore. Restrictions apply. <j<s. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 2, NO. 2, APRIL 1991 132 procedure TermComputation(X, li) begin {index p of processor at (2.j) is p = z l i S(p) := center(i,j) i n ( p ) := i mod I i ; t e 7 4 % n ( P ) l b )i= ( F b )- S ( p ) I 2 ; for q := 1 to h - 1 do begin I +j } := f ( l o g M , q ) ; S(P) + .w)); in(p)82’; zn(p) := t e 4 i n ( p ) l ( p ) := (Fb)- s(P))2; <: < I<. Step3: The PE’s in each 1 x li row window compute A := Consecutive Sum(term, I<). I summed up over M / I i 1 x Ii windows Step4: The values of .-are in each row using window sum. This results in d2(i. k ) = q .IT- 1 ,=O terni[k](i.j)? 0 k < I<. O<k<K Term computation. Fig. 6. O ( l < ) memory cluster assignment Ii Stepl. Step2: Step3: Step4: Steps: Step6: We note that the computation of dist is quite similar to the matrix multiplication C = A * B where C , A, and B are S x S matrices. In fact, if we let A=F B = center’ 5 Steps: Compute NewCluster(i, 0) = NewCluster[z] by finding q such that d 2 ( i . q ) = niiri { d 2 ( i . k ) } . Step6: Broadcast NcwCluster(i, 0) to the remaining .\I - 1 PE’s in the ith row, 0 5 i < -V. end end; {of TermComputation} Fig. S. Stepl: Broadcast the Ii x ,\I cluster center window to the remaining *\‘/I< - 1 Ii x M windows. Step2: The PE’s in each Ii x 1 column window compute t e m [ q ] . 0 (i.e., Transpose of center) and replace U,, * b,, by ( u t T- b:!)’ in the definition of matrix product, we end up with the definition of dist. Hence, dist may be computed using the following modification of the S’ PE matrix product algorithm of [2]: Stepl: Compute B = Transpose of the center matrix Step2: Use the matrix product algorithm of [2] to “multiply” F and B.However, each time two terms of F and B are to be multiplied, compute the square of their difference instead. The number of unit routes required to compute dist this way is 4s t O(1og S). 9) Summing Random Access Write (SRAW): This is done in K x 1 column windows. The K PE’s of a window originate data A(z) that are to be sent to the dest(i)th PE in the window. If two or more PE’s have data that are to be sent to the same PE, then their sum is needed at the destination PE. Thus, following the operation, the j t h P E in the K x 1 window has This can be done in O(logZK)unit routes by a modification of the random access write algorithm of Nassimi and Sahni [9]. In this modification, when two A’s reach one PE, they are replaced by a single A which is the sum of the two. 111. NM PE CLUSTERING We consider two cases for an N M PE SIMD hypercube. In one, each PE has O ( K ) memory. In the other, each PE has 0(1) memory. In both cases, we assume the initial data configuration of Section 11-B. For each of these two cases, we develop algorithms for the cluster reassignment (Step 1) and center update (Step 3) steps of Fig. 1. A. Cluster Reassignment 1) O(K) Memory: The cases K 5 M and K > M result in two slightly different algorithms. The algorithm for the case K 5 M is given in Fig. 6 while that for the case K > Fig. 7. 5 log (.\-/Ii) 21i - 2 21i log ( A l / l i ) 2 log li log df Complexity analysis of Fig. 6 M is given in Fig. 8. In Fig. 6, we begin by broadcasting the K x M cluster center matrix to the remaining N / K - 1 K x M windows of the N x M hypercube. This is done using a window broadcast. Next, in Step 2, P E ( i : j ) computes term[q]= ( F [ i . j ]- c e n t e ~ [ q , j ] ) ~5, O q < K . This is done by circulating the center values through column windows of size K x 1 as discussed in Section 11-CS. The objective of Steps 3 and 4 are to compute d 2 ( i , k ) , 0 5 i < K . d 2 ( i , k ) is stored in the d2 register of PE(i, IC). First, in Step 3 the j t h PE in each 1 x K row window computes the sum of the K terrn[j]values in the window (i.e., A ( j ) = term[j](q),O5 j < K q t 1x A’ ILI17ldOW is computed in all the 1 x K windows). This is done using consecutive sum in 1 x K windows. Next, the PE’s in the first 1 x K window of each row sum up the values computed by corresponding PE’s in the 1 x K windows in their row. This gives the IC d2 values for the pattern represented in the row. The minimum of these can be found using a data sum with add replaced by min. Once the new cluster for each pattern is known, it can be broadcast to all the PE’s in the pattern row (Step 6) for later use. A complexity analysis is provided in Fig. 7. The overall complexity is 4K O(1ogNMK) unit routes. The algorithm for the case K > M is given in Fig. 8. The strategy is similar to that for the case K 5 M . Fig. 9 provides a complexity analysis. The total number of unit routes is 4K + O(1ogNMK) unit routes. 2) 0 ( 1 ) Memory: Once again, we need to develop different algorithms for the two cases K 5 M and K > M . Fig. 10 gives the algorithm for the former case and Fig. 11 for the latter. The complexity analysis for both is done in Fig. 12. The number of unit routes required by each is 4K O(1og N M K ) . Let us go through the steps of Fig. 10. Recall that this algorithm is for the case K 5 M . First, the cluster center window is broadcast such that it resides in all K x M windows of the N M PE hypercube. The objective of Steps 2 and 3 is to compute + + .2f-l ( F ( i , q )- c e n t e r ( j , q ) ) ’ , O 5 i < in P E ( i , j ) , d 2 ( 2 . j ) = N,O 5 j < K. q=u In Step 2, P E ( i , j ) computes in dzst(i,j) the Authorized licensed use limited to: IEEE Xplore. Downloaded on April 16, 2009 at 11:38 from IEEE Xplore. Restrictions apply. 133 RANKA AND SAHNI: CLUSTERING ON A HYPERCUBE MULTICOMPUTER Stepl: Broadcast the Ii x M cluster center window to the remaining N / l i - 1 Ii x M windows. Step2: The PE's in each I< x 1 column window compute t e r r n [ q ] , 0 5 q < Ii. Step3: Each row forms a 1 x A4 window for the consecutive sum operation. This operation is to be repeated Ii/M times. On the zth iteration term[zM j ] , 0 5 j < I\j of each PE are involved in the operation. Thus, each PE computes li/'Z! A values A[O. . . Ii/M - 11. Step4: At this time, the PE's in each row have li A values. Each represents a different d 2 ( 2 , k ) value. Each PE computes D = min {Abl}. + 0 5 1 <.W/IC Step5: PE(Z, 0) computes NewCluster[i] by computing the minimum D in its row and the cluster index corresponding to this. Step6: Broadcast NewCluster(i, 0) to the remaining M - 1 PE's in the ith row, 0 5 i < ,V. Fig. 8. O ( l i ) memory cluster assignment Ii Stepl: Step2: Step3: Step4: Step5: Step6: > JI. Fig. 9. Complexity analysis of Fig. 8. sum Ii - 1 + T) - center(j,1 + r ) ) * azd using the cluster center index remembered by PE (2. q ) Step6: Broadcast NewCluster(2, 0) to the remaining .If - 1 PE's in the zth row, 0 5 z < jV. < N, 0 5 j < M where 1 = b / K ] . Then, in Step 4, d2 is computed by adding the dist values in corresponding PE's of the 1 x K windows of the 1 x M rows. Once d2 has been computed, the new cluster values are easily obtained and broadcast to all PE's representing the pattern. B. Cluster Update For this operation, we assume that P E ( i , j ) in the topmost K x M window has values FeatureSum(i,j ) and Number(i, j ) defined as FeatureSum(i,j) = F(q,j), 05i I Stepl: Step2: Step3: Step4: Step5: Step6: log(N/li) 4h-+ O(log IC) log (M/ A-) log I< log M - > 1I log( N / l i ) Ii/nI (Step 3 Step 4 + 42w + O(l0g '14) + 1) 0 log ,I I log >II Fig. 12. Complexity analysis of Figs. 10 and 11 0 5 i < K, 0 5 j < M. N u m b e r ( i , j ) = \Szl, The algorithm to update the cluster centers is given in Fig. 13. Steps 1 and 2 are performed in K x M windows. The ( i ,j ) PE in each such window computes the change in FeatureSum(i,j) and Number(i, j ) contributed by the patterns in this window. These two steps can be restricted to for which NewCluster(i,j) # c(i,,j). In Steps 3 and 4 the topmost window accumulates the sum of these changes. Steps 5-8 update the clustering data. The complexity analysis is provided in Fig. 14. A total of O(1og'K l o-~ g ( N / K ) )unit routes are used. Overall Complexity: The total number of unit routes used by our algorithms for one pass of Fig. 1 is 4K O(log2K) O(1ogNMK) regardless of whether the amount of memory available is O ( K ) or O(1). This improves on the algorithm of Li and Fang [8] which requires O ( K * logNM) unit routes and O ( K ) memory per PE. + Fig. 11. O(1) memory cluster assignment Ii < K, 0 5 j < M q€S, ~1 5 M. Fig. 10. 0(1)memory cluster assignment K 0 < k <.\8 r=O 05i U<k<li Steps: Broadcast NcwCluster(z, 0) to the remaining ;Li' - 1 PE's in the zth row, 0 5 z < IY. Stepl: Broadcast the li x :If cluster center window to the remaining -Y/Ii - 1 Ii x windows. Step2: In each I< x M window regard the IC x M cluster centrer matrix as Ii/M d l x IZI cluster center matrices. These will be circulated through the I</>'i' x 'U' windows of the larger li x M window using the data circulation procedure of Section 11-CS. As a result, each d l x ,I,I cluster center window will visit each M x M window exactly once. Whenever a new M x ,U' cluster center window is received, the M x ,U' PE window does Steps 3 and 4. I.e., these are done a total of K/A" times. Step3: Each .VI x M window does a distance computation as described in Section II-C8. Because of the window size used, each computed distance represents the squared distance between a pattern and a cluster. Step4: Each PE remembers the smallest distance value it has computed so far. It also remembers the cluster index that corresponds to this. Steps: Compute NewCluster(z, 0) by finding q such that d 2 ( i , q ) = min { d 2 ( i ,k ) } log (-\-/Ii) 21i - 2 21i 0 2 log Ii log n I d i s t ( i , j ) = E ( F ( i ,1K Stepl: Broadcast the Ii x hi' cluster center window to the remaining N/li - 1 li x M windows. Step2: The PE's in each Ii x Ii window perform a distance calculation as discussed in Section II-C8. The result is left in the dist registers of the PE's Step3: The dist values in the iM/K 1 x Ii windows of each row are summed using a window sum. The result of this is left in the d2 registers of the PE's in the first 1 x Ii window of each row. Step4: Compute NewCluster(2, 0) = NewCluster[z] by finding q such that d2(i,q) = min {d2(z,k)}. IV. NMK PE CLUSTERING An N M K PE hypercube may be viewed as an N x M x K array with each N x M subarray forming one plane. The F matrix begins in the plane P E ( i > j , o ) and the center matrix is in center(i>j , '1: 5 j < K , 5 j < M' 'luster reassignment can Of Fig' 15. be done in ' ( l o g N M K ) time using the Cluster centers may be updated in O(1og N M K ) time using the algorithm of Fig. 16. I , + + V. CLUSTERING ON A MEDIUM GRAINHYPERCUBE In the previous sections, we have developed algorithms to perform clustering on a fine grain hypercube. Such a computer has the property that the cost of interprocessor communication is comparable to that of a basic arithmetic operation. In this Authorized licensed use limited to: IEEE Xplore. Downloaded on April 16, 2009 at 11:38 from IEEE Xplore. Restrictions apply. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 2, NO. 2, APRIL 1991 134 Stepl: PE(Z, j ) does an SRAW of F ( i , j ) to the (NewCluster(2. j ) , j ) PE in its li x M window. It also does an SRAW of - F ( i , j ) to the window. Note that both SRAW’s ( C ( ~ , J ) .PE J ) in its li x involve data movement in li x 1 column windows only. Let the resulting sum in PE(2, J ) be .4(i, j). Step2: PE(i, j ) does an SRAW of +1 to the (NewCluster((i. j ) . J )PE in its li x -I\ window. It does an SRAW of -1 to the (C‘(i.1).j ) PE in its li x .II window. Note that both SRAW’s involve data movement in li x 1 column windows only. Let the resulting sum in PE(,, j ) be E(,, j ) . Step3: The .A values of corresponding PE’s in the .V/K I< x ,\I windows are added using window sum. The results are in the D registers of the topmost li x d l window. Step4: The B values of corresponding PE’s in the .V/K IC x \I. windows are added using window sum. The results are in the E registers of the topmost li x .\I window. Step.5: FeatureSurn(i, j) := FeatureSum(i, j ) + D(i.j ) . 0 5 2 < I<. 0 5 j < -11. Steph: Number(z, j ) := min { x,Number(i, j ) + E ( i j. ) } , 0 5 i < I<, 0 5 j < -If. Step7: center(i. j ) := FeatureSum(2, j)/Number(t, J ) . 0 5 i < I<. 0 5 j < N. Step8: C‘(7.j) := NewCluster(i, j ) , 0 Stepl: Broadcast F , Newcluster, and C such that F ( z ,j . k ) = F ( i , j >O ) , NewCluster(i. j , k ) =Newcluster ( i . j.O), C ( 2 . j . k ) = C‘(i.j%O), 0 5 i < 4, 0 5 j < nf, 0 5 k < I<. - ) :’V(?.j.k)to Step2: if NewCluster(i. j . b ) = C(2.j . k ) set A ( z . J . ~and 0 else if NewCluster(i. j . k ) = k set A ( ; .j . k ) = F ( z .j . k ) and z ( 2 . j . k ) to 1 else if C ( i ,j . k ) = k set -4(i, j . k ) to - F ( i .j . k ) and .V(i. j . k) to -1 else set ,4(z. j. k ) to 0 and -\-(z, j. b ) to 0. .v - 1 Step3: Compute B(0.j . k ) C = <4(i,j , k ) and D(0.j . k ) = ,=U s-1 C .Y(z, j , k ) . ,=U Step4: Translate B and so that E ( k .j . 0) = B(O.j . k ) and F ( k .j . 0) = D(0.J . I C ) , 0 5 j < >U, 0 5 k < Ii. Step.5: FeatureSum(k. j , 0) := FeatureSum(k, j . O)+E(k.j . 0 ) Number(k, j , 0 ) := inin { x .Number(k, j.O) F ( k .j , O ) } and center(k. j , 0 ) := FeatureSum(k.j.O)/ Number(k. j . 0), 0 5 j < .lI, 0 5 k < Ii. C(i.10) := Newcluster(/. j.O), 0 5 i < .V, 0 5 j < .\I. A\- + 5 i < li. 0 5 j < .U.. Fig. 16. Cluster center update with .\-Mli PE’s. Fig. 13. Cluster updating. Stepl: Step2: Step3: Step4: Step.5: Step6: Step7: Step8: Stepl: Receive partial feature matrix from host. Node s receives F [ i . j ] s(.\-/p) 5 i < (s l)(.V/p) 0 5 s < p . Node 0 also receives the cluster center matrix. Step2: Steps 3-7 are repeated “iteration” number of times. Step3: Node 0 broadcasts the cluster center matrix to all nodes. Step4: Each node calculates the new clusters for each pattern using the cluster center matrix. Step5: Each node s calculates O(log21;) O(Iog2 I<) log (-\-/Ii) log (-\-/li) + 0 0 0 0 ~ [ s ] j[ ]i = . Fig. 14. Complexity analysis of Fig. 13 Stepl: Broadcast the center and F matrices such that center(i.1. k ) = center[k.j ] and F ( i . j , k ) = F [ i . J ] , 0 5 i < .V, 0 5 j < ‘If, 0 5 k < 1;. Step2: Compute t e r m ( t . j , k ) = ( F ( i . J . k ) - center(i.j.k))2 = ( ~ [ ji ].- center[k.j])’. .\I-1 Step3: Compute d 2 ( i , 0. k ) = term(2.j. k ) = d2[i. k ] by summing 3=11 along the second dimension. Step4: Compute NewCluster(i, 0, 0 ) = y such that d2(i.0. y) min {d2(2.0.k)}. 1s, ~ [ nj ]. . . s ( . ~ / p )5 i < (.s +1(-~/p)) aE = O<k<li 0 .V[S][t] = 1 1. s(>\-/p) 5 j < I\. 5 2 < (s + l ( S / p ) ) *€S, where S, denotes the tth cluster. Steph. At node 0, the following information IS gathered P-1 s[?.J] = T[5][2. J]. 0 5 t < l<,0 5 J < I\. a=O Step5: Broadcast Newcluster ( i , 0, 0) along the second dimension. P-1 Fig. 15. Cluster reassignment with -V2111i PE’s. 1721 = .V[.][t]. 0 5 1 < I< s=o section, we shall consider the clustering problem on a hypercube in which interprocessor communication is relatively expensive and the number of processors is small relative to the number of patterns N . In particular we shall experiment with an NCUBE/7 hypercube which is capable of having up to 128 processors. The NCUBE/7 available to us, however, has only 64 processors. The time to perform a 2 byte integer addition on each hypercube processor is 4.3 ps whereas the time to communicate b bytes to a neighbor processor is approximately 447 2.4b ps [4]. Since the hypercube computer is attached to a host computer, two cases of the clustering problem can be studied. These vary in the initial location of the pattern and the location of the final results: 1) Host-to-host: The pattern and cluster information is initially at the host and the result is to be left in the host also. 2) Hypercube-to-hypercube: The pattern and cluster informa- + This is done using a binary tree as in [2]. At each stage the node receiving the information adds its information to the received information and sends it to its parent. Step7: Node 0 calculates the new cluster center matrix. StepH: Each node sends the information about the final value of S,, (s(-\-/p) 5 z < ( E l)(.Y/p)) to the host. + Fig. 17. Clustering algorithm on the hypercube. tion is initially at the nodes and the result is to be left at the nodes. Let p be the number of hypercube processors. We assume that the N feature vectors that constitute the feature matrix are distributed equally among the p processors and that the center matrix is located initially at node 0. Further we assume that each processor has enough memory to hold its share of the pattern feature matrix and the whole cluster center matrix. Fig. 17 gives T Authorized licensed use limited to: IEEE Xplore. Downloaded on April 16, 2009 at 11:38 from IEEE Xplore. Restrictions apply. 135 RANKA AND SAHNI: CLUSTERING ON A HYPERCUBE MULTICOMPUTER No. of Clusters I No. of Processors 16 32 64 1 2 4 8 16 32 64 44.188 22.346 11.463 6.080 3.440 2.184 1.632 86.842 43.659 22.282 11.700 6.504 4.015 2.889 171.481 86.286 43.921 22.941 12.632 7.675 5.405 1 No' Of Processors 1 16 32 64 No. of Processors 16 2 4 8 16 32 64 44.519 22.651 11.778 6.390 3.760 2.520 86.568 43.544 22.128 11.516 6.307 3.799 2.641 64 171.341 86.175 43.779 22.768 12.450 7.477 5.178 (a) 512 patterns I I i I 171.942 86.858 44.510 23.518 13.218 8.278 16 32 64 I No. of Clusters 1 I No. of Processors 44.053 22.689 12.099 6.192 4.439 1 1 32 44.182 22.228 11.303 5.891 3.236 1.959 1.372 (a) 512 patterns No. of Clusters I 16 I 16 44.284 22.331 11.405 5.993 3.338 2.062 I I 32 86.760 43.736 22.321 11.709 6.500 3.991 1 I 64 171.714 86.549 44.153 23.142 12.823 7.851 (a) 1024 patterns (a) 1024 patterns Times are in seconds Number of features = 20 Number of iterations = 10 Times are in seconds Number of features = 20 Number of iterations = 10 Fig. 18. Host-to-host times Fig. 19. Hypercube-to-bypercube times. our hypercube algorithm for the host-to-host case. This algorithm assumes that p divides N . If this is not the case, then, in Step 1, the rows of the feature matrix can be distributed so that each of q = Nmodp processors gets [ N / p l rows of the feature matrix and the remaining p - q processors get LN/pJ rows each. The algorithm of Fig. 1 7 does not require that N , M , and K be powers of 2. The current cluster center matrix is broadcast, in Step 3, to all p processors using the standard binary tree broadcast scheme [2]. Steps 4 and 5 require no interprocessor communication. Step 6 uses the broadcast binary tree used in Step 3 backward (i.e., from the leaves to the root rather than from the root to the leaves). Step 7 requires no interprocessor communication. For the hypercube-to-hypercube case, Steps 1 and 8 are omitted. Interprocessor communication is done in Steps 3 and 6. The time needed for this is O(MK1ogp).The processing time for Steps 4,5, and 6 is O ( N M K / p ) .The overall time complexity for one iteration of Steps 3-7 is therefore O ( N M K / p MKlogp). When N / p is O(logp), the time becomes O ( N M K / p ) . The clustering algorithm of Fig. 17 was programmed in C and run on an NCUBE/7 hypercube. We considered two random feature matrices. One had N = 512 patterns (or feature vectors) and M = 20 and the other had N = 1024 and M = 20. The use of random feature matrices is justified in our experiments as we are interested in measuring the effectiveness of our parallel algorithm in terms of speedup rather than in the converegence of the clustering algorithm. The convergence properties of the parallel clustering algorithm are the same as those of the sequential one and the run time of the parallel algorithm is insensitive to the actual feature values. Run times for ten iterations of Steps 3-7 are given in Figs. 18 and 19. Run times are reported for p = 1 , 2 , 4 , 8 . 1 6 . 3 2 . and 64 for the case N = 512 and for p = 2 , 4 , 8 , 1 6 , 3 2 , and 64 for the case N = 1024. Fig. 18 is for the case host-to-host while Fig. 19 is for the hypercube-to-hypercube case. For small p , there is little difference between the times for the two cases. I.e., the time for Steps 1 and 8 is small compared to the time for the remaining steps. Since the host-to-hypercube and hypercube-tohost data transfer time of Steps 1 and 8, respectively, is relatively insensitive to the value of p and since the time for Steps 2-7 decreases as l / p , the significance of the time for Steps 1 and 8 increases as we increase p . In fact, for p = 64 and K = 1 6 the time for Steps 1 and 8 is approximately 20% of the time for ten iterations of Steps 3-7 when N = 512 and approximately 23% when N = 1024. The run times of Fig. 1 9 closely agree with our analysis for Steps 3-7. Note that for our test data N / p 2 log,p, so the predicted run time is O ( N M K / p ) . Hence, we expect the run time to increase linearly with N , M , and K and the dependency on p is as l/p. The speedup and efficiency (speeduplp) for N = 512, K = 32, M = 20 for ten iterations are given in Fig. 20. Fig. 21 gives these figures for the case N = 1024, K = 32, M = 20. and number of iterations = 10. From Fig. 20, we see that when N = 512, we get greater than 80% efficiency so long as the number of processors is no more than 16. When p = 64, the efficiency drops to approximately 50%. This drop in efficiency as p increases while the problem size is held fixed is expected as the useful work per processor declines and the effects of the nonuseful work (e.g., interprocessor communication) become more significant. The efficiency is expected to be larger for larger N as the amount of useful work per processor increases relative to the amount of nonuseful work being performed. With an N + Authorized licensed use limited to: IEEE Xplore. Downloaded on April 16, 2009 at 11:38 from IEEE Xplore. Restrictions apply. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 2, NO. 2, APRIL 1991 136 P 1 2 4 8 16 32 64 Host-to-host Speedup Efficiency 1.000 1.000 1.989 0.994 3.897 0.974 7.422 0.928 13.352 0.834 21.629 0.676 30.059 0.470 Hypercubeto-hypercube Speedup Efficiency 1.000 1.000 1.988 0.994 3.9 10 0.978 7.517 0.939 13.725 0.857 22.787 0.712 32.778 0.512 Number of clusters = 32 Number of features = 20 Number of iterations = 10 Fig. 20. 512 patterns: Speedup and efficiency. Number of clusters = 32 Number of features = 20 Number of iterations = 10 Fig. 21. 1024 patterns: Speedup and efficiency (based on estimated time for one processor). of 1024, the efficiency for p = 64 is approximately 60% (since we did not have enough memory to solve an N = 1024 instance on one processor, the p = 1 time is estimated from the p = 2 time using an efficiency of 0.994. This is the efficiency for the case N = 512). VI. CONCLUSIONS In this paper, optimal algorithms for squared error clustering were developed. We considered the two cases when the number of PE’s available is N M and N M K . For the former, we developed algorithms for the cases of O ( K ) and 0(1)memory per PE. While the algorithms for both cases use a comparable number of unit routes (4K O(1og’K) O ( l o g N M K ) ) ,those for the case of O ( K ) memory are simpler. Our algorithm for the case of N M K PE’s runs in O(1ogNMK) time and uses O( 1) memory per PE. All our algorithms are optimal to within a constant factor. Experimental results obtained on an NCUBE/7 hypercube indicate that the clustering problem can be solved efficiently on commercial medium grain multicomputers. + [9] D. Nassimi and S. Sahni, “Data broadcasting in SIMD computers,” IEEE Trans. Cornput., vol. (2-30, pp. 101-107, Feb. 1981. [IO] L. M. Ni and A. K. 