•c^cms^^ ^ (UBRARIESi ^ net ^ ALFRED P. WORKING PAPER SLOAN SCHOOL OF MANAGEMENT Examining The Validity of Sample Clusters Using the Bootstrap Method by M. WP 1469-83 A. Wong and D. A. Shera August 1983 MASSACHUSETTS INSTITUTE OF TECHNOLOGY 50 MEMORIAL DRIVE CAMBRIDGE, MASSACHUSETTS 02139 Examining The Validity of Sample Clusters Using the Bootstrap Method by M. WP 1469-83 A. Wong and D. A. Shera August 1983 Abstract An important problem in clustering research is the stability of sample clusters. Cluster diagnostics, based on the bootstrap subsampling procedure and Fowlkes and Mallows' B statistic, are developed in this K. study to aid the users of cluster analysis in assessing the stability and validity of sample clusters. 1. An important INTPQDUCTIQN problem clustering in research stability and validity of the sample clusters. For Euclidean Hartigan (1981), Wong (1982), and Wong and Lane (1983) data, developed procedures to have evaluate sample clusters using density-contour clustering model. the is the that true political major in clustering However, it is often the nations of the world, or the states of the United States, or the companies of a industry, the objects under study cannot be reasonably viewed as a sample such underlying population. from some Under these circumstances, it is not reasonable to talk about sampling errors in the computed clusters. But, the stability of the sample clusters can be evaluated in other ways. In the approach taken by Baker Baker and Hubert (1975) , the (1974) , Hubert (1974) , and question of how the clustering solutions provided by the single and complete linkage methods are is data affected by changes in the distance or similarity matrix addressed. This collected on the objects similarities. Ling developed exact compactness the is a reasonable an (1973) might change the distances or took probability and isolation of question as different a different approach and theory testing for the single linkage clusters, under (admittedly unrealistic) assumption that the rank order of the entries in the distance matrix is completely random. approach Another appropriate is when sample the similarity matrix provides an approximation of the population set of objects. matrix for a fixed similarity For example, in marketing research/ if the brand-switching behavior of the total population of coffee consumers were known, a population matrix similarity terms (in of relative frequency of switching between brands) between various coffee brands could be obtained. hierarchical population, clustering model can brand-switching the true a clustering can be defined on the objects (here, block-distance only finite a brands) using the population similarity matrix coffee the such For behavior of (e.g., be used). However, a sample of coffee consumers is available in practice, and the sample similarity matrix obtained similarities aggregate approximation of the true only provides an between brands. Consequently, the sample hierarchical clustering obtained from this similarity matrix is merely a sample estimate of the true hierarchical clustering. In this type of study, standard sampling procedures like the bootstrap or cross-validation methods described in Efron (1979a, Hartigan 1979b), or the error (1969, 1971) can be analysis scheme given in usefully applied to the sample subjects (e.g., coffee consumers) to perform two main tasks: 1. to assess the similarity between the population hierarchical clusterings, and sample and 2. to assess the stablility of the sample clusters both using the Bk measure developed by Fowlkes and Mallows at Bell Laboratories (1983). In Section 2, we describe the theoretical background for this study. concerned to with, the clustering methods used, the statistics compare clustering trees, and how the bootstrap method is used. The Section 3. also There are discussions on the type of data we are sampling experiments are described in detail in The simulation results and their implications are discussed there. Section 4 draws conclusions describes how the techniques can be used most effectively. and RESEARCH BACKGROUND 2. In this study, our main concern is to develop diagnostic tools clusters be a stability and the the case where the in finite population, a true hierarchical clustering can In order to evaluate the clusters obtained by various . clustering techniques from the need examine to and series simulated Fowlkes of sample similarity matrix, we of clusters sampling between agreement and its the distribution in experiments. We a will adopt and Mallows' Bk statistic as a measure of the degree The clustering techniques agreement. in section 2.1, and described reviewed degree the sample population are sample similarity matrix is defined on the objects by the ultrametrics model (Johnson 1967) of validity of sample an approximation of the population similarity matrix. merely For assessing for in section 2.2. In used in this study the Bk statistic will be section 2.3, the bootstrap procedure will be outlined. 2.1. CLUSTERING METHODS Clustering partitions, heirarchical the or a splitting of the set of objects into sets, or groups of one or more objects. is clustering is a set of objects indexed from 1 clusterings, { Ci to N, the number of objects. }, A of The index it is the number of clusters in partition Ci. What makes i is that if objects X hierarchical and Y are in the same cluster in partition Ci then they are in the same cluster for all Cj tree will mean A where j < and a number of other things taxonomy, i. a dendrogram, a which all can mean The term tree reflects the property hierarchical clustering. that a hierarchical clustering can be represented on paper by something which resembles an upside-down botanical tree. Here is more dynamic definition of a tree. with each a all objects, the objects are in a we have will Continue linking single cluster. linkings N-1 At each step cluster. object in it's own combine two clusters into a single cluster. until Given matrix, each clustering method proceeds as follows: distance Start a and a Thus for N tree of N clusterings, the first and last being trivial. differences The different in used criteria linking the to methods which two decide come from clusters to One of the most common criteria is the combine at each step. distance between the closest two objects of the two clusters. This the known as single is distance distances object two of Complete linkage defines between two clusters to be the maximum of the object in the first cluster and any between any the second. clusters linkage. which are At each step, this method links the closest together by that distance Average linkage is the same except it uses the measurement. average of the distances between objects in each cluster. Wong Lane and (1983) involves looking at the Kth where is K a rationale comes estimating the proposed data analyst. from density estimation so density of the linked are method which a nearest neighbor to each object chosen by the parameter Clusters object. have that it is like true distribution together by The at each highest density first and only on the additional condition that at least some in one cluster be object object some of other cluster. As a result, this to decide which K to Lane have suggested a method and at the results for all looking by use the also produces an estimate for the number of clusters. method Wong in within the Kth nearest neigborhood values of K and see what the most common value of the number of clusters is. There criteria , are many other existing clustering methods and but they will not be considered here. 2.2. THE Bk STATISTIC definition of Bk is as The follows: The k in Bk refers to the number of clusters and we will compare the clusterings with i k and clusters in the two trees. j run from 1 to k Consider a table Mij where and where Mij equals the number of 8 objects in cluster first and tree, in of clustering 1, the clustering from the cluster clustering of j and Moj are the marginal Mio of the second. j the clustering Moj be the number of objects of the first clustering/ and cluster 2, Let Mio be the number of objects in cluster from the second. i i totals of the rows and columns of table Mij. Given k, let Pk = ^Mio - N 1 2 Qk = 2^Moj - N y Mij ^ Tk = 2 - N ij and then Tk Bk = JPk Fowlkes distribution completely imagine a and for Mallows Bk, independent Qk (1983) that is, trees. the calculated distribution However, it null a for two is difficult to situation in which the null distribution should be considered since clustering of there some is kind. almost Most always going situations significant deviation from the null case. will to be have Rand (1971) proposed also , Rand's statistic was Pk+Qk Tk Rk = +1 2C C C equals Comb where objects taken 2 the In clearly 2), (n, the number of combinations of n at a time. paper introducing Bk, Fowlkes and Mallows show Rand's statistic was insensitive since it did that rightfully did. situations where Bk differences in important indicate not to measure the statistic With the same P, Q, and T as similarity between clusterings. defined above for Bk a We will use the Bk statistic in this study. 2.3. THE BOOTSTRAP METHOD Bootstrapping is a X, bootstrap, = {Xi}, is F, statistic. a set of N and a statistic to estimate the distribution consider a random independent and identically distributed distribution with objective distribution of true probability distribution F, variables (iid) numerical technique proposed by Efron to estimate the (1979a, b, 1983) Given a the R of R. empirical distribution (X, F) , the To do the of X, i.e. N and call this distribution G. each Xi has probability 1 / Note that not each value of Xi has equal probability because Xi might have the same value for two 10 or more values of i. Now draw the size of each sample Yj is the from runs number of new samples a 1 to = , {Yi}j, from G where same as that of X, i.e. calculate Then N. Yj R (Y j , G) i for each bootstrap sample Yj The main assumption is that G is a good approximation of With F. distribution rejects of R (X,F) This . the same assumption assumption: because model a exceptional. a but hardly empirical the is not an unreasonable lies outside it restrictive some confidence that the data is not unusual Standard probablility large N, G approaches for that applies when a statistician He or she is assuming region. say we of the R(Yj,G)'s is a good approximation of the distribution true or assumption, this theory tells us that F almost surely with suitable, conditions. The data analyst may decide to smooth the data by adding random noise or fitting a smooth distribution to it if he she or feels that is appropriate. A bootstrap sample by defined by sample drawn from the being the original based on the original is a sample empirical distribution, G, sample, or version. The term sample by itself the distribution, not a bootstrap. have true matrix. The bootstrapping 11 a smoothed will mean a sample from an unknown "population" similarity similarity perhaps In this study, we matrix and a sample procedure will be applied elements the to the sample of similarity matrix independently to obtain subsample matricies. 2.4 SIMULATION DATA It is difficult to make because it comes data translating so in general statements about data many different forms. computer usable to In fact, form usually requires Here we describe our writing a new program for each problem. data and how it is treated in this paper. Objects are These cluster. of interest which things the we want to variables or countries or objects might be denote the number of objects N will brands of chewing gum. here. Data is made up of responses or single values, e.g. single draw Xi from a distribution F is a response. a A sample is a collection of independent responses. change within getting hand, bootstrap samples the (Some objects of interest will not this study, the set of In many technique samples is and there and resamples. resample studies single problem. the scope of a so been done important part will be The bootstrap size fixed equal to have an 12 of the many different procedure we used has the original sample size. sample size does where the change for different samples (Hartigan 1981) be concerned with that here.) On the other , but we will not Since involve the all operations, Since this is not the focus satisfy the triangle inequality. of the paper, we will assume to its distance matrix is presented techniques the some mathematical sense, i.e. it need not a true metric in the be through The resulting distance need not between the objects. matrix be using raw data into a distance want to change the we will we eventually, matrix, distance a methods clustering that the transformation from X clear and done implicitly. here will apply Thus, to any situation where one can create a distance or similarity matrix from the data. are some examples of responses and their relation Here to distance matrices: a) A one for each individual sample of distance matricies, giving his or her associations between objects combined in Some average of fashion. some these responses would then become the distance matrix corresponding to the sample. b) A vector trying to association of values cluster the between each individual for variables. variables by We where we are might measure the linear or monotone correlations, or something more sophisticated. c) A the simple case distance matrix between addressed by Hubert approached here. 13 the objects. (1974) and will This is not be our experiments, the ultrametrics tree model will be In used the generate the population to distance hierarchical data set Al; for clustering Following clusters. Al. matrix distance matrix. it corresponds to a 10 objects with of Below is well defined 3 tree corresponding to set that is the Other data sets will be introduced in Section 3 as they are used. Table 2.4.1: True Distances of Set Al Al Al - A2 A3 Bl B2 B3 CI C2 C3 C4 .10 .10 .40 .40 .40 .70 .70 .70 .70 .10 .40 .40 .40 .70 .70 .70 .70 .40 .40 .40 .70 .70 .70 .70 .05 .05 .70 .70 .70 .70 .05 .70 .70 .70 .70 .70 .70 .70 .70 .15 .15 .15 .15 .15 A2 A3 Bl B2 B3 - CI - C2 C3 - .15 The follows: the in mind can be described as Suppose the objects were brands of detergents and respondents detergent who application would 1 we were consumers. and detergent not have 2 is use each as a The real distance between the proportion of consumers substitute for the other. 14 In marketing people research studies, a finite sample of , say, 100 would be tested if they would accept one product as a substitute switched, for then the other. So if 80 the distance is .20, percent of the people and it is a reasonable estimate of the true distance in the population. 15 Figure 2.4.2: Tree for Data Set Al 1.00 + / .70 \ /--+— .40 /~+~\ /-+-\ I I I I I /-+-\ ********** AAABBBCCCC I simulate To thing do to is I I I I I described above, survey the I create binomial to .15 .10 .05 0.00 the natural random variables with probability parameters equal to the distance and with n Sample trials. (Pi j ,100) /lOO i and j. bootstrap distances are distributed = 100 Binomial where Pij is the true distance between objects Likewise, distances are Sij if is the sample distance, the distributed Binomial (oi j ,100) /lOO. The next two distance matrices are examples of a sample from set Al and then one of the bootstrap samples based on that sample. 16 Table 2.4.3: Example of Sample Distances Al Al - A2 A3 Bl B2 B3 CI C2 C3 C4 3. RESULTS 3.1 BK PLOTS 3.1.1 Data Set Al, Complete Linkage 50 samples In the first experiment based on data set Al, from the true was linkage distribution used to generated were and complete Bk was calculated by make 50 trees. comparing each of the 50 trees with the true tree. The distribution of Bk for each 3.1.1.1. and hatch is shown in the figure The horizontal scale runs from N lie at the left vertical k are at to N, (N = 10); and right borders, respectively. scale runs from marks 1 1 The at the bottom to 1 at the top and increments of .10. For numbers count how many Bk's took on that value. 18 each k, the Figure 3.1.1.1: Plot of Bk versus Set Al, Complete Linkage k = k Figure stars, 3.1.1.2 shows the means of Bk, indicated by "*", and plus and minus one sample standard deviation dashes, "-". They are truncated on each side, indicated by at the top and bottom boundaries because Bk will only lie in the unit interval. 20 Figure 3.1.1.2: Plot of Bk versus Set Alf Complete Linkage * k * * * The following are box plots of the same results. The median is represented by a number sign, "#", the quartiles by a plus sign, "+", and the first and seventh eighths by minus signs or dashes, "-", regions in between the quartiles The was filled in with vertical bars, the box-plots more visible. equal to the quartiles Often, "I", to make the boxes in because the eighths are and/or the quartiles 21 equal to the median, only one of the symbols is shown. The discreteness of the distribution causes this problem. It also often causes rather large boxes for higher values of k. 22 Figure 3.1.1.3: Plot of Bk versus Set Al, Complete Linkage k « I + Because may box not of the discreteness, simply finding the average be an appropriate summary plots efficiently. information, seem The but convey to variability the frequency plot can be difficult there are many objects. 23 of the information. contains of The Bk most the most to read, especially when Before explain that plot of Bk experiments, actually and natural, Bk plot. not a smooth is us one but jagged and The reason because desirable, let It is clear steep cliffs. many peakS/ valleys, and contains is more with important feature of the the the continue we the peaks indicate the more stable and relevant clusterings. see the reason, consider the structure of Set Al and To of linking perturbation to for the samples, After and 2 linkings, when Bk completely Then, Bk beginning 8 k At = 8, k is (9), none of the clusters is + 1, pieces have been linked. Similarly, smaller. the incomplete to link together. though because those with A's or C's. the B cluster is usually linked linked, and only random because smaller at higher. is the objects labelled with likely to link before most are B's Even with the together clusters. process the The sometimes for k cluster - 1, of (7), Bk is A's is only peak is not always exactly some A's may get a small distance in the sample, or the B's a large distance, and then A's link before the B cluster is complete. the clustering at level k when k is one of these peaks, one will likely find stable clusters. 3.1.2. Data Set Bl, Complete Linkage 24 So by looking at Figure 3.1,2.1: Tree for Data Set Bl 1.00 / \ .65 .55 .50 /~+~\ /--+— \ /— +~\ .30 ****************** AAAABBBBCCCCDDDDEE of 18 objects, There are A's, for Data Set Bl, consisting three figures here are The 3 which represents distinct clusters of B's, and C's, and 6 0.00 4 a different population. objects each labelled with other objects labelled with D's and E's which are in a less distinct cluster. 25 Figure 3.1.2.2: Plot of Bk versus k Set Bl, Complete Linkage, 50 Samples 50 4 8 • • ' ' • 48 1 1 20 1 ' • ' 3 5 1 19 6 15 11 4 1 1 9 1 i 1 2 14 27 24 5 14 49 1 1 1 6 1 2 2 2 3 15 + • 15 9 18 ' 15 1 2 1 7745 1 11 50 18 32 1 6 4 7 8 6 7 5 20 17 1 4 2 9 13 37 11 1 7 4 More 29 46 50 objects made the distribution smoother, except for the very high values of for us to use only the sample mean and standard deviation as k; but it is still not smooth enough descriptors for the distribution. 26 Figure 3.1.2.3: Plot of Bk versus k Set Bl, Complete Linkage, 50 Samples # # • « + I # + + - + I » » + + t + + # - « # I I I + I I + 3.1.3. Data Set Al, Other Methods Section 3.1.1 contained the plots of Bk for 50 samples when complete linkage was the method used to form the trees. Next are the results when single linkage, average linkage, and the kth nearest neighbor algorithm are used. 27 Figure 3.1.3.1: Plot of Bk versus k Set Al, Single Linkage, 50 Samples Bk's The especially of Mallows high values of are near k typically N, the higher, number of Complete linkage finds the stable clusters at lower objects. values for linkage single for k, also near see 1. In their experiments, Fowlkes and that the complete linkage. This is easily than single that single linkage is "continuous" 28 linkage decayed faster explained by the fact while complete is known to be "discontinuous". Discontinuous means that drastic changes in the result can come from small change in the data. However, single sensitive "chaining", to or linkage the has small linking its problems, distances clusters normally be linked. 29 which earlier than too. can It is lead to they should Figure 3.1.3.2: Plot of Bk versus k Set Al, Average Linkage, 50 Samples + I I I I * I + By clusters using instead the of distance average the minimum or discontinuity might be avoided. average between objects in maximum, chaining and As one would expect, Bk with linkage (figure 3.1.3.2) usually ends up between the single and complete linkage results. 30 Figure 3.1.3.3: Plot of Bk versus k Set A, Nearest Neighbor (2) Linkage, 50 Samples In the plot failed neighbor method clusters, indicated find stable it is shown above, clusters by identify to low Bk's at for = k 3. k that the kth nearest the two population =2, although The reason it did for this confusion is that the kth nearest heighbor method is oriented towards picking out the number of clusters and the clusters themselves and not towards the reproducing whole tree. 31 Here, 3 is choice its good perfectly evaluate reproduces Thus, answer. nearest kth the the number of clusters of tree the it use together it appropriate to method by neighbor nor not is and that is a how well it with additional point, this method is not designed for of objects more for Euclidean data but When objects. reproduction other tree was worse than the of we will not consider reasons, . One small set a with more than 100 besides parameters Bk were 2 used, For these for 2. this method for the remainder of the study. Data Set Bl, Other Methods 3.1.4. Recall clustering at 9, when only clusters A, figure tree, that there were unstable clusters linkage for 4 while 2, 3, single 4, B, and C are mislead into declaring Complete linkage says clusters. 5, linkage, 6, 7 stable by simply looking at the one might be 3,1.2.1, showed us that the D's and E's However, not really clusters. are 9. = The graph of Bk would tell linked. in k graph The 3.1.2.3. figure and above, 9 4 shows clusters! clusters stable Average has stable clusters at 2, 3, 4, and 5, and of course It is difficult to say which method is more correct. 32 Figure 3.1.4.1: Plot of Bk versus Set Bl, Single Linkage k = 9 I # t I + + I I I - k i + * + + I + + I # + # + I + « 11 I + I + I I # I + + - 33 # # # Figure 3.1.4.2: Plot of Bk versus k Set Bl, Average Linkage, 50 Samples k « = 9 4 I * + * * f » + I * + * I I I » + + I I # I + « I I I + 3.1.5. Data Sets A2 and A3, Complete Linkage Set Al clusters. topology By of is an unusually nice changing the the set with well separated distances tree, set A2 is clusters. 34 but conserving the formed with less distinct Table 3.1.5.1: Distances for Data Set A2 Al Al - A2 A3 Bl B2 - A2 A3 Bl B2 B3 CI C2 C3 C4 .20 .20 .25 .25 .25 .50 .50 .50 .50 .20 .25 .25 .25 .50 .50 .50 .50 .25 .25 .25 .50 .50 .50 .50 .15 .15 .50 .50 .50 .50 .15 .50 .50 .50 .50 .50 .50 .50 .50 .10 .10 .10 .10 .10 B3 CI - C2 - .10 C3 - 35 Figure 3.1.5.2: Tree for Data Set A2 1.00 / /— +~\ + \ 1 50 Figure 3.1.5.3: Plot of Bk versus k Set A2, Complete Linkage « I + + I stability The depends The maximum less the the than with this variance in a binomial probability is .50. So in set A2 at .20 37 not only .20 to .25 in random variable occurs for data set A3, pictured real difference between links those e.g. data that sampling will produce. but also the variation A2, below, clusters the difference in distances, on set when of at .50 and .55 is and .25, even though the arithmetic something simply variation set A3 difference, that looking is data a at a .05, is the not well known. tree, As with complete linkage has clusters. 38 This property is could easily analyst single same. forget while especially when the expected, the Bk plot for even less stablility at 3 Figure 3.1.5.4: Tree for Data Set A3 1.00 I I I I / •\ -+- .80 I I I I /— +— /-+-\ I /_-+— \ ********** AAABBBCCCC 39 .55 .50 .45 .40 0.00 Figure 3.1.5.5: Plot of Bk versus Set A3, Complete Linkage k + I I « + I I I * + + I + t I I * I + Because we can not define a true null distribution, it is difficult to say exactly how one could tell whether a peak is significant or not. cut off point at, say, may not decay. be It would be easy to make an arbitrary .90 or appropriate .85; but as since the Bk Perhaps this question of k increases, this plot must eventually significant peaks can only be answered after a few hundred plots of experience. 40 3.1.6. Data Set B2 Figure 3.1.6.1: Data Set B2 1.00 + / I I /__+-_\ \ I I I I /—+— \-\ /—+— \ /—+— .55 .50 .40 ****************** AAAABBBBCCCCDDDDEE 0.00 Data set B2 is a less stable version of Bl. complete linkage stable linkage find stable clusters at does not. However, as those is Set Bl, these 9 k Average and = 9 while single clusters are not as as indicated by the lower values of Bk. 41 Figure 3.1.6.2: Plot of Bk versus Set B2, Complete Linkage + I I - » + - I + - I III + + t + + I + + I # I I I I I + « + + I + - I + - I # + + t # + + - - - # - I + I + * I + - + - It -11 - + # I + I * - 42 » k Figure 3.1.6.3: Plot of Bk versus Set B2, Single Linkage # - - • ' + - + # _ + I k I + - I # * + - * * + + + + I # I I I I + I I # # + I I I + + I # - + + - » - I + + I - + I + t « I + + + Notice clusters # « that average linkage (figure 3.1.6.4) has stable at 4 which is odd because set B2 does not. has stable clusters at k = 9. 43 It also Figure 3.1.6.4: Plot of Bk versus Set B2, Average Linkage • . I # + I I I 4. I I I I I I k • I I - + # I I - * I # + + + - + # - - - t + + * I I + I + + I - # I I I I + + # - # I + + - # I + - + # I * I t + + + we have shown that the So stable and taking many samples from a known relevant clusters. get only one sample and no one find a reliable Bk plot is useful in finding The problem truth. 44 is that we were truth while in practice we Now the question is "Can estimator of the sample and population trees?" t true Bk between the 3.2. THE DISTRIBUTION OF BK Since we do not know the population in practice, we must use like the bootstrap to simulate the true distribution. As stated above, the techniques subsampling sampling from bootstrap uses the sample to see an approximation of the the distribution of the as population. order In sample-to-bootstrap Bk, distances some and bootstrap-samples actually close to was Bk true-to-sample if were samples 50 samples these of drawn were taken that of the from the true were chosen from each of them. and 50 Then the sample Bk could be compared with the bootstrapped Bk's to see if the bootstrap reproduced the original situation well. To compare the distributions, a standard chi-squared goodness of test fit was The used. hypothesis null is that the distributions are the same and a low value of the chi-squared statistic, relative to the degrees of freedom, will indicate that the null hypothesis is acceptable. Binning true-to-sample binned 50 was Bk's. based on Those Bk's from with the closest sample Bk. bootstraps on each of 6 distribution the of the the bootstraps were For complete linkage and samples from set Al, the results are in Table 3.2.1. 45 Table 3.2.1: Results of Chi-squared Tests Set Al, Complete Linkage, Samples 1 through k chisq df 61.879 16.425 6.901 32.516 5.014 5.209 95th percentile = 5.991 4 4 4 4 4 4 2 2 2 2 2 2 64.729 241.342 195.578 147.015 306.318 71.061 95th percentile = 9.488 5 5 5 5 5 5 4 4 4 4 4 4 6 6 6 6 6 6 1 1 1 1 1 1 7.219 .357 .357 7.219 2.228 7.219 95th percentile = 3.841 106.452 106.390 20.121 93.056 122.238 2.400 95th percentile = 7.814 7 7 7 7 7 7 3 3 3 3 3 3 8 8 8 8 8 8 1 1 1 1 1 1 8.000 18.000 5.120 9 9 9 9 9 1 1 1 1 1 .750 4.083 21.333 .750 .000 .720 11.520 6.480 95th percentile = 3.841 46 6 9 1 33.333 95th percentile = 3.841 results show that the distribution of the sample Bk The is not close to that of the true. could not tested because the be degenerate. The cases when k = 2 or sample distributions were Even with different binning schemes, the results were still the same: the distributions are different. was also all difference true for three clustering methods. This The was that the bootstrapped Bk's were nearly always from shifted 3 the sampled. Usually, it was shifted to the lower side, which is what one would expect. This below. difference is made clear by the two plots shown The Bk plot in figure 3.2.3 compares the 50 bootstrap samples (based itself, while on the sample number 33) one in figure original samples with the population. 47 with sample number 33 3.2.2 compares the 50 Figure 3.2.2: Plot of Bk versus k Set Bl, Average Linkage, 50 Samples k # = 9 « 4 + # - ' # # - # + I + # * I I + I # - + + f I + I # + I I I + t - » For data sets with situation was the same. by the least objects, 18 set Bl, the So the problem was not simply caused small number of objects. sensitive e.g. method, Even average linkage, the had bootstrap Bk's different from the sample. 48 that were very Figure 3.2.3: Plot of Bk versus k Set Bl, Average Linkage, 50 Bootstrap Samples + # - + I # « + _ + + I _ _ I « « * + I « + * I + I + + These be results are not particularly good news. the empirical distributions defined that different very » from by the samples can true distribution. the It means However, In these caseS/ it Notice that the median of the bootstrap Bk's is equal to the median of above. was always different. the Even sample though Bk's for k = 2, 3, the distributions of 49 all is not lost. 4, 5, 9, and 10 Bk's are not the the statistics like the median some Seunc/ may be very close to The purpose of section original. 3.3 is to examine how close it really is. 3.3. ESTIMATORS FOR BK It rather may than estimators looked for 50 at sufficient be 3 distribution. its can to estimate be calculated natural estimators, bootstraps It the sample-true Bk is proposed from bootstrapped of one sample. The We values are given in Table 3.3.1: Estimators of Bk Set Al, Complete Linkage true Bk's. the mean, median, and mode, Table 3.3.1. k here that Most of the time, the median equaled the mode. After examining many trials it appears that the median and mode are either exactly equal to the true Bk, or they are further from it than the mean. 3.3.1. Data Set Al, Complete Linkage In order to get a better idea on how accurate estimators are, the bootstrap can be used distribution of these statistics. distribution was taken and 30 sets again One to estimate the sample from the true of 50 bootstrap samples from the sample distribution were created. For each set, the mean median, and mode of the 50 boots were calculated and the true Bk was subtracted. 51 Figure 3.3.1.1: Deviation of Esimator from the True Bk Data Set Al, Complete Linkage / Sample Number 10 MEAN .20+ 3 + 5 4 9 .00+ + -.20+ 40 + .0 2.0 + Figure 3.3.1.2: Deviation of Estimator from True Bk Set Al/ Complete Linkage r Sample 10 MEDIAN .20+ .00+ + -.20+ * + + 4 -.40 + -.60+ 2.0 .0 The estimator, 4.0 6.0 8.0 10.0 median appeared better than the mode and mean as an if only slightly. So we median in the remainder of the study. 53 will present only the Figure 3.3.1.3: Deviation of Estimator from True Bk Set Al, Complete Linkage, Sample 10 MODE .60+ .40 + .20+ .00+ + -.20 + + 5 + -.40+ -.60 + 2.0 .0 6.0 4.0 8.0 10.0 3.3.2. Data Set Al, Other Methods Here single with we present the results for the same data set using and average linkage. single linkage than The median does slightly worse complete while better with average linkage. 54 it does slightly Figure 3.3.2.1: Deviation of Estimator from True Bk Set Al, Single Linkage, Sample 10 MEDIAN .60 + .40 + ,20 + .00 + - * -.20+ 2 + + 2 40+ + .0 All In + + + 2.0 4.0 6.0 three methods did perfectly for + 8.0 k =2, +K 10.0 3, 5, and 9, this context, it does not mean those are stable clusters. It only means that we accurately estimated the true Bk. 55 Figure 3.3.2.2: Deviation of Estimator from True Bk Set Al, Average Linkage, Sample 10 MEDIAN .20 + + 00+ +8 + + + + -.20+ 2 * + + -.40+ * -.60+ + .0 + + + 2.0 4.0 6.0 56 + 8.0 +K 10.0 3.3.3. Data Set Bl, All Three Methods Figure 3.3.3.1: Deviation of Estimator from True Bk Set Bl, Complete Linkage, Sample 10 MEDIAN 1.00 + 80+ ,60+ + 3 + 40+ * 2 + .20+ 9 4 + 9 + - * 8 + 5*4+ *+3++* +5 .00+ + + 6 8 + + 9 * -.20+ 2 - 2 * 2 + * 2 9 * 2 + 4 7 2 5 -.40+ + .0 + + 4.0 8.0 + 12.0 57 + 16.0 +K 20.0 With However/ more when bootstrap-sample set estimation becomes objects, the true-sample estimator Here, the estimate of Bk for was Bk was high also. Al, the estimators were perfect for k = 9 is 58 less accurate. k high, the Recall that for equal to 2 usually accurate. and 3. Figure 3.3.3.2: Deviation of Estimator from True Bk Set Bl, Single Linkage, Sample 10 MEDIAN .60 + .40 + + + + _ * ,20+ 2 2 ,00+ - 3 * - 5 * 6 2 3 9 - -.20+ + 8 + 2 * + 77 + + + * * 4 9 2 * + 2 +58 2 + * 5 + 40+ .0 4.0 12.0 8.0 59 16.0 20.0 Figure 3.3.3.3: Deviation of Estimator from True Bk Set Bl, Average Linkage, Sample 10 MEDIAN 1.00 + .80+ ,60 + + 3 .40 + 3 .20+ + + + 4 3 3+ +4 4 ++ .00+ * + 2 9 + + * 5 62 + + + -.20+ 2 + 3* 22 * 2 3 40 + .0 4.0 12.0 8.0 3.3.4. Data Set Al, Other Samples 60 16.0 20.0 of the results above are Most tenth. atypical, In order the to same make sure procedure numbers 11 through 13. 61 was based on one sample, the that run the on sample is not other samples, Figure 3.3.4,1: Deviation of Estimator from True Bk Set Al, Average Linkage, Sample 11 MEDIAN .60 + 40 + + .20 + 6 - + 3 +.+ 00+ + + + + 3 2 + -.20 + -.40 + -.60+ -.80+ -1.00+ + + .0 + + + + 2.0 4.0 6.0 8.0 62 +K 10.0 These samples Different any plots will models. reasonable samples will give situation advisable be show that certain to That where the statistics on bootstrap estimators of the true different results, however. bootstrap create and examine used, it In would be many samples from possible way the reliability of particular situation would be known. 63 is Bk. the bootstrap in that Figure 3.3.4.2: Deviation of Estimator from True Bk Set kl, Average Linkage, Sample 12 MEDIAN .60+ 40+ ,20 + * _ 2 + ,00+ + + + + + -.20 + -.40 + -.60+ -.80+ -1.00+ 8 + .0 + + + 2.0 4.0 6.0 64 + 8.0 +K 10.0 Figure 3.3.4.3: Deviation of Estimator from True Bk Set A, Average Linkage, Sample 13 MEDIAN .20 + 9 - * + .00+ + * 4 + + + + + + -.20 + 2.0 .0 6.0 4.0 10.0 8.0 Complete linkage worked perfectly on sample number 13 Figure 3.3.4.4: Deviation of Estimator from True Bk Set Al, Complete Linkage, Sample 13 MEDIAN .20 + .00+ -.20 + 2.0 .0 4.0 6.0 10.0 8.0 3.3.5. Data Set B2, Average Linkage This with less last plot shows how the distinct clusters. technique works on The results significantly different from the other experiments. 65 are a set not Figure 3.3.5,1: Deviation of Estimator from True Bk Set B2, Average Linkage/ Sample 13 MEDIAN .60+ 3 2 .40+ + + 2 .20+ * + + + 3 + - * * * * + - 3 .00+ + 3+3 9442 2+ 8+2 6 + 7+* - 7 2 + + 7 -.20+ 8 * 6 + + + + 2 4 + + 2 40+ .0 4.0 12.0 8.0 66 16.0 20.0 CONCLUSIONS AND SUMMARY 4. In this study we propose to use the bootstrap and the Bk statistic in combination to perform two tasks: We 1) find can the more stable and important clusters by looking at those clusterings which tend to produce peaks in This information about the variability of the the Bk plot. set data will be very useful interpreting when the clustering tree. estimate the sample-to-true can We 2) bootstrap-to-sample Bk's. The best Bk by examining the results in this study from using the median of bootstrapped Bk's and either came average or complete linkage. We here have these and proposed documented results procedure. the results of indicate the The model used a simulation study usefulness the in the experiments was appropriate for the common example described above. it of However, is not known whether the variation in other situations is comparable. It could be much more, with worse results, or much less, with better results. show promise but, because of the limited scope of the study, The methods presented here have yet to be tested in real applications. 67 BIBLIOGRAPHY Baker, F. B. (1974), Techniques Case I: "Stability of Two Hierarchical Grouping Sensitivity to Data Errors". JASA 69, pp. 440-465. "Hierarchical (1975), and Hubert, L. B. J. F. Clustering and the Concept of Power". JASA 70, pp. 31-38. Baker, Efron, B. (1979a), "Computers and the Theory of Statistics: Thinking the Unthinkable". SIML Rfiyifitt 21, pp. 460-480. Efron, B. (1979b), "Bootstrap Methods: Another Look at Annals of Statistics 7, pp. 1-26. Jackknife". the "A Leisurely Look at the Gong. (1983), and G. B Efron, American Bootstrap, the Jackknife, and Cross-Validation" . Statistician 37 #1, pp. 36-48. "A Method for and C. L. Mallows. (1983), E. B. Fowlkes, Comparing Two Heirarchical Clusterings" JASA (to appear in September) Hartigan, Values" (1969), A. J. JASA 64, pp. "Using Subsample Values as Typical 1303-1317. Hartigan, J. A. (1971), "Error Analysis by Replaced Samples", J^ fiOi^al Stat^ Sqc. Sfixles fi 33, pp. 98-110. Hartigan, J. A. (1975), Clustering Algorithms. Wiley. Hartigan, J. A. (1981), "Consistency of Single Linkage for High-Density Clusters", JASA 76, pp. 388-394. Hubert, L. (1974), "Approximate Evaluation Techniques for the Single Link and Complete Link Hierarchical Clustering Procedures", JASA 69, pp. 698-704. C. S. Johnson. Psychometrica Ling. F. R. Analysis", Rand, W. M. , (1967), "Hierarchical pp. 241-254. "A Probability (1973), 68, pp. 159-164. Clustering Schemes", Theory of Cluster JASA (1971), Clustering Methods". "Objective Criteria for Evaluation of JASA 66, pp 846-850. "A Hybrid Clustering Algorithm for (1982), Wong, M. A. Identifying High-Density Clusters", JASA 77, pp 841-847. 68 "A Kth Nearest Neighbor Wong, M. A., and T. Lane (1983), Clustering Procedure", J... Eo^ai SiAt^ Soc. Z&il&s. E (to appear) 69 ii22l U03 3 "IDSD DDM Mfll lt.1 Date Due Lib-26-67 ?»^^