I have asked three knowledgeable researchers in the area to review the paper; their reviews are enclosed below. Two of the reviewers are generally very happy with the paper; the third, however, has one major concern and a number of more minor ones. As a result, we will conditionally accept your paper for publication in the journal pending the following revisions: - Reviewers 2 and 3 believe that the paper is longer than it probably needs to be. I encourage you to follow suggestions from each of the reviewers for places where material could be cut. - Reviewer 2 has supplied detailed comments on suggested changes in wording, notation, emphasis. Both reviewers 2 and 3 voice concerns about the paper's (implicit) assumption that cluster ensembles will be good because classifier ensembles have been shown to be good. Changing this emphasis will have ramifications for the introduction and conclusions, and is likely to provide a tighter focus for (and shorten) the discussion of some aspects of related work. Reviewer 1 has included additional references that will remain important to include even in the shortened paper. - Reviewer 3 has more serious concerns. Most importantly, that reviewer notes that the paper provides no link between the ANMI objective function (section 2) and the proposed heuristic algorithms (section 3), and suggests comparisons to an algorithm that tries to directly optimize the ANMI criterion. At a minimum, this gap needs to be explicitly addressed in the re-subitted paper. When submitting the revised article please send to me a point-by-point response to each of the reviewers major comments with the resubmission. (I'm afraid I don't know what the options are for doing this via the ICAAP system. If there are none, just send the response via regular e-mail.) -----------------------------------------------------------------Review 1 -----------------------------------------------------------------Review of "Cluster Ensembles -- A Knowledge Reuse Framework for Combining Multiple Partitions" by Strehl and Ghosh I recommend that the paper be accepted. This is an excellent paper and I have few substantive suggestions for revisions. The motivation and background discussion is excellent. I would look at Cutting, Karger, and Pedersen (16th International ACM SIGIR Conference on Research and Develeopment in Information Retrieval, 1993), Fisher (J. AI Research 1996 and KDD conference 1995), Zhang et al (KDD Journal, 1997), Bradley and Fayyad (ICML 1998), and Fayyad, Reina, and Bradley (KDD Conference, 1998) for work on clustering clusters where a summary representation (probabilistic, centroid) representation in terms of the base features is given for each cluster. In general, I like the lack of assumptions about the form of cluster summary representations in this paper, but the paper should acknowledge the possibility of exploiting summary representations for purposes of combining partitions, particularly in an ODC task with possibly non-overlapping data and cite earlier work than Kargupta et al 1999. We have added some earlier work than Kargupta. Thanks for providing some additional references. A minor stylistic bias (p. 3) is that terms like "YOUR preferences/provider/bookmarks" be replaced by "USER/AN// preferences/provider/bookmarks" respectively. Done (p. 8) "clusterings 1 and 2" --> "clusterings lambda(1) and lambda(2)" and similiarly for 3 and 4. Done My review is short. This is a solid paper that should motivate subsequent work. -----------------------------------------------------------------Review 2 -----------------------------------------------------------------Review Form for Journal of Machine Learning Research Title: Cluster Ensambles - A Knowledge Reuse Framework for Combining Multiple Partitions Author(s): Alexander Strehl, Joydeep Ghosh Due Date: 05/13/2002 ------------------------------------------------------------APPROPRIATENESS: Is this paper appropriate for this journal? ------------------------------------------------------------Yes. The paper addresses combining multiple clusterings, which is important from both theoretical and practical point of view. ------------------------------------------------------------GOALS: Are the specific goals of the research stated early and clearly? Is the learning task unambiguously identified? ------------------------------------------------------------Yes. Authors provide lengthy and complete introduction section where they not only state precisely what the goal of the paper is, but provide quite complete references to prior and related work. ------------------------------------------------------------DESCRIPTION: Are the paper's algorithms correct? Are they described in sufficient detail so that readers could replicate the work? For instance, are the algorithm's inputs and outputs unambiguously described? Does the paper include enough illustrative examples? Are they sufficiently detailed? Are both the learning and performance tasks clearly specified? ------------------------------------------------------------Yes. The paper provides theoretical justification for using cluster ensembles and then introduces three (plus fourth, combined) heuristics. In this reviewer's view, any heuristics is by definition correct (provided syntax is correct and semantics addresses the problem at hand). Authors provide sufficient detail to implement each of their approaches. In addition, authors include enough examples, including both simulated and real-life data sets, to demonstrate utility of their approach. ------------------------------------------------------------EVALUATION: Are the results of the research evaluated? For instance, has the author systematically run experiments, provided insightful theoretical analysis, show psychological validity, or otherwise given evidence of generality of the presented approach? For empirical work: are experimental comparisons made to relevant alternatives? Have the results been shown to be statistically significant? Were reasonable testbeds used? ------------------------------------------------------------Yes. The evaluation is quite complete (see additional comments at the end for some specific details/suggestions/comments). The theoretical analysis is at the level appropriate for the current paper-i.e., the introductory paper-where authors introduce their methodology and give some theoretical justification. The more formal evaluation of limitations and applicability of the approach is left for the future work (which is appropriate), and is described as such in the closing section of the paper. ------------------------------------------------------------DISCUSSION: Does the paper discuss the strengths, limitations, and implications of the presented approach? Does it discuss relevant previous research? Are the relative strengths and weaknesses described? Are promising directions for future work outlined? ------------------------------------------------------------Yes. Limitations and the ``high-level picture" are appropriately described throughout the paper. ------------------------------------------------------------PRESENTATION: Is this paper well organized and well written? Does it use standard terminology? Has the author provided sufficient background? Are an appropriate number of informative figures included? Are experimental results presented clearly, uncluttered with excessive detail? Does the abstract adequately reflect the contents? ------------------------------------------------------------The organization of the paper is reviewed at the end of this form; please see comments there. The terminology is standard, the figures' content and level of detail are appropriate, but they need additional work (see comments at the end of the form). The abstract is also appropriate. ------------------------------------------------------------GENERAL COMMENTS: Does this paper make an original, significant, and technically sound contribution? Does anything need to be added or deleted? Feel free to give additional comments. ------------------------------------------------------------- Yes. The paper makes an original, significant and technically sound contribution. One concern this reviewer has is about the reference to the previous work by the same authors (proceedings of the AAAI, 2002). The title implies that the work presented there is very similar to the work currently submitted. However, since the reference is to the conference proceedings, and further since it is by the same authors, this reviewer is comfortable with considering this a novel and original contribution presented for publication for the first time in a journal. Additional comments are at the end of this form. Recommendation: o conditional accept after MINOR revisions There are certain revisions that this reviewer will require prior to acceptance of the paper for publication. Since revisions are not concerned with the quality of the results, number of experiments and examples, conclusions of the paper, etc., they are considered minor. Note, however, that some of the revisions require authors to modify some sections significantly. ------------------------------------------------------------ADDITIONAL COMMENTS: ------------------------------------------------------------High Level: ------------------------------------------------------------This is a well-conceived paper with relevant contributions in the field. However, it is authors' responsibility to proof read the paper, correct basic spelling errors, correct basic style errors and correct the grammar (e.g., use of articles ``the" and ``a"). This reviewer will point out some of the more obvious omissions only - the rest is up to authors to correct. The notation is almost correct, but there are several omissions and inaccuracies. More generally, this reviewer found the notation to be of slightly variable rigor that falls into paradigm of ``measure with micrometer, mark with chalk, cut with an axe." A typical example is section 4.5, page 26, where authors first advertise their approach as being applicable to real-life data sets, but then go into minute detail of using the ``ceiling" operator when defining cardinality of a partition. In the introduction, the authors try to justify their approach of using ensembles, but each justification invariably reduces to the success of the related approaches in classification. In other words, they make an implicit assumption that the introduction of ensembles is a concept that generally improves performance of algorithms. While this is indeed observed in practice, the paper should serve as another example that this conjecture is correct, rather than using it as a starting point. Figures in the paper use color. This is not explicitly discouraged by the JMLR author instructions, but most of the readers will print the paper out on a fast laser printer and will therefore have hard time understanding figures (e.g., this was the case with this reviewer). It is left to authors to decide on whether to use color or not in the final version. Some of the figures are too small with significant overplotting (figures 9 and 10). These figures should be either larger (though it would make them not fit on a single page), and/or the data should be subsampled to show details. Low Level: ------------------------------------------------------------Pg 2, par 3: ``We call the combination...<snip> as the cluster <snip>"; grammar: remove ``as". Done Pg 2, par 3: ``...number of clusters... depends on the scale..."; it is easy to find a case where this is not the case; replace with ``might depend," ``often depends" or similar. Done Pg 2, bottom: ``Quality and Robustness." This bullet mixes related work with the usefulness of the proposed method. Only the first sentence relates to the topic of the paper, while the rest represents the related work. The two do not fit together, since the bullet should address why *clustering* ensembles are useful, not why similar approaches are useful in related tasks, unless a statement is made that the ensembles in general improve *any* learning task. In other words, the paragraph on related work does neither support, nor establish quality or robustness of the proposed approach. Note: there is another mismatch similar to this one later in the paper, hence this reviewer's earlier comment on ``implied generalization of ensembles." Pg 3, par 2: ``... no clustering method is perfect, ..."; consider using ``universal" rather than ``perfect". A clustering method can be perfect for a specific domain. Done Pg 3, par 2, bottom: ``... there should be a potential for greater gains..."; too strong: use ``could" or similar. The higher variability in clustering is not enough to guarantee or suggest the improved performance; it does suggest a possible direction of further investigation (which is performed in the paper), though. Done Pg 5, par 1: Work of Neumann and Norton: end of paragraph 1 and paragraph 2 does not fit into the general flow of the paper. The detail provided in the paper is well beyond referencing that work, but is not enough to make a meaningful picture of why the work is related. For example, terms like ``lattice", ``supremum", ``infimum", ``consensus interval", etc. are not defined at all (e.g., supremum of what?). Suggestion: either remove all the implementation details (e.g., partial order relation, consensus intervals, etc.) and briefly mention the limitations, or, include the details and definitions at the level appropriate for a reader who is not familiar with the reference. This referee would prefer omitting the details, since the purpose of the reference is to give interested reader information about where all the details can be found. Done. Details omitted. Pg 6, par 1 and 2: These two paragraphs again try to establish properties of the *cluster* ensembles based on the counterpart properties of the *classification* ensembles by making an implicit assumption that the ensemble concept generalizes to all learning tasks. These two paragraphs should be rewritten as a ``related work", rather than the ``use of cluster ensembles". Pg 7, Notation: for a fixed number of clusters k, \lambda *can* be defined as an element of N^n (since it lies in a subset of it), but then equation 1 (pg, 9) becomes incorrect since it is defined as a mapping N^{n\times r} -> N^n. A mapping must map *all* elements in the domain to a subset of the target space. In fact, our implementation will map all elements in the domain to a subset of the target space. Any element of N^n will constitute a valid clustering. (However, not necessarily a unique one) Pg 7, Notation: ``... delivers a label vector given an ordered set ..."; sets are originally inherently unordered (mathematically). Use ``tuple" or similar term to denote ordered input. Done. Using ‘tuple’. Pg 8, Figure 2: correct punctuation of x_1, ..., x_7. Consider removing color. Done. Removed color and inserted commas. Pg 9, par 2: the definition of the single representation that is used to represent the class of k! possible representations is not correct. For example, if the first and last object belong to the same class, then condition (ii) cannot be satisfied since \lambda_n = \lambda_1 = 1. Similarly, \lambda_i is not defined; assumption is that it represents the classification of the i-th object in the input tuple (problem with earlier notation). Also, unless the correct notation obviously shows uniqueness, provide a brief sketch/proof of uniqueness. This is an example of inconsistent rigor mentioned earlier. We believe, the unique form is correct. In your example, the only constraint concerning the last element of \lambda is for i = n-1. Therefor (ii) reads \lambda_n \leq max_{j = 1…n-1} \lambda_j + 1 The RHS is greater or equal than 2 and the LHS is 1 so (ii) is satisfied. We have added a couple of sentences that should help clarify that. Pg 9: ``... (k!)^{r-1} possible association patterns."; it is not clear what ``association patterns" refers to and it is not clear whether this number is correct. Unless authors feel that it is trivial, a sentence on reasoning behind this number would be appropriate. Done. Dropped association patterns & equation since they are not necessary for understanding. Pg 9, below eq. 1: ``The optimal combined clustering should share..."; if there is a reason why it *should* share the most information, state it explicitly. If you *define* the optimal clustering in this sentence, say something along the lines of: ``We define the optimal clustering to..." Note that the fact it is an intuitive argument does not validate the choice as the only one, as implied by using ``should". Pg 10, equation 6: First, if you do not refer to this equation later in the text, remove the equation number or include it in the text (suggestion, not a requirement). Second, this equation is far from obvious, especially with the alternating sign in the sum. From the combinatorial point of view, there are n^k possible classifications (e.g., each object can be classified into one of the k classes), leading to a smaller number of clusterings due to the exchangeability of class labels. It is beyond duty of this referee to try to establish that equation 6 correctly represents this number. A sentence or two on the reasoning behind the equation would really help. Pg 11, par 2 of section 3.4: Definition of matrix H^q is not a definition of matrix, but rather a vector of length n*k(q). The cross product in the exponent is not sufficient to make it a matrix since the cross product of two numbers is yet another number. Consider changing this notation. Pg 12, bottom of par 1 in section 3.3: Why do you look at components of approximately the same size? In reality you might have components of significantly different sizes, and further, it might be that the small components are the one you are primarily interested (e.g., novelty/outlier detection). Pg 13, figure 4: a non-collinear layout might work better. Done Pg 13/14, description of MCLA, whole section 3.4: this referee is not familiar enough with the specific background to comment on the correctness of the statements/approach presented in that section. Pg 15, par 2: ``First, a we partition..."; remove ``a". Done Pg 16, fig 6: use different symbols for ``random labels" and ``original labels". Done Pg 17, par 1, last sentence: ``This indicates... is indeed appropriate..."; the statement is too strong. The experiment shows that it is plausible to use it, not that it is *the* correct one. Pg 17, section 4, last sentence: ``... several very limited subset..." change to ``... several very limited subsets...". Done Pg 18, first sentence: specifying variance to describe Gaussian covariance matrices is not sufficient - state that you are (presumably) using diagonal covariance matrices; as stated, off diagonal elements could be anything. Done Pg 18, footnote 7: Although this reviewer knows what the authors meant, the statement is not correct and needs to be changed. The number of clusters depends on the model used, and it is very easy to construct a model that will have only two clusters in a noisy XOR problem (e.g., the generative model itself). Even if clusters are required to be compact, a simple metrics can be invented to make the problem consist of only two compact clusters (e.g., instead of (x1-x2)^2 + (y1-y2)^2 use (|x1|-|x2|)^2 + (|y1|-|y2|)^2)... I understand that a different generative model might make our statement untrue. However, in the canonical geometric sense there are four clusters which should suffice for the purpose of this example. Regarding your example, I do not follow since in XOR x1,x2,y1,y2 \in {0,1} so talking the absolute value does not change anything. Pg 19, bottom of par 2: ``category prior probability" should be put into context and defined. None of the previous sections mention any priors. Further, the sentence that the ``prior might be close to 1 while the desired value should be close to 0" is at minimum ambiguous: prior represents prior probability and as such is inherently subjective. Why not just set the prior to 0 if that is the desired value? Or is the term ``prior" misused here? Removed the notion of ‘priors’ in favor of category sizes. Pg 19, par 3: ``Monolithic clusterings are evaluated at 0..." What is a 0 of clustering, and how is a clustering evaluated at 0? Did you mean ``scored as 0"? Yes. Changed to ‘scored’. Pg 19, footnote 8: what? Removed footnote. Clarified in the text replacing the notion of the ‘best clustering’ by the notion of ‘categorization’. Pg 19, paragraph just above section 4.3: ``unbiased" should be put into context. In statistical terms, an estimator of quantity X is an unbiased estimator if its expected value equals to X. The sentence in question claims that the ``measure is unbiased" which is not the standard usage of ``unbiased", hence should be explained at the minimum. This is especially true in conjecture with the ``measure being conservative" (e.g., if it is conservative, it is likely to be biased). Our use of unbiased is not correct in strict statistical terms. We replaced it with the term ‘not biased towards high k’ Pg 19, last sentence above section 4.3: ``... to evaluate the cluster quality using mutual information..."; please specify how you evaluate this quality using the mutual information (e.g., you refer to the quality later). The term ‘quality’ referes to \phi^ANMI. We use ‘quality’ rather then ‘mutual information’ since there might be other ways to define quality using mutual information. Added the symbol \phi^ANMI to make this connection clear. Pg 20, par 2: ``Different sample sizes provide insight how cluster..."; change to ``Different sample sizes provide insight into how cluster...". Done Pg 20, par 2: ``... quality maxes out."; this is jargon and should be replaced. Done Pg 20, par 2: what is ``random clustering" used as a baseline? Is it randomly choosing it is the latter, appropriate would It is the latter. a clusterer, or randomly assigning class-labels? If a comment on why such a simple baseline is be in place. It is the simplest baseline. Pg 20, par 3: ``... the RCC results are better than all individual..."; replace ``all" by ``each". Similar comment applies throughout this paragraph. Done Pg 20, par 3: ``The average RCC quality at 0.85 is ..."; please specify at *what* 0.85 (e.g., at 0.85 peanuts, 0.85 parsecs, 0.85 inches^2), unless it refers to the value of quality, in which case you need to define what the quality is (see previous comments) and replace ``at" by ``of" or similar. Done Pg 20, par 6: ``Figure 8 shows..."; What is an ``average single technique"? Why not just call it a ``single technique" since the average of a single number is the number itself? If the ``average" refers to the average of the 10 techniques, how is it visible in figure 8 that RCC is better than the average? This referee also had a problem in ``looking at the three consensus techniques" since it appears they are not plotted in figure 8; the combiner of the three is plotted as RCC instead (?). Done Pg 23, first row: ``...such scenarios result in distributed..."; should be the other way around: scenarios are a consequence of distributed DB's, not the cause. Done Pg 23, second row: ``...big flat file..." is jargon. Further, whether it is big, flat and/or file does not matter at all for the rest of the sentence - it is the information that cannot be collected in a single centralized location... Done Pg 23, par 3: Figure 10 is introduced before figure 9 which should be reversed. Either relabel figures or swap the several sentences describing them. Done. Swapped figures 9 and 10. Pg 24, figure 9: Plots are too small and crowded, otherwise quite informative. Same for figure 10. The main purpose is to demonstrate that the consensus clustering is far superior to the individual clusterings. The figure should serve that purpose eithen though it can not be expected that all 1000 points can be distinguished. Enlarging them will either distort them or will leave them split over several pages making it harder to see the points. Similarly if parts of the figure are omitted the information presented would not be completed. Pg 25, par 2: ``The supra... chooses either MCLA and CSPA results but the difference..." should read: ``The supra... chooses either MCLA or CSPA results, but the difference..."; Check punctuation in this, and several following sentences. Done Pg 25, par 2: There are statements about performance of individual consensus functions but there are no numbers/figures to demonstrate this. Pg 26, par 3: ``clustering is referred to as inner loop clustering"; add ``the" before ``inner". Done Pg 26, below eq. 7: In many journals the use of ``iff" is discouraged and ``if and only if" is preferred, argument being that the former looks too much like a typo. This is a suggestion, but the authors are free to do it as they find appropriate. Done Pg 26: Footnote 11 is not in proper place. It appears next to the definition of overlap (it should also be on the other side of the dot), while the footnote talks about existence of the overlap. Done. Footnote dropped since not necessary for basic understanding. Pg 26: parametrized -> parameterized Done Pg 26: Define ``l" (el), the degree of overlap. Changed some notation to simplify. In that course el was dropped. Pg 26: ``v>1 to fix..."; define ``fix". Got changed in the process as well. Pg 26: Use of ceiling is inappropriate; much like 1 and 1.00 stand for [0, 2] and [0.99, 1.01] respectively, use of ceiling defines an implicit error of a single data point (e.g., as stated, the analysis would not be valid if one used the floor instead, nor would it be valid if one had a single data point extra). Since the presented analysis is at best approximate, the ceiling should be omitted altogether and ``='' should be replaced with ``\sim". Note that even if the specific sampling technique uses exactly this number of data points, the use of ceiling is still not appropriate since the method is advertised as being quite a bit more general. Specifically, the beginning of section 4.5 talks about how ODC can occur in real data sets where it is unlikely that the partitions would have exactly this number of data points. Done. Simplified notation in that section. See above comments Pg 26, last line: instead of ``k" say ``number of clusters". Done Pg 26, everything below eq. 7: make clear whether the introduced number of partitions, the number of data points per partition, etc. is a requirement, or just an implementation detail of presented algorithm. Since this is a simulated distributed scenario all choices are implementational choices. Added ‘simulated scenario’ on several occasions to make this clearer. Pg 26: ``Every partition should have the same number of objects for load balancing."; Is this a requirement for the method to work? If not, please rephrase. See last comment. Pg 27, par 2, description of figure 11: ``... p ranges from 2 to 72"; There is something wrong with the plots. If the ``x" represents the data point (1,1), then p starts from 0, since there is at least one circle to the left of ``x" (see figure 11). The x was removed since it was a legacy from times where the x axis was speedup. Thanks for pointing that mistake out. Pg 27, several places: ``...retains around 90% (80%) of the ..."; What does the number in brackets refer to? The number in bracket refers to the experimental variation described in brackets. Pg 27, par 4: again, the three consensus algorithms are not shown (?) Pg 27, par 5: ``First, through the reduction, ... , the problem is simplified as speedup increases"; This sentence does not relate to the loss of quality as advertised. Further several sentences are not clear: if the sampling cannot maintain the balancing, how can it enforce it (thus hurting the quality)? Pg 27, bottom line: linear -> linearly. Done Pg 28, first line: This analysis is not correct as stated. There is a mix up of actual time performance and the asymptotic complexity. The linear term *could* be negligible, but not on the account of being linear. It is true that the asymptotic time complexity remains O(n^2), but this is not because the linear term is negligible, but because O(n+n^2) = O(n^2). It is easy to conceive a whole set of methodologies that are asymptotically linear, but would completely dominate the quadratic term for anything but infinite samples. Pg 29, par 1: ``Consequently, the sequential speedup can be multiplied by p..."; Note punctuation. Also, the speedup can be multiplied by PI, 42, 1e200, etc. Please rephrase. Pg 29: the time complexity analysis mixes up asymptotic complexity and the actual running time. The example with p=4 and v=2 does not have a correct derivation. While s=1 might hold, authors must first establish what the actual running time is, show that it is heavily dominated by the quadratic term for the sample sizes of interest, and only then can the analysis be deemed correct. Pg 29, section 5: ``We defined ... that enables us ... and allows one ..."; be consistent with ``us" vs. ``one". Pg 29, section 5: ``speed-up"; use consistent spelling. Done: speedup is used consistently Pg 29, section 5: ``... dramatically improve... for a large variety of domains"; the sentence is too strong since 4 domains do not represent a large variety. Something along ``for several quite different domains" is more appropriate. Done Pg 29, section 5, par 2: This referee found the future work to be described in more detail than necessary, but it appears that it is the journal policy to enforce such a section. Unchanged Pg 30, appendix: H(X) is defined as entropy, but is then used in the context of cross entropy H(X,Y). Define H(X,Y) prior to first use. Done Pg 30, below eq. 12: The upper bound is not tight, and even if it were, it would not follow from the min(Hx,Hy)<=(Hx+Hy)/2. For example, by the argument of the sentence one would have that since min(Hx,Hy)<= 100*(|Hx|+|Hy|)/2, then 100*(|Hx|+|Hy|)/2 represents a tight bound as well. Removed the word ‘tight’. It should have expressed that the upper bound is actually reached in particular cases: min(Hx,Hy)==(|Hx|+|Hy|)/2 while it is save to say that min(Hx,Hy)<100*(|Hx|+|Hy|)/2 This concludes the most important points. Note that spelling/grammar/structure of sentences must be checked by authors prior to final submission. -----------------------------------------------------------------Review 3 -----------------------------------------------------------------PAPER: Strehl & Gosh: Cluster Ensembles Summary: The paper proceeds in four steps: (i) it poses a problem that is claimed to be highly relevant, namely how to combine multiple data partitionings without accessing the original pattern representation, (ii) it proposes an objective function to measure how well a combined partitioning agrees with a set of given partitionings, namely a criterion based on mutual information called ANMI, (iii) it suggests different (three) heuristics of how to combine partitionings, and (iv) it investigates several concrete settings (and data sets), motivated by "practical" problems. Assessment & Comments: Obviously, the paper tries to cover a lot of ground. The authors bring forward a number of ideas and suggestions, each of which certainly has some merit. Yet overall the paper is lacking a contribution of real significance and each of the above steps is carried out somewhat superficially, dealing with a number of details in a rather ad hoc manner. Moreover, the paper lacks overall coherence. In particular there seems to be no obvious connection between the proposed objective function and the proposed heuristic algorithms, a fact that I consider to be the most crucial flaw of the paper. Looking in more detail at the above steps: (i): The introduction spends pages on discussing work on combining classifiers. As it turns out, very little of this is actually relevant for the proposed method. Is it supposed to just provide some motivation? If so, it seems to be based primarily on analogy, since it is not clear that what works for classification should also work for data clustering (and even for the same reasons). On the other hand, the introduction has failed to fully convince me of a real need for "cluster ensemble" techniques. Phrases like "collective data mining" and "knowledge reuse" sound great, but the references to real applications remain somewhat fuzzy. Maybe there are just too many applications? It would still be good to discus some relevant applications in greater detail. In addition, it appears to me that exchanging some sort of "approximate sufficient statistics" seems possible in most cases, for example, by exchanging cluster centroids in the case of distributed clustering. My impression is that the whole setting might be overly prohibitive as far as practical problems are concerned. This would be OK for a "theory" paper, where one would actually pose a problem in an overly "radical" fashion in order to prove interesting theorems, but this is not what the submitted manuscript is aiming at. (ii): The objective function makes a lot of sense to me. The notation could be improved though. For example, n^(h)_l is actually symmetric in h and l, but the indexing convention make it very difficult to grasp that. There must be a way to remove some of the clutter from Eqs. (2-4)! Why is the trivial (but pretty much futile) way of establishing a normal form in 2.1 being discussed? It doesn't seem that this is used later on. (iii): As said earlier, it is not clear how the heuristics proposed in Section 3 relate to the objective function established in Section 2. Do you just hope by pure luck that one of these will produce an approximate solution to the average mutual information maximization problem? There should be a way to come up with an approximation scheme that at least finds a local maximum. The reader gets the impression that the authors apply the same scheme of combining heterogeneous algorithms at the "meta"-level again. The principle seem to be: "If there is no algorithm that solves the problem optimally, take a bunch of algorithms and see which one works best on a case by case basis." While this may not be an unreasonable approach for engineers who are solving real-world problems, it's not clear to me on a more intellectual level what lesson can be learned here. Additional comment(s): I'm not convinced that the hypergraph representation is very useful. The only algorithm that really makes use of this representation is HGPA and as the authors point out themselves, this algorithm seems to be inadequate in the way cluster "violations" contribute to the cost function. The repesentation is used by all three algorithms. Calling the representation a hypergraph representation might be misleading the reader towards the conception that it is used only for HGPA which is unfortunate. However, we believe it is the most descriptive name for the representation. (iv): The authors have included a number of experiments on different datasets, artificial and synthetic. As a result a number of interesting insights are gained and overall it seems that the proposed cluster ensemble design methods "work". Although I think none of the applications qualifies as a real "killer" application for the proposed framework, the authors have put some effort in the empirical evaluation and are able to claim a success, for example, on the robust clustering problem. Yet I would be curious to see, how an algorithm would perform that optimizes some combination of objective functions to start with. It is not fully convincing to sell a limitation (no access to the underlying pattern representations) as a feature! Why should one be off worse by using more information? This and other questions reinforce the impression that there might be yet other techniques that would do even better in these applications. Presentation: The text spends too many pages on issues that are inessential for the main topic of the paper. Basically all sections of the paper can be shortened and condensed. Thereby, the paper would gain in focus. Appendix A is trivial, I would suggest to drop it altogether. Also, I personally think that the main idea needs to be presented earlier on in the paper. According to my opinion, the paper also has a tendency to be overly formal where it should focus on communicating the main ideas. I would revise a couple of statements/sentences: + "It is well known that no clustering method is perfect." (Deeper meaning unclear) Done + Agglomerative clustering methods do not always minimize an objective. + What does "worse than a random partitioning" exactly mean? It is possible to have less mutual information than a random labeling. However, since this is not fully explained at the place where this statement is made the wording has been changed to ‘not better than’ random. + "statistical information shared between two distributions" (inaccurate, you need a joint distribution to start with, etc.) + the magic number 1.05 + "unbiased" is a technical term from statistical inference, you seem to use it in a much more colloquial sense Done Recommendation: Overall I think the paper has some merit, but I think it would be more suitable to publish the paper in a journal that puts more emphasis on "data mining" and "knowledge engineering". From a machine learning point of view, I think the contributions made in the paper are only of limited significance. Alternatively, I would probably feel more comfortable in supporting the paper, if the authors could close the gap between Sections 2 & 3 and come up with an algorithm that would directly optimize the ANMI criterion. We added a greedy optimization scheme for ANMI in a new section. We also compare it to the heuristics in the controlled experiment. However, it is computationally very slow. In fact, we could compute results in the comparison section but not in the real experiments. The heuristics tend to come up with the same results than the direct optimization in a several orders of magnitude faster time.