jmlrrebuttal

advertisement
I have asked three knowledgeable researchers in the area to review the
paper; their reviews are enclosed below. Two of the reviewers are
generally very happy with the paper; the third, however, has one major
concern and a number of more minor ones. As a result, we will
conditionally accept your paper for publication in the journal pending
the following revisions:
- Reviewers 2 and 3 believe that the paper is longer than it probably
needs to be. I encourage you to follow suggestions from each of the
reviewers for places where material could be cut.
- Reviewer 2 has supplied detailed comments on suggested changes in
wording, notation, emphasis. Both reviewers 2 and 3 voice concerns
about the paper's (implicit) assumption that cluster ensembles will
be good because classifier ensembles have been shown to be good.
Changing this emphasis will have ramifications for the introduction
and conclusions, and is likely to provide a tighter focus for (and
shorten) the discussion of some aspects of related work. Reviewer 1
has included additional references that will remain important to
include even in the shortened paper.
- Reviewer 3 has more serious concerns. Most importantly, that
reviewer notes that the paper provides no link between the ANMI
objective function (section 2) and the proposed heuristic algorithms
(section 3), and suggests comparisons to an algorithm that tries to
directly optimize the ANMI criterion. At a minimum, this gap needs
to be explicitly addressed in the re-subitted paper.
When submitting the revised article please send to me a point-by-point
response to each of the reviewers major comments with the
resubmission. (I'm afraid I don't know what the options are for doing
this via the ICAAP system. If there are none, just send the response
via regular e-mail.)
-----------------------------------------------------------------Review 1
-----------------------------------------------------------------Review of "Cluster Ensembles -- A Knowledge Reuse Framework for
Combining Multiple Partitions"
by Strehl and Ghosh
I recommend that the paper be accepted. This is an excellent paper and
I have few substantive suggestions for revisions.
The motivation and background discussion is excellent. I would look at
Cutting, Karger, and Pedersen (16th International ACM SIGIR Conference
on Research and Develeopment in Information Retrieval, 1993),
Fisher (J. AI Research 1996 and KDD conference 1995),
Zhang et al (KDD Journal, 1997), Bradley and Fayyad (ICML 1998),
and Fayyad, Reina, and Bradley (KDD Conference, 1998)
for work on clustering clusters where a summary
representation (probabilistic, centroid) representation
in terms of the base features is given for each cluster. In general, I
like the lack of assumptions about the form of cluster summary representations
in this paper, but the paper should acknowledge the possibility
of exploiting summary representations for purposes of combining
partitions, particularly in an ODC task with possibly non-overlapping
data and cite earlier work than Kargupta et al 1999.
We have added some earlier work than Kargupta. Thanks for providing some
additional references.
A minor stylistic bias (p. 3) is that terms like
"YOUR preferences/provider/bookmarks" be replaced by
"USER/AN// preferences/provider/bookmarks" respectively.
Done
(p. 8) "clusterings 1 and 2" --> "clusterings lambda(1) and lambda(2)"
and similiarly for 3 and 4.
Done
My review is short. This is a solid paper that should motivate subsequent
work.
-----------------------------------------------------------------Review 2
-----------------------------------------------------------------Review Form
for
Journal of Machine Learning Research
Title: Cluster Ensambles - A Knowledge Reuse Framework for Combining
Multiple Partitions
Author(s): Alexander Strehl, Joydeep Ghosh
Due Date: 05/13/2002
------------------------------------------------------------APPROPRIATENESS: Is this paper appropriate for this journal?
------------------------------------------------------------Yes. The paper addresses combining multiple clusterings, which is
important from both theoretical and practical point of view.
------------------------------------------------------------GOALS:
Are the specific goals of the research
stated early and clearly? Is the learning
task unambiguously identified?
------------------------------------------------------------Yes. Authors provide lengthy and complete introduction section where
they not only state precisely what the goal of the paper is, but
provide quite complete references to prior and related work.
------------------------------------------------------------DESCRIPTION:
Are the paper's algorithms correct? Are
they described in sufficient detail so that
readers could replicate the work? For
instance, are the algorithm's inputs and
outputs unambiguously described? Does the
paper include enough illustrative examples?
Are they sufficiently detailed? Are both
the learning and performance tasks clearly
specified?
------------------------------------------------------------Yes. The paper provides theoretical justification for using cluster
ensembles and then introduces three (plus fourth, combined)
heuristics. In this reviewer's view, any heuristics is by definition
correct (provided syntax is correct and semantics addresses the
problem at hand). Authors provide sufficient detail to implement each
of their approaches. In addition, authors include enough examples,
including both simulated and real-life data sets, to demonstrate
utility of their approach.
------------------------------------------------------------EVALUATION:
Are the results of the research evaluated?
For instance, has the author systematically
run experiments, provided insightful
theoretical analysis, show psychological
validity, or otherwise given evidence of
generality of the presented approach?
For empirical work: are experimental
comparisons made to relevant alternatives?
Have the results been shown to be
statistically significant? Were reasonable
testbeds used?
------------------------------------------------------------Yes. The evaluation is quite complete (see additional comments at the
end for some specific details/suggestions/comments). The theoretical
analysis is at the level appropriate for the current paper-i.e., the
introductory paper-where authors introduce their methodology and give
some theoretical justification. The more formal evaluation of
limitations and applicability of the approach is left for the future
work (which is appropriate), and is described as such in the closing
section of the paper.
------------------------------------------------------------DISCUSSION:
Does the paper discuss the strengths,
limitations, and implications of the
presented approach? Does it discuss relevant
previous research? Are the relative strengths
and weaknesses described? Are promising
directions for future work outlined?
------------------------------------------------------------Yes. Limitations and the ``high-level picture" are appropriately
described throughout the paper.
------------------------------------------------------------PRESENTATION:
Is this paper well organized and well
written? Does it use standard terminology?
Has the author provided sufficient background?
Are an appropriate number of informative
figures included? Are experimental results
presented clearly, uncluttered with
excessive detail? Does the abstract
adequately reflect the contents?
------------------------------------------------------------The organization of the paper is reviewed at the end of this form;
please see comments there. The terminology is standard, the figures'
content and level of detail are appropriate, but they need additional
work (see comments at the end of the form). The abstract is also
appropriate.
------------------------------------------------------------GENERAL COMMENTS: Does this paper make an original, significant,
and technically sound contribution?
Does anything need to be added or deleted?
Feel free to give additional comments.
-------------------------------------------------------------
Yes. The paper makes an original, significant and technically sound
contribution. One concern this reviewer has is about the reference to
the previous work by the same authors (proceedings of the AAAI,
2002). The title implies that the work presented there is very similar
to the work currently submitted. However, since the reference is to
the conference proceedings, and further since it is by the same
authors, this reviewer is comfortable with considering this a novel
and original contribution presented for publication for the first time
in a journal. Additional comments are at the end of this form.
Recommendation:
o conditional accept after MINOR revisions
There are certain revisions that this reviewer will require prior to
acceptance of the paper for publication. Since revisions are not
concerned with the quality of the results, number of experiments and
examples, conclusions of the paper, etc., they are considered
minor. Note, however, that some of the revisions require authors to
modify some sections significantly.
------------------------------------------------------------ADDITIONAL COMMENTS:
------------------------------------------------------------High Level:
------------------------------------------------------------This is a well-conceived paper with relevant contributions in the
field. However, it is authors' responsibility to proof read the paper,
correct basic spelling errors, correct basic style errors and correct
the grammar (e.g., use of articles ``the" and ``a"). This reviewer
will point out some of the more obvious omissions only - the rest is
up to authors to correct.
The notation is almost correct, but there are several omissions and
inaccuracies. More generally, this reviewer found the notation to be
of slightly variable rigor that falls into paradigm of ``measure with
micrometer, mark with chalk, cut with an axe." A typical example is
section 4.5, page 26, where authors first advertise their approach as
being applicable to real-life data sets, but then go into minute
detail of using the ``ceiling" operator when defining cardinality of a
partition.
In the introduction, the authors try to justify their approach of
using ensembles, but each justification invariably reduces to the
success of the related approaches in classification. In other words,
they make an implicit assumption that the introduction of ensembles is
a concept that generally improves performance of algorithms. While
this is indeed observed in practice, the paper should serve as another
example that this conjecture is correct, rather than using it as a
starting point.
Figures in the paper use color. This is not explicitly discouraged by
the JMLR author instructions, but most of the readers will print the
paper out on a fast laser printer and will therefore have hard time
understanding figures (e.g., this was the case with this reviewer). It
is left to authors to decide on whether to use color or not in the
final version.
Some of the figures are too small with significant overplotting
(figures 9 and 10). These figures should be either larger (though it
would make them not fit on a single page), and/or the data should be
subsampled to show details.
Low Level:
------------------------------------------------------------Pg 2, par 3: ``We call the combination...<snip> as the cluster
<snip>"; grammar: remove ``as".
Done
Pg 2, par 3: ``...number of clusters... depends on the scale..."; it
is easy to find a case where this is not the case; replace with
``might depend," ``often depends" or similar.
Done
Pg 2, bottom: ``Quality and Robustness." This bullet mixes related
work with the usefulness of the proposed method. Only the first
sentence relates to the topic of the paper, while the rest represents
the related work. The two do not fit together, since the bullet should
address why *clustering* ensembles are useful, not why similar
approaches are useful in related tasks, unless a statement is made
that the ensembles in general improve *any* learning task. In other
words, the paragraph on related work does neither support, nor
establish quality or robustness of the proposed approach. Note: there
is another mismatch similar to this one later in the paper, hence this
reviewer's earlier comment on ``implied generalization of ensembles."
Pg 3, par 2: ``... no clustering method is perfect, ..."; consider
using ``universal" rather than ``perfect". A clustering method can be
perfect for a specific domain.
Done
Pg 3, par 2, bottom: ``... there should be a potential for greater
gains..."; too strong: use ``could" or similar. The higher variability
in clustering is not enough to guarantee or suggest the improved
performance; it does suggest a possible direction of further
investigation (which is performed in the paper), though.
Done
Pg 5, par 1: Work of Neumann and Norton: end of paragraph 1 and
paragraph 2 does not fit into the general flow of the paper. The
detail provided in the paper is well beyond referencing that work, but
is not enough to make a meaningful picture of why the work is
related. For example, terms like ``lattice", ``supremum", ``infimum",
``consensus interval", etc. are not defined at all (e.g., supremum of
what?). Suggestion: either remove all the implementation details
(e.g., partial order relation, consensus intervals, etc.) and briefly
mention the limitations, or, include the details and definitions at
the level appropriate for a reader who is not familiar with the
reference. This referee would prefer omitting the details, since the
purpose of the reference is to give interested reader information
about where all the details can be found.
Done. Details omitted.
Pg 6, par 1 and 2: These two paragraphs again try to establish
properties of the *cluster* ensembles based on the counterpart
properties of the *classification* ensembles by making an implicit
assumption that the ensemble concept generalizes to all learning
tasks. These two paragraphs should be rewritten as a ``related work",
rather than the ``use of cluster ensembles".
Pg 7, Notation: for a fixed number of clusters k, \lambda *can* be
defined as an element of N^n (since it lies in a subset of it), but
then equation 1 (pg, 9) becomes incorrect since it is defined as a
mapping N^{n\times r} -> N^n. A mapping must map *all* elements in the
domain to a subset of the target space.
In fact, our implementation will map all elements in the domain to a subset of
the target space. Any element of N^n will constitute a valid clustering.
(However, not necessarily a unique one)
Pg 7, Notation: ``... delivers a label vector given an ordered set
..."; sets are originally inherently unordered (mathematically). Use
``tuple" or similar term to denote ordered input.
Done. Using ‘tuple’.
Pg 8, Figure 2: correct punctuation of x_1, ..., x_7. Consider
removing color.
Done. Removed color and inserted commas.
Pg 9, par 2: the definition of the single representation that is used
to represent the class of k! possible representations is not
correct. For example, if the first and last object belong to the same
class, then condition (ii) cannot be satisfied since \lambda_n =
\lambda_1 = 1. Similarly, \lambda_i is not defined; assumption is that
it represents the classification of the i-th object in the input tuple
(problem with earlier notation). Also, unless the correct notation
obviously shows uniqueness, provide a brief sketch/proof of
uniqueness. This is an example of inconsistent rigor mentioned
earlier.
We believe, the unique form is correct. In your example, the only constraint
concerning the last element of \lambda is for i = n-1. Therefor (ii) reads
\lambda_n \leq max_{j = 1…n-1} \lambda_j + 1 The RHS is greater or equal than 2
and the LHS is 1 so (ii) is satisfied.
We have added a couple of sentences that should help clarify that.
Pg 9: ``... (k!)^{r-1} possible association patterns."; it is not
clear what ``association patterns" refers to and it is not clear
whether this number is correct. Unless authors feel that it is
trivial, a sentence on reasoning behind this number would be
appropriate.
Done. Dropped association patterns & equation since they are not necessary for
understanding.
Pg 9, below eq. 1: ``The optimal combined clustering should share...";
if there is a reason why it *should* share the most information, state
it explicitly. If you *define* the optimal clustering in this
sentence, say something along the lines of: ``We define the optimal
clustering to..." Note that the fact it is an intuitive argument does
not validate the choice as the only one, as implied by using
``should".
Pg 10, equation 6: First, if you do not refer to this equation later
in the text, remove the equation number or include it in the text
(suggestion, not a requirement). Second, this equation is far from
obvious, especially with the alternating sign in the sum. From the
combinatorial point of view, there are n^k possible classifications
(e.g., each object can be classified into one of the k classes),
leading to a smaller number of clusterings due to the exchangeability
of class labels. It is beyond duty of this referee to try to establish
that equation 6 correctly represents this number. A sentence or two on
the reasoning behind the equation would really help.
Pg 11, par 2 of section 3.4: Definition of matrix H^q is not a
definition of matrix, but rather a vector of length n*k(q). The cross
product in the exponent is not sufficient to make it a matrix since
the cross product of two numbers is yet another number. Consider
changing this notation.
Pg 12, bottom of par 1 in section 3.3: Why do you look at components
of approximately the same size? In reality you might have components
of significantly different sizes, and further, it might be that the
small components are the one you are primarily interested (e.g.,
novelty/outlier detection).
Pg 13, figure 4: a non-collinear layout might work better.
Done
Pg 13/14, description of MCLA, whole section 3.4: this referee is not
familiar enough with the specific background to comment on the
correctness of the statements/approach presented in that section.
Pg 15, par 2: ``First, a we partition..."; remove ``a".
Done
Pg 16, fig 6: use different symbols for ``random labels" and
``original labels".
Done
Pg 17, par 1, last sentence: ``This indicates... is indeed
appropriate..."; the statement is too strong. The experiment shows
that it is plausible to use it, not that it is *the* correct one.
Pg 17, section 4, last sentence: ``... several very limited subset..."
change to ``... several very limited subsets...".
Done
Pg 18, first sentence: specifying variance to describe Gaussian
covariance matrices is not sufficient - state that you are
(presumably) using diagonal covariance matrices; as stated, off
diagonal elements could be anything.
Done
Pg 18, footnote 7: Although this reviewer knows what the authors
meant, the statement is not correct and needs to be changed. The
number of clusters depends on the model used, and it is very easy to
construct a model that will have only two clusters in a noisy XOR
problem (e.g., the generative model itself). Even if clusters are
required to be compact, a simple metrics can be invented to make the
problem consist of only two compact clusters (e.g., instead of
(x1-x2)^2 + (y1-y2)^2 use (|x1|-|x2|)^2 + (|y1|-|y2|)^2)...
I understand that a different generative model might make our statement untrue.
However, in the canonical geometric sense there are four clusters which should
suffice for the purpose of this example. Regarding your example, I do not follow
since in XOR x1,x2,y1,y2 \in {0,1} so talking the absolute value does not change
anything.
Pg 19,
bottom of par 2: ``category prior probability" should be put into
context and defined. None of the previous sections mention any
priors. Further, the sentence that the ``prior might be close to 1
while the desired value should be close to 0" is at minimum ambiguous:
prior represents prior probability and as such is inherently
subjective. Why not just set the prior to 0 if that is the desired
value? Or is the term ``prior" misused here?
Removed the notion of ‘priors’ in favor of category sizes.
Pg 19, par 3: ``Monolithic clusterings are evaluated at 0..." What is
a 0 of clustering, and how is a clustering evaluated at 0? Did you
mean ``scored as 0"?
Yes. Changed to ‘scored’.
Pg 19, footnote 8: what?
Removed footnote. Clarified in the text replacing the notion of the ‘best
clustering’ by the notion of ‘categorization’.
Pg 19, paragraph just above section 4.3: ``unbiased" should be put
into context. In statistical terms, an estimator of quantity X is an
unbiased estimator if its expected value equals to X. The sentence in
question claims that the ``measure is unbiased" which is not the
standard usage of ``unbiased", hence should be explained at the
minimum. This is especially true in conjecture with the ``measure
being conservative" (e.g., if it is conservative, it is likely to be
biased).
Our use of unbiased is not correct in strict statistical terms. We replaced it
with the term ‘not biased towards high k’
Pg 19, last sentence above section 4.3: ``... to evaluate the cluster
quality using mutual information..."; please specify how you evaluate
this quality using the mutual information (e.g., you refer to the
quality later).
The term ‘quality’ referes to \phi^ANMI. We use ‘quality’ rather then ‘mutual
information’ since there might be other ways to define quality using mutual
information. Added the symbol \phi^ANMI to make this connection clear.
Pg 20, par 2: ``Different sample sizes provide insight how
cluster..."; change to ``Different sample sizes provide insight into
how cluster...".
Done
Pg 20, par 2: ``... quality maxes out."; this is jargon and should be
replaced.
Done
Pg 20, par 2: what is ``random clustering" used as a baseline? Is it
randomly choosing
it is the latter,
appropriate would
It is the latter.
a clusterer, or randomly assigning class-labels? If
a comment on why such a simple baseline is
be in place.
It is the simplest baseline.
Pg 20, par 3: ``... the RCC results are better than all
individual..."; replace ``all" by ``each". Similar comment applies
throughout this paragraph.
Done
Pg 20, par 3: ``The average RCC quality at 0.85 is ..."; please
specify at *what* 0.85 (e.g., at 0.85 peanuts, 0.85 parsecs, 0.85
inches^2), unless it refers to the value of quality, in which case you
need to define what the quality is (see previous comments) and replace
``at" by ``of" or similar.
Done
Pg 20, par 6: ``Figure 8 shows..."; What is an ``average single
technique"? Why not just call it a ``single technique" since the
average of a single number is the number itself? If the ``average"
refers to the average of the 10 techniques, how is it visible in
figure 8 that RCC is better than the average? This referee also had a
problem in ``looking at the three consensus techniques" since it
appears they are not plotted in figure 8; the combiner of the three is
plotted as RCC instead (?).
Done
Pg 23, first row: ``...such scenarios result in distributed...";
should be the other way around: scenarios are a consequence of
distributed DB's, not the cause.
Done
Pg 23, second row: ``...big flat file..." is jargon. Further, whether
it is big, flat and/or file does not matter at all for the rest of the
sentence - it is the information that cannot be collected in a single
centralized location...
Done
Pg 23, par 3: Figure 10 is introduced before figure 9 which should be
reversed. Either relabel figures or swap the several sentences
describing them.
Done. Swapped figures 9 and 10.
Pg 24, figure 9: Plots are too small and crowded, otherwise quite
informative. Same for figure 10.
The main purpose is to demonstrate that the consensus clustering is far superior
to the individual clusterings. The figure should serve that purpose eithen
though it can not be expected that all 1000 points can be distinguished.
Enlarging them will either distort them or will leave them split over several
pages making it harder to see the points. Similarly if parts of the figure are
omitted the information presented would not be completed.
Pg 25, par 2: ``The supra... chooses either MCLA and CSPA results but
the difference..." should read: ``The supra... chooses either MCLA or
CSPA results, but the difference..."; Check punctuation in this, and
several following sentences.
Done
Pg 25, par 2: There are statements about performance of individual
consensus functions but there are no numbers/figures to demonstrate
this.
Pg 26, par 3: ``clustering is referred to as inner loop clustering";
add ``the" before ``inner".
Done
Pg 26, below eq. 7: In many journals the use of ``iff" is discouraged
and ``if and only if" is preferred, argument being that the former
looks too much like a typo. This is a suggestion, but the authors are
free to do it as they find appropriate.
Done
Pg 26: Footnote 11 is not in proper place. It appears next to the
definition of overlap (it should also be on the other side of the
dot), while the footnote talks about existence of the overlap.
Done. Footnote dropped since not necessary for basic understanding.
Pg 26: parametrized -> parameterized
Done
Pg 26: Define ``l" (el), the degree of overlap.
Changed some notation to simplify. In that course el was dropped.
Pg 26: ``v>1 to fix..."; define ``fix".
Got changed in the process as well.
Pg 26: Use of ceiling is inappropriate; much like 1 and 1.00 stand for
[0, 2] and [0.99, 1.01] respectively, use of ceiling defines an
implicit error of a single data point (e.g., as stated, the analysis
would not be valid if one used the floor instead, nor would it be
valid if one had a single data point extra). Since the presented
analysis is at best approximate, the ceiling should be omitted
altogether and ``='' should be replaced with ``\sim". Note that even
if the specific sampling technique uses exactly this number of data
points, the use of ceiling is still not appropriate since the method
is advertised as being quite a bit more general. Specifically, the
beginning of section 4.5 talks about how ODC can occur in real data
sets where it is unlikely that the partitions would have exactly this
number of data points.
Done. Simplified notation in that section. See above comments
Pg 26, last line: instead of ``k" say ``number of clusters".
Done
Pg 26, everything below eq. 7: make clear whether the introduced
number of partitions, the number of data points per partition, etc. is
a requirement, or just an implementation detail of presented
algorithm.
Since this is a simulated distributed scenario all choices are implementational
choices. Added ‘simulated scenario’ on several occasions to make this clearer.
Pg 26: ``Every partition should have the same number of objects for
load balancing."; Is this a requirement for the method to work? If
not, please rephrase.
See last comment.
Pg 27, par 2, description of figure 11: ``... p ranges from 2 to 72";
There is something wrong with the plots. If the ``x" represents the
data point (1,1), then p starts from 0, since there is at least one
circle to the left of ``x" (see figure 11).
The x was removed since it was a legacy from times where the x axis was speedup.
Thanks for pointing that mistake out.
Pg 27, several places: ``...retains around 90% (80%) of the ..."; What
does the number in brackets refer to?
The number in bracket refers to the experimental variation described in
brackets.
Pg 27, par 4: again, the three consensus algorithms are not shown (?)
Pg 27, par 5: ``First, through the reduction, ... , the problem is
simplified as speedup increases"; This sentence does not relate to the
loss of quality as advertised. Further several sentences are not
clear: if the sampling cannot maintain the balancing, how can it
enforce it (thus hurting the quality)?
Pg 27, bottom line: linear -> linearly.
Done
Pg 28, first line: This analysis is not correct as stated. There is a
mix up of actual time performance and the asymptotic complexity. The
linear term *could* be negligible, but not on the account of being
linear. It is true that the asymptotic time complexity remains O(n^2),
but this is not because the linear term is negligible, but because
O(n+n^2) = O(n^2). It is easy to conceive a whole set of methodologies
that are asymptotically linear, but would completely dominate the
quadratic term for anything but infinite samples.
Pg 29, par 1: ``Consequently, the sequential speedup can be multiplied
by p..."; Note punctuation. Also, the speedup can be multiplied by PI,
42, 1e200, etc. Please rephrase.
Pg 29: the time complexity analysis mixes up asymptotic complexity and
the actual running time. The example with p=4 and v=2 does not have a
correct derivation. While s=1 might hold, authors must first establish
what the actual running time is, show that it is heavily dominated by
the quadratic term for the sample sizes of interest, and only then can
the analysis be deemed correct.
Pg 29, section 5: ``We defined ... that enables us ... and allows one
..."; be consistent with ``us" vs. ``one".
Pg 29, section 5: ``speed-up"; use consistent spelling.
Done: speedup is used consistently
Pg 29, section 5: ``... dramatically improve... for a large variety of
domains"; the sentence is too strong since 4 domains do not represent
a large variety. Something along ``for several quite different
domains" is more appropriate.
Done
Pg 29, section 5, par 2: This referee found the future work to be
described in more detail than necessary, but it appears that it is the
journal policy to enforce such a section.
Unchanged
Pg 30, appendix: H(X) is defined as entropy, but is then used in the
context of cross entropy H(X,Y). Define H(X,Y) prior to first use.
Done
Pg 30, below eq. 12: The upper bound is not tight, and even if it
were, it would not follow from the min(Hx,Hy)<=(Hx+Hy)/2. For example,
by the argument of the sentence one would have that since min(Hx,Hy)<=
100*(|Hx|+|Hy|)/2, then 100*(|Hx|+|Hy|)/2 represents a tight bound as
well.
Removed the word ‘tight’. It should have expressed that the upper bound is
actually reached in particular cases: min(Hx,Hy)==(|Hx|+|Hy|)/2 while it is save
to say that min(Hx,Hy)<100*(|Hx|+|Hy|)/2
This concludes the most important points. Note that
spelling/grammar/structure of sentences must be checked by authors
prior to final submission.
-----------------------------------------------------------------Review 3
-----------------------------------------------------------------PAPER:
Strehl & Gosh: Cluster Ensembles
Summary:
The paper proceeds in four steps: (i) it poses a problem that is
claimed to be highly relevant, namely how to combine multiple data
partitionings without accessing the original pattern representation,
(ii) it proposes an objective function to measure how well a combined
partitioning agrees with a set of given partitionings, namely a
criterion based on mutual information called ANMI, (iii) it suggests
different (three) heuristics of how to combine partitionings, and (iv)
it investigates several concrete settings (and data sets), motivated
by "practical" problems.
Assessment & Comments:
Obviously, the paper tries to cover a lot of ground. The authors bring
forward a number of ideas and suggestions, each of which certainly has
some merit. Yet overall the paper is lacking a contribution of real
significance and each of the above steps is carried out somewhat
superficially, dealing with a number of details in a rather ad hoc
manner. Moreover, the paper lacks overall coherence. In particular
there seems to be no obvious connection between the proposed objective
function and the proposed heuristic algorithms, a fact that I consider
to be the most crucial flaw of the paper.
Looking in more detail at the above steps:
(i): The introduction spends pages on discussing work on combining
classifiers. As it turns out, very little of this is actually relevant
for the proposed method. Is it supposed to just provide some
motivation? If so, it seems to be based primarily on analogy, since it
is not clear that what works for classification should also work for
data clustering (and even for the same reasons). On the other hand,
the introduction has failed to fully convince me of a real need for
"cluster ensemble" techniques. Phrases like "collective data mining"
and "knowledge reuse" sound great, but the references to real
applications remain somewhat fuzzy. Maybe there are just too many
applications? It would still be good to discus some relevant
applications in greater detail. In addition, it appears to me that
exchanging some sort of "approximate sufficient statistics" seems
possible in most cases, for example, by exchanging cluster centroids
in the case of distributed clustering. My impression is that the whole
setting might be overly prohibitive as far as practical problems are
concerned. This would be OK for a "theory" paper, where one would
actually pose a problem in an overly "radical" fashion in order to
prove interesting theorems, but this is not what the submitted
manuscript is aiming at.
(ii): The objective function makes a lot of sense to me. The notation
could be improved though. For example, n^(h)_l is actually symmetric
in h and l, but the indexing convention make it very difficult to
grasp that. There must be a way to remove some of the clutter from
Eqs. (2-4)! Why is the trivial (but pretty much futile) way of
establishing a normal form in 2.1 being discussed? It doesn't seem
that this is used later on.
(iii): As said earlier, it is not clear how the heuristics proposed in
Section 3 relate to the objective function established in Section
2. Do you just hope by pure luck that one of these will produce an
approximate solution to the average mutual information maximization
problem? There should be a way to come up with an approximation scheme
that at least finds a local maximum. The reader gets the impression
that the authors apply the same scheme of combining heterogeneous
algorithms at the "meta"-level again. The principle seem to be: "If
there is no algorithm that solves the problem optimally, take a bunch
of algorithms and see which one works best on a case by case basis."
While this may not be an unreasonable approach for engineers who are
solving real-world problems, it's not clear to me on a more
intellectual level what lesson can be learned here.
Additional comment(s): I'm not convinced that the hypergraph
representation is very useful. The only algorithm that really makes
use of this representation is HGPA and as the authors point out
themselves, this algorithm seems to be inadequate in the way cluster
"violations" contribute to the cost function.
The repesentation is used by all three algorithms. Calling the representation a
hypergraph representation might be misleading the reader towards the conception
that it is used only for HGPA which is unfortunate. However, we believe it is
the most descriptive name for the representation.
(iv): The authors have included a number of experiments on different
datasets, artificial and synthetic. As a result a number of
interesting insights are gained and overall it seems that the proposed
cluster ensemble design methods "work". Although I think none of the
applications qualifies as a real "killer" application for the proposed
framework, the authors have put some effort in the empirical
evaluation and are able to claim a success, for example, on the robust
clustering problem. Yet I would be curious to see, how an algorithm
would perform that optimizes some combination of objective functions
to start with. It is not fully convincing to sell a limitation (no
access to the underlying pattern representations) as a feature! Why
should one be off worse by using more information? This and other
questions reinforce the impression that there might be yet other
techniques that would do even better in these applications.
Presentation:
The text spends too many pages on issues that are inessential for the
main topic of the paper. Basically all sections of the paper can be
shortened and condensed. Thereby, the paper would gain in
focus. Appendix A is trivial, I would suggest to drop it
altogether. Also, I personally think that the main idea needs to be
presented earlier on in the paper. According to my opinion, the paper
also has a tendency to be overly formal where it should focus on
communicating the main ideas.
I would revise a couple of statements/sentences:
+ "It is well known that no clustering method is perfect." (Deeper meaning
unclear)
Done
+ Agglomerative clustering methods do not always minimize an objective.
+ What does "worse than a random partitioning" exactly mean?
It is possible to have less mutual information than a random labeling. However,
since this is not fully explained at the place where this statement is made the
wording has been changed to ‘not better than’ random.
+ "statistical information shared between two distributions"
(inaccurate, you need a joint distribution to start with, etc.)
+ the magic number 1.05
+ "unbiased" is a technical term from statistical inference, you seem
to use it in a much more colloquial sense
Done
Recommendation:
Overall I think the paper has some merit, but I think it would be more
suitable to publish the paper in a journal that puts more emphasis on
"data mining" and "knowledge engineering". From a machine learning
point of view, I think the contributions made in the paper are only of
limited significance. Alternatively, I would probably feel more
comfortable in supporting the paper, if the authors could close the
gap between Sections 2 & 3 and come up with an algorithm that would
directly optimize the ANMI criterion.
We added a greedy optimization scheme for ANMI in a new section. We also compare
it to the heuristics in the controlled experiment. However, it is
computationally very slow. In fact, we could compute results in the comparison
section but not in the real experiments. The heuristics tend to come up with the
same results than the direct optimization in a several orders of magnitude
faster time.
Download