Document 11043161

advertisement
•c^cms^^
^ (UBRARIESi ^
net
^
ALFRED
P.
WORKING PAPER
SLOAN SCHOOL OF MANAGEMENT
Examining The Validity of Sample Clusters
Using the Bootstrap Method
by
M.
WP 1469-83
A.
Wong and
D.
A.
Shera
August 1983
MASSACHUSETTS
INSTITUTE OF TECHNOLOGY
50 MEMORIAL DRIVE
CAMBRIDGE, MASSACHUSETTS 02139
Examining The Validity of Sample Clusters
Using the Bootstrap Method
by
M.
WP 1469-83
A.
Wong and
D.
A.
Shera
August 1983
Abstract
An important problem in clustering research is the stability of
sample clusters.
Cluster diagnostics, based on the bootstrap subsampling
procedure and Fowlkes and Mallows' B
statistic, are developed in this
K.
study to aid the users of cluster analysis in assessing the stability
and validity of sample clusters.
1.
An
important
INTPQDUCTIQN
problem
clustering
in
research
stability and validity of the sample clusters.
For Euclidean
Hartigan (1981), Wong (1982), and Wong and Lane (1983)
data,
developed procedures to
have
evaluate sample clusters using
density-contour clustering model.
the
is the
that
true
political
major
in
clustering
However, it is often
the nations of
the world, or the
states of the United States, or the companies of a
industry, the objects under study cannot be reasonably
viewed
as
a
sample
such underlying population.
from some
Under these circumstances, it is not reasonable to talk about
sampling errors in the computed clusters.
But, the stability
of the sample clusters can be evaluated in other ways.
In the approach taken by Baker
Baker
and Hubert (1975)
,
the
(1974)
,
Hubert (1974)
,
and
question of how the clustering
solutions provided by the single and complete linkage methods
are
is
data
affected by changes in the distance or similarity matrix
addressed.
This
collected on the objects
similarities.
Ling
developed
exact
compactness
the
is a reasonable
an
(1973)
might change the distances or
took
probability
and isolation of
question as different
a different
approach and
theory
testing
for
the
single linkage clusters, under
(admittedly unrealistic) assumption
that the rank order
of the entries in the distance matrix is completely random.
approach
Another
appropriate
is
when
sample
the
similarity matrix provides an approximation of the population
set of objects.
matrix for a fixed
similarity
For example,
in marketing research/ if the brand-switching behavior of the
total population of coffee consumers were known, a population
matrix
similarity
terms
(in
of
relative
frequency
of
switching between brands) between various coffee brands could
be
obtained.
hierarchical
population,
clustering model can
brand-switching
the
true
a
clustering can be defined on the objects (here,
block-distance
only
finite
a
brands) using the population similarity matrix
coffee
the
such
For
behavior
of
(e.g.,
be used). However,
a sample
of coffee
consumers is available in practice, and the sample similarity
matrix
obtained
similarities
aggregate
approximation of the true
only provides an
between
brands.
Consequently, the
sample
hierarchical clustering obtained from this similarity
matrix
is merely a sample
estimate of the true hierarchical
clustering.
In this type of study, standard sampling procedures like
the
bootstrap or cross-validation methods described in Efron
(1979a,
Hartigan
1979b),
or
the
error
(1969, 1971) can be
analysis
scheme
given
in
usefully applied to the sample
subjects (e.g., coffee consumers) to perform two main tasks:
1.
to
assess
the
similarity
between the
population hierarchical clusterings, and
sample and
2.
to assess the stablility of the sample clusters
both using the Bk measure developed by Fowlkes and Mallows at
Bell Laboratories
(1983).
In Section 2, we describe the theoretical background for
this study.
concerned
to
with, the clustering
methods used, the statistics
compare clustering trees, and how the bootstrap method is
used.
The
Section 3.
also
There are discussions on the type of data we are
sampling experiments are
described in detail in
The simulation results and their implications are
discussed
there.
Section
4
draws
conclusions
describes how the techniques can be used most effectively.
and
RESEARCH BACKGROUND
2.
In this study, our main concern is to develop diagnostic
tools
clusters
be
a
stability and
the
the case where the
in
finite population, a
true hierarchical clustering can
In order to evaluate the clusters obtained by various
.
clustering
techniques from the
need
examine
to
and
series
simulated
Fowlkes
of
sample similarity matrix, we
of
clusters
sampling
between
agreement
and its
the
distribution in
experiments.
We
a
will adopt
and Mallows' Bk statistic as a measure of the degree
The clustering techniques
agreement.
in section 2.1, and
described
reviewed
degree
the
sample
population
are
sample similarity matrix is
defined on the objects by the ultrametrics model (Johnson
1967)
of
validity of sample
an approximation of the population similarity matrix.
merely
For
assessing
for
in
section
2.2.
In
used in this study
the Bk statistic will be
section 2.3,
the bootstrap
procedure will be outlined.
2.1. CLUSTERING METHODS
Clustering
partitions,
heirarchical
the
or
a
splitting of the
set of objects into
sets,
or groups of one
or more objects.
is
clustering is a set of
objects indexed from
1
clusterings,
{
Ci
to N, the number of objects.
},
A
of
The
index
it
is the number of clusters in partition Ci. What makes
i
is that if objects X
hierarchical
and Y are in the same
cluster in partition Ci then they are in the same cluster for
all
Cj
tree will mean
A
where
j
<
and
a
number of other things
taxonomy,
i.
a dendrogram,
a
which all can mean
The term tree reflects the property
hierarchical clustering.
that a hierarchical clustering can be represented on paper by
something which resembles an upside-down botanical tree.
Here
is
more dynamic definition
of a tree.
with
each
a
all
objects,
the objects are in a
we
have
will
Continue linking
single cluster.
linkings
N-1
At each step
cluster.
object in it's own
combine two clusters into a single cluster.
until
Given
matrix, each clustering method proceeds as follows:
distance
Start
a
and
a
Thus for N
tree
of
N
clusterings, the first and last being trivial.
differences
The
different
in
used
criteria
linking
the
to
methods
which two
decide
come
from
clusters to
One of the most common criteria is the
combine at each step.
distance between the closest two objects of the two clusters.
This
the
known as single
is
distance
distances
object
two
of
Complete linkage defines
between two clusters to
be the maximum of the
object in the
first cluster and any
between
any
the second.
clusters
linkage.
which
are
At each
step, this method links the
closest together
by that distance
Average linkage is the same except it uses the
measurement.
average of the distances between objects in each cluster.
Wong
Lane
and
(1983)
involves
looking at the Kth
where
is
K
a
rationale
comes
estimating
the
proposed
data analyst.
from density estimation so
density
of
the
linked
are
method which
a
nearest neighbor to each object
chosen by the
parameter
Clusters
object.
have
that it is like
true distribution
together by
The
at each
highest density
first and only on the additional condition that at least some
in one cluster be
object
object
some
of
other cluster.
As
a result,
this
to decide which K to
Lane have suggested a method
and
at the results for all
looking
by
use
the
also produces an estimate for the number of clusters.
method
Wong
in
within the Kth nearest neigborhood
values of K and see
what the most common value of the number of clusters is.
There
criteria
,
are
many
other existing
clustering methods and
but they will not be considered here.
2.2. THE Bk STATISTIC
definition of Bk is as
The
follows: The
k
in Bk refers
to the number of clusters and we will compare the clusterings
with
i
k
and
clusters in the two trees.
j
run from
1
to k
Consider a table Mij where
and where Mij equals the number of
8
objects in cluster
first
and
tree,
in
of clustering 1, the clustering from the
cluster
clustering
of
j
and Moj are the marginal
Mio
of the second.
j
the clustering
Moj be the number of objects
of the first clustering/ and
cluster
2,
Let Mio be the number of objects in cluster
from the second.
i
i
totals of the rows and columns of table Mij.
Given k, let
Pk =
^Mio
- N
1
2
Qk =
2^Moj
- N
y Mij
^
Tk =
2
- N
ij
and then
Tk
Bk =
JPk
Fowlkes
distribution
completely
imagine
a
and
for
Mallows
Bk,
independent
Qk
(1983)
that
is,
trees.
the
calculated
distribution
However, it
null
a
for two
is difficult to
situation in which the null distribution should be
considered
since
clustering
of
there
some
is
kind.
almost
Most
always
going
situations
significant deviation from the null case.
will
to
be
have
Rand
(1971)
proposed
also
,
Rand's statistic was
Pk+Qk
Tk
Rk =
+1
2C
C
C equals Comb
where
objects taken
2
the
In
clearly
2),
(n,
the number of combinations of n
at a time.
paper introducing Bk,
Fowlkes and Mallows show
Rand's statistic was
insensitive since it did
that
rightfully did.
situations where Bk
differences in
important
indicate
not
to measure the
statistic
With the same P, Q, and T as
similarity between clusterings.
defined above for Bk
a
We will use the Bk statistic in this study.
2.3. THE BOOTSTRAP METHOD
Bootstrapping is
a
X,
bootstrap,
=
{Xi},
is
F,
statistic.
a set of N
and a statistic
to estimate the distribution
consider
a
random
independent and identically distributed
distribution
with
objective
distribution of
true probability distribution F,
variables
(iid)
numerical technique proposed by Efron
to estimate the
(1979a, b, 1983)
Given
a
the
R
of R.
empirical distribution
(X,
F)
,
the
To do the
of X, i.e.
N and call this distribution G.
each
Xi has probability 1 /
Note
that not each value of Xi has equal probability because
Xi
might
have
the same value for two
10
or more values of i.
Now
draw
the
size of each sample Yj is the
from
runs
number of new samples
a
1
to
=
,
{Yi}j, from G where
same as that of X, i.e.
calculate
Then
N.
Yj
R
(Y j
,
G)
i
for each
bootstrap sample Yj
The main assumption is that G is a good approximation of
With
F.
distribution
rejects
of R (X,F)
This
.
the same assumption
assumption:
because
model
a
exceptional.
a
but
hardly
empirical
the
is not an unreasonable
lies outside
it
restrictive
some confidence
that the data is not unusual
Standard probablility
large N, G approaches
for
that
applies when a statistician
He or she is assuming
region.
say
we
of the R(Yj,G)'s is a good approximation of the
distribution
true
or
assumption,
this
theory tells us that
F almost surely with suitable,
conditions.
The data
analyst may
decide to smooth the data by adding random noise or fitting a
smooth
distribution
to
it
if
he
she
or
feels
that is
appropriate.
A
bootstrap
sample
by
defined
by
sample
drawn from the
being
the
original
based on the original
is a sample
empirical distribution, G,
sample,
or
version.
The term sample by itself
the
distribution, not a bootstrap.
have
true
matrix.
The
bootstrapping
11
a
smoothed
will mean a sample from
an unknown "population" similarity
similarity
perhaps
In this study, we
matrix and a sample
procedure
will
be
applied
elements
the
to
the sample
of
similarity matrix
independently to obtain subsample matricies.
2.4 SIMULATION DATA
It
is
difficult to make
because
it
comes
data
translating
so
in
general statements about data
many
different forms.
computer usable
to
In fact,
form usually requires
Here we describe our
writing a new program for each problem.
data and how it is treated in this paper.
Objects
are
These
cluster.
of interest which
things
the
we want to
variables or countries or
objects might be
denote the number of objects
N will
brands
of chewing gum.
here.
Data is made up of responses or single values, e.g.
single draw Xi from a distribution F is a response.
a
A sample
is a collection of independent responses.
change
within
getting
hand,
bootstrap
samples
the
(Some
objects of interest will not
this study, the set of
In
many
technique
samples
is
and
there
and resamples.
resample
studies
single problem.
the scope of a
so
been
done
important part
will be
The bootstrap
size fixed equal to
have
an
12
of the
many different
procedure we used has
the original sample size.
sample size does
where the
change for different samples (Hartigan 1981)
be concerned with that here.)
On the other
,
but we will not
Since
involve
the
all
operations,
Since this is not the focus
satisfy the triangle inequality.
of
the paper, we will assume
to
its distance matrix is
presented
techniques
the
some
mathematical sense, i.e. it need not
a true metric in the
be
through
The resulting distance need not
between the objects.
matrix
be using
raw data into a distance
want to change the
we
will
we
eventually,
matrix,
distance
a
methods
clustering
that the transformation from X
clear and done implicitly.
here will apply
Thus,
to any situation
where one can create a distance or similarity matrix from the
data.
are some examples of responses and their relation
Here
to distance matrices:
a)
A
one for each individual
sample of distance matricies,
giving
his or her associations between objects combined in
Some average of
fashion.
some
these responses would then
become the distance matrix corresponding to the sample.
b)
A
vector
trying
to
association
of
values
cluster
the
between
each individual
for
variables.
variables
by
We
where we are
might measure the
linear
or
monotone
correlations, or something more sophisticated.
c)
A
the
simple
case
distance matrix between
addressed
by
Hubert
approached here.
13
the objects.
(1974)
and
will
This is
not be
our experiments, the ultrametrics tree model will be
In
used
the
generate the population
to
distance
hierarchical
data set Al;
for
clustering
Following
clusters.
Al.
matrix
distance matrix.
it corresponds to a
10 objects with
of
Below is
well defined
3
tree corresponding to set
that is the
Other data sets will be introduced in Section
3
as they
are used.
Table 2.4.1: True Distances of Set Al
Al
Al -
A2
A3
Bl
B2
B3
CI
C2
C3
C4
.10
.10
.40
.40
.40
.70
.70
.70
.70
.10
.40
.40
.40
.70
.70
.70
.70
.40
.40
.40
.70
.70
.70
.70
.05
.05
.70
.70
.70
.70
.05
.70
.70
.70
.70
.70
.70
.70
.70
.15
.15
.15
.15
.15
A2 A3 Bl B2 B3 -
CI -
C2 C3 -
.15
The
follows:
the
in mind can
be described as
Suppose the objects were brands of
detergents and
respondents
detergent
who
application
would
1
we
were consumers.
and detergent
not
have
2
is
use each as a
The
real distance between
the proportion of consumers
substitute for the other.
14
In
marketing
people
research
studies,
a finite sample
of
,
say, 100
would be tested if they would accept one product as a
substitute
switched,
for
then
the
other.
So if 80
the distance is .20,
percent of the people
and it is a reasonable
estimate of the true distance in the population.
15
Figure 2.4.2: Tree for Data Set Al
1.00
+
/
.70
\
/--+—
.40
/~+~\
/-+-\
I
I
I
I
I
/-+-\
**********
AAABBBCCCC
I
simulate
To
thing
do
to
is
I
I
I
I
I
described above,
survey
the
I
create binomial
to
.15
.10
.05
0.00
the natural
random variables with
probability parameters equal to the distance and with n
Sample
trials.
(Pi j ,100) /lOO
i
and
j.
bootstrap
distances
are
distributed
=
100
Binomial
where Pij is the true distance between objects
Likewise,
distances are
Sij
if
is the
sample distance, the
distributed Binomial (oi j ,100) /lOO.
The
next two distance matrices are examples of a sample from
set
Al
and then one of the
bootstrap samples based on that
sample.
16
Table 2.4.3: Example of Sample Distances
Al
Al -
A2
A3
Bl
B2
B3
CI
C2
C3
C4
3.
RESULTS
3.1 BK PLOTS
3.1.1 Data Set Al, Complete Linkage
50 samples
In the first experiment based on data set Al,
from
the
true
was
linkage
distribution
used
to
generated
were
and
complete
Bk was calculated by
make 50 trees.
comparing each of the 50 trees with the true tree.
The distribution of Bk for each
3.1.1.1.
and
hatch
is shown in the figure
The horizontal scale runs from
N lie at the left
vertical
k
are
at
to N,
(N = 10);
and right borders, respectively.
scale runs from
marks
1
1
The
at the bottom to 1 at the top and
increments
of
.10.
For
numbers count how many Bk's took on that value.
18
each k, the
Figure 3.1.1.1: Plot of Bk versus
Set Al, Complete Linkage
k
=
k
Figure
stars,
3.1.1.2
shows
the
means
of Bk,
indicated by
"*", and plus and minus one sample standard deviation
dashes, "-".
They are truncated
on
each side, indicated by
at
the top and bottom boundaries because Bk will only lie in
the unit interval.
20
Figure 3.1.1.2: Plot of Bk versus
Set Alf Complete Linkage
*
k
*
*
*
The
following
are box plots of
the same results.
The
median is represented by a number sign, "#", the quartiles by
a
plus sign, "+", and the first and seventh eighths by minus
signs
or dashes,
"-",
regions in between the quartiles
The
was
filled in with vertical bars,
the
box-plots more visible.
equal
to
the
quartiles
Often,
"I", to make the boxes in
because the eighths are
and/or the quartiles
21
equal to the
median, only one of the symbols is shown. The discreteness of
the
distribution causes this problem.
It also often causes
rather large boxes for higher values of k.
22
Figure 3.1.1.3: Plot of Bk versus
Set Al, Complete Linkage
k
«
I
+
Because
may
box
not
of the discreteness, simply finding the average
be an appropriate summary
plots
efficiently.
information,
seem
The
but
convey
to
variability
the
frequency
plot
can be difficult
there are many objects.
23
of the information.
contains
of
The
Bk
most
the
most
to read, especially when
Before
explain
that
plot
of
Bk
experiments,
actually
and
natural,
Bk plot.
not a smooth
is
us
one but jagged and
The reason
because
desirable,
let
It is clear
steep cliffs.
many peakS/ valleys, and
contains
is
more
with
important feature of the
the
the
continue
we
the
peaks
indicate the more stable and relevant clusterings.
see the reason, consider the structure of Set Al and
To
of
linking
perturbation
to
for the samples,
After
and
2
linkings, when
Bk
completely
Then,
Bk
beginning
8
k
At
= 8,
k
is
(9), none of the clusters is
+ 1,
pieces have been linked.
Similarly,
smaller.
the
incomplete
to link together.
though
because
those with A's or C's.
the B cluster is usually linked
linked, and only random
because
smaller
at
higher.
is
the objects labelled with
likely to link before
most
are
B's
Even with the
together clusters.
process
the
The
sometimes
for
k
cluster
- 1,
of
(7), Bk is
A's
is
only
peak is not always exactly
some
A's may
get a small
distance in the sample, or the B's a large distance, and then
A's
link before the B cluster is complete.
the
clustering at level
k
when
k
is one of these peaks, one
will likely find stable clusters.
3.1.2. Data Set Bl, Complete Linkage
24
So by looking at
Figure 3.1,2.1: Tree for Data Set Bl
1.00
/
\
.65
.55
.50
/~+~\ /--+— \ /— +~\
.30
******************
AAAABBBBCCCCDDDDEE
of
18
objects,
There are
A's,
for Data Set Bl, consisting
three figures here are
The
3
which
represents
distinct clusters of
B's, and C's, and
6
0.00
4
a
different population.
objects each labelled with
other objects labelled with D's and
E's which are in a less distinct cluster.
25
Figure 3.1.2.2: Plot of Bk versus k
Set Bl, Complete Linkage, 50 Samples
50
4
8
•
•
'
'
•
48
1
1
20
1
'
•
'
3
5
1
19 6
15 11
4
1
1
9
1
i
1
2
14 27
24
5
14
49
1
1
1
6
1
2
2
2
3
15
+
•
15
9
18
'
15
1
2
1
7745
1
11 50
18 32
1
6
4
7
8
6
7
5
20
17
1
4
2
9
13
37
11
1
7
4
More
29 46 50
objects made the distribution smoother, except for
the
very high values of
for
us to use only the sample mean and standard deviation as
k;
but it is still not smooth enough
descriptors for the distribution.
26
Figure 3.1.2.3: Plot of Bk versus k
Set Bl, Complete Linkage, 50 Samples
#
#
•
«
+
I
#
+
+
-
+
I
»
»
+
+
t
+
+
#
-
«
#
I
I
I
+
I
I
+
3.1.3. Data Set Al, Other Methods
Section
3.1.1 contained the plots
of Bk for 50 samples
when
complete linkage was the method used to form the trees.
Next
are
the results when
single linkage, average linkage,
and the kth nearest neighbor algorithm are used.
27
Figure 3.1.3.1: Plot of Bk versus k
Set Al, Single Linkage, 50 Samples
Bk's
The
especially
of
Mallows
high
values
of
are
near
k
typically
N,
the
higher,
number of
Complete linkage finds the stable clusters at lower
objects.
values
for
linkage
single
for
k,
also
near
see
1.
In their
experiments, Fowlkes and
that the complete
linkage.
This is easily
than
single
that
single linkage is "continuous"
28
linkage decayed faster
explained by the fact
while complete is known
to
be
"discontinuous".
Discontinuous
means
that drastic
changes in the result can come from small change in the data.
However,
single
sensitive
"chaining",
to
or
linkage
the
has
small
linking
its
problems,
distances
clusters
normally be linked.
29
which
earlier than
too.
can
It
is
lead
to
they should
Figure 3.1.3.2: Plot of Bk versus k
Set Al, Average Linkage, 50 Samples
+
I
I
I
I
*
I
+
By
clusters
using
instead
the
of
distance
average
the
minimum or
discontinuity might be avoided.
average
between
objects
in
maximum, chaining and
As one would expect, Bk with
linkage (figure 3.1.3.2) usually ends up between the
single and complete linkage results.
30
Figure 3.1.3.3: Plot of Bk versus k
Set A, Nearest Neighbor (2) Linkage, 50 Samples
In
the
plot
failed
neighbor
method
clusters,
indicated
find
stable
it is shown
above,
clusters
by
identify
to
low Bk's at
for
=
k
3.
k
that the kth nearest
the
two
population
=2, although
The
reason
it did
for this
confusion is that the kth nearest heighbor method is oriented
towards
picking out the number
of clusters and the clusters
themselves and not towards the reproducing whole tree.
31
Here,
3
is
choice
its
good
perfectly
evaluate
reproduces
Thus,
answer.
nearest
kth
the
the number of clusters
of
tree
the
it
use
together
it
appropriate to
method by
neighbor
nor
not
is
and that is a
how well it
with
additional point, this method is not designed for
of
objects
more for Euclidean data
but
When
objects.
reproduction
other
tree was worse than
the
of
we will not consider
reasons,
.
One
small set
a
with more than 100
besides
parameters
Bk
were
2
used,
For these
for 2.
this method for the remainder
of the study.
Data Set Bl, Other Methods
3.1.4.
Recall
clustering
at
9,
when only clusters A,
figure
tree,
that there were
unstable
clusters
linkage
for
4
while
2,
3,
single
4,
B, and C are
mislead into declaring
Complete linkage says
clusters.
5,
linkage,
6,
7
stable
by simply looking at the
one might be
3,1.2.1,
showed
us that the D's and E's
However,
not really clusters.
are
9.
=
The graph of Bk would tell
linked.
in
k
graph
The
3.1.2.3.
figure
and
above,
9
4
shows
clusters!
clusters
stable
Average
has stable clusters at 2, 3, 4, and 5, and of course
It is difficult to say which method is more correct.
32
Figure 3.1.4.1: Plot of Bk versus
Set Bl, Single Linkage
k
=
9
I
#
t
I
+
+
I
I
I
-
k
i
+
*
+
+
I
+
+
I
#
+
#
+
I
+
«
11
I
+
I
+
I
I
#
I
+
+
-
33
#
#
#
Figure 3.1.4.2: Plot of Bk versus k
Set Bl, Average Linkage, 50 Samples
k
«
=
9
4
I
*
+
*
*
f
»
+
I
*
+
*
I
I
I
»
+
+
I
I
#
I
+
«
I
I
I
+
3.1.5. Data Sets A2 and A3, Complete Linkage
Set
Al
clusters.
topology
By
of
is
an unusually nice
changing
the
the
set with well separated
distances
tree, set A2 is
clusters.
34
but
conserving the
formed with less distinct
Table 3.1.5.1: Distances for Data Set A2
Al
Al -
A2 A3 Bl B2 -
A2
A3
Bl
B2
B3
CI
C2
C3
C4
.20
.20
.25
.25
.25
.50
.50
.50
.50
.20
.25
.25
.25
.50
.50
.50
.50
.25
.25
.25
.50
.50
.50
.50
.15
.15
.50
.50
.50
.50
.15
.50
.50
.50
.50
.50
.50
.50
.50
.10
.10
.10
.10
.10
B3 CI -
C2 -
.10
C3 -
35
Figure 3.1.5.2: Tree for Data Set A2
1.00
/
/— +~\
+
\
1
50
Figure 3.1.5.3: Plot of Bk versus k
Set A2, Complete Linkage
«
I
+
+
I
stability
The
depends
The
maximum
less
the
the
than
with
this
variance
in a binomial
probability is .50.
So
in set A2 at .20
37
not only
.20 to
.25 in
random variable occurs
for data set A3, pictured
real difference between links
those
e.g.
data
that sampling will produce.
but also the variation
A2,
below,
clusters
the difference in distances,
on
set
when
of
at .50 and .55 is
and .25, even though the
arithmetic
something
simply
variation
set
A3
difference,
that
looking
is
data
a
at
a
.05, is the
not well known.
tree,
As
with complete linkage has
clusters.
38
This property is
could easily
analyst
single
same.
forget while
especially
when
the
expected, the Bk plot for
even less stablility at
3
Figure 3.1.5.4: Tree for Data Set A3
1.00
I
I
I
I
/
•\
-+-
.80
I
I
I
I
/— +—
/-+-\
I
/_-+— \
**********
AAABBBCCCC
39
.55
.50
.45
.40
0.00
Figure 3.1.5.5: Plot of Bk versus
Set A3, Complete Linkage
k
+
I
I
«
+
I
I
I
*
+
+
I
+
t
I
I
*
I
+
Because
we can not define
a true
null distribution, it
is difficult to say exactly how one could tell whether a peak
is significant or not.
cut
off point at, say,
may
not
decay.
be
It would be easy to make an arbitrary
.90 or
appropriate
.85;
but as
since the Bk
Perhaps this question of
k
increases, this
plot must eventually
significant peaks can only
be answered after a few hundred plots of experience.
40
3.1.6. Data Set B2
Figure 3.1.6.1: Data Set B2
1.00
+
/
I
I
/__+-_\
\
I
I
I
I
/—+—
\-\
/—+— \ /—+—
.55
.50
.40
******************
AAAABBBBCCCCDDDDEE
0.00
Data set B2 is a less stable version of Bl.
complete
linkage
stable
linkage find stable clusters at
does
not.
However,
as those is Set Bl,
these
9
k
Average and
= 9 while single
clusters
are not as
as indicated by the lower values
of Bk.
41
Figure 3.1.6.2: Plot of Bk versus
Set B2, Complete Linkage
+
I
I
-
»
+
-
I
+
-
I
III
+
+
t
+
+
I
+
+
I
#
I
I
I
I
I
+
«
+
+
I
+
-
I
+
-
I
#
+
+
t
#
+
+
-
-
-
#
-
I
+
I
+
*
I
+
-
+
-
It
-11
-
+
#
I
+
I
*
-
42
»
k
Figure 3.1.6.3: Plot of Bk versus
Set B2, Single Linkage
#
-
-
•
'
+
-
+
#
_
+
I
k
I
+
-
I
#
*
+
-
*
*
+
+
+
+
I
#
I
I
I
I
+
I
I
#
#
+
I
I
I
+
+
I
#
-
+
+
-
»
-
I
+
+
I
-
+
I
+
t
«
I
+
+
+
Notice
clusters
#
«
that average linkage (figure 3.1.6.4) has stable
at 4 which is odd because set B2 does not.
has stable clusters at
k
= 9.
43
It also
Figure 3.1.6.4: Plot of Bk versus
Set B2, Average Linkage
•
.
I
#
+
I
I
I
4.
I
I
I
I
I
I
k
•
I
I
-
+
#
I
I
-
*
I
#
+
+
+
-
+
#
-
-
-
t
+
+
*
I
I
+
I
+
+
I
-
#
I
I
I
I
+
+
#
-
#
I
+
+
-
#
I
+
-
+
#
I
*
I
t
+
+
+
we have shown that the
So
stable
and
taking
many samples from a known
relevant clusters.
get
only one sample and no
one
find
a
reliable
Bk plot is useful in finding
The problem
truth.
44
is that we were
truth while in practice we
Now the question is "Can
estimator of the
sample and population trees?"
t
true Bk between the
3.2. THE DISTRIBUTION OF BK
Since we do not know the population in practice, we must
use
like the
bootstrap to simulate
the true distribution.
As stated above, the
techniques
subsampling
sampling
from
bootstrap
uses
the
sample
to
see
an
approximation
of
the
the
distribution
of
the
as
population.
order
In
sample-to-bootstrap
Bk,
distances
some
and
bootstrap-samples
actually close to
was
Bk
true-to-sample
if
were
samples
50
samples
these
of
drawn
were
taken
that of the
from
the true
were chosen
from each of
them.
and 50
Then the
sample Bk could be compared with the bootstrapped Bk's to see
if
the bootstrap reproduced the original situation well.
To
compare the distributions, a standard chi-squared goodness of
test
fit
was
The
used.
hypothesis
null
is
that
the
distributions are the same and a low value of the chi-squared
statistic, relative to the degrees of freedom,
will indicate
that the null hypothesis is acceptable.
Binning
true-to-sample
binned
50
was
Bk's.
based
on
Those
Bk's from
with the closest sample Bk.
bootstraps on each of
6
distribution
the
of
the
the bootstraps were
For complete linkage and
samples from set Al, the results
are in Table 3.2.1.
45
Table 3.2.1: Results of Chi-squared Tests
Set Al, Complete Linkage, Samples 1 through
k
chisq
df
61.879
16.425
6.901
32.516
5.014
5.209
95th percentile = 5.991
4
4
4
4
4
4
2
2
2
2
2
2
64.729
241.342
195.578
147.015
306.318
71.061
95th percentile = 9.488
5
5
5
5
5
5
4
4
4
4
4
4
6
6
6
6
6
6
1
1
1
1
1
1
7.219
.357
.357
7.219
2.228
7.219
95th percentile = 3.841
106.452
106.390
20.121
93.056
122.238
2.400
95th percentile = 7.814
7
7
7
7
7
7
3
3
3
3
3
3
8
8
8
8
8
8
1
1
1
1
1
1
8.000
18.000
5.120
9
9
9
9
9
1
1
1
1
1
.750
4.083
21.333
.750
.000
.720
11.520
6.480
95th percentile = 3.841
46
6
9
1
33.333
95th percentile = 3.841
results show that the distribution of the sample Bk
The
is
not close to that of the true.
could
not
tested because the
be
degenerate.
The cases when
k
= 2 or
sample distributions were
Even with different binning schemes, the results
were
still the same:
the distributions are different.
was
also
all
difference
true
for
three
clustering
methods.
This
The
was that the bootstrapped Bk's were nearly always
from
shifted
3
the
sampled.
Usually, it
was shifted to the
lower side, which is what one would expect.
This
below.
difference
is
made clear by
the two plots shown
The Bk plot in figure 3.2.3 compares the 50 bootstrap
samples
(based
itself,
while
on
the
sample number 33)
one
in
figure
original samples with the population.
47
with sample number 33
3.2.2
compares
the 50
Figure 3.2.2: Plot of Bk versus k
Set Bl, Average Linkage, 50 Samples
k
#
=
9
«
4
+
#
-
'
#
#
-
#
+
I
+
#
*
I
I
+
I
#
-
+
+
f
I
+
I
#
+
I
I
I
+
t
-
»
For
data
sets
with
situation was the same.
by
the
least
objects,
18
set
Bl,
the
So the problem was not simply caused
small number of objects.
sensitive
e.g.
method,
Even average linkage, the
had bootstrap Bk's
different from the sample.
48
that were very
Figure 3.2.3: Plot of Bk versus k
Set Bl, Average Linkage, 50 Bootstrap Samples
+
#
-
+
I
#
«
+
_
+
+
I
_
_
I
«
«
*
+
I
«
+
*
I
+
I
+
+
These
be
results are not particularly good news.
the empirical distributions defined
that
different
very
»
from
by the samples can
true distribution.
the
It means
However,
In these
caseS/
it
Notice
that the median of the bootstrap Bk's is equal to the
median
of
above.
was always different.
the
Even
sample
though
Bk's
for k = 2, 3,
the distributions of
49
all is not lost.
4,
5,
9,
and 10
Bk's are not the
the
statistics like the median
some
Seunc/
may be very close to
The purpose of section
original.
3.3 is to examine how
close it really is.
3.3. ESTIMATORS FOR BK
It
rather
may
than
estimators
looked
for
50
at
sufficient
be
3
distribution.
its
can
to estimate
be
calculated
natural estimators,
bootstraps
It
the sample-true Bk
is proposed
from bootstrapped
of one sample.
The
We
values are given in
Table 3.3.1: Estimators of Bk
Set Al, Complete Linkage
true
Bk's.
the mean, median, and mode,
Table 3.3.1.
k
here that
Most
of
the time, the median
equaled the mode.
After
examining many trials it appears that the median and mode are
either exactly equal to the true Bk, or they are further from
it than the mean.
3.3.1. Data Set Al, Complete Linkage
In order to get a better idea on how accurate estimators
are,
the
bootstrap
can
be
used
distribution
of these statistics.
distribution
was
taken and 30 sets
again
One
to
estimate
the
sample from the true
of 50 bootstrap samples
from the sample distribution were created.
For each set, the
mean median, and mode of the 50 boots were calculated and the
true Bk was subtracted.
51
Figure 3.3.1.1: Deviation of Esimator from the True Bk
Data Set Al, Complete Linkage / Sample Number 10
MEAN
.20+
3
+
5
4
9
.00+
+
-.20+
40 +
.0
2.0
+
Figure 3.3.1.2: Deviation of Estimator from True Bk
Set Al/ Complete Linkage r Sample 10
MEDIAN
.20+
.00+
+
-.20+
*
+
+
4
-.40 +
-.60+
2.0
.0
The
estimator,
4.0
6.0
8.0
10.0
median appeared better than the mode and mean as an
if
only
slightly.
So we
median in the remainder of the study.
53
will present only the
Figure 3.3.1.3: Deviation of Estimator from True Bk
Set Al, Complete Linkage, Sample 10
MODE
.60+
.40 +
.20+
.00+
+
-.20 +
+
5
+
-.40+
-.60 +
2.0
.0
6.0
4.0
8.0
10.0
3.3.2. Data Set Al, Other Methods
Here
single
with
we present the results for the same data set using
and average linkage.
single
linkage
than
The
median does slightly worse
complete while
better with average linkage.
54
it does slightly
Figure 3.3.2.1: Deviation of Estimator from True Bk
Set Al, Single Linkage, Sample 10
MEDIAN
.60 +
.40 +
,20 +
.00 +
-
*
-.20+
2
+
+
2
40+
+
.0
All
In
+
+
+
2.0
4.0
6.0
three methods did perfectly for
+
8.0
k
=2,
+K
10.0
3,
5,
and
9,
this context, it does not mean those are stable clusters.
It only means that we accurately estimated the true Bk.
55
Figure 3.3.2.2: Deviation of Estimator from True Bk
Set Al, Average Linkage, Sample 10
MEDIAN
.20 +
+
00+
+8
+
+
+
+
-.20+
2
*
+
+
-.40+
*
-.60+
+
.0
+
+
+
2.0
4.0
6.0
56
+
8.0
+K
10.0
3.3.3. Data Set Bl, All Three Methods
Figure 3.3.3.1: Deviation of Estimator from True Bk
Set Bl, Complete Linkage, Sample 10
MEDIAN
1.00 +
80+
,60+
+
3
+
40+
*
2
+
.20+
9
4
+
9
+
-
*
8
+
5*4+
*+3++*
+5
.00+
+
+
6
8
+ +
9
*
-.20+
2
-
2
*
2
+ *
2
9
*
2
+
4
7
2
5
-.40+
+
.0
+
+
4.0
8.0
+
12.0
57
+
16.0
+K
20.0
With
However/
more
when
bootstrap-sample
set
estimation becomes
objects,
the
true-sample
estimator
Here, the estimate of Bk for
was
Bk
was high also.
Al, the estimators were perfect for
k
= 9 is
58
less accurate.
k
high,
the
Recall that for
equal to
2
usually accurate.
and 3.
Figure 3.3.3.2: Deviation of Estimator from True Bk
Set Bl, Single Linkage, Sample 10
MEDIAN
.60 +
.40 +
+ +
+
_
*
,20+
2
2
,00+
-
3
*
-
5
*
6
2
3
9
-
-.20+
+
8
+
2
*
+
77
+
+
+
*
*
4
9
2
*
+
2
+58
2
+
*
5
+
40+
.0
4.0
12.0
8.0
59
16.0
20.0
Figure 3.3.3.3: Deviation of Estimator from True Bk
Set Bl, Average Linkage, Sample 10
MEDIAN
1.00 +
.80+
,60 +
+
3
.40 +
3
.20+
+
+
+
4
3
3+
+4
4
++
.00+
*
+
2
9
+
+
*
5
62
+
+ +
-.20+
2
+
3*
22
*
2
3
40 +
.0
4.0
12.0
8.0
3.3.4. Data Set Al, Other Samples
60
16.0
20.0
of the results above are
Most
tenth.
atypical,
In
order
the
to
same
make
sure
procedure
numbers 11 through 13.
61
was
based on one sample, the
that
run
the
on
sample
is not
other samples,
Figure 3.3.4,1: Deviation of Estimator from True Bk
Set Al, Average Linkage, Sample 11
MEDIAN
.60 +
40 +
+
.20 +
6
-
+
3
+.+
00+
+
+
+
+
3
2
+
-.20 +
-.40 +
-.60+
-.80+
-1.00+
+
+
.0
+
+
+
+
2.0
4.0
6.0
8.0
62
+K
10.0
These
samples
Different
any
plots
will
models.
reasonable
samples will give
situation
advisable
be
show that certain
to
That
where
the
statistics on bootstrap
estimators
of
the
true
different results, however.
bootstrap
create and examine
used, it
In
would be
many samples from possible
way the reliability of
particular situation would be known.
63
is
Bk.
the bootstrap in that
Figure 3.3.4.2: Deviation of Estimator from True Bk
Set kl, Average Linkage, Sample 12
MEDIAN
.60+
40+
,20 +
*
_
2
+
,00+
+
+
+
+
+
-.20 +
-.40 +
-.60+
-.80+
-1.00+
8
+
.0
+
+
+
2.0
4.0
6.0
64
+
8.0
+K
10.0
Figure 3.3.4.3: Deviation of Estimator from True Bk
Set A, Average Linkage, Sample 13
MEDIAN
.20 +
9
-
*
+
.00+
+
*
4
+
+
+
+
+
+
-.20 +
2.0
.0
6.0
4.0
10.0
8.0
Complete linkage worked perfectly on sample number 13
Figure 3.3.4.4: Deviation of Estimator from True Bk
Set Al, Complete Linkage, Sample 13
MEDIAN
.20 +
.00+
-.20 +
2.0
.0
4.0
6.0
10.0
8.0
3.3.5. Data Set B2, Average Linkage
This
with
less
last
plot shows how the
distinct
clusters.
technique works on
The
results
significantly different from the other experiments.
65
are
a
set
not
Figure 3.3.5,1: Deviation of Estimator from True Bk
Set B2, Average Linkage/ Sample 13
MEDIAN
.60+
3
2
.40+
+
+
2
.20+
*
+
+ +
3
+
-
*
*
*
*
+
-
3
.00+
+
3+3
9442
2+
8+2
6
+
7+*
-
7
2
+
+
7
-.20+
8
*
6
+
+
+
+
2
4
+
+
2
40+
.0
4.0
12.0
8.0
66
16.0
20.0
CONCLUSIONS AND SUMMARY
4.
In this study we propose to use the bootstrap and the Bk
statistic in combination to perform two tasks:
We
1)
find
can
the more stable
and important clusters by
looking at those clusterings which tend to produce peaks in
This information about the variability of the
the Bk plot.
set
data
will
be
very
useful
interpreting
when
the
clustering tree.
estimate the sample-to-true
can
We
2)
bootstrap-to-sample
Bk's.
The best
Bk by examining the
results in this study
from using the median of bootstrapped Bk's and either
came
average or complete linkage.
We
here
have
these
and
proposed
documented
results
procedure.
the results of
indicate
the
The model used
a
simulation study
usefulness
the
in the experiments was
appropriate for the common example described above.
it
of
However,
is not known whether the variation in other situations is
comparable.
It could be
much
more, with worse results, or
much
less, with better results.
show
promise but, because of the limited scope of the study,
The methods presented here
have yet to be tested in real applications.
67
BIBLIOGRAPHY
Baker,
F. B.
(1974),
Techniques Case
I:
"Stability of Two Hierarchical Grouping
Sensitivity to Data Errors".
JASA 69,
pp. 440-465.
"Hierarchical
(1975),
and Hubert,
L.
B.
J.
F.
Clustering and the Concept of Power". JASA 70, pp. 31-38.
Baker,
Efron, B. (1979a), "Computers and the Theory of Statistics:
Thinking the Unthinkable".
SIML Rfiyifitt 21, pp. 460-480.
Efron, B. (1979b), "Bootstrap Methods: Another Look at
Annals of Statistics 7, pp. 1-26.
Jackknife".
the
"A Leisurely Look at the
Gong.
(1983),
and G.
B
Efron,
American
Bootstrap, the Jackknife, and Cross-Validation" .
Statistician 37 #1, pp. 36-48.
"A Method for
and C.
L. Mallows. (1983),
E.
B.
Fowlkes,
Comparing Two Heirarchical Clusterings"
JASA (to appear
in September)
Hartigan,
Values"
(1969),
A.
J.
JASA
64, pp.
"Using Subsample Values as Typical
1303-1317.
Hartigan, J. A. (1971), "Error Analysis by Replaced Samples",
J^ fiOi^al Stat^ Sqc. Sfixles fi 33, pp. 98-110.
Hartigan, J. A.
(1975),
Clustering Algorithms.
Wiley.
Hartigan, J. A. (1981), "Consistency of Single Linkage for
High-Density Clusters",
JASA 76, pp. 388-394.
Hubert, L. (1974), "Approximate Evaluation Techniques for the
Single Link and Complete Link Hierarchical Clustering
Procedures", JASA 69, pp. 698-704.
C.
S.
Johnson.
Psychometrica
Ling.
F.
R.
Analysis",
Rand,
W.
M.
,
(1967), "Hierarchical
pp. 241-254.
"A Probability
(1973),
68, pp. 159-164.
Clustering Schemes",
Theory
of
Cluster
JASA
(1971),
Clustering Methods".
"Objective Criteria for Evaluation of
JASA 66, pp 846-850.
"A Hybrid Clustering Algorithm for
(1982),
Wong, M. A.
Identifying High-Density Clusters", JASA 77, pp 841-847.
68
"A Kth Nearest Neighbor
Wong, M. A., and T. Lane (1983),
Clustering Procedure", J... Eo^ai SiAt^ Soc. Z&il&s. E (to
appear)
69
ii22l
U03
3
"IDSD DDM
Mfll
lt.1
Date Due
Lib-26-67
?»^^
Download