Chapter 10

advertisement
Pattern
Classification
All materials in these slides were taken from
Pattern Classification (2nd ed) by R. O.
Duda, P. E. Hart and D. G. Stork, John Wiley
& Sons, 2000
with the permission of the authors and the
publisher
Chapter 10
Unsupervised Learning & Clustering
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Introduction
Mixture Densities and Identifiability
ML Estimates
Application to Normal Mixtures
K-means algorithm
Unsupervised Bayesian Learning
Data description and clustering
Criterion function for clustering
Hierarchical clustering
The number of cluster problem and cluster validation
On-line clustering
Graph-theoretic methods
PCA and ICA
Low-dim reps and multidimensional scaling (selforganizing maps)
Clustering and dimensionality reduction
2
Introduction
•
•
Previously, all our training samples were labeled: these
samples were said “supervised”
Why are we interested in “unsupervised” procedures
which use unlabeled samples?
1) Collecting and Labeling a large set of sample patterns can
be costly
2) We can train with large amounts of (less expensive)
unlabeled data
Then use supervision to label the groupings found, this is
appropriate for large “data mining” applications where the
contents of a large database are not known beforehand
Pattern Classification, Chapter 10
3
3) Patterns may change slowly with time
Improved performance can be achieved if classifiers
running in a unsupervised mode are used
4) We can use unsupervised methods to identify
features that will then be useful for
categorization
 ‘smart’ feature extraction
5) We gain some insight into the nature (or
structure) of the data
 which set of classification labels?
Pattern Classification, Chapter 10
4
Mixture Densities & Identifiability
• Assume:
• functional forms for underlying probability densities are known
• value of an unknown parameter vector must be learned
• i.e., like chapter 3 but without class labels
• Specific assumptions:
• The samples come from a known number c of classes
• The prior probabilities P(j) for each class are known (j = 1, …,c)
• Forms for the P(x | j, j) (j = 1, …,c) are known
• The values of the c parameter vectors 1, 2, …, c are unknown
• The category labels are unknown
Pattern Classification, Chapter 10
5
• The PDF for the samples is:
component densities




P ( x |  )   P ( x |  j , j ). P (  j )

j 1
c
mixing parameters
where   (  1 , 2 ,..., c )
t
• This density function is called a mixture density
• Our goal will be to use samples drawn from this
•
mixture density to estimate the unknown
parameter vector .
Once  is known, we can decompose the mixture
into its components and use a MAP classifier on
the derived densities.
Pattern Classification, Chapter 10
6
• Can  be recovered from the mixture?
• Consider the case where:
• Unlimited number of samples
• Use nonparametric technique to find p(x| ) for every x
• If several  result in same p(x| )  can’t find unique
solution
This is the issue of solution identifiability.
• Definition: Identifiability
A density P(x | ) is said to be identifiable if
  ’ implies that there exists an x such that:
P(x | )  P(x | ’)
Pattern Classification, Chapter 10
7
As a simple example, consider the case where x is binary
and
1 x
1 x
1 x
1 x
P
(
x
|

)


(
1


)


(
1


)
P(x | ) is the mixture:
1
1
2
2
2
2
1
 2 (  1   2 ) if x  1

1 - 1 (    ) if x  0
2
 2 1
Assume that:
P(x = 1 | ) = 0.6  P(x = 0 | ) = 0.4
We know P(x | ) but not 
We can say: 1 + 2 = 1.2 but not what 1 and 2 are.
Thus, we have a case in which the mixture distribution is
completely unidentifiable, and therefore unsupervised
learning is impossible.
Pattern Classification, Chapter 10
8
•
•
In the discrete distributions too many components can
be problematic
Too many unknowns
•
Perhaps more unknowns than independent equations
identifiability can become a serious problem!
Pattern Classification, Chapter 10
•
9
While it can be shown that mixtures of normal densities are
usually identifiable, the parameters in the simple mixture
density
P( x | ) 
•
•
P( 1 )
 1
 P(  2 )
 1

exp   ( x   1 )2  
exp   ( x   2 )2 
2
2
 2

 2

cannot be uniquely identified if P(1) = P(2)
(we cannot recover a unique  even from an infinite amount of
data!)
 = (1, 2) and  = (2, 1) are two possible vectors that can be
interchanged without affecting P(x | ).
Identifiability can be a problem, we always assume that the
densities we are dealing with are identifiable!
Pattern Classification, Chapter 10
10
ML Estimates
Suppose that we have a set D = {x1, …, xn} of n
unlabeled samples drawn independently from
the mixture density:
c
p( x |  )   p( x |  j , j )P (  j )
j 1
( is fixed but unknown!)
The MLE is: ˆ  arg max p( D |  ) with

n
p(D |  )   p(x k |  )
k 1
Pattern Classification, Chapter 10
11
ML Estimates
n
Then the log-likelihood is:l   ln p ( xk |  )
k 1
And the gradient of the log-likelihood is:
n
 i l   P (  i | x k , ) i ln p( x k |  i , i )
k 1
Pattern Classification, Chapter 10
12
Since the gradient must vanish at the value of i
n
that maximizes l ( l   ln p( x k | , ))
k 1
the ML estimate ˆi
n
 P(
k 1
i
must satisfy the conditions
| xk , ˆ)i ln p ( xk | i , ˆi )  0 (i  1,..., c)
(a)
Pattern Classification, Chapter 10
13
The MLE for P(i) and ˆi must satisfy:
1
P̂ (  i )   P̂ (  i | x k ,ˆ )
n
n
and  P̂ (  i | x k ,ˆ ) i ln p( x k |  i ,ˆi )  0
k 1
where : P̂(  i | x k ,ˆ ) 
p( x k |  i ,ˆi )P̂ (  i )
c
ˆ )P̂ (  )
p
(
x
|

,

 k j j
j
j 1
Pattern Classification, Chapter 10
14
Applications to Normal Mixtures
p(x | i, i) ~ N(i, i)
Case
i
i
P(i)
c
1
?
x
x
x
2
?
?
?
x
3
?
?
?
?
Case 1 = Simplest case
Case 2 = more realistic case
Pattern Classification, Chapter 10
15
Case 1: Multivariate Normal, Unknown mean vectors
i = i  i = 1, …, c, The likelihood is for the ith mean is:
ML estimate of  = (i) is:
n
ˆ i 
 P(
k 1
n
i
 P(
k 1
Where
P (  i | xk , ˆ )
| xk , ˆ ) xk
(1)
i
| xk , ˆ )
is the fraction of those samples
having value xk that come from the ith class, and
of the samples coming from the i-th class.
̂ i
is the average
Pattern Classification, Chapter 10
16
• Unfortunately, equation (1) does not give
̂ i
explicitly
• However, if we have some way of obtaining good
initial estimates ˆ i ( 0 ) for the unknown means,
equation (1) can be seen as an iterative process
for improving the estimates
n
ˆ i ( j  1 ) 
 P( 
k 1
n
i
 P( 
k 1
| xk , ˆ ( j )) xk
i
| xk , ˆ ( j ))
Pattern Classification, Chapter 10
17
•
This is a gradient ascent for maximizing the loglikelihood function
•
Example:
Consider the simple two-component one-dimensional
normal mixture
1
2
 1

 1

p(x |  1,  2 ) 
exp   (x   1 )2  
exp   (x   2 )2 
3 2
 2
 3 2
 2

1
2
(2 clusters!)
Let’s set 1 = -2, 2 = 2 and draw 25 samples
sequentially from this mixture. The log-likelihood
function is:
n
l( 1,  2 )   lnp(x k |  1,  2 )
k 1
Pattern Classification, Chapter 10
18
The maximum value of l occurs at:
ˆ 1  2.130 and ˆ 2  1.668
(which are not far from the true values: 1 = -2 and 2
= +2)
There is another peak at ˆ 1  2.085 and ˆ 2  1.257
which has almost the same height as can be seen
from the following figure.
This mixture of normal densities is identifiable
When the mixture density is not identifiable, the ML
solution is not unique
Pattern Classification, Chapter 10
19
Pattern Classification, Chapter 10
20
Case 2: All parameters unknown
• No constraints are placed on the covariance
matrix
• Let p(x | , 2) be the two-component normal
mixture:
2

1
1
x


1

 
 1 2
2
p( x |  , ) 
exp   
exp   x 
 
2 2 .
 2 
 2     2 2
Pattern Classification, Chapter 10
21
Suppose  = x1, therefore:
1
1
 1 2
p( x1 |  ,  ) 

exp  x1 
2 2  2 2
 2 
2
For the rest of the samples:
1
 1 2
p( x k |  , ) 
exp   x k 
2 2
 2 
2
Finally,
1
 1 n 2
1
 1 2 
p( x1 ,..., xn |  , )    exp  x1  
exp    xk 
n

2   ( 2 2 )
2 k 2 





2
 this term  

 0 

The likelihood is therefore large and the maximumlikelihood solution becomes singular.
Pattern Classification, Chapter 10
22
• Assumption: MLE is well-behaved at local maxima.
• Consider the largest of the finite local maxima of
•
the likelihood function and use the ML estimation.
We obtain the following local-maximum-likelihood
estimates:
1 n
ˆ
P̂ (  i ) 
P̂ ( 

n
k 1
n
Iterative
scheme
ˆ i 
 P̂ ( 
k 1
n
i
 P̂ ( 
k 1
i
| x k , )
| x k ,ˆ ) x k
i
| x k ,ˆ )
n
ˆ i 
ˆ )( x  ˆ )( x  ˆ )t
P̂
(

|
x
,

 i k
k
i
k
i
k 1
n
 P̂ ( 
k 1
i
| x k ,ˆ )
Pattern Classification, Chapter 10
23
Where:
 1

exp   ( xk  ˆ i )t ˆ i1 ( xk  ˆ i ) P̂ (  i )
2


ˆ
P̂ (  i | xk , )  c
1 / 2
 1

t ˆ 1
ˆ
ˆ
ˆ

exp

(
x


)

(
x


)

j
j
j
k
j  P̂ (  j )
 2 k

j 1
i
1 / 2
Pattern Classification, Chapter 10
24
K-Means Clustering
• Goal: find the c mean vectors 1, 2, …, c
• Replace the squared Mahalanobis distance
( xk  ˆ i )t ˆ i1 ( xk  ˆ i ) by the squared Euclidean distance xk  ˆ i
• Find the mean
P̂(  i | xk ,ˆ )
2
̂ m nearest to xk and approximate
1 if i  m
as: P̂ (  i | xk , )  
0 otherwise
• Use the iterative scheme to find ˆ 1 , ˆ 2 ,..., ˆ c
Pattern Classification, Chapter 10
25
• If n is the known number of patterns and c the desired
number of clusters, the k-means algorithm is:
Begin
initialize n, c, 1, 2, …, c(randomly
selected)
do classify n samples according to
nearest i
recompute i
until no change in i
return 1, 2, …, c
End
Complexity is O(ndcT) where d is the # features, T the # iterations
Pattern Classification, Chapter 10
26
•
K-means cluster on data from previous figure
Pattern Classification, Chapter 10
27
Unsupervised Bayesian Learning
•
Other than the ML estimate, the Bayesian estimation
technique can also be used in the unsupervised case
(see chapters ML & Bayesian methods, Chap. 3 of the
textbook)
•
•
•
•
•
•
•
number of classes is known
class priors are known
forms of class-conditional probability densities P(x|j, j) are
known
However, the full parameter vector  is unknown
Part of our knowledge about  is contained in the prior p()
rest of our knowledge of  is in the training samples
We compute the posterior distribution using the training
samples
Pattern Classification, Chapter 10
28
• We can compute p(|D) as seen previously
p(x |  i , D) P( i | D)
p(x |  i , D) P( i )
P( i | x, D)  c
 c
 p(x |  j , D) P( j | D)  p(x |  j , D) P( j )
j 1
j 1
P(i|D) = P(i) since selection of i is independent of previous samples
and passing through the usual formulation introducing the
unknown parameter vector .
p(x |  i , D)   p(x, θ | i , D)dθ   p(x | θ, i ,D) p(θ  i ,D)dθ
  p(x |  i , θi ) p(θ |D)dθ
• Hence, the best estimate of p(x|i) is obtained by
averaging p(x|i, i) over i.
• The goodness of this estimate depends on p(|D); this is
the main issue of the problem.
Pattern Classification, Chapter 10
29
From Bayes we get:
p(θ | D) 
p( D | θ) p(θ)
 p( D | θ) p(θ)dθ
where independence of the samples yields the likelihood
n
p ( D | θ)   p ( x k | θ)
k 1
or alternately (denoting Dn the set of n samples) the recursive
form:
n 1
p(θ | D ) 
n
•
p(x n | θ) p(θ | D
 p(x
n
| θ) p(θ | D
n 1
, i )
,  i )dθ
If p() is almost uniform in the region where p(D|) peaks,
then p(|D) peaks in the same place.
Pattern Classification, Chapter 10
•
30
ˆ and the peak is
If the only significant peak occurs at θ  θ
very sharp, then
p(x | i , D)  p(x | i , θˆ )
and
P( i | x, D) 
p(x |  i , θˆ i ) P( i )
c
 p(x |  , θˆ ) P(
j 1
•
•
•
•
j
j
j
)
Therefore, the ML estimate θ̂ is justified.
Both approaches coincide if large amounts of data are
available.
In small sample size problems they can agree or not,
depending on the form of the distributions
The ML method is typically easier
to implement than the Bayesian one
Pattern Classification, Chapter 10
31
•
•
•
Formal Bayesian solution: unsupervised learning of the
parameters of a mixture density is similar to the supervised
learning of the parameters of a component density.
Significant differences: identifiability, computational
complexity
The issue of identifiability
•
•
•
With SL, the lack of identifiability means that we do not obtain a
unique vector, but an equivalence class, which does not present
theoretical difficulty as all yield the same component density.
With UL, the lack of identifiability means that the mixture cannot be
decomposed into its true components
 p(x | Dn) may still converge to p(x), but p(x |i, Dn) will not in
general converge to p(x |i), hence there is a theoretical barrier.
The computational complexity
•
With SL, the sufficient statistics allows the solutions to be
computationally feasible
Pattern Classification, Chapter 10
32
• With UL, samples comes from a mixture density and
there is little hope of finding simple exact solutions for
p(D | ).  n samples results in 2n terms.
(Corresponding to the ways in the which the n samples
can be drawn from the 2 classes.)
• Another way of comparing the UL and SL is to
consider the usual equation in which the mixture
density is explicit
n 1
p(θ | D ) 
n
p(x n | θ) p(θ | D
 p(x
n
| θ) p(θ | D
c

 p(x
j 1
c
n
n 1
)
)dθ

|  j , θ j ) P( j )
n 1
p
(
x
|

,
θ
)
P
(

)
p
(
θ
|
D
)dθ
 n j j
j
p(θ | D n 1 )
j 1
Pattern Classification, Chapter 10
33
c
 p(x
|  j , θ j ) P( j )
From
j 1
p(θ | D n )  c
p(θ | D n 1 ) Previous
slide
n 1
p
(
x
|

,
θ
)
P
(

)
p
(
θ
|
D
)
d
θ
 n j j
j
n
j 1
• If we consider the case in which P(1)=1 and all
other prior probabilities as zero, corresponding to
the supervised case in which all samples comes
from the class 1, then we get
p(θ | D n ) 
p(x n | 1 , θ1 )
n 1
p
(
x
|

,
θ
)
p
(
θ
|
D
)dθ
 n 1 1
p(θ | D n1 )
Pattern Classification, Chapter 10
c
p(θ | D n ) 
 p(x
j 1
  p(x n |  j , θ j ) P( j ) p(θ | D n1 )dθ
p(x n | 1 , θ1 )
n
•
•
n 1
p
(
x
|

,
θ
)
p
(
θ
|
D
)dθ
 n 1 1
p(θ | D n 1 )
p(θ | D n1 )
Eqns From
Previous
slide
Comparing the two eqns, we see that observing an additional sample
changes the estimate of .
Ignoring the denominator which is independent of , the only significant
difference is that
•
•
in the SL, we multiply the “prior” density for  by the component density p(xn
|1, 1)
in the UL, we multiply the “prior” density by the whole mixture
c
 p(x
j 1
•
|  j , θ j ) P( j )
c
j 1
p(θ | D ) 
n
34
n
|  j , θ j ) P( j )
Assuming that the sample did come from class 1, the effect of not knowing this
category is to diminish the influence of xn in changing  for category 1..
Pattern Classification, Chapter 10
35
Data Clustering
•
•
Structures of multidimensional patterns are important
for clustering
If we know that data come from a specific distribution,
such data can be represented by a compact set of
parameters (sufficient statistics)
• If samples are considered
coming from a specific
distribution, but actually they
are not, these statistics is a
misleading representation of
the data
Pattern Classification, Chapter 10
36
• Aproximation of density functions:
• Mixture of normal distributions can approximate arbitrary
PDFs
• In these cases, one can use parametric methods
to estimate the parameters of the mixture density.
• No free lunch  dimensionality issue!
• Huh?
Pattern Classification, Chapter 10
37
•
Caveat
If little prior knowledge can be assumed, the
assumption of a parametric form is meaningless:
• Issue: imposing structure vs finding structure
• use non parametric method to estimate the
unknown mixture density.
• Alternatively, for subclass discovery:
• use a clustering procedure
• identify data points having strong internal similarities
Pattern Classification, Chapter 10
38
Similarity measures
• What do we mean by similarity?
• Two isses:
• How to measure the similarity between samples?
• How to evaluate a partitioning of a set into clusters?
• Obvious measure of similarity/dissimilarity is the
distance between samples
• Samples of the same cluster should be closer to
each other than to samples in different classes.
Pattern Classification, Chapter 10
39
• Euclidean distance is a possible metric:
• assume samples belonging to same cluster if their
distance is less than a threshold d0
• Clusters defined by Euclidean distance are
invariant to translations and rotation of the feature
space, but not invariant to general transformations
that distort the distance relationship
Pattern Classification, Chapter 10
40
• Achieving invariance:
• normalize the data, e.g., such that they all have zero
•
means and unit variance,
or use principal components for invariance to rotation
• A broad class of metrics is the Minkowsky metric
•
1/ q

q
d (x, x' )    xk  xk ' 
 k 1

where q1 is a selectable parameter:
q = 1  Manhattan or city block metric
q = 2  Euclidean metric
One can also used a nonmetric similarity function
s(x,x’) to compare 2 vectors.
d
Pattern Classification, Chapter 10
41
• It is typically a symmetric function whose value is
•
large when x and x’ are similar.
For example, the inner product
x t x'
s (x, x' ) 
x x'
• In case of binary-valued features, we have, e.g.:
x t x'
s(x, x' ) 
d
x t x'
s(x, x' )  t
x x  x't x' xt x'
Tanimoto distance
Pattern Classification, Chapter 10
42
Clustering as optimization
• The second issue: how to evaluate a partitioning of
a set into clusters?
• Clustering can be posed as an optimization of a
criterion function
• The sum-of-squared-error criterion and its variants
• Scatter criteria
• The sum-of-squared-error criterion
• Let ni the number of samples in Di, and mi the mean of
those samples
1
mi 
ni
x
xDi
Pattern Classification, Chapter 10
43
•
The sum of squared error is defined as
c
J e    x  mi
•
•
•
2
i 1 xDi
This criterion defines clusters by their mean vectors mi
 it minimizes the sum of the squared lengths of the error x - mi.
The minimum variance partition minimizes Je
Results:
•
•
Good when clusters form well separated compact clouds
Bad with large differences in the number of samples in different
clusters.
Pattern Classification, Chapter 10
44
• Scatter criteria
• Scatter matrices used in multiple discriminant analysis,
i.e., the within-scatter matrix SW and the betweenscatter matrix SB
ST = SB +SW
• Note:
• ST does not depend on partitioning
• In contrast, SB and SW depend on partitioning
• Two approaches:
• minimize the within-cluster
• maximize the between-cluster scatter
Pattern Classification, Chapter 10
45
The trace (sum of diagonal elements) is the
simplest scalar measure of the scatter matrix
c
c
trSW    trSi     x  m i
i 1
i 1 xDi
2
 Je
• proportional to the sum of the variances in the
•
coordinate directions
This is the sum-of-squared-error criterion, Je.
Pattern Classification, Chapter 10
46
• As tr[ST] = tr[SW] + tr[SB] and tr[ST] is independent from
the partitioning, no new results can be derived by
minimizing tr[SB]
• However, seeking to minimize the within-cluster criterion
Je=tr[SW], is equivalent to maximise the between-cluster
criterion
c
trS B    ni m i  m
2
i 1
where m is the total mean vector:
1
1 c
m   x   ni mi
n D
n i 1
Pattern Classification, Chapter 10
47
Iterative optimization
• Clustering  discrete optimization problem
• Finite data set  finite number of partitions
• What is the cost of exhaustive search?
cn/c! For c clusters. Not a good idea
• Typically iterative optimization used:
• starting from a reasonable initial partition
• Redistribute samples to minimize criterion function.
•  guarantees local, not global, optimization.
Pattern Classification, Chapter 10
48
• consider an iterative procedure to minimize the
sum-of-squared-error criterion Je
c
Je   Ji
where
i 1
Ji 
 xm
xDi
2
i
where Ji is the effective error per cluster.
• Moving sample x̂ from cluster Di to Dj, changes
the errors in the 2 clusters by:
J j*  J j 
nj
n j 1
xˆ  m j
2
ni
Ji*  Ji 
xˆ  m i
ni  1
2
Pattern Classification, Chapter 10
• Hence, the transfer is advantegeous if the
49
decrease in Ji is larger than the increase in Jj
ni
xˆ  m i
ni  1
2

nj
n j 1
xˆ  m j
2
Pattern Classification, Chapter 10
50
•
Alg. 3 is sequential version of the k-means alg.
• Alg. 3 updates each time a sample is reclassified
• k-means waits until n samples have been reclassified
before updating
•
Alg 3 can get trapped in local minima
• Depends on order of the samples
• Basically, myopic approach
• But it is online!
Pattern Classification, Chapter 10
51
•
•
Starting point is always a problem
Approaches:
1. Random centers of clusters
2. Repetition with different random initialization
3. c-cluster starting point as the solution of the (c-1)cluster problem plus the sample farthest from the
nearer cluster center
Pattern Classification, Chapter 10
52
Hierarchical Clustering
• Many times, clusters are not disjoint, but a
cluster may have subclusters, in turn having subsubclusters, etc.
• Consider a sequence of partitions of the n
samples into c clusters
• The first is a partition into n cluster, each one
containing exactly one sample
• The second is a partition into n-1 clusters, the third
into n-2, and so on, until the n-th in which there is only
one cluster containing all of the samples
• At the level k in the sequence, c = n-k+1.
Pattern Classification, Chapter 10
53
•
•
Given any two samples x and x’, they will be grouped
together at some level, and if they are grouped a level k,
they remain grouped for all higher levels
Hierarchical clustering  tree representation called
dendrogram
Pattern Classification, Chapter 10
54
•
•
•
Are groupings natural or forced: check similarity values
Evenly distributed similarity  no justification for grouping
Another representation is based on set, e.g., on the Venn
diagrams
Pattern Classification, Chapter 10
55
• Hierarchical clustering can be divided in
agglomerative and divisive.
• Agglomerative (bottom up, clumping): start with n
singleton cluster and form the sequence by
merging clusters
• Divisive (top down, splitting): start with all of the
samples in one cluster and form the sequence by
successively splitting clusters
Pattern Classification, Chapter 10
56
Agglomerative hierarchical clustering
• The procedure terminates when the specified
number of cluster has been obtained, and returns
the cluster as sets of points, rather than the mean
or a representative vector for each cluster
Pattern Classification, Chapter 10
57
• At any level, the distance between nearest clusters
•
can provide the dissimilarity value for that level
To find the nearest clusters, one can use
d min ( Di , D j )  min
xDi , x 'D j
x  x'
d max ( Di , D j )  max x  x'
xDi , x 'D j
1
d avg ( Di , D j ) 
ni n j
  x  x'
xDi x 'D j
d mean ( Di , D j )  m i  m j
•
which behave quite similar of the clusters are
hyperspherical and well separated.
The computational complexity is O(cn2d2), n>>c
Pattern Classification, Chapter 10
58
Nearest-neighbor algorithm (single linkage)
• dmin is used
• Viewed in graph terms, an edge is added to the
nearest nonconnected components
• Equivalent of Prims minimum spanning tree
algorithm
• Terminates when the distance between nearest
clusters exceeds an arbitrary threshold
Pattern Classification, Chapter 10
59
• The use of dmin as a distance measure and the
agglomerative clustering generate a minimal
spanning tree
• Chaining effect: defect of this distance measure
(right)
Pattern Classification, Chapter 10
60
The farthest neighbor algorithm (complete linkage)
• dmax is used
• This method discourages the growth of elongated
clusters
• In graph theoretic terms:
• every cluster is a complete subgraph
• the distance between two clusters is determined by the
most distant nodes in the 2 clusters
• terminates when the distance between nearest
clusters exceeds an arbitrary threshold
Pattern Classification, Chapter 10
• When two clusters are merged, the graph is
61
changed by adding edges between every pair of
nodes in the 2 clusters
• All the procedures involving minima or maxima are
sensitive to outliers. The use of dmean or davg are
natural compromises
Pattern Classification, Chapter 10
62
The problem of the number of clusters
• How many clusters should there be?
• For clustering by extremizing a criterion function
• repeat the clustering with c=1, c=2, c=3, etc.
• look for large changes in criterion function
• Alternatively:
• state a threshold for the creation of a new cluster
• useful for on line cases
• sensitive to order of presentation of data.
• These approaches are similar to model selection
procedures
Pattern Classification, Chapter 10
63
Graph-theoretic methods
• Caveat: no uniform way of posing clustering as a
•
•
•
graph theoretic problem
Generalize from a threshold distance to arbitrary
similarity measures.
If s0 is a threshold value, we can say that xi is
similar to xj if s(xi, xj) > s0.
We can define a similarity matrix S = [sij]
1 if s(xi , x j )  s0
sij  
0 otherwise
Pattern Classification, Chapter 10
64
• This matrix induces a similarity graph, dual to S, in
•
•
•
which nodes corresponds to points and edge joins
node i and j iff sij=1.
Single-linkage alg.: two samples x and x’ are in
the same cluster if there exists a chain x, x1, x2,
…, xk, x’, such that x is similar to x1, x1 to x2, and
so on  connected components of the graph
Complete-link alg.: all samples in a given cluster
must be similar to one another and no sample can
be in more than one cluster.
Neirest-neighbor algorithm is a method to find the
minimum spanning tree and vice versa
• Removal of the longest edge produce a 2-cluster
grouping, removal of the next longest edge produces a
3-cluster grouping, and so on.
Pattern Classification, Chapter 10
65
• This is a divisive hierarchical procedure, and
suggest ways to dividing the graph in subgraphs
• E.g., in selecting an edge to remove, comparing its
length with the lengths of the other edges incident the
nodes
Pattern Classification, Chapter 10
66
• One useful statistic to be estimated from the
•
minimal spanning tree is the edge length
distribution
For instance, in the case of 2 dense cluster
immersed in a sparse set of points:
Pattern Classification, Chapter 10
Download