Theoretical Foundations of Clustering Margareta Ackerman

advertisement
Theoretical Foundations of
Clustering
Margareta Ackerman
The Theory-Practice Gap
Clustering is one of the most widely used tools for
exploratory data analysis.
Identifying target markets
Constructing phylogenetic trees
Facility allocation for city planning
Personalization
...
The Theory-Practice Gap
While the interest in and application of
cluster analysis has been rising rapidly,
the abstract nature of the tool is still
poorly understood”
-Wright, 1973.
There has been relatively little work aimed at
reasoning about clustering independently of
any particular algorithm, objective function,
or generative data model”
-Kleinberg, 2002.
Inherent obstacles:
Clustering is ill-defined
Clustering aims to organize data into
groups of similar items, but beyond that
There is very little consensus
on the definition of clustering
Clustering algorithms:
A few classical examples
How can we partition data into k groups?
Clustering algorithms:
A few classical examples
How can we partition data into k groups?
• Use
Kruskal’s algorithm for MST (Singlelinkage)
Clustering algorithms:
A few classical examples
How can we partition data into k groups?
• Use
Kruskal’s algorithm for MST (Singlelinkage)
• Find the minimum cut (motivates spectral
clustering methods)
Clustering algorithms:
A few classical examples
How can we partition data into k groups?
• Use
Kruskal’s algorithm for MST (Singlelinkage)
• Find the minimum cut (motivates spectral
clustering methods)
• Find k “centers” that minimize the average
distance to a center (k-median, k-means, ...)
• Many more...
Inherent obstacles:
Clustering is inherently ambiguous
• There are many clustering algorithms with
different (often implicit) objective functions
• Different algorithms have radically different
input-output behavior
• There may be multiple reasonable clusterings
• There is usually no ground truth
Different input-output behavior of
clustering algorithms
Different input-output behavior of
clustering algorithms
Progress despite these obstacles: Overview
•
Axioms of clustering quality measures (Ackerman & Ben-David, 08)
•
Study and compare notions of clusterability (Ackerman and Ben-David, 09)
•
Characterizing linkage-based algorithms (Ackerman, Ben-David, and Loker,
2010)
•
Framework for clustering algorithm selection (Ackerman, Ben-David, and
Loker, 2010)
•
Characterizing hierarchical linkage-based algorithms (Ackerman & BenDavid, 2011)
•
Properties of Phylogenetic algorithms (Ackerman, Brown, and Loker, 2012)
•
Properties in the weighted clustering setting (Ackerman, Ben-David, Branzei,
and Loker, 2012)
•
Clustering oligarchies (Ackerman, Ben-David, Loker, and Sabato, 2013)
•
Perturbation robust clustering (Ackerman & Schulman, 2013)
•
Online clustering (Ackerman & Dasgupta, 2014)
Progress despite these obstacles: Overview
•
Axioms of clustering quality measures (Ackerman & Ben-David, 08)
•
Study and compare notions of clusterability (Ackerman and Ben-David, 09)
•
Characterizing linkage-based algorithms (Ackerman, Ben-David, and Loker,
2010)
•
Framework for clustering algorithm selection (Ackerman, Ben-David, and
Loker, 2010)
•
Characterizing hierarchical linkage-based algorithms (Ackerman & BenDavid, 2011)
•
Properties of Phylogenetic algorithms (Ackerman, Brown, and Loker, 2012)
•
Properties in the weighted clustering setting (Ackerman, Ben-David,
Branzei, and Loker, 2012)
•
Clustering oligarchies (Ackerman, Ben-David, Loker, and Sabato, 2013)
•
Perturbation robust clustering (Ackerman & Schulman, 2013)
•
Online clustering (Ackerman & Dasgupta, 2014)
Outline
• Axiomatic treatment of clustering
• Clustering algorithm selection
• Characterizing Linkage-Based clustering
Outline
• Axiomatic treatment of clustering
• Clustering algorithm selection
• Characterizing Linkage-Based clustering
Formal setup
For a finite domain set X, a distance function d is the
distance defined between the domain points.
A clustering function maps
Input: a distance function d over X
to
Output: a partition (clustering) of X
Kleinberg’s axioms
Scale Invariance:
f (cd) = f (d) for all d and all strictly positive c .
Consistency:
If d0 equals d, except for shrinking distances within
clusters of f (d) or stretching between-cluster
0
distances, then f (d ) = f (d).
Richness:
For any clustering C of X, there exists a distance
function d over X so that f (d) = C.
Theorem [Kleinberg, ‘02]:
These axioms are inconsistent. Namely, no
function can satisfy these three axioms.
Theorem [Kleinberg, ‘02]:
These axioms are inconsistent. Namely, no
function can satisfy these three axioms.
Why are “axioms” that seem to capture our
intuition about clustering inconsistent??
Theorem [Kleinberg, ‘02]:
These axioms are inconsistent. Namely, no
function can satisfy these three axioms.
Why are “axioms” that seem to capture our
intuition about clustering inconsistent??
Our answer: The formalization of these
axioms is stronger than the intuition
they intend to capture
We express that same intuition in an alternative framework,
and achieve consistency.
Clustering quality measures
How good is this clustering?
Clustering-quality measures
quantify the quality of clusterings.
Defining clustering quality measures
A clustering-quality measure is a function
m(dataset, clustering) 2 R
satisfying some properties that make this
function a meaningful clustering quality
measure.
What properties should it satisfy?
Rephrasing Kleinberg’s axioms for clustering
quality measures
Scale Invariance
m(C, ↵d) = m(C, d) for all C, d and strictly positive ↵.
Richness
For any clustering C of X, there exists a distance
function d over X so that
C = argmaxC m(C, d)
Consistency:
If d0 equals d , except for shrinking distances within
clusters of C or stretching between-cluster distances,
then
0
m(C, d)  m(C, d ).
d
d
C
0
C
Major gain - consistency of new axioms
Theorem [Ackerman & Ben-David, NIPS ’08]:
Consistency, scale invariance, and richness for
clustering quality measures form a
consistent set of requirements.
Dunn’s index (’73):
minx6⇠C y d(x, y)
maxx⇠C y d(x, y)
This clustering quality measure satisfies consistency,
scale-invariance, and richness.
Additional measures satisfying our axioms
• C-index (Dalrymple-Alford, 1970)
• Gamma (Baker & Hubert, 1975)
• Adjusted ratio of clustering (Roenker et al., 1971)
• D-index (Dalrymple-Alford, 1970)
• Modified ratio of repetition (Bower, Lesgold, and
Tieman, 1969)
• Variations of Dunn’s index (Bezdek and Pal, 1998)
• Strict separation (Balacan, Blum, and Vempala, 2008)
• And many more...
Why is the quality measure formulation more
faithful to intuition?
In the earlier setting of clustering functions, consistent changes to
the underlying distance should not create any new contenders
for the best clustering of the data.
d
d
0
C
A clustering function that satisfies Kleinberg’s
Consistency cannot output C 0 .
C
0
Why is the quality measure formulation more
faithful to intuition?
In the setting of clustering-quality measures, consistency
requires only that the quality of clustering C not get
worse.
d
d
0
C
0
C
A different clustering can have better quality than the original.
Outline
• Axiomatic treatment of clustering
• Clustering algorithm selection
• Characterizing Linkage-Based clustering
Clustering algorithm selection
There is a wide variety of clustering algorithms, which
can produce very different clusterings.
How should a user decide
which algorithm to use for
a given application?
30
Clustering algorithm selection
Users rely on cost related considerations: running
times, space usage, software purchasing costs, etc…
There is inadequate emphasis on
input-output behavior 31
Our framework for algorithm selection
We propose a framework that lets a user utilize prior
knowledge to select an algorithm
• Identify properties that distinguish between different
input-output behavior of clustering paradigms
• The properties should be:
1) Intuitive and “user-friendly”
2) Useful for distinguishing clustering algorithms
Ex. Kleinberg’s axioms, order invariance, etc..
32
Property-based classification for fixed k
Ackerman, Ben-David, and Loker, NIPS 2010
Local Outer
Con.
Inner
Con.
Consistent
Refin.
Preserv
Order
Inv.
Rich
Outer
Rich
Rep
Ind
Scale
Inv
Single
linkage
!
!
!
!
!
!
!
!
!
!
Average
linkage
!
!
"
"
!
"
!
!
!
!
Complete
linkage
!
!
"
"
!
!
!
!
!
!
K-means
!
!
!
"
"
!
!
!
"
"
"
"
!
!
"
"
"
!
!
"
"
"
"
"
"
"
"
"
"
"
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
K-medoids
Min-Sum
Ratio-cut
Normalizedcut
33
Kleinberg’s axioms for fixed k
Local Outer
Con.
Inner
Con.
Consistent
Refin.
Preserv
Order
Inv.
Rich
Outer
Rich
Rep
Ind
Scale
Inv
Single
linkage
!
!
!
!
!
!
!
!
!
!
Average
linkage
!
!
"
"
!
"
!
!
!
!
Complete
linkage
!
!
"
"
!
!
K-means
!
!
!
"
"
!
!
!
"
"
"
"
!
!
"
"
"
!
!
"
"
"
"
"
"
! ! Axioms
! !
Kleinberg’s
are consistent
"
! !
when!k is given
K-medoids
Min-Sum
Ratio-cut
Normalizedcut
"
"
"
"
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
34
Single-linkage satisfies everything
Local Outer
Con.
Single
linkage
!
!
Inner
Con.
!
Consistent
!
Refin.
Preserv
!
Order
Inv.
!
Rich
!
Outer
Rich
!
Rep
Ind
!
Scale
Inv
!
Recall: Single linkage is Kruskal’s algorithm for
Minimum Spanning Tree.
It’s not a good clustering algorithm in practice!
35
Classification in Weighted Setting
Ackerman, Ben-David, Branzei, and Loker (AAAI, 2012)
Weight robust: ignores element duplicates
Weight sensitive: output can always be changed by duplicating some
of the data
Weight considering: element duplication effects the output on some
data sets, but not others.
Classification in Weighted Setting
Ackerman, Ben-David, Branzei, and Loker (AAAI, 2012)
Weight robust: ignores element duplicates
Weight sensitive: output can always be changed by duplicating some
of the data
Weight considering: element duplication effects the output on some
data sets, but not others.
Partitional
Weight Robust
Min Diameter
k-center
Hierarchical
Single Linkage
Complete Linkage
Weight Sensitive k-means, k-medoids, Ward’s Method
k-median, min-sum Bisecting k-means
Weight
Considering
Ratio Cut
Average Linkage
Using property-based classification to
choose an algorithm
• Enables users to identify a suitable algorithm
without the overhead of executing many algorithms
• This framework helps understand the behavior of
existing and new algorithms
• The long-term goal is to construct a property-based
classification for many useful clustering algorithms
38
Outline
• Axiomatic treatment of clustering
• Clustering algorithm selection
• Characterizing linkage-based clustering
Characterizing Linkage-Based Clustering
We characterize a popular family of clustering
algorithms, called “linkage-based.”
We show that
1) all linkage-based algorithms satisfy two
natural properties, and
2) no algorithm outside that family satisfies these
properties.
Formal setting:
Dendrograms and clusterings
Ci is a cluster in a dendrogram D if there exists a
node in the dendrogram so that Ci is the set of its
leaf descendants.
41
Formal setting:
Dendrograms and clusterings
C = {C1 , . . . , Ck } is a clustering in a dendrogram D if
– Ci is a cluster in D for all 1  i  k, and
– Clusters are disjoint
42
Formal setting:
Hierarchical clustering algorithm
A Hierarchical Clustering Algorithm A maps
Input: A data set X with a distance function d,
to
Output: A dendrogram of X
43
Linkage-based algorithms
• Create a leaf node for every elements of X
44
Linkage-based algorithms
• Create a leaf node for every elements of X
• Repeat the following until a single tree remains:
– Consider clusters represented by the remaining root nodes
45
Linkage-based algorithms
• Create a leaf node for every elements of X
• Repeat the following until a single tree remains:
– Consider clusters represented by the remaining root nodes
Merge the closest pair of clusters by assigning them a common
parent node
46
Linkage-Based Algorithms
• Create a leaf node for every elements of X
• Repeat the following until a single tree remains:
– Consider clusters represented by the remaining root nodes
Merge the closest pair of clusters by assigning them a common
parent node
?
47
Examples of linkage-based algorithms
• The choice of linkage function distinguishes between
different linkage-based algorithms.
• Examples of common linkage-functions
– Single-linkage: min between-cluster distance
– Average-linkage: average between-cluster distance
– Complete-linkage: max between-cluster distance
48
Characterizing Linkage-Based Clustering
Partitional Setting
Local Outer
Con.
Inner
Con.
Consistent
Refin.
Preserv
Order
Inv.
Rich
Outer
Rich
Rep
Ind
Scale
Inv
Single
linkage
!
!
!
!
!
!
!
!
!
!
Average
linkage
!
!
"
"
!
"
!
!
!
!
Complete
linkage
!
!
"
"
!
!
!
!
!
!
K-means
!
!
!
"
"
!
!
!
"
"
"
"
!
!
"
"
"
!
!
"
"
"
"
"
"
"
"
"
"
"
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
K-medoids
Min-Sum
Ratio-cut
Normalizedcut
49
Characterizing Linkage-Based Clustering
Ackerman, Ben-David, and Loker, COLT 2010
Local Outer
Con.
Inner
Con.
Consistent
Refin.
Preserv
Order
Inv.
Rich
Outer
Rich
Rep
Ind
Scale
Inv
Single
linkage
!
!
!
!
!
!
!
!
!
!
Average
linkage
!
!
"
"
!
"
!
!
!
!
Complete
linkage
!
!
"
"
!
!
!
!
!
!
The 2010 characterization applies in the partitional setting, by using
the k-stopping criteria.
This characterization distinguished linkage-based algorithms from
other partitional techniques.
50
Characterizing Linkage-Based Clustering in the
Hierarchal Setting
(Ackerman & Ben-David, IJCAI ‘11)
• Propose two intuitive properties that uniquely
identify hierarchical linkage-based clustering
algorithms.
• Show that common hierarchical algorithms,
including bisecting k-means, cannot be
simulated by any linkage-based algorithm
51
Locality
D = A(X, d)
0
D0 = A(X 0 , d)
X = {x1 , . . . , x4 }
If we select a cluster from a dendrogram, and run the
algorithm on the data in this cluster, we obtain a result that
is consistent with the original dendrogram.
52
Outer consistency
A(X,d)
C
0
(X, d )
(X, d)
C
C
outer-consistent
change
53
0
A
If is outer-consistent, then A(X, d ) will include the clustering C.
Theorem [Ackerman & Ben-David, IJCAI ’11]:
A hierarchical clustering algorithm is Linkage-Based
if and only if
it is Local and Outer-Consistent.
54
Easy direction of proof
Every linkage-based hierarchical clustering
algorithm is Local and Outer-Consistent.
The proof is quite straightforward.
55
Interesting direction of proof
If A is Local and Outer-Consistent, then A is
linkage-based.
To prove this direction we first need to formalize
linkage-based clustering, by formally defining
what is a Linkage Function.
56
What do we expect from a linage function?
A linkage function
` : {(X1 , X2 , d) : d is a distance function over X1 [ X2 } ! R+
satisfies the following:
Monotonicity:
If we increase distances that go between X1 and X2
then `(X1 , X2 , d) doesn’t decrease
Representation independence:
Doesn’t change if we re-label data
X1
X2
57
Proof Sketch
Recall direction:
If A satisfies Outer-Consistency and Locality, then
it is linkage-based.
Goal
Define a linkage function ` so that the linkage-based
clustering based on ` outputs A(X, d)
(for every X and d ).
58
Proof Sketch
• Define an operator <A :
(X, Y, d1 ) <A (Z, W, d2 ) if when we run A on
(X [ Y [ Z [ W, d), where d extends d1 and d2 , X and Y
are merged before Z and W.
A(X, d)
• Prove that <A can be
extended to a partial
ordering.
• Use the ordering to define `.
Z
W
X
Y
59
Sketch of proof continue:
Show that <A is a partial ordering
We show that <A is cycle-free.
Lemma: Given a hierarchical algorithm A that is
Local and Outer-Consistent, there exists no finite
sequence so that
(X1 , Y1 , d1 ) <A · · · <A (Xn , Yn , dn ) <A (X1 , Y1 , d1 ).
60
Proof Sketch (continued...)
• By the above Lemma, the transitive closure of <A
is a partial ordering.
• This implies that there exists an order preserving
function ` that maps pairs of data sets to R+ .
• It can be shown that ` satisfies the properties of a
Linkage Function.
61
Future Directions
• Identify properties that are significant for
specific clustering applications (some previous work in
this directions by Ackerman, Brown, and Loker (ICCABS, 2012)).
• Analyze clustering algorithms in alternative
settings, such as categorical data, fuzzy
clustering, and using a noise bucket
• Online clustering
• Axiomatize clustering functions
Download