Towards Theoretical Foundations of Clustering.

advertisement
Towards Theoretical
Foundations of Clustering
Margareta Ackerman
Caltech
Joint work with Shai Ben-David and David Loker
The Theory-Practice Gap
Clustering is one of the most widely used tools
for exploratory data analysis.
Social Sciences
Biology
Astronomy
Computer Science
….
All apply clustering to gain a first understanding of
the structure of large data sets.
2
The Theory-Practice Gap
“While the interest in and application of cluster
analysis has been rising rapidly, the abstract
nature of the tool is still poorly understood”
(Wright, 1973)
“There has been relatively little work aimed at
reasoning about clustering independently
of any particular algorithm, objective
function, or generative data model”
(Kleinberg, 2002)
Both statements still apply today.
3
Inherent Obstacles:
Clustering is ill-defined
Clustering aims to assign data into groups of
similar items
Beyond that, there is very little consensus
on the definition of clustering
4
Inherent Obstacles
• Clustering is inherently ambiguous
• There may be multiple reasonable
clusterings
• There is usually no ground truth
5
Differences in Input/Output Behavior
of Clustering Algorithms
6
Differences in Input/Output Behavior
of Clustering Algorithms
7
Outline
•
•
•
•
Previous work
Clustering algorithm selection
Characterization of Linkage-Based clustering
Conclusions and future work
8
Previous Work Towards a
General Theory
• Axioms of clustering [(Wright, ‘73), (Meila, ACM
‘05), (Pattern Recognition, ‘00), (Kleinberg, NIPS ‘02),
(Ackerman & Ben-David, NIPS ‘08)].
• Clusterability [(Balcan, Blum, and Vempala, STOC
‘08),(Balcan, Blum and Gupta, SODA ‘09), (Ackerman &
Ben-David, AISTATS ’09)].
9
Outline
•
•
•
•
Previous work
Clustering algorithm selection
Characterization of Linkage-Based clustering
Conclusions and future work
10
Selecting a Clustering Algorithm
There are a wide variety of clustering
algorithms, which often produce very
different clusterings.
How should a user decide
which algorithm to use for
a given application?
11
Selecting a Clustering Algorithm
Users rely on cost related considerations: running times,
space usage, software purchasing costs, etc…
There is inadequate emphasis on
input-output behaviour
12
Our Framework for Selecting a
Clustering Algorithm
• Identify properties that distinguish between
different input-output behaviour of clustering
paradigms
• The properties should be:
1) Intuitive and “user-friendly”
2) Useful for distinguishing clustering
algorithms
13
Our Framework for Selecting a
Clustering Algorithm
• Enables users to identify a suitable algorithm
without the overhead of executing many
algorithms
• Helps understand the behaviour of algorithms
• The long-term goal is to construct a large
property-based classification for many useful
clustering algorithms
14
Taxonomy of Partitional Algorithms
(Ackerman, Ben-David, Loker, NIPS 2010)
15
Properties VS Axioms
Properties
Axioms
16
Characterization of Linkage-Based Clustering
(Ackerman, Ben-David, Loker, COLT 2010)
17
Characterization of Linkage-Based Clustering
(Ackerman, Ben-David, Loker, COLT 2010)
The 2010 characterization applies in the partitional
setting, by using the k-stopping criteria.
This characterization distinguished linkage-based
algorithms from other partitional algorithms.
18
Characterizing Linkage-Based Clustering in
the Hierarchical Setting
(Ackerman and Ben-David, IJCAI 2011)
• Propose two intuitive properties that uniquely
indentify hierarchical linkage-based clustering
algorithms.
• Show that common hierarchical algorithms,
including bisecting k-means, cannot be
simulated by any linkage-based algorithm
19
Outline
•
•
•
•
Previous work
Clustering algorithm selection
Characterization of Linkage-Based clustering
Conclusions and future work
20
Formal Setup:
Dendrograms and clusterings
C_i is a cluster in a dendrogram D if there exists a
node in the dendrogram so that C_i is the set of its
leaf descendents.
21
Formal Setup:
Dendrograms and clusterings
C = {C1, … , Ck} is a clustering in a dendrogram D if
– Ci is a cluster in D for all 1≤ i ≤ k, and
– The clusters are disjoint
22
Formal Setup:
Hierarchical clustering algorithm
A Hierarchical Clustering Algorithm A
maps
Input: A data set X with a dissimilarity function d,
denoted (X,d)
to
Output: A dendrogram of X
23
Linkage-Based Algorithm
• Create a leaf node for every element of X
Insert image
24
Linkage-Based Algorithm
• Create a leaf node for every elements of X
• Repeat the following until a single tree remains:
– Consider clusters represented by the remaining root nodes.
25
Linkage-Based Algorithm
• Create a leaf node for every elements of X
?
• Repeat the following until a single tree remains:
– Consider clusters represented by the remaining root nodes.
Merge the closest pair of clusters by assigning them a
common parent node.
26
Example Linkage-Based Algorithms
• The choice of Linkage Function distinguishes
between different linkage-based algorithms.
• Examples of common linkage-functions
– Single-linkage: shortest between-cluster distance
– Average-linkage: average between-cluster distance
– Complete-linkage: maximum between-cluster distance
X1
X2
27
Locality
Informal Definition
D = A(X,d)
D’ = A(X’,d)
X’={x1, …, x6}
If we select a set of disjoint clusters from a dendrogram,
and run the algorithm on the union of these clusters, we
obtain a result that is consistent with the original
dendrogram.
28
Locality
Informal Definition
D = A(X,d)
D’ = A(X’,d)
X’={x1, …, x6}
If we select a set of disjoint clusters from a dendrogram,
and run the algorithm on the union of these clusters, we
obtain a result that is consistent with the original
dendrogram.
29
Outer Consistency
A(X,d)
C
C on dataset (X,d’)
C on dataset (X,d)
Outer-consistent
change
30
If A is outer-consistent, then A(X,d’) will also include the clustering C.
Theorem (Ackerman & Ben-David, IJCAI 2011):
A hierarchical clustering algorithm is
Linkage-Based
if and only if
it is Local and Outer-Consistent.
31
Easy Direction of Proof
Every Linkage-Based hierarchical clustering
algorithm is Local and Outer-Consistent.
The proof is quite straightforward.
32
Interesting Direction of Proof
If A is Local and Outer-Consistent, then A is
Linkage-Based.
To prove this direction we first need to
formalize Linkage-Based clustering, by
formally defining what is a Linkage
Function.
33
What Do We Expect From Linkage
Functions?
A Linkage Function is a function
l:{(X1, X2 ,d): d is a distance function over X1uX2 }→ R+
that satisfies the following:
-
Representation independence: Doesn’t change if we re-label data
Monotonicity: if we increase edges that go between X1 and X2,
then l(X1, X2 ,d) doesn’t decrease.
X1
X2
(X1uX2,d)
34
Proof Sketch
Recall direction:
If A satisfies Outer-Consistency and Locality, then A
is Linkage-Based.
Goal:
Define a linkage function l so that the linkage-based
clustering based on l outputs A(X,d)
(for every X and d).
35
Proof Sketch
• Define an operator <A :
(X,Y,d1) <A (Z,W,d2) if when we run A on (XuYuZuW,d),
where d extends d1 and d2, X and Y are merged before
Z and W.
A(X,d)
• Prove that <A can be
extended to a partial
ordering
• Use the ordering to
define l
Z
W
X
Y
36
Sketch of proof continue:
Show that <A is a partial ordering
We show that <A is cycle-free.
Lemma: Given a hierarchical algorithm A that is
Local and Outer-Consistent, there exists no finite
sequence so that
(X1,Y1,d1) <A …. <A(Xn,Yn,dn) <A (X1,Y1,d1).
37
Proof Sketch (continued…)
• By the above Lemma, the transitive closure of
<A is a partial ordering.
• This implies that there exists an order
preserving function l that maps pairs of data
sets to R+.
• It can be shown that l satisfies the properties
of a Linkage Function.
38
Hierarchical but Not Linkage-Based
P -Divisive algorithms construct dendrograms top-down
using a partitional 2-clustering algorithm P to split nodes.
Apply partitional clustering P
Ex. k-means for k=2
39
Hierarchical but Not Linkage-Based
A partitional 2-clustering algorithm P is
Context Sensitive if there exist d’ extending d so that
P({x,y,z},d) = {x, {y,z}} and P({x,y,z,w} ,d’)= {{x,y}, {z,w}}.
Ex. K-means, min-sum, min-diameter.
Theorem [Ackerman & Ben-David, IJCAI ’11]:
If P is context-sensitive, then the
P –divisive algorithm fails the locality
property.
40
Hierarchical but Not Linkage-Based
• The input-output behaviour of some natural
divisive algorithms is distinct from that of all
linkage-based algorithms.
• The bisecting k-means algorithm, and other
natural divisive algorithms, cannot be simulated
by any linkage-based algorithm.
41
Conclusions
• We present a new framework for clustering
algorithm selection
• Provide a property-based classification of
common clustering algorithms
• Characterize linkage-based clustering in
terms of two natural properties
• Show that no linkage-based algorithm can
simulate some natural divisive algorithms
42
What’s Next?
• Apply our approach to specific clustering applications
(Ackerman, Brow, and Loker, ICCABS ‘12).
• Bridging the gap in other clustering settings
– clustering with a “noise cluster”
– algorithms for categorical data
• Axioms of clustering algorithms
43
Download