Weighted Clustering - Computer Science and Engineering

advertisement
Weighted Clustering
Margareta Ackerman
Work with Shai Ben-David, Simina Branzei, and David Loker
The Theory-Practice Gap
Clustering is one of the most widely used tools
for exploratory data analysis.
Social Sciences
Biology
Astronomy
Computer Science
….
All apply clustering to gain a first understanding of
the structure of large data sets.
2
The Theory-Practice Gap
“While the interest in and application of
cluster analysis has been rising rapidly,
the abstract nature of the tool is still
poorly understood” (Wright, 1973)
“There has been relatively little work aimed at
reasoning about clustering independently of
any particular algorithm, objective function,
or generative data model” (Kleinberg, 2002)
Both statements still apply today.
3
Inherent Obstacles:
Clustering is ill-defined
Clustering aims to assign data into groups of
similar items
Beyond that, there is very little consensus on
the definition of clustering
4
Inherent Obstacles
• Clustering is inherently ambiguous
– There may be multiple reasonable clusterings
– There is usually no ground truth
• There are many clustering algorithms with
different (often implicit) objective
functions
• Different algorithms have radically
different input-output behaviour
5
Differences in Input/Output Behavior of
Clustering Algorithms
6
Differences in Input/Output Behavior of
Clustering Algorithms
7
Clustering Algorithm Selection
There are a wide variety of clustering algorithms, which
can produce very different clusterings.
How should a user decide
which algorithm to use for a
given application?
8
Clustering Algorithm Selection
Users rely on cost related considerations: running
times, space usage, software purchasing costs, etc…
There is inadequate emphasis on
input-output behaviour
9
Our Framework for Algorithm Selection
We propose a framework that lets a user utilize prior
knowledge to select an algorithm
• Identify properties that distinguish between different
input-output behaviour of clustering paradigms
• The properties should be:
1) Intuitive and “user-friendly”
2) Useful for distinguishing clustering algorithms
10
Our Framework for Algorithm Selection
In essence, our goal is to understand
fundamental differences between
clustering methods, and convey them
formally, clearly, and as simply as
possible.
11
Previous Work
• Axiomatic perspective
• Impossibility Result: Kleinberg (NIPS, 2003)
• Consistent axioms for quality measures: Ackerman &
Ben-David (NIPS, 2009)
• Axioms in the weighted setting: Wright (Pattern
Recognition, 1973)
12
Previous Work
• Characterizations of Single-Linkage
• Partitional Setting: Bosah Zehad and Ben-David (UAI,
2009)
• Hierarchical Setting: Jarvis and Sibson (Mathematical
Taxonomy, 1981) and Carlsson and Memoli (JMLR, 2010).
• Characterizations of Linkage-Based Clustering
• Partitional Setting: Ackerman, Ben-David, and Loker (COLT,
2010).
• Hierarchical Setting: Ackerman & Ben-David (IJCAI, 2011).
13
Previous Work
• Classifications of clustering methods
• Fischer and Van Ness (Biometrica, 1971)
• Ackerman, Ben-David, and Loker (NIPS, 2010)
14
What’s Left To Be Done?
Despite much work on clustering properties, some basic
questions remain unanswered.
Consider some of the most popular clustering methods:
k-means, single-linkage, average-linkage, etc…
• What are the advantages of k-means over other
methods?
• Previous classifications are missing key properties.
15
Our Contributions (at a high level)
We indentify 3 fundamental categories that clearly
delineate some essential differences between common
clustering methods
The strength of these categories lies in their simplicity.
We hope this gives insight into core differences between
popular clustering methods.
To define these categories, we first present the weighted
clustering setting.
16
Outline
Outline
•
•
•
•
Formal framework
Categories and classification
A result from each category
Conclusions and future work
Weighted Clustering
Every element is associated with a
real valued weight, representing its
mass or importance.
Generalizes the notion of element duplication.
Algorithm design, particularly design of approximation
algorithms, is often done in this framework.
18
Other Reasons to Add Weight:
An Example
• Apply clustering to facility allocation, such as the
placement of police stations in a new district.
• The distribution of stations should enable quick
access to most areas in the district.
• Accessibility of different
institutions to a station may
have varying importance.
• The weighted setting enables
a convenient method for
prioritizing certain landmarks.
19
Algorithms in the Weighted Clustering
Setting
Traditional clustering algorithms can be readily
translated into the weighted setting by
considering their behavior on data containing
element duplicates.
20
Formal Setting
• For a finite domain set X, a weight function
w: X →R+
defines the weight of every element.
• For a finite domain set X, a distance function
d: X x X →R + u {0}
is the distance defined between the domain points.
Formal Setting
Partitional Clustering Algorithm
(X,d) denotes unweighted data
(w[X],d) denotes weighted data
A Partitional Algorithm maps
Input: (w[X],d,k)
to
Output: a k-partition (k-clustering) of X
Formal Setting
Hierarchical Clustering Algorithm
A Hierarchical Algorithm maps
Input: (w[X],d)
to
Output: dendrogram of X
A dendrogram of (X,d) is a strictly binary tree whose
leaves correspond to elements of X
C appears in A(w[X],d) if its clusters are in the dendrogram
Our Contributions
• We utilize the weighted framework to indentify 3
fundamental categories, describing how algorithms
respond to weight.
• Classify traditional algorithms according to these
categories
• Fully characterize when different algorithms react to
weight
24
Towards Basic Categories
Range(X,d)
PARTITIONAL:
Range(A(X, d,k)) = {C | ∃ w s.t. C=A(w[X], d)}
The set of clusterings that A outputs on (X, d) over all possible
weight functions.
HIERARCHICAL:
Range(A(X, d)) = {D | ∃ w s.t. D=A(w[X], d)}
The set of dendrograms that A outputs on (X, d) over all
possible weight functions.
Outline
Outline
•
•
•
•
Formal framework
Categories and classification
A result from each category
Conclusions and future work
Categories:
Weight Robust
A is weight-robust if for all (X, d), |Range(X,d)| = 1.
A never responds to weight.
27
Categories:
Weight Sensitive
A is weight-sensitive if for all (X, d),|Range(X,d)| > 1.
A always responds to weight.
28
Categories:
Weight Considering
An algorithm A is weight-considering if
1) There exists (X, d) where |Range(X,d)|=1.
2) There exists (X, d) where |Range(X,d)|>1.
A responds to weight on some
data sets, but not others.
29
Summary of Categories
Range(A(X, d)) = {C | ∃ w such that A(w[X], d) = C}
Range(A(X, d)) = {D | ∃ w such that A(w[X], d) = D}
Weight-robust: for all (X, d), |Range(X,d)| = 1.
Weight-sensitive: for all (X, d),|Range(X,d)| > 1.
Weight-considering:
1) ∃ (X, d) where |Range(X,d)|=1.
2) ∃ (X, d) where |Range(X,d)|>1.
30
Outline
Connecting
To Applications
The desired category depends on the application.
In the facility allocation
example above, a weightsensitive algorithm may be
preferred.
In phylogeny, where sampling
procedures can be highly biased,
some degree of weight
robustness may be desired.
Classification
Partitional
Hierarchical
Weight Robust
Min Diameter
K-center
Single Linkage
Complete Linkage
Weight Sensitive
K-means, k-medoids,
k-median, min-sum
Ward’s Method
Bisecting K-means
Weight
Considering
Ratio Cut
Average Linkage
For the weight considering algorithms, we fully
characterize when they are sensitive to weight.
Outline
Outline
•
•
•
•
•
Formal framework
Categories and classification
A result from each category
Classification of heuristics
Conclusions and future work
Classification
Partitional
Weight Robust
Min Diameter
K-center
Weight Sensitive
K-means
k-medoids
k-median, min-sum
Weight
Considering
Ratio Cut
Hierarchical
Single Linkage
Complete Linkage
Ward’s Method
Bisecting K-means
Average Linkage
Zooming Into:
Weight Sensitive Algorithms
We show that k-means is weight-sensitive.
A is weight-separable if for any data set (X, d) and
subset S of X with at most k points, ∃ w so that
A(w[X],d,k) separates all points of S.
Fact: Every algorithm that is weight-separable is also
weight-sensitive.
35
K-means is Weight-Sensitive
Theorem: k-means is weight-sensitive.
Proof:
• Show that k-means is weight-separable
• Consider any (X,d) and S⊂X on at least k points
• Increase weight of points in S until each belongs to a distinct cluster.
36
Zooming Into:
Weight Considering Algorithms
• We show that Average-Linkage is Weight
Considering.
• Characterize the precise conditions under
which it is sensitive to weight.
Recall:
An algorithm A is weight-considering if
1) There exists (X, d) where |Range(X,d)|=1.
2) There exists (X, d) where |Range(X,d)|>1.
37
Average Linkage
• Average-Linkage is a hierarchical algorithm.
• It starts by creating a leaf for every element.
• It then repeatedly merges the “closest”
clusters using the following linkage function:
Average weighted distance between clusters
38
Average Linkage is Weight Considering
(X,d) where Range(X,d) =1:
A
B
C
D
A
The same dendrogram is output
for every weight function.
B
C
D
Average Linkage is Weight Considering
(X,d) where Range(X,d) >1:
1+ϵ
1
A
B
1
C
2+2ϵ
D
E
A
1+ ϵ
1
A
B
1
C
B
C
D
E
2+2ϵ
D
E
A
B
C
D
E
When is Average Linkage
Sensitive to Weight?
We showed that Average-Linkage is weightconsidering.
Can we show when it is sensitive to weight?
We provide a complete characterization of
when Average-Linkage is sensitive to
weight, and when it is not.
41
Nice Clustering
A clustering is nice if every point is closer to all points
within its cluster than to all other points.
Nice
42
Nice Clustering
A clustering is nice if every point is closer to all points
within its cluster than to all other points.
Nice
43
Nice Clustering
A clustering is nice if every point is closer to all points
within its cluster than to all other points.
Not nice
44
Characterizing When Average Linkage is
Sensitive to Weight
A dendrogram is nice if all of its clusterings are nice.
Theorem: Range(AL(X,d)) = 1 if and only if (X,d)
has a nice dendrogram.
45
Characterizing When Average Linkage is Sensitive to
Weight: Proof
Theorem: Range(AL(X,d)) = 1 if and only if (X,d)
has a nice dendrogram.
Proof:
Show that:
1) If there is a nice dendrogram for (X,d), then
Average-Linkage outputs it.
2) If a clustering that is not nice appears in dendrogram
AL(w[X],d) for some w, then Range(AL(X,d)) > 1.
46
Characterizing When Average Linkage is Sensitive to
Weight: Proof (cnt.)
Lemma: If there is a nice dendrogram for (X,d), then
Average-Linkage outputs it.
Proof Sketch:
1) Assume that (w[X],d) has a nice dendrogram.
2) Main idea: Show that every nice clustering of the
data appears in AL(w[X],d).
3) For that, we show that each cluster in a nice
clustering is formed by the algorithm.
47
Characterizing When Average Linkage is Sensitive to
Weight: Proof (cnt.)
Given a nice clustering C, it can be shown that
for any clusters Ci and Cj of C, any disjoint subsets Y and Z
of Ci, and any subset W of Cj, Y and Z are closer than Y
and W.
This implies that C appears in the dendrogram.
48
Characterizing When Average Linkage Responds to
Weight: Proof (cnt.)
Lemma: If a clustering C that is not nice appears in
AL(w[X],d), then range(X,d)>1.
Proof:
• Since C is not nice, there exist points x, y, and z, so that
• x and y are belong to the same cluster in C
• x and z belong to difference clusters
• yet d(x,z) < d(x,y)
• If x, y and z are sufficiently heavier than all other points,
then x and z will be merged before x and y, so C will not
be formed.
49
Characterizing When Average Linkage is
Sensitive to Weight
Theorem: Range(AL(X,d)) = 1 if and only if (X,d)
has a nice dendrogram.
Average Linkage is robust to weight whenever there is
a dendrogram of (X,d) consisting of only nice
clusterings, and it is sensitive to weight otherwise.
50
Zooming Into:
Weight Robust Algorithms
These algorithms are invariant to element duplication.
Ex. Min-Diameter returns a clustering that minimizes
the length of the longest within-cluster edge.
As this quantity is not effected by the number of points
(or weight) at any location, Min-Diameter is weight
robust.
51
Outline
Outline
•
•
•
•
Introduce framework
Present categories and classification
Show several results from different categories
Conclusions and future work
Conclusions
• We introduced three basic categories
describing how algorithms respond to weights
• We characterize the precise conditions under
which algorithms respond to weights
• The same results apply in the non-weighted
setting for data duplicates
• This classification can be used to help select
clustering algorithms for specific applications
Future Directions
• Capture differences between objective functions
similar to k-means (ex. k-medians, k-medoids,
min-sum)
• Show bounds on the size of the Range of weight
considering and weight sensitive methods
• Analyze clustering algorithms for categorical data
• Analyze clustering algorithms with a noise bucket
• Indentify properties that are significant for
specific clustering applications (some previous work in this
directions by Ackerman, Brown, and Loker (ICCABS, 2012)).
Supplementary Material
Characterizing When Ratio Cut
is Weight Responsive
Ratio-cut is a similarity based clustering function
A clustering is perfect if all within-cluster distances
are shorter than all between-cluster distances.
A clustering is separation uniform if all betweencluster distances are equal.
56
Characterizing When Ratio Cut
is Weight Responsive
Theorem: Given a clustering C of (X, s) where every
cluster has more than one element, ratio-cut is
weight-responsive on C if and only if either
• C is not perfect, or
• C is not separation-uniform.
57
What about heuristics?
• Our analysis for k-means and similar methods is for
their corresponding objective functions.
• Unfortunately, optimal partitions are NP-hard to find.
In practice, heuristics such as the Lloyd method are
used.
• We analyze several variations of the Lloyd method for
k-means, as well as the Partitional Around Medoids
(PAM) algorithm for k-medoids.
58
Heuristics Classification
Partitional
Weight Sensitive
•Lloyd with random initialization
•K-means++
•PAM
Weight Considering
•The Lloyd Method with Further
Centroid Initialization
Note that the more popular heuristics respond to
weights in the same way as the k-means and kmedoids objective functions.
Heuristics Classification
Partitional
Weight Sensitive
•Lloyd with random initialization
•K-means++
•PAM
Weight Considering
•The Lloyd Method with Further
Centroid Initialization
Just like Average-Linkage, the Lloyd Method with
Furthest Centroid initialization responds to weight
only on clusterings that are not nice.
Classification
Partitional
Hierarchical
Weight Robust
•Min Diameter
•K-center
•Single Linkage
•Complete Linkage
Weight Sensitive
•K-means
•k-medoids
•k-median
•Min-Sum
•Randomized Lloyd
•k-means++
•Ward’s Method
•Bisecting K-means
Weight
Considering
•Ratio Cut
•Lloyd with Furthest
Centroid
•Average Linkage
Wright’s Axioms
In 1973, Wright proposed axioms requiring in this
setting
• Points with zero mass can be treated as nonexistent
• Multiple points with mass at the same location are
equivalent to one point whose weight is the sum of
these masses
62
Characterizing When Average Linkage Responds to
Weight: Proof (cnt.)
We use the following lemma to show that if there is a nice dendrogram,
then it is output by Avergae-Linkage.
Logic: Every nice clustering is output by AL. So if there is a nice dendrog,
consider the set of clusters it has. All of them have to appear in any
dendrogam output by AL. Because a binary tree always have the same
number of nodes, and every node corresponds to a unique cluster,
every dendrog produces the same clusters, which means it has the same
clusterings as the nice dendrogam. Show by induction that every level in
a level-condensed dendrogram has the same clusters.
Lemma: Consider any data set (w[X], d). If a
clustering C of X is nice, then C appears in the
dendrogram Average-Linkage(w[X],d).
63
Previous Work
• Classifications of clustering methods
• Fischer and Van Ness (Biometrica, 1971)
64
Download