What is clustering?

advertisement
Text Clustering
PengBo
Nov 1, 2010
Today’s Topic

Document clustering



Motivations
Clustering algorithms
 Partitional
 Hierarchical
Evaluation
What’s Clustering?
What is clustering?

Clustering: the process of grouping a set of
objects into classes of similar objects

The commonest form of unsupervised learning


Unsupervised learning = learning from raw data, as opposed
to supervised data where a classification of examples is given
A common and important task that finds many
applications in IR and other places
Clustering Internal Criterion
How many clusters?


High intra-cluster similarity
Low inter-cluster similarity
Issues for clustering

Representation for clustering



文档表示Document representation
 Vector space or language model?
相似度/距离similarity/distance
 COS similarity or KL distance
How many clusters?


Fixed a priori?
Completely data driven?
 Avoid “trivial” clusters - too large or small
Clustering Algorithms

Hard clustering algorithms


computes a hard assignment – each document is a
member of exactly one cluster.
Soft clustering algorithms

is soft – a document’s assignment is a distribution over
all clusters.
Clustering Algorithms

Flat algorithms




Create cluster set without explicit structure
Usually start with a random (partial) partitioning
Refine it iteratively
 K means clustering
 Model based clustering
Hierarchical algorithms


Bottom-up, agglomerative
Top-down, divisive
Clustering Algorithms

Flat algorithms




Create cluster set without explicit structure
Usually start with a random (partial) partitioning
Refine it iteratively
 K means clustering
 Model based clustering
Hierarchical algorithms


Bottom-up, agglomerative
Top-down, divisive
Evaluation
Think about it…

Evaluation by High internal criterion scores?

Object function for High intra-cluster similarity and Low
inter-cluster similarity
Application
User judgment
Internal judgment
Example


 
 
Cluster I


 
 
Cluster II


 

Cluster III
External criteria for clustering quality




测试集是什么?ground truth=?
Assume documents with C gold standard classes,
while our clustering algorithms produce K clusters,
ω1, ω2, …, ωK with ni members each.
一个简单的measure: purity 定义为cluster中占主
导的class Ci 的文档数与cluster ωK 大小的比率
ω= {ω1,ω2, . . . ,ωK} is the set of clusters and C
= {c1, c2, . . . , cJ} the set of classes.
Purity example


 
 
Cluster I


 
 

Cluster II
Cluster I: Purity = 1/6 *(max(5, 1, 0)) = 5/6
Cluster II: Purity = 1/6 * (max(1, 4, 1)) = 4/6
Cluster III: Purity = 1/5 * (max(2, 0, 3)) = 3/5
Total: Purity =
1/17 * (5+4+3) = 12/17

 

Cluster III
Rand Index





View it as a series of decisions, one for each of
the N(N − 1)/2 pairs of documents in the
collection.
true positive (TP) decision assigns two similar
documents to the same cluster
true negative (TN) decision assigns two dissimilar
documents to different clusters.
false positive (FP) decision assigns two dissimilar
documents to the same cluster.
false negative (FN) decision assigns two similar
documents to different clusters.
Rand Index
Number of points
Same Cluster in
clustering
Different Clusters in
clustering
Same class in
ground truth
TP
FN
Different classes in
ground truth
FP
TN
Rand index Example


 
 
Cluster I


 
 
Cluster II


 

Cluster III
K Means Algorithm
Partitioning Algorithms

Given:


a set of documents D and the number K
Find:

找到一个K clusters的划分,使partitioning criterion最优
 Globally optimal: exhaustively enumerate all
partitions
 Effective heuristic methods: K-means algorithms
partitioning criterion: residual sum of squares(残差平方和)
K-Means



假设documents是实值 vectors.
基于cluster ω的中心centroids (aka the center of
gravity or mean)
划分instances到clusters是根据它到cluster
centroid中心点的距离,选择最近的centroid
K Means Example
(K=2)
Pick seeds
x
x
x
x
Reassign clusters
Compute centroids
Reassign clusters
Compute centroids
Reassign clusters
Converged!
K-Means Algorithm
Convergence

为什么K-means算法会收敛?



A state in which clusters don’t change.
Reassignment: RSS单调减,每个向量分到最近的
centroid.
Recomputation: 每个RSSk 单调减(mk is number of
members in cluster k):

a =(ωk )取什么值,使RSSK取得最小值?
Σ –2(X – a) = 0
ΣX=Σa
mK a = Σ X
a = (1/ mk) Σ X
Convergence = Global Minimum?

There is unfortunately no guarantee that a global
minimum in the objective function will be reached
outlier
Seed Choice


In the above, if you start
with B and E as centroids
you converge to {A,B,C}
and {D,E,F}
If you start with D and F
you converge to
{A,B,D,E} {C,F}
Seed的选择会影响结果
某些seeds导致收敛很慢,
或者收敛到sub-optimal
clusterings.



用heuristic选seeds (e.g., doc
least similar to any existing
mean)
尝试多种starting points
以其它clustering方法的结果
来初始化.(e.g., by sampling)
How Many Clusters?

怎样确定合适的K?


在产生更多cluster(每个cluster内部更像)和产生太多的
cluster (eg.浏览代价大)之间取得平衡
例如:




定义Benefit :a doc到它所在的cluster centroid的cosine
similarity。所有docs的benefit之和为Total Benefit.
定义一个cluster的Cost
定义clustering的Value = Total Benefit - Total Cost.
所有可能的K中,选取value最大的那一个
Is K-Means Efficient?

Time Complexity





Computing distance between two docs is O(M) where
M is the dimensionality of the vectors.
Reassigning clusters: O(KN) distance computations, or
O(KNM).
Computing centroids: Each doc gets added once to
some centroid: O(NM).
Assume these two steps are each done once for I
iterations: O(IKNM).
M is …


Document is sparse vector, but Centroid is not
K-medoids algorithms: the element closest to the
center as "the medoid"
Efficiency: Medoid As Cluster
Representative

Medoid: 用一个document来作cluster的表示


One reason this is useful




如: 离centroid最近的document
考察一个很大的cluster的representative (>1000
documents)
The centroid of this cluster will be a dense vector
The medoid of this cluster will be a sparse vector
类似于:


mean .vs. median
centroid vs. medoid
Hierarchical Clustering Algorithm
Hierarchical Agglomerative
Clustering (HAC)


假定有了一个similarity
function来确定两个
instances的相似度.
贪心算法:



Dendrogram

每个instances为一独立的
cluster开始
选择最similar的两个
cluster,合并为一个新
cluster
直到最后剩下一个cluster
为止
上面的合并历史形成一个
binary tree或hierarchy.
Dendrogram: Document Example

As clusters agglomerate, docs likely to fall into a
hierarchy of “topics” or concepts.
d3
d5
d1
d2
d3,d4,d
5
d4
d1,d2
d4,d5
d3
HAC Algorithm, pseudo-code
Hierarchical Clustering algorithms

Agglomerative (bottom-up):




Start with each document being a single cluster.
Eventually all documents belong to the same cluster.
Divisive (top-down):

Start with all documents belong to the same cluster.

Eventually each node forms a cluster on its own.
不需要预定clusters的数目k
Key notion: cluster representative



如何计算哪两个clusters最近?
为了有效进行此计算,怎样表达每个
cluster(cluster representation)?
Representative可以cluster中的某些“typical” 或
central点:


point inducing smallest radii to docs in cluster
 smallest squared distances, etc.
point that is the “average” of all docs in the cluster
 Centroid or center of gravity
“Closest pair” of clusters

“Center of gravity”


Average-link


每对元素的平均cosine-similar
Single-link


centroids (centers of gravity)最cosine-similar的clusters
最近点(Similarity of the most cosine-similar)
Complete-link

最远点(Similarity of the “furthest” points, the least
cosine-similar)
Single Link Example
chaining
sim(ci ,c j )  max sim( x, y)
xci , yc j
sim((ci  c j ), ck )  max(sim(ci , ck ), sim(c j , ck ))
Complete Link Example
Affect by outliers
sim(ci ,c j )  min sim( x, y )
xci , yc j
sim((ci  c j ), ck )  min(sim(ci , ck ), sim(c j , ck ))
Computational Complexity


第一次iteration, HAC计算所有pairs之间的
similarity : O(n2).
后续的n2 merging iterations, 需要计算最新产生
的cluster和其它已有的clusters之间的similarity


其它的similarity不变
为了达到整体的O(n2) performance


计算和其它cluster之间的similarity必须是constant time.
否则O(n2 log n) or O(n3)
Centroid Agglomerative Clustering
Example: n=6, k=3, closest pair of centroids
d6
d4
d3
d5
Centroid after
second step.
d1
d2
Centroid after first step.
Group Average Agglomerative
Clustering


合并后的cluster中所有pairs的平均similarity
 
1
sim(ci , c j ) 
sim( x, y)


ci  c j ( ci  c j  1) x(ci c j ) y(ci c j ): y  x
可以在常数时间计算?


Vectors都经过单位长度normalized.
保存每个cluster的sum of vectors.

s (c j ) 
sim(ci , c j ) 

x

xc j




(s (ci )  s (c j ))  (s (ci )  s (c j ))  (| ci |  | c j |)
(| ci |  | c j |)(| ci |  | c j | 1)
Exercise

考虑在一条直线上的n个点的agglomerative聚类.
你能避免n3 次的距离/相似度计算吗?你的方式需
要计算多少次?
Efficiency: “Using approximations”


标准算法中,每一步都必须找到最近的centroid
pairs
近似算法: 找nearly closest pair

simplistic example: maintain closest pair based on
distances in projection on a random line
Random line
Applications in IR
Navigating document collections
Index
Aardvark, 15
Blueberry, 200
Capricorn, 1, 45-55
Dog, 79-99
Egypt, 65
Falafel, 78-90
Giraffes, 45-59
…


Table of Contents
1. Science of Cognition
1.a. Motivations
1.a.i. Intellectual Curiosity
1.a.ii. Practical Applications
1.b. History of Cognitive Psychology
2. The Neural Basis of Cognition
2.a. The Nervous System
2.b. Organization of the Brain
2.c. The Visual System
3. Perception and Attention
3.a. Sensory Memory
3.b. Attention and Sensory Information Processing
Information Retrieval —— a book index
Document clusters —— a table of contents
Scatter/Gather: Cutting, Karger, and Pedersen
For better navigation of search results
Navigating search results (2)



按sense of a word 对documents聚类
对搜索结果 (say Jaguar, or NLP), 聚成相关的文档组
可看作是一种word sense disambiguation
For speeding up vector space retrieval




VSM中retrieval, 需要找到和query vector最近
的doc vectors
计算文档集里所有doc和query doc的similarity
– slow (for some applications)
优化一下:使用inverted index,只计算那些
query doc中的term出现过的doc
By clustering docs in corpus a priori

只在子集上计算:query doc所在的cluster
Resources

Weka 3 - Data Mining with
Open Source Machine
Learning Software in Java
本次课小结


Text Clustering
Evaluation


Partition Algorithm


Purity, NMI ,Rand Index
K-Means
 Reassignment
 Recomputation
Hierarchical Algorithm


Cluster representation
Close measure of cluster pair
 Single link
 Complete link
 Average link
 centroid
Thank You!
Q&A
Readings


[1]. IIR Ch16.1-4 Ch17.1-4
[2]. B. Florian, E. Martin, and X. Xiaowei,
"Frequent term-based text clustering," in
Proceedings of the eighth ACM SIGKDD
international conference on Knowledge discovery
and data mining. Edmonton, Alberta, Canada:
ACM, 2002.
Cluster Labeling
Major issue - labeling


After clustering algorithm finds clusters - how can
they be useful to the end user?
Need pithy label for each cluster


In search results, say “Animal” or “Car” in the jaguar
example.
In topic trees (Yahoo), need navigational cues.
 Often done by hand, a posteriori.
How to Label Clusters

Show titles of typical documents




Titles are easy to scan
Authors create them for quick scanning!
But you can only show a few titles which may not fully
represent cluster
Show words/phrases prominent in cluster



More likely to fully represent cluster
Use distinguishing words/phrases
 Differential labeling (think about Feature Selection)
But harder to scan
Labeling

Common heuristics - list 5-10 most frequent
terms in the centroid vector.


Differential labeling by frequent terms



Drop stop-words; stem.
Within a collection “Computers”, clusters all have the
word computer as frequent term.
Discriminant analysis of centroids.
Perhaps better: distinctive noun phrase
Download