Language Independent Methods of Clustering Similar Contexts (with applications) Ted Pedersen

advertisement
Language Independent Methods
of Clustering Similar Contexts
(with applications)
Ted Pedersen
University of Minnesota, Duluth
http://www.d.umn.edu/~tpederse
tpederse@d.umn.edu
EACL-2006 Tutorial
1
Language Independent Methods

Do not utilize syntactic information



Do not utilize dictionaries or other manually
created lexical resources
Based on lexical features selected from corpora


No parsers, part of speech taggers, etc. required
Assumption: word segmentation can be done by
looking for white spaces between strings
No manually annotated data of any kind,
methods are completely unsupervised in the
strictest sense
EACL-2006 Tutorial
2
Clustering Similar Contexts

A context is a short unit of text



often a phrase to a paragraph in length,
although it can be longer
Input: N contexts
Output: K clusters

Where each member of a cluster is a context
that is more similar to each other than to the
contexts found in other clusters
EACL-2006 Tutorial
3
Applications

Headed contexts (contain target word)



Headless contexts




Name Discrimination
Word Sense Discrimination
Email Organization
Document Clustering
Paraphrase identification
Clustering Sets of Related Words
EACL-2006 Tutorial
4
Tutorial Outline

Identifying lexical features


Context representations


Singular Value Decomposition
Clustering




First & second order
Dimensionality reduction


Measures of association & tests of significance
Partitional techniques
Cluster stopping
Cluster labeling
Hands On Exercises
EACL-2006 Tutorial
5
General Info



Please fill out short survey
Break from 4:00-4:30pm
Finish at 6pm



Slides and video from tutorial will be posted (I will send
you email when that is ready)
Questions are welcome



Reception tonight at 7pm at Castle (?)
Now, or via email to me or SenseClusters list.
Comments, observations, criticisms are all welcome
Knoppix CD, will give you Linux and SenseClusters when
computer is booted from the CD.
EACL-2006 Tutorial
6
SenseClusters

A package for clustering contexts



http://senseclusters.sourceforge.net
SenseClusters Live! (Knoppix CD)
Integrates with various other tools



Ngram Statistics Package
CLUTO
SVDPACKC
EACL-2006 Tutorial
7
Many thanks…

Amruta Purandare (M.S., 2004)



Anagha Kulkarni (M.S., 2006, expected)



Founding developer of SenseClusters (2002-2004)
Now PhD student in Intelligent Systems at the
University of Pittsburgh
http://www.cs.pitt.edu/~amruta/
Enhancing SenseClusters since Fall 2004!
http://www.d.umn.edu/~kulka020/
National Science Foundation (USA) for supporting
Amruta, Anagha and me via CAREER award
#0092784
EACL-2006 Tutorial
8
Background and Motivations
EACL-2006 Tutorial
9
Headed and Headless Contexts

A headed context includes a target word



Our goal is to cluster the target words based
on their surrounding contexts
Target word is center of context and our
attention
A headless context has no target word


Our goal is to cluster the contexts based on
their similarity to each other
The focus is on the context as a whole
EACL-2006 Tutorial
10
Headed Contexts (input)





I can hear the ocean in that shell.
My operating system shell is bash.
The shells on the shore are lovely.
The shell command line is flexible.
The oyster shell is very hard and black.
EACL-2006 Tutorial
11
Headed Contexts (output)

Cluster 1:



My operating system shell is bash.
The shell command line is flexible.
Cluster 2:



The shells on the shore are lovely.
The oyster shell is very hard and black.
I can hear the ocean in that shell.
EACL-2006 Tutorial
12
Headless Contexts (input)





The new version of Linux is more stable and
better support for cameras.
My Chevy Malibu has had some front end
troubles.
Osborne made on of the first personal
computers.
The brakes went out, and the car flew into the
house.
With the price of gasoline, I think I’ll be taking
the bus more often!
EACL-2006 Tutorial
13
Headless Contexts (output)

Cluster 1:



The new version of Linux is more stable and better
support for cameras.
Osborne made one of the first personal computers.
Cluster 2:



My Chevy Malibu has had some front end troubles.
The brakes went out, and the car flew into the house.
With the price of gasoline, I think I’ll be taking the
bus more often!
EACL-2006 Tutorial
14
Web Search as Application

Web search results are headed contexts



Search term is target word (found in snippets)
Web search results are often disorganized – two
people sharing same name, two organizations
sharing same abbreviation, etc. often have their
pages “mixed up”
If you click on search results or follow links in
pages found, you will encounter headless
contexts too…
EACL-2006 Tutorial
15
Name Discrimination
EACL-2006 Tutorial
16
George Millers!
EACL-2006 Tutorial
17
EACL-2006 Tutorial
18
EACL-2006 Tutorial
19
EACL-2006 Tutorial
20
EACL-2006 Tutorial
21
EACL-2006 Tutorial
22
Email Foldering as Application

Email (public or private) is made up of
headless contexts


Short, usually focused…
Cluster similar email messages together


Automatic email foldering
Take all messages from sent-mail file or inbox
and organize into categories
EACL-2006 Tutorial
23
EACL-2006 Tutorial
24
EACL-2006 Tutorial
25
Clustering News as Application

News articles are headless contexts



Entire article or first paragraph
Short, usually focused
Cluster similar articles together
EACL-2006 Tutorial
26
EACL-2006 Tutorial
27
EACL-2006 Tutorial
28
EACL-2006 Tutorial
29
What is it to be “similar”?

You shall know a word by the company it keeps


Meanings of words are (largely) determined by their
distributional patterns (Distributional Hypothesis)


Harris, 1968 (Mathematical Structures of Language)
Words that occur in similar contexts will have similar
meanings (Strong Contextual Hypothesis)


Firth, 1957 (Studies in Linguistic Analysis)
Miller and Charles, 1991 (Language and Cognitive Processes)
Various extensions…


Similar contexts will have similar meanings, etc.
Names that occur in similar contexts will refer to the same
underlying person, etc.
EACL-2006 Tutorial
30
General Methodology

Represent contexts to be clustered using first or
second order feature vectors


Reduce dimensionality to make vectors more
tractable and/or understandable


Singular value decomposition
Cluster the context vectors



Lexical features
Find the number of clusters
Label the clusters
Evaluate and/or use the contexts!
EACL-2006 Tutorial
31
Identifying Lexical Features
Measures of Association and
Tests of Significance
EACL-2006 Tutorial
32
What are features?



Features represent the (hopefully) salient
characteristics of the contexts to be
clustered
Eventually we will represent each context
as a vector, where the dimensions of the
vector are associated with features
Vectors/contexts that include many of the
same features will be similar to each other
EACL-2006 Tutorial
33
Where do features come from?

In unsupervised clustering, it is common for the
feature selection data to be the same data that
is to be clustered


This is not cheating, since data to be clustered does
not have any labeled classes that can be used to
assist feature selection
It may also be necessary, since we may need to
cluster all available data, and not hold out some for a
separate feature identification step

Email or news articles
EACL-2006 Tutorial
34
Feature Selection

“Test” data – the contexts to be clustered


Assume that the feature selection data is the same as
the test data, unless otherwise indicated
“Training” data – a separate corpus of held out
feature selection data (that will not be clustered)


may need to use if you have a small number of
contexts to cluster (e.g., web search results)
This sense of “training” due to Schütze (1998)
EACL-2006 Tutorial
35
Lexical Features


Unigram – a single word that occurs more than
a given number of times
Bigram – an ordered pair of words that occur
together more often than expected by chance



Consecutive or may have intervening words
Co-occurrence – an unordered bigram
Target Co-occurrence – a co-occurrence where
one of the words is the target word
EACL-2006 Tutorial
36
Bigrams






fine wine (window size of 2)
baseball bat
house of representatives (window size of 3)
president of the republic (window size of 4)
apple orchard
Selected using a small window size (2-4 words),
trying to capture a regular (localized) pattern
between two words (collocation?)
EACL-2006 Tutorial
37
Co-occurrences





tropics water
boat fish
law president
train travel
Usually selected using a larger window (7-10
words) of context, hoping to capture pairs of
related words rather than collocations
EACL-2006 Tutorial
38
Bigrams and Co-occurrences

Pairs of words tend to be much less
ambiguous than unigrams




“bank” versus “river bank” and “bank card”
“dot” versus “dot com” and “dot product”
Three grams and beyond occur much less
frequently (Ngrams very Zipfian)
Unigrams are noisy, but bountiful
EACL-2006 Tutorial
39
“occur together more often than
expected by chance…”

Observed frequencies for two words occurring
together and alone are stored in a 2x2 matrix


Throw out bigrams that include one or two stop
words
Expected values are calculated, based on the
model of independence and observed values


How often would you expect these words to occur
together, if they only occurred together by chance?
If two words occur “significantly” more often than the
expected value, then the words do not occur together
by chance.
EACL-2006 Tutorial
40
2x2 Contingency Table
Intelligence !Intelligence
Artificial
100
400
300
100,000
!Artificial
EACL-2006 Tutorial
41
2x2 Contingency Table
Intelligence !Intelligence
Artificial
100
300
400
!Artificial
200
99,400
99,600
300
99,700
100,000
EACL-2006 Tutorial
42
2x2 Contingency Table
Intelligence !Intelligence
Artificial
!Artificial
100.0
000.12
200.0
298.8
300
300.0
398.8
99,400.0
99,301.2
99,700
EACL-2006 Tutorial
400
99,600
100,000
43
Measures of Association
G 
2
 (observed (w , w ) * log expected(w , w ) )
i , j 1
X 
2
observed ( wi , w j )
2
2

i , j 1
i
j
i
[observed ( wi , w j )  expected( wi , w j )]
j
2
expected( wi , w j )
EACL-2006 Tutorial
44
Measures of Association
G  750.88
2
X  8191.78
2
EACL-2006 Tutorial
45
Interpreting the Scores…


G^2 and X^2 are asymptotically approximated
by the chi-squared distribution…
This means…if you fix the marginal totals of a
table, randomly generate internal cell values in
the table, calculate the G^2 or X^2 scores for
each resulting table, and plot the distribution
of the scores, you *should* get …
EACL-2006 Tutorial
46
EACL-2006 Tutorial
47
Interpreting the Scores…

Values above a certain level of significance
can be considered grounds for rejecting
the null hypothesis


H0: the words in the bigram are independent
3.841 is associated with 95% confidence that
the null hypothesis should be rejected
EACL-2006 Tutorial
48
Measures of Association


There are numerous measures of
association that can be used to identify
bigram and co-occurrence features
Many of these are supported in the Ngram
Statistics Package (NSP)

http://www.d.umn.edu/~tpederse/nsp.html
EACL-2006 Tutorial
49
Measures Supported in NSP

Log-likelihood Ratio (ll)








True Mutual Information (tmi)
Pearson’s Chi-squared Test (x2)
Pointwise Mutual Information (pmi)
Phi coefficient (phi)
T-test (tscore)
Fisher’s Exact Test (leftFisher, rightFisher)
Dice Coefficient (dice)
Odds Ratio (odds)
EACL-2006 Tutorial
50
Summary

Identify lexical features based on frequency counts or
measures of association – either in the data to be
clustered or in a separate set of feature selection data


Unigrams usually only selected by frequency


Language independent
Remember, no labeled data from which to learn, so somewhat
less effective as features than in supervised case
Bigrams and co-occurrences can also be selected by
frequency, or better yet measures of association



Bigrams and co-occurrences need not be consecutive
Stop words should be eliminated
Frequency thresholds are helpful (e.g., unigram/bigram that
occurs once may be too rare to be useful)
EACL-2006 Tutorial
52
Related Work



Moore, 2004 (EMNLP) follow-up to Dunning and Pedersen on loglikelihood and exact tests
http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Moore.pdf
Pedersen, 1996 (SCSUG) explanation of exact tests, and comparison
to log-likelihood
http://arxiv.org/abs/cmp-lg/9608010
(also see Pedersen, Kayaalp, and Bruce, AAAI-1996)
Dunning, 1993 (Computational Linguistics) introduces log-likelihood
ratio for collocation identification
http://acl.ldc.upenn.edu/J/J93/J93-1003.pdf
EACL-2006 Tutorial
53
Context Representations
First and Second Order Methods
EACL-2006 Tutorial
54
Once features selected…

We will have a set of unigrams, bigrams, cooccurrences or target co-occurrences that we
believe are somehow interesting and useful


We also have any frequency and measure of
association score that have been used in their
selection
Convert contexts to be clustered into a vector
representation based on these features
EACL-2006 Tutorial
55
First Order Representation

Each context is represented by a vector
with M dimensions, each of which
indicates whether or not a particular
feature occurred in that context


Value may be binary, a frequency count, or an
association score
Context by Feature representation
EACL-2006 Tutorial
56
Contexts




Cxt1: There was an island curse of black
magic cast by that voodoo child.
Cxt2: Harold, a known voodoo child, was
gifted in the arts of black magic.
Cxt3: Despite their military might, it was a
serious error to attack.
Cxt4: Military might is no defense against
a voodoo child or an island curse.
EACL-2006 Tutorial
57
Unigram Feature Set






island
black
curse
magic
child
1000
700
500
400
200
(assume these are frequency counts obtained
from some corpus…)
EACL-2006 Tutorial
58
First Order Vectors of Unigrams
Cxt1
Cxt2
island
1
0
black
1
1
curse
1
0
magic
1
1
child
1
1
Cxt3
0
0
0
0
0
Cxt4
1
0
1
0
1
EACL-2006 Tutorial
59
Bigram Feature Set











island curse
black magic
voodoo child
military might
serious error
island child
voodoo might
military error
black child
serious curse
189.2
123.5
120.0
100.3
89.2
73.2
69.4
54.9
43.2
21.2
(assume these are log-likelihood scores based on frequency counts from
some corpus)
EACL-2006 Tutorial
60
First Order Vectors of Bigrams
Cxt1
black
magic
1
Cxt2
island military serious voodoo
curse might
error
child
1
0
0
1
1
0
0
0
1
Cxt3
0
0
1
1
0
Cxt4
0
1
1
0
1
EACL-2006 Tutorial
61
First Order Vectors



Can have binary values or weights
associated with frequency, etc.
Forms a context by feature matrix
May optionally be smoothed/reduced with
Singular Value Decomposition


More on that later…
The contexts are ready for clustering…

More on that later…
EACL-2006 Tutorial
62
Second Order Features

First order features encode the occurrence of a
feature in a context


Feature occurrence represented by binary value
Second order features encode something ‘extra’
about a feature that occurs in a context


Feature occurrence represented by word co-occurrences
Feature occurrence represented by context occurrences
EACL-2006 Tutorial
63
Second Order Representation

First, build word by word matrix from features






Based on bigrams or co-occurrences
First word is row, second word is column, cell is score
(optionally) reduce dimensionality w/SVD
Each row forms a vector of first order co-occurrences
Second, replace each word in a context with its
row/vector as found in the word by word matrix
Average all the word vectors in the context to
create the second order representation

Due to Schütze (1998), related to LSI/LSA
EACL-2006 Tutorial
64
Word by Word Matrix
magic
curse
might
error
child
black
123.5
0
0
0
43.2
island
0
189.2
0
0
73.2
military
0
0
100.3
54.9
0
serious
0
21.2
0
89.2
0
voodoo
0
0
69.4
0
120.0
EACL-2006 Tutorial
65
Word by Word Matrix


…can also be used to identify sets of related words
In the case of bigrams, rows represent the first word in
a bigram and columns represent the second word


In the case of co-occurrences, rows and columns are
equivalent



Matrix is asymmetric
Matrix is symmetric
The vector (row) for each word represent a set of first
order features for that word
Each word in a context to be clustered for which a vector
exists (in the word by word matrix) is replaced by that
vector in that context
EACL-2006 Tutorial
66
There was an island curse of black
magic cast by that voodoo child.
magic
curse
might
error
child
black
123.5
0
0
0
43.2
island
0
189.2
0
0
73.2
voodoo
0
0
69.4
0
120.0
EACL-2006 Tutorial
67
Second Order Co-Occurrences


Word vectors for “black” and “island” show
similarity as both occur with “child”
“black” and “island” are second order cooccurrence with each other, since both
occur with “child” but not with each other
(i.e., “black island” is not observed)
EACL-2006 Tutorial
68
Second Order Representation


There was an [curse, child] curse of
[magic, child] magic cast by that [might,
child] child
[curse, child] + [magic, child] + [might,
child]
EACL-2006 Tutorial
69
There was an island curse of black
magic cast by that voodoo child.
Cxt1
magic
curse
might
error
child
41.2
63.1
24.4
0
78.8
EACL-2006 Tutorial
70
Second Order Representation


Results in a Context by Feature (Word)
Representation
Cell values do not indicate if feature
occurred in context. Rather, they show the
strength of association of that feature with
other words that occur with a word in the
context.
EACL-2006 Tutorial
71
Summary

First order representations are intuitive, but…



Can suffer from sparsity
Contexts represented based on the features that
occur in those contexts
Second order representations are harder to
visualize, but…



Allow a word to be represented by the words it cooccurs with (i.e., the company it keeps)
Allows a context to be represented by the words that
occur with the words in the context
Helps combat sparsity…
EACL-2006 Tutorial
72
Related Work



Pedersen and Bruce 1997 (EMNLP) presented first order method of
discrimination
http://acl.ldc.upenn.edu/W/W97/W97-0322.pdf
Schütze 1998 (Computational Linguistics) introduced second order
method
http://acl.ldc.upenn.edu/J/J98/J98-1004.pdf
Purandare and Pedersen 2004 (CoNLL) compared first and second
order methods
http://acl.ldc.upenn.edu/hlt-naacl2004/conll04/pdf/purandare.pdf


First order better if you have lots of data
Second order better with smaller amounts of data
EACL-2006 Tutorial
73
Dimensionality Reduction
Singular Value Decomposition
EACL-2006 Tutorial
74
Motivation

First order matrices are very sparse



Context by feature
Word by word
NLP data is noisy


No stemming performed
synonyms
EACL-2006 Tutorial
75
Many Methods

Singular Value Decomposition (SVD)






SVDPACKC http://www.netlib.org/svdpack/
Multi-Dimensional Scaling (MDS)
Principal Components Analysis (PCA)
Independent Components Analysis (ICA)
Linear Discriminant Analysis (LDA)
etc…
EACL-2006 Tutorial
76
Effect of SVD

SVD reduces a matrix to a given number
of dimensions This may convert a word
level space into a semantic or conceptual
space

If “dog” and “collie” and “wolf” are
dimensions/columns in a word co-occurrence
matrix, after SVD they may be a single
dimension that represents “canines”
EACL-2006 Tutorial
77
Effect of SVD

The dimensions of the matrix after SVD are
principal components that represent the
meaning of concepts


Similar columns are grouped together
SVD is a way of smoothing a very sparse
matrix, so that there are very few zero
valued cells after SVD
EACL-2006 Tutorial
78
How can SVD be used?

SVD on first order contexts will reduce a
context by feature representation down to a
smaller number of features


Latent Semantic Analysis typically performs SVD
on a feature by context representation, where
the contexts are reduced
SVD used in creating second order context
representations

Reduce word by word matrix
EACL-2006 Tutorial
79
Word by Word Matrix
apple
blood
cells
ibm
data
box
tissue
graphics
memory
organ
plasma
pc
2
0
0
1
3
1
0
0
0
0
0
body
0
3
0
0
0
0
2
0
0
2
1
disk
1
0
0
2
0
3
0
1
2
0
0
petri
0
2
1
0
0
0
2
0
1
0
1
lab
0
0
3
0
2
0
2
0
2
1
3
sales
0
0
0
2
3
0
0
1
2
0
0
linux
2
0
0
1
3
2
0
1
1
0
0
debt
0
0
0
2
3
4
0
2
0
0
0
EACL-2006 Tutorial
80
Singular Value Decomposition
A=UDV’
EACL-2006 Tutorial
81
U
.35
.09
-.2
.02
.63
.20
-.00
-.02
.08 -.09 -.44
-.04
-.6
-.02
-.01
.41 -.22
.20
-.39
.00
.03
.09
.83
.05
-.26
-.01
.00
.29 -.68 -.45 -.34 -.31 .02 -.21
.01
.43
-.02
-.07
.37 -.01 -.31 .09
.03
.31
-.00
.08
.05
.08
.08
-.00
-.01
.30 -.07 -.49 -.52 .14
-.3
-.30
.00
-.07
.05 -.49 .59
.35
.13
.52 -.09 .40
.44
.39 -.60 .31
.08 -.45 .25 -.02 .17
.72 -.48 -.04
.46
.11 -.08 .24 -.01 .39
.56
.25
EACL-2006 Tutorial
82
D
9.19
6.36
3.99
3.25
2.52
2.30
1.26
0.66
0.00
0.00
0.00
EACL-2006 Tutorial
83
V
.21
.08
-.04
.28
.04
.86
-.05
-.05
-.31
-.12
.03
.04
-.37
.57
.39
.23
-.04
.26
-.02
.03
.25
.44
.11
-.39
-.27
-.32
-.30
.06
.17
.15
-.41
.58
.07
.37
.15
.12
-.12
.39
-.17
-.13
.71
-.31
-.12
.03
.63
-.01
-.45
.52
-.09
-.26
.08
-.06
.21
.08
-.02
.49
.27
.50
-.32
-.45
.13
.02
-.01
.31
.12
-.03
.09
-.51
.20
.05
-.05
.02
.29
.08
-.04
-.31
-.71
.25
.11
.15
-.12
.02
-.32
.05
-.59
-.62
-.23
.07
.28
-.23
-.14
-.45
.64
.17
-.04
-.32
.31
.12
-.03
.04
-.26
.19
.17
-.06
-.07
-.87
-.10
-.07
.22
-.20
.11
-.47
-.12
-.18
-.27
.03
-.18
.09
.12
-.58
.50
EACL-2006 Tutorial
84
Word by Word Matrix After SVD
apple
blood
cells
ibm
data
tissue
graphics
memory
organ
plasma
pc
.73
.00
.11
1.3
2.0
.01
.86
.77
.00
.09
body
.00
1.2
1.3
.00
.33
1.6
.00
.85
.84
1.5
disk
.76
.00
.01
1.3
2.1
.00
.91
.72
.00
.00
germ
.00
1.1
1.2
.00
.49
1.5
.00
.86
.77
1.4
lab
.21
1.7
2.0
.35
1.7
2.5
.18
1.7
1.2
2.3
sales
.73
.15
.39
1.3
2.2
.35
.85
.98
.17
.41
linux
.96
.00
.16
1.7
2.7
.03
1.1
1.0
.00
.13
debt
1.2
.00
.00
2.1
3.2
.00
1.5
1.1
.00
.00
EACL-2006 Tutorial
85
Second Order Representation
• I got a new disk today!
• What do you think of linux?
apple
blood
cells
ibm
data
tissue
graphics
memory
organ
Plasma
disk
.76
.00
.01
1.3
2.1
.00
.91
.72
.00
.00
linux
.96
.00
.16
1.7
2.7
.03
1.1
1.0
.00
.13


These two contexts share no words in common, yet they are
similar! disk and linux both occur with “Apple”, “IBM”, “data”,
“graphics”, and “memory”
The two contexts are similar because they share many second
order co-occurrences
EACL-2006 Tutorial
86
Relationship to LSA

Latent Semantic Analysis uses feature by
context first order representation




Indicates all the contexts in which a feature
occurs
Use SVD to reduce dimensions (contexts)
Cluster features based on similarity of
contexts in which they occur
Represent sentences using an average of
feature vectors
EACL-2006 Tutorial
87
Feature by Context Representation
Cxt1
Cxt2
Cxt3
Cxt4
black magic
1
1
0
1
island curse
1
0
0
1
military might
0
0
1
0
serious error
0
0
1
0
voodoo child
1
1
0
1
EACL-2006 Tutorial
88
References




Deerwester, S. and Dumais, S.T. and Furnas, G.W. and
Landauer, T.K. and Harshman, R., Indexing by Latent
Semantic Analysis, Journal of the American Society for
Information Science, vol. 41, 1990
Landauer, T. and Dumais, S., A Solution to Plato's
Problem: The Latent Semantic Analysis Theory of
Acquisition, Induction and Representation of Knowledge,
Psychological Review, vol. 104, 1997
Schütze, H, Automatic Word Sense Discrimination,
Computational Linguistics, vol. 24, 1998
Berry, M.W. and Drmac, Z. and Jessup, E.R.,Matrices,
Vector Spaces, and Information Retrieval, SIAM Review,
vol 41, 1999
EACL-2006 Tutorial
89
Clustering
Partitional Methods
Cluster Stopping
Cluster Labeling
EACL-2006 Tutorial
90
Many many methods…

Cluto supports a wide range of different
clustering methods

Agglomerative


Partitional


K-means (Direct)
Hybrid


Average, single, complete link…
Repeated bisections
SenseClusters integrates with Cluto

http://www-users.cs.umn.edu/~karypis/cluto/
EACL-2006 Tutorial
91
General Methodology


Represent contexts to be clustered in first
or second order vectors
Cluster the context vectors directly


vcluster
… or convert to similarity matrix and then
cluster

scluster
EACL-2006 Tutorial
92
Partitional Methods



Randomly create centroids equal to the
number of clusters you wish to find
Assign each context to nearest centroid
After all contexts assigned, re-compute
centroids


“best” location decided by criterion function
Repeat until stable clusters found

Centroids don’t shift from iteration to iteration
EACL-2006 Tutorial
97
Partitional Methods


Advantages : fast
Disadvantages


Results can be dependent on the initial
placement of centroids
Must specify number of clusters ahead of time

maybe not…
EACL-2006 Tutorial
98
Vectors to be clustered
EACL-2006 Tutorial
99
Random Initial Centroids (k=2)
EACL-2006 Tutorial
100
Assignment of Clusters
EACL-2006 Tutorial
101
Recalculation of Centroids
EACL-2006 Tutorial
102
Reassignment of Clusters
EACL-2006 Tutorial
103
Recalculation of Centroid
EACL-2006 Tutorial
104
Reassignment of Clusters
EACL-2006 Tutorial
105
Partitional Criterion Functions

Intra-Cluster (Internal) similarity/distance



How close together are members of a cluster?
Closer together is better
Inter-Cluster (External) similarity/distance


How far apart are the different clusters?
Further apart is better
EACL-2006 Tutorial
106
Intra Cluster Similarity

Ball of String (I1)


How far is each member from each other
member
Flower (I2)

How far is each member of cluster from
centroid
EACL-2006 Tutorial
107
Contexts to be Clustered
EACL-2006 Tutorial
108
Ball of String
(I1 Internal Criterion Function)
EACL-2006 Tutorial
109
Flower
(I2 Internal Criterion Function)
EACL-2006 Tutorial
110
Inter Cluster Similarity

The Fan (E1)


How far is each centroid from the centroid of
the entire collection of contexts
Maximize that distance
EACL-2006 Tutorial
111
The Fan
(E1 External Criterion Function)
EACL-2006 Tutorial
112
Hybrid Criterion Functions

Balance internal and external similarity




H1 = I1/E1
H2 = I2/E1
Want internal similarity to increase, while
external similarity decreases
Want internal distances to decrease, while
external distances increase
EACL-2006 Tutorial
113
Cluster Stopping
EACL-2006 Tutorial
114
Cluster Stopping


Many Clustering Algorithms require that
the user specify the number of clusters
prior to clustering
But, the user often doesn’t know the
number of clusters, and in fact finding that
out might be the goal of clustering
EACL-2006 Tutorial
115
Criterion Functions Can Help

Run partitional algorithm for k=1 to deltaK



DeltaK is a user estimated or automatically
determined upper bound for the number of clusters
Find the value of k at which the criterion
function does not significantly increase at k+1
Clustering can stop at this value, since no
further improvement in solution is apparent with
additional clusters (increases in k)
EACL-2006 Tutorial
116
SenseCluster’s Approach
to Cluster Stopping

Will be subject of Demo at EACL

Demo Session 2
5th April, 14:30-16:00
Ted Pedersen and Anagha Kulkarni: Selecting
the "Right" Number of Senses Based on
Clustering Criterion Functions
EACL-2006 Tutorial
117
H2 versus k
T. Blair – V. Putin – S. Hussein
EACL-2006 Tutorial
118
PK2



Based on Hartigan, 1975
When ratio approaches 1, clustering is at a plateau
Select value of k which is closest to but outside of
standard deviation interval
H 2(k )
PK 2(k ) 
H 2(k  1)
EACL-2006 Tutorial
119
PK2 predicts 3 senses
T. Blair – V. Putin – S. Hussein
EACL-2006 Tutorial
120
PK3




Related to Salvador and Chan, 2004
Inspired by Dice Coefficient
Values close to 1 mean clustering is improving …
Select value of k which is closest to but outside of
standard deviation interval
2 * H 2( k )
PK 3(k ) 
H 2(k  1) H 2(k  1)
EACL-2006 Tutorial
121
PK3 predicts 3 senses
T. Blair – V. Putin – S. Hussein
EACL-2006 Tutorial
122
References

Hartigan, J. Clustering Algorithms, Wiley, 1975


Mojena, R., Hierarchical Grouping Methods and Stopping Rules: An
Evaluation, The Computer Journal, vol 20, 1977



basis for SenseClusters stopping method PK1
Milligan, G. and Cooper, M., An Examination of Procedures for
Determining the Number of Clusters in a Data Set, Psychometrika,
vol. 50, 1985


basis for SenseClusters stopping method PK2
Very extensive comparison of cluster stopping methods
Tibshirani, R. and Walther, G. and Hastie, T., Estimating the Number
of Clusters in a Dataset via the Gap Statistic,Journal of the Royal
Statistics Society (Series B), 2001
Pedersen, T. and Kulkarni, A. Selecting the "Right" Number of
Senses Based on Clustering Criterion Functions, Proceedings of the
Posters and Demo Program of the Eleventh Conference of the
European Chapter of the Association for Computational Linguistics,
2006

Describes SenseClusters stopping methods
EACL-2006 Tutorial
123
Cluster Labeling
EACL-2006 Tutorial
124
Cluster Labeling


Once a cluster is discovered, how can you
generate a description of the contexts of
that cluster automatically?
In the case of contexts, you might be able
to identify significant lexical features from
the contents of the clusters, and use those
as a preliminary label
EACL-2006 Tutorial
125
Results of Clustering




Each cluster consists of some number of
contexts
Each context is a short unit of text
Apply measures of association to the
contents of each cluster to determine N
most significant bigrams
Use those bigrams as a label for the
cluster
EACL-2006 Tutorial
126
Label Types


The N most significant bigrams for each
cluster will act as a descriptive label
The M most significant bigrams that are
unique to each cluster will act as a
discriminating label
EACL-2006 Tutorial
127
Baseline Algorithm


Baseline Algorithm – group all instances
into one cluster, this will reach “accuracy”
equal to majority classifier
What if the clustering said everything
should be in the same cluster?
EACL-2006 Tutorial
132
Baseline Performance
S1
S2
S3
Totals
C1
0
0
0
0
C2
0
0
0
C3
80
35
Totals
80
35
S3
S2
S1
Totals
C1
0
0
0
0
0
C2
0
0
0
0
55
170
C3
55
35
80
170
55
170
Totals
55
35
80
170
(0+0+55)/170 = .32
(0+0+80)/170 = .47
if C3 is S1
if C3 is S3
EACL-2006 Tutorial
133
Things to Try

Feature Identification








Type of Feature
Measures of association
Context Representation (1st or 2nd order)
Automatic Stopping (or not)
SVD (or not)
Clustering Algorithm and Criterion Function
Evaluation
Labeling
EACL-2006 Tutorial
138
Thank you!


Questions or comments on tutorial or
SenseClusters are welcome at any time
tpederse@d.umn.edu
SenseClusters is freely available via LIVE
CD, the Web, and in source code form
http://senseclusters.sourceforge.net

SenseClusters papers available at:
http://www.d.umn.edu/~tpederse/senseclusters-pubs.html
EACL-2006 Tutorial
149
Download