Language Independent Methods of Clustering Similar Contexts (with applications) Ted Pedersen

advertisement
Language Independent Methods
of Clustering Similar Contexts
(with applications)
Ted Pedersen
University of Minnesota, Duluth
tpederse@d.umn.edu
http://www.d.umn.edu/~tpederse/SCTutorial.html
July 17, 2006
AAAI-2006 Tutorial
1
Language Independent Methods
• Do not utilize syntactic information
– No parsers, part of speech taggers, etc. required
• Do not utilize dictionaries or other manually
created lexical resources
• Based on lexical features selected from corpora
– Assumption: word segmentation can be done by
looking for white spaces between strings
• No manually annotated data of any kind,
methods are completely unsupervised in the
strictest sense
July 17, 2006
AAAI-2006 Tutorial
2
Clustering Similar Contexts
• A context is a short unit of text
– often a phrase to a paragraph in length,
although it can be longer
• Input: N contexts
• Output: K clusters
– Where each member of a cluster is a context
that is more similar to each other than to the
contexts found in other clusters
July 17, 2006
AAAI-2006 Tutorial
3
Applications
• Headed contexts (contain target word)
– Name Discrimination
– Word Sense Discrimination
• Headless contexts
– Email Organization
– Document Clustering
– Paraphrase identification
• Clustering Sets of Related Words
July 17, 2006
AAAI-2006 Tutorial
4
Tutorial Outline
• Identifying lexical features
– Measures of association & tests of significance
• Context representations
– First & second order
• Dimensionality reduction
– Singular Value Decomposition
• Clustering
– Partitional techniques
– Cluster stopping
– Cluster labeling
• Evaluation
July 17, 2006
AAAI-2006 Tutorial
5
SenseClusters
• A package for clustering contexts
– http://senseclusters.sourceforge.net
– SenseClusters Live! (Knoppix CD)
• Integrates with various other tools
– Ngram Statistics Package
– CLUTO
– SVDPACKC
July 17, 2006
AAAI-2006 Tutorial
6
Many thanks…
• Amruta Purandare (M.S., 2004)
– Founding developer of SenseClusters (2002-2004)
– Now PhD student in Intelligent Systems at the
University of Pittsburgh http://www.cs.pitt.edu/~amruta/
• Anagha Kulkarni (M.S., 2006, expected)
– Enhancing SenseClusters since Fall 2004!
– Will start as PhD student at CMU/LTI in Fall 2006
http://www.d.umn.edu/~kulka020/
• NSF for supporting Amruta, Anagha and Ted via
CAREER award #0092784
July 17, 2006
AAAI-2006 Tutorial
7
Background and Motivations
July 17, 2006
AAAI-2006 Tutorial
8
Headed and Headless Contexts
• A headed context includes a target word
– Our goal is to cluster the target words based
on their surrounding contexts
– Target word is center of context and our
attention
• A headless context has no target word
– Our goal is to cluster the contexts based on
their similarity to each other
– The focus is on the context as a whole
July 17, 2006
AAAI-2006 Tutorial
9
Headed Contexts (input)
•
•
•
•
•
I can hear the ocean in that shell.
My operating system shell is bash.
The shells on the shore are lovely.
The shell command line is flexible.
The oyster shell is very hard and black.
July 17, 2006
AAAI-2006 Tutorial
10
Headed Contexts (output)
• Cluster 1:
– My operating system shell is bash.
– The shell command line is flexible.
• Cluster 2:
– The shells on the shore are lovely.
– The oyster shell is very hard and black.
– I can hear the ocean in that shell.
July 17, 2006
AAAI-2006 Tutorial
11
Headless Contexts (input)
• The new version of Linux is more stable and has
better support for cameras.
• My Chevy Malibu has had some front end
troubles.
• Osborne made one of the first personal
computers.
• The brakes went out, and the car flew into the
house.
• With the price of gasoline, I think I’ll be taking
the bus more often!
July 17, 2006
AAAI-2006 Tutorial
12
Headless Contexts (output)
• Cluster 1:
– The new version of Linux is more stable and better
support for cameras.
– Osborne made one of the first personal computers.
• Cluster 2:
– My Chevy Malibu has had some front end troubles.
– The brakes went out, and the car flew into the house.
– With the price of gasoline, I think I’ll be taking the bus
more often!
July 17, 2006
AAAI-2006 Tutorial
13
Web Search as Application
• Web search results are headed contexts
– Search term is target word (found in snippets)
• Web search results are often disorganized – two
people sharing same name, two organizations
sharing same abbreviation, etc. often have their
pages “mixed up”
• If you click on search results or follow links in
pages found, you will encounter headless
contexts too…
July 17, 2006
AAAI-2006 Tutorial
14
Email Foldering as Application
• Email (public or private) is made up of
headless contexts
– Short, usually focused…
• Cluster similar email messages together
– Automatic email foldering
– Take all messages from sent-mail file or inbox
and organize into categories
July 17, 2006
AAAI-2006 Tutorial
15
Clustering News as Application
• News articles are headless contexts
– Entire article or first paragraph
– Short, usually focused
• Cluster similar articles together
July 17, 2006
AAAI-2006 Tutorial
16
What is it to be “similar”?
• You shall know a word by the company it keeps
– Firth, 1957 (Studies in Linguistic Analysis)
• Meanings of words are (largely) determined by their
distributional patterns (Distributional Hypothesis)
– Harris, 1968 (Mathematical Structures of Language)
• Words that occur in similar contexts will have similar
meanings (Strong Contextual Hypothesis)
– Miller and Charles, 1991 (Language and Cognitive Processes)
• Various extensions…
– Similar contexts will have similar meanings, etc.
– Names that occur in similar contexts will refer to the same
underlying person, etc.
July 17, 2006
AAAI-2006 Tutorial
17
General Methodology
• Represent contexts to be clustered using first or
second order feature vectors
– Lexical features
• Reduce dimensionality to make vectors more
tractable and/or understandable
– Singular value decomposition
• Cluster the context vectors
– Find the number of clusters
– Label the clusters
• Evaluate and/or use the contexts!
July 17, 2006
AAAI-2006 Tutorial
18
Identifying Lexical Features
Measures of Association and
Tests of Significance
July 17, 2006
AAAI-2006 Tutorial
19
What are features?
• Features represent the (hopefully) salient
characteristics of the contexts to be
clustered
• Eventually we will represent each context
as a vector, where the dimensions of the
vector are associated with features
• Vectors/contexts that include many of the
same features will be similar to each other
July 17, 2006
AAAI-2006 Tutorial
20
Where do features come from?
• In unsupervised clustering, it is common for the
feature selection data to be the same data that is
to be clustered
– This is not cheating, since data to be clustered does
not have any labeled classes that can be used to
assist feature selection
– It may also be necessary, since we may need to
cluster all available data, and not hold out some for a
separate feature identification step
• Email or news articles
July 17, 2006
AAAI-2006 Tutorial
21
Feature Selection
• “Test” data – the contexts to be clustered
– Assume that the feature selection data is the same as
the test data, unless otherwise indicated
• “Training” data – a separate corpus of held out
feature selection data (that will not be clustered)
– may need to use if you have a small number of
contexts to cluster (e.g., web search results)
– This sense of “training” due to Schütze (1998)
July 17, 2006
AAAI-2006 Tutorial
22
Lexical Features
• Unigram – a single word that occurs more than a
given number of times
• Bigram – an ordered pair of words that occur
together more often than expected by chance
– Consecutive or may have intervening words
• Co-occurrence – an unordered bigram
• Target Co-occurrence – a co-occurrence where
one of the words is the target word
July 17, 2006
AAAI-2006 Tutorial
23
Bigrams
•
•
•
•
•
fine wine (window size of 2)
baseball bat
house of representatives (window size of 3)
president of the republic (window size of 4)
apple orchard
• Selected using a small window size (2-4 words),
trying to capture a regular (localized) pattern
between two words (collocation?)
July 17, 2006
AAAI-2006 Tutorial
24
Co-occurrences
•
•
•
•
tropics water
boat fish
law president
train travel
• Usually selected using a larger window (7-10
words) of context, hoping to capture pairs of
related words rather than collocations
July 17, 2006
AAAI-2006 Tutorial
25
Bigrams and Co-occurrences
• Pairs of words tend to be much less
ambiguous than unigrams
– “bank” versus “river bank” and “bank card”
– “dot” versus “dot com” and “dot product”
• Three grams and beyond occur much less
frequently (Ngrams very Zipfian)
• Unigrams are noisy, but bountiful
July 17, 2006
AAAI-2006 Tutorial
26
“occur together more often than
expected by chance…”
• Observed frequencies for two words occurring
together and alone are stored in a 2x2 matrix
– Throw out bigrams that include one or two stop words
• Expected values are calculated, based on the
model of independence and observed values
– How often would you expect these words to occur
together, if they only occurred together by chance?
– If two words occur “significantly” more often than the
expected value, then the words do not occur together
by chance.
July 17, 2006
AAAI-2006 Tutorial
27
2x2 Contingency Table
Intelligence !Intelligence
Artificial
100.0
000.12
300.0
398.8
400
!Artificial
200.0
298.8
99,400.0
99,301.2
99,600
300
99,700
100,000
July 17, 2006
AAAI-2006 Tutorial
28
Measures of Association
G 
2
 (observed (w , w ) * log expected(w , w ) )
i , j 1
X 
2
observed ( wi , w j )
2
2

i , j 1
July 17, 2006
i
j
i
[observed ( wi , w j )  expected( wi , w j )]
j
2
expected( wi , w j )
AAAI-2006 Tutorial
29
Interpreting the Scores…
• G^2 and X^2 are asymptotically approximated
by the chi-squared distribution…
• This means…if you fix the marginal totals of a
table, randomly generate internal cell values in
the table, calculate the G^2 or X^2 scores for
each resulting table, and plot the distribution of
the scores, you *should* get …
July 17, 2006
AAAI-2006 Tutorial
30
Interpreting the Scores…
• Values above a certain level of
significance can be considered grounds
for rejecting the null hypothesis
– H0: the words in the bigram are independent
– 3.841 is associated with 95% confidence that
the null hypothesis should be rejected
July 17, 2006
AAAI-2006 Tutorial
31
Measures of Association
• There are numerous measures of
association that can be used to identify
bigram and co-occurrence features
• Many of these are supported in the Ngram
Statistics Package (NSP)
– http://www.d.umn.edu/~tpederse/nsp.html
July 17, 2006
AAAI-2006 Tutorial
32
Summary
• Identify lexical features based on frequency counts or
measures of association – either in the data to be
clustered or in a separate set of feature selection data
– Language independent
• Unigrams usually only selected by frequency
– Remember, no labeled data from which to learn, so somewhat
less effective as features than in supervised case
• Bigrams and co-occurrences can also be selected by
frequency, or better yet measures of association
– Bigrams and co-occurrences need not be consecutive
– Stop words should be eliminated
– Frequency thresholds are helpful (e.g., unigram/bigram that
occurs once may be too rare to be useful)
July 17, 2006
AAAI-2006 Tutorial
33
Context Representations
First and Second Order Methods
July 17, 2006
AAAI-2006 Tutorial
34
Once features selected…
• We have a set of unigrams, bigrams, cooccurrences or target co-occurrences
– We believe/hope that these are descriptive of
the contexts
– We also have frequency and measure of
association score that have been used in their
selection
• Convert contexts to be clustered into a
vector representation based on these
features
July 17, 2006
AAAI-2006 Tutorial
35
First Order Representation
• Each context is represented by a vector
with M dimensions, each of which
indicates whether or not a particular
feature occurred in that context
– Value may be binary, a frequency count, or an
association score
• Context by Feature representation
July 17, 2006
AAAI-2006 Tutorial
36
Contexts
• Cxt1: There was an island curse of black
magic cast by that voodoo child.
• Cxt2: Harold, a known voodoo child, was
gifted in the arts of black magic.
• Cxt3: Despite their military might, it was a
serious error to attack.
• Cxt4: Military might is no defense against
a voodoo child or an island curse.
July 17, 2006
AAAI-2006 Tutorial
37
Unigram Feature Set
•
•
•
•
•
island
black
curse
magic
child
1000
700
500
400
200
• (assume these are frequency counts obtained
from some corpus…)
July 17, 2006
AAAI-2006 Tutorial
38
First Order Vectors of Unigrams
island
black
curse
magic
child
Cxt1
1
1
1
1
1
Cxt2
0
1
0
1
1
Cxt3
0
0
0
0
0
Cxt4
1
0
1
0
1
July 17, 2006
AAAI-2006 Tutorial
39
Bigram Feature Set
•
•
•
•
•
•
•
•
•
•
•
island curse
black magic
voodoo child
military might
serious error
island child
voodoo might
military error
black child
serious curse
189.2
123.5
120.0
100.3
89.2
73.2
69.4
54.9
43.2
21.2
(assume these are log-likelihood scores based on frequency counts from
some corpus)
July 17, 2006
AAAI-2006 Tutorial
40
First Order Vectors of Bigrams
Cxt1
black
magic
1
Cxt2
island military serious voodoo
curse might
error
child
1
0
0
1
1
0
0
0
1
Cxt3
0
0
1
1
0
Cxt4
0
1
1
0
1
July 17, 2006
AAAI-2006 Tutorial
41
First Order Vectors
• Can have binary values or weights
associated with frequency, etc.
• Forms a context by feature matrix
• May optionally be smoothed/reduced with
Singular Value Decomposition
– More on that later…
• The contexts are ready for clustering…
– More on that later…
July 17, 2006
AAAI-2006 Tutorial
42
Second Order Features
• First order features encode the occurrence of a
feature in a context
– Feature occurrence represented by binary value
• Second order features encode something ‘extra’
about a feature that occurs in a context
– Feature occurrence represented by word co-occurrences
– Feature occurrence represented by context occurrences
July 17, 2006
AAAI-2006 Tutorial
43
Second Order Representation
• First, build word by word matrix from features
–
–
–
–
Based on bigrams or co-occurrences
First word is row, second word is column, cell is score
(optionally) reduce dimensionality w/SVD
Each row forms a vector of first order co-occurrences
• Second, replace each word in a context with its
row/vector as found in the word by word matrix
• Average all the word vectors in the context to
create the second order representation
– Due to Schütze (1998), related to LSI/LSA
July 17, 2006
AAAI-2006 Tutorial
44
Word by Word Matrix
magic
curse
might
error
child
black
123.5
0
0
0
43.2
island
0
189.2
0
0
73.2
military
0
0
100.3
54.9
0
serious
0
21.2
0
89.2
0
voodoo
0
0
69.4
0
120.0
July 17, 2006
AAAI-2006 Tutorial
45
Word by Word Matrix
• …can also be used to identify sets of related words
• In the case of bigrams, rows represent the first word in a
bigram and columns represent the second word
– Matrix is asymmetric
• In the case of co-occurrences, rows and columns are
equivalent
– Matrix is symmetric
• The vector (row) for each word represent a set of first
order features for that word
• Each word in a context to be clustered for which a vector
exists (in the word by word matrix) is replaced by that
vector in that context
July 17, 2006
AAAI-2006 Tutorial
46
There was an island curse of black
magic cast by that voodoo child.
magic
curse
might
error
child
black
123.5
0
0
0
43.2
island
0
189.2
0
0
73.2
voodoo
0
0
69.4
0
120.0
July 17, 2006
AAAI-2006 Tutorial
47
Second Order Co-Occurrences
• Word vectors for “black” and “island” show
similarity as both occur with “child”
• “black” and “island” are second order cooccurrence with each other, since both
occur with “child” but not with each other
(i.e., “black island” is not observed)
July 17, 2006
AAAI-2006 Tutorial
48
Second Order Representation
• There was an [curse, child] curse of
[magic, child] magic cast by that [might,
child] child
• [curse, child] + [magic, child] + [might,
child]
July 17, 2006
AAAI-2006 Tutorial
49
There was an island curse of black
magic cast by that voodoo child.
Cxt1
July 17, 2006
magic
curse
might
error
child
41.2
63.1
24.4
0
78.8
AAAI-2006 Tutorial
50
Second Order Representation
• Results in a Context by Feature (Word)
Representation
• Cell values do not indicate if feature
occurred in context. Rather, they show the
strength of association of that feature with
other words that occur with a word in the
context.
July 17, 2006
AAAI-2006 Tutorial
51
Summary
• First order representations are intuitive, but…
– Can suffer from sparsity
– Contexts represented based on the features that
occur in those contexts
• Second order representations are harder to
visualize, but…
– Allow a word to be represented by the words it cooccurs with (i.e., the company it keeps)
– Allows a context to be represented by the words that
occur with the words in the context
– Helps combat sparsity…
July 17, 2006
AAAI-2006 Tutorial
52
Related Work
• Pedersen and Bruce 1997 (EMNLP) presented first order method of
discrimination
http://acl.ldc.upenn.edu/W/W97/W97-0322.pdf
• Schütze 1998 (Computational Linguistics) introduced second order
method
http://acl.ldc.upenn.edu/J/J98/J98-1004.pdf
• Purandare and Pedersen 2004 (CoNLL) compared first and second
order methods
http://acl.ldc.upenn.edu/hlt-naacl2004/conll04/pdf/purandare.pdf
– First order better if you have lots of data
– Second order better with smaller amounts of data
July 17, 2006
AAAI-2006 Tutorial
53
Dimensionality Reduction
Singular Value Decomposition
July 17, 2006
AAAI-2006 Tutorial
54
Effect of SVD
• SVD reduces a matrix to a given number
of dimensions This may convert a word
level space into a semantic or conceptual
space
– If “dog” and “collie” and “wolf” are
dimensions/columns in a word co-occurrence
matrix, after SVD they may be a single
dimension that represents “canines”
July 17, 2006
AAAI-2006 Tutorial
55
Effect of SVD
• The dimensions of the matrix after SVD are
principal components that represent the
meaning of concepts
– Similar columns are grouped together
• SVD is a way of smoothing a very sparse
matrix, so that there are very few zero
valued cells after SVD
July 17, 2006
AAAI-2006 Tutorial
56
How can SVD be used?
• SVD on first order contexts will reduce a
context by feature representation down to a
smaller number of features
– Latent Semantic Analysis typically performs SVD
on a feature by context representation, where
the contexts are reduced
• SVD used in creating second order context
representations
– Reduce word by word matrix
July 17, 2006
AAAI-2006 Tutorial
57
Word by Word Matrix
apple
blood
cells
pc
2
0
0
body
0
3
disk
1
petri
data
box
tissue
graphics
1
3
1
0
0
0
0
0
0
2
0
0
2
0
3
0
2
1
0
0
lab
0
0
3
0
sales
0
0
0
linux
2
0
debt
0
0
July 17, 2006
ibm
organ
plasma
0
0
0
0
0
2
1
0
1
2
0
0
0
2
0
1
0
1
2
0
2
0
2
1
3
2
3
0
0
1
2
0
0
0
1
3
2
0
1
1
0
0
0
2
3
4
0
2
0
0
0
AAAI-2006 Tutorial
memory
58
Singular Value Decomposition
A=UDV’
July 17, 2006
AAAI-2006 Tutorial
59
Word by Word Matrix After SVD
apple
blood
cells
ibm
data
tissue
graphics
memory
organ
plasma
pc
.73
.00
.11
1.3
2.0
.01
.86
.77
.00
.09
body
.00
1.2
1.3
.00
.33
1.6
.00
.85
.84
1.5
disk
.76
.00
.01
1.3
2.1
.00
.91
.72
.00
.00
germ
.00
1.1
1.2
.00
.49
1.5
.00
.86
.77
1.4
lab
.21
1.7
2.0
.35
1.7
2.5
.18
1.7
1.2
2.3
sales
.73
.15
.39
1.3
2.2
.35
.85
.98
.17
.41
linux
.96
.00
.16
1.7
2.7
.03
1.1
1.0
.00
.13
debt
1.2
.00
.00
2.1
3.2
.00
1.5
1.1
.00
.00
July 17, 2006
AAAI-2006 Tutorial
60
Second Order Representation
• I got a new disk today!
• What do you think of linux?
apple
blood
cells
ibm
data
tissue
disk
.76
.00
.01
1.3
2.1
.00
.91
linux
.96
.00
.16
1.7
2.7
.03
1.1
organ
plasma
.72
.00
.00
1.0
.00
.13
graphics memory
• These two contexts share no words in common, yet they are
similar! disk and linux both occur with “Apple”, “IBM”, “data”,
“graphics”, and “memory”
• The two contexts are similar because they share many second
order co-occurrences
July 17, 2006
AAAI-2006 Tutorial
61
Relationship to LSA
• Latent Semantic Analysis uses feature by
context first order representation
– Indicates all the contexts in which a feature
occurs
– Use SVD to reduce dimensions (contexts)
– Cluster features based on similarity of
contexts in which they occur
– Represent sentences using an average of
feature vectors
July 17, 2006
AAAI-2006 Tutorial
62
Feature by Context Representation
Cxt1
Cxt2
Cxt3
Cxt4
black magic
1
1
0
1
island curse
1
0
0
1
military might
0
0
1
0
serious error
0
0
1
0
voodoo child
1
1
0
1
July 17, 2006
AAAI-2006 Tutorial
63
References
• Deerwester, S. and Dumais, S.T. and Furnas, G.W. and
Landauer, T.K. and Harshman, R., Indexing by Latent
Semantic Analysis, Journal of the American Society for
Information Science, vol. 41, 1990
• Landauer, T. and Dumais, S., A Solution to Plato's
Problem: The Latent Semantic Analysis Theory of
Acquisition, Induction and Representation of Knowledge,
Psychological Review, vol. 104, 1997
• Schütze, H, Automatic Word Sense Discrimination,
Computational Linguistics, vol. 24, 1998
• Berry, M.W. and Drmac, Z. and Jessup, E.R.,Matrices,
Vector Spaces, and Information Retrieval, SIAM Review,
vol 41, 1999
July 17, 2006
AAAI-2006 Tutorial
64
Clustering
Partitional Methods
Cluster Stopping
Cluster Labeling
July 17, 2006
AAAI-2006 Tutorial
65
Many many methods…
• Cluto supports a wide range of different
clustering methods
– Agglomerative
• Average, single, complete link…
– Partitional
• K-means (Direct)
– Hybrid
• Repeated bisections
• SenseClusters integrates with Cluto
– http://www-users.cs.umn.edu/~karypis/cluto/
July 17, 2006
AAAI-2006 Tutorial
66
General Methodology
• Represent contexts to be clustered in first
or second order vectors
• Cluster the context vectors directly
– vcluster
• … or convert to similarity matrix and then
cluster
– scluster
July 17, 2006
AAAI-2006 Tutorial
67
Partitional Methods
• Randomly create centroids equal to the
number of clusters you wish to find
• Assign each context to nearest centroid
• After all contexts assigned, re-compute
centroids
– “best” location decided by criterion function
• Repeat until stable clusters found
– Centroids don’t shift from iteration to iteration
July 17, 2006
AAAI-2006 Tutorial
68
Partitional Methods
• Advantages : fast
• Disadvantages
– Results can be dependent on the initial
placement of centroids
– Must specify number of clusters ahead of time
• maybe not…
July 17, 2006
AAAI-2006 Tutorial
69
Partitional Criterion Functions
• Intra-Cluster (Internal) similarity/distance
– How close together are members of a cluster?
– Closer together is better
• Inter-Cluster (External) similarity/distance
– How far apart are the different clusters?
– Further apart is better
July 17, 2006
AAAI-2006 Tutorial
70
Intra Cluster Similarity
• Ball of String (I1)
– How far is each member from each other
member
• Flower (I2)
– How far is each member of cluster from
centroid
July 17, 2006
AAAI-2006 Tutorial
71
Contexts to be Clustered
July 17, 2006
AAAI-2006 Tutorial
72
Ball of String
(I1 Internal Criterion Function)
July 17, 2006
AAAI-2006 Tutorial
73
Flower
(I2 Internal Criterion Function)
July 17, 2006
AAAI-2006 Tutorial
74
Inter Cluster Similarity
• The Fan (E1)
– How far is each centroid from the centroid of
the entire collection of contexts
– Maximize that distance
July 17, 2006
AAAI-2006 Tutorial
75
The Fan
(E1 External Criterion Function)
July 17, 2006
AAAI-2006 Tutorial
76
Hybrid Criterion Functions
• Balance internal and external similarity
– H1 = I1/E1
– H2 = I2/E1
• Want internal similarity to increase, while
external similarity decreases
• Want internal distances to decrease, while
external distances increase
July 17, 2006
AAAI-2006 Tutorial
77
Cluster Stopping
July 17, 2006
AAAI-2006 Tutorial
78
Cluster Stopping
• Many Clustering Algorithms require that
the user specify the number of clusters
prior to clustering
• But, the user often doesn’t know the
number of clusters, and in fact finding that
out might be the goal of clustering
July 17, 2006
AAAI-2006 Tutorial
79
Criterion Functions Can Help
• Run partitional algorithm for k=1 to deltaK
– DeltaK is a user estimated or automatically
determined upper bound for the number of clusters
• Find the value of k at which the criterion function
does not significantly increase at k+1
• Clustering can stop at this value, since no further
improvement in solution is apparent with
additional clusters (increases in k)
July 17, 2006
AAAI-2006 Tutorial
80
H2 versus k
T. Blair – V. Putin – S. Hussein
July 17, 2006
AAAI-2006 Tutorial
81
PK2
• Based on Hartigan, 1975
• When ratio approaches 1, clustering is at a plateau
• Select value of k which is closest to but outside of
standard deviation interval
H 2(k )
PK 2(k ) 
H 2(k  1)
July 17, 2006
AAAI-2006 Tutorial
82
PK2 predicts 3 senses
T. Blair – V. Putin – S. Hussein
July 17, 2006
AAAI-2006 Tutorial
83
PK3
•
•
•
•
Related to Salvador and Chan, 2004
Inspired by Dice Coefficient
Values close to 1 mean clustering is improving …
Select value of k which is closest to but outside of
standard deviation interval
2 * H 2( k )
PK 3(k ) 
H 2(k  1) H 2(k  1)
July 17, 2006
AAAI-2006 Tutorial
84
PK3 predicts 3 senses
T. Blair – V. Putin – S. Hussein
July 17, 2006
AAAI-2006 Tutorial
85
References
• Hartigan, J. Clustering Algorithms, Wiley, 1975
– basis for SenseClusters stopping method PK2
• Mojena, R., Hierarchical Grouping Methods and Stopping Rules: An
Evaluation, The Computer Journal, vol 20, 1977
– basis for SenseClusters stopping method PK1
• Milligan, G. and Cooper, M., An Examination of Procedures for
Determining the Number of Clusters in a Data Set, Psychometrika,
vol. 50, 1985
– Very extensive comparison of cluster stopping methods
• Tibshirani, R. and Walther, G. and Hastie, T., Estimating the Number
of Clusters in a Dataset via the Gap Statistic,Journal of the Royal
Statistics Society (Series B), 2001
• Pedersen, T. and Kulkarni, A. Selecting the "Right" Number of
Senses Based on Clustering Criterion Functions, Proceedings of the
Posters and Demo Program of the Eleventh Conference of the
European Chapter of the Association for Computational Linguistics,
2006
– Describes SenseClusters stopping methods
July 17, 2006
AAAI-2006 Tutorial
86
Cluster Labeling
July 17, 2006
AAAI-2006 Tutorial
87
Cluster Labeling
• Once a cluster is discovered, how can you
generate a description of the contexts of
that cluster automatically?
• In the case of contexts, you might be able
to identify significant lexical features from
the contents of the clusters, and use those
as a preliminary label
July 17, 2006
AAAI-2006 Tutorial
88
Results of Clustering
• Each cluster consists of some number of
contexts
• Each context is a short unit of text
• Apply measures of association to the
contents of each cluster to determine N
most significant bigrams
• Use those bigrams as a label for the
cluster
July 17, 2006
AAAI-2006 Tutorial
89
Label Types
• The N most significant bigrams for each
cluster will act as a descriptive label
• The M most significant bigrams that are
unique to each cluster will act as a
discriminating label
July 17, 2006
AAAI-2006 Tutorial
90
Evaluation Techniques
Comparison to gold standard data
July 17, 2006
AAAI-2006 Tutorial
91
Evaluation
• If Sense tagged text is available, can be
used for evaluation
– But don’t use sense tags for clustering or
feature selection!
• Assume that sense tags represent “true”
clusters, and compare these to discovered
clusters
– Find mapping of clusters to senses that
attains maximum accuracy
July 17, 2006
AAAI-2006 Tutorial
92
Evaluation
• Pseudo words are especially useful, since
it is hard to find data that is discriminated
– Pick two words or names from a corpus, and
conflate them into one name. Then see how
well you can discriminate.
– http://www.d.umn.edu/~tpederse/tools.html
• Baseline Algorithm– group all instances
into one cluster, this will reach “accuracy”
equal to majority classifier
July 17, 2006
AAAI-2006 Tutorial
93
Evaluation
• Pseudo words are especially useful, since
it is hard to find data that is discriminated
– Pick two or more words or names from a
corpus, and conflate them into one name.
Then see how well you can discriminate.
– http://www.d.umn.edu/~kulka020/kanaghaNa
me.html
July 17, 2006
AAAI-2006 Tutorial
94
Baseline Algorithm
• Baseline Algorithm – group all instances
into one cluster, this will reach “accuracy”
equal to majority classifier
• What if the clustering said everything
should be in the same cluster?
July 17, 2006
AAAI-2006 Tutorial
95
Baseline Performance
S1
S2
S3
Totals
S3
S2
S1
Totals
C1
0
0
0
0
C1
0
0
0
0
C2
0
0
0
0
C2
0
0
0
0
C3
80
35
55
170
C3
55
35
80
170
Totals
80
35
55
170
Totals
55
35
80
170
(0+0+55)/170 = .32
(0+0+80)/170 = .47
July 17, 2006
if C3 is S1
if C3 is S3
AAAI-2006 Tutorial
96
Evaluation
• Suppose that C1 is labeled S1, C2 as S2, and C3 as S3
• Accuracy = (10 + 0 + 10) / 170 = 12%
• Diagonal shows how many members of the cluster actually
belong to the sense given on the column
• Can the “columns” be rearranged to improve the overall
accuracy?
– Optimally assign clusters to senses
July 17, 2006
S1
S2
S3
Totals
C1
10
30
5
45
C2
20
0
40
60
C3
50
5
10
65
Totals
80
35
55
170
AAAI-2006 Tutorial
97
Evaluation
• The assignment of C1 to
S2, C2 to S3, and C3 to S1
results in 120/170 = 71%
• Find the ordering of the
columns in the matrix that
maximizes the sum of the
diagonal.
• This is an instance of the
Assignment Problem from
Operations Research, or
finding the Maximal
Matching of a Bipartite
Graph from Graph Theory.
July 17, 2006
S2
S3
S1
Totals
C1
30
5
10
45
C2
0
40
20
60
C3
5
10
50
65
Totals
35
55
80
170
AAAI-2006 Tutorial
98
Alternatives?
• Unsupervised methods may not discover clusters
equivalent to the classes learned in supervised learning
• Evaluation based on assuming that sense tags represent
the “true” cluster are likely a bit harsh. Alternatives?
– Humans could look at the members of each cluster and
determine the nature of the relationship or meaning that they all
share
– Use the contents of the cluster to generate a descriptive label
that could be inspected by a human
July 17, 2006
AAAI-2006 Tutorial
99
Thank you!
• Questions or comments on tutorial or
SenseClusters are welcome at any time
tpederse@d.umn.edu
• SenseClusters is freely available via LIVE
CD, the Web, and in source code form
http://senseclusters.sourceforge.net
• SenseClusters papers available at:
http://www.d.umn.edu/~tpederse/senseclusters-pubs.html
July 17, 2006
AAAI-2006 Tutorial
100
Download