Language Independent Methods of Clustering Similar Contexts (with applications) Ted Pedersen

advertisement
Language Independent Methods
of Clustering Similar Contexts
(with applications)
Ted Pedersen
University of Minnesota, Duluth
http://www.d.umn.edu/~tpederse
tpederse@d.umn.edu
EuroLAN-2005 Summer School
1
The Problem

A context is a short unit of text



often a phrase to a paragraph in length,
although it can be longer
Input: N contexts
Output: K clusters

Where each member of a cluster is a context
that is more similar to each other than to the
contexts found in other clusters
EuroLAN-2005 Summer School
2
Language Independent Methods

Do not utilize syntactic information





No parsers, part of speech taggers, etc. required
Do not utilize dictionaries or other manually
created lexical resources
Based on lexical features selected from corpora
No manually annotated data of any kind,
methods are completely unsupervised in the
strictest sense
Assumption: word segmentation can be done by
looking for white spaces between strings
EuroLAN-2005 Summer School
3
Outline (Tutorial)


Background and motivations
Identifying lexical features


Context representations



Singular Value Decomposition
Clustering methods


First & second order
Dimensionality reduction


Measures of association & tests of significance
Agglomerative & partitional techniques
Cluster labeling
Evaluation techniques

Gold standard comparisons
EuroLAN-2005 Summer School
4
Outline (Practical Session)

Headed contexts




Headless contexts



Name Discrimination
Word Sense Discrimination
Abbreviations
Email/Newsgroup Organization
Newspaper text
Identifying Sets of Related Words
EuroLAN-2005 Summer School
5
SenseClusters


A package designed to cluster contexts
Integrates with various other tools




Ngram Statistics Package
Cluto
SVDPACKC
http://senseclusters.sourceforge.net
EuroLAN-2005 Summer School
6
Many thanks…

Satanjeev (“Bano”) Banerjee (M.S., 2002)



Amruta Purandare (M.S., 2004)



Founding developer of SenseClusters (2002-2004)
Now PhD student in Intelligent Systems at the University of
Pittsburgh http://www.cs.pitt.edu/~amruta/
Anagha Kulkarni (M.S., 2006, expected)



Founding developer of the Ngram Statistics Package (2000-2001)
Now PhD student in the Language Technology Institute at
Carnegie Mellon University http://www-2.cs.cmu.edu/~banerjee/
Enhancing SenseClusters since Fall 2004!
http://www.d.umn.edu/~kulka020/
National Science Foundation (USA) for supporting Bano,
Amruta, Anagha and me (!) via CAREER award #0092784
EuroLAN-2005 Summer School
7
Practical Session

Experiment with SenseClusters



http://marimba.d.umn.edu/cgi-bin/SC-cgi/index.cgi
Has both a command line and web interface (above)
Can be installed on Linux/Unix machine
without too much work



http://senseclusters.sourceforge.net
Has some dependencies that must be installed, so having
supervisor access and/or sysadmin experience helps
Complete system (SenseClusters plus dependencies) is available
on CD
EuroLAN-2005 Summer School
8
Background and Motivations
EuroLAN-2005 Summer School
9
Headed and Headless Contexts

A headed context includes a target word


Our goal is to collect multiple contexts that
mention a particular target word in order to
try identify different senses of that word
A headless context has no target word

Our goal is to identify the contexts that are
similar to each other
EuroLAN-2005 Summer School
10
Headed Contexts (input)





I can hear the ocean in that shell.
My operating system shell is bash.
The shells on the shore are lovely.
The shell command line is flexible.
The oyster shell is very hard and black.
EuroLAN-2005 Summer School
11
Headed Contexts (output)

Cluster 1:



My operating system shell is bash.
The shell command line is flexible.
Cluster 2:



The shells on the shore are lovely.
The oyster shell is very hard and black.
I can hear the ocean in that shell.
EuroLAN-2005 Summer School
12
Headless Contexts (input)





The new version of Linux is more stable and
better support for cameras.
My Chevy Malibu has had some front end
troubles.
Osborne made on of the first personal
computers.
The brakes went out, and the car flew into the
house.
With the price of gasoline, I think I’ll be taking
the bus more often!
EuroLAN-2005 Summer School
13
Headless Contexts (output)

Cluster 1:



The new version of Linux is more stable and better
support for cameras.
Osborne made one of the first personal computers.
Cluster 2:



My Chevy Malibu has had some front end troubles.
The brakes went out, and the car flew into the house.
With the price of gasoline, I think I’ll be taking the
bus more often!
EuroLAN-2005 Summer School
14
Applications

Web search results are headed contexts


Web search results are often disorganized – two
people sharing same name, two organizations
sharing same abbreviation, etc. often have their
pages “mixed up”


Term you search for is included in snippet
Organizing web search results is an important
problem.
If you click on search results or follow links in
pages found, you will encounter headless
contexts too…
EuroLAN-2005 Summer School
15
EuroLAN-2005 Summer School
16
EuroLAN-2005 Summer School
17
EuroLAN-2005 Summer School
18
EuroLAN-2005 Summer School
19
EuroLAN-2005 Summer School
20
Applications

Email (public or private) is made up of
headless contexts


Short, usually focused…
Cluster similar email messages together


Automatic email foldering
Take all messages from sent-mail file or inbox
and organize into categories
EuroLAN-2005 Summer School
21
EuroLAN-2005 Summer School
22
EuroLAN-2005 Summer School
23
Applications

News article are another example of
headless contexts



Entire article or first paragraph
Short, usually focused
Cluster similar articles together
EuroLAN-2005 Summer School
24
EuroLAN-2005 Summer School
25
EuroLAN-2005 Summer School
26
EuroLAN-2005 Summer School
27
Underlying Premise…

You shall know a word by the company it keeps


Meanings of words are (largely) determined by their
distributional patterns (Distributional Hypothesis)


Harris, 1968 (Mathematical Structures of Language)
Words that occur in similar contexts will have similar
meanings (Strong Contextual Hypothesis)


Firth, 1957 (Studies in Linguistic Analysis)
Miller and Charles, 1991 (Language and Cognitive Processes)
Various extensions…


Similar contexts will have similar meanings, etc.
Names that occur in similar contexts will refer to the same
underlying person, etc.
EuroLAN-2005 Summer School
28
Identifying Lexical Features
Measures of Association and
Tests of Significance
EuroLAN-2005 Summer School
29
What are features?



Features represent the (hopefully) salient
characteristics of the contexts to be
clustered
Eventually we will represent each context
as a vector, where the dimensions of the
vector are associated with features
Vectors/contexts that include many of the
same features will be similar to each other
EuroLAN-2005 Summer School
30
Where do features come from?

In unsupervised clustering, it is common for the
feature selection data to be the same data that
is to be clustered


This is not cheating, since data to be clustered does
not have any labeled classes that can be used to
assist feature selection
It may also be necessary, since we may need to
cluster all available data, and not hold out some for a
separate feature identification step

Email or news articles
EuroLAN-2005 Summer School
31
Feature Selection

“Test” data – the contexts to be clustered


Assume that the feature selection data is the same as
the test data, unless otherwise indicated
“Training” data – a separate corpus of held out
feature selection data (that will not be clustered)


may need to use if you have a small number of
contexts to cluster (e.g., web search results)
This sense of “training” due to Schütze (1998)
EuroLAN-2005 Summer School
32
Lexical Features


Unigram – a single word that occurs more than
a given number of times
Bigram – an ordered pair of words that occur
together more often than expected by chance



Consecutive or may have intervening words
Co-occurrence – an unordered bigram
Target Co-occurrence – a co-occurrence where
one of the words is the target word
EuroLAN-2005 Summer School
33
Bigrams






fine wine (window size of 2)
baseball bat
house of representatives (window size of 3)
president of the republic (window size of 4)
apple orchard
Selected using a small window size (2-4 words),
trying to capture a regular (localized) pattern
between two words (collocation?)
EuroLAN-2005 Summer School
34
Co-occurrences





tropics water
boat fish
law president
train travel
Usually selected using a larger window (7-10
words) of context, hoping to capture pairs of
related words rather than collocations
EuroLAN-2005 Summer School
35
Bigrams and Co-occurrences

Pairs of words tend to be much less
ambiguous than unigrams




“bank” versus “river bank” and “bank card”
“dot” versus “dot com” and “dot product”
Three grams and beyond occur much less
frequently (Ngrams very Zipfian)
Unigrams are noisy, but bountiful
EuroLAN-2005 Summer School
36
“occur together more often than
expected by chance…”

Observed frequencies for two words occurring
together and alone are stored in a 2x2 matrix


Throw out bigrams that include one or two stop
words
Expected values are calculated, based on the
model of independence and observed values


How often would you expect these words to occur
together, if they only occurred together by chance?
If two words occur “significantly” more often than the
expected value, then the words do not occur together
by chance.
EuroLAN-2005 Summer School
37
2x2 Contingency Table
Intelligence !Intelligence
Artificial
100
400
300
100,000
!Artificial
EuroLAN-2005 Summer School
38
2x2 Contingency Table
Intelligence !Intelligence
Artificial
100
300
400
!Artificial
200
99,400
99,600
300
99,700
100,000
EuroLAN-2005 Summer School
39
2x2 Contingency Table
Intelligence !Intelligence
Artificial
!Artificial
100.0
000.12
200.0
298.8
300
300.0
398.8
99,400.0
99,301.2
99,700
EuroLAN-2005 Summer School
400
99,600
100,000
40
Measures of Association
G 
2
 (observed (w , w ) * log expected(w , w ) )
i , j 1
X 
2
observed ( wi , w j )
2
2

i , j 1
i
j
i
[observed ( wi , w j )  expected( wi , w j )]
j
2
expected( wi , w j )
EuroLAN-2005 Summer School
41
Measures of Association
G  750.88
2
X  8191.78
2
EuroLAN-2005 Summer School
42
Interpreting the Scores…


G^2 and X^2 are asymptotically approximated
by the chi-squared distribution…
This means…if you fix the marginal totals of a
table, randomly generate internal cell values in
the table, calculate the G^2 or X^2 scores for
each resulting table, and plot the distribution
of the scores, you *should* get …
EuroLAN-2005 Summer School
43
EuroLAN-2005 Summer School
44
Interpreting the Scores…

Values above a certain level of significance
can be considered grounds for rejecting
the null hypothesis


H0: the words in the bigram are independent
3.841 is associated with 95% confidence that
the null hypothesis should be rejected
EuroLAN-2005 Summer School
45
Measures of Association


There are numerous measures of
association that can be used to identify
bigram and co-occurrence features
Many of these are supported in the Ngram
Statistics Package (NSP)

http://www.d.umn.edu/~tpederse/nsp.html
EuroLAN-2005 Summer School
46
Measures Supported in NSP

Log-likelihood Ratio (ll)








True Mutual Information (tmi)
Pearson’s Chi-squared Test (x2)
Pointwise Mutual Information (pmi)
Phi coefficient (phi)
T-test (tscore)
Fisher’s Exact Test (leftFisher, rightFisher)
Dice Coefficient (dice)
Odds Ratio (odds)
EuroLAN-2005 Summer School
47
NSP

Will explore NSP during practical session


Integrated into SenseClusters, may also be
used in stand-alone mode
Can be installed easily on a Linux/Unix
system from CD or download from


http://www.d.umn.edu/~tpederse/nsp.html
I’m told it can also be installed on Windows
(via cygwin or ActivePerl), but I have no
personal experience of this…
EuroLAN-2005 Summer School
48
Summary

Identify lexical features based on frequency counts or
measures of association – either in the data to be
clustered or in a separate set of feature selection data


Unigrams usually only selected by frequency


Language independent
Remember, no labeled data from which to learn, so somewhat
less effective as features than in supervised case
Bigrams and co-occurrences can also be selected by
frequency, or better yet measures of association



Bigrams and co-occurrences need not be consecutive
Stop words should be eliminated
Frequency thresholds are helpful (e.g., unigram/bigram that
occurs once may be too rare to be useful)
EuroLAN-2005 Summer School
49
Related Work



Moore, 2004 (EMNLP) follow-up to Dunning and Pedersen on loglikelihood and exact tests
http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Moore.pdf
Pedersen, 1996 (SCSUG) explanation of exact tests, and comparison
to log-likelihood
http://arxiv.org/abs/cmp-lg/9608010
(also see Pedersen, Kayaalp, and Bruce, AAAI-1996)
Dunning, 1993 (Computational Linguistics) introduces log-likelihood
ratio for collocation identification
http://acl.ldc.upenn.edu/J/J93/J93-1003.pdf
EuroLAN-2005 Summer School
50
Context Representations
First and Second Order Methods
EuroLAN-2005 Summer School
51
Once features selected…

We will have a set of unigrams, bigrams, cooccurrences or target co-occurrences that we
believe are somehow interesting and useful


We also have any frequency and measure of
association score that have been used in their
selection
Convert contexts to be clustered into a vector
representation based on these features
EuroLAN-2005 Summer School
52
First Order Representation

Each context is represented by a vector
with M dimensions, each of which
indicates whether or not a particular
feature occurred in that context


Value may be binary, a frequency count, or an
association score
Context by Feature representation
EuroLAN-2005 Summer School
53
Contexts




C1: There was an island curse of black
magic cast by that voodoo child.
C2: Harold, a known voodoo child, was
gifted in the arts of black magic.
C3: Despite their military might, it was a
serious error to attack.
C4: Military might is no defense against a
voodoo child or an island curse.
EuroLAN-2005 Summer School
54
Unigram Feature Set






island
black
curse
magic
child
1000
700
500
400
200
(assume these are frequency counts obtained
from some corpus…)
EuroLAN-2005 Summer School
55
First Order Vectors of Unigrams
C1
C2
island
1
0
black
1
1
curse
1
0
magic
1
1
child
1
1
C3
0
0
0
0
0
C4
1
0
1
0
1
EuroLAN-2005 Summer School
56
Bigram Feature Set











island curse
black magic
voodoo child
military might
serious error
island child
voodoo might
military error
black child
serious curse
189.2
123.5
120.0
100.3
89.2
73.2
69.4
54.9
43.2
21.2
(assume these are log-likelihood scores based on frequency counts from
some corpus)
EuroLAN-2005 Summer School
57
First Order Vectors of Bigrams
C1
black
magic
1
C2
island military serious voodoo
curse might
error
child
1
0
0
1
1
0
0
0
1
C3
0
0
1
1
0
C4
0
1
1
0
1
EuroLAN-2005 Summer School
58
First Order Vectors


Can have binary values or weights
associated with frequency, etc.
May optionally be smoothed/reduced with
Singular Value Decomposition


More on that later…
The contexts are ready for clustering…

More on that later…
EuroLAN-2005 Summer School
59
Second Order Representation

Build word by word matrix from features






Must be bigrams or co-occurrences
(optionally) reduce dimensionality w/SVD
Each row represents first order co-occurrences
Represent a context by replacing each word with
an entry in the word by word matrix with its
associated vector
Average word vectors found for the context
Due to Schuetze (1998)
EuroLAN-2005 Summer School
60
Word by Word Matrix
magic
curse
might
error
child
black
123.5
0
0
0
43.2
island
0
189.2
0
0
73.2
military
0
0
100.3
54.9
0
serious
0
21.2
0
89.2
0
voodoo
0
0
69.4
0
120.0
EuroLAN-2005 Summer School
61
Word by Word Matrix


…can also be used to identify sets of related words
In the case of bigrams, rows represent the first word in
a bigram and columns represent the second word


In the case of co-occurrences, rows and columns are
equivalent



Matrix is asymmetric
Matrix is symmetric
The vector (row) for each word represent a set of first
order features for that word
Each word in a context to be clustered for which a vector
exists (in the word by word matrix) is replaced by that
vector in that context
EuroLAN-2005 Summer School
62
There was an island curse of black
magic cast by that voodoo child.
magic
curse
might
error
child
black
123.5
0
0
0
43.2
island
0
189.2
0
0
73.2
voodoo
0
0
69.4
0
120.0
EuroLAN-2005 Summer School
63
Second Order Representation


There was an [curse, child] curse of
[magic, child] magic cast by that [might,
child] child
[curse, child] + [magic, child] + [might,
child]
EuroLAN-2005 Summer School
64
There was an island curse of black
magic cast by that voodoo child.
C1
magic
curse
might
error
child
41.2
63.1
24.4
0
78.8
EuroLAN-2005 Summer School
65
First versus Second Order

First Order represents a context by showing
which features occurred in that context


This is what feature vectors normally do…
Second Order allows for additional information
about a word to be incorporated into the
representation

Feature values based on information found outside of
the immediate context
EuroLAN-2005 Summer School
66
Second Order Co-Occurrences


“black” and “island” show similarity
because both words have occurred with
“child”
“black” and “island” are second order cooccurrence with each other, since both
occur with “child” but not with each other
(i.e., “black island” is not observed)
EuroLAN-2005 Summer School
67
Second Order Co-occurrences

Imagine a co-occurrence graph




Word network
First order co-occurrences are directly
connected
Second order co-occurrences are to each
connected via one other word
kocos.pl program in Ngram Statistics
Package finds kth order co-occurrences
EuroLAN-2005 Summer School
68
Summary

First order representations are intuitive, but…



Can suffer from sparsity
Contexts represented based on the features that
occur in those contexts
Second order representations are harder to
visualize, but…



Allow a word to be represented by the words it cooccurs with (i.e., the company it keeps)
Allows a context to be represented by the words that
occur with the words in the context
Helps combat sparsity…
EuroLAN-2005 Summer School
69
Related Work



Pedersen and Bruce 1997 (EMNLP) presented first order method of
discrimination
http://acl.ldc.upenn.edu/W/W97/W97-0322.pdf
Schütze 1998 (Computational Linguistics) introduced second order
method
http://acl.ldc.upenn.edu/J/J98/J98-1004.pdf
Purandare and Pedersen 2004 (CoNLL) compared first and second
order methods
http://acl.ldc.upenn.edu/hlt-naacl2004/conll04/pdf/purandare.pdf


First order better if you have lots of data
Second order better with smaller amounts of data
EuroLAN-2005 Summer School
70
Dimensionality Reduction
Singular Value Decomposition
EuroLAN-2005 Summer School
71
Motivation

First order matrices are very sparse



Word by word
Context by feature
NLP data is noisy


No stemming performed
synonyms
EuroLAN-2005 Summer School
72
Many Methods

Singular Value Decomposition (SVD)






SVDPACKC http://www.netlib.org/svdpack/
Multi-Dimensional Scaling (MDS)
Principal Components Analysis (PCA)
Independent Components Analysis (ICA)
Linear Discriminant Analysis (LDA)
etc…
EuroLAN-2005 Summer School
73
Effect of SVD

SVD reduces a matrix to a given number
of dimensions This may convert a word
level space into a semantic or conceptual
space

If “dog” and “collie” and “wolf” are
dimensions/columns in a word co-occurrence
matrix, after SVD they may be a single
dimension that represents “canines”
EuroLAN-2005 Summer School
74
Effect of SVD

The dimensions of the matrix after SVD are
principal components that represent the
meaning of concepts


Similar columns are grouped together
SVD is a way of smoothing a very sparse
matrix, so that there are very few zero
valued cells after SVD
EuroLAN-2005 Summer School
75
How can SVD be used?

SVD on first order contexts will reduce a context by
feature representation down to a smaller number of
features


SVD used in creating second order context
representations


Latent Semantic Analysis typically performs SVD on a
word by context representation, where the contexts are
reduced
Reduce word by word matrix
SVD could also be used on resultant second order
context representations (although not supported)
EuroLAN-2005 Summer School
76
Word by Word Matrix
apple
blood
cells
ibm
data
box
tissue
graphics
memory
organ
plasma
pc
2
0
0
1
3
1
0
0
0
0
0
body
0
3
0
0
0
0
2
0
0
2
1
disk
1
0
0
2
0
3
0
1
2
0
0
petri
0
2
1
0
0
0
2
0
1
0
1
lab
0
0
3
0
2
0
2
0
2
1
3
sales
0
0
0
2
3
0
0
1
2
0
0
linux
2
0
0
1
3
2
0
1
1
0
0
debt
0
0
0
2
3
4
0
2
0
0
0
EuroLAN-2005 Summer School
77
Singular Value Decomposition
A=UDV’
EuroLAN-2005 Summer School
78
U
.35
.09
-.2
.02
.63
.20
-.00
-.02
.08 -.09 -.44
-.04
-.6
-.02
-.01
.41 -.22
.20
-.39
.00
.03
.09
.83
.05
-.26
-.01
.00
.29 -.68 -.45 -.34 -.31 .02 -.21
.01
.43
-.02
-.07
.37 -.01 -.31 .09
.03
.31
-.00
.08
.05
.08
.08
-.00
-.01
.30 -.07 -.49 -.52 .14
-.3
-.30
.00
-.07
.05 -.49 .59
.35
.13
.52 -.09 .40
.44
.39 -.60 .31
.08 -.45 .25 -.02 .17
.72 -.48 -.04
.46
.11 -.08 .24 -.01 .39
.56
.25
EuroLAN-2005 Summer School
79
D
9.19
6.36
3.99
3.25
2.52
2.30
1.26
0.66
0.00
0.00
0.00
EuroLAN-2005 Summer School
80
V
.21
.08
-.04
.28
.04
.86
-.05
-.05
-.31
-.12
.03
.04
-.37
.57
.39
.23
-.04
.26
-.02
.03
.25
.44
.11
-.39
-.27
-.32
-.30
.06
.17
.15
-.41
.58
.07
.37
.15
.12
-.12
.39
-.17
-.13
.71
-.31
-.12
.03
.63
-.01
-.45
.52
-.09
-.26
.08
-.06
.21
.08
-.02
.49
.27
.50
-.32
-.45
.13
.02
-.01
.31
.12
-.03
.09
-.51
.20
.05
-.05
.02
.29
.08
-.04
-.31
-.71
.25
.11
.15
-.12
.02
-.32
.05
-.59
-.62
-.23
.07
.28
-.23
-.14
-.45
.64
.17
-.04
-.32
.31
.12
-.03
.04
-.26
.19
.17
-.06
-.07
-.87
-.10
-.07
.22
-.20
.11
-.47
-.12
-.18
-.27
.03
-.18
.09
.12
-.58
.50
EuroLAN-2005 Summer School
81
Word by Word Matrix After SVD
apple
blood
cells
ibm
data
tissue
graphics
memory
organ
plasma
pc
.73
.00
.11
1.3
2.0
.01
.86
.77
.00
.09
body
.00
1.2
1.3
.00
.33
1.6
.00
.85
.84
1.5
disk
.76
.00
.01
1.3
2.1
.00
.91
.72
.00
.00
germ
.00
1.1
1.2
.00
.49
1.5
.00
.86
.77
1.4
lab
.21
1.7
2.0
.35
1.7
2.5
.18
1.7
1.2
2.3
sales
.73
.15
.39
1.3
2.2
.35
.85
.98
.17
.41
linux
.96
.00
.16
1.7
2.7
.03
1.1
1.0
.00
.13
debt
1.2
.00
.00
2.1
3.2
.00
1.5
1.1
.00
.00
EuroLAN-2005 Summer School
82
Second Order Representation
• I got a new disk today!
• What do you think of linux?
apple
blood
cells
ibm
data
tissue
graphics
memory
organ
Plasma
disk
.76
.00
.01
1.3
2.1
.00
.91
.72
.00
.00
linux
.96
.00
.16
1.7
2.7
.03
1.1
1.0
.00
.13


These two contexts share no words in common, yet they are
similar! disk and linux both occur with “Apple”, “IBM”, “data”,
“graphics”, and “memory”
The two contexts are similar because they share many second
order co-occurrences
EuroLAN-2005 Summer School
83
Clustering Methods
Agglomerative and
Partitional
EuroLAN-2005 Summer School
84
Many many methods…

Cluto supports a wide range of different
clustering methods

Agglomerative


Partitional


K-means
Hybrid


Average, single, complete link…
Repeated bisections
SenseClusters integrates with Cluto

http://www-users.cs.umn.edu/~karypis/cluto/
EuroLAN-2005 Summer School
85
General Methodology


Represent contexts to be clustered in first
or second order vectors
Cluster the vectors directly, or convert to
similarity matrix and then cluster


vcluster
scluster
EuroLAN-2005 Summer School
86
Agglomerative Clustering

Create a similarity matrix of instances to
be discriminated


Results in a symmetric “instance by instance”
matrix, where each cell contains the similarity
score between a pair of instances
Typically a first order representation, where
similarity is based on the features observed in
the pair of instances
(X Y )
(X Y )
EuroLAN-2005 Summer School
87
Measuring Similarity

Integer Values
Matching Coefficient
X Y

Jaccard Coefficient
X Y
X Y

Dice Coefficient


2 X Y
X Y
Real Values

Cosine


X Y
X Y
EuroLAN-2005 Summer School
88
Agglomerative Clustering

Apply Agglomerative Clustering algorithm to
similarity matrix





To start, each instance is its own cluster
Form a cluster from the most similar pair of instances
Repeat until the desired number of clusters is
obtained
Advantages : high quality clustering
Disadvantages – computationally expensive,
must carry out exhaustive pair wise comparisons
EuroLAN-2005 Summer School
89
Average Link Clustering
S1
S1
S2
S3
S4
3
4
2
2
0
S2
3
S3
4
2
S4
2
0
S1S3S
2
S1S3S2
S4
1.5  1.5
 1. 5
2
1
1
S4
S1S3
S1S3
S2
3 2
 2.5
2
S4
2 1
 1.5
2
S2
S4
3 2
 2.5
2
2 1
 1.5
2
0
0
1.5  1.5
 1. 5
2
EuroLAN-2005 Summer School
90
Partitional Methods




Select some number of contexts in feature
space to act as centroids
Assign each context to nearest centroid,
forming cluster
After all contexts assigned, recompute
centroids
Repeat until stable clusters found

Centroids don’t shift from iteration to iteration
EuroLAN-2005 Summer School
91
Partitional Methods


Advances : fast
Disadvantages : very dependent on the
initial placement of centroids
EuroLAN-2005 Summer School
92
Cluster Labeling
EuroLAN-2005 Summer School
93
Results of Clustering




Each cluster consists of some number of
contexts
Each context is a short unit of text
Apply measures of association to the
contents of each cluster to determine N
most significant bigrams
Use those bigrams as a label for the
cluster
EuroLAN-2005 Summer School
94
Label Types


The N most significant bigrams for each
cluster will act as a descriptive label
The M most significant bigrams that are
unique to each cluster will act as a
discriminating label
EuroLAN-2005 Summer School
95
Evaluation Techniques
Comparison to gold standard data
EuroLAN-2005 Summer School
96
Evaluation

If Sense tagged text is available, can be
used for evaluation


But don’t use sense tags for clustering or
feature selection!
Assume that sense tags represent “true”
clusters, and compare these to discovered
clusters

Find mapping of clusters to senses that
attains maximum accuracy
EuroLAN-2005 Summer School
97
Evaluation

Pseudo words are especially useful, since it
is hard to find data that is discriminated



Pick two words or names from a corpus, and
conflate them into one name. Then see how
well you can discriminate.
http://www.d.umn.edu/~tpederse/tools.html
Baseline Algorithm– group all instances
into one cluster, this will reach “accuracy”
equal to majority classifier
EuroLAN-2005 Summer School
98
Evaluation

Pseudo words are especially useful, since
it is hard to find data that is discriminated


Pick two words or names from a corpus, and
conflate them into one name. Then see how
well you can discriminate.
http://www.d.umn.edu/~kulka020/kanaghaNa
me.html
EuroLAN-2005 Summer School
99
Baseline Algorithm


Baseline Algorithm – group all instances
into one cluster, this will reach “accuracy”
equal to majority classifier
What if the clustering said everything
should be in the same cluster?
EuroLAN-2005 Summer School
100
Baseline Performance
S1
S2
S3
Totals
C1
0
0
0
0
C2
0
0
0
C3
80
35
Totals
80
35
S3
S2
S1
Totals
C1
0
0
0
0
0
C2
0
0
0
0
55
170
C3
55
35
80
170
55
170
Totals
55
35
80
170
(0+0+55)/170 = .32
(0+0+80)/170 = .47
if C3 is S1
if C3 is S3
EuroLAN-2005 Summer School
101
Evaluation




Suppose that C1 is labeled S1, C2 as S2, and C3 as S3
Accuracy = (10 + 0 + 10) / 170 = 12%
Diagonal shows how many members of the cluster actually
belong to the sense given on the column
Can the “columns” be rearranged to improve the overall
accuracy?
 Optimally assign clusters to senses
S1
S2
S3
Totals
C1
10
30
5
45
C2
20
0
40
60
C3
50
5
10
65
Totals 80
35
55
170
EuroLAN-2005 Summer School
102
Evaluation



The assignment of C1 to S2,
C2 to S3, and C3 to S1
results in 120/170 = 71%
Find the ordering of the
columns in the matrix that
maximizes the sum of the
diagonal.
This is an instance of the
Assignment Problem from
Operations Research, or
finding the Maximal
Matching of a Bipartite
Graph from Graph Theory.
S2
S3
S1
Totals
C1
30
5
10
45
C2
0
40
20
60
C3
5
10
50
65
Totals
35
55
80
170
EuroLAN-2005 Summer School
103
Analysis


Unsupervised methods may not discover clusters
equivalent to the classes learned in supervised learning
Evaluation based on assuming that sense tags represent
the “true” cluster are likely a bit harsh. Alternatives?


Humans could look at the members of each cluster and
determine the nature of the relationship or meaning that they all
share
Use the contents of the cluster to generate a descriptive label
that could be inspected by a human
EuroLAN-2005 Summer School
104
Practical Session
Experiments with SenseClusters
EuroLAN-2005 Summer School
105
Experimental Data

Available on Web Site


Available on CD


http://senseclusters.sourceforge.net
Data/SenseClusters-Data
SenseClusters requires data to be in the
Senseval-2 lexical sample format

Plenty of such data available on CD and from
web site
EuroLAN-2005 Summer School
106
Creating Experimental Data

NameConflate program


Text2Headless program


Creates name conflated data from English
GigaWord corpus
Convert plain text into headless contexts
http://www.d.umn.edu/~tpederse/tools.html
EuroLAN-2005 Summer School
107
Name Conflation Data

Smaller Data Set (also on Web as SC-Web…)




Larger Data Sets (also on Web as Split-Smaller…)



Adidas - Puma
Emile Lahoud – Askar Akayev
CICLING data (CD only)



Country - Noun
Name - Name
Noun - Noun
David Beckham – Ronaldo
Microsoft – IBM
ACL 2005 demo data (CD only)

Name - Name
EuroLAN-2005 Summer School
108
Clustering Contexts

ACL 2005 Demo (also on Web as Email…)


Various partitions of 20 news groups data sets
Spanish Data (web only)

News articles each of which mention
abbreviations PP or PSOE
EuroLAN-2005 Summer School
109
Name Discrimination
EuroLAN-2005 Summer School
110
George Millers!
EuroLAN-2005 Summer School
111
Headed Clustering

Name Discrimination


Tom Hanks
Russell Crowe
EuroLAN-2005 Summer School
112
EuroLAN-2005 Summer School
113
EuroLAN-2005 Summer School
114
EuroLAN-2005 Summer School
115
EuroLAN-2005 Summer School
116
EuroLAN-2005 Summer School
117
EuroLAN-2005 Summer School
118
EuroLAN-2005 Summer School
119
EuroLAN-2005 Summer School
120
EuroLAN-2005 Summer School
121
EuroLAN-2005 Summer School
122
EuroLAN-2005 Summer School
123
EuroLAN-2005 Summer School
124
EuroLAN-2005 Summer School
125
EuroLAN-2005 Summer School
126
EuroLAN-2005 Summer School
127
EuroLAN-2005 Summer School
128
EuroLAN-2005 Summer School
129
EuroLAN-2005 Summer School
130
EuroLAN-2005 Summer School
131
EuroLAN-2005 Summer School
132
EuroLAN-2005 Summer School
133
EuroLAN-2005 Summer School
134
Headless Contexts


Email / 20 newsgroups data
Spanish Text
EuroLAN-2005 Summer School
135
EuroLAN-2005 Summer School
136
EuroLAN-2005 Summer School
137
EuroLAN-2005 Summer School
138
EuroLAN-2005 Summer School
139
EuroLAN-2005 Summer School
140
EuroLAN-2005 Summer School
141
EuroLAN-2005 Summer School
142
EuroLAN-2005 Summer School
143
EuroLAN-2005 Summer School
144
EuroLAN-2005 Summer School
145
EuroLAN-2005 Summer School
146
EuroLAN-2005 Summer School
147
EuroLAN-2005 Summer School
148
EuroLAN-2005 Summer School
149
EuroLAN-2005 Summer School
150
EuroLAN-2005 Summer School
151
EuroLAN-2005 Summer School
152
EuroLAN-2005 Summer School
153
EuroLAN-2005 Summer School
154
EuroLAN-2005 Summer School
155
EuroLAN-2005 Summer School
156
If you after all these matrices you
crave knowledge based resources…
Read on…
EuroLAN-2005 Summer School
157
WordNet-Similarity

Not language independent


But, can be combined with distributional
methods to good effect


McCarthy, et. al. ACL-2004
Perl module


Based on English WordNet
http://search.cpan.org/dist/WordNet-Similarity
Web interface

http://marimba.d.umn.edu/cgi-bin/similarity/similarity.cgi
EuroLAN-2005 Summer School
158
Many thanks!

Satanjeev “Bano” Banerjee (M.S., 2002)



Siddharth Patwardhan (M.S., 2003)




Founding developer of WordNet-Similarity (2001-2003)
Now PhD student at University of Utah
http://www.cs.utah.edu/~sidd/
Jason Michelizzi (M.S., 2005)



Inventor of Adapted Lesk Algorithm (IJCAI-2003), which is the earliest
origin and motivation for WordNet-Similarity…
Now PhD student at LTI/CMU…
Enhanced WordNet-Similarity in many ways and applied it to all words
sense disambiguation (2003-2005)
http://www.d.umn.edu/~mich0212
NSF for supporting Bano, and University of Minnesota for supporting
Bano, Sid and Jason via various internal sources
EuroLAN-2005 Summer School
159
Vector measure

Build a word by word matrix from WordNet Gloss
Corpus


Treat glosses as contexts, and use second order
representation where words are replaced with
vectors from matrix


1.4 million words
Average together all vectors to represent
concept/definition
High correlation with human relatedness
judgements
EuroLAN-2005 Summer School
160
Many other measures

Path Based




Information Content Based




Path
Leacock & Chodorow
Wu and Palmer
Resnik
Lin
Jiang & Conrath
Relatedness



Hirst & St-Onge
Adapted Lesk
Vector
EuroLAN-2005 Summer School
161
EuroLAN-2005 Summer School
162
EuroLAN-2005 Summer School
163
Thank you!


Questions are welcome at any time. Feel
free to contact me in person or via email
(tpederse@d.umn.edu) at any time!
All of our software is free and open
source, you are welcome to download,
modify, redistribute, etc.

http://www.d.umn.edu/~tpederse/code.html
EuroLAN-2005 Summer School
164
Download