Direct Word Sense Matching for Lexical Substitution

advertisement
Bar Ilan University
The Department of
Computer Science
Direct Word Sense Matching for
Lexical Substitution
by
Efrat Hershkovitz-Marmorshtein
Submitted in partial fulfillment of the requirements for the Master's Degree
in the Department of Computer Science, Bar Ilan University.
Ramat Gan, Israel
October 2006, Tishrei 5767
This work was carried out under the supervision of Dr Ido Dagan
Department of Computer Science,
Bar-Ilan University.
2
Acknowledgements
This thesis paper has been accomplished with the help of a number of people, and I wish
to express my heartfelt thanks to them.
I am grateful to Dr. Ido Dagan at Bar-Ilan University for his supervision of the thesis
and for his tutoring on how to observe, analyze and scientifically formalize. It has been a
great pleasure to work with him on this subject and I have learned a lot.
I would like to thank Oren Glickman at Bar-Ilan University for his guidance at the
beginning of the way, and his advice, both professional and technical, throughout the
work.
I would also like to thank our Italian colleagues, Alfio Glizzo and Carlo Strapparava
from ITC-Irst with whom I enjoyed a fruitful collaboration. I wish to thank them for their
assistance in implementing some of the methods, and for their helpful suggestions based
on their professional experience.
I would like to thank Hami Margliyot for her help in professional matters and in
wording the paper.
I would like to thank our NLP group for their mutual support and helpful comments.
Finally, I would like to thank my family and my husband Yair and our daughter Hadar
Tova for understanding, supporting and encouraging.
3
Table of Contents
Table of Contents ................................................................................................................ 4
List of tables and figures ..................................................................................................... 5
Abstract ............................................................................................................................... 6
1. Introduction ..................................................................................................................... 9
2. Background ................................................................................................................... 13
2.1 Word Senses and Lexical Ambiguity ..................................................................... 13
2.1.1 WordNet – a word sense database ....................................................................... 13
2.2 WSD and Lexical Expansion .................................................................................. 15
2.3 Textual Entailment .................................................................................................. 17
2.4 Classification Algorithms ....................................................................................... 19
2.4.1 Binary classification using SVM (Support Vector Machine) .......................... 20
2.4.1.1 One-class SVM ......................................................................................... 22
2.4.2 The kNN (k -nearest neighbor) classification algorithm .................................. 23
4. Investigated Methods .................................................................................................... 32
4.1 Feature set and classifier ......................................................................................... 33
4.2 Supervised Methods ................................................................................................ 34
4.2.1 Indirect approach ............................................................................................. 34
4.2.2 Direct approach ................................................................................................ 34
4.3 Unsupervised Methods............................................................................................ 35
4.3.1 Indirect approach ............................................................................................. 36
4.3.2 Direct approaches............................................................................................. 36
4.3.2.1 Direct approach one-class SVM .............................................................. 37
4.3.2.2 Direct approach KNN-based ranking........................................................ 39
5. Evaluation ..................................................................................................................... 42
5.1 Evaluation measures ............................................................................................... 42
5.1.1 Classification measure ..................................................................................... 42
5.1.2 Ranking measure .............................................................................................. 42
5.2 Classification measure results ................................................................................. 43
5.2.1 Baselines .......................................................................................................... 43
5.2.2 Supervised Methods ........................................................................................ 44
5.2.3 Unsupervised methods ..................................................................................... 46
5.3 Ranking measure results ......................................................................................... 49
6. Conclusion and future work .......................................................................................... 53
7. References ..................................................................................................................... 55
4
List of tables and figures
Tables List
Table 1 - Source and target pairs …………………………………………………………....29
Table 2 -Positive and negative example for the source-target synonym
pair 'record-disc'………………………………………………….………....30
Table 3 -Example instances for the source-target synonym pair ‘level-degree’,
where two senses of the source word 'degree' are considered positive….…31
Table 4 -Table of a noisy training example and an appropriate training example
for the source word 'level' and the target word 'degree'……………….…...39
Table 5A-Classification results on the sense matching task -supervise method………….….44
Table 5B-Classification results on the sense matching task -unsupervise method………..…44
Table 6 - Mean Average Precision……………………………………………….….50
Figures List
Figure1 -Pseudo code for our kNN classifier algorithm……………………….……41
Figure 2 -Direct supervised results varying J ………………………………………….… ...41
Figure 3 -One-class evaluation varying  ……………………………………………….….45
Figure 4-Precision, Recall and F1 of kNN with cosine metric and k=10,
for various thresholds………………………………………………..….47
Figure 5 -Macro-averaged recall-precision curves………………………………….…….....50
Figure 6-Results of kNN with different values of k………………………….……...51
Figure 7-Results of kNN with different similarity metrics……………..……….......52
5
Abstract
This thesis investigates conceptually and empirically the novel sense matching task, of
recognizing whether the senses of two synonymous words match in context. Sense
matching enables substituting a word by its synonym, which is called lexical substitution.
It is a commonly used operation for increasing recall in information seeking applications,
like Information Retrieval (IR) and Question Answering (QA). For example, there are
contexts in which the given source word ‘design’ (which might be part of a search query)
may be substituted by the target word ‘plan’; however one should recognize that ‘plan’
has a different sense than ‘design’ in sentences such as “they discussed plans for a new
bond issue.”, while in the sentence "The construction plans were drawn." it should be
recognized that the sense of the target word 'plan' does match the meaning of the source
word 'design'.
This thesis addresses improving the task of verifying that the senses of two given
words do indeed match in a given context. In other words, recognizing texts in which the
specified source word might be substituted with a synonymous target word. Such
improved recognition of sense matching would improve the eventual precision of sense
matching in applications, which typically decreases somewhat when applying lexical
substitution.
To perform lexical substitution, a source of synonymous words is required. One of the
most common sources, which was used also in our work, is WordNet (Fellbaum, 1998).
Given a synonymous word, the binary classification task of sense matching may be dealt
with various methods, which are categorized by two basic characteristics: The first one is
whether the sense matching is direct or indirect. In the indirect approach, the senses of the
6
source word and the target word are explicitly identified relative to predefined lists of
word senses, a process called Word Sense Disambiguation (WSD), and then compared. In
the direct approach it is determined whether the senses match without identifying
explicitly the sense identity. Apparently, the indirect approach solves a harder
intermediate problem than eventually required and relies on a set of explicitly stipulated
senses, while in the direct approach there is no explicit reference to predefined senses.
The second discrimination between methods is whether they are supervised or
unsupervised. Supervised methods require manually labeled learning data, while
unsupervised methods do not require any manual intervention.
In this thesis we investigate sense matching methods of all the above types. We
experimented with a supervised indirect method, which makes use of the standard multiclass WSD setting to identify the sense of the target word, and classify positively for the
sense matching task if the selected sense matches one of the sense of the source word.
The supervised direct method we examined is trained on a binary annotated learning data
of matching and non-matching target words, which correspond to the multiple senses of
the source word. The unsupervised indirect method we implemented matches example
words in the given context of the target word with the sense definitions of the source
word, obtained from a common resource dictionary.
The most powerful approach we investigated is the unsupervised direct one, which
avoids the intermediate step of explicit word sense disambiguation, and thus circumvents
the problematic reliance on a set of explicitly stipulated senses, and does not require
manual labeling as well. The underlying assumption of this approach is that if the sense
of the substituting target word matches the original source word, then its context should
7
be valid (typical) for the source word. The classification scheme we suggested in this
thesis learns a model of unlabeled occurrences of the source word, referring to all of them
as positive examples, and tests whether this model matches the context of the given
occurrence of the target word. We applied two different methods for this approach, one
based on the one-class SVM algorithm, which tries to delimit a range of most training
examples, and classifies a substituting target word as matching if it falls within this range.
The other method is based on a kNN approach, which calculates similarity between the
substituting target word and the occurrences of the source word, and ranks the level of
matching between the two words according to the level of similarity between the target
word context and the k most similar occurrences of the source word.
We used two different measures to evaluate the results of the methods described
above, one for evaluating classification accuracy and the other for evaluating ranking
quality. The ranking measure could be applied only for the kNN method and the
supervised direct method we implemented, since only those gave a score for each
substituting word, which enabled ranking. Classification could be applied for all methods,
when setting a threshold of positive classification for the ranking methods scores,
converting their results to binary classification.
Positive empirical results are presented for all methods, substantially improving the
baselines. We focused on the direct unsupervised approach that does not require any
manual intervention and does not rely on any form of external information. As described
above we applied two different methods for this approach, the kNN method and the oneclass method, where the former obtained better results. These results are accompanied
with some stimulating analysis for future research.
8
1. Introduction
In many language processing settings it is needed to recognize that a given word or term
may be substituted by a synonymous one. In a typical information seeking scenario, an
information need is specified by some given source words. When looking for texts that
match the specified need the original source words might be substituted with
synonymous target words. For example, given the source word ‘weapon’ a system may
substitute it with the target synonym ‘arm’ when searching for relevant texts about
weapons.
This scenario, which is generally referred here as lexical substitution, is a common
technique for increasing recall in Natural Language Processing (NLP) applications. In
Information Retrieval (IR) and Question Answering (QA), it is typically termed
query/question expansion (Moldovan and Mihalcea, 2000; Negri, 2004). Lexical
Substitution is also commonly applied to identify synonyms in text summarization, for
paraphrasing in text generation, or is integrated into the features of supervised tasks such
as Text Categorization and Information Extraction. Naturally, lexical substitution is a
very common first step in textual entailment recognition, which models semantic
inference between a pair of texts in a generalized application independent setting (Dagan
et al., 2005).
To perform lexical substitution NLP applications typically utilize a knowledge source
of synonymous word pairs. The most commonly used resource for lexical substitution is
the manually constructed WordNet (Fellbaum, 1998). Another option is to use statistical
9
word similarities, such as in the database constructed by Dekang Lin (e.g. (Lin, 1998)).1
We generically refer to such resources as substitution lexicons2.
When using a substitution lexicon it is assumed that there are some contexts in which
the given synonymous words share the same meaning. Yet, due to polysemy, it is needed
to verify that the senses of the two words do indeed match in a given context. For
example, there are contexts in which the source word ‘weapon’ may be substituted by the
target word ‘arm’; however one should recognize that ‘arm’ has a different sense than
‘weapon’ in sentences such as “repetitive movements could cause injuries to hands,
wrists and arms.”
Since the sense matching involves sense disambiguation of both words, either
explicitly or implicitly, a mismatch between the source and target words may be caused
by wrong sense disambiguation of either of them. To illustrate these two cases of
mismatch, let us first consider the pair of source word weapon and target word arm,
when arm appears in the following context: “Look, could you grab hold of this box
before my arms drop off?". In this sentence, the word arm appears in another sense than
weapon - not the desired sense. The second type of mismatch happens when the original
source word is substituted by a word that is not synonymous to the sense of the original
word in the given context. For example, assume that the source word 'paper' appears in a
given query "photocopying paper". In this case it would be wrong to substitute it with the
target word 'newspaper', which is synonymous to a different sense of 'paper'. The focus of
our research is to solve the mismatch of the first type, while the same method could be
1
Available from http://armena.cs.ualberta.ca/lindek/downloads
While focusing on synonymy in this thesis, lexical substitution may be based on additional lexical
semantic relation such as hyponymy
2
10
applied to solve the second type, when switching roles between the source and target
words.
A commonly proposed approach to address sense matching in lexical substitution is
applying Word Sense Disambiguation (WSD) to identify the senses of the source and
target words. In this approach, substitution is applied only if the words have the same
sense (or synset, in WordNet terminology). In settings in which the source is given as a
single term without context, sense disambiguation is performed only to the target word;
substitution is then applied only if the target word’s sense matches at least one of the
possible senses of the source word.
One might observe that such application of WSD addresses the task at hand in a
somewhat indirect manner. In fact, lexical substitution only requires knowing that the
source and target senses do match, but it does not require that the matching senses will be
explicitly identified. Selecting explicitly the right sense in context, which is then followed
by verifying the desired matching, might be solving a harder intermediate problem than
required. Instead, we can define the sense matching problem directly as a binary
classification task for a pair of synonymous source and target words. This task requires to
decide whether the senses of the two words do or do not match in a given context (but it
does not require to identify explicitly the identity of the matching senses).
A highly related task was proposed in (McCarthy, 2002). McCarthy's proposal was to
ask systems to suggest possible "semantically similar replacements" of a target word in
context, where alternative replacements should be grouped together. While this task is
somewhat more complicated as an evaluation setting than our binary recognition task, it
was motivated by similar observations and applied goals. From another perspective, sense
11
matching may be viewed as a lexical sub-case of the general textual entailment
recognition setting, where we need to recognize whether the meaning of the target word
"entails" the meaning of the source word in a given context.
This thesis3 provides a first investigation of the novel sense matching problem. To
allow comparison with the classical WSD setting we derived an evaluation dataset for the
new problem from the Senseval-3 English lexical sample dataset (Mihalcea and
Edmonds, 2004). We then evaluated alternative supervised and unsupervised methods
that perform sense matching either indirectly or directly (i.e. with or without the
intermediate sense identification step). Our findings suggest that in the supervised setting
the results of the direct and indirect approaches are comparable. However, addressing
directly the binary classification task has practical advantages and can yield high
precision values as desired in precision-oriented applications such as IR and QA.
More importantly, direct sense matching sets the ground for implicit unsupervised
approaches that may utilize practically unlimited volumes of unlabeled training data.
Furthermore, such approaches circumvent the Sisyphean need for specifying explicitly a
set of stipulated senses. We present initial implementations of such approaches based a
one-class classifier and a KNN-style ranking method. These methods are trained on
unlabeled occurrences of the source word and are applied to classify and rank test
occurrences of the target word. The presented results outperform the unsupervised
baselines and put forth a whole new direction for future research.
3
Major parts of this research were published in (Dagan et al., 2006), which was based on
the current thesis work.
12
2. Background
2.1 Word Senses and Lexical Ambiguity
2.1.1 WordNet – a word sense database
One must refer to the WordNet ontology (Fellbaum, 1998), being the most influential
computational lexical resource, in order to obtain an application-oriented view on
prominent lexical relations.
WordNet is a lexical database which is available online, and provides a large
repository of English lexical items. It was developed by a group of lexicographers led by
Miller, Fellbaum and others at Princeton University and has been constantly updated and
improved during the last fifteen years. Inspired by current psycholinguistic theories of
human lexical memory, it consists of English nouns, verbs, adjectives and adverbs
organized into synonym sets – synsets, each representing one underlying sense.
The synset includes a set of synonyms and their definition. The specific meaning of
one word for one type of POS (part of speech) is called a sense. Each sense of a word
appears in a different synset. Synsets are equivalent to senses = structures containing sets
of terms with synonymous meanings. Each synset has a gloss that defines the concept it
represents. For example, the words 'night', 'nighttime' and 'dark' constitute a single synset
that has the following gloss: “the time after sunset and before sunrise while it is dark
outside." Synsets are connected to one another through explicit semantic relations. Some
of these relations (hypernym and hyponym for nouns, hypernym and troponym for verbs)
constitute is-a-kind-of (hyperonymy) and is-a-part-of (meronymy for nouns) hierarchies.
Diverse WordNet relations have been used in various NLP tasks as a source of
candidate lexical substitutes for expansion. Expansion consists of altering a given text
13
(usually a query) by adding terms of similar meaning. For example, many question
answering systems perform expansion in the retrieval phase using query related words
based on WordNet’s lexical relations, such as synonymy or hyponymy (e.g. (Harabagiu et
al., 2000), (Hovy et al., 2001)). Automatic indexing has been improved by adding the
synsets of query words and their hypernyms to the query (Mihalcea and Moldovan,
2000). Scott and Matwin (1998) exploited WordNet hypernyms to increase the accuracy
of Text Classification. Chaves (1998) enhanced a document summarization task through
merging WordNet hyponymy chains, while Flank (1998) introduced a layered approach
to term similarity computation for information retrieval, which assigns the highest
weights to synonymy relations, hyponyms are ranked next, and the meronymy relations
contribute the lowest scores to the final similarity weights. Notably, each of the above
works addressed the problem within the narrow setting of a specific application, while
none has induced a clear generic definition of the types of ontological relations that
contribute to semantic substitution.
2.1.2 Lexical ambiguity and Senseval
Word Sense Disambiguation (WSD) is the problem of deciding which sense a word has
in any given context. It has been very difficult to formalize the process of
disambiguation, which humans can do so effortlessly. For virtually all applications of
language technology, word sense ambiguity is a potential source of error. One example is
Machine Translation (MT). If the English word 'drug' translates into French as either
'drogue' (narcotic) or 'médicament' (medication), then an English-French MT system
needs to disambiguate every use of 'drug' in order to make the correct translation.
14
Similarly, information retrieval systems may erroneously retrieve documents about an
illegal narcotic when the item of interest is a medication; analogously, information
extraction systems may make wrong assertions; and text-to-speech application may
confuse violin bows for a ship's bows.
Senseval (http://www.senseval.org/) is the international organization devoted to the
evaluation of Word Sense Disambiguation systems. Its mission is to organize and run
evaluation and related activities to test the strengths and weaknesses of WSD systems
with respect to different words, different aspects of language, and different languages. In
actual applications, WSD is often fully integrated into the system and often cannot be
separated. But in order to study and evaluate WSD, Senseval has, to date, concentrated on
standalone, generic systems for WSD.
2.2 WSD and Lexical Expansion
Despite some initial skepticism about the usefulness of WSD in practical tasks
(Voorhees, 1993; Sanderson, 1994), there is some evidence that WSD can improve
performance in typical NLP tasks such as IR and QA. For example, Schütze and
Pederson (1995) give clear indication of the potential for WSD to improve the precision
of an IR system. They tested the use of WSD on a standard IR test collection (TREC-1B),
improving precision by more than 4%.
The use of WSD has produced successful experiments for query expansion techniques.
In particular, some attempts exploited WordNet to enrich queries with semanticallyrelated terms. For instance, Voorhees (1994) manually expanded 50 queries over the
TREC-1 collection using synonymy and other WordNet relations. She found that the
15
expansion was useful with short and incomplete queries, leaving the task of proper
automatic expansion as an open problem.
Gonzalo et al. (1998) demonstrates an increment in performance over an IR test
collection using the sense data contained in SemCor over a purely term based model. In
practice, they experimented searching SemCor with disambiguated and expanded queries.
Their work shows that a WSD system, even if not performing perfectly, combined with
synonymy enrichment increases retrieval performance.
Moldovan and Mihalcea (2000) introduces the idea of using WordNet to extend Web
searches based on semantic similarity. Their results showed that WSD-based query
expansion actually improves retrieval performance in a Web scenario. Recently Negri
(2004) proposed a sense-based relevance feedback scheme for query enrichment in a QA
scenario (TREC-2003 and ACQUAINT), demonstrating improvement in retrieval
performance.
While all these works clearly show the potential usefulness of WSD in practical tasks,
nonetheless they do not necessarily justify the efforts for refining fine-grained sense
repositories and for building large sense-tagged corpora. We suggest that the sense
matching task, as presented in the introduction, may relieve major drawbacks of applying
WSD in practical scenarios.
It is worth mentioning a related approach of word sense discrimination (Pedersen and
Bruce, 1997; Schütze, 1998). Word sense discrimination intends to divide the usages of a
word into different meanings without regard to any particular existing sense inventory.
Typically approached with unsupervised techniques, sense discrimination divides the
occurrences of a word into a number of classes by determining for any two occurrences
16
whether they belong to the same sense or not. Consequently, sense discrimination does
not determine the actual "meaning" (i.e.
sense "label") but rather identifies which
occurrences of the same word have an equivalent meaning. Over-all, word sense
discrimination can be viewed as an indirect approach which assign unsupervised senses.
In our preliminary work we assessed the importance of identifying expansion
mismatches in applied settings, and to evaluate the causes of such mismatch. Using
several pairs of source words, we checked whether substituting them with target
synonyms actually results with sense mismatches in randomly retrieved sentences. We
discovered that the main cause of inappropriately retrieved sentences is indeed word
sense mismatch which caused 77% of retrieval mismatches. For example, consider the
original word pair 'cut job', where the source word 'job' is substituted by the target word
'position'. A successful substitution is found in the sentence: "40% of the positions at the
company were cut." The following sentence is an example of a sense mismatch: "The
company’s market position suffered a cut after a bad quarter". In this sentence the word
position has a different sense than job.
2.3 Textual Entailment
Textual entailment (TE) has been proposed recently as a generic framework for modeling
semantic variability in many Natural Language Processing applications, such as Question
Answering (QA), Information Extraction (IE) and Information Retrieval (IR).Textual
entailment is defined as a relationship between a coherent text T and a language
expression, which is considered as a hypothesis, H. Then, T entails H (H is a consequent
of T), denoted by T=>H, if the meaning of H, as interpreted in the context of T, can be
17
inferred from the meaning of T. For example, Shirley inherited the house” => “Shirley
owned the house”.
The task of identifying entailment between texts is a complex task. Many researchers
have addressed sub-tasks of TE, such as Geffet and Dagan, (2004) who explored the
correspondence between the distributional characterizations of pairs of words (which may
rarely co-occur, as is usually the case for synonyms ) and the kind of tight semantic
relationship that might hold between them, and in particular. entailment at the lexical
level. They proposed a feature weighting function (RFF) that yields more accurate
distributional similarity lists, which better approximate the lexical entailment relation.
This method still applies a standard measure for distributional vector similarity (over
vectors with the improved feature weights), and thus produces many loose similarities
that do not correspond to entailment.
In a later paper, they explore more deeply the relationship between distributional
characterization of words and lexical entailment, proposing two new hypotheses as a
refinement of the distributional similarity hypothesis. The main idea is that if one word
entails the other then we would expect that virtually all the characteristic context features
of the entailing word will actually occur also with the entailed word. To illustrate this
idea let us consider an entailing pair: company => firm, and the following set of
characteristic features of “company” – {“(company)’s profit”, “chairman of the
(company)”}. Then these features are expected to appear with “firm” as well in some
large corpus - “firm’s profit” and “chairman of the firm”. Other researchers have explored
other aspects of textual entailment: Glickman, Dagan, and Kopel (2004) propose a
general generative probabilistic setting for textual entailment. They focus on the sub-task
18
of recognizing whether the lexical concepts present in a given hypothesis are entailed
from a given text.
Glickman, Bar-Haim, Spector (2005) suggest an analysis of sub-components and tasks
within textual entailment, proposing two levels: Lexical and Lexical-Syntactic. At the
lexical level, they match (possibly multi-word) terms of one text (T) and a second text
(hypothesis H), ignoring function words. At the lexical-syntactic level, they match the
syntactic dependency relations within H and T.
The sense matching problem we tried to deal with is actually a binary classification
task: to decide whether the occurrence of the target word in the given sentence entails the
source word (i.e., at least one of the meanings of the source word). An example we have
already mentioned is the sentence ' Repetitive movements could cause injuries to hands,
wrists and arms.', where the word arm substitutes the word weapon, but with the wrong
sense. In this case, the target word arm does not entail the source word weapon. On the
other hand, in the sentence 'This house was badly mauled by careless soldiers searching
for arms' the target word arm entails the source word weapon. In our work we suggest a
novel approach of using an implicit WSD method that identifies such lexical entailment
in context.
2.4 Classification Algorithms
As we have mentioned in the introduction, we can define the sense matching problem
directly as a binary classification task for a pair of synonymous source and target words.
For this task we used two algorithms, SVM (Support Vector Machine) and a kNN (k-
19
nearest neighbors) - based method. We used two existing implementations of the SVM
algorithm, LibSVM and SVMLight, and implemented the kNN algorithm
2.4.1 Binary classification using SVM (Support Vector Machine)
The SVM algorithm refers to each source-target pair as a point (or vector) in a multidimensional space, where each dimension is any desired feature of the pair. Given a
collection of such training points in the feature space, each tagged positive or negative,
we would like to separate the positive and the negative ones as neatly as possible, by the
simplest plane. The task of determining whether an untagged point is negative or positive,
is called classification, and a group of identically tagged points is called a class.
In some cases, the two classes of positive and negative points can be separated by a
multi-dimensional 'plane’, which is called a hyperplane. A method that uses such a
hyperplane is therefore called linear classification. If such a hyperplane exists, we would
like to choose the one that separates the data points with maximum distance between it
and the closest data points from both classes. This distance is called the margin. We
desire this property, because it makes the separation between the two classes greater If we
then add another data point to the points we already have, we can more accurately
classify the new point more accurately. Such a hyperplane is known as the maximummargin hyperplane, or the optimal hyperplane. The vectors (data points) that are closest
to this hyperplane are called the support vectors.
Let us now give a more detailed view of the SVM algorithm. We consider a set of
training data points of the form: {( x1 ,c1 ), ( x 2 , c 2 ), ….. , ( xn , c n )} where the ci is either 1
20
or −1. This constant denotes the class to which the point x i belongs positive or negative.
Each xi is a n-dimensional vector of scaled [0,1] or [-1,1] values. The scaling is important
to guard against features with larger variance, that might otherwise dominate the
classification. These points can be viewed as training data, which denote the correct
classification, which the SVM should eventually distinguish. The classification will be
defined by means of the dividing hyperplane, which takes the form
(1) w  x  b  0
where w and b are the parameters of the optimal hyperplane that we need to learn.
As we are interested in the maximum margin, the dominant data are in the support
vectors and the hyperplanes closest to these support vectors in either class. These
hyperplanes are parallel to the optimal separating hyperplane. It can be shown that they
can be described by the equations
(2) w  x  b  1
(3) w  x  b  1
We would like these hyperplanes to maximize the distance from the dividing hyperplane
and to have no data points between them. Geometrically, the distance between the
hyperplanes is 2/|w|, so |w| should be minimized in order to maximiaze the margin. To
exclude data points, it should be ensured that for all i,
(4) w  xi -b  1 or w  xi -b  -1
This can be rewritten as:
(5) ci ( w  xi -b)  1
1in
The problem now is to minimize |w| subject to the constraint (5). This is a quadraic
programming
(QP) optimization problem which is solved by the SVM training
21
algorithm. After the SVM has been trained, it can be used to classify unseen 'test' data.
This is achieved using the following decision rule

1 if w  x  b  0
(6) c  
 1 if w  x  b  0

where c is the class of the new data point x.
Writing the classification rule in its dual form reveals that classification is a function only
of the Support Vectors, i.e., the training data points that lie on the margin.
In the cases when there is no hyperplane that can split the positive and negative
training examples, the Soft Margin method is applied. This method chooses a hyperplane
that splits the examples as cleanly as possible, while still maximizing the distance to the
nearest split examples. This method introduces slack variables and the equation (5) now
transforms to
(7) ci ( w  xi -b)  1 -  i
1in
and the optimization problem becomes
(8) min || w || 2 + C  i such that ci ( w  xi -b)  1 -  i
1in
i
The constraint in (7) along with the objective of minimizing |w| can be solved using
Lagrange multipliers or setting up a dual optimization problem to eliminate the slack
variable.
2.4.1.1 One-class SVM
In certain case the training data contain points of one class only, and the above separation
is not possible anymore. In such cases the aim is to estimate the smallest hypersphere
22
enclosing most of the positive training data points. New test instances are then classified
positively if they lie inside the sphere, while outliers are regarded as negatives.
We used the SVMlight
4
classifier (developed by T. Joachims from the University of
Dortmund) for the case where the data contain points of both classes, and LibSVM5, with
its one class option, for the one-class data case. LibSVM also enables us to control the
ratio between the width of the enclosed region of training points and the number of
misclassified training examples, by setting the parameter   (0, 1). Smaller values of 
will produce larger positive regions, yielding increased recall, but lower precision.
2.4.2 The kNN (k -nearest neighbor) classification algorithm
The k -nearest neighbor algorithm6 is an intuitive -method that classifies unlabeled
examples based on their similarity with examples in a given training set. For a given
unlabeled example, the algorithms finds the k closest labeled examples in the training
data, and classifies the unlabeled examples according to the most frequent class within
the set of the k closest examples. The special case where the class is predicted to be the
class of the closest training sample (i.e. when k = 1) is called the nearest neighbour
algorithm.
The training examples are mapped into a multidimensional feature space. The space is
partitioned into regions by class labels of the training samples. A point in the space is
assigned to the class c, if it is the most frequent class label among the k nearest training
4
http://svmlight.joachims.org/
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
6
Major parts of this paragraph were learned from wikipedia
http://en.wikipedia.org/wiki/Nearest_neighbor_(pattern_recognition)
5
23
samples. The training phase of the algorithm consists only of storing the feature vectors
and class labels of the training samples. In the actual classification phase, the same
features as before are computed for the new test example (whose class is not known). The
distances from the new vector to all stored vectors are computed, by a selected metric –
distance, or some vector similarity measure. The k closest samples are then selected, and
the new point is predicted to belong to the most frequent class within this set. The
performance of the kNN algorithm is influenced by two main factors: (1) the similarity
measure used to locate the k nearest neighbors; and (2) the number of k neighbors used to
classify the new sample.
The best choice of k depends upon the data. Generally, larger values of k reduce the
effect of noise on the classification, but make boundaries between classes less distinct. A
good k can be selected by parameter optimization using, for example, cross-validation.
The accuracy of the kNN algorithm can be severely degraded by the presence of noisy or
irrelevant features, or if the feature scales are not consistent with their relevance. Much
research effort has been placed into selecting or scaling features to improve classification.
A particularly popular approach is the use of evolutionary algorithms to optimize feature
scaling (Dixon and Corne and Oates, 2003).. Another popular approach is to scale
features by the mutual information of the training data with the training classes (Yang,
J.O. Pederson ,1997) , (Li, L. et al., 2001).
The algorithm is easy to implement, but it is computationally intensive, especially
when the size of the training set grows. Many optimizations have been proposed over the
years; these generally seek to reduce the number of distances actually computed. Some
24
optimizations involve partitioning the feature space, and only computing distances within
specific nearby volumes.
To measure the distance between two vectors some vector metric or any measure of
similarity is required. We used three of the most popular similarity measures. The
weighted Jaccard measure, (Grefenstette, 1994), compares the number of common
features with the number of unique features for a pair of examples. When generalizing
this scheme to non-binary values, each features is represented by a real value in the range
of 0–1. This generalization, known as Weighted Jaccard, replaces intersection with the
minimum weight, and union with the maximum weight. Set cardinality is generalized to
summing over the union of the features of the two examples w and v.
 min( weight (w, f ), weight (v, f ))
f F ( w )  F ( v )
simW J ( w, v) 
 max( weight (w, f ), weight (v, f ))
f F ( w )  F ( v )
where F(w) and F(v) are the features of the two examples. The advantage of this measure
is that it also takes into account the feature weights rather than just the number of the
common features.
The standard Cosine measure, which was successfully employed in IR (Salton and
McGill, 1983), and also for learning similar words (Ruge, 1992; Caraballo, 1999; Gauch
et al., 1999; Pantel and Ravichandran, 2004), is the second alternative to examine:
sim cos w, v  
 weightw, f  weightv, f 
f
 weightw, f 
f
2

 weightv, f 
f
25
2
Calculating the cosine of the angle between the two vectors considers the difference in
direction of two vectors in feature space as opposed to their geometric distance. Thus, it
overcomes the problem of distance metrics that discriminate too strongly between vectors
with significantly different lengths.
The third measure we used is a recent state-of-the-art variant of the weighted Jaccard
measure (Weeds and Weir, 2004), which was developed by Lin (1998) and is grounded
on principles of information theory. It computes the ratio between what is shared by the
features of both vectors and the sum over the features of each vector:
sim Lin ( w, v) 
 (weight
f F ( w ) F ( v )

f F ( w )
MI
( w, f )  weight MI (v, f ))
( weight MI ( w, f ) 
 weight
f F ( v )
MI
(v, f )
F(w) and F(v) are the features of the two examples and the weight function is defined as
the Mutual Information (MI). There are three underlying intuitions to this measure: (1)
the more commonality the two objects share, the more similar they are; (2) the more
differences they have, the less similar they are; (3) the maximum similarity between
objects A and B should only be reached when they are identical, no matter how much
commonality they share.
26
3. Problem Setting and Dataset
To investigate the direct sense matching problem it is necessary to obtain an appropriate
dataset of examples for this binary classification task, along with gold standard
annotation. While there is no such standard (application independent) dataset available it
is possible to derive it automatically from existing WSD evaluation datasets, as described
below. This methodology also allows comparing direct approaches for sense matching
with classical indirect approaches, which apply an intermediate step of identifying first
the most likely WordNet sense.
We chose to work with single words in order to find the abstract solution to the
problem of direct sense matching. (We did not want to become involved in working with
more than a single word at a time in order to prevent problematic word dependencies,
etc.). Our dataset was derived from the Senseval-3 English lexical sample dataset
(Mihalcea and Edmonds, 2004), and included all 25 nouns, adjectives and adverbs in this
sample. Verbs were excluded since their sense annotation in Senseval-3 is not based on
WordNet senses but rather on a different dictionary (the available approximate mapping
to Word-Net synsets was not sufficiently reliable). The Senseval dataset includes a set of
example occurrences in context for each word, split to training and test sets, where each
example is manually annotated with the corresponding WordNet synset.
For the sense matching setting we need examples of pairs of source-target
synonymous words, where at least one of these words should occur in a given context.
Following an applicative motivation, we mimic a typical IR setting in which a single
source word query is expanded (substituted) by a synonymous target word. Then, it is
needed to identify contexts in which the target word appears in a sense that matches the
27
source word. Accordingly, we considered each of the 25 words in the Senseval sample as
a target word for the sense matching task. Next, we had to pick for each target word a
corresponding synonym to play the role of the source word. This was done by creating a
list of all WordNet synonyms of the target word, under all its possible senses, and picking
randomly one of the synonyms as the source word. For example, the word ‘disc’ is one of
the words in the Senseval lexical sample. For this target word the synonym ‘record’ was
picked, which matches ‘disc’ in its musical sense.
While creating source-target synonym pairs it was evident that many WordNet
synonyms corresponded to very infrequent senses or word usages, such as the WordNet
synonyms germ and source. Such source synonyms are useless for evaluating sense
matching with the target word since the senses of the two words would rarely match in
perceivable contexts. In fact, considering our motivation for lexical substitution, it is
usually desired to exclude such obscure synonym pairs from substitution lexicons in
practical applications, since they would mostly introduce noise to the system. To avoid
this problem the list of WordNet synonyms for each target word was filtered by a
lexicographer, who excluded manually obscure synonyms that seemed worthless in
practice. The lexicographer was also instructed to exclude pairs where the target word
had a more general meaning than the source word.
Using those manually filtered results, the source synonym for each target word was then
picked randomly from the filtered list. Table 1 shows the 25 source-target pairs created
for our experiments.
28
Source word
Target word
statement
argument
level
degree
raging
hot
opinion
judgment
execution
performance
subdivision
arm
deviation
difference
ikon
image
arrangement
organization
design
plan
atm
atmosphere
dissimilar
different
crucial
important
newspaper
paper
protection
shelter
hearing
audience
trouble
difficulty
sake
interest
company
party
variety
sort
camber
bank
record
disc
bare
simple
substantial
solid
root
source
WordNet Sense id
argument%1:10:02
degree%1:07:00::
degree%1:26:01::
hot%3:00:00:violent:00
judgment%1:10:00::
performance%1:04:00::
arm%1:14:00::
difference%1:11:00::
Image%1:06:00::
organization%1:09:00::
plan%1:09:01::
atmosphere%1:23:00::
different%3:00:02::
important%3:00:02::
paper%1:06:00::
paper%1:10:03::
paper%1:14:00::
Shelter%1:26:00::
audience%1:26:00::
difficulty%1:04:00::
interest%1:07:01::
party%1:14:02::
sort%1:09:00::
bank%1:17:02::
disc%1:06:01::
Simple%3:00:02:plain:01
solid%3:00:00:sound:01
solid%3:00:00:wholesome:00
Source%1:15:00::
Table 1: Source and target pairs
29
In future work it may be possible to apply automatic methods for filtering infrequent
sense correspondences in the dataset, by adopting algorithms such as in (McCarthy et al.
2004).
Having source-target synonym pairs, a classification instance for the sense matching
task is created from each example occurrence of the target word in the Senseval dataset.
A classification instance is thus defined by a pair of source and target words and a given
occurrence of the target word in context. The instance should be classified as positive if
the sense of the target word in the given context matches one of the possible senses of the
source word, and as negative otherwise. Table 2 illustrates positive and negative example
instances for the source-target synonym pair ‘record-disc’, where only occurrences of
‘disc’ in the musical sense are considered positive.
Sentence
annotation
This is anyway a stunning disc, thanks to the playing of the Moscow positive
Virtuosi with Spivakov.
He said computer networks would not be affected and copies of negative
information should made on floppy discs.
Before the dead solider was placed in the ditch his personal possessions negative
were removed, leaving one disc on the body for identification purposes.
Table 2: positive and negative example for the source-target synonym pair 'record-disc'
The gold standard annotation for the binary sense matching task can be derived
automatically from the Senseval annotations and the corresponding WordNet synsets. An
example occurrence of the target word is considered positive if the annotated synset for
that example includes also the source word, and Negative otherwise. Notice that different
30
positive examples might correspond to different senses of the source word. This happens
when the source and target share several senses, and hence they appear together in several
synsets (see Table 3). Finally, since in Senseval an example may be annotated with more
than one sense, it was considered positive if any of the annotated synsets for the target
word included the source word.
Using this procedure we derived gold standard annotations for all the examples in the
Senseval-3 training section for our 25 target words. For the test set we took up to 40 test
examples for each target word (some words had fewer test examples), yielding 913 test
examples in total, out of which 239 were positive. This test set was used to evaluate the
sense matching methods described in the next section.
Sentence
WordNet sense
annotation
It can be a very useful means of making a A position on a scale of Positive
charitable gift towards the end of the tax intensity or amount or
year when your taxable income for the year quality
can be estimated with some degree of
precision
The length of time spent stretching depends A specific identifiable Positive
on the sport you are training for and the position in a continuum
degree of flexibility you wish to attain
or series or especially in
a process
Table 3 :Example instances for the source-target synonym pair ‘level-degree’, where two
senses of the source word 'degree' are considered positive.
31
4. Investigated Methods
As explained in the introduction, the sense matching task may be addressed by two
general approaches. The traditional indirect approach would first disambiguate the target
word relative to a predefined set of senses, using standard WSD methods. Then, it would
check whether the selected sense matches the source word. In terms of Word-Net synsets,
it would check whether the selected synset for the target word includes the source word
as well. On the other hand, a direct approach would address the binary sense matching
task directly, without selecting explicitly a concrete sense for the target word. In this
research we focus on investigating several direct methods for sense matching and
compare their performance relative to traditional indirect methods, under both supervised
and unsupervised settings.
Two different goals may be set for sense matching methods. The first goal is
classification, where the system needs to decide for each test example whether it is
positive or negative (i.e., whether the target word sense matches the source or not). The
second goal is ranking, where the system only needs to rank all test examples of a given
target word according to their likelihood of being positive, as measured by some
confidence score. From the perspective of the applied lexical substitution task, employing
the sense matching module as a classifier enables to utilize it for filtering out
inappropriate contexts of the target word. On the other hand, scored ranking corresponds
to situations in which a hard classification decision is not expected from the sense
matching module, either because the final system output is a ranked list (as in IR and QA)
or because the sense matching score is being integrated with the scores of additional
32
system modules. As described below, we investigate alternative methods for both the
classification and ranking goals.
4.1 Feature set and classifier
As a vehicle for investigating different classification approaches we implemented a
“vanilla” state of the art architecture for WSD. Following common practice in feature
extraction (e.g. (Yarowsky, 1994)), and using the mxpost7 part of speech tagger and
WordNet’s lemmatization, the following feature set was used: bag of word lemmas for
the context words in the preceding, current and following sentence; unigrams of lemmas
and part of speech in a window of +/- three words , where, each position provides a
distinct feature [w-3, w-2, w-1, w+1 w+2 w+3].and bigrams of lemmas in the same
window [w-3-2, w-2-1, w-1+1, w+1+2, w+2+3]
The SVMLight (Joachims, 1999) classifier was used in the supervised settings with its
default parameters. To obtain a multi-class classifier we used a standard one-vs-all
approach of training a binary SVM for each possible sense and then selected the highest
scoring sense for a test example.
To verify that our implementation provides a reasonable replication of state of the art
WSD we applied it to the standard Senseval-3 Lexical Sample WSD task. The obtained
accuracy8 was 66.7%, which compares reasonably with the mid-range of systems in the
Senseval-3 benchmark (Mihalcea and Edmonds, 2004). This figure is just a few percent
lower than the (quite complicated) best Senseval-3 system, which achieved about 73%
accuracy and it is much higher than the standard Senseval baselines. We thus regard our
7
ftp://ftp.cis.upenn.edu/pub/adwait/jmx/jmx.tar.gz
The standard classification accuracy measure equals precision and recall as defined in the Senseval
terminology when the system classifies all examples, with no abstentions
8
33
classifier as a fair vehicle for comparing the alternative approaches for sense matching on
equal grounds.
4.2 Supervised Methods
4.2.1 Indirect approach
The indirect approach for sense matching follows the traditional scheme of performing
WSD for lexical substitution. First, the WSD classifier described above was trained for
the target words of our dataset, using the Senseval-3 sense annotated training data for
these words. This was accomplished by training a binary SVM for each possible sense.
Each binary classifier was trained to identify one sense of the target - the training
examples were classified as positive when the sense of the target word in the given
context matched the specific classifier's sense and the rest were classified as negative.
Then, each classifier was applied to each test example of the target words, selecting the
most likely sense for each example by picking the sense of the binary classifier that
scored highest for the test example. Finally, an example was classified as positive if its
selected sense included the source word in its synset. Otherwise, the example was
classified as negative.
4.2.2 Direct approach
As explained above, the direct approach addresses the binary sense matching task
directly, without selecting explicitly a sense for the target word. In the supervised setting
it is easy to obtain such a binary classifier using the annotation scheme described in
Section 3. Under this scheme an example was annotated as positive (for the binary sense
matching task) if the source word is included in the Senseval gold standard synset of the
34
target word. We trained the classifier using the set of Senseval-3 training examples for
each target word, considering their derived binary annotations. Finally, the trained
classifier was applied to the test examples of the target words, yielding directly a binary
positive-negative classification. We note that the direct binary setting is suitable for
producing rankings as well, using the obtained SVM scores to rank all examples of each
target word. In addition, because this method is direct and applies a single classifies for a
target word, it allows for shorter running time during the training and test stages. In the
indirect method, the training stage must be used to train a binary classifier for each sense.
Consequently, in the testing stage, each test example must be checked by all of the binary
classifiers. The running time increases with the number of senses for each target word.
Some words have many senses, like the word "hot", that has twenty one different senses.
4.3 Unsupervised Methods
It is well known that obtaining annotated training examples for WSD tasks is very
expensive, and is often considered infeasible in unrestricted domains. Therefore, many
researchers investigated unsupervised methods, which do not require annotated examples.
Unsupervised approaches have usually been investigated within Senseval using the “All
Words” dataset, which does not include training examples. In this thesis we preferred
using the same test set which was used for the supervised setting (created from the
Senseval-3 “Lexical Sample” dataset, as described above), in order to enable comparison
between the two settings. Naturally, in the unsupervised setting the sense labels in the
training set were not utilized.
35
4.3.1 Indirect approach
State-of-the-art unsupervised WSD systems are quite complex and they are not easy to be
replicated. Thus, we implemented the unsupervised version of the Lesk algorithm (Lesk,
1986) as a reference system, since it is considered a standard simple baseline for
unsupervised approaches. The Lesk algorithm is one of the first algorithms developed for
semantic disambiguation of all-words in unrestricted text. In its original unsupervised
version, the only resource required by the algorithm is a machine readable dictionary with
one definition for each possible word sense. The algorithm looks for words in the sense
definitions that overlap with context words in the given sentence, and chooses the sense
that yields maximal word overlap. This algorithm is based on the intuition that words that
co-occur in a sentence are being used to refer to the same topic, and that topically related
sense of words are defined in a dictionary using the same words. We used an
implemented version of this algorithm created by our Italian colleague, Carlo
Strapparava, from ITC-Irst, , that uses WordNet sense-definitions with context length of
±10 words before and after the target word.
4.3.2 Direct approaches
It has been well recognized that it is very difficult, and methodologically problematic, to
determine the “right” set of pre-defined senses for WSD. Hence, the direct sense
matching approach may be particularly attractive just because it does not assume any
reference to a pre-defined set of senses. However, existing unsupervised algorithms for
the classical WSD task do rely on pre-defined sense repositories, and sometimes also on
dictionary definitions for these senses (similar to the Lesk algorithm). For this reason,
36
standard unsupervised WSD techniques cannot be applied for direct sense matching, in
which the only external information assumed is a substitution lexicon.
The assumption underlying our proposed methods is that if a target word occurrence
has a sense which matches the source word then the context of that occurrence should be
valid for the source word as well. Unlabeled occurrences of the source word can then be
used to learn a model of its typical valid contexts. Next, we can match this model against
test examples of the target word and evaluate whether the given target contexts are valid
for the source word or not, providing a decision criterion for sense matching.
We notice that in this proposed approach only positive examples are given, in the form of
unlabeled occurrences of the source word. Learning from positive examples only (also
called one class learning) is known to be much more difficult than standard supervised
learning for the same task. Yet, this setting arises in many practical situations and is often
the only unsupervised solution available.
4.3.2.1 Direct approach one-class SVM
Our first unsupervised method utilizes the One Class SVM learning algorithm
(Sch¨olkopf et al., 2001), and was implemented using the LIBSVM package9 by our
Italian colleague, Alfio Glizzo , from ITC-Irst. The training examples consist of a given
sample of unlabeled occurrences of the source word represented by the same feature set
of Subsection 4.1. We used training examples taken from the BNC 10 (British National
Corpus). This created a compatibility between the training data and test data because the
BNC is one of the sources of Senseval, which we used as a source for the test data. This
9
Freely available from http://www.csie.ntu.edu.tw/~cjlin/libsvm
The BNC (British National Corpus) is a 100 million word collection of samples of written and spoken
language from a wide range of sources. http://www.natcrop.ox.ac.uk
10
37
compatibility added to the chances of successful learning, since the training data and the
test data had more topics in common.
Roughly speaking, a one-class SVM estimates the smallest hypersphere enclosing
most of the training data. New test instances are then classified positively if they lie
inside the sphere, while outliers are regarded as negatives. The ratio between the width of
the enclosed region and the number of misclassified training examples can be varied by
setting the parameter   (0, 1). Smaller values of  will produce larger positive regions,
yielding increased recall. We note that we could utilize the LIBSVM one-class package
only for classification but not for ranking, since it provides just a binary classification
decision rather than a classification score.
Experiments with the one-class SVM (see Section 5) revealed two problems. First,
there is no obvious way to tune the optimal value for the  parameter in an unsupervised
setting, in which no labeled examples are given. Furthermore, different  values were
found optimal, in retrospect, for different words. Such optimization of classification
performance is an inherent problem for the unsupervised one-class setting, unlike the
standard supervised setting in which both positive and negative examples are utilized to
optimize models uniformly during training. Second, when the source word is ambiguous
then only one (or few) of its senses can be substituted with the target word. However, our
one-class algorithm was trained on all examples of the source word, which include
examples of irrelevant senses of the source word, yielding noisy training sets.
For an example, see Table 4.
38
Sentence
Sense
appropriate /noisy
What level is the office on?
floor
noisy
A high level of care is
degree, amount
acceptable
required
Table 4: Table of a noisy training example and an appropriate training example for the
source word 'level' and the target word 'degree'.
4.3.2.2 Direct approach KNN-based ranking
Consequently, we developed also an unsupervised ranking method which is based on the
k Nearest Neighbors (kNN) principle. To avoid the first problem of optimizing
classification performance we decided to focus at this stage on the ranking goal. That is,
we aim to score all test examples of a target word such that the positive ones will tend to
appear at the top of the ranked list, but without identifying an optimal classification
boundary. (This method was also evaluated by the classification measurement so that we
would be able to compare the two unsupervised direct method - one class and kNN).
The second problem of source word ambiguity is addressed by the choice of a kNN
approach. In this approach the score of a test example is determined only by the most
relevant subset of source word occurrences, which are likely to correspond to the relevant
sense of the source word. More concretely, we store in memory all training examples of
the source word, represented by our standard feature set. The score of a test example of
the target word is computed as the average similarity between the test example and the k
most similar training examples. Finally, all test examples of the target word are ranked by
these scores.
39
The rational behind this method is that if the sense of the target test example matches
the source then there are likely to be k occurrences of the corresponding source sense that
are relatively similar to this target example. On the other hand, if the target example has a
sense that does not match the source word then it is likely to have lower similarity with
the source examples.
The disadvantage of this algorithm is that it uses all the training data at the test time,
which makes it memory and time expensive, since similarities should be calculated
between the test example and all training examples. We tried to improve the algorithm in
these two aspects by building an index for the training data of every source word. The
pseudo code of the improved algorithm appears in the following figure, where the
numbers in brackets below refer to the code lines. The index is implemented by a hash
table where the key is the feature number and the value is the indexes of sentences that
include that feature (1). After building the index, the similarities are calculated (2-5). The
index saves us the need to calculate similarities for all the training set. Instead, we
calculate similarities only for the sentences that have any common feature with the test
sentence: the algorithm loops over the features of the test sentence (3), and for every
feature, calculates the similarity between the test sentence and the sentences that were
hashed in this entry (4-5). This way we save both time and cash memory, since we do not
need to upload all the training data to the cash.
40
1 Bulid indx I for the training data set
2
3
4
5
for each example X in the test data do
for each feature xi in example X
for each training example Dj in indxer entry I[xi]
calculate sim(X, Dj);
6
find K largest scores of sim(X, Dj);
7
calculate sim_avg for K nearest neighbors;
8
return sim_avg
Figure1 -Pseudo code for our kNN classifier algorithm
41
5. Evaluation
5.1 Evaluation measures
As we described in section 4, two different goals may be set for sense matching methods:
classification and ranking. To get realistic and comprehensive evaluation of the methods,
we used two evaluation measures, one for each goal.
5.1.1 Classification measure
For binary sense matching, and the corresponding lexical substitution setting, the
standard WSD metrics (Mihalcea and Edmonds, 2004) are less suitable because we are
interested in the binary decision of whether the target word matches the sense of a given
source word.
For this reason we decided to adopt an Information Retrieval evaluation schema,
where Precision, Recall and F1 are estimated as follows:
Precision 
Recall 
F1 
TruePositi ve
TruePositi ve  FalsePosit ive
TruePositi ve
TruePositi ve  FalseNegat ive
2 * Precision * Recall
Precision  Recall
In the following section we report micro-averaged results for these measures on our test
set.
5.1.2 Ranking measure
This measure is very popular in Information Retrieval (IR) and Question Answering
(QA) systems, for which the lexical expansion setting is targeted. It quantifies the
system's ability to rank examples for a given source word, preferring a ranking which
ranks correct examples before negative ones. A prefect ranking would place all the
42
positive examples before all the negative examples. Average precision is a common
evaluation measure for system rankings, and is computed as the average of the system's
precision values, at all points in the ranked list where recall increases (Voorhees and
Harman 1999). In our case, the points where recall increases correspond to positive test
examples. More formally, it can be written as follows:
1/R * sum for i=1 to n ( E(i) * #-correct-up-to-pair-i/i)
Where n is a number of the examples in the test set, R is the total number of positive
examples in the test set, E(i) is 1 if the i-th example is positive and 0 otherwise, and I
ranges over the examples, order by their ranking from the highest down.
This average precision calculation outputs a value in the 0-1 range, where 1 corresponds
to perfect ranking. This value corresponds to the area under the non-interpolated recallprecision curve for the target word. Mean Average Precision (MAP) is defined as the
mean of the average precision values for all test words.
5.2 Classification measure results
5.2.1 Baselines
Following the Senseval methodology, we evaluated two different baselines for
unsupervised and supervised methods. The random baseline, used for the unsupervised
algorithms, was obtained by choosing either the positive or the negative class at random
resulting in P = 0.262, R = 0.5, F1 = 0.344. The Most Frequent baseline has been used
for the supervised algorithms and is obtained by assigning the positive class when the
percentage of positive examples in the training set is above 50%, resulting in P = 0.65,
R= 0.41, F1 = 0.51.
43
Supervised
P
R
F1
Most Frequent
Baseline
0.65 0.41 0.51
Multiclass SVM
Indirect
0.59 0.63 0.61
Binary SVM (J = 0.5)
Direct
0.80 0.26 0.39
Binary SVM
Direct
0.76 0.46 0.57
(J=1)
Binary SVM
(J=2)
Direct
0.68 0.53 0.60
Binary SVM
(J=3)
Direct
0.69 0.55 0.61
Table 5A Supervise method
Unsupervised
P
R
F1
Random
Baseline 0.26 0.50 0.34
Lesk
Indirect
0.24 0.19 0.21
One-Class ( = 0.3)
Direct
0.26 0.72 0.39
One-Class ( = 0.5)
Direct
0.29 0.56 0.38
One-Class ( = 0.7)
Direct
0.28 0.36 0.32
One-Class ( = 0.9)
Direct
0.23 0.10 0.14
Table 5B Unsupervise method
Table 5: Classification results on the sense matching task
5.2.2 Supervised Methods
Both the indirect and the direct supervised methods presented in Subsection 4.2 have
been tested and compared to the most frequent baseline.
Indirect. For the indirect methodology we trained the supervised WSD system for each
target word on the sense-tagged training sample. As described in Subsection 4.2, we
implemented a simple SVM-based WSD system (see Section 4.2) and applied it to the
sense-matching task. Results are reported in Table 5A. The direct strategy surpasses the
most frequent baseline F1 score, but the achieved precision is still below it. We note that
44
in this multi-class setting it is less straightforward to tradeoff recall for precision, as all
senses compete with each other.
Direct. In the direct supervised setting, sense matching is performed by training a binary
classifier, as described in Subsection 4.2.
The advantage of adopting a binary classification strategy is that the precision/recall
tradeoff can be tuned in a meaningful way. In SVM learning, such tuning is achieved by
varying the parameter J, that allows us to modify the cost function of the SVM learning
algorithm. If J = 1 (default), the weight for the positive examples is equal to them weight
for the negatives. When J > 1, negative examples are penalized (increasing recall), while,
whenever 0 < J < 1, positive examples are penalized (increasing precision). Results
obtained by varying this parameter are reported in Figure 2.
Figure 2: Direct supervised results varying J
45
Adopting the standard parameter settings (i.e. J = 1, see Table 3), the F1 of the system
is slightly lower than for the indirect approach, while it reaches the indirect figures when
J increases. More importantly, reducing J allows us to boost precision towards 100%.
This feature is of great interest for lexical substitution, particularly in precision oriented
applications like IR and QA, for filtering irrelevant candidate answers or documents.
5.2.3 Unsupervised methods
Indirect. To evaluate the indirect unsupervised settings we implemented the Lesk
algorithm, described in Subsection 4.3.1, and evaluated it on the sense matching task. The
obtained figures, reported in Table 5B, are clearly below the baseline, suggesting that
simple unsupervised indirect strategies cannot be used for this task. In fact, the error of
the first step, due to low WSD accuracy of the unsupervised technique, is propagated in
the second step, producing poor sense matching.
Unfortunately, state-of-the-art unsupervised systems are actually not much better than
Lesk on all words task (Mihalcea and Edmonds, 2004), discouraging the use of
unsupervised indirect methods for the sense matching task
Direct. Conceptually, the most appealing solution for the sense matching task is the oneclass approach proposed for the direct method (Section 4.3.2). As stated in section 4, in
order to perform our experiments, we trained a different one-class SVM for each source
word, using a sample of its unlabeled occurrences in the BNC as the training set. To
avoid huge training sets and to speed up the learning process, we fixed the maximum
number of training examples to 10000 occurrences per word, collecting on average about
46
6500 occurrences per word. For each target word in the test sample, we applied the
classifier of the corresponding source word. Results for different values of  are reported
in Figure 3 and summarized in Table 5B.
Figure 3 One-class evaluation varying 
While the results are somewhat above the baseline, just small improvements in
precision are reported, and recall is higher than the baseline for  < 0.6. Such small
improvements may suggest that we are following a relevant direction, even though they
may not be useful yet for an applied sense-matching setting.
Further analysis of the classification results for each word revealed that optimal F1
values are obtained by adopting different values of  for different words. In the optimal
(in retrospect) parameter settings for each word, performance for the test set is noticeably
boosted, achieving P = 0.40, R = 0.85 and F1 = 0.54. Finding a principled unsupervised
way to automatically tune the  parameter is thus a promising direction for future work.
47
Investigating further the results per word, we found that the correlation coefficient
between the optimal  values and the degree of polysemy of the corresponding source
words is 0.35. More interestingly, we noticed a negative correlation (r = -0.30) between
the achieved F1 and the degree of polysemy of the word, suggesting that polysemous
source words provide poor training models for sense matching. This can be explained by
observing that polysemous source words can be substituted with the target words only for
a strict subset of their senses. On the other hand, our one class algorithm was trained on
all the examples of the source word, which include irrelevant examples that yield noisy
training sets. A possible solution may be obtained using clustering-based word sense
discrimination methods (Pedersen and Bruce, 1997; Sch¨utze, 1998), in order to train
different one-class models from different sense clusters. Overall, the analysis suggests
that it may be possible to obtain in the future better binary classifiers based on unlabeled
examples of the source word.
As the unsupervised-direct approach is the most appealing approach for sense
matching, and we have presented two algorithms which implement that approach, we
would like to compare their results. For this purpose we evaluated the classification
measure of the kNN algorithm as well (although this algorithm will be examined mostly
by the ranking measure). Since the kNN algorithm ranks the test sentences, we need to set
a threshold to separate the negative and positive results, in order to compute the
classification results. Figure 4 shows the Precision, Recall and F1 values for various
values of threshold, with the cosine similarity metric and k=10.
48
One can see quite clearly that the kNN yield somewhat better results than the one-class
SVM algorithm: the optimal F1 the kNN has achieved is 0.42, with a threshold of 0.1,
compared to the optimal F1 of the one-class SVM, which is 0.39.
P
R
F1
1.2
1
score
0.8
0.6
0.4
0.2
0
0.001 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
threshold
Figure 4– Precision, Recall and F1 of kNN with cosine metric and k=10, for various
thresholds
5.3 Ranking measure results
Table 6 summarizes the MAP (Mean Average Precision ) results for the supervised direct
approach and for the KNN-based unsupervised approach, along with a baseline of
randomized ranking averaged over 10 runs. The results indicate that the ranking produced
by the KNN method (K = 10) outperform random ranking, while still being substantially
lower than supervised performance.
49
Method
MAP
Random
0.36
kNN (Cosine 10)
0.40
Binary SVM (J = 2)
0.60
Table 6: Mean Average Precision
Figure 5 provides a closer look at the ranking behavior, plotting the macro averaged
recall-precision curves for each method.
Figure 5 Macro-averaged recall-precision curves
The figure indicates that the KNN-based ranking is better than randomized ranking up to
the 80% recall point. In particular, in the important high-precision range, of up to about
25% recall, the kNN based method is better than random by 8 − 18%. That is, KNN does
succeed to give the highest ranks to positive examples substantially better than random.
To the best of our knowledge, this is the first time in which such positive result is
50
obtained by a method that does not consider any externally-provided information at all,
be it in the form of labeled examples, a sense repository or sense definitions. We
hypothesize that this result can be further improved through better assessment of the
similarity between the target test example and the source training data. for ranking results
When implementing the kNN algorithm, we tried three similarity measures, Cosine,
Jacard, Lin, and different values of k,10, 50 and 100. There was no significant difference
between the results of these attempts, but we still find it valuable to show them, in figures
6 and 7 figure 6 show the results of kNN with the Cosine metric for different k values
and figure 7 shows the results of kNN with different metrics with K=100
Cosine varying K
1.2
1
Precision
0.8
cos 100
cos 10
cos 50
0.6
0.4
0.2
0
0
0.5
1
1.5
Recall
Figure 6– Results of kNN with different values of k
51
Results varying sim
1.2
1
Precision
0.8
cos 100
jaccard 100
lin 100
0.6
0.4
0.2
0
0
0.5
1
1.5
Recall
Figure 7– Results of kNN with different similarity metrics
52
6. Conclusion and future work
This thesis defined and investigated the novel sense matching task, which captures
directly the polysemy problem in lexical substitution. We proposed direct approaches for
the task, suggesting the advantages of controlling precision-recall tradeoff, while
avoiding the need in an explicitly defined sense repository. Furthermore, we proposed
novel types of completely unsupervised learning schemes.
To obtain realistic and comprehensive evaluation of the methods, we used two
evaluation measures, by classification and ranking, which correspond to two different
goals for the sense matching task. In both measures the methods yielded better results
than the baselines. In particular, positive results for both measures were obtained by the
kNN method, which does not require any form of external information. We speculate that
with these encouraging results there is a great potential for such approaches, to be
explored in future research.
We would like to remind here that the algorithms we suggested were aimed to handle
one case of source-target mismatch – the first case that was mentioned in the
introduction, where the target word had the wrong sense in a given context. The same
algorithms could be used when switching roles between source and target, to handle the
second case of mismatch, where the target word was selected according to a wrong sense
of the source word.
We focused on the direct unsupervised approach as our goal. Possible future
improvements may be done, for example, by adding weights to the features, or creating
negative examples in the training data by using the target words as negative examples
while the source words themselves make the positive ones. This idea need further
53
research, since it induces much noise that should be handled. Additionally, ideas for other
methods have come up during the research, such as the automatic clustering of word
instances by contexts, e.g. what Sh¨utze (1998) termed as sense discrimination - two
words would be considered to be used in the same sense if they are within the same
cluster. We hope that the abstract idea we initiated in this research will lead to further
research in this area, and to valuable progress in the task of lexical substitution.
54
7. References
B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin
classifiers. In D. Haussler, editor, 5th Annual ACM Workshop on COLT, pages 144-152,
Pittsburgh, PA, 1992..
Caraballo, Sharon A. 1999. Automatic Acquisition of a Hypernym-Labeled Noun
Hierarchy from Text. In 37th Annual Meeting of the Association for Computational
Linguistics: Proceedings of the Conference, pages 120-126.
Chaves R. P. 2001. WordNet and Automated Text Summarization. In Proceedings of the
6th Natural Language Processing Pacific Rim Symposium (NLPRS-01). Tokyo, Japan.
Christopher J. C. Burges. "A Tutorial on Support Vector Machines for Pattern
Recognition". Data Mining and Knowledge Discovery 2:121 - 167, 1998
Ido Dagan,. 2000. Contextual Word Similarity, in Rob Dale, Hermann Moisl and Harold
Somers (Eds.), Handbook of Natural Language Processing, Marcel Dekker Inc, 2000,
Chapter 19, pp. 459-476
Ido Dagan, Oren Glickman, and Bernardo Magnini. 2005. The pascal recognising textual
entailment challenge. Proceedings of the PASCAL Challenges Workshop on Recognising
Textual Entailment.
Ido Dagan, Oren Glickman, Alfio Gliozzo, Efrat Marmorshtein and Carlo Strapparava.
2006. Direct Word Sense Matching for Lexical Substitution, COLING-ACL
Ido Dagan, Shaul Marcus and Shaul Markovitch. Contextual word similarity and
estimation from sparse data, Computer, Speech and Language, 1995, Vol. 9, pp. 123-152
Belur V. Dasarathy, 1991. Nearest Neighbor (NN) Norms: NN Pattern Classification
Techniques,
Phillip William Dixon and David Corne and Martin J. Oates Replacing Generality with
Coverage for Improved Learning Classifier Systems, 2003
HIS pp 185-193
C. Fellbaum. 1998. WordNet. An Electronic Lexical Database. MIT Press. J. Gonzalo,
F. Verdejo, I. Chugur, and J. Cigarran. 1998. Indexing with wordnet synsets can improve
text retrieval. In ACL, Montreal, Canada.
Flank S. A layered approach to NLP-based Information Retrieval. 1998. In Proceedings
of the ACL / COLING Conference. Montreal, Canada.
55
Gasperin, Caroline and Renata Vieira. 2004. Using Word Similarity Lists for Resolving
Indirect Anaphora. In Proc. of ACL-04 Workshop on Reference Resolution. Barcelona,
Spain, July, 2004
Gauch, Susan, J. Wang, S. Mahesh Rachakonda. 1999. A Corpus Analysis Approach for
Automatic Query Expansion and its Extension to Multiple Databases. ACM Transactions
on Information Systems, volume 17(3), pp. 250-250, 1999.
.Grefenstette, Gregory. 1994. Exploration in Automatic Thesaurus Discovery. Kluwer
Academic Publishers.
Harabagiu, Sanda M., Dan I. Moldovan, Marius Pasca, Rada Mihalcea, Mihai Surdeanu,
Razvan C. Bunescu, Roxana Girju, Vasile Rus, and Paul Morarescu. 2000. Falcon:
Boosting knowledge for answer engines. In Text REtrieval Conference.
Hovy, Eduard H., Ulf Hermjakob, and Chin-Yew Lin. 2001. The use of external
knowledge of factoid QA. In Text Retrieval Conference.
T. Joachims. 1999. Making large-scale SVM learning practical. In B. Sch¨olkopf, C.
Burges, and A. Smola, editors, Advances in kernel methods: support vector learning,
chapter 11, pages 169 – 184. MIT Press.
Lee, Lillian. 1997. Similarity-Based Approaches to Natural Language Processing. Ph.D.
thesis, Harvard University, Cambridge, MA.
M. Lesk. 1986. Automatic sense disambiguation using machine readable dictionaries:
How to tell a pine cone from an ice cream cone. In Proceedings of the ACM-SIGDOC
Conference, Toronto, Canada.
Li, L. et al., 2001. Gene selection for sample classification based on gene expression data:
Study of sensitivity to choice of parameters of the GA/KNN method. Bioinformatics 17,
1131–1142.
Dekang Lin. 1998. Automatic retrieval and clustering of similar words. In Proceedings of
the 17th international conference on Computational linguistics, pages 768–774,
Morristown, NJ, USA. Association for Computational Linguistics.
Lin, Dekang. 1998a. Automatic Retrieval and Clustering of Similar Words. In Proc. of
COLING–ACL98, Montreal, Canada, August, 1998.
Diana McCarthy, Rob Koeling, Julie Weeds, and John Carroll. 2004, Automatic
identification of infrequent word senses. In Proceedings of COLING page 1220-1226.
Diana McCarthy, 2002, Lexical substitution as a task for wsd evaluation. In Proceedings
of the ACL-02 workshop on Word sense disambiguation, pages 109-115, Morristown,
NJ, USA, Association for Computational Linguistics.
56
R. Mihalcea and P. Edmonds, editors. 2004. Proceedings of SENSEVAL-3: Third
International Workshop on the Evaluation of Systems for the Semantic Analysis of Text,
Barcelona, Spain, July.
Mihalcea R. and D. Moldovan. 2000. Semantic Indexing using WordNet Senses. In
Proceedings of ACL Workshop on IR and NLP.
D. Moldovan and R. Mihalcea. 2000. Using wordnet and lexical operators to improve
internet searches. IEEE Internet Computing, 4(1):34–43, January.
M. Negri. 2004. Sense-based blind relevance feedback for question answering. In SIGIR2004 Workshop on Information Retrieval For Question Answering (IR4QA), Sheffield,
UK, July.
Patrick Pantel and Deepak Ravichandran. 2004. Automatically Labeling Semantic
Classes. In Proceedings of Human Language Technology / North American chapter of the
Association for Computational Linguistics (HLT/NAACL-04). pp. 321-328. Boston, MA.
T. Pedersen and R. Bruce. 1997. Distinguishing word sense in untagged text. In EMNLP,
Providence, August. M. Sanderson. 1994. Word sense disambiguation and information
retrieval. In SIGIR, Dublin, Ireland, June.
Ruge, Gerda. 1992. Experiments on linguistically-based term associations. Information
Processing & Management, 28(3), pp. 317–332.
Salton, G. and McGill, M. J. 1983. Introduction to Modern Information Retrieval.
McGraw Hill.
Scott, S. and S. Matwin. 1998. Text classification using WordNet hypernyms. In
Proceedings of the COLING / ACL Workshop on Usage of WordNet in Natural
Language Processing Systems. Montreal, Canada.
B. Sch¨olkopf, J. Platt, J. Shawe-Taylor, A. J. Smola, and R. C.Williamson. 2001.
Estimating the support of a high-dimensional distribution. Neural Computation, 13:1443–
1471.H.
Shakhnarovish, Darrell, and Indyk Nearest-Neighbor Methods in Learning and Vision, e,
The MIT Press, 2005
H. Sh¨utze Automatic word sense discrimination computational Linguistics 24, 1(1998),
97-124
H. Sh¨utze and J. Pederson. 1995. Information retrieval based on word senses. In
Proceedings of the 4th Annual Symposium on Document Analysis and Information
Retrieval, Las Vegas.
57
E. Voorhees and D. Harmann, editors. 1999. Proceedings of the Seventh Text REtrieval
Conference (TREC-7), Gaithersburg, MD, USA, July. NIST Special Publication.
E. Voorhees. 1993. Using WordNet to disambiguate word sense for text retrieval. In
SIGIR, Pittsburgh, PA.
E. Voorhees. 1994. Query expansion using lexical semantic relations. In Proceedings of
the 17th ACM SIGIR Conference, Dublin, Ireland, June.
Weeds, Julie, D. Weir, and D. McCarthy. 2004. Characterizing Measures of Lexical
Distributional Similarity. In Proc. of Coling 2004. Switzerland, July, 2004.
Y.Yang, J.O. Pederson, A comparative study on feature selection in text categorization,
International Conference on Machine Learning (ICML), 1997.
D. Yarowsky. 1994. Decision lists for lexical ambiguity resolution: Application to accent
restoration in spanish and french. In ACL, pages 88–95, Las Cruces, New Mexico.
58
Download