Bar Ilan University The Department of Computer Science Direct Word Sense Matching for Lexical Substitution by Efrat Hershkovitz-Marmorshtein Submitted in partial fulfillment of the requirements for the Master's Degree in the Department of Computer Science, Bar Ilan University. Ramat Gan, Israel October 2006, Tishrei 5767 This work was carried out under the supervision of Dr Ido Dagan Department of Computer Science, Bar-Ilan University. 2 Acknowledgements This thesis paper has been accomplished with the help of a number of people, and I wish to express my heartfelt thanks to them. I am grateful to Dr. Ido Dagan at Bar-Ilan University for his supervision of the thesis and for his tutoring on how to observe, analyze and scientifically formalize. It has been a great pleasure to work with him on this subject and I have learned a lot. I would like to thank Oren Glickman at Bar-Ilan University for his guidance at the beginning of the way, and his advice, both professional and technical, throughout the work. I would also like to thank our Italian colleagues, Alfio Glizzo and Carlo Strapparava from ITC-Irst with whom I enjoyed a fruitful collaboration. I wish to thank them for their assistance in implementing some of the methods, and for their helpful suggestions based on their professional experience. I would like to thank Hami Margliyot for her help in professional matters and in wording the paper. I would like to thank our NLP group for their mutual support and helpful comments. Finally, I would like to thank my family and my husband Yair and our daughter Hadar Tova for understanding, supporting and encouraging. 3 Table of Contents Table of Contents ................................................................................................................ 4 List of tables and figures ..................................................................................................... 5 Abstract ............................................................................................................................... 6 1. Introduction ..................................................................................................................... 9 2. Background ................................................................................................................... 13 2.1 Word Senses and Lexical Ambiguity ..................................................................... 13 2.1.1 WordNet – a word sense database ....................................................................... 13 2.2 WSD and Lexical Expansion .................................................................................. 15 2.3 Textual Entailment .................................................................................................. 17 2.4 Classification Algorithms ....................................................................................... 19 2.4.1 Binary classification using SVM (Support Vector Machine) .......................... 20 2.4.1.1 One-class SVM ......................................................................................... 22 2.4.2 The kNN (k -nearest neighbor) classification algorithm .................................. 23 4. Investigated Methods .................................................................................................... 32 4.1 Feature set and classifier ......................................................................................... 33 4.2 Supervised Methods ................................................................................................ 34 4.2.1 Indirect approach ............................................................................................. 34 4.2.2 Direct approach ................................................................................................ 34 4.3 Unsupervised Methods............................................................................................ 35 4.3.1 Indirect approach ............................................................................................. 36 4.3.2 Direct approaches............................................................................................. 36 4.3.2.1 Direct approach one-class SVM .............................................................. 37 4.3.2.2 Direct approach KNN-based ranking........................................................ 39 5. Evaluation ..................................................................................................................... 42 5.1 Evaluation measures ............................................................................................... 42 5.1.1 Classification measure ..................................................................................... 42 5.1.2 Ranking measure .............................................................................................. 42 5.2 Classification measure results ................................................................................. 43 5.2.1 Baselines .......................................................................................................... 43 5.2.2 Supervised Methods ........................................................................................ 44 5.2.3 Unsupervised methods ..................................................................................... 46 5.3 Ranking measure results ......................................................................................... 49 6. Conclusion and future work .......................................................................................... 53 7. References ..................................................................................................................... 55 4 List of tables and figures Tables List Table 1 - Source and target pairs …………………………………………………………....29 Table 2 -Positive and negative example for the source-target synonym pair 'record-disc'………………………………………………….………....30 Table 3 -Example instances for the source-target synonym pair ‘level-degree’, where two senses of the source word 'degree' are considered positive….…31 Table 4 -Table of a noisy training example and an appropriate training example for the source word 'level' and the target word 'degree'……………….…...39 Table 5A-Classification results on the sense matching task -supervise method………….….44 Table 5B-Classification results on the sense matching task -unsupervise method………..…44 Table 6 - Mean Average Precision……………………………………………….….50 Figures List Figure1 -Pseudo code for our kNN classifier algorithm……………………….……41 Figure 2 -Direct supervised results varying J ………………………………………….… ...41 Figure 3 -One-class evaluation varying ……………………………………………….….45 Figure 4-Precision, Recall and F1 of kNN with cosine metric and k=10, for various thresholds………………………………………………..….47 Figure 5 -Macro-averaged recall-precision curves………………………………….…….....50 Figure 6-Results of kNN with different values of k………………………….……...51 Figure 7-Results of kNN with different similarity metrics……………..……….......52 5 Abstract This thesis investigates conceptually and empirically the novel sense matching task, of recognizing whether the senses of two synonymous words match in context. Sense matching enables substituting a word by its synonym, which is called lexical substitution. It is a commonly used operation for increasing recall in information seeking applications, like Information Retrieval (IR) and Question Answering (QA). For example, there are contexts in which the given source word ‘design’ (which might be part of a search query) may be substituted by the target word ‘plan’; however one should recognize that ‘plan’ has a different sense than ‘design’ in sentences such as “they discussed plans for a new bond issue.”, while in the sentence "The construction plans were drawn." it should be recognized that the sense of the target word 'plan' does match the meaning of the source word 'design'. This thesis addresses improving the task of verifying that the senses of two given words do indeed match in a given context. In other words, recognizing texts in which the specified source word might be substituted with a synonymous target word. Such improved recognition of sense matching would improve the eventual precision of sense matching in applications, which typically decreases somewhat when applying lexical substitution. To perform lexical substitution, a source of synonymous words is required. One of the most common sources, which was used also in our work, is WordNet (Fellbaum, 1998). Given a synonymous word, the binary classification task of sense matching may be dealt with various methods, which are categorized by two basic characteristics: The first one is whether the sense matching is direct or indirect. In the indirect approach, the senses of the 6 source word and the target word are explicitly identified relative to predefined lists of word senses, a process called Word Sense Disambiguation (WSD), and then compared. In the direct approach it is determined whether the senses match without identifying explicitly the sense identity. Apparently, the indirect approach solves a harder intermediate problem than eventually required and relies on a set of explicitly stipulated senses, while in the direct approach there is no explicit reference to predefined senses. The second discrimination between methods is whether they are supervised or unsupervised. Supervised methods require manually labeled learning data, while unsupervised methods do not require any manual intervention. In this thesis we investigate sense matching methods of all the above types. We experimented with a supervised indirect method, which makes use of the standard multiclass WSD setting to identify the sense of the target word, and classify positively for the sense matching task if the selected sense matches one of the sense of the source word. The supervised direct method we examined is trained on a binary annotated learning data of matching and non-matching target words, which correspond to the multiple senses of the source word. The unsupervised indirect method we implemented matches example words in the given context of the target word with the sense definitions of the source word, obtained from a common resource dictionary. The most powerful approach we investigated is the unsupervised direct one, which avoids the intermediate step of explicit word sense disambiguation, and thus circumvents the problematic reliance on a set of explicitly stipulated senses, and does not require manual labeling as well. The underlying assumption of this approach is that if the sense of the substituting target word matches the original source word, then its context should 7 be valid (typical) for the source word. The classification scheme we suggested in this thesis learns a model of unlabeled occurrences of the source word, referring to all of them as positive examples, and tests whether this model matches the context of the given occurrence of the target word. We applied two different methods for this approach, one based on the one-class SVM algorithm, which tries to delimit a range of most training examples, and classifies a substituting target word as matching if it falls within this range. The other method is based on a kNN approach, which calculates similarity between the substituting target word and the occurrences of the source word, and ranks the level of matching between the two words according to the level of similarity between the target word context and the k most similar occurrences of the source word. We used two different measures to evaluate the results of the methods described above, one for evaluating classification accuracy and the other for evaluating ranking quality. The ranking measure could be applied only for the kNN method and the supervised direct method we implemented, since only those gave a score for each substituting word, which enabled ranking. Classification could be applied for all methods, when setting a threshold of positive classification for the ranking methods scores, converting their results to binary classification. Positive empirical results are presented for all methods, substantially improving the baselines. We focused on the direct unsupervised approach that does not require any manual intervention and does not rely on any form of external information. As described above we applied two different methods for this approach, the kNN method and the oneclass method, where the former obtained better results. These results are accompanied with some stimulating analysis for future research. 8 1. Introduction In many language processing settings it is needed to recognize that a given word or term may be substituted by a synonymous one. In a typical information seeking scenario, an information need is specified by some given source words. When looking for texts that match the specified need the original source words might be substituted with synonymous target words. For example, given the source word ‘weapon’ a system may substitute it with the target synonym ‘arm’ when searching for relevant texts about weapons. This scenario, which is generally referred here as lexical substitution, is a common technique for increasing recall in Natural Language Processing (NLP) applications. In Information Retrieval (IR) and Question Answering (QA), it is typically termed query/question expansion (Moldovan and Mihalcea, 2000; Negri, 2004). Lexical Substitution is also commonly applied to identify synonyms in text summarization, for paraphrasing in text generation, or is integrated into the features of supervised tasks such as Text Categorization and Information Extraction. Naturally, lexical substitution is a very common first step in textual entailment recognition, which models semantic inference between a pair of texts in a generalized application independent setting (Dagan et al., 2005). To perform lexical substitution NLP applications typically utilize a knowledge source of synonymous word pairs. The most commonly used resource for lexical substitution is the manually constructed WordNet (Fellbaum, 1998). Another option is to use statistical 9 word similarities, such as in the database constructed by Dekang Lin (e.g. (Lin, 1998)).1 We generically refer to such resources as substitution lexicons2. When using a substitution lexicon it is assumed that there are some contexts in which the given synonymous words share the same meaning. Yet, due to polysemy, it is needed to verify that the senses of the two words do indeed match in a given context. For example, there are contexts in which the source word ‘weapon’ may be substituted by the target word ‘arm’; however one should recognize that ‘arm’ has a different sense than ‘weapon’ in sentences such as “repetitive movements could cause injuries to hands, wrists and arms.” Since the sense matching involves sense disambiguation of both words, either explicitly or implicitly, a mismatch between the source and target words may be caused by wrong sense disambiguation of either of them. To illustrate these two cases of mismatch, let us first consider the pair of source word weapon and target word arm, when arm appears in the following context: “Look, could you grab hold of this box before my arms drop off?". In this sentence, the word arm appears in another sense than weapon - not the desired sense. The second type of mismatch happens when the original source word is substituted by a word that is not synonymous to the sense of the original word in the given context. For example, assume that the source word 'paper' appears in a given query "photocopying paper". In this case it would be wrong to substitute it with the target word 'newspaper', which is synonymous to a different sense of 'paper'. The focus of our research is to solve the mismatch of the first type, while the same method could be 1 Available from http://armena.cs.ualberta.ca/lindek/downloads While focusing on synonymy in this thesis, lexical substitution may be based on additional lexical semantic relation such as hyponymy 2 10 applied to solve the second type, when switching roles between the source and target words. A commonly proposed approach to address sense matching in lexical substitution is applying Word Sense Disambiguation (WSD) to identify the senses of the source and target words. In this approach, substitution is applied only if the words have the same sense (or synset, in WordNet terminology). In settings in which the source is given as a single term without context, sense disambiguation is performed only to the target word; substitution is then applied only if the target word’s sense matches at least one of the possible senses of the source word. One might observe that such application of WSD addresses the task at hand in a somewhat indirect manner. In fact, lexical substitution only requires knowing that the source and target senses do match, but it does not require that the matching senses will be explicitly identified. Selecting explicitly the right sense in context, which is then followed by verifying the desired matching, might be solving a harder intermediate problem than required. Instead, we can define the sense matching problem directly as a binary classification task for a pair of synonymous source and target words. This task requires to decide whether the senses of the two words do or do not match in a given context (but it does not require to identify explicitly the identity of the matching senses). A highly related task was proposed in (McCarthy, 2002). McCarthy's proposal was to ask systems to suggest possible "semantically similar replacements" of a target word in context, where alternative replacements should be grouped together. While this task is somewhat more complicated as an evaluation setting than our binary recognition task, it was motivated by similar observations and applied goals. From another perspective, sense 11 matching may be viewed as a lexical sub-case of the general textual entailment recognition setting, where we need to recognize whether the meaning of the target word "entails" the meaning of the source word in a given context. This thesis3 provides a first investigation of the novel sense matching problem. To allow comparison with the classical WSD setting we derived an evaluation dataset for the new problem from the Senseval-3 English lexical sample dataset (Mihalcea and Edmonds, 2004). We then evaluated alternative supervised and unsupervised methods that perform sense matching either indirectly or directly (i.e. with or without the intermediate sense identification step). Our findings suggest that in the supervised setting the results of the direct and indirect approaches are comparable. However, addressing directly the binary classification task has practical advantages and can yield high precision values as desired in precision-oriented applications such as IR and QA. More importantly, direct sense matching sets the ground for implicit unsupervised approaches that may utilize practically unlimited volumes of unlabeled training data. Furthermore, such approaches circumvent the Sisyphean need for specifying explicitly a set of stipulated senses. We present initial implementations of such approaches based a one-class classifier and a KNN-style ranking method. These methods are trained on unlabeled occurrences of the source word and are applied to classify and rank test occurrences of the target word. The presented results outperform the unsupervised baselines and put forth a whole new direction for future research. 3 Major parts of this research were published in (Dagan et al., 2006), which was based on the current thesis work. 12 2. Background 2.1 Word Senses and Lexical Ambiguity 2.1.1 WordNet – a word sense database One must refer to the WordNet ontology (Fellbaum, 1998), being the most influential computational lexical resource, in order to obtain an application-oriented view on prominent lexical relations. WordNet is a lexical database which is available online, and provides a large repository of English lexical items. It was developed by a group of lexicographers led by Miller, Fellbaum and others at Princeton University and has been constantly updated and improved during the last fifteen years. Inspired by current psycholinguistic theories of human lexical memory, it consists of English nouns, verbs, adjectives and adverbs organized into synonym sets – synsets, each representing one underlying sense. The synset includes a set of synonyms and their definition. The specific meaning of one word for one type of POS (part of speech) is called a sense. Each sense of a word appears in a different synset. Synsets are equivalent to senses = structures containing sets of terms with synonymous meanings. Each synset has a gloss that defines the concept it represents. For example, the words 'night', 'nighttime' and 'dark' constitute a single synset that has the following gloss: “the time after sunset and before sunrise while it is dark outside." Synsets are connected to one another through explicit semantic relations. Some of these relations (hypernym and hyponym for nouns, hypernym and troponym for verbs) constitute is-a-kind-of (hyperonymy) and is-a-part-of (meronymy for nouns) hierarchies. Diverse WordNet relations have been used in various NLP tasks as a source of candidate lexical substitutes for expansion. Expansion consists of altering a given text 13 (usually a query) by adding terms of similar meaning. For example, many question answering systems perform expansion in the retrieval phase using query related words based on WordNet’s lexical relations, such as synonymy or hyponymy (e.g. (Harabagiu et al., 2000), (Hovy et al., 2001)). Automatic indexing has been improved by adding the synsets of query words and their hypernyms to the query (Mihalcea and Moldovan, 2000). Scott and Matwin (1998) exploited WordNet hypernyms to increase the accuracy of Text Classification. Chaves (1998) enhanced a document summarization task through merging WordNet hyponymy chains, while Flank (1998) introduced a layered approach to term similarity computation for information retrieval, which assigns the highest weights to synonymy relations, hyponyms are ranked next, and the meronymy relations contribute the lowest scores to the final similarity weights. Notably, each of the above works addressed the problem within the narrow setting of a specific application, while none has induced a clear generic definition of the types of ontological relations that contribute to semantic substitution. 2.1.2 Lexical ambiguity and Senseval Word Sense Disambiguation (WSD) is the problem of deciding which sense a word has in any given context. It has been very difficult to formalize the process of disambiguation, which humans can do so effortlessly. For virtually all applications of language technology, word sense ambiguity is a potential source of error. One example is Machine Translation (MT). If the English word 'drug' translates into French as either 'drogue' (narcotic) or 'médicament' (medication), then an English-French MT system needs to disambiguate every use of 'drug' in order to make the correct translation. 14 Similarly, information retrieval systems may erroneously retrieve documents about an illegal narcotic when the item of interest is a medication; analogously, information extraction systems may make wrong assertions; and text-to-speech application may confuse violin bows for a ship's bows. Senseval (http://www.senseval.org/) is the international organization devoted to the evaluation of Word Sense Disambiguation systems. Its mission is to organize and run evaluation and related activities to test the strengths and weaknesses of WSD systems with respect to different words, different aspects of language, and different languages. In actual applications, WSD is often fully integrated into the system and often cannot be separated. But in order to study and evaluate WSD, Senseval has, to date, concentrated on standalone, generic systems for WSD. 2.2 WSD and Lexical Expansion Despite some initial skepticism about the usefulness of WSD in practical tasks (Voorhees, 1993; Sanderson, 1994), there is some evidence that WSD can improve performance in typical NLP tasks such as IR and QA. For example, Schütze and Pederson (1995) give clear indication of the potential for WSD to improve the precision of an IR system. They tested the use of WSD on a standard IR test collection (TREC-1B), improving precision by more than 4%. The use of WSD has produced successful experiments for query expansion techniques. In particular, some attempts exploited WordNet to enrich queries with semanticallyrelated terms. For instance, Voorhees (1994) manually expanded 50 queries over the TREC-1 collection using synonymy and other WordNet relations. She found that the 15 expansion was useful with short and incomplete queries, leaving the task of proper automatic expansion as an open problem. Gonzalo et al. (1998) demonstrates an increment in performance over an IR test collection using the sense data contained in SemCor over a purely term based model. In practice, they experimented searching SemCor with disambiguated and expanded queries. Their work shows that a WSD system, even if not performing perfectly, combined with synonymy enrichment increases retrieval performance. Moldovan and Mihalcea (2000) introduces the idea of using WordNet to extend Web searches based on semantic similarity. Their results showed that WSD-based query expansion actually improves retrieval performance in a Web scenario. Recently Negri (2004) proposed a sense-based relevance feedback scheme for query enrichment in a QA scenario (TREC-2003 and ACQUAINT), demonstrating improvement in retrieval performance. While all these works clearly show the potential usefulness of WSD in practical tasks, nonetheless they do not necessarily justify the efforts for refining fine-grained sense repositories and for building large sense-tagged corpora. We suggest that the sense matching task, as presented in the introduction, may relieve major drawbacks of applying WSD in practical scenarios. It is worth mentioning a related approach of word sense discrimination (Pedersen and Bruce, 1997; Schütze, 1998). Word sense discrimination intends to divide the usages of a word into different meanings without regard to any particular existing sense inventory. Typically approached with unsupervised techniques, sense discrimination divides the occurrences of a word into a number of classes by determining for any two occurrences 16 whether they belong to the same sense or not. Consequently, sense discrimination does not determine the actual "meaning" (i.e. sense "label") but rather identifies which occurrences of the same word have an equivalent meaning. Over-all, word sense discrimination can be viewed as an indirect approach which assign unsupervised senses. In our preliminary work we assessed the importance of identifying expansion mismatches in applied settings, and to evaluate the causes of such mismatch. Using several pairs of source words, we checked whether substituting them with target synonyms actually results with sense mismatches in randomly retrieved sentences. We discovered that the main cause of inappropriately retrieved sentences is indeed word sense mismatch which caused 77% of retrieval mismatches. For example, consider the original word pair 'cut job', where the source word 'job' is substituted by the target word 'position'. A successful substitution is found in the sentence: "40% of the positions at the company were cut." The following sentence is an example of a sense mismatch: "The company’s market position suffered a cut after a bad quarter". In this sentence the word position has a different sense than job. 2.3 Textual Entailment Textual entailment (TE) has been proposed recently as a generic framework for modeling semantic variability in many Natural Language Processing applications, such as Question Answering (QA), Information Extraction (IE) and Information Retrieval (IR).Textual entailment is defined as a relationship between a coherent text T and a language expression, which is considered as a hypothesis, H. Then, T entails H (H is a consequent of T), denoted by T=>H, if the meaning of H, as interpreted in the context of T, can be 17 inferred from the meaning of T. For example, Shirley inherited the house” => “Shirley owned the house”. The task of identifying entailment between texts is a complex task. Many researchers have addressed sub-tasks of TE, such as Geffet and Dagan, (2004) who explored the correspondence between the distributional characterizations of pairs of words (which may rarely co-occur, as is usually the case for synonyms ) and the kind of tight semantic relationship that might hold between them, and in particular. entailment at the lexical level. They proposed a feature weighting function (RFF) that yields more accurate distributional similarity lists, which better approximate the lexical entailment relation. This method still applies a standard measure for distributional vector similarity (over vectors with the improved feature weights), and thus produces many loose similarities that do not correspond to entailment. In a later paper, they explore more deeply the relationship between distributional characterization of words and lexical entailment, proposing two new hypotheses as a refinement of the distributional similarity hypothesis. The main idea is that if one word entails the other then we would expect that virtually all the characteristic context features of the entailing word will actually occur also with the entailed word. To illustrate this idea let us consider an entailing pair: company => firm, and the following set of characteristic features of “company” – {“(company)’s profit”, “chairman of the (company)”}. Then these features are expected to appear with “firm” as well in some large corpus - “firm’s profit” and “chairman of the firm”. Other researchers have explored other aspects of textual entailment: Glickman, Dagan, and Kopel (2004) propose a general generative probabilistic setting for textual entailment. They focus on the sub-task 18 of recognizing whether the lexical concepts present in a given hypothesis are entailed from a given text. Glickman, Bar-Haim, Spector (2005) suggest an analysis of sub-components and tasks within textual entailment, proposing two levels: Lexical and Lexical-Syntactic. At the lexical level, they match (possibly multi-word) terms of one text (T) and a second text (hypothesis H), ignoring function words. At the lexical-syntactic level, they match the syntactic dependency relations within H and T. The sense matching problem we tried to deal with is actually a binary classification task: to decide whether the occurrence of the target word in the given sentence entails the source word (i.e., at least one of the meanings of the source word). An example we have already mentioned is the sentence ' Repetitive movements could cause injuries to hands, wrists and arms.', where the word arm substitutes the word weapon, but with the wrong sense. In this case, the target word arm does not entail the source word weapon. On the other hand, in the sentence 'This house was badly mauled by careless soldiers searching for arms' the target word arm entails the source word weapon. In our work we suggest a novel approach of using an implicit WSD method that identifies such lexical entailment in context. 2.4 Classification Algorithms As we have mentioned in the introduction, we can define the sense matching problem directly as a binary classification task for a pair of synonymous source and target words. For this task we used two algorithms, SVM (Support Vector Machine) and a kNN (k- 19 nearest neighbors) - based method. We used two existing implementations of the SVM algorithm, LibSVM and SVMLight, and implemented the kNN algorithm 2.4.1 Binary classification using SVM (Support Vector Machine) The SVM algorithm refers to each source-target pair as a point (or vector) in a multidimensional space, where each dimension is any desired feature of the pair. Given a collection of such training points in the feature space, each tagged positive or negative, we would like to separate the positive and the negative ones as neatly as possible, by the simplest plane. The task of determining whether an untagged point is negative or positive, is called classification, and a group of identically tagged points is called a class. In some cases, the two classes of positive and negative points can be separated by a multi-dimensional 'plane’, which is called a hyperplane. A method that uses such a hyperplane is therefore called linear classification. If such a hyperplane exists, we would like to choose the one that separates the data points with maximum distance between it and the closest data points from both classes. This distance is called the margin. We desire this property, because it makes the separation between the two classes greater If we then add another data point to the points we already have, we can more accurately classify the new point more accurately. Such a hyperplane is known as the maximummargin hyperplane, or the optimal hyperplane. The vectors (data points) that are closest to this hyperplane are called the support vectors. Let us now give a more detailed view of the SVM algorithm. We consider a set of training data points of the form: {( x1 ,c1 ), ( x 2 , c 2 ), ….. , ( xn , c n )} where the ci is either 1 20 or −1. This constant denotes the class to which the point x i belongs positive or negative. Each xi is a n-dimensional vector of scaled [0,1] or [-1,1] values. The scaling is important to guard against features with larger variance, that might otherwise dominate the classification. These points can be viewed as training data, which denote the correct classification, which the SVM should eventually distinguish. The classification will be defined by means of the dividing hyperplane, which takes the form (1) w x b 0 where w and b are the parameters of the optimal hyperplane that we need to learn. As we are interested in the maximum margin, the dominant data are in the support vectors and the hyperplanes closest to these support vectors in either class. These hyperplanes are parallel to the optimal separating hyperplane. It can be shown that they can be described by the equations (2) w x b 1 (3) w x b 1 We would like these hyperplanes to maximize the distance from the dividing hyperplane and to have no data points between them. Geometrically, the distance between the hyperplanes is 2/|w|, so |w| should be minimized in order to maximiaze the margin. To exclude data points, it should be ensured that for all i, (4) w xi -b 1 or w xi -b -1 This can be rewritten as: (5) ci ( w xi -b) 1 1in The problem now is to minimize |w| subject to the constraint (5). This is a quadraic programming (QP) optimization problem which is solved by the SVM training 21 algorithm. After the SVM has been trained, it can be used to classify unseen 'test' data. This is achieved using the following decision rule 1 if w x b 0 (6) c 1 if w x b 0 where c is the class of the new data point x. Writing the classification rule in its dual form reveals that classification is a function only of the Support Vectors, i.e., the training data points that lie on the margin. In the cases when there is no hyperplane that can split the positive and negative training examples, the Soft Margin method is applied. This method chooses a hyperplane that splits the examples as cleanly as possible, while still maximizing the distance to the nearest split examples. This method introduces slack variables and the equation (5) now transforms to (7) ci ( w xi -b) 1 - i 1in and the optimization problem becomes (8) min || w || 2 + C i such that ci ( w xi -b) 1 - i 1in i The constraint in (7) along with the objective of minimizing |w| can be solved using Lagrange multipliers or setting up a dual optimization problem to eliminate the slack variable. 2.4.1.1 One-class SVM In certain case the training data contain points of one class only, and the above separation is not possible anymore. In such cases the aim is to estimate the smallest hypersphere 22 enclosing most of the positive training data points. New test instances are then classified positively if they lie inside the sphere, while outliers are regarded as negatives. We used the SVMlight 4 classifier (developed by T. Joachims from the University of Dortmund) for the case where the data contain points of both classes, and LibSVM5, with its one class option, for the one-class data case. LibSVM also enables us to control the ratio between the width of the enclosed region of training points and the number of misclassified training examples, by setting the parameter (0, 1). Smaller values of will produce larger positive regions, yielding increased recall, but lower precision. 2.4.2 The kNN (k -nearest neighbor) classification algorithm The k -nearest neighbor algorithm6 is an intuitive -method that classifies unlabeled examples based on their similarity with examples in a given training set. For a given unlabeled example, the algorithms finds the k closest labeled examples in the training data, and classifies the unlabeled examples according to the most frequent class within the set of the k closest examples. The special case where the class is predicted to be the class of the closest training sample (i.e. when k = 1) is called the nearest neighbour algorithm. The training examples are mapped into a multidimensional feature space. The space is partitioned into regions by class labels of the training samples. A point in the space is assigned to the class c, if it is the most frequent class label among the k nearest training 4 http://svmlight.joachims.org/ http://www.csie.ntu.edu.tw/~cjlin/libsvm/ 6 Major parts of this paragraph were learned from wikipedia http://en.wikipedia.org/wiki/Nearest_neighbor_(pattern_recognition) 5 23 samples. The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples. In the actual classification phase, the same features as before are computed for the new test example (whose class is not known). The distances from the new vector to all stored vectors are computed, by a selected metric – distance, or some vector similarity measure. The k closest samples are then selected, and the new point is predicted to belong to the most frequent class within this set. The performance of the kNN algorithm is influenced by two main factors: (1) the similarity measure used to locate the k nearest neighbors; and (2) the number of k neighbors used to classify the new sample. The best choice of k depends upon the data. Generally, larger values of k reduce the effect of noise on the classification, but make boundaries between classes less distinct. A good k can be selected by parameter optimization using, for example, cross-validation. The accuracy of the kNN algorithm can be severely degraded by the presence of noisy or irrelevant features, or if the feature scales are not consistent with their relevance. Much research effort has been placed into selecting or scaling features to improve classification. A particularly popular approach is the use of evolutionary algorithms to optimize feature scaling (Dixon and Corne and Oates, 2003).. Another popular approach is to scale features by the mutual information of the training data with the training classes (Yang, J.O. Pederson ,1997) , (Li, L. et al., 2001). The algorithm is easy to implement, but it is computationally intensive, especially when the size of the training set grows. Many optimizations have been proposed over the years; these generally seek to reduce the number of distances actually computed. Some 24 optimizations involve partitioning the feature space, and only computing distances within specific nearby volumes. To measure the distance between two vectors some vector metric or any measure of similarity is required. We used three of the most popular similarity measures. The weighted Jaccard measure, (Grefenstette, 1994), compares the number of common features with the number of unique features for a pair of examples. When generalizing this scheme to non-binary values, each features is represented by a real value in the range of 0–1. This generalization, known as Weighted Jaccard, replaces intersection with the minimum weight, and union with the maximum weight. Set cardinality is generalized to summing over the union of the features of the two examples w and v. min( weight (w, f ), weight (v, f )) f F ( w ) F ( v ) simW J ( w, v) max( weight (w, f ), weight (v, f )) f F ( w ) F ( v ) where F(w) and F(v) are the features of the two examples. The advantage of this measure is that it also takes into account the feature weights rather than just the number of the common features. The standard Cosine measure, which was successfully employed in IR (Salton and McGill, 1983), and also for learning similar words (Ruge, 1992; Caraballo, 1999; Gauch et al., 1999; Pantel and Ravichandran, 2004), is the second alternative to examine: sim cos w, v weightw, f weightv, f f weightw, f f 2 weightv, f f 25 2 Calculating the cosine of the angle between the two vectors considers the difference in direction of two vectors in feature space as opposed to their geometric distance. Thus, it overcomes the problem of distance metrics that discriminate too strongly between vectors with significantly different lengths. The third measure we used is a recent state-of-the-art variant of the weighted Jaccard measure (Weeds and Weir, 2004), which was developed by Lin (1998) and is grounded on principles of information theory. It computes the ratio between what is shared by the features of both vectors and the sum over the features of each vector: sim Lin ( w, v) (weight f F ( w ) F ( v ) f F ( w ) MI ( w, f ) weight MI (v, f )) ( weight MI ( w, f ) weight f F ( v ) MI (v, f ) F(w) and F(v) are the features of the two examples and the weight function is defined as the Mutual Information (MI). There are three underlying intuitions to this measure: (1) the more commonality the two objects share, the more similar they are; (2) the more differences they have, the less similar they are; (3) the maximum similarity between objects A and B should only be reached when they are identical, no matter how much commonality they share. 26 3. Problem Setting and Dataset To investigate the direct sense matching problem it is necessary to obtain an appropriate dataset of examples for this binary classification task, along with gold standard annotation. While there is no such standard (application independent) dataset available it is possible to derive it automatically from existing WSD evaluation datasets, as described below. This methodology also allows comparing direct approaches for sense matching with classical indirect approaches, which apply an intermediate step of identifying first the most likely WordNet sense. We chose to work with single words in order to find the abstract solution to the problem of direct sense matching. (We did not want to become involved in working with more than a single word at a time in order to prevent problematic word dependencies, etc.). Our dataset was derived from the Senseval-3 English lexical sample dataset (Mihalcea and Edmonds, 2004), and included all 25 nouns, adjectives and adverbs in this sample. Verbs were excluded since their sense annotation in Senseval-3 is not based on WordNet senses but rather on a different dictionary (the available approximate mapping to Word-Net synsets was not sufficiently reliable). The Senseval dataset includes a set of example occurrences in context for each word, split to training and test sets, where each example is manually annotated with the corresponding WordNet synset. For the sense matching setting we need examples of pairs of source-target synonymous words, where at least one of these words should occur in a given context. Following an applicative motivation, we mimic a typical IR setting in which a single source word query is expanded (substituted) by a synonymous target word. Then, it is needed to identify contexts in which the target word appears in a sense that matches the 27 source word. Accordingly, we considered each of the 25 words in the Senseval sample as a target word for the sense matching task. Next, we had to pick for each target word a corresponding synonym to play the role of the source word. This was done by creating a list of all WordNet synonyms of the target word, under all its possible senses, and picking randomly one of the synonyms as the source word. For example, the word ‘disc’ is one of the words in the Senseval lexical sample. For this target word the synonym ‘record’ was picked, which matches ‘disc’ in its musical sense. While creating source-target synonym pairs it was evident that many WordNet synonyms corresponded to very infrequent senses or word usages, such as the WordNet synonyms germ and source. Such source synonyms are useless for evaluating sense matching with the target word since the senses of the two words would rarely match in perceivable contexts. In fact, considering our motivation for lexical substitution, it is usually desired to exclude such obscure synonym pairs from substitution lexicons in practical applications, since they would mostly introduce noise to the system. To avoid this problem the list of WordNet synonyms for each target word was filtered by a lexicographer, who excluded manually obscure synonyms that seemed worthless in practice. The lexicographer was also instructed to exclude pairs where the target word had a more general meaning than the source word. Using those manually filtered results, the source synonym for each target word was then picked randomly from the filtered list. Table 1 shows the 25 source-target pairs created for our experiments. 28 Source word Target word statement argument level degree raging hot opinion judgment execution performance subdivision arm deviation difference ikon image arrangement organization design plan atm atmosphere dissimilar different crucial important newspaper paper protection shelter hearing audience trouble difficulty sake interest company party variety sort camber bank record disc bare simple substantial solid root source WordNet Sense id argument%1:10:02 degree%1:07:00:: degree%1:26:01:: hot%3:00:00:violent:00 judgment%1:10:00:: performance%1:04:00:: arm%1:14:00:: difference%1:11:00:: Image%1:06:00:: organization%1:09:00:: plan%1:09:01:: atmosphere%1:23:00:: different%3:00:02:: important%3:00:02:: paper%1:06:00:: paper%1:10:03:: paper%1:14:00:: Shelter%1:26:00:: audience%1:26:00:: difficulty%1:04:00:: interest%1:07:01:: party%1:14:02:: sort%1:09:00:: bank%1:17:02:: disc%1:06:01:: Simple%3:00:02:plain:01 solid%3:00:00:sound:01 solid%3:00:00:wholesome:00 Source%1:15:00:: Table 1: Source and target pairs 29 In future work it may be possible to apply automatic methods for filtering infrequent sense correspondences in the dataset, by adopting algorithms such as in (McCarthy et al. 2004). Having source-target synonym pairs, a classification instance for the sense matching task is created from each example occurrence of the target word in the Senseval dataset. A classification instance is thus defined by a pair of source and target words and a given occurrence of the target word in context. The instance should be classified as positive if the sense of the target word in the given context matches one of the possible senses of the source word, and as negative otherwise. Table 2 illustrates positive and negative example instances for the source-target synonym pair ‘record-disc’, where only occurrences of ‘disc’ in the musical sense are considered positive. Sentence annotation This is anyway a stunning disc, thanks to the playing of the Moscow positive Virtuosi with Spivakov. He said computer networks would not be affected and copies of negative information should made on floppy discs. Before the dead solider was placed in the ditch his personal possessions negative were removed, leaving one disc on the body for identification purposes. Table 2: positive and negative example for the source-target synonym pair 'record-disc' The gold standard annotation for the binary sense matching task can be derived automatically from the Senseval annotations and the corresponding WordNet synsets. An example occurrence of the target word is considered positive if the annotated synset for that example includes also the source word, and Negative otherwise. Notice that different 30 positive examples might correspond to different senses of the source word. This happens when the source and target share several senses, and hence they appear together in several synsets (see Table 3). Finally, since in Senseval an example may be annotated with more than one sense, it was considered positive if any of the annotated synsets for the target word included the source word. Using this procedure we derived gold standard annotations for all the examples in the Senseval-3 training section for our 25 target words. For the test set we took up to 40 test examples for each target word (some words had fewer test examples), yielding 913 test examples in total, out of which 239 were positive. This test set was used to evaluate the sense matching methods described in the next section. Sentence WordNet sense annotation It can be a very useful means of making a A position on a scale of Positive charitable gift towards the end of the tax intensity or amount or year when your taxable income for the year quality can be estimated with some degree of precision The length of time spent stretching depends A specific identifiable Positive on the sport you are training for and the position in a continuum degree of flexibility you wish to attain or series or especially in a process Table 3 :Example instances for the source-target synonym pair ‘level-degree’, where two senses of the source word 'degree' are considered positive. 31 4. Investigated Methods As explained in the introduction, the sense matching task may be addressed by two general approaches. The traditional indirect approach would first disambiguate the target word relative to a predefined set of senses, using standard WSD methods. Then, it would check whether the selected sense matches the source word. In terms of Word-Net synsets, it would check whether the selected synset for the target word includes the source word as well. On the other hand, a direct approach would address the binary sense matching task directly, without selecting explicitly a concrete sense for the target word. In this research we focus on investigating several direct methods for sense matching and compare their performance relative to traditional indirect methods, under both supervised and unsupervised settings. Two different goals may be set for sense matching methods. The first goal is classification, where the system needs to decide for each test example whether it is positive or negative (i.e., whether the target word sense matches the source or not). The second goal is ranking, where the system only needs to rank all test examples of a given target word according to their likelihood of being positive, as measured by some confidence score. From the perspective of the applied lexical substitution task, employing the sense matching module as a classifier enables to utilize it for filtering out inappropriate contexts of the target word. On the other hand, scored ranking corresponds to situations in which a hard classification decision is not expected from the sense matching module, either because the final system output is a ranked list (as in IR and QA) or because the sense matching score is being integrated with the scores of additional 32 system modules. As described below, we investigate alternative methods for both the classification and ranking goals. 4.1 Feature set and classifier As a vehicle for investigating different classification approaches we implemented a “vanilla” state of the art architecture for WSD. Following common practice in feature extraction (e.g. (Yarowsky, 1994)), and using the mxpost7 part of speech tagger and WordNet’s lemmatization, the following feature set was used: bag of word lemmas for the context words in the preceding, current and following sentence; unigrams of lemmas and part of speech in a window of +/- three words , where, each position provides a distinct feature [w-3, w-2, w-1, w+1 w+2 w+3].and bigrams of lemmas in the same window [w-3-2, w-2-1, w-1+1, w+1+2, w+2+3] The SVMLight (Joachims, 1999) classifier was used in the supervised settings with its default parameters. To obtain a multi-class classifier we used a standard one-vs-all approach of training a binary SVM for each possible sense and then selected the highest scoring sense for a test example. To verify that our implementation provides a reasonable replication of state of the art WSD we applied it to the standard Senseval-3 Lexical Sample WSD task. The obtained accuracy8 was 66.7%, which compares reasonably with the mid-range of systems in the Senseval-3 benchmark (Mihalcea and Edmonds, 2004). This figure is just a few percent lower than the (quite complicated) best Senseval-3 system, which achieved about 73% accuracy and it is much higher than the standard Senseval baselines. We thus regard our 7 ftp://ftp.cis.upenn.edu/pub/adwait/jmx/jmx.tar.gz The standard classification accuracy measure equals precision and recall as defined in the Senseval terminology when the system classifies all examples, with no abstentions 8 33 classifier as a fair vehicle for comparing the alternative approaches for sense matching on equal grounds. 4.2 Supervised Methods 4.2.1 Indirect approach The indirect approach for sense matching follows the traditional scheme of performing WSD for lexical substitution. First, the WSD classifier described above was trained for the target words of our dataset, using the Senseval-3 sense annotated training data for these words. This was accomplished by training a binary SVM for each possible sense. Each binary classifier was trained to identify one sense of the target - the training examples were classified as positive when the sense of the target word in the given context matched the specific classifier's sense and the rest were classified as negative. Then, each classifier was applied to each test example of the target words, selecting the most likely sense for each example by picking the sense of the binary classifier that scored highest for the test example. Finally, an example was classified as positive if its selected sense included the source word in its synset. Otherwise, the example was classified as negative. 4.2.2 Direct approach As explained above, the direct approach addresses the binary sense matching task directly, without selecting explicitly a sense for the target word. In the supervised setting it is easy to obtain such a binary classifier using the annotation scheme described in Section 3. Under this scheme an example was annotated as positive (for the binary sense matching task) if the source word is included in the Senseval gold standard synset of the 34 target word. We trained the classifier using the set of Senseval-3 training examples for each target word, considering their derived binary annotations. Finally, the trained classifier was applied to the test examples of the target words, yielding directly a binary positive-negative classification. We note that the direct binary setting is suitable for producing rankings as well, using the obtained SVM scores to rank all examples of each target word. In addition, because this method is direct and applies a single classifies for a target word, it allows for shorter running time during the training and test stages. In the indirect method, the training stage must be used to train a binary classifier for each sense. Consequently, in the testing stage, each test example must be checked by all of the binary classifiers. The running time increases with the number of senses for each target word. Some words have many senses, like the word "hot", that has twenty one different senses. 4.3 Unsupervised Methods It is well known that obtaining annotated training examples for WSD tasks is very expensive, and is often considered infeasible in unrestricted domains. Therefore, many researchers investigated unsupervised methods, which do not require annotated examples. Unsupervised approaches have usually been investigated within Senseval using the “All Words” dataset, which does not include training examples. In this thesis we preferred using the same test set which was used for the supervised setting (created from the Senseval-3 “Lexical Sample” dataset, as described above), in order to enable comparison between the two settings. Naturally, in the unsupervised setting the sense labels in the training set were not utilized. 35 4.3.1 Indirect approach State-of-the-art unsupervised WSD systems are quite complex and they are not easy to be replicated. Thus, we implemented the unsupervised version of the Lesk algorithm (Lesk, 1986) as a reference system, since it is considered a standard simple baseline for unsupervised approaches. The Lesk algorithm is one of the first algorithms developed for semantic disambiguation of all-words in unrestricted text. In its original unsupervised version, the only resource required by the algorithm is a machine readable dictionary with one definition for each possible word sense. The algorithm looks for words in the sense definitions that overlap with context words in the given sentence, and chooses the sense that yields maximal word overlap. This algorithm is based on the intuition that words that co-occur in a sentence are being used to refer to the same topic, and that topically related sense of words are defined in a dictionary using the same words. We used an implemented version of this algorithm created by our Italian colleague, Carlo Strapparava, from ITC-Irst, , that uses WordNet sense-definitions with context length of ±10 words before and after the target word. 4.3.2 Direct approaches It has been well recognized that it is very difficult, and methodologically problematic, to determine the “right” set of pre-defined senses for WSD. Hence, the direct sense matching approach may be particularly attractive just because it does not assume any reference to a pre-defined set of senses. However, existing unsupervised algorithms for the classical WSD task do rely on pre-defined sense repositories, and sometimes also on dictionary definitions for these senses (similar to the Lesk algorithm). For this reason, 36 standard unsupervised WSD techniques cannot be applied for direct sense matching, in which the only external information assumed is a substitution lexicon. The assumption underlying our proposed methods is that if a target word occurrence has a sense which matches the source word then the context of that occurrence should be valid for the source word as well. Unlabeled occurrences of the source word can then be used to learn a model of its typical valid contexts. Next, we can match this model against test examples of the target word and evaluate whether the given target contexts are valid for the source word or not, providing a decision criterion for sense matching. We notice that in this proposed approach only positive examples are given, in the form of unlabeled occurrences of the source word. Learning from positive examples only (also called one class learning) is known to be much more difficult than standard supervised learning for the same task. Yet, this setting arises in many practical situations and is often the only unsupervised solution available. 4.3.2.1 Direct approach one-class SVM Our first unsupervised method utilizes the One Class SVM learning algorithm (Sch¨olkopf et al., 2001), and was implemented using the LIBSVM package9 by our Italian colleague, Alfio Glizzo , from ITC-Irst. The training examples consist of a given sample of unlabeled occurrences of the source word represented by the same feature set of Subsection 4.1. We used training examples taken from the BNC 10 (British National Corpus). This created a compatibility between the training data and test data because the BNC is one of the sources of Senseval, which we used as a source for the test data. This 9 Freely available from http://www.csie.ntu.edu.tw/~cjlin/libsvm The BNC (British National Corpus) is a 100 million word collection of samples of written and spoken language from a wide range of sources. http://www.natcrop.ox.ac.uk 10 37 compatibility added to the chances of successful learning, since the training data and the test data had more topics in common. Roughly speaking, a one-class SVM estimates the smallest hypersphere enclosing most of the training data. New test instances are then classified positively if they lie inside the sphere, while outliers are regarded as negatives. The ratio between the width of the enclosed region and the number of misclassified training examples can be varied by setting the parameter (0, 1). Smaller values of will produce larger positive regions, yielding increased recall. We note that we could utilize the LIBSVM one-class package only for classification but not for ranking, since it provides just a binary classification decision rather than a classification score. Experiments with the one-class SVM (see Section 5) revealed two problems. First, there is no obvious way to tune the optimal value for the parameter in an unsupervised setting, in which no labeled examples are given. Furthermore, different values were found optimal, in retrospect, for different words. Such optimization of classification performance is an inherent problem for the unsupervised one-class setting, unlike the standard supervised setting in which both positive and negative examples are utilized to optimize models uniformly during training. Second, when the source word is ambiguous then only one (or few) of its senses can be substituted with the target word. However, our one-class algorithm was trained on all examples of the source word, which include examples of irrelevant senses of the source word, yielding noisy training sets. For an example, see Table 4. 38 Sentence Sense appropriate /noisy What level is the office on? floor noisy A high level of care is degree, amount acceptable required Table 4: Table of a noisy training example and an appropriate training example for the source word 'level' and the target word 'degree'. 4.3.2.2 Direct approach KNN-based ranking Consequently, we developed also an unsupervised ranking method which is based on the k Nearest Neighbors (kNN) principle. To avoid the first problem of optimizing classification performance we decided to focus at this stage on the ranking goal. That is, we aim to score all test examples of a target word such that the positive ones will tend to appear at the top of the ranked list, but without identifying an optimal classification boundary. (This method was also evaluated by the classification measurement so that we would be able to compare the two unsupervised direct method - one class and kNN). The second problem of source word ambiguity is addressed by the choice of a kNN approach. In this approach the score of a test example is determined only by the most relevant subset of source word occurrences, which are likely to correspond to the relevant sense of the source word. More concretely, we store in memory all training examples of the source word, represented by our standard feature set. The score of a test example of the target word is computed as the average similarity between the test example and the k most similar training examples. Finally, all test examples of the target word are ranked by these scores. 39 The rational behind this method is that if the sense of the target test example matches the source then there are likely to be k occurrences of the corresponding source sense that are relatively similar to this target example. On the other hand, if the target example has a sense that does not match the source word then it is likely to have lower similarity with the source examples. The disadvantage of this algorithm is that it uses all the training data at the test time, which makes it memory and time expensive, since similarities should be calculated between the test example and all training examples. We tried to improve the algorithm in these two aspects by building an index for the training data of every source word. The pseudo code of the improved algorithm appears in the following figure, where the numbers in brackets below refer to the code lines. The index is implemented by a hash table where the key is the feature number and the value is the indexes of sentences that include that feature (1). After building the index, the similarities are calculated (2-5). The index saves us the need to calculate similarities for all the training set. Instead, we calculate similarities only for the sentences that have any common feature with the test sentence: the algorithm loops over the features of the test sentence (3), and for every feature, calculates the similarity between the test sentence and the sentences that were hashed in this entry (4-5). This way we save both time and cash memory, since we do not need to upload all the training data to the cash. 40 1 Bulid indx I for the training data set 2 3 4 5 for each example X in the test data do for each feature xi in example X for each training example Dj in indxer entry I[xi] calculate sim(X, Dj); 6 find K largest scores of sim(X, Dj); 7 calculate sim_avg for K nearest neighbors; 8 return sim_avg Figure1 -Pseudo code for our kNN classifier algorithm 41 5. Evaluation 5.1 Evaluation measures As we described in section 4, two different goals may be set for sense matching methods: classification and ranking. To get realistic and comprehensive evaluation of the methods, we used two evaluation measures, one for each goal. 5.1.1 Classification measure For binary sense matching, and the corresponding lexical substitution setting, the standard WSD metrics (Mihalcea and Edmonds, 2004) are less suitable because we are interested in the binary decision of whether the target word matches the sense of a given source word. For this reason we decided to adopt an Information Retrieval evaluation schema, where Precision, Recall and F1 are estimated as follows: Precision Recall F1 TruePositi ve TruePositi ve FalsePosit ive TruePositi ve TruePositi ve FalseNegat ive 2 * Precision * Recall Precision Recall In the following section we report micro-averaged results for these measures on our test set. 5.1.2 Ranking measure This measure is very popular in Information Retrieval (IR) and Question Answering (QA) systems, for which the lexical expansion setting is targeted. It quantifies the system's ability to rank examples for a given source word, preferring a ranking which ranks correct examples before negative ones. A prefect ranking would place all the 42 positive examples before all the negative examples. Average precision is a common evaluation measure for system rankings, and is computed as the average of the system's precision values, at all points in the ranked list where recall increases (Voorhees and Harman 1999). In our case, the points where recall increases correspond to positive test examples. More formally, it can be written as follows: 1/R * sum for i=1 to n ( E(i) * #-correct-up-to-pair-i/i) Where n is a number of the examples in the test set, R is the total number of positive examples in the test set, E(i) is 1 if the i-th example is positive and 0 otherwise, and I ranges over the examples, order by their ranking from the highest down. This average precision calculation outputs a value in the 0-1 range, where 1 corresponds to perfect ranking. This value corresponds to the area under the non-interpolated recallprecision curve for the target word. Mean Average Precision (MAP) is defined as the mean of the average precision values for all test words. 5.2 Classification measure results 5.2.1 Baselines Following the Senseval methodology, we evaluated two different baselines for unsupervised and supervised methods. The random baseline, used for the unsupervised algorithms, was obtained by choosing either the positive or the negative class at random resulting in P = 0.262, R = 0.5, F1 = 0.344. The Most Frequent baseline has been used for the supervised algorithms and is obtained by assigning the positive class when the percentage of positive examples in the training set is above 50%, resulting in P = 0.65, R= 0.41, F1 = 0.51. 43 Supervised P R F1 Most Frequent Baseline 0.65 0.41 0.51 Multiclass SVM Indirect 0.59 0.63 0.61 Binary SVM (J = 0.5) Direct 0.80 0.26 0.39 Binary SVM Direct 0.76 0.46 0.57 (J=1) Binary SVM (J=2) Direct 0.68 0.53 0.60 Binary SVM (J=3) Direct 0.69 0.55 0.61 Table 5A Supervise method Unsupervised P R F1 Random Baseline 0.26 0.50 0.34 Lesk Indirect 0.24 0.19 0.21 One-Class ( = 0.3) Direct 0.26 0.72 0.39 One-Class ( = 0.5) Direct 0.29 0.56 0.38 One-Class ( = 0.7) Direct 0.28 0.36 0.32 One-Class ( = 0.9) Direct 0.23 0.10 0.14 Table 5B Unsupervise method Table 5: Classification results on the sense matching task 5.2.2 Supervised Methods Both the indirect and the direct supervised methods presented in Subsection 4.2 have been tested and compared to the most frequent baseline. Indirect. For the indirect methodology we trained the supervised WSD system for each target word on the sense-tagged training sample. As described in Subsection 4.2, we implemented a simple SVM-based WSD system (see Section 4.2) and applied it to the sense-matching task. Results are reported in Table 5A. The direct strategy surpasses the most frequent baseline F1 score, but the achieved precision is still below it. We note that 44 in this multi-class setting it is less straightforward to tradeoff recall for precision, as all senses compete with each other. Direct. In the direct supervised setting, sense matching is performed by training a binary classifier, as described in Subsection 4.2. The advantage of adopting a binary classification strategy is that the precision/recall tradeoff can be tuned in a meaningful way. In SVM learning, such tuning is achieved by varying the parameter J, that allows us to modify the cost function of the SVM learning algorithm. If J = 1 (default), the weight for the positive examples is equal to them weight for the negatives. When J > 1, negative examples are penalized (increasing recall), while, whenever 0 < J < 1, positive examples are penalized (increasing precision). Results obtained by varying this parameter are reported in Figure 2. Figure 2: Direct supervised results varying J 45 Adopting the standard parameter settings (i.e. J = 1, see Table 3), the F1 of the system is slightly lower than for the indirect approach, while it reaches the indirect figures when J increases. More importantly, reducing J allows us to boost precision towards 100%. This feature is of great interest for lexical substitution, particularly in precision oriented applications like IR and QA, for filtering irrelevant candidate answers or documents. 5.2.3 Unsupervised methods Indirect. To evaluate the indirect unsupervised settings we implemented the Lesk algorithm, described in Subsection 4.3.1, and evaluated it on the sense matching task. The obtained figures, reported in Table 5B, are clearly below the baseline, suggesting that simple unsupervised indirect strategies cannot be used for this task. In fact, the error of the first step, due to low WSD accuracy of the unsupervised technique, is propagated in the second step, producing poor sense matching. Unfortunately, state-of-the-art unsupervised systems are actually not much better than Lesk on all words task (Mihalcea and Edmonds, 2004), discouraging the use of unsupervised indirect methods for the sense matching task Direct. Conceptually, the most appealing solution for the sense matching task is the oneclass approach proposed for the direct method (Section 4.3.2). As stated in section 4, in order to perform our experiments, we trained a different one-class SVM for each source word, using a sample of its unlabeled occurrences in the BNC as the training set. To avoid huge training sets and to speed up the learning process, we fixed the maximum number of training examples to 10000 occurrences per word, collecting on average about 46 6500 occurrences per word. For each target word in the test sample, we applied the classifier of the corresponding source word. Results for different values of are reported in Figure 3 and summarized in Table 5B. Figure 3 One-class evaluation varying While the results are somewhat above the baseline, just small improvements in precision are reported, and recall is higher than the baseline for < 0.6. Such small improvements may suggest that we are following a relevant direction, even though they may not be useful yet for an applied sense-matching setting. Further analysis of the classification results for each word revealed that optimal F1 values are obtained by adopting different values of for different words. In the optimal (in retrospect) parameter settings for each word, performance for the test set is noticeably boosted, achieving P = 0.40, R = 0.85 and F1 = 0.54. Finding a principled unsupervised way to automatically tune the parameter is thus a promising direction for future work. 47 Investigating further the results per word, we found that the correlation coefficient between the optimal values and the degree of polysemy of the corresponding source words is 0.35. More interestingly, we noticed a negative correlation (r = -0.30) between the achieved F1 and the degree of polysemy of the word, suggesting that polysemous source words provide poor training models for sense matching. This can be explained by observing that polysemous source words can be substituted with the target words only for a strict subset of their senses. On the other hand, our one class algorithm was trained on all the examples of the source word, which include irrelevant examples that yield noisy training sets. A possible solution may be obtained using clustering-based word sense discrimination methods (Pedersen and Bruce, 1997; Sch¨utze, 1998), in order to train different one-class models from different sense clusters. Overall, the analysis suggests that it may be possible to obtain in the future better binary classifiers based on unlabeled examples of the source word. As the unsupervised-direct approach is the most appealing approach for sense matching, and we have presented two algorithms which implement that approach, we would like to compare their results. For this purpose we evaluated the classification measure of the kNN algorithm as well (although this algorithm will be examined mostly by the ranking measure). Since the kNN algorithm ranks the test sentences, we need to set a threshold to separate the negative and positive results, in order to compute the classification results. Figure 4 shows the Precision, Recall and F1 values for various values of threshold, with the cosine similarity metric and k=10. 48 One can see quite clearly that the kNN yield somewhat better results than the one-class SVM algorithm: the optimal F1 the kNN has achieved is 0.42, with a threshold of 0.1, compared to the optimal F1 of the one-class SVM, which is 0.39. P R F1 1.2 1 score 0.8 0.6 0.4 0.2 0 0.001 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 threshold Figure 4– Precision, Recall and F1 of kNN with cosine metric and k=10, for various thresholds 5.3 Ranking measure results Table 6 summarizes the MAP (Mean Average Precision ) results for the supervised direct approach and for the KNN-based unsupervised approach, along with a baseline of randomized ranking averaged over 10 runs. The results indicate that the ranking produced by the KNN method (K = 10) outperform random ranking, while still being substantially lower than supervised performance. 49 Method MAP Random 0.36 kNN (Cosine 10) 0.40 Binary SVM (J = 2) 0.60 Table 6: Mean Average Precision Figure 5 provides a closer look at the ranking behavior, plotting the macro averaged recall-precision curves for each method. Figure 5 Macro-averaged recall-precision curves The figure indicates that the KNN-based ranking is better than randomized ranking up to the 80% recall point. In particular, in the important high-precision range, of up to about 25% recall, the kNN based method is better than random by 8 − 18%. That is, KNN does succeed to give the highest ranks to positive examples substantially better than random. To the best of our knowledge, this is the first time in which such positive result is 50 obtained by a method that does not consider any externally-provided information at all, be it in the form of labeled examples, a sense repository or sense definitions. We hypothesize that this result can be further improved through better assessment of the similarity between the target test example and the source training data. for ranking results When implementing the kNN algorithm, we tried three similarity measures, Cosine, Jacard, Lin, and different values of k,10, 50 and 100. There was no significant difference between the results of these attempts, but we still find it valuable to show them, in figures 6 and 7 figure 6 show the results of kNN with the Cosine metric for different k values and figure 7 shows the results of kNN with different metrics with K=100 Cosine varying K 1.2 1 Precision 0.8 cos 100 cos 10 cos 50 0.6 0.4 0.2 0 0 0.5 1 1.5 Recall Figure 6– Results of kNN with different values of k 51 Results varying sim 1.2 1 Precision 0.8 cos 100 jaccard 100 lin 100 0.6 0.4 0.2 0 0 0.5 1 1.5 Recall Figure 7– Results of kNN with different similarity metrics 52 6. Conclusion and future work This thesis defined and investigated the novel sense matching task, which captures directly the polysemy problem in lexical substitution. We proposed direct approaches for the task, suggesting the advantages of controlling precision-recall tradeoff, while avoiding the need in an explicitly defined sense repository. Furthermore, we proposed novel types of completely unsupervised learning schemes. To obtain realistic and comprehensive evaluation of the methods, we used two evaluation measures, by classification and ranking, which correspond to two different goals for the sense matching task. In both measures the methods yielded better results than the baselines. In particular, positive results for both measures were obtained by the kNN method, which does not require any form of external information. We speculate that with these encouraging results there is a great potential for such approaches, to be explored in future research. We would like to remind here that the algorithms we suggested were aimed to handle one case of source-target mismatch – the first case that was mentioned in the introduction, where the target word had the wrong sense in a given context. The same algorithms could be used when switching roles between source and target, to handle the second case of mismatch, where the target word was selected according to a wrong sense of the source word. We focused on the direct unsupervised approach as our goal. Possible future improvements may be done, for example, by adding weights to the features, or creating negative examples in the training data by using the target words as negative examples while the source words themselves make the positive ones. This idea need further 53 research, since it induces much noise that should be handled. Additionally, ideas for other methods have come up during the research, such as the automatic clustering of word instances by contexts, e.g. what Sh¨utze (1998) termed as sense discrimination - two words would be considered to be used in the same sense if they are within the same cluster. We hope that the abstract idea we initiated in this research will lead to further research in this area, and to valuable progress in the task of lexical substitution. 54 7. References B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In D. Haussler, editor, 5th Annual ACM Workshop on COLT, pages 144-152, Pittsburgh, PA, 1992.. Caraballo, Sharon A. 1999. Automatic Acquisition of a Hypernym-Labeled Noun Hierarchy from Text. In 37th Annual Meeting of the Association for Computational Linguistics: Proceedings of the Conference, pages 120-126. Chaves R. P. 2001. WordNet and Automated Text Summarization. In Proceedings of the 6th Natural Language Processing Pacific Rim Symposium (NLPRS-01). Tokyo, Japan. Christopher J. C. Burges. "A Tutorial on Support Vector Machines for Pattern Recognition". Data Mining and Knowledge Discovery 2:121 - 167, 1998 Ido Dagan,. 2000. Contextual Word Similarity, in Rob Dale, Hermann Moisl and Harold Somers (Eds.), Handbook of Natural Language Processing, Marcel Dekker Inc, 2000, Chapter 19, pp. 459-476 Ido Dagan, Oren Glickman, and Bernardo Magnini. 2005. The pascal recognising textual entailment challenge. Proceedings of the PASCAL Challenges Workshop on Recognising Textual Entailment. Ido Dagan, Oren Glickman, Alfio Gliozzo, Efrat Marmorshtein and Carlo Strapparava. 2006. Direct Word Sense Matching for Lexical Substitution, COLING-ACL Ido Dagan, Shaul Marcus and Shaul Markovitch. Contextual word similarity and estimation from sparse data, Computer, Speech and Language, 1995, Vol. 9, pp. 123-152 Belur V. Dasarathy, 1991. Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques, Phillip William Dixon and David Corne and Martin J. Oates Replacing Generality with Coverage for Improved Learning Classifier Systems, 2003 HIS pp 185-193 C. Fellbaum. 1998. WordNet. An Electronic Lexical Database. MIT Press. J. Gonzalo, F. Verdejo, I. Chugur, and J. Cigarran. 1998. Indexing with wordnet synsets can improve text retrieval. In ACL, Montreal, Canada. Flank S. A layered approach to NLP-based Information Retrieval. 1998. In Proceedings of the ACL / COLING Conference. Montreal, Canada. 55 Gasperin, Caroline and Renata Vieira. 2004. Using Word Similarity Lists for Resolving Indirect Anaphora. In Proc. of ACL-04 Workshop on Reference Resolution. Barcelona, Spain, July, 2004 Gauch, Susan, J. Wang, S. Mahesh Rachakonda. 1999. A Corpus Analysis Approach for Automatic Query Expansion and its Extension to Multiple Databases. ACM Transactions on Information Systems, volume 17(3), pp. 250-250, 1999. .Grefenstette, Gregory. 1994. Exploration in Automatic Thesaurus Discovery. Kluwer Academic Publishers. Harabagiu, Sanda M., Dan I. Moldovan, Marius Pasca, Rada Mihalcea, Mihai Surdeanu, Razvan C. Bunescu, Roxana Girju, Vasile Rus, and Paul Morarescu. 2000. Falcon: Boosting knowledge for answer engines. In Text REtrieval Conference. Hovy, Eduard H., Ulf Hermjakob, and Chin-Yew Lin. 2001. The use of external knowledge of factoid QA. In Text Retrieval Conference. T. Joachims. 1999. Making large-scale SVM learning practical. In B. Sch¨olkopf, C. Burges, and A. Smola, editors, Advances in kernel methods: support vector learning, chapter 11, pages 169 – 184. MIT Press. Lee, Lillian. 1997. Similarity-Based Approaches to Natural Language Processing. Ph.D. thesis, Harvard University, Cambridge, MA. M. Lesk. 1986. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In Proceedings of the ACM-SIGDOC Conference, Toronto, Canada. Li, L. et al., 2001. Gene selection for sample classification based on gene expression data: Study of sensitivity to choice of parameters of the GA/KNN method. Bioinformatics 17, 1131–1142. Dekang Lin. 1998. Automatic retrieval and clustering of similar words. In Proceedings of the 17th international conference on Computational linguistics, pages 768–774, Morristown, NJ, USA. Association for Computational Linguistics. Lin, Dekang. 1998a. Automatic Retrieval and Clustering of Similar Words. In Proc. of COLING–ACL98, Montreal, Canada, August, 1998. Diana McCarthy, Rob Koeling, Julie Weeds, and John Carroll. 2004, Automatic identification of infrequent word senses. In Proceedings of COLING page 1220-1226. Diana McCarthy, 2002, Lexical substitution as a task for wsd evaluation. In Proceedings of the ACL-02 workshop on Word sense disambiguation, pages 109-115, Morristown, NJ, USA, Association for Computational Linguistics. 56 R. Mihalcea and P. Edmonds, editors. 2004. Proceedings of SENSEVAL-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, Barcelona, Spain, July. Mihalcea R. and D. Moldovan. 2000. Semantic Indexing using WordNet Senses. In Proceedings of ACL Workshop on IR and NLP. D. Moldovan and R. Mihalcea. 2000. Using wordnet and lexical operators to improve internet searches. IEEE Internet Computing, 4(1):34–43, January. M. Negri. 2004. Sense-based blind relevance feedback for question answering. In SIGIR2004 Workshop on Information Retrieval For Question Answering (IR4QA), Sheffield, UK, July. Patrick Pantel and Deepak Ravichandran. 2004. Automatically Labeling Semantic Classes. In Proceedings of Human Language Technology / North American chapter of the Association for Computational Linguistics (HLT/NAACL-04). pp. 321-328. Boston, MA. T. Pedersen and R. Bruce. 1997. Distinguishing word sense in untagged text. In EMNLP, Providence, August. M. Sanderson. 1994. Word sense disambiguation and information retrieval. In SIGIR, Dublin, Ireland, June. Ruge, Gerda. 1992. Experiments on linguistically-based term associations. Information Processing & Management, 28(3), pp. 317–332. Salton, G. and McGill, M. J. 1983. Introduction to Modern Information Retrieval. McGraw Hill. Scott, S. and S. Matwin. 1998. Text classification using WordNet hypernyms. In Proceedings of the COLING / ACL Workshop on Usage of WordNet in Natural Language Processing Systems. Montreal, Canada. B. Sch¨olkopf, J. Platt, J. Shawe-Taylor, A. J. Smola, and R. C.Williamson. 2001. Estimating the support of a high-dimensional distribution. Neural Computation, 13:1443– 1471.H. Shakhnarovish, Darrell, and Indyk Nearest-Neighbor Methods in Learning and Vision, e, The MIT Press, 2005 H. Sh¨utze Automatic word sense discrimination computational Linguistics 24, 1(1998), 97-124 H. Sh¨utze and J. Pederson. 1995. Information retrieval based on word senses. In Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval, Las Vegas. 57 E. Voorhees and D. Harmann, editors. 1999. Proceedings of the Seventh Text REtrieval Conference (TREC-7), Gaithersburg, MD, USA, July. NIST Special Publication. E. Voorhees. 1993. Using WordNet to disambiguate word sense for text retrieval. In SIGIR, Pittsburgh, PA. E. Voorhees. 1994. Query expansion using lexical semantic relations. In Proceedings of the 17th ACM SIGIR Conference, Dublin, Ireland, June. Weeds, Julie, D. Weir, and D. McCarthy. 2004. Characterizing Measures of Lexical Distributional Similarity. In Proc. of Coling 2004. Switzerland, July, 2004. Y.Yang, J.O. Pederson, A comparative study on feature selection in text categorization, International Conference on Machine Learning (ICML), 1997. D. Yarowsky. 1994. Decision lists for lexical ambiguity resolution: Application to accent restoration in spanish and french. In ACL, pages 88–95, Las Cruces, New Mexico. 58