From: AAAI Technical Report FS-92-04. Copyright © 1992, AAAI (www.aaai.org). All rights reserved. ~ti~tlstlcallv-Gulded Word Sense Dlsambio_uation 1 Elizabeth D. Liddy & Woojin Paik School of Information Studies Syracuse University Syracuse, New York 13244-4100 31 5-443-2911 liddy@mailbox.syr.edu wpaik@mailbox.syr.edu Abstract Within the field of Natural LanguageProcessing, lexical disambiguation remains one of the toughest hurdles to overcomein the developmentof fully operational systems. As part of a larger document detection system (DR-LINK), we have implemented a computational approximation of word sense disambiguation by combining information from a machine-readabledictionary, local context, and corpus statistics. Weuse the Subject-Field Codes(SFC) extracted from a machine-readabledictionary producea preliminary, multi-tagged semantic coding of words in a text. Thenweapply local heuristics that evaluate the SFCsof ambiguouswords to chooseamongthe multiple SFCs.Choices which cannot be madeusing local heuristics are resolved by statistical evidence, namely, an SFCcorrelation matrix that was generated by processing a corpus of 977 Wall Street Journal (WSJ)articles containing 442,059 words. The implementation was tested on a sample of 1638 words from the WSJand selected the correct SFC89%of the time. The resultant, disambiguated SFCfrequencies are summed and normalized to produce a weighted semantic vector representation of each text. TheseSFCvectors provide the basis on which the systemautomatically classifies texts as the first stage in DR-LINK. The Disambiguation Problem NLPsystems take naturally occurring text and create a representation of the meaningof the text that will be used to accomplishthe specific task of the system, be it machinetranslation, document detection, question-answering, knowledgeextraction, or information retrieval. Lexical ambiguity has been a major stumbling block in the developmentof real-world NLPsystemsfor all these applications due to the fact that a single word mayhave morethan one meaning. According to Gentner (1981) the twenty most frequent nounsin English have an average of 7.3 senses each, while the twenty most frequent verbs have an average of 12.4 senses each. As a result, whenattempting to represent a word which has multiple senses, an NLPsystem must either produce multiple representations for that word or select one sense from amongstthe possible choices included in the system’s lexicon. The process of selecting from amongsta word’s possible senses is referred to as semantic lexical disambiguation. Researchinto humanlexical access and disambiguation has been very active in recent years. Small, Cottrell & Tannenhaus(1988) provide a substantive reader on research on lexical disambiguation from the various fields within cognitive science. In principle, weagree with Small, Cottrell & Tannenhaus (1988) that "in order to resolve ambiguity, an NLU(humanor otherwise) has to take into account sources of knowledge".Given that there is no current single theory as to the exact nature of and interaction amongstthese sources which can account for all the experimental results in lexical disambiguation, we agree with Prather & Swinney(1988) that there will be ’no uniform, invariant solution to lexical ambiguity resolution". Consequently,we interpret the empirical psycholinguistic results as suggesting that there are three sources of influence on the humandisambiguation process: Local context -the sentence containing the ambiguousword restricts ambiguous words the interpretation Domainknowledge- the recognition that a text is concerned with a particular only the senses appropriate to that domain of domainactivates Frequencydata - the frequency of each sense’s general usage affects its accessibility 1Support for this research wasprovided by DARPA under the auspices of the TIPSTERProject. 98 From: AAAI Technical Report FS-92-04. Copyright © 1992, AAAI (www.aaai.org). All rights reserved. Wehave attempted to computationally replicate these three knowledgesources in our disambiguator. Figure 1 provides a mappingfrom the sources suggestedin the psycholinguistic literature as being used by humansto the soumesused by the disambiguator in DR-LINK.Eachof the DRLINKsourceswill be describedin later sections of this paper. OR-LINK Humans Fig.l: local context unique or high-frequency SFCwithin a sentence domain knowledge SFCcorrelation matrix frequency of usage order of senses in LDOCE Sources influencing humanand automatic disambiguation (DR-LINK) Weconsider the ’uniquely assigned’ and ’high-frequency’ SFCsto words within a single sentence as providing the local context which activates the correct SFCfor an ambiguousword. The SFCcorrelation matrix, which is based on a large sampleof texts of the sametype as the text being disambiguated, equates to the domainknowledge(WSJtopics) that is called uponfor disambiguationif the local context does not resolve the ambiguity. Andfinally, ordering of SFCsin LDOCE replicates the frequency of use criterion suggestedby Hogaboam & Perfetti (1975) whosaid that: "The order of search is determined by frequency of usage, the most frequent being first. The search is self-terminating, so that as soon as an acceptable match occurs, no [other] entries will be checked". Weimplement the computational disambiguation process by moving in stages from the morelocal levels to the moreglobal types of disambiguation, using these sources of information to guide the disambiguationprocess. Thework is unique in that it successfully combineslarge-scale statistical evidence with the more commonlyespousedlocal heuristics. The documentdetection task of DR-LINKdoes not require the precise disambiguation of every word in text, but it does need to disambiguateto the point of knowingwhich of the SFCsthat have been assigned to a word’s multiple senses is correct. In somecases this process equates to sense disambiguation. In other cases, the SFCselected by the system maybe attached to morethan one sense, so in that case our system narrows the choice of senses to those which are in the appropriate semantic domain. Semantic Re oresentation of Wordsin DR-LINK Weuse the Subject Field Codesfrom Lonaman’sDictionary_ of Contem.DoraryEnglish (LDOCE)as semantic representation of a text’s contents. LDOCE is a British-produced learner’s dictionary that has been used in a numberof investigations into natural languageprocessing applications (Boguraev Briscoe, 1989) using the first edition (1978) of the dictionary. Weare using the secondedition (1987) and beganworking directly from the typesetters’ tape, which we have cleaned up during related research into the automatic extraction of semantic relations from dictionary definitions (Liddy & Paik, 1991) and converted the data from the tape into a lexical database. The 1987 edition of LDOCE contains 35,899 headwordsand 53,838 senses, for an average of 1.499 senses per headword. The machine-readabletape of LDOCE contains several fields of information not visible in the hard-copy version, but which are extremely useful in natural languageprocessing tasks. Someof these are relevant for syntactic processing of text, such as subcategorization codes while others contain semanticinformation, such as the Box Codes,which indicate the class of entities to which a nounbelongs or the semantic constraints for the argumentsof a verb or an adjective, and the Subject Codes. The Subject Codesare based on a classification schemeof 124 major fields and 250 sub-fields. Subject Codesare manually assigned to words in LDOCE by the Longmanlexicographers. There are two types of problems, however, with the Subject Codeassignments which becomeobvious when an attempt is madeto use them computationally. First, a particular word mayfunction as morethan one part of speechand each word mayalso have morethan one sense, and each of these entries and/or senses maybe assigned different Subject Codes. The entries for ’acid’ in Figure 2 are taken from the 99 From: AAAI Technical Report FS-92-04. Copyright © 1992, AAAI (www.aaai.org). All rights reserved. LDOCE tape and demonstratea fairly simple exampleof this problem. PART-OF-SPEECH SUBJECTFIELDS acid noun Slzc [Science, chemistry] 133 [Drugs (not pharmaceutical)] acid adjective FOzc[Food, cookery] XX [General] Fig. 2: LDOCE entry with Multiple Parts of Speechand SFCs If an NLPsystemcannot ascertain either the grammaticalfunction or sense of a word in the text being processed, all Subject Codesfor all entries for an orthographic form must be considered. Our systemincorporates automatic meansfor choosing amongstthe LDOCE syntactic categories and choosing amongstthe senses, thereby limiting which Subject Codesare assigned to each word in a given text. There is also the possibility that no Subject Codehas beenassignedto a word or any of its individual senses. Of the 53,838 senses in LDOCE ’87, 51,383 or 95%have Subject Codes. Of these, however, 27,273 are coded XX for the General class and therefore provide no useful semantic information. The absenceof Subject Codesor the presenceof only the General class code poses a problem whenword-by-word disambiguation is desired, but whenthe task is to arrive at a summary semantic representation of the text, the law of large numbersappearsto take over. For although only 24,110 senses (45%) in LDOCE have the more informative, domain-specific codes, this has proven sufficient for the task of text classification. In the future, wewill investigate waysin which wecan use sentence context and the correlation matrix to suggest appropriate SFCsfor those words that do not have SFCsin LDOCE.However,the cases of multiple codes assigned to a word impact more immediately on our attempts at classification, since the most frequently used words in our language tend to have manysenses and therefore, multiple Subject Codes. Other Work Using Subject Codes Walker and Amsler, who were the first to makeuse of the domaininformation represented by Subject Codes, have reported on a somewhatsimilar attempt to utilize the Subject Codesto determine the subject domainsfor a set of texts (1986). However,they used the most frequent Subject Code characterize a document’scontent, whereaswe represent a documentby a vector of frequencies of Subject Codesfor words in that text. Wefind that our research efforts strongly support the suggestions madeby Walker and Amsler concerning waysto refine the representation of text using Subject Codes. Slator (1991) has taken the original 124 Subject Codesand addedan additional layer of seven pragmatic classes to the original two-level hierarchy. Theseare communication,economics, entertainment, household, politics, science and transportation. Hehas found the reconstructed hierarchy useful whenattempting to disambiguate multiple senses and Subject Codesattached to words. His metric for preferring one sense over another relies on text-specific values, whereasweadd corpus correlation values as a further stage in the disambiguationprocess. Krovetz (1991) is exploring the effect of combining the evidence from Subject Codeswith evidence from morphology,part of speech, subcategorization and semantic restrictions for selection of the correct sense. His goal is to represent documentsby their appropriate word senses rather than just their orthographic forms for use in an information retrieval system. The Disambiguation Process The following stages of processing are done, a sentence at a time, to generate vector representations of each document. In Stage 1 processing, we run the documentsand query through POST,a probabilistic part of i00 From: AAAI Technical Report FS-92-04. Copyright © 1992, AAAI (www.aaai.org). All rights reserved. speechtagger (Meeter et al, 1991). Weuse POSTto limit the SFCsof a word to those of the appropriate syntactic category of each word as determined by POST.The inclusion of POSThas reduced the numberof SFCsthat need to be further considered for sense disambiguation by an average of 60%. Stage 2 processing consists of retrieving SFCsfor each word’s correct part of speechfrom the lexical database. Stage 2 also does someuseful pre-processing before the system actually assigns the SFCs.For example, the SFCretrieval process utilizes the weakstemmingalgorithm of Kelly & Stone (1975), which removesonly inflectional endings. This algorithm has proven to be a stemmerof the appropriate strength for processing newspapertext for lexical look-up in a machine-readable dictionary such as LDOCE. Additionally, as a special case warranted by the newspapertext-type, if no hyphenatedword can be found in the lexical databasefor a word that is hyphenatedin the text, the system removesthe hyphenand searchesthe conjoined result in the lexical database. If not found, the systemre-separates the words and assigns part of speech to each composite part using POST,and then these two words are looked up in the lexical database. During this stage, all functional parts of speech(articles, conjunctions, prepositions and pronouns) are eliminated from further processing. Our preliminary tests demonstratedthat such frequently occurring function words’ SFCscan out-number the SFCsof the more substantive content words when summed across a document,and really distort the resulting SFCvector representations of the text’s content. At Stage 3 we begin sense disambiguation, using local sentence-level context-heuristics. Intellectual analysis of manualdisambiguationof text generatedthe original hypothesis of our work, namely, that unique SFCsand high-frequency SFCsare good local determinants of the subject domainof the sentence. Webegin with context-heuristics becauseempirical results have shownthat local context is used successfully by humansfor sense disambiguation (Choueka& Lusignan, 1985) and contextheuristics have been experimentally tested in Walker & Amsler’s (1986) and Slator’s work (1991) with promising results. The input to Stage 3 is a word, its part-of-speech tag, and the SFCsof each sense of that grammatical category. For somewords, no disambiguation maybe necessary at this stage, however, for the majority of words in each sentence there are multiple SFCs,so the input would be as seen in Figure 3. State companies employ about one billion people, n n v adv adj adj n 4, ORDERS POLITICALSCIENCE BUSINESS,MUSIC, THEATER LABOR, BUSINESS NUMBERS 2, ANTHROPOLOGY SOCIOLOGY,POLITICAL SCIENCE Fig 3: Subject Field Codes& Frequencies(in Superscript) for words’ as one part-of-speech To select a single SFCfor each word in a sentence, Stage 3 uses an ordered set of heuristics. First, the SFCsattached to all words in a sentence are evaluated to determine at the sentence level: 1) whether any words have only one SFCassigned to all senses of the word; 2) the SFCswhich are highly frequently assigned across all words in the sentence. Eachsentence mayhave more than one unique SFC as there maybe morethan one word whosesenses have all been assigned a single SFC.In Figure 3, NUMBERS is a unique SFC,being the only SFCassigned to the word ’billion’ and POLITICAL SCIENCE is the most frequently assigned SFCfor this sentence. Wehave established the criterion that if no SFChas a frequency equal to or greater than three, we do not select a frequency-basedSFCfor that particular sentence. Preliminary test results showthat SFCswith a within-sentence frequency less than three do not accurately represent the domainof the sentence. The secondstep in Stage 3 evaluates the remaining words in the sentence and choosesa single SFC for each word based on the locally-important SFCsdetermined in step one. The system scans the SFCs of each remaining word to determine whether the SFCswhich have been identified as unique or highfrequency amongstthe multiple SFCsassigned to each word by LDOCE. In Figure 3, for example, POLITICAL SCIENCE would be selected as the appropriate SFCfor both ’state’ and ’people’ because i01 From: AAAI Technical Report FS-92-04. Copyright © 1992, AAAI (www.aaai.org). All rights reserved. POLITICALSCIENCE was determined in step 1 to be a high-frequency SFCvalue for the sentence. Stage 4 incorporates two global knowledgesources to complete the disambiguation task begunin Stage 3. The primary source is a 122 x 122 correlation matrix computedfrom the SFCfrequencies of the 442,059 words that occurred in a sample of 977 WSJarticles. The matrix, therefore, reflects stable estimates of SFCswhich co-occur within documentsof this text-type. Although there are actually 124 SFCsused in LDOCE,we chose not to include XX (GENERAL) and CS (CLOSED SYSTEM PART OFSPEECH) in the matrix construction, due to the non-substantive nature of XX and CS. The second source is the order in which the senses of a word are listed in LDOCE¯ Since ordering of senses in LDOCE is determined by Longman’slexicographers basedon frequency of use in the English language, we equate ordering of sensesto the notion of frequency of usagesuggestedin the psycholinguistic literature as influencing humandisambiguation. The correlation matrix was computedwith SASusing SFCoutput of the 977 WSJarticles that were processedthrough Stages 1 and 2 as described above. That is, each article was represented by a vector of SFCsof the senses of the correct part-of-speech of each word as determined by POST.For the matrix calculations, the observation unit wasthe article and the variables being correlated were the 122 SFCs.The scores for the variables are the within-document frequencies of each SFC.There are 255,709 scores across the 977 articles on which the matrix is computed¯The resulting values in the 122 x 122 matrix are the Pearson product momentcorrelation coefficients betweenSFCsand range from a +1 to a -1, with 0 indicating no relationship betweenthe SFCs.The correlations represent the probability that a particular SFCwill co-occur with every other SFCin a WSJarticle. The output matrix is consulted during Stage 4 processing to determine the correlation coefficient betweentwo SFCsand serves as the more global, domain-level data on which we select one SFCfor each word that wasnot disambiguatedin Stage 3. The correlation coefficients are quite intuitively reasonable, as can be seen in Figure 4 where the ten highest correlations are listed. The unexpected correlation betweenLAWand BUILDING is due to the highly frequent usageof the word ’court’ which has SFCsfor both LAWand BUILDING and contributes greatly to the high correlation betweenthe two SFCs. Co-efficient ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ 91314 80801 73958 73654 7219 g 71428 70844 70271 68600 68177 SFC-2 SFC-1 NET GAMES ECONOMICS SOCIOLOGY THEATER THEATER PLANT NAMES ANIMAL HUSBANDRY AGRICULTURE LAW GAMBLING COURT GAMES BUSINESS LAW ENTERTAINMENT MUSIC AGRICULTURE AGRICULTURE BUSINESS BUILDING CARD GAMES Fig. 4: Highest Correlations BetweenSFCsBased on 255,709 SFCFrequencies from 977 WSJArticles In Stage 4, one ambiguousword at a time is resolved, accessing the matrix via the unique and highfrequency SFCsdetermined for a sentence in Stage 3. The system evaluates the correlation coefficients betweenthe unique/frequent SFCsof the sentence and the multiple SFCsassigned to the word being disambiguatedin order to determine which of the multiple SFCshas the highest correlation with the unique and/or high-frequency SFCs.The system then selects that SFCas the unambiguous representation of the sense of the word. Wehave developedheuristics for three separate cases for selecting a single SFCfor a word using the correlation matrix. The three cases function better than handling all instances as a single case becauseof the special treatment neededfor words with the less-substantive GENERAL (XX) or CLOSED SYSTEM PARTOF SPEECH (CS) codes. For the two cases where there are XX or CS amongst the SFCs for the word being disambiguated,wetake the order of the SFCsinto consideration, reflecting the fact that the first SFClisted is morelikely to be correct, since the most widely used senseis listed first in LDOCE. So to overcomethis likelihood, a moresubstantive SFClisted later in the entry must have a higher correlation with the sentence-determinedSFC. 102 From: AAAI Technical Report FS-92-04. Copyright © 1992, AAAI (www.aaai.org). All rights reserved. To clarify the following description of the disambiguation procedure, we refer to the ambiguous word’s multiple SFCsas word-attached SFCsand the unique and high-frequency SFCsthat were established at the sentence level in Stage 3 as sentence-determined SFCs. Case 1- Words with no XX or CS SFCs: If any word-attached SFChas a correlation greater than .6 with any one of the sentencedetermined SFCs, select that word-attached SFC. If no word-attached SFChas such a correlation, average the correlations betweenthe wordattached SFCand sentence-determinedSFCscorrelations, and select the word-attached SFCwith the highest averagecorrelation. Case2 - Wordswith XX or CS listed first in LDOCE entry: Select the XX or CSunless a moresubstantive SFCfurther downthe list of senses has a correlation with the sentence-determinedSFCsgreater than 0.6. Case3 - Wordswhere XX or CSis not the first listed SFCin LDOCE entry: Choosethe moresubstantive SFCwhich occurs before XX or CSif it has a correlation greater than 0.4. Figure 5 presents a sentence which illustrates howthe heuristics use the correlation matrix values to select SFCs. For this sentence, BEAUTY CULTURE, CALENDAR, and ECONOMICS were selected at Stage 3 as unique SFCs,basedon each being the sole SFCassigned to ’cosmetics’, ’November’, and ’financing’ respectively. Therefore, whena SFCneedsto be selected for ’giant’, Case3 says that the moresubstantive SFC(LITERATURE) must have a correlation greater than .4 with a unique SFCin order to be selected. Since the 3 SFCsare less than .4, the GENERAL SFC(XX) is correctly chosen. In the case of ’junk’, since GENERAL is listed first, either DRUGS or NAUTICAL must have a correlation coefficient of at least .6 to be selected over GENERAL. Since neither do, the correct choice of GENERAL is made.For the case of ’floated’, the samelogic applies, but since BUSINESS has a correlation of .808, BUSINESS is selected over the first occurring GENERAL. ~vORRELATION WITH UNIQUESFCs He acquired the cosmetics giant pro v det n n in November, financing in the transaction with junk prep n v prep det n prep n GENERAL BEAUTYCULTURE LITERATURE,GENERAL LITERATURE-BEAUTYCULTURE:.328 LITERATURE-CALENDAR: .334 .1 62 LITERATURE-ECONOMICS: -> Select GENERAL by Case 3 CALENDAR ECONOMICS GENERAL CULTURE: GENERAL,DRUGS,NAUTICAL DRUGS-BEAUTY DRUGS-CALENDAR: DRUGS-ECONOMICS: NAUTICAL-BEAUTYCULTURE: NAUTICAL-CALENDAR: NAUTICAL-ECONOMICS: -> Select GENERAL by Case 2 bonds n GENERAL 103 .317 .329 .269 .376 .434 .378 From: AAAI Technical Report FS-92-04. Copyright © 1992, AAAI (www.aaai.org). All rights reserved. floated v by DBL,Inc. prep prop GENERAL,BUSINESS BUSINESS-BEAUTYCULTURE: BUSINESS-CALENDAR: BUSINESS-ECONOMICS: -> Select BUSINESS by Case 2 Fig. 5: Samplesentence exemplifying correlation-matrix .376 .545 .808 heuristics For those sentences which contain neither a unique or high-frequency SFC,the SFCwhich occurs first in the LDOCE ordering of sensesis selected. Testino_ of DisambiouationProcedures Wetested our SFCdisambiguation procedures on a sample of 166 sentences, containing 1638 nonfunction words which had SFCsin LDOCE.The sentences comprised a set of 12 randomly selected WSJ articles. The system implementation of the disambiguation procedures was run and a single SFCwas selected for each word. TheseSFCswere comparedto the sense-selections madeby an independent judge whowas instructed to read the sentences and the definitions of the senses of each word and then to select that sense of the word which wasmost correct. Figure 6 summarizesthe overall results (att. = attempts, cor. = correct) presented according to the main source of knowledgeused in the disambiguation process. Local Heuristics att. cor. % 1134 1032 91 DomainCorrelations att. cor. % Freauency Ordering att. cor. % 268 236 206 77 219 93 att. 1638 Total cor. % 1457 89 Fig. 6: SFCDisambiguation Results Using Multiple Sources of Knowledgeon 1638 Words Analysis of Results After noting the fact that these results are indeed quite good, what is most obvious in the summary results (Fig. 6) is that disambiguationbasedon both local heuristics and frequencyordering is much better than the disambiguationbasedon domaincorrelations. Although this reflects negatively on the role of probabilistic knowledgein disambiguation, the positive view of these results is that they are quite reasonablefor a first attempt at using corpus-basedcorrelations and, in fact, the quality of the information in this source can be improved. Wewill present moredetailed results of the three main sources along with an error analysis of the poorer results along with indications of howwe are currently incorporating these insights into adjustments to the knowledgesources and disambiguation processes in order to produce an improved version of the system. First, wewill micro-analyze the results of using Local Heuristics to select an SFC.Local Heuristics use either unique SFCsor high-frequency SFCsas determined within each sentence to both: 1) selfselect the SFCwhich has the one and only (unique) SFCassigned to that word, and; 2) use the unique SFCassigned to one word in the sentence to select from amongstthe multiple SFCsassigned to another word which has no unique SFCof its own. In addition, SFCsare summed across all words in the sentence and those with frequency greater than 3 are used to select an SFCfor an ambiguousword which has a high-frequency SFCamongstits multiple SFCs.This is the process explained earlier in Stage 3 of the disambiguation process. As seen in Figure 7, whena word has only one SFCassigned to it, the selection by default must be correct. Therefore, the great majority of the selections are correct. However,it appearsthat the reliance on an SFCwith frequencygreater than three as a goodindicator of local context is inappropriate, producing only 65%correct results. In fact whenanalyzing a sampleof the errors madeby relying on frequency data, we discovered that rather than considering all SFCswith a 104 From: AAAI Technical Report FS-92-04. Copyright © 1992, AAAI (www.aaai.org). All rights reserved. frequency greater than 3 as good indicators of local context, our results would be improvedin 36%of the cases if we ranked the SFCsaccording to their frequency and then applied them in that order. In another 36%of the cases, our results would be improved if we raised the frequency threshold from 3 to 4. Theremaining 28%of the errors are not directly attributable to any single bias. SFCFrequency > 3 att. cor. % 165 107 65 Uniauelv Assigned SFC art. cor. % 819 819 100 Selected bv Unigue SF~ att. cot. % 150 106 71 att. 1134 Total cor. 1032 % 91 Fig. 7: Micro-analysis of DisambiguationResults based on Local Heuristics The micro-analysis of the results based on domaincorrelations is presented in Figure 8. The major source of error here can be attributed to the need to recomputethe correlation matrix. The results of our first effort at using statistical information for sense disambiguationcan nowbe used as feedback. After wemakethe corrections to the heuristics learnt from our micro-analysis, we will run the disambiguator on a new sample of 1000 WSJarticles. The disambiguated SFCvector representations of those articles can then be used to re-computethe correlation matrix, so that whenthe disambiguator is run on the next sampleset of sentences, the values selected from the matrix will be moreaccurate. Simply recomputing the matrix has the potential for correcting 62%of the errors in the case whenXX is not the first SFCin the ordering of senses for a word, which is wheredisambiguationbasedon domaincorrelations is performing worst. Another cause of error in the case whereXX is not the first listed SFC,is causedby SFCsassigned to proper nouns, whosefrequent occurrence is a problem whenprocessing newspapertext. In the initial version of the system as reported in this paper, proper nouns were processed the sameas common nouns, producing somehumorous SFCtags, such as: Bush = HORTICULTURE Carter = OCCUPATIONS, Baton Rouge, LA = MAGIC+ COSMETICS + MUSIC. In computingour experimental results, wedid not include the explicit tagging of proper nouns becauseproper nouns will not be processedin the secondversion of the system nowbeing implemented. The category of proper noun will becomeone of the grammatical categories for which the system does not retrieve SFCs.However,we cannot removethe implicit effect of proper noun tagging from the current results, since their tags were included whendetermining the high frequency SFCsand when selecting unique SFCs,which are then used by the correlation matrix heuristics. Therefore, the SFCsof proper nouns did impact, in most instances negatively, on selection of SFCsfor ambiguouswords, but this source of error will be eliminated from the next version of the system. In results attributable to Case1 of Stage 4 processing whenthere is no XX amongstthe SFCs assigned to word, 50%of the errors have the potential for improvementusing a re-computedmatrix. An additional source of error in this set is the averaging rule which comesinto play whenno single word-attached SFChas a correlation greater than .6 with one of the sentence-determinedSFCs.When this occurs, all the correlations betweenword-attached SFCsand each sentence-determinedSFCare averaged and the word-attached SFCwith the highest average correlation is chosen. Our analysis showsthat this averaging diminishes the effect of what actually is the correct SFC.Therefore, in our next implementation, we will use the Dempster-Shafer formula (Shafer, 1976) which combines multiple sources of evidence (correlations with various sentence-determinedSFCs)in a mannerthat does not diminish the strongest evidence by averaging it with the weakerevidence. XXis first SFC art. cor. % 188 161 86 XXis not first att. cor. SFC % No XX is SFCs att. cor. % att. Total cor. % 64 55 16 268 206 77 35 10 62 Fig. 8: Micro-analysis of DisambiguationResults based on DomainCorrelations Finally, in reviewing the results which dependon ordering of sensesin LDOCE (based on their usage in the languageas determined by LDOCE lexicographers), we currently do not see a wayto improve the 105 From: AAAI Technical Report FS-92-04. Copyright © 1992, AAAI (www.aaai.org). All rights reserved. results in Figure 9. This data represent those sentencesin which no unique or highly-frequent SFC occurs, and therefore there are no SFCsto access the matrix, and the only knowledgesource for disambiguation is simple ordering of SFCsin LDOCE. Thesecases appear to us to occur in those sentences which in fact are very unsubstantive, being filled with manywords having only quite general meanings.Fortunately, the disambiguationresults in these cases are quite reasonable. XX is only SFC att. cor. % Multiple SFCs att. cor. % 144 92 144 100 75 82 att. 236 Total cor. 219 % 93 Fig. 9: Micro-analysis of Disambiguation Results based on LDOCE Ordering by Frequency Conclusions Wewould like to compareour results to those of others doing similar work, but have not found comparablework which provides quantified results. Quantitative results in the literature on lexical disambiguation to which we can compareour efforts are those of Lesk (1986) who used a small set words and reported a success rate for disambiguation of 50%to 70%, and Wilks et al (1989) who reported 45%success in an experiment on just the word ’bank’. Comparedto these results, our experiments on a randomly selected set of sentences containing 1638 words requiring disambiguation reflect a larger scale testing of a disambiguation methodology.Although our system selects SFCsand not actual senses, there are manytext analysis tasks whereour approach’s successful combination of three knowledgesources for the selection of either a single sense or a moreconstrained set of senses, would contribute to the system’ssuccessful completion of its task. Recent work by McRoy(1992) which we find very promising also uses multiple knowledgesources, but does not report a quantitative evaluation of the disambiguator’s performancedue to their stated difficulty in acquiring humandisambiguation of words. Weacknowledgethe difficulty in obtaining human judgments, but have chosen to quantify our disambiguator’s performancealthough there maybe some noise in the results. In an effort to assure the quality of the results, we haveused a single native English speaker with excellent languageskills whosechoices are reviewed by two additional confirmatory judges to determine each word’s correct sense. In addition, we agree with McRoythat perhaps a morereasonable evaluation of the disambiguator’s performancewould be in respect to the task for which the NLPsystem is designed. Wehave preliminary results on the effectiveness of using documentrepresentations consisting of disambiguatedSFCvectors for the clustering of documentsfor use in our documentdetection system. Our preliminary experimental efforts (Liddy, Paik & Woelfel, 1992) produce coherent subject-based documentclusters whoseuse in the documentdetection system reduces the retrieval computation by 80%and actually improves precision as comparedto individual documentto query matching of SFCvectors. In conclusion, we find our results both intuitively and pragmatically pleasing. Our automatic approximation of lexical disambiguation has producedsomeexcellent results for a first implementation. In addition, we knowhow to adjust and improve most of the primary causes of the incorrect results. Additionally, the attempt to replicate the sources of knowledgethat have beensuggestedin the psycholinguistic literature as alternative influences on the humandisambiguation process, have been successfully replicated in machine-readableresources and computational processes. Acknowledgments Wewish to thank LongmanGroup, Ltd. for making the machinereadable version of LDOCE,2nd Edition available to us and BBNfor makingPOST available for our use on this project. References Boguraev, B. & Briscoe, T. (1989). Computational lexicography for natural languageprocessing. London: Longman. 106 From: AAAI Technical Report FS-92-04. Copyright © 1992, AAAI (www.aaai.org). All rights reserved. Choueka,Y. & Lusignan, S. (1985). Disambiguation by short contexts. Computersand the Humanities, pp. 147-157. ¯ Gentner, D. (1981). Someinteresting differences betweenverbs and nouns. Cognition and brain theory. 4(2), 161-178. Hogaboam, T. W. & Perfetti, C. A. (1975). Lexical ambiguity and sentence comprehension. Journal of verbal learning and verbal behavior. 16 (3), pp. 265-274. Kelly, E. F. & Stone, P. J. (1975). Com.Duterrecognition of English word senses. Amsterdam: North Holland Publishing Co. Krovetz, R. (1991). Lexical acquisition and information retrieval. In Zernik, U. (Ed.). Lexical acquisition: exploiting on-line resourcesto build a lexicon. Hillsdale, NJ: LawrenceEarlbaum. Lesk, M. (1986). Automatic sense disambiguation using machinereadable dictionaries: How tell a pine cone from an ice creamcone. Proceedingsof SIGDOC. pp. 24-26. Liddy, E.D. & Paik, W. (1991). An intelligent semantic relation assigner. Proceedingsof Workshopon Natural LanguageLearnina. Sponsoredby IJCAI ’91, Sydney, Australia. Liddy, E.D., Paik, W. & Woelfel, J.K. (1992). Useof subject field codes from a machinereadable dictionary for automatic classification of documents.Advancesin Classification Research: Proceedings of the 3rd ASIS SIG/CRClassification ResearchWorkshop.Medford, NJ: LearnedInformation, Inc. McRoy,S. W. (1992). Using multiple knowledgesources for word sense disambiguation. Computational linguistics. 18 (1). pp. 1-30. Meteer, M., Schwartz, R. & Weischedel, R. (1991). POST:Using probabilities in language processing. Proceedingsof the Twelfth International Conferenceon Artificial Intelligence. Sydney, Australia. Prather, P.A. & Swinney, D. A. (1988). Lexical processing and ambiguity resolution: An automatic process in an interactive box. In Small, S., Cottrell, G., & Tanenhaus,M. (Eds). Lexical ambigvity resolution: Perspectives from Psycholinouistics. Neuropsycholoov.and Artificial Intelligence San Mateo, CA: Morgan Kaufmann. Shafer, G. (1976). A mathematical theory_ of evidence. Princeton, NJ: Princeton University Press. Slator, B. (1991). Using context for sensepreference. In Zernik, U. (Ed.). Lexical acquisition: exploiting on-line resources to build a lexicon. Hillsdale, NJ: LawrenceEarlbaum. Small, S., Cottrell, G., & Tanenhaus,M. (1988). Lexical ambiguity resolution: Perspectives from Psycholinguistics. Neuropsychology.and Artificial Intelligence. San Mateo, CA: Morgan Kaufmann. Walker, D. E. & Amsler, R. A. (1986). The use of machine-readabledictionaries in sublanguage analysis. In Grishman,R. & Kittredge, R. (Eds). Analyzing languagein restricted domains: Sublanguaoedescription and processing. Hillsdale, NJ: LawrenceEarlbaum. Wllks, Y., Fass, D., Guo, C-M., McDonald,J., Plate, T. & Slator, B. (1989). A tractable machinedictionary as a resource for computational semantics. In Boguraev, B. & Briscoe, T. (Eds). Computationallexicography for natural languageprocessing. London: Longman. 107