From: AAAI Technical Report SS-93-02. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved. The Use of Machine Readable Dictionaries in the Pangloss Project* Stephen Helmreich, Louise Guthrie, and Yorick Wilks ComputingResearch Laboratory Box 30001 NewMexico State University Las Cruces, NM88003-0001 Machine Readable Dictionaries (MRDs)contain muchuseful information about lan£uage. Researchers have workedfor the last decade on ways to extract this information for language processing systems. But processing dictionaries for use in natural language computation is itself a difficult problem. Transforminginformation from a version designed for humanreaders to one usable by computers is a slow process that, for example, mayrequire one to define a "semantics" of font changes on a dictionary printer tape (e.g., that small bold capital letters meanthat the wordso capitalized is a synonymof the headword). Somefairly straight-forward information, such as word lists, part of speech fists, and orthography indications have been used by various language processing systems, yet the rather obvious use of MRDs-- to allow the automatic construction of lexicons for natural language processing systems, and in particular for MT systems-- has not been within our reach until recently. [Farwell, Wilks and Cmthrie 1992] described work which used the LongmanDictionary of Contemporary English (LDOCE)to construct lexical entries semiautomatically for one of the lexicons in the ULTRA machine translation system [Farwell and Wilks 1990]. Cmrenfly, the CRLis working together with the Center for Machine Translation at Carnegie-Mellon and the InformationSciences Institute at the University of Southern California on Pangloss, a Department of Defensefunded machinetranslation project. Someof the goals of that project are to use MRDs in the construction of its lexicons, to improvethe techniquesused in [Farwell et. al 92] and to develop techniques for using bilingual dictionaries to construct the individual languagelexicons automatically. The Pangloss project is foundedon an interfingual, knowledge-based approach combined with statistical methods. Automaticacquisition of lexical information is used at all three levels of the initial system: analysis (Spanish), interlingnal/ontological, and generation (English). For this extraction/acquisition task we have used both LDOCEand Collins Spanish-English and * This research was supported at the NewMexico State University ComputingResearch Laboratory by NSFGrant IRI-9101232 and through DoD Contract No. MDAg04-91-C-9334. 69 English-Spanishbilingual dictionaries. In this paper we address issues raised by the call for papers from the perspective of this project and also describe in somedetail the actual workso far carried out. Lexical levels - an overviewof the lexicons The interlingual nature of the Pangloss system lends itself to a stratified approachto lexicon building: lexical information, for both l~.nglish and Spanish, contains primarily syntactic and morphologicalinformation, together with a pointer to an interlingual conceptual token containing semantic information. In our initial system there are three levels for which automated acquisition is useful: the Intermediate Representation Lexicon based on ULTRA, the Spanish lexicon for analysis (also originally based on ULTRA),and the English lexicon used by the generator, Penman [Penman Natural Langnage Group, 1989]. During the second stage of the project, we use LDOCE extensively for generating entries in the joint ontology, and in the initial processingof Spanish input. Wediscuss each level separately. I. Spanish Lexical Entries Nouns: se_form(san,ios~,ts~f._A,san_jose_x). se_.form(abarrote,tsgn,_A,grcceries0_.0). se_foam(abarrotes,tp,m,_A,groceries0_0). se:form(abrigo,ts,m~A,coatl_ 1). se_form(abrigo,tugn~A,shelter 1_2). se__form(abrigos,tp,m,__A,coatl_l). Spanish entries for nouns and pronouns have the representation above. "se_form"identifies the entry as a noun or pronoun. The components of the five-tuple encode the lexical item, person and numberinformation, gender information, case information and the corresponding interlingnal token. Nounsenses make up more than half the senses in a standard dictionary and will be a large part of the final Panglosss Spanish lexicon. The entries were created in a semi-automatic way, using the Collins Spanish/English dictionary as an aid. Weare developing proceduresto create these entries entirely automatically from the machine-readableversion of the Collins dictionary. The procedures for supplying the first four components (lexical item, person and nnmber,gender, and case) from the dictionary have been developed and we are investigatingtechniquesfor associatingthe correct interlingual token. Verbsas well as adjectives are representedin the Spanish lexicon of Pangloss as shownin Figure 1. "sr form"identifies the entry as a verb or adjective. The ten-tuple representation encodesthe informationbelow whenever it is appropriate.Useof a prolog-stylevariable C_A"for example)indicates componentswhichare not germaneto the entry. Theten fields indicate the lexical token, whetherthe verb is stative or dynamic,agreement information(person and number),agreementinformation (gender), other morphologicalinformation, information on tense, aspect, mood,and voice as well as the correspondinginterlingual token. At present these entries werecreated interactively, with the use of the Collins Bilingual Dictionary. Proceduresare being developedto create entries automaticallyfromthe dictionary: Wehave completedproceduresto derive automaticallythe agreement information (person, numberand gender), tense, aspect, mood,voice and morphologicalinformatiou. We are investigatingmethodsfor deriving the remainingfeatures as well as techniquesfor associatingthe interlingual token. For example, we have obtained as a resource a bilingual (Spanish/English)list of some3000financial terms compiledby Barron’s. This list of Spanishterms with correspondingEnglish terms allows us to identify the relevant possible sense tokens from LDOCE. Where a collocationor termis not listed in Longman’s, an alphabetic rather than numerictag is assigned to the word sense. In the secondstage of the project, weare also using Collins bilingual dictionaries to provide information neededfor pro-processingSpanishtext. For example,we have developeda part-of-speech tagger for Spanish, basedon wordlookupin Collins. To do this, it wasnecessaryto obtain the citation formfor inflected words.For this purlxx~,weused verb classification informationto developa look-uptable for all the inflectional formsof verbslisted in Collins. II. Intermediate Representations: The entries correspondingto nouns and verbs in the intermediaterepresentatiouare basedon the intermediate representation of ULTRA which have been describedin [Farwellet al. 1992]. Theentity entries in Figure 2 have been derived entirely from the machi~ readable version of LDOCE, except for the third component whichis a manuallyassigned semantic category. "ir spec_ent"entries correspondroughly to nouns. The fields encodea semanticcategory, whetherthe nounis proper or common,whether it is mass or count, the LDOCE semantic category (provided from the machine readable version of the dictionary), and the LDOCE 70 domain category (provided for some senses in the machinereadable version of LDOCE). For nouns which do not appear in LDOCE (such as "SanJose" above), the entries werecreated manually.Thefirst component represents the headword,homographand sense in LDOCE. If there is only onehomograph or sense, it is labelled"0". Thus,"grocery0_0" indicatesthat it is the first (andonly) sense of the first (and only) homograph of the headword "grocery". Headwordsnot appearing in LDOCE are given an arbitrary homograph/sense tag as can be seen in "San_jose_X". Theentries in the intermediaterepresentationcorrespondingroughly to verbs and adjectives were also generated semi-automatically. See [Farwell 1992] for details. In the exampleentry above,"ir_spec._rel" indicates a relational entry (verb or adjective). Thefields markthe sense token, whether the sense is dynamicor stative, a semanticclassification for the verb, the semantic roles of its arguments (subject, direct object, indirect object, ff any),andthe semanticclassificationof the entities generallyfilling thoseroles. Duringthe secondstage of the Panglossproject, this IntermediateRepresentationLexiconwill be incorporated within an Ontology created by combinins the already substantive ontologies of Penman’s Upper Model,CMU’s Ontos, and the semantic codinghierarchy of LDOCE semanticclassification codesfor nouns. Into this upper-level network will be placed the LDOCE senses, using the networkof disambiguatedgenusterms describedin [BruceandGuthrie1992]. Duringthe secondstage of the project, the Spanish analysis systemwill access these entries in the ontology rather than the Intermediate RepresentationLexiconto obtain semantic information. Wealso plan that the LDOCE semanticcodeswill be used as a basis for sense disambiguation, in addition to the hand-codedULTRAbasedsemanticcategories. Ill. EnglishLexicalentries for generation As the entries for Penman(Figure 3) are more readable, wewill not describe their content in detail. Givena list of proposedPenman Englishlexical entries, wehave developedprocedures for deriving from LDOCE the inflected forms of nouns, verbs, adjectives and adverbs. InterdependenceandUniformityof Representations Fortheoretical reasons,these different lexical representations--Spanish lexicon, Englishlexicon, andIntermediateRepresentatioulexicon (ontology)--arekept separate. Therepresentationsdiffer accordingto the type of information contained (syntactic vs. semantic) and accordln~to use (analysis vs. generation). Thesecondstage ontology, however,will be accessed by both the analysis and generationmodules,and will also serve as the basis for inferencingwithinthe Interlingualrepresentation, whichwe call the TMR (Text MeaningRepresentation). Use of Automatedprocedures In addition to the automatedproceduresdescribed above, wehave extracted various phrasal lexicons for English/Spanish, and Spaulsh/Fmglish, both from the Collins bilingual dictionaries andfromcorpora. Wehave developeda lexicon of phrases and idic~ns from LDOCE and, by parsing the definitions of LDOCE [Slator 1988] and developingproceduresfor disambiguatingthe hypernymsof nounsenses in the dictionary, wehavecreated a network of over 20,000 noun sense representations [Brtr.e &Cmthrie1992], [Guthrie et al. 1990] whichis being used in the ontology of Pangloss. Weare also developinga procedurefor identifying and sense disambiguationtypical argumentsof verbs, wherethis information is provided by LDOCE. In short, our efforts have been directed towardprovidingscalability in encoding explicit linguistic and semanticinformadonthat techniquesrelying on implicit informationhavebeenable to approachusing purely statistical methods[Brownet al., 1990]. Sharing lexicons Part of our effort has beendirected to extracting informationfromMRDs on-line for the specific purposes of this project. Undoubtedly that will continue, particularly as newMRDs becomeavailable. Anotherpart of oureffort, however,has been directed to the general problemof makinE on-line dictionary information more accessible:processingeachdictionaryandits entries into a format that makesextraction of any informationmuch simplerand faster. For each dictionary, of course, this machine-tractable (as opposed to merely machinereadable)formatdiffers, if only becausedifferent typesof languageanddifferent types of dictionaries contain different kindsof information.Therefore,at a generallevel, weare in the processof constructinga lexical data base, whichcan contain the informationfromseveral dictionaries. BHingualdictionaries Althoughour workwith the Collins dictionaries has not yet been as extensive as that with LDOCE, it appears that at least as muchuseful informationfor MT purposesis available there. Wehavealready madeuse of somemonolingualinformation, such as morphological information and syntactic information. Monolingual semantic informationin the form of synonyms,subject areas, or preferred collocationsis perhapsmoreaccessible and moreextensive than in LDOCE. Phrasal lexicons havebeenextracted fromCollins for both example-based translation and for codingmulti-wordlexical items for the P anglossparser/analyzer. MTmismatches and divergences Mismatchesbetweensource and target languages are numerousand exist at every relevant semanticlevel of language(lexical, syntactic, pragmatic,cultural). Dealingappropriatelywith mismatches at a lexical level is a vital task for anyMTsystem.Mismatches at the lexical level (sense splits, senseoverlaps,completemapping failures) are well-knownproblemsfor MT.Within the interlingual modelof the current Panglosssystem, these sense mismatchesare not resolvedat the lexical level. Thereis no direct mappingfromsource languagelexical itemsto a set of target languagelexical items. Everylanguage-specificlexical entry mapsuniquelyinto an interlingual concept token. If disambiguationof the source text is not or cannotbe accomplished sufficiently well to provide the distinctions necessary for the generation moduleduring analysis, the generator itself must make the distinctions basedon the intermediaterepresentation. This procedure (as employedin the ULTRA system) describedin [Helmreich,Jin, Wilks,andGuillen1992]. Conclusion The (semi-) automatic use of MRDs for applications like MTremainsa desirable goal, but cannot as yet be claimedas proved, though wehave certainly given whatwebelieve to be the first operational proof of the concept. Wedo not suppose that very general MRDs (like LDOCE and Collins) will suffice in the absence domain-specificlexicons described frombilingual corpora (parallel and, moreintriguingly, non-parallel) empirical methods.Whatwe have described here is no morethan a core lexicon. In a separate project (Tipster) weare investigating not only howto join MRD-derived to corpora-derivedlexicons but howto obtain the latter by "tnning" the formeragainst the domaincorpusitself [Anlck&Pustejovsky1990;Cowieet al. 1992,1993]. Asis well known,there is at the moment no wayof evaluatingthe performanceof a lexicon separately from the wholeproject of whichit is a part, other thanby comparingfinal outputperformance with different versionsof a lexiconat differenttimes(whilethe rest of the systemis kept fixed). Webelieve fuller evaluation methodsfor lexical systemswill be derived, but at the moment this project (of automaticlexical provisionfor Pangloss)is subjectto the evaluationof Panglossitself, whichis conducted periodically against other DoDNIT systems fundedunder the sameoverall scheme. sr_form(identifica,dyn,ts,_Afan,prs,simp,indic,actv,identify0_ 1). sr_form(identifica,dyn,tu,_A,fin,prs,simp,indic,actv,identify0_ 1 ). sr_form(identificada,dyn,ts,f,psp,_A~B ,_C,pasv,identify0_ 1). sr_form(identificadas,dyn,tp;f,psp,_A,__B ~C,pasv,identify0_l). sr_form(identificado,dyn,_A,_B ,psp,_C,perf,_D,actv,identify0_ 1 ). sr_form(identificado,dyn,ts,m,psp,_A,_B,_C,pasv,identify0_l ). sr_form(identiticados,dyn,tp~n,psp,_A,_B,_C,npasv,identify0_ 1 ). sr_form(identifican,dyn,tp~A,fin,prs,simp,indic,actv,identify0_ 1 ). sr_form(identificando,dyn,__A,_B ,prp,_C,prog,_D,actv, identify0_1 ). sr_form(identificar,dyn,_A~B,inf,._C,simp,_D,actv,identify0_l). sr_form(identificaron,dyn,tp,_A,fin,pst, simp,indic,actv,identify0_l). sr_form(identificar&dyn,ts,__A,fin,fut, simp,put,actv,identify0_l). sr_form(identiticar&dyn,tu,_A,fin,fut,simp,put,actv,identify0_ 1 ). sr form(identificar(m ,dyn,tp,_A.fin,fut,simp,put,actv,identify0_ 1 sr_form(identific6,dyn,ts~A,fin,pst,simp,indic,actv,identify0_ 1 ). sr_form(identific6,dyn,m,_A,fin,pst, simp,indic,actv,identify0_ 1 ). Figure I. PanglossSpanishLexicon:verbs Entities: ir_spec_.ent(san_jose_x,propJoc,c,_Lc,_Ld). it_ spec_ent(groceriesO 0 ,nrm,p_obj,c,sol _or liq,open). ir_spec_ent(groceryO_O,nrm,place,m,abstract,open). Relations: ir_spec_rel(identifyO_ 1 ,dyn,act.agnt,pat,none,human,p_obj,none). Figure2. IntermediateRepresentation Lexicon:entities andrelations 72 Penmannoun: (lexical-item :name LOS-ANGglFS :spelling "LosAngeles" :sample-sentence .... :features (NOUNNOINFLECTIONS PROPERNOUN COUNTABLE NOT-PERIOD NOT-DETERMINERRF~UIRED NOT-PROVENANCE) :comments"city" :date "10/24/8810:21:02" :editor "HOVY") PenmanVerb: (lexical-item :name IDENTIFY :spelling "identify" :sample-sentence’Thenavyidentified the newRussianship as the Gorky." :features (VERBINFLECTABLE LEXICALNOT-CASEPREPOS1TIONS OBJECTPERM1TYED NOT-TOCOMPQUESTIONCOMP PARTICIPLECOMPNOT-MAKECOMP BAREINFIN1TIVECOMP NOT-COPULAPASSIVE THATCOMP NOT-THATREQUIRED NOT- SUB JUNCTIVEREQUIRED NOT-ADJECTIVECOMP NONE-OF-B1TRANS1TIVE-INDIRECTOBJECT EXPERIVNCEVERB PERCEVFIONMIDDLE OBJECTNOTREQUIRED NOT-SUBJECTCOMP UN1TARYSPV.ILING VISUAL IRR PLURALPASTFORM SECONDSINGULARPASTFORM PLURALFORMFIRSTSINGULARFORM EDPARTICIPI,RFORMTHIRDSINGULARFORM INGPARTICIPLEFORM PASTFORM SECONDSINGULARFORM FIRSTSINGULARPASTFORM THIRDSINGULARPASTFORM) :properties ((INGPARTICIPI ,EFORM "identifying") (EDPARTICIPLEFORM "identitied") (PLURALPASTFORM "identified") (THIRDSINGULARPASTFORM "identitied") (SECONDSINGULARPASTFORM "identified") (FIRSTSINGULARPASTFORM "identified")(PASTFORM"identified") (PLURALFORM "identify")(SECONDSINGULARFORM "identify") (FIRSTSINGULARFORM "identify")(THIRDSINGULARFORM "identifies")) :comments"process" :date "10/26/8812:08:51" :editor "HOVY") Figure3. Penman entries 73 References Anick, P.o and J. Pustejovsky(1990). Knowledge Acquisition from Corpora.COLING-90(2):7-12. Brown,E. J. Cocke,S. Della Pietra, V. Delia Pietra, E Jelinek, J. Lafferty, R. Mercer, and E Roossin (1990). A Statistical Approachto MachineTranslation. Computational Linguistics, 16:79-85. Bruce, R., and L. Guthde (1992). GenusDisambiguation: A Study in Weighted Preference. COLING-92(4):1187-1191. Cowie,J., T. Wakao,L. Guthrie, W.Jin, and J. Pustejovsky (1992). CRL/NMSU and Brandeis: Report on the Diderot Systemas Usedfor the Tipster 12 MonthEvaluation. TIPSTER12-month meeting, Plenary SessionNotebook,San Diego, California. Cowie,J., L. Guthde,W.Jin, and J. Pustejovsky(1993). TheDiderot InformationExtraction System.To be presented at PAC-LING 93, Vancouver,B.C. Farwell, D. et Y. Wilks(1990). ULTRA: a Multi-lingual MachineTranslator. Memoranda in Computingand Cognitive Science, MCCS-90-202,Computing Research Laboratory, NewMexicoState University, Las Cruces,NM. Farwell, D., L2 Guthrie, et Y. Wilks(1992). TheAutomatic Creationof Lexical Entries for a Multilingual MTSystem. COLING-92(2):532-538. Guthrie, Louise,Brian Slator, YorickWilks,and Rebecca Bruce (1990). Is there content in ~mptyHeads? COLING-90(3): 138-143. Helmreich, S., W. Jin, Y. Wilks, R. Guillen (1992). Research Issues in MachineTranslation at the ComputingResearch Laboratory. Memorandain Computingand Cognitive Science, MCCS-92-242, Computing Research Laboratory, NewMexico State University, Las Cruces, NM. PenmanNatural Language Group (1989). The Penman Reference Manual. The PenmanUser Guide. The PenmanPrimer. InformationSciences Institute at the University of SouthernCalifornia, Marinadel Rey, CA. Procter, P., L. Ilson, J. Ayto,et al. (1978).Longman Dictionary of ContemporaryEnglish. Harlow, UK: LongmanGroupLimited. Slator, Brian M. (1988). Constructing Contextually OrgAnizedLexical Semantic Knowledge-bases. Proceedingsof the Third AnnualRockyMountain Conferenceon Artificial Intelligence (RMCAI-88), Denver, CO,pp.142-148. Smith, C., with M. Bermejo Marcos and E. ChangRodriguez (1990). Collins Spanish-English, English-SpanishDictionary. Glasgow,UK:HarperCollinsPublishers 74