The Use of Machine Readable ... in the Pangloss Project* Computing Research Laboratory

From: AAAI Technical Report SS-93-02. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved.
The Use of Machine Readable Dictionaries
in the Pangloss Project*
Stephen Helmreich, Louise Guthrie, and Yorick Wilks
ComputingResearch Laboratory
Box 30001
NewMexico State University
Las Cruces, NM88003-0001
Machine Readable Dictionaries (MRDs)contain
muchuseful information about lan£uage. Researchers
have workedfor the last decade on ways to extract this
information for language processing systems. But processing dictionaries for use in natural language computation is itself a difficult problem. Transforminginformation from a version designed for humanreaders to one
usable by computers is a slow process that, for example,
mayrequire one to define a "semantics" of font changes
on a dictionary printer tape (e.g., that small bold capital
letters meanthat the wordso capitalized is a synonymof
the headword). Somefairly straight-forward information,
such as word lists, part of speech fists, and orthography
indications have been used by various language processing systems, yet the rather obvious use of MRDs-- to
allow the automatic construction of lexicons for natural
language processing systems, and in particular for MT
systems-- has not been within our reach until recently.
[Farwell, Wilks and Cmthrie 1992] described work
which used the LongmanDictionary of Contemporary
English (LDOCE)to construct lexical entries semiautomatically for one of the lexicons in the ULTRA
machine translation system [Farwell and Wilks 1990].
Cmrenfly, the CRLis working together with the Center
for Machine Translation at Carnegie-Mellon and the
InformationSciences Institute at the University of Southern California on Pangloss, a Department of Defensefunded machinetranslation project. Someof the goals of
that project are to use MRDs
in the construction of its
lexicons, to improvethe techniquesused in [Farwell et. al
92] and to develop techniques for using bilingual dictionaries to construct the individual languagelexicons automatically.
The Pangloss project is foundedon an interfingual,
knowledge-based approach combined with statistical
methods. Automaticacquisition of lexical information is
used at all three levels of the initial system: analysis
(Spanish), interlingnal/ontological,
and generation
(English). For this extraction/acquisition task we have
used both LDOCEand Collins Spanish-English and
* This research was supported at the NewMexico State
University ComputingResearch Laboratory by NSFGrant
IRI-9101232
and through
DoD Contract
No.
MDAg04-91-C-9334.
69
English-Spanishbilingual dictionaries.
In this paper we address issues raised by the call
for papers from the perspective of this project and also
describe in somedetail the actual workso far carried out.
Lexical levels - an overviewof the lexicons
The interlingual nature of the Pangloss system
lends itself to a stratified approachto lexicon building:
lexical information, for both l~.nglish and Spanish, contains primarily syntactic and morphologicalinformation,
together with a pointer to an interlingual conceptual
token containing semantic information. In our initial system there are three levels for which automated acquisition is useful: the Intermediate Representation Lexicon
based on ULTRA,
the Spanish lexicon for analysis (also
originally based on ULTRA),and the English lexicon
used by the generator, Penman [Penman Natural Langnage Group, 1989]. During the second stage of the project, we use LDOCE
extensively for generating entries in
the joint ontology, and in the initial processingof Spanish
input. Wediscuss each level separately.
I. Spanish Lexical Entries
Nouns:
se_form(san,ios~,ts~f._A,san_jose_x).
se_.form(abarrote,tsgn,_A,grcceries0_.0).
se_foam(abarrotes,tp,m,_A,groceries0_0).
se:form(abrigo,ts,m~A,coatl_ 1).
se_form(abrigo,tugn~A,shelter 1_2).
se__form(abrigos,tp,m,__A,coatl_l).
Spanish entries for nouns and pronouns have the
representation above. "se_form"identifies the entry as a
noun or pronoun. The components of the five-tuple
encode the lexical item, person and numberinformation,
gender information, case information and the corresponding interlingnal token. Nounsenses make up more than
half the senses in a standard dictionary and will be a large
part of the final Panglosss Spanish lexicon. The entries
were created in a semi-automatic way, using the Collins
Spanish/English dictionary as an aid. Weare developing
proceduresto create these entries entirely automatically
from the machine-readableversion of the Collins dictionary. The procedures for supplying the first four components (lexical item, person and nnmber,gender, and case)
from the dictionary have been developed and we are
investigatingtechniquesfor associatingthe correct interlingual token.
Verbsas well as adjectives are representedin the
Spanish lexicon of Pangloss as shownin Figure 1.
"sr form"identifies the entry as a verb or adjective. The
ten-tuple representation encodesthe informationbelow
whenever
it is appropriate.Useof a prolog-stylevariable
C_A"for example)indicates componentswhichare not
germaneto the entry. Theten fields indicate the lexical
token, whetherthe verb is stative or dynamic,agreement
information(person and number),agreementinformation
(gender), other morphologicalinformation, information
on tense, aspect, mood,and voice as well as the correspondinginterlingual token. At present these entries
werecreated interactively, with the use of the Collins
Bilingual Dictionary. Proceduresare being developedto
create entries automaticallyfromthe dictionary: Wehave
completedproceduresto derive automaticallythe agreement information (person, numberand gender), tense,
aspect, mood,voice and morphologicalinformatiou. We
are investigatingmethodsfor deriving the remainingfeatures as well as techniquesfor associatingthe interlingual
token.
For example, we have obtained as a resource a
bilingual (Spanish/English)list of some3000financial
terms compiledby Barron’s. This list of Spanishterms
with correspondingEnglish terms allows us to identify
the relevant possible sense tokens from LDOCE.
Where
a collocationor termis not listed in Longman’s,
an alphabetic rather than numerictag is assigned to the word
sense.
In the secondstage of the project, weare also using
Collins bilingual dictionaries to provide information
neededfor pro-processingSpanishtext. For example,we
have developeda part-of-speech tagger for Spanish,
basedon wordlookupin Collins. To do this, it wasnecessaryto obtain the citation formfor inflected words.For
this purlxx~,weused verb classification informationto
developa look-uptable for all the inflectional formsof
verbslisted in Collins.
II. Intermediate
Representations:
The entries correspondingto nouns and verbs in
the intermediaterepresentatiouare basedon the intermediate representation of ULTRA
which have been
describedin [Farwellet al. 1992]. Theentity entries in
Figure 2 have been derived entirely from the machi~
readable version of LDOCE,
except for the third component whichis a manuallyassigned semantic category.
"ir spec_ent"entries correspondroughly to nouns. The
fields encodea semanticcategory, whetherthe nounis
proper or common,whether it is mass or count, the
LDOCE
semantic category (provided from the machine
readable version of the dictionary), and the LDOCE
70
domain category (provided for some senses in the
machinereadable version of LDOCE).
For nouns which
do not appear in LDOCE
(such as "SanJose" above), the
entries werecreated manually.Thefirst component
represents the headword,homographand sense in LDOCE.
If there is only onehomograph
or sense, it is labelled"0".
Thus,"grocery0_0"
indicatesthat it is the first (andonly)
sense of the first (and only) homograph
of the headword
"grocery". Headwordsnot appearing in LDOCE
are
given an arbitrary homograph/sense
tag as can be seen in
"San_jose_X".
Theentries in the intermediaterepresentationcorrespondingroughly to verbs and adjectives were also
generated semi-automatically. See [Farwell 1992] for
details. In the exampleentry above,"ir_spec._rel" indicates a relational entry (verb or adjective). Thefields
markthe sense token, whether the sense is dynamicor
stative, a semanticclassification for the verb, the semantic roles of its arguments
(subject, direct object, indirect
object, ff any),andthe semanticclassificationof the entities generallyfilling thoseroles.
Duringthe secondstage of the Panglossproject,
this IntermediateRepresentationLexiconwill be incorporated within an Ontology created by combinins the
already substantive ontologies of Penman’s Upper
Model,CMU’s
Ontos, and the semantic codinghierarchy
of LDOCE
semanticclassification codesfor nouns. Into
this upper-level network will be placed the LDOCE
senses, using the networkof disambiguatedgenusterms
describedin [BruceandGuthrie1992].
Duringthe secondstage of the project, the Spanish
analysis systemwill access these entries in the ontology
rather than the Intermediate RepresentationLexiconto
obtain semantic information. Wealso plan that the
LDOCE
semanticcodeswill be used as a basis for sense
disambiguation, in addition to the hand-codedULTRAbasedsemanticcategories.
Ill. EnglishLexicalentries for generation
As the entries for Penman(Figure 3) are more
readable, wewill not describe their content in detail.
Givena list of proposedPenman
Englishlexical entries,
wehave developedprocedures for deriving from LDOCE
the inflected forms of nouns, verbs, adjectives and
adverbs.
InterdependenceandUniformityof Representations
Fortheoretical reasons,these different lexical representations--Spanish
lexicon, Englishlexicon, andIntermediateRepresentatioulexicon (ontology)--arekept separate. Therepresentationsdiffer accordingto the type of
information contained (syntactic vs. semantic) and
accordln~to use (analysis vs. generation). Thesecondstage ontology, however,will be accessed by both the
analysis and generationmodules,and will also serve as
the basis for inferencingwithinthe Interlingualrepresentation, whichwe call the TMR
(Text MeaningRepresentation).
Use of Automatedprocedures
In addition to the automatedproceduresdescribed
above, wehave extracted various phrasal lexicons for
English/Spanish, and Spaulsh/Fmglish, both from the
Collins bilingual dictionaries andfromcorpora. Wehave
developeda lexicon of phrases and idic~ns from LDOCE
and, by parsing the definitions of LDOCE
[Slator 1988]
and developingproceduresfor disambiguatingthe hypernymsof nounsenses in the dictionary, wehavecreated a
network of over 20,000 noun sense representations
[Brtr.e &Cmthrie1992], [Guthrie et al. 1990] whichis
being used in the ontology of Pangloss. Weare also
developinga procedurefor identifying and sense disambiguationtypical argumentsof verbs, wherethis information is provided by LDOCE.
In short, our efforts have
been directed towardprovidingscalability in encoding
explicit linguistic and semanticinformadonthat techniquesrelying on implicit informationhavebeenable to
approachusing purely statistical methods[Brownet al.,
1990].
Sharing lexicons
Part of our effort has beendirected to extracting
informationfromMRDs
on-line for the specific purposes
of this project. Undoubtedly
that will continue, particularly as newMRDs
becomeavailable. Anotherpart of
oureffort, however,has been directed to the general
problemof makinE on-line dictionary information more
accessible:processingeachdictionaryandits entries into
a format that makesextraction of any informationmuch
simplerand faster. For each dictionary, of course, this
machine-tractable (as opposed to merely machinereadable)formatdiffers, if only becausedifferent typesof
languageanddifferent types of dictionaries contain different kindsof information.Therefore,at a generallevel,
weare in the processof constructinga lexical data base,
whichcan contain the informationfromseveral dictionaries.
BHingualdictionaries
Althoughour workwith the Collins dictionaries
has not yet been as extensive as that with LDOCE,
it
appears that at least as muchuseful informationfor MT
purposesis available there. Wehavealready madeuse of
somemonolingualinformation, such as morphological
information and syntactic information. Monolingual
semantic informationin the form of synonyms,subject
areas, or preferred collocationsis perhapsmoreaccessible and moreextensive than in LDOCE.
Phrasal lexicons
havebeenextracted fromCollins for both example-based
translation and for codingmulti-wordlexical items for
the P anglossparser/analyzer.
MTmismatches and divergences
Mismatchesbetweensource and target languages
are numerousand exist at every relevant semanticlevel
of language(lexical, syntactic, pragmatic,cultural).
Dealingappropriatelywith mismatches
at a lexical level
is a vital task for anyMTsystem.Mismatches
at the lexical level (sense splits, senseoverlaps,completemapping
failures) are well-knownproblemsfor MT.Within the
interlingual modelof the current Panglosssystem, these
sense mismatchesare not resolvedat the lexical level.
Thereis no direct mappingfromsource languagelexical
itemsto a set of target languagelexical items. Everylanguage-specificlexical entry mapsuniquelyinto an interlingual concept token. If disambiguationof the source
text is not or cannotbe accomplished
sufficiently well to
provide the distinctions necessary for the generation
moduleduring analysis, the generator itself must make
the distinctions basedon the intermediaterepresentation.
This procedure (as employedin the ULTRA
system)
describedin [Helmreich,Jin, Wilks,andGuillen1992].
Conclusion
The (semi-) automatic use of MRDs
for applications like MTremainsa desirable goal, but cannot as yet
be claimedas proved, though wehave certainly given
whatwebelieve to be the first operational proof of the
concept. Wedo not suppose that very general MRDs
(like LDOCE
and Collins) will suffice in the absence
domain-specificlexicons described frombilingual corpora (parallel and, moreintriguingly, non-parallel)
empirical methods.Whatwe have described here is no
morethan a core lexicon. In a separate project (Tipster)
weare investigating not only howto join MRD-derived
to corpora-derivedlexicons but howto obtain the latter
by "tnning" the formeragainst the domaincorpusitself
[Anlck&Pustejovsky1990;Cowieet al. 1992,1993].
Asis well known,there is at the moment
no wayof
evaluatingthe performanceof a lexicon separately from
the wholeproject of whichit is a part, other thanby comparingfinal outputperformance
with different versionsof
a lexiconat differenttimes(whilethe rest of the systemis
kept fixed). Webelieve fuller evaluation methodsfor
lexical systemswill be derived, but at the moment
this
project (of automaticlexical provisionfor Pangloss)is
subjectto the evaluationof Panglossitself, whichis conducted periodically against other DoDNIT systems
fundedunder the sameoverall scheme.
sr_form(identifica,dyn,ts,_Afan,prs,simp,indic,actv,identify0_
1).
sr_form(identifica,dyn,tu,_A,fin,prs,simp,indic,actv,identify0_
1 ).
sr_form(identificada,dyn,ts,f,psp,_A~B
,_C,pasv,identify0_
1).
sr_form(identificadas,dyn,tp;f,psp,_A,__B
~C,pasv,identify0_l).
sr_form(identificado,dyn,_A,_B
,psp,_C,perf,_D,actv,identify0_
1 ).
sr_form(identificado,dyn,ts,m,psp,_A,_B,_C,pasv,identify0_l
).
sr_form(identiticados,dyn,tp~n,psp,_A,_B,_C,npasv,identify0_
1 ).
sr_form(identifican,dyn,tp~A,fin,prs,simp,indic,actv,identify0_
1 ).
sr_form(identificando,dyn,__A,_B
,prp,_C,prog,_D,actv,
identify0_1 ).
sr_form(identificar,dyn,_A~B,inf,._C,simp,_D,actv,identify0_l).
sr_form(identificaron,dyn,tp,_A,fin,pst,
simp,indic,actv,identify0_l).
sr_form(identificar&dyn,ts,__A,fin,fut,
simp,put,actv,identify0_l).
sr_form(identiticar&dyn,tu,_A,fin,fut,simp,put,actv,identify0_
1 ).
sr form(identificar(m
,dyn,tp,_A.fin,fut,simp,put,actv,identify0_
1
sr_form(identific6,dyn,ts~A,fin,pst,simp,indic,actv,identify0_
1 ).
sr_form(identific6,dyn,m,_A,fin,pst,
simp,indic,actv,identify0_
1 ).
Figure I. PanglossSpanishLexicon:verbs
Entities:
ir_spec_.ent(san_jose_x,propJoc,c,_Lc,_Ld).
it_ spec_ent(groceriesO
0 ,nrm,p_obj,c,sol _or liq,open).
ir_spec_ent(groceryO_O,nrm,place,m,abstract,open).
Relations:
ir_spec_rel(identifyO_
1 ,dyn,act.agnt,pat,none,human,p_obj,none).
Figure2. IntermediateRepresentation
Lexicon:entities andrelations
72
Penmannoun:
(lexical-item
:name LOS-ANGglFS
:spelling "LosAngeles"
:sample-sentence
....
:features
(NOUNNOINFLECTIONS
PROPERNOUN
COUNTABLE
NOT-PERIOD
NOT-DETERMINERRF~UIRED
NOT-PROVENANCE)
:comments"city"
:date "10/24/8810:21:02"
:editor "HOVY")
PenmanVerb:
(lexical-item
:name IDENTIFY
:spelling "identify"
:sample-sentence’Thenavyidentified the newRussianship as the Gorky."
:features
(VERBINFLECTABLE
LEXICALNOT-CASEPREPOS1TIONS
OBJECTPERM1TYED
NOT-TOCOMPQUESTIONCOMP
PARTICIPLECOMPNOT-MAKECOMP
BAREINFIN1TIVECOMP
NOT-COPULAPASSIVE THATCOMP
NOT-THATREQUIRED
NOT- SUB JUNCTIVEREQUIRED
NOT-ADJECTIVECOMP
NONE-OF-B1TRANS1TIVE-INDIRECTOBJECT
EXPERIVNCEVERB
PERCEVFIONMIDDLE
OBJECTNOTREQUIRED
NOT-SUBJECTCOMP
UN1TARYSPV.ILING VISUAL IRR
PLURALPASTFORM
SECONDSINGULARPASTFORM
PLURALFORMFIRSTSINGULARFORM
EDPARTICIPI,RFORMTHIRDSINGULARFORM
INGPARTICIPLEFORM
PASTFORM
SECONDSINGULARFORM
FIRSTSINGULARPASTFORM
THIRDSINGULARPASTFORM)
:properties ((INGPARTICIPI
,EFORM
"identifying") (EDPARTICIPLEFORM
"identitied") (PLURALPASTFORM
"identified") (THIRDSINGULARPASTFORM
"identitied") (SECONDSINGULARPASTFORM
"identified")
(FIRSTSINGULARPASTFORM
"identified")(PASTFORM"identified")
(PLURALFORM
"identify")(SECONDSINGULARFORM
"identify")
(FIRSTSINGULARFORM
"identify")(THIRDSINGULARFORM
"identifies"))
:comments"process"
:date "10/26/8812:08:51"
:editor "HOVY")
Figure3. Penman
entries
73
References
Anick, P.o and J. Pustejovsky(1990). Knowledge
Acquisition from Corpora.COLING-90(2):7-12.
Brown,E. J. Cocke,S. Della Pietra, V. Delia Pietra, E
Jelinek, J. Lafferty, R. Mercer, and E Roossin
(1990). A Statistical Approachto MachineTranslation. Computational
Linguistics, 16:79-85.
Bruce, R., and L. Guthde (1992). GenusDisambiguation: A Study in Weighted Preference. COLING-92(4):1187-1191.
Cowie,J., T. Wakao,L. Guthrie, W.Jin, and J. Pustejovsky (1992). CRL/NMSU
and Brandeis: Report
on the Diderot Systemas Usedfor the Tipster 12
MonthEvaluation. TIPSTER12-month meeting,
Plenary SessionNotebook,San Diego, California.
Cowie,J., L. Guthde,W.Jin, and J. Pustejovsky(1993).
TheDiderot InformationExtraction System.To be
presented at PAC-LING
93, Vancouver,B.C.
Farwell, D. et Y. Wilks(1990). ULTRA:
a Multi-lingual
MachineTranslator. Memoranda
in Computingand
Cognitive Science, MCCS-90-202,Computing
Research Laboratory, NewMexicoState University, Las Cruces,NM.
Farwell, D., L2 Guthrie, et Y. Wilks(1992). TheAutomatic Creationof Lexical Entries for a Multilingual MTSystem. COLING-92(2):532-538.
Guthrie, Louise,Brian Slator, YorickWilks,and Rebecca
Bruce (1990). Is there content in ~mptyHeads?
COLING-90(3):
138-143.
Helmreich, S., W. Jin, Y. Wilks, R. Guillen (1992).
Research Issues in MachineTranslation at the
ComputingResearch Laboratory. Memorandain
Computingand Cognitive Science, MCCS-92-242,
Computing Research Laboratory, NewMexico
State University, Las Cruces, NM.
PenmanNatural Language Group (1989). The Penman
Reference Manual. The PenmanUser Guide. The
PenmanPrimer. InformationSciences Institute at
the University of SouthernCalifornia, Marinadel
Rey, CA.
Procter, P., L. Ilson, J. Ayto,et al. (1978).Longman
Dictionary of ContemporaryEnglish. Harlow, UK:
LongmanGroupLimited.
Slator, Brian M. (1988). Constructing Contextually
OrgAnizedLexical Semantic Knowledge-bases.
Proceedingsof the Third AnnualRockyMountain
Conferenceon Artificial Intelligence (RMCAI-88),
Denver, CO,pp.142-148.
Smith, C., with M. Bermejo Marcos and E. ChangRodriguez (1990). Collins Spanish-English,
English-SpanishDictionary. Glasgow,UK:HarperCollinsPublishers
74