From: AAAI Technical Report SS-93-02. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved. Lexical Issuesin Dealingwith Semantic Mismatchesand Divergences JamesBarnett Elaine Rich MCC bamett@mcc.com rich@mcc.com In this paper, we address the question, ’~/hat types of MTdivergences and mismatchesmust be accommodated in the lexicon?" In our work, we have focused on the treatment of divergences and mismatches in an interlingual (as opposedto a tranfer-based) MTsystem.In such systems,one uses monolinguallexicons for eachof the languageunderconsideration, so divergencesand mismatches are not handledas explicitly as they mustbe in transfer-basedsystemsthat rely on bilingual lexicons. In an interlingual system,theseissues become issuesin NLgenerationinto the target language. Divergencesand Mismatches Twokinds of problemscan occur as a result of a divergence[Dorr 90] or a mismatch[Kameyama 91]. Thefirst is the relatively simplecase,in whichthere is a lexical gap, i.e., a straightforwardattemptto render a conceptthat wasmentionedin the sourcelanguagefails to find the required word. When this happens it is clear that thereis a problem,althoughit is less clear whatto do aboutit. Example: English know<->Germanwissenlkennen Themoredifficult caseis the onein whichthe lexicon providesa wayto expressthe exact (or almost exact) conceptthat wasextracted from the sourcetext but there is also at least onealternative wayto expressthe same(or similar) thing that is morenatural in the target language.In this case,evennoticing the problemis not necessarilystraightforward. Weconsidertwo subcaseshere: ¯ Theproblemis with a single word. This occurs whenthere are wordpairs oneof whichis a default andthe other of whichIs marked.In thesecases,the default mayonly be usedwhen the markedcasecannotbe used. For example,in Spanishthe wordpez(fish) is a default it canonly be usedif the morespecific wordpescado (caughtfish suitable for food) is not 1 appropriate. ¯ Theproblemis at the phrasal level. In these cases, although eachwordis acceptablein somecontext, the larger phrase soundsawkward.There are several specific kinds of circumstances in whichthis occurs,including: 1. The main event has more than one argument/modifier, one of which can be incorporated, along with the mainevent, into the mainverb, with the others being addedas syntactic arguments/modifiers. There maybe lexical preferences for choosingone argumentover anotherfor incorporation. For example, French:II a traversela rMere~ la nage. English: Hetraversedthe river by swimming. Heswam acrossthe river. 1The marked/default distinctionweareexploiting hereIs analogous to themore traditionalmarked/unmarked dIstinction that Is usedIn morphology [Jakobson 66]. 34 In this example, the maineventis physicalmotion.In French,the fact that the motion wasall the wayto the other side is incorporatedinto the mainverb andthe kind of motion (swimming)is addedas a modifier. Although this sameincorporation possiblein English(as shown in the literal translation), the alternative, in whichthe kind of motionis incorporatedinto the mainverb andthe completionof the motionis indicatedwith a prepositionis morenatural. . Althoughthere is a generalkind of phrasethat canbe usedto expressthe desired concept,there is also a morespecializedidiom (or semi-idiom)that is applicable the current situation and shouldtherefore be used.For example,in English, youcan say, ’I havea pain in my " for just about any part of your body. But in a few special cases(e.g., heao)it is morenatural to say, ’1 havea ache." Ourgoal is to define a tactical generationsystemthat takesas its input a structurewecall a discourselevel semanticstructure (DLSS).TheDLSS is composed of a set of referents that will correspondto the objects to be mentionedanda set of assertions about those referents. Theassertions will be in terms defined by the knowledgebasethat underlies the interlingua. Becausethe DLSSalready corresponds roughlyto the waythat the target text will be expressed,the techniquesweare developingare unableto handlecaseswhereglobal transformationsmustbe applied to the result of understanding the sourcetext. In thosecases, additional mechanisms (probablyunderstoodright nowby no one) mustbe applied to the result of the sourcelanguageunderstanding processbeforethe tactical generationprocessinto the target languagecan begin. Wewantto producea target text that hastwo properties: ¯ It comesas close as possibleto renderingexactly the semanticcontentof the DLSS that was input to the tactical generator. ¯ It soundsas natural as possible. Thesetwo dimensionsare orthogonal. But they cannot be consideredindependently. Sometimes,a very small changein semanticsmayenablea significant changein naturalness.So, the task of designing a generationalgorithmwithin this framework canbe dividedinto the following tasks: 1. Definewhatit means to be the best rendition, in a given language,of an input DLSS.This definition mustincludesomewayof trading off accuracyandnaturalness. 2. Definea searchprocedure that will, in principle, find all the correct renditionsof an input DLSS.Anyreasonablegenerationalgorithm will workhere. 3. Describea set of heuristics that will make the searchprocedure tractable. Viewingthe Problemas Oneof Generation As suggestedabove,weview this entire processas one of generation, rather than as translation per se. But eventhe brief descriptionof it that hasso far beenpresentedmakes it clear that there is a sense in whichwhatweare doingis transformingthe interlingua expression(whichresulted fromanalyzingthe sourcetext) into an expressionin the target language.At onelevel, this is very similar to conventional stuctural transfer of the sort that takes placein mosttransfer-basedMTsystems.But there is a crucial difference. Thereare no transformationsthat are driven by relationshipsbetween linguistic structures in onelanguageandlinguistic structures in another.Instead, there are transformationsthat are driven by the lexicon andgrammar of the target language,whichcanbe thoughtof as forcing the transformationsto producethe mostnatural, semanticallyacceptableoutput in the target language.True, the input to the transformations comesfrom analyzing expressionsin another language,and so the form clearly does 35 dependon the lexicon and the grammarof the source language. And, in fact, we will use some informationaboutthe sourcelexicon in someof our generationheuristics. But in general, the processby whichtext is generatedneedsto knowlittle or nothingaboutthe sourcelanguage. The only real sensein which the needsof MThave pushedour work on generation in directions different from moregenericefforts to build NLgenerationsystemsis the following: In someconstrained generationenvironments,it is possible to get awaywith assuming that the input to the generator(from someexternal source, suchas a problemsolving system)is already in a form that is very close to the formthat the final linguistic output shouldtake. This is particularly likely to be true if the input to the generator is basedon a knowledgebasethat wasdesignedwith one specific languagein mind(as is often the caseunless the needto supportnatural representationsof expressionsin other languagesIs explicitly consideredwhile the KBis beingbuilt). WhatMTdoesis to force us to considercasesin which the input to the generatoris in a formquite different fromthe final outputform.And,in particular, wemust considercasesin whichthe KBis not an exact matchto the lexicon that the generatormustuse(since it cannotsimultaneouslycorrespondexactly to morethan one language’slexicon). But the mechanisms for dealing with these casesneednot be specific to MT.Instead, they canbe applied just as effectively in other generationenvironments, particularly thosein whichthe underlyingKBis sufficiently rich that there canexist formsthat do not happen to correspond to natural linguistic expressions in the target language. In the rest of this paper,wewill sketchthe answers that wehaveso far developed to eachof the three issues raised above.Theseanswersrequire that weview the target languagelexicon as containing more informationthan is often provided.In particular, it is necessarynot just to include possiblewordsand phrases,but also to indicate for eachwhenit is the mostnatural wayto expressthe associatedconcept. Whatis a Correct Renditionof a DLSS? Todefine whatit means to be a correct rendition of a given DLSS, weneedto do three things: ¯ Define what wemeanby accuracy(the extent to which the target text exactly matchesthe semanticsof the DLSS).Wewill refer to this as semanticcloseness. ¯ Definewhatwemean by syntacticnaturalness (a function solely of the target text itself). ¯ Specify a wayto handlethe tradeoff betweenaccuracyand naturalness Assume for the moment that we havemeasuresof semanticclosenessand naturalness (which wewill describe below). Onewayto handlethe tradeoff betweenthe two is to define somekind of combining function. Thena generationalgorithm could searchthe wholespaceof remotelypossible output strings and select the one that wasbest, given the combiningfunction we chose. Thereare two problemswith this approach.Thefirst is that there appearsto be no principled basisfor choosingsucha function. The secondis morepragmatic- the generationalgorithm will wastea lot of time exploring possible output texts that will simpleget rejectedat evaluationtime. An alternative approachis to take a morestructured view of the interaction between the two measures and build theminto the algorithm in a waythat enablesus to constrain moretightly the generation process.Wedo that as follows. Thegenerationproceduretakes four inputs: ¯ the DLSS to be renderedinto the target language. ¯ an absolute semanticclosenessthreshold e. wewill accept as a correct rendition of the input DLSS only thosestrings whosesemanticsis within 0 of the DLSS. ¯ a semanticclosenessinterval 6(8). ¯ a syntactic naturalnessthresholdN. Wewill acceptas a correct rendition of the input DLSS only thosestrings whosenaturalnessis aboveN. Thegeneration algorithm themworks as follows: Start with a reasonablytight boundon semantic closeness(initially 6(8)). Seeif youcangenerateoneor moreacceptablynatural sentenceswithin bound.If so, pick the mostnatural of the candidates.If not, increasethe allowabledifference andtry again. Continueuntil you reach the absolute boundon closeness.If there are no acceptably natural sentences within this bound,the algorithmfails. Howtight the original closenessboundis, howloose it is eventuallyallowedto get, andwhatlevel of naturalnessare requiredwill be functionsof the generation/translationcontext. In somesituations, loose but natural translations will be preferred. In others, tight but perhapsunnaturalonesmaybe better. In anycase, by using an iterative approach,it is possible to generatean entire text in whichmostof the sentencesare very close to the input, with only a small number(the oneswherethe close version sounded really bad)beinga bit farther away. SemanticCloseness Howthen do we define the semantic closeness measurethat this algorithm requires? Wewant a measure that capturesthree things: ¯ Howdifferent are the semanticsof the DLSS and the target text? For example,if weassume a set theoretic model, howmanyelementsare in one but not the other? Wewill use the expressionASto refer to the difference betweenthe DLSS andthe information contentof the proposedtext. A(S) is in principle a statementin the semanticslanguage.Whatweneedfor the algorithm howeveris a numericmeasureof the size of A(S). Wewill use IA($)Ito indicate that measure,which will be computednot by actually computingA(S) and then measuringit, but by assigning costs to eachof the inference rules that are usedto move fromonesemanticdescription to another. ¯ Howlikely is it that the semantics of the target text acuratelycharacterizethe situation that the DLSSis describing? For example, the DLSSmaycorrespond to the meaningof the English word knowand the target text mayend up with semantics correspondingto the German wordkennen,but if the situation being describedis oneof being acquaintedwith a person,thenthe target text accuratelydescribesthat situation evenif it is not identical to the English sourcetext. Let P(C) be the probability that the semanticsof the target text accuratelycharacterizethe situation. ¯ If there are differences betweenthe semanticsof the target text andthe situation being described, howmuchis anyonelikely to care about those differences? For example, supposewe translate the English word vegetable into the Japaneseword yasai, which describesa class that mostlybut not completelyoverlapsthe class describedby vegetable. In mosttexts this differenceprobablydoesnot matterat all, but in a bookon horticulture, it might. Let CARE(A(S))be a measureof how muchanyoneis likely to care about semanticdifferences that do exist. In general, if wethrowawayinformation, weneedto be concernedabout whetherit wasimportant. If weadd information, weneedto worry about whetheraddingit will give the readersomeincorrect ideasthat matter. In any real program,it will not be possible to computeany of these measuresexactly. But wecan exploit heuristics. For the first, wecan look at the specific inferences that weremadewithin the knowledgebase. For the second, we mustalso appeal to the KB, although here we mayalso want to 37 appealto the sourcelexicon as well (see below). For the third, wemustappealagainto the KBand also to somemodel(possibly very simple)of whatmattersin the domainof the current generationtask. To see howthese measuresinteract to define closeness, let’s examineeach of the common KB inferenceproceduresthat canbe usedto generatesets of assertionsthat are likely to be "close" to the input DLSS. First consider the case in which we moveup in the KBhierarchy (in other words we throw out information). This happensin translating the German wordskennenand wisseninto English, wherewe haveto moveup to the moregeneral conceptof knowing.A clearer exampleoccurs in translating the Spanishwordpescado(meaninga fish that has beencaughtand Is ready to be food) into the English fish. In this case, there is no chanceof being wrong.So wedefine the closenessof a proposedtext to 2 the Input DLSSas simply IASI*CARE(A(S)) Next consider the case in which we movedownin the KB hierarchy (in other words, we add information).This happens in translation the Englishwordfish into the Spanishpescado, or in translating the English knowinto German.Nowwe haveto consider the risk that the information that has been addedis wrong.Sometimes the risk is low (for example,becausestraightforwardtype constraints on the other components of the sentenceguaranteethat the required conditions are met.) But sometimes there is a risk. So closeness is definedas IA~*P(C)*CARE(A(S)) Finally, consider the casein whichwemovesideways,to related situations. This operation is very common, becauseit occurs in several different kinds of scenarios. It happensin cases(for example Englishvegetablevs. Japanese yasal) in whichthere is no wordfor the exactconceptbut there is a word for a very similar one. It also happensin casesof altemative incorporations(as in the swim~traverse example).In both cases, it is possible that weare both addingand throwing awayinformation. In the latter case, the likelihood of makinga mistake,however,is very low. In the former, it depends on how similar the conceptsare. In either case, weuseas a measure of closeness, IA~.P(C")¯ CARE(A (S)) Syntactic Naturalness Nowweneedto define whatit meansto be a syntactically acceptablesentencein the target language. It is at this step that the lexicon plays a keyrole. Andit is herethat weneedto take into accountthe difference between "sort of in the language"(e.g., "Hetraversedthe river by swimming."),and"natural the language"(e.g., "Heswam acrossthe river.") Thereare three possible reasonswhysomethingthat is "in the language"(i.e., it can be produced using the lexicon andthe grammar of the language)mightbe consideredunnatural: 1. A wordor phrasethat is labeled as a default in a default/markedalternation pair is used whenthe markedform is appropriate. 2We usemultiplicationin all of thesefunctions.It isn’t cdtical that It bemultiplication.Almostanymonotonic functionwill probably work, 38 2. A phrasethat is generally okayis generated but there is a nearly equivalent(semantically) phrasethat is preferred.Thisis whatis happening in both the ’traverse the river" andthe ’1 havea pain in myhead"examples.) 3. Thereare genuine"negative idioms" in the language. By a negative idiom we meansome phrasethat, while generatablefrom the lexicon andthe grammar of the language,is simply not acceptable, for whatever reason. Weare not sure yet what the data are on the existenceof such"phrasalgaps",but weallow for themin our approach. Of these, the secondis by far the mostcommon and thus the mostimportant from the point of view both of the lexicon itself andof the generationalgorithm. Yet this causeof unnaturalnesspresentsa problemfor anyevaluationfunction that is intendedto look at a proposedoutput string andevaluateit. Todo that, weneedto be ableto tell, by lookingjust at the string andat the lexical entries andrules that produced it, howgoodit is. In the caseof default wordsandnegativeidioms(cases1 and3 above),this is not difficult, because both canbe indicatedas infelicitous in the lexicon. Defaultwordsneedjust to be markedas such. Negative idioms needonly be stored, using the sameformat as positive ones (see below), alongwith a rating that indicates the appropropriatelevel of unacceptability. But this approach doesn’treally workin case2. Wedo not wantto haveto enter explicitly into the lexicon everyphrasethat is "eclipsed"by something better. Let’s look at the ’traverse the river" example to makethis issueclear. Weare trying to translate from Frenchto English. Suppose westart generatingwith a very small 8(0), becausewewanta fairly literal translation if wecan find a goodone. Thenwemayproduce,"He traversed the river by swimming."The lexical entries for thesewordstell us nothingabout the fact that this soundsunnatural, nor doesthe grammar Indicate that anyunusualconstructionwasused. Sothere is no basis for giving this sentencea low acceptability score. Yet wewantto, precisely because there is an alternative wayto say it that is very, veryclosesemantically andthat is a lot better in termsof naturalness. Thereare two possible solutions to this problem.Thefirst is to modifythe generationalgorithm and adda sort of cloud around8. If anythingin the cloudis a lot morenatural than the other sentences,then take it. Thebig drawbackto this approachis that it is computationallyvery expensive.Addingto 0 means firing knowledge baseinferencerules, andthat is likely to be explosive. Thealternativeis to run the inferencerules in the otherdirection andcachethe results. In otherwords, start with all the phrases that are markedin the lexicon as being preferred. Take the semantic expressionsthat correspondto themand, starting at those points, apply the inverses of the inference rules that wouldnormally be used during generation. Any phrasesthat correspondto the semantic expressionsthat result shouldbe marked in the lexicon as bad. In other words,weneedto be able to tell whena sentenceis bad(i.e., highly unnatural).But wedo not wantto dependon being given a lot of explcit information aboutbadthings, becausethere are too many of themandbecausethe badnessusually results from the existence of goodthings (so sayingit again wouldbe redundant). Fortunately, however,wecan extract the required information automatically. In addition, it is possibleto addexplicit badness informationas a wayto tune the lexicon, but this should only be necesssary in caseswherethe knowledge baseis incompleteandit is not worththe effort to fix it. So wenowhavetwo kinds of information that mustbe presentin the lexicon in order for naturalness judgements to be possible. Thefirst is that default/marked pairs mustbe indicated. This is easy. 39 Thesecond,andmoresignificant thing is that it mustbe straightforwardto enter phrases,at varying degreesof generality, into the lexicon. Eachlexical entry musthavethe following generalform: <conceptual pattern> <linguistic pattern> ( <style> <goodness function> For example,to indicate that wehavea preferencefor using wordslike swimming (i.e., incorporating the kind of motioninto the mainverb), wemighthavethe following entry in our Englishlexicon (using shorthand for the actual patterns): motion-event (X), mechanism-of-motion ( x,z) path x,y <word corresponding to Z> <path expression> y 1 Oftenthe goodness function will be a single number.In fact, using simply 0 for ordinary wordentries and 1 for phrasesthat are knownto be goodin certain circumstances(as described by the stated patternsof their arguments) goesa long way.In somecases,however,it is necessaryto usea function, which typically takes stylistic information as its argument.So wemight havea technical wordwhose goodnessis 1 if weare attempting to generate technical prose and 0 otherwise. Becausestylistic Informationis not alwaysnecessary for computing goodness, it is optional in eachlexical entry. Choosingthe Most Natural WholeSentence Sofar, wehavetalked aboutwaysto indicate, in the lexicon, the naturalnessof individual wordsand phrases.But howdo wecombinethose into scores for wholesentencesso that wecan choosethe best? Thereare two parts to the answerto this question. Thefirst lets us get a roughmeasure.Thesecond is then usedto comparea set of sentenceswhosescores are nearly equal given the roughmeasurement. To computethe roughmeasurement of the naturalnessof a sentence,just averagethe scores of the components that wereusedto compose it. Weneedto use an averageso that the generationalgorithm canusea single acceptability thresholdregardlessof the length of the sentencebeinggenerated.Bythis measure, sentencesthat contain "bad" wordsor phraseswill havelower scoresthan thosethat don’t. But what if there are multiple candidatesentencesnoneof which is "bad"? Nowweuse a different measure.Hereweprefer sentencesin which the languagedoesmorework to oneswhereit doesless. Whatwemeanis that weprefer: ¯ The use of inflectional morphologyto communicate somethingover the use of additional lexical items. ¯ Theuseof a single wordto communicate a set of assertionsover a phrase(unless the word is stylistically inappropriatein the currentcontextor there is externalinformation,e.g., the sourcetext, that tells us that the wordcannotbe used). 4O ¯ Theuseof a morerestricted phraseto the use of a moregeneralpurposeone, and the use of evena generalpurposestored phraseto the useof text that is deriveda wordat a time using the grammar.A phrase A is morerestricted than another phraseB if the semantic conditions on the useof A subsume thoseon the useof B. Theseheuristics, takentogether, usually but not alwaysprefer shorter sentencesto longer ones. For example,they prefer, "Heswam acrossthe river," to "He traversedthe river by swimming." But ’I havea headache," is longer than, "Myheadhurts," eventhoughit is moreidiomatic. Usingthe SourceText for Advice So far, wehavedescribedthe processof target languagegenerationas onein whichonly the target languagedictionary and grammar needbe used. Thereare cases, though, in which the sourcetext and the sourcelexicon can be exploited for additional information that can be usedduring generation. We give two examples. First considerwhathappens if, in attemptingto generatethe target text, the initial attemptto find a wordfor a conceptproducesa wordthat is marked as a default (e.g., pez). In general,it is necessary to find the correspondingmarkedword(e.g., pescado)and seeif its semanticconditions can be provento be probably true in the current context. If they can, then the markedwordmustbe used. But suppose that the sourcelanguagehasthe samedefault/markedalternation. Thenthe reasonweare looking at the default wordis that the corresponding default wordwasusedin the sourcetext. Assuming that text was correctly written, weare guaranteed that the applicability conditionsfor the marked wordare not met. So wedo not needto checkfor themduring generation. As a second example, consider the case in which an entire set of semantic conditions can be accounted for by using a single word,rather than a headwordanda set of modifiers. In general, wewant to do that. For example,wewouldwantto renderthe meaning of the Englishphrase,"nasty little dog," as roquet in French.But suppose that the sourcelanguagealso has a single wordanddidn’t useit (perhaps because this occurence of the wordis in a definition). Thenit is probablya goodidea not to useit in the targettext either. Notice that in both these cases, although we makeuse of the source language lexicon during generation,there is no transfer lexicon. Neither sourcenor target lexicon needsto knowanythingabout the other. ConsideringMorethan OneSentenceat a Time So far, wehaveconsideredthe problemof generatinga single sentencefrom a single DLSS.Wehave assumed the existence of a strategic planning component that precedesthe actual generation system andwhosegoal is to construct the appropriate DLSS.In particular, for MT,we assume that often the DLSS canbe patterned after the structures in the sourcetext. Sometimes, however,it maybe desirable to changesomethe decisionsmadeby the strategic plannerduring generation.For example,it mightbe possible to producea muchmorenatural soundingsentenceif it wereallowedto changethe focus or the presuppositions of the sentence. If these kinds of transformations are done, their costs can be incorporatedinto IAsI,just as are the costs of other changes to the semanticsof the original DLSS.But since text level issues tend to be muchmorecommon across languagesthan lexical and morphological ones are, these kinds of transforms are less important than the ones we have focused on in this 41 discussion. Summary of Implications for the Lexicon Themainimplications of this discussionfor lexicons designedto supportinterlingual MTsystemsare the following: ¯ Default/marked pairs mustbe indicatedin the lexicon. ¯ Phrasesthat representpreferedwaysof sayingthings needto be enteredinto the lexicon as phrasesevenif their semanticscan be derived compositionallyfrom the meaningsof their constituents(andthus there wouldbe no reasonto enter theminto a lexicon intendedjust to supportunderstanding rather than generation). ¯ Restrictionson the applicability of phrasesneedto be statedas tightly as possible. ¯ Thelexicon needsto be tied to a knowledge basethat supportsthe kinds of reasoningthat wehavedescribedfor getting from a literal meaningto a related meaningthat canbe more naturally expressed in the target language. Noticethat in somesensenoneof theserequirementsis uniqueto MThowever.All of thesethings are necessaryin any NL generation systemthat expects to be able to generate from inputs that do not exactly matchthe waythings are typically said in the target language. References [Dorr 90] Doff, B. Solving ThematicDivergencesin MachineTranslation. In Proceedingsof ACL.1990. [Jakobson66] Jakobson,R. Zur Struktur des RussischenVerbums. In Hamp, Householder, & Austerlitz (editors), Readings in UnguisticsII. TheUniversity of ChicagoPress, Chicago,1966. [Kameyama91] Kameyama, M., R. Ochitani & S. Peters. ResolvingTranslation Mismatches with Information Flow. In Proceedingsof the 29th AnnualMeetingof the ACL.1991. 42