From: AAAI Technical Report WS-92-01. Compilation copyright © 1992, AAAI (www.aaai.org). All rights reserved. Howto Compilea Bilingual Collocational Lexicon . Automatically Frank Smadja, smadja@cs.columbia.edu Columbia University Computer ScienceDepartment NowYork, NY10027 1 1.0 Introduction and MoUvation Considerthe followingtwosentences: (1 e) "TheGovernment is wastingmillions of dollars sendingmonthlypensionchequesto wealthysenior citizens and babybonusesto families whodo not need the money." (If) "Le gouvernement gaspille des millions de dollars en envoyantdes chequesde pensiontousles moisaux personnesdg6esriches et des allocationsauxfamilles qui n’ en ont pasbesoin." Thesesentencesare extractedfroma corpusof the proccedingsof the Canadian Parliament,also called the Hansardscorpus. As required by law, the Hansards corpushaveboth the Englishand the Frenchfor each sentence.Thecorpusconsists of a number of pairs of files, onewrittenin Englishandthe other onein French.Weused a versionof the Hansardsin whichthe sentenceshavebeenalignedwith their translationsas 2. Sentence(If) is thus the describedin [Church91] 3translation in Frenchof Sentence(le). Automaticallyproducing(le) from(lf) involves issues that haveyet to be solved.In this paperwe addressthe simplertask of findingcorrect translations for collocations,in orderto producebilingualiexical information.Moreprecisely, in the abovesentencesthe translation for "senior citizens" is "personnesdg~es" and the translation for "babybonuses"is "allocations." This actually raises several problems: 1. Thetranslation of mostcollocationsis not predictable in termsof the meaning of the individualwords,and/or the meaningof the wholecollocation. Althoughthe usualtranslationfor "citizen"is "citoyen,’"the latter is not usedin the translation for "seniorcitizen." Without the knowledge of the propercollocations, one would attempta word-by-word translation and end up with incorrect mmslations (e.g., "un bonuspourbdb#s"). 2. Asdiscussedin [Smadja92], mostdictionaries do not includecollocations. This is true both for monolingual dictionaries andbilingualones. a. Somecollocations wanslateinto single words. More generally, a collocation of n words(1 <=n) might 1. Thisresearchis supportedONR grant N00014-89-J-1782. Proceedings of the AAKI workshop onstatistically basedNLP techniques, July,1992. 2. Wewould like to thankKenChurch andthe Bell Laboratories 3. Although in actuality,(le) mighthavebeentranslatedfrom herefor providing us withthe alignedcorpus. (10. 57 translate in a collocationof p words(] <=p), in which n andp are different. Problem I is the motivationfor our work.It indicates that a collocationalbilingual dictionaryis neededto do proper wansL~tion. Furthermore. since currentlyavailabledictionaries are not adequatethere is a needfor creatingsuch collocationaldictionaries. In this paper, weproposea techniquefor constructing bilingual collocationdictionaries completelyautomatically. Thetechniqueweproposefirst identifies a set of collocationsin onelanguageandthen attemptsto translate themusing the Hansardsas wainingdata. Todo this, we proposeto use Xlract, a collocationcompiler[Smadja92], to identify collocationsandto use mutualinformationstatistics to Iranslatethe collocationsinto the other language. Thealgorithmwedescribeis an iterative methodthat buildsthe Iranslationof a givencollocationbyadding wordsoneby one. This techniqueallowsa collocation containingn wordsto be translatedinto a collocationof p words.Thepaper describes the proposedalgorithmand showshowit is appliedin the WansL’ttion of the following three collocations:"’seniorcitizen.," "Madam Speaker," and"’election campaign." this is onlya substepin the translationprocessandit is not madeto produceanysort of lexicon. Moreover,it is not clear howthese alignmentscouldbe usedin non-statistical approaches.In contrast, weuse statistical techniquesin order to providebilingual lexical informationthat couldbe usedacrossa variety of applications. 3.0 Aligning wordsusing mutual informationsscores. Thefirst stage in aligningcollocationsis not novel (i.e.,[Gale &Church91] and[Brown et.al.91b]). It consists of usingmutualinformationscores to evaluatethe correlation of pairs of Frenchand Englishwords.These scoreswill then be usedin the next stagesto select candidates for inclusionin collocations. Themutualinformationbetweentwo events is usually definedas: p (e^~O )xp(]) O)~(e,./) = log (p(e) 2.0 Related work wheree and fare two separate events, andp(x)denotesthe probability of appearanceof x. g (e,f) measureshowthe twoevents,are correlated. Theworkwedescribe in this paper can be viewedas an extension of the workdoneby [Gale &Church91] and [Brownet.a1.91a] on bilingual sentencealignment.Both worksuse purelystatistical techniquesto identify sentence pairing in corporasimilar to the Hansards.Although aligning sentencesmightseemlike a relatively minortask, it is very productiveas it providesa starting point fromwhich to pursueresearch. Asmentionedbefore, in the workwe presenthere, weusedan alignedcorpusas input data. Weapplied Equation(1) as follows. Let S’nbeagiven sentence chosenrandomlyin the corpus, let Sl~ be the Englishversionof this sentence, andSl~ be the French alignment.Let E and F be respectively an English word and a Frenchword,and let e andfreapectivelybe the eventsthat F E Sl~ andthat E GSINE.Usingthe corpusas training data, wecan compute probabilities using a simple maximum likelihood method,and thus the mutualinformationof the two events can be computed as follows: Anotherthread of researchattemptsto dotranslation using statistical techniquesonly. [Brown et.al.91 b] use a sto(2)~t (e,.f) = log (IS(E) nS(F)I.N) IS(E)I" IS(F) chastic languagemodelbasedon the techniquesusedin speechrecognition[Bahlet.al.83], combined with translation probabilities compiledon the alignedcorpusin order whereSOV)denotes the set of sentencescontainingthe to dosentencetranslation.Although the projectis still at wordW,andNdenotesthe total numberof sentencesin the an early stage, it alreadyproduces quality translationfor training corpus. simplesentenceswithoutanylinguistic or semanticinformation.In the processof translating sentences,theyalso align groupsof wordsto other groupsof words.However, 58 UsingEquation2 on the Hansards corpus, wecompiled mutualinformation scores for most pairs of possible English and French words and we kept all pairs with mutualinformationscores significantly greater than 0. Table 1 showsa randomlyselected subset of these aligned wordswhichhas been sorted in decreasing score for this paper. Mostwordslisted in the Table are actual translation of one another, for example,the days of the week, the months, and mostly unambiguouswords such as afternoon and aprds midi, mandatoryand obligatoire have been corre.tly associated. Suchdata could already be used efficiently by humans.However,we notice that there arc somediscrepancies, due to several factors. Wehave identified the followingtwo factors as accountingfor a vast majority of the incorrectly aligned words: ¯ Ambiguous words. Ambiguouswordsare often translated into several wordsdependingon the context. In the table, we see that, for example,inflation is aligned with the French inflation. This is only true wheninflation means"’an increase in the volumeof moneyand credit," but the translation is not correct wheninflation means"the act of inflating." The moreappropriate translation would be "’gonflage" or "’gonflement." Althoughthe two senses arc related, Frenchhas two different and unrelated words. ¯ Collocations. Somewordsare used as part of a collocation, and a collocation often translates into anothercollocation. As explainedbefore, collocations do not translate well across languages. So that simple mutual information scores might provide wrong associations in which one wordof a given collocation is associated with any wordof the correspondingcollocation. In the table, examplesof wrongassociations due to collocations are in bold fonts. For example,the proper translation for "’senior citizen" is "’personnesa#es," and the correct translation for "election campaign,"is " campagne electorale." In the table, we see that campaigngets associated either with electorale (whichmeans"relating to an election"), or with " campagne " (whichis only true in the context of this collocation), and senior 59 gets associated with a#es (which means old). TABLE1. Someword alignments English october inflation friday december scotia French octobre inflation vendredi d6cembre nouvelle&cosse bay bay afternoon aprts-midi nova nouvelle. ~cosse war guerre patent brevets madam madame thousand milliers mine mines solicitor solliciteur native autochtones mandatory obligatoire morning matin expansion regionale campaign eampagne debt dette welfare bien-f~tre hill colline progr~s progress campaign electorale expansion expansion supervision obligatoire madam pr~sidente mandate mandat supervision surveillance withdraw retirer men hommes constituent ~iecteurs gas 8az expansion indnstrielle water eau defend d6fendre growth croissance cabinet cabinet senior agnes children enfants Score 2.766482 2.760550 2.756804 2.743461 2.688104 2.676008 2.669028 2.649809 2.642909 2.605734 2.590855 2.586640 2.579116 2.561797 2.544999 2.526343 2.525235 2.520080 2.$15319 2.514616 2.505738 2.503827 2.496735 2.492681 2.492560 2.492257 2.488770 2.481933 2.473921 2.469366 2.469220 2.467424 2.457627 2.456365 2.454382 2.443816 2.438468 2.438364 2.426776 2.422183 In this paper, we do not address the case of ambiguous words, but we are mostly concernedwith the case of collocations. ing for the translation of "Madam Speaker, .... the election campaign,"and "senior citizen." TABLE2. Possible translations of senior ag6es sensor 2.426776 sensor troisi~me 1.200094 senior direction 1.226118 semor fonctionnaires 1.238787 semor citoyens 1.156518 senior s6nat 0.915873 semor am(~ricain 0.866879 semor population 0.813242 senior niveau 0.929738 senior circonscription 0.890250 senior femmes 0.706010 senior s6curi.’t6 0.703194 jcnnes senior 0.650396 senior postes 0.636763 senior mois 0.615607 senior comment 0.604284 senmr service 0.587988 senior ministg.re 0.641287 semor conservateur 0.713594 senior aurait 0.536152 senior d6cision 0.534065 senmr contre 0.609833 senior nombre 0.464127 senior prendre 0.531391 senior finances 0.494123 senior parlementaire 0.452941 senior conseil 0.487362 senior sant~ 0.521901 senior personnes 1.621361 senior services 0.656421 senior gens 0.460902 senior mesures 0.460181 senior programme 0.437212 senior encore 0.292530 semor soci6~ 0.278405 4.0 UsingXtract for finding English collocations ProvidingIranslation for collocationspresupposes that collocations are alreadyknowin one language LI andthat one wantsto expressthemin an other languageL2. To identify collocations, weproposeto usea collocationcompiler, XWact [Smadja 92]. UsingXtract ,allows us to start the translationprocessby providingus with a set of collocationsto Iranslate. It thusgreatlyreduces the searchspace in identifying many-to-many associationsbetweenEnglish andFrench.A year of the Hansardshasa vocabularyof morethan 20,000 wordsso that the search space for collocations of length 2 to 8 wouldbe in the order of 1033 whereas Xwactproduces only several thousand collocation of length 2 to 8. 4.1 Xtmct, an Overvlew Described in [Smadja & McKcown 90, Smadja 92] Xtract is a tool for compilingcollocations froman unstructured free text corpus. Xtract producesa wide range of collocations. In particular, Xtract producesflexible collocations of the type "to makea decision," in whichthe words can be inflected, the word order might change and the number of additional wordsvary with the examples. In [Smadja 92] we showthat Xtract can identify such collocations with a precision of 80%.Xtract also produces compounds, such as "The DowJones average of 30 industrial stock," which,are non flexible collocations. In this paper weonly use the collocations of type compounds which have also been identified by other techniques such as [Ch6uekaet.al. 83]. 4.2 CompilingCollocations with Xtract Wehaveused Xtract on the English version of the Hansards to compile compoundcollocations. Amongthe collocations re~evedare: "Madam Speaker,""the election campaign," "regional industrial expansion," "the Prime Minister," and "senior citizen." In this paper we are look- 60 5.0 Translating Collocations 5.1 Hypothesesand Overall Description. The techniqueweproposeuses compound collocations as identified by Xuactas seeds, and attempts to provide translation for them. Theoretically this process must be TABLE3. Possible translations of citizen citoyens citizen 2.090491 citizen ~g6es 2.069875 citizen personnes 1.293101 citizen p6tition 1.271266 citizen ottawa 0.997750 citizen habitants 0.993420 citizen troisii~me 0.981948 citizen lois 0.922442 citizen bien-Stre 0.922442 citizen honneur 0.875496 citizen circonscription 0.824966 citizen qualit6 0.823696 citizen groupe 0.806067 citizen pr6senter 0.783948 citizen s6curit6 0.783605 citizen pmt6ger 0.780380 citizen matin 0.766204 citizen d6cid6 0.765796 citizen population 0.736677 citizen justice 0.735672 citizen acc~ 0.711306 citizen canadiens 0.681.060 citizen eux 0.678879 citizen plupart 0.677068 citizen sant~ 0.675888 citizen droit 0.663617 citizen moyen 0.659368 citizen services 0.659038 citizen vie 0.647867 citizen payer 0.646507 citizen groupes 0.636491 citizen aider 0.633778 citizen parlement 0.629732 citizen besoin 0.619423 applied both from French to English and from English to French, so that many-to-one and one-to-manygrouping could be identified. However,we only applied it from English to French for the moment.The assumption on whichthe Uanslation process is based on says that if two collocations, E and F, are translations of one another, then all the wordsin E are correlated with all the wordsin F. The algorithm we propose to use attempts to build the translation of a seed English collocation by incrementally addingsingle wordsto its Frenchtranslation. 61 TABLE 4. Intermediatetranslations senior citizen gg6es 2.638035, senior citizen porsonnes 1.831326, senior citizen troisi~me 1.476078, senior citizen citoyens 1.385506, senior citizen population 1.083623, senior citizen circonscription 1.073481, senior citizen s6curit6 0.969971, semor citizen niveau 0.878886, senlor citizen services 0.867681, citizen Funes senior 0.820263, senior citizen conservateur 0.817643, citizen centre senior 0.783304, senior citizen sant~ 0.757519, senior citizen nombre 0.730904, senior citizen parlementaire 0.719718, senior citizen gens 0.696520, senior citizen femmes 0.671757, senior citizen am6ricain 0.663165, senior citizen aurait 0.581080, senior citizen encore 0.513549, senior citizen programme 0.377404, senior citizen service 0.331886, senior citizen prendre 0.327677, senior citizen mesures 0.321192, senior citizen soci6t~ 0.285544, senior citizen d6cision 0.240174, 5.2 The algorithm The algorithmis an iterative algorithm that conslxuctsthe translation for a given collocation on a word by word basis. Let {e’l ..... e’n} be an Englishcollocation as identified by Xtract. The algorithmis as follows: 1. Computef%q~ = S’l In which tt (e,./) > 1 and S’i is the set of-~Frenchwordsthat are correlated with each of the words of { e f i ..... e’n }. Leti= 1. 2. Sort the elements of S fi by decreasing mutualinformation scores. 3. For each subset of size i+l of S’i, computethe mutual informationof all its elementstaken as separate events. 4. Removeall the sets containing non correlated elements. . If there is no remaining subsetof size i+l, thenproducethe subsetof size i withthe highestmutualinformationscore with the seed Englishcollocationand go to Step6. Otherwise,Incrementi and go to Step 2. 6. End. In the abovealgorithm,wedefine the mutualinformation of a set of size p > 1 an eventx asthe mutualinformation of conjunctionof the pth elementof the set andthe remaining subsetof size (p -1), withthe event 5.3 late noun-nouncompounds in Frenchsince French syntax doesnot allowfor suchconstructs. Weare currently workingon testing this algorithmfor morecomplex cases, i.e., cases in whichan Englishcollocation of size n is Wanslated in a Frenchcollocationof a different size. In particular whenthe Frenchtranslation consistsof a single word.Weare also evaluatingthe use of statistics other than mutualinformationthat wouldbring better results. In a next stage, wewill applythe technique to a large number of collocationsandwewill then evaluate the results. SomePreliminary Results Wehaveexperimentedwith the abovealgorithmfor three Englishcollocations, andwehavereachedthe correct Frenchequivalentin all three cases. Although this does not allowus to determinethe validity of the algorithmwe considerit an encouraging result. In the rest of this section weshowthe algorithmfor the collocation:"seniorcitizen." Table2 and3, list the associationsof the wordsseniorand citizen respectively.Ascan be seen fromTables2 and3, senior hassome30 possibletranslations andcitizen has some130. ApplyingStep I of the algoritlun wecompute S~i, whichconsists of the following26 words:troisidme, societal, services, service, s6curit6,santLprogramme, prendre,population,personnes,parlementaire,hombre, niveou,mesures,jeunes, gens, femmes,encore,d#cision, contre,conservateur, citoyens,circonscription,aurait, amdricain,dg~es. ApplyingStep 2 of the algorithm, wecomputethe mutual informationof each of the abovewordswith the seed collocation.Table4, indicatestheseresults. After the appficafionof Step3, onlyonesubsetof size 2 remained:"personnesdg6es"whichis the correct Iranslation for "’seniorcitizen." Which then terminatesthe algorithm. Theapplicationof the samealgorithmon "election campaign" and "Madam Speaker," also produced the correct results: "campagne electorale" and "Madame la Pr6sidente," in the samenumber of steps. Thetranslation of "Madam Speaker"is obviouslyspecific to this corpusand cannotbe generalized.In contrast, the wansiationof "election campaign" is generaland valid across domains.In addition,it is interestingbecauseit is a problem to trans- 62 6.0 Conclusion In this paper wehaveproposeda techniquefor compiling a bilingual collocationallexiconcompletelyautomatically. Thetechniquesuse Xtract as a front endin order to identify the collocationsto be translated. Thetranslationsare then constructedon a wordby wordbasis, andthe search space is reducedby only consideringwordswith high mutualinformationwith the original collocationas well as mutuallycorrelated. This paperdescribesthe algorithm andgives somepreliminaryresults. In a next stage, we intend to test the algorithmon morecomplexcases and then producea bilingual collocationlexiconto be used by the research community. BIBLIOGRAPHY [Bahlet.al. 1983]BahlL., Jelinek E, andMercerIL, "A maximum likelihood approachto continuous SpeechRecognition." IEE wansactionson pattern analysis and machineintelligence, 1983,5-(2), pp:179-190,1983 [Brown et.al. 91a] Brown’ E, DellaPietra S., DellaPiewa V., and MercerR. "WordSenseDisambiguation using Statistical Methods."Proceedingsof the 29th AnnualMeeting of the Associationof Computational Linguistics, June 1991,BerkeleyCa. [Brownet.al. 91b] BrownP., Lai J., and MercerR. "’AligningSentencein Parallel Corpora."Proceedingsof the 29th AnnualMeetingof the Assoeiationof Computational Linguistics, June 1991,BerkeleyCa. [Chouekaet.al. 83] ChouekaY., Klein T., and NouwitzE., "’Automatic Retrieval of Frequent Idiomatic and Collocational F, xpressions in a LargeCorpus."Jofimal for Literary and Linguistic Computing,vol. 4, pp: 34-38, 1983. [Gale & Church 91] Gale W., and Church K., ",4 program for Aligning Sentences in Bilingual Corpora." Proceedings of the 29th AnnualMeetingof the Association of ComputationalLinguistics, June 1991, Berkeley Ca. [Smadja & McKeown90] Smadja E, and McKeownK., "’Automatically Extracting and Representing Collocatioas for LanguageGeneration." Proceedings of the 28th Annual Meeting of the Association of ComputationalLinguistics, June 1990, Pittsburgh, Pa. [Smadja 92] $madja E "Retrieving Collocations from Text: Xtract." Journal of ComputationalLinguistics, to appear, smadja 63