How to Compile a Bilingual Collocational Lexicon

advertisement
From: AAAI Technical Report WS-92-01. Compilation copyright © 1992, AAAI (www.aaai.org). All rights reserved.
Howto Compilea Bilingual
Collocational Lexicon
. Automatically
Frank Smadja,
smadja@cs.columbia.edu
Columbia
University
Computer
ScienceDepartment
NowYork, NY10027
1
1.0 Introduction
and MoUvation
Considerthe followingtwosentences:
(1 e) "TheGovernment
is wastingmillions of dollars
sendingmonthlypensionchequesto wealthysenior citizens and babybonusesto families whodo not need
the money."
(If) "Le gouvernement
gaspille des millions de dollars
en envoyantdes chequesde pensiontousles moisaux
personnesdg6esriches et des allocationsauxfamilles
qui n’ en ont pasbesoin."
Thesesentencesare extractedfroma corpusof the proccedingsof the Canadian
Parliament,also called the
Hansardscorpus. As required by law, the Hansards
corpushaveboth the Englishand the Frenchfor each
sentence.Thecorpusconsists of a number
of pairs of
files, onewrittenin Englishandthe other onein
French.Weused a versionof the Hansardsin whichthe
sentenceshavebeenalignedwith their translationsas
2. Sentence(If) is thus the
describedin [Church91]
3translation in Frenchof Sentence(le).
Automaticallyproducing(le) from(lf) involves
issues that haveyet to be solved.In this paperwe
addressthe simplertask of findingcorrect translations
for collocations,in orderto producebilingualiexical
information.Moreprecisely, in the abovesentencesthe
translation for "senior citizens" is "personnesdg~es"
and the translation for "babybonuses"is "allocations." This actually raises several problems:
1. Thetranslation of mostcollocationsis not predictable
in termsof the meaning
of the individualwords,and/or
the meaningof the wholecollocation. Althoughthe
usualtranslationfor "citizen"is "citoyen,’"the latter is
not usedin the translation for "seniorcitizen." Without
the knowledge
of the propercollocations, one would
attempta word-by-word
translation and end up with
incorrect mmslations
(e.g., "un bonuspourbdb#s").
2. Asdiscussedin [Smadja92], mostdictionaries do not
includecollocations. This is true both for monolingual
dictionaries andbilingualones.
a. Somecollocations wanslateinto single words. More
generally, a collocation of n words(1 <=n) might
1. Thisresearchis supportedONR
grant N00014-89-J-1782.
Proceedings
of the AAKI
workshop
onstatistically basedNLP
techniques,
July,1992.
2. Wewould
like to thankKenChurch
andthe Bell Laboratories 3. Although
in actuality,(le) mighthavebeentranslatedfrom
herefor providing
us withthe alignedcorpus.
(10.
57
translate in a collocationof p words(] <=p), in which
n andp are different.
Problem
I is the motivationfor our work.It indicates that
a collocationalbilingual dictionaryis neededto do proper
wansL~tion.
Furthermore.
since currentlyavailabledictionaries are not adequatethere is a needfor creatingsuch
collocationaldictionaries.
In this paper, weproposea techniquefor constructing
bilingual collocationdictionaries completelyautomatically. Thetechniqueweproposefirst identifies a set of collocationsin onelanguageandthen attemptsto translate
themusing the Hansardsas wainingdata. Todo this, we
proposeto use Xlract, a collocationcompiler[Smadja92],
to identify collocationsandto use mutualinformationstatistics to Iranslatethe collocationsinto the other language.
Thealgorithmwedescribeis an iterative methodthat
buildsthe Iranslationof a givencollocationbyadding
wordsoneby one. This techniqueallowsa collocation
containingn wordsto be translatedinto a collocationof p
words.Thepaper describes the proposedalgorithmand
showshowit is appliedin the WansL’ttion
of the following
three collocations:"’seniorcitizen.," "Madam
Speaker,"
and"’election campaign."
this is onlya substepin the translationprocessandit is not
madeto produceanysort of lexicon. Moreover,it is not
clear howthese alignmentscouldbe usedin non-statistical
approaches.In contrast, weuse statistical techniquesin
order to providebilingual lexical informationthat couldbe
usedacrossa variety of applications.
3.0 Aligning wordsusing mutual
informationsscores.
Thefirst stage in aligningcollocationsis not novel
(i.e.,[Gale &Church91] and[Brown
et.al.91b]). It consists of usingmutualinformationscores to evaluatethe
correlation of pairs of Frenchand Englishwords.These
scoreswill then be usedin the next stagesto select candidates for inclusionin collocations.
Themutualinformationbetweentwo events is usually
definedas:
p (e^~O
)xp(])
O)~(e,./) = log (p(e)
2.0 Related work
wheree and fare two separate events, andp(x)denotesthe
probability of appearanceof x. g (e,f) measureshowthe
twoevents,are correlated.
Theworkwedescribe in this paper can be viewedas an
extension of the workdoneby [Gale &Church91] and
[Brownet.a1.91a] on bilingual sentencealignment.Both
worksuse purelystatistical techniquesto identify sentence
pairing in corporasimilar to the Hansards.Although
aligning sentencesmightseemlike a relatively minortask, it is
very productiveas it providesa starting point fromwhich
to pursueresearch. Asmentionedbefore, in the workwe
presenthere, weusedan alignedcorpusas input data.
Weapplied Equation(1) as follows. Let S’nbeagiven sentence chosenrandomlyin the corpus, let Sl~ be the
Englishversionof this sentence, andSl~ be the French
alignment.Let E and F be respectively an English word
and a Frenchword,and let e andfreapectivelybe the
eventsthat F E Sl~ andthat E GSINE.Usingthe corpusas
training data, wecan compute
probabilities using a simple
maximum
likelihood method,and thus the mutualinformationof the two events can be computed
as follows:
Anotherthread of researchattemptsto dotranslation using
statistical techniquesonly. [Brown
et.al.91 b] use a sto(2)~t (e,.f) = log (IS(E) nS(F)I.N)
IS(E)I" IS(F)
chastic languagemodelbasedon the techniquesusedin
speechrecognition[Bahlet.al.83], combined
with translation probabilities compiledon the alignedcorpusin order
whereSOV)denotes the set of sentencescontainingthe
to dosentencetranslation.Although
the projectis still at
wordW,andNdenotesthe total numberof sentencesin the
an early stage, it alreadyproduces
quality translationfor
training corpus.
simplesentenceswithoutanylinguistic or semanticinformation.In the processof translating sentences,theyalso
align groupsof wordsto other groupsof words.However,
58
UsingEquation2 on the Hansards corpus, wecompiled
mutualinformation scores for most pairs of possible
English and French words and we kept all pairs with
mutualinformationscores significantly greater than 0.
Table 1 showsa randomlyselected subset of these aligned
wordswhichhas been sorted in decreasing score for this
paper. Mostwordslisted in the Table are actual translation
of one another, for example,the days of the week, the
months, and mostly unambiguouswords such as afternoon
and aprds midi, mandatoryand obligatoire have been corre.tly associated. Suchdata could already be used efficiently by humans.However,we notice that there arc
somediscrepancies, due to several factors. Wehave identified the followingtwo factors as accountingfor a vast
majority of the incorrectly aligned words:
¯ Ambiguous words.
Ambiguouswordsare often translated into several
wordsdependingon the context. In the table, we see
that, for example,inflation is aligned with the French
inflation. This is only true wheninflation means"’an
increase in the volumeof moneyand credit," but the
translation is not correct wheninflation means"the act
of inflating." The moreappropriate translation would
be "’gonflage" or "’gonflement." Althoughthe two
senses arc related, Frenchhas two different and unrelated words.
¯ Collocations.
Somewordsare used as part of a collocation, and a
collocation often translates into anothercollocation. As
explainedbefore, collocations do not translate well
across languages. So that simple mutual information
scores might provide wrong associations in which one
wordof a given collocation is associated with any
wordof the correspondingcollocation. In the table,
examplesof wrongassociations due to collocations are
in bold fonts. For example,the proper translation for
"’senior citizen" is "’personnesa#es," and the correct
translation for "election campaign,"is " campagne
electorale." In the table, we see that campaigngets
associated either with electorale (whichmeans"relating to an election"), or with " campagne
" (whichis
only true in the context of this collocation), and senior
59
gets associated with a#es (which means old).
TABLE1. Someword alignments
English
october
inflation
friday
december
scotia
French
octobre
inflation
vendredi
d6cembre
nouvelle&cosse
bay
bay
afternoon
aprts-midi
nova
nouvelle.
~cosse
war
guerre
patent
brevets
madam
madame
thousand
milliers
mine
mines
solicitor
solliciteur
native
autochtones
mandatory
obligatoire
morning
matin
expansion
regionale
campaign
eampagne
debt
dette
welfare
bien-f~tre
hill
colline
progr~s
progress
campaign
electorale
expansion
expansion
supervision obligatoire
madam
pr~sidente
mandate
mandat
supervision surveillance
withdraw
retirer
men
hommes
constituent ~iecteurs
gas
8az
expansion
indnstrielle
water
eau
defend
d6fendre
growth
croissance
cabinet
cabinet
senior
agnes
children
enfants
Score
2.766482
2.760550
2.756804
2.743461
2.688104
2.676008
2.669028
2.649809
2.642909
2.605734
2.590855
2.586640
2.579116
2.561797
2.544999
2.526343
2.525235
2.520080
2.$15319
2.514616
2.505738
2.503827
2.496735
2.492681
2.492560
2.492257
2.488770
2.481933
2.473921
2.469366
2.469220
2.467424
2.457627
2.456365
2.454382
2.443816
2.438468
2.438364
2.426776
2.422183
In this paper, we do not address the case of ambiguous
words, but we are mostly concernedwith the case of collocations.
ing for the translation of "Madam
Speaker, .... the election
campaign,"and "senior citizen."
TABLE2.
Possible
translations
of senior
ag6es
sensor
2.426776
sensor troisi~me
1.200094
senior direction
1.226118
semor fonctionnaires 1.238787
semor citoyens
1.156518
senior s6nat
0.915873
semor am(~ricain
0.866879
semor population
0.813242
senior niveau
0.929738
senior circonscription 0.890250
senior femmes
0.706010
senior s6curi.’t6
0.703194
jcnnes
senior
0.650396
senior postes
0.636763
senior mois
0.615607
senior comment
0.604284
senmr service
0.587988
senior ministg.re
0.641287
semor conservateur
0.713594
senior aurait
0.536152
senior d6cision
0.534065
senmr contre
0.609833
senior
nombre
0.464127
senior prendre
0.531391
senior finances
0.494123
senior parlementaire 0.452941
senior conseil
0.487362
senior
sant~
0.521901
senior
personnes
1.621361
senior services
0.656421
senior gens
0.460902
senior mesures
0.460181
senior programme
0.437212
senior encore
0.292530
semor soci6~
0.278405
4.0 UsingXtract for finding English
collocations
ProvidingIranslation for collocationspresupposes
that
collocations are alreadyknowin one language
LI andthat
one wantsto expressthemin an other languageL2. To
identify collocations, weproposeto usea collocationcompiler, XWact
[Smadja
92]. UsingXtract ,allows us to start
the translationprocessby providingus with a set of collocationsto Iranslate. It thusgreatlyreduces
the searchspace
in identifying many-to-many
associationsbetweenEnglish
andFrench.A year of the Hansardshasa vocabularyof
morethan 20,000 wordsso that the search space for collocations of length 2 to 8 wouldbe in the order of 1033
whereas Xwactproduces only several thousand collocation
of length 2 to 8.
4.1
Xtmct, an Overvlew
Described in [Smadja & McKcown
90, Smadja 92] Xtract
is a tool for compilingcollocations froman unstructured
free text corpus. Xtract producesa wide range of collocations. In particular, Xtract producesflexible collocations
of the type "to makea decision," in whichthe words can
be inflected, the word order might change and the number
of additional wordsvary with the examples. In [Smadja
92] we showthat Xtract can identify such collocations
with a precision of 80%.Xtract also produces compounds,
such as "The DowJones average of 30 industrial stock,"
which,are non flexible collocations. In this paper weonly
use the collocations of type compounds
which have also
been identified by other techniques such as [Ch6uekaet.al.
83].
4.2
CompilingCollocations with Xtract
Wehaveused Xtract on the English version of the Hansards to compile compoundcollocations. Amongthe collocations re~evedare: "Madam
Speaker,""the election
campaign," "regional industrial expansion," "the Prime
Minister," and "senior citizen." In this paper we are look-
60
5.0 Translating Collocations
5.1
Hypothesesand Overall Description.
The techniqueweproposeuses compound
collocations as
identified by Xuactas seeds, and attempts to provide
translation for them. Theoretically this process must be
TABLE3.
Possible
translations
of citizen
citoyens
citizen
2.090491
citizen ~g6es
2.069875
citizen personnes
1.293101
citizen p6tition
1.271266
citizen ottawa
0.997750
citizen habitants
0.993420
citizen troisii~me
0.981948
citizen lois
0.922442
citizen bien-Stre
0.922442
citizen honneur
0.875496
citizen circonscription 0.824966
citizen qualit6
0.823696
citizen groupe
0.806067
citizen pr6senter
0.783948
citizen s6curit6
0.783605
citizen pmt6ger
0.780380
citizen matin
0.766204
citizen d6cid6
0.765796
citizen population
0.736677
citizen justice
0.735672
citizen acc~
0.711306
citizen canadiens
0.681.060
citizen eux
0.678879
citizen plupart
0.677068
citizen sant~
0.675888
citizen droit
0.663617
citizen moyen
0.659368
citizen services
0.659038
citizen vie
0.647867
citizen payer
0.646507
citizen groupes
0.636491
citizen aider
0.633778
citizen parlement
0.629732
citizen besoin
0.619423
applied both from French to English and from English to
French, so that many-to-one and one-to-manygrouping
could be identified. However,we only applied it from
English to French for the moment.The assumption on
whichthe Uanslation process is based on says that if two
collocations, E and F, are translations of one another, then
all the wordsin E are correlated with all the wordsin F.
The algorithm we propose to use attempts to build the
translation of a seed English collocation by incrementally
addingsingle wordsto its Frenchtranslation.
61
TABLE
4. Intermediatetranslations
senior
citizen gg6es
2.638035,
senior
citizen porsonnes
1.831326,
senior
citizen troisi~me
1.476078,
senior
citizen citoyens
1.385506,
senior
citizen population
1.083623,
senior
citizen circonscription 1.073481,
senior
citizen s6curit6
0.969971,
semor
citizen niveau
0.878886,
senlor
citizen services
0.867681,
citizen Funes
senior
0.820263,
senior
citizen conservateur 0.817643,
citizen centre
senior
0.783304,
senior
citizen sant~
0.757519,
senior
citizen nombre
0.730904,
senior
citizen parlementaire 0.719718,
senior
citizen gens
0.696520,
senior
citizen femmes
0.671757,
senior
citizen am6ricain
0.663165,
senior
citizen aurait
0.581080,
senior
citizen encore
0.513549,
senior
citizen programme
0.377404,
senior
citizen service
0.331886,
senior
citizen prendre
0.327677,
senior
citizen mesures
0.321192,
senior
citizen soci6t~
0.285544,
senior
citizen d6cision
0.240174,
5.2
The algorithm
The algorithmis an iterative algorithm that conslxuctsthe
translation for a given collocation on a word by word
basis. Let {e’l ..... e’n} be an Englishcollocation as identified by Xtract. The algorithmis as follows:
1. Computef%q~ = S’l In which tt (e,./) > 1 and S’i
is the set of-~Frenchwordsthat are correlated with each
of the words
of { e f i ..... e’n }.
Leti= 1.
2. Sort the elements of S fi by decreasing mutualinformation scores.
3. For each subset of size i+l of S’i, computethe mutual
informationof all its elementstaken as separate events.
4.
Removeall the sets containing non correlated elements.
.
If there is no remaining
subsetof size i+l, thenproducethe subsetof size i withthe highestmutualinformationscore with the seed Englishcollocationand
go to Step6.
Otherwise,Incrementi and go to Step 2.
6. End.
In the abovealgorithm,wedefine the mutualinformation
of a set of size p > 1 an eventx asthe mutualinformation
of conjunctionof the pth elementof the set andthe
remaining
subsetof size (p -1), withthe event
5.3
late noun-nouncompounds
in Frenchsince French syntax
doesnot allowfor suchconstructs.
Weare currently workingon testing this algorithmfor
morecomplex
cases, i.e., cases in whichan Englishcollocation of size n is Wanslated
in a Frenchcollocationof a
different size. In particular whenthe Frenchtranslation
consistsof a single word.Weare also evaluatingthe use of
statistics other than mutualinformationthat wouldbring
better results. In a next stage, wewill applythe technique
to a large number
of collocationsandwewill then evaluate
the results.
SomePreliminary Results
Wehaveexperimentedwith the abovealgorithmfor three
Englishcollocations, andwehavereachedthe correct
Frenchequivalentin all three cases. Although
this does
not allowus to determinethe validity of the algorithmwe
considerit an encouraging
result. In the rest of this section
weshowthe algorithmfor the collocation:"seniorcitizen."
Table2 and3, list the associationsof the wordsseniorand
citizen respectively.Ascan be seen fromTables2 and3,
senior hassome30 possibletranslations andcitizen has
some130. ApplyingStep I of the algoritlun wecompute
S~i, whichconsists of the following26 words:troisidme,
societal, services, service, s6curit6,santLprogramme,
prendre,population,personnes,parlementaire,hombre,
niveou,mesures,jeunes, gens, femmes,encore,d#cision,
contre,conservateur,
citoyens,circonscription,aurait,
amdricain,dg~es.
ApplyingStep 2 of the algorithm, wecomputethe mutual
informationof each of the abovewordswith the seed collocation.Table4, indicatestheseresults.
After the appficafionof Step3, onlyonesubsetof size 2
remained:"personnesdg6es"whichis the correct Iranslation for "’seniorcitizen." Which
then terminatesthe algorithm.
Theapplicationof the samealgorithmon "election campaign" and "Madam
Speaker," also produced
the correct
results: "campagne
electorale" and "Madame
la Pr6sidente," in the samenumber
of steps. Thetranslation of
"Madam
Speaker"is obviouslyspecific to this corpusand
cannotbe generalized.In contrast, the wansiationof "election campaign"
is generaland valid across domains.In
addition,it is interestingbecauseit is a problem
to trans-
62
6.0 Conclusion
In this paper wehaveproposeda techniquefor compiling
a bilingual collocationallexiconcompletelyautomatically.
Thetechniquesuse Xtract as a front endin order to identify the collocationsto be translated. Thetranslationsare
then constructedon a wordby wordbasis, andthe search
space is reducedby only consideringwordswith high
mutualinformationwith the original collocationas well as
mutuallycorrelated. This paperdescribesthe algorithm
andgives somepreliminaryresults. In a next stage, we
intend to test the algorithmon morecomplexcases and
then producea bilingual collocationlexiconto be used by
the research community.
BIBLIOGRAPHY
[Bahlet.al. 1983]BahlL., Jelinek E, andMercerIL, "A
maximum
likelihood approachto continuous SpeechRecognition." IEE wansactionson pattern analysis and
machineintelligence, 1983,5-(2), pp:179-190,1983
[Brown
et.al. 91a] Brown’
E, DellaPietra S., DellaPiewa
V., and MercerR. "WordSenseDisambiguation
using Statistical Methods."Proceedingsof the 29th AnnualMeeting of the Associationof Computational
Linguistics, June
1991,BerkeleyCa.
[Brownet.al. 91b] BrownP., Lai J., and MercerR.
"’AligningSentencein Parallel Corpora."Proceedingsof
the 29th AnnualMeetingof the Assoeiationof Computational Linguistics, June 1991,BerkeleyCa.
[Chouekaet.al. 83] ChouekaY., Klein T., and NouwitzE.,
"’Automatic Retrieval of Frequent Idiomatic and Collocational F, xpressions in a LargeCorpus."Jofimal for Literary and Linguistic Computing,vol. 4, pp: 34-38, 1983.
[Gale & Church 91] Gale W., and Church K., ",4 program
for Aligning Sentences in Bilingual Corpora." Proceedings of the 29th AnnualMeetingof the Association of
ComputationalLinguistics, June 1991, Berkeley Ca.
[Smadja & McKeown90] Smadja E, and McKeownK.,
"’Automatically Extracting and Representing Collocatioas
for LanguageGeneration." Proceedings of the 28th
Annual Meeting of the Association of ComputationalLinguistics, June 1990, Pittsburgh, Pa.
[Smadja 92] $madja E "Retrieving Collocations from
Text: Xtract." Journal of ComputationalLinguistics, to
appear, smadja
63
Download