Parallel Data & Sentence Alignment Declan Groves, DCU

advertisement
Parallel Data & Sentence Alignment
Declan Groves, DCU
dgroves@computing.dcu.ie
Parallel Corpus
Seen why from week 2 that data-driven Machine Translation (MT),
based on using real-world translation examples/data, is now the
most prevelant approach
3 approaches to data-driven MT:
Example Based (EBMT)
Statistical Based (SMT)
Hybrid models (a mix of different approaches; may include non-data
driven approaches such as rule-based) which use some probabilistic
processing
All need a parallel copus (or bitext) of aligned sentences
Can create the resource manually, otherwise if we have un-aligned
bilinugal texts, we can automate the alignment
Automatic Alignment (1/2)
Most alignments are one-to-one:
E1: Often, in the textile industry, businesses close their plant in Montreal to
move to the Eastern townships.
F1: Dans le domaine du textile souvent, dans Montréal, on ferme et on va
s’installer dans les Cantons de l’Est.
E2: There is no legislation to prevent them from doing so, for it is a matter of
internal economy.
F2: Il n’ya aucune loi pour empêcher cela, c’est de la r´egie interne.
E3: That is serious.
F3: C’est grave.
Automatic Alignment (2/2)
But not always:
E1: Honourary members opposite scoff at the freeze suggested by this
party; to them it is laughable.
F1: Les deputés d’en face se moquent du gel que a propose notre parti.
F2: Pour eux, c’est une mesure risible.
Automatic Alignment (2/2)
But not always:
E1: Honourary members opposite scoff at the freeze suggested by this
party; to them it is laughable.
F1: Les deputés d’en face se moquent du gel que a propose notre parti.
F2: Pour eux, c’est une mesure risible.
E1
{ F2
F1
So some (like this) have one sentence correspond to two, or
more, or none. Or there may be a two-to-two sentence alignment
without there being a one-to-one relation between the component
sentences.
Good Language Models Come from Good Copora (1/2)
Any statistical approach to MT requires the availability of aligned bilingual
corpora which are:
large;
good-quality;
representative
Class Question
Assmume the following (tiny) corpus:
Mary and John have two children.
The children that Mary and John have are aged 3 and 4.
John has blue eyes.
Q1: what’s P(have) vs. P(has) in a general particular corpus? Which is more likely?
Q2: what’s P(have | John) vs. P(has | John) in a general corpus?
Q3: what’s P(have) vs. P(has) in this corpus? What’s their relative probability?
Q4: what’s P(have | John) vs. P(has | John) in this corpus?
* SPEECH and LANGUAGE PROCESSING: An Introduction to Natural Language Processing, Computational
Linguistics, and Speech Recognition (Jurafsky & Martin)
Good Language Models Come from Good Copora (2/2)
Assume a different small corpus:
Am I right, or am I wrong?
Peter and I are seldom wrong.
I am sometimes right.
Sam and I are often mistaken.
Q5: What two generalisations would a probabilistic language model (based on
bigrams, say) infer from this data, which are not true of English as a whole? Are
there any other generalisations that could be inferred?
Good Language Models Come from Good Copora (2/2)
Assume a different small corpus:
Am I right, or am I wrong?
Peter and I are seldom wrong.
I am sometimes right.
Sam and I are often mistaken.
Q5: What two generalisations would a probabilistic language model (based on
bigrams, say) infer from this data, which are not true of English as a whole? Are
there any other generalisations that could be inferred?
Good Language Models Come from Good Copora (2/2)
Assume a different small corpus:
Am I right, or am I wrong?
Peter and I are seldom wrong.
I am sometimes right.
Sam and I are often mistaken.
Q5: What two generalisations would a probabilistic language model (based on
bigrams, say) infer from this data, which are not true of English as a whole? Are
there any other generalisations that could be inferred?
Q6: Try to think of some trigrams (and 4-grams, if you can) that cannot be
‘discovered’ by a bigram model? What you’re looking for here is a phrase where the
third (or subsequent) word depends on the first word, which in a bigram model is
‘too far away’...
Note that all the sentences in these corpora are well-formed. If, on the other hand,
the corpus contains ill-formed input, then that too will skew our probability models ...
Bilingual Corpora (1/2)
Previous examples, all monolingual, but the same applies w.r.t.
bilingual corpora.
One issue that we can think about now is what corpora we’re going
to extract our probabilistic language (and translation) models from.
What sorts of large, good quality, representative bilingual corpora
exist?
Canadian Hansards
Proceedings of the Hong Kong parliament
Dáil Proceedings
…
i.e. while the statistical techniques can be applied to any pair of
languages, this approach is currently limited to only a few language
pairs.
Bilingual Corpora (1/2)
W.r.t. ‘representativeness’, consider the following from a 9000 word
experiment (Brown et al., 1990) on the Canadian Hansards ...
French
Probability
???
.808
entendre
.079
entendu
.026
entends
.024
entendons
.013
Q1: For what English word are these possible candidate translations?
Q2: What’s “???”:
Beware of sparse data and unrepresentative corpora!!
Ditto poor quality language ... though I’ll come back to this one!
Bilingual Corpora (1/2)
W.r.t. ‘representativeness’, consider the following from a 9000 word
experiment (Brown et al., 1990) on the Canadian Hansards ...
French
Probability
bravo
.808
entendre
.079
entendu
.026
entends
.024
entendons
.013
Q1: For what English word are these possible candidate translations?
Q2: What’s “???”:
Beware of sparse data and unrepresentative corpora!!
Ditto poor quality language ... though I’ll come back to this one!
If the corpora are small, or of poor quality, or are unrepresentative, then
our statistical language models will be poor, so any results we achieve will
be poor.
Is the World Wide Web a good corpus to use?
Let’s imagine you want to find the correct translation of the French
compound (a word that consists of more than one token/stem) “groupe de
travail”.
For each of the two French nouns here, there are several translations:
groupe: cluster, group, grouping, concern, collective
travail: work, labor, labour
If groupe de travail is not in our dictionary (cf. proliferation of compounds),
but the two composite nouns are, then any compositional translation is
multiply ambiguous.
Now here’s the trick: let’s search for all 15 possible pairs on the WWW. We
might find:
labour cluster: 2
labour concern: 9
.....
work group: 66593
Is the World Wide Web a good corpus to use?
There are at least two ways in which we could attempt to resolve
the potential ambiguity:
simply take the most often occurring (frequency) term as the
translation;
see which candidate translations surpass some threshold (in terms of
relative frequency, say)
cf. share price, stock price: assuming these two massively outrank
all other candidates, we can keep both as synonymous translations.
So, we can use the WWW to create a very large bilingual lexicon ...
Can we use the WWW to extract parallel corpora?
Lots of pages where you can click on a link to get a version of that
page in a different language – how to find these?
Query a search engine for a string “English” ‘not too far away’ from
the string “Spanish”. This might get you things like:
Click here for English version vs. Click here for Spanish version
But also:
English literature vs. Spanish literature
What then? Need a process to evaluate these candidates. Using
something like ‘diff’, you could align each case and manually
remove poor candidates:
Can we use the WWW to extract parallel corpora?
Comparison using ‘diff’
sunsite{away}691: diff a b
1c1
< Mary and John have two children.
--> Maria und Johannes haben zwei Kinder.
3c3
< The children that Mary and John have are aged 3 and 4.
--> Die Kinder, die Maria und Johannes haben, sind 3 und 4 Jahre alt.
5c5
< John has blue eyes.
--> Johannes hat blaue Augen.
Or automate the comparison fully using string compare routines
measured against some % threshold....
Sentence Alignment
Manual construction of aligned corpora (which are essential for
probabilistic techniques) also avoids the considerable problem of
trying to align source and target texts.
But what if we already have bilingual corpora which are not
aligned?!
Automation!
Automatic Alignment
We’ve just seen one (novel) way of automating this process (using
simple string comparison techniques).
What are the problems?
Most alignments are one-to-one ... but some are not, as we saw
previously:
Some have one sentence correspond to two, or more, or none.
Or there may be a two-to-two sentence alignment without there being a
one-to-one relation between the component sentences.
Let’s look at (some of) the major algorithms for aligning pairs of
sentences ...
Gale & Church’s Algorithm (1/4)
Gale & Church (1993): ‘A program for aligning sentences in bilingual
corpora’, in Computational Linguistics 19(1):75–102.
All sentence alignment algorithms are premised on the notion that good
candidates are not dissimilar with regards to length (i.e. shorter sentences
tend to have shorter translations then longer sentences).
G&C use this length-based notion, together with the probability that a
typical translation is one of various many-to-many sentence relations (0-1,
1-0, 1-1, 2-1, 1-2, 2-2 etc).
Distance measure: P(match | d), where match = 1-1, 0-1 etc., and d =
difference in length
How to calculate text length?
word tokens;
characters.
G&C use characters. No language maps character by character to another
language, so first calculate the ratio of chars. in L1 to chars. in L2.
e.g. English text = 100,000 chars; Spanish text = 110,000, then scaling ratio = 1.1
Gale & Church’s Algorithm (2/4)
Q: What’s the difference in length between an English text of 50 chars. and
a Spanish text of 56 chars given this scaling ratio (1.1)?
Typically, the longer the text, the bigger the difference, so we need to
normalize the difference in length to take into account longer texts:
lt ο€­ c.ls
dο€½
lm .s 2
where lt and ls are the lengths of the target and source texts respectively,
and lm is the average of the two lengths, c is the scaling factor, and s2 is
the variance for all values d of in a typical aligned corpus, which G&C
suggest is 6.8.
Gale & Church’s Algorithm (3/4)
Terms like P(match | d) are called conditional probabilities
Q: P(throwing at least 7 with 2 die)?
Q: P(throwing at least 7 | 1st dice thrown was a 6)?
Bayes’ Theorem is used to to relate probabilities:
P(match | d) = P(d|match).P(match)
P(d)
P(d) is a normalizing constant, so can be ignored. P(match) is estimated
from correctly aligned corpora
G&C give one-to-one alignments a probability of 0.89, with the rest of the
probability space assigned to other possible alignments. P(d| match) has to
be estimated (cf. Trujillo, 1999:71).
Gale & Church’s Algorithm (4/4)
Comment:
Aligning complete texts is computationally expensive.
Luckily, we can use paragraphs, sections,headings, chapters etc to identify
chunks of texts and align these smaller elements.
That is, we’re looking for anchors which can identify text chunks in both
languages. Other things can be tags, proper names, acronyms, cognates,
figures, dates…(Q: can you think of others?)
Results?
G&C report error rates of 4.2% on 1316 alignments.
Most errors occur in non 1-1 alignments.
Selection of the best 80% alignments reduces the error rate to 0.7%.
Interestingly, for European languages at least, the algorithm is not sensitive
to different values for c and s2, but ‘noisy’ texts or from very different
languages can degrade performance.
Recap
Data-driven (e.g. statistical) MT relies on the availability of parallel data that is:
Of sufficient quantity
Of sufficient (i.e. high) quality
Representative of the language we’re trying to model
Poor quality data = poor quality models
Too little data: data sparseness & poor coverage
Too poor quality: bad quality translations
Not representative: model will make wrong assumptions, produce incorrect
translations
Data extracted from the web can be used to create some parallel texts
Best for dictionary extraction
Finding corresponding documents can be difficult
Parallel data needs to be aligned
Document-level alignment
Sentence-level alignment (manual vs automatic, using rel. sentence length ratio)
Gale & Church (1993) character-based sentence alignment
Brown et al.’s Algorithm (1/3)
Brown et al. (1991): ‘Aligning sentences in parallel corpora’, in Proceedings
of 29th Annual Meeting of the Association for Computational Linguistics,
University of California, Berkeley, pp.169–176.
Brown et al. measure the length of a text in terms of the number of word
tokens.
Their method assumes alignments of text fragments are produced by a
Hidden Markov Model (HMM), and the correct alignment is that which
maximizes the output of the HMM
The HMM of Brown et al. models alignment by determining probabilities for
eight types of alignments, or beads:
s, t, st, sst, stt, ¶s, ¶t, ¶s ¶t
i.e. sst indicates that two source strings align with one target string (2-1), and ¶s
signifies that a source paragraph delimiter matches nothing in the target language
(i.e. it is deleted) (1-0).
Brown et al.’s Algorithm (2/3)
English text (s)
French text (t)
17s
19t
25s
20t
12s
8t
…
…
st
stt
The probability of s bead emitting a sentence of length ls = P(ls)
We allow P(ls) = probability of finding sentence of length l in source text
Similarly, probability of a t bead emitted sentence of length lt = P(lt)
We allow P(lt) = probability of finding sentence of length l in target text
For st (i.e. a 1-1 alignment), P(ls,lt) depends on P(ls) and P(lt | ls)
Log of ratio (r) of lt to ls is normally distributed with a mean of πœ‡ and variance 𝜎 2
thus,
if π‘Ÿ = π‘™π‘œπ‘”
𝑙𝑑
𝑙𝑠
, then 𝑃 𝑙𝑑 |𝑙𝑠 = 𝛼𝑒 −
π‘Ÿ − πœ‡ 2 /2𝜎2
,
with 𝛼 a constant ensuring that for all values of 𝑙𝑑 , 𝑃 𝑙𝑑 |𝑙𝑠 = 1
Brown et al.’s Algorithm (3/3)
For stt or sst beads, r is calculated by adding the lengths of the two source
(or two target) strings.
The probability of both sentences is calculated by 𝑃 𝑙𝑑1 . 𝑃(𝑙𝑑2 ) and 𝑃 𝑙𝑠1 . 𝑃(𝑙𝑠2 ),
respectively
For this method, Brown el al. calculated 𝑃 𝑙𝑠 and 𝑃 𝑙𝑑 based on a sample
corpus of 1000 sentences with a maximum length of 81 words.
The method used anchor points (major: paragraph markers; minor: session
points, time stamps, names of speakers…) to guide the alignment process
only consider sections that contain the same number of minor anchor points
Results: Reported error rates for st beads for English-French are 0.9%
(very similar to Gale & Church’s approach)
Kay & Röscheisen’s Algorithm (1/2)
Kay & Röscheisen’s (1993): “Text-translation alignment” in Computational
Linguistics 19(1):121-142
Text-based alignment assumes set of anchors, parallel text and access to
an English-French dictionary
Two sentences will be aligned iff word translation pairs from bilingual
dictionary are found with sufficient frequency between two sentences
(relative to size and to other sentences in corpus)
Compare with length based algorithms. Disadvantages: requires bilingual
dictionary – always available?
Uses information on relative sentence position for English-French
Canadian Hansards.
Previously seen relative sentence length (length-based algorithms)
Kay & Röscheisen’s Algorithm (2/2)
Assume 10 source and 12 target sentences.
Initial estimate alignment maps 𝑠1 , 𝑑1 and
𝑠10 , 𝑑12
𝑠1
𝑑1
𝑠2
𝑑2
Let’s look at 𝑠3 : Assume this aligned with 𝑑3 +/- 3
𝑠3
sentences
i.e. 𝑠3 is estimated to align with 𝑑1−6
𝑑3
𝑠4
𝑑4
𝑠5
𝑑5
𝑠6
𝑑6
𝑠7
𝑑7
𝑠8
𝑑8
𝑠9
𝑑9
𝑠10
𝑑10
𝑑11
𝑑12
Kay & Röscheisen’s Algorithm (2/2)
Assume 10 source and 12 target sentences.
Initial estimate alignment maps 𝑠1 , 𝑑1 and
𝑠10 , 𝑑12
𝑠1
𝑑1
𝑠2
𝑑2 … fichier …
Let’s look at 𝑠3 : Assume this aligned with 𝑑3 +/- 3
𝑠3 … file …
sentences
i.e. 𝑠3 is estimated to align with 𝑑1−6
𝑠4 … file …
Produce a set of word alignments by selecting
𝑀𝑠 , 𝑀𝑑 word pairs in currently aligned sentences. 𝑠5
For example: if file occurs in 𝑠3 and 𝑠4 , and fichier
occurs in 𝑑2 and 𝑑5 , then sentence alignments
𝑠3 , 𝑑2 and 𝑠4 , 𝑑5 are made and the alignments
become additional anchors.
Caveat: sentence anchors may not cross
e.g. if we have 𝑠1 , 𝑑1 and 𝑠3 , 𝑑2 then 𝑠2 can only
align with 𝑑1 or 𝑑2 but nothing after 𝑑2 as this is
already aligned with 𝑠3
Results: algorithm converges after 4/5 iterations
and on first 1000 sentences all but 7 were aligned
correctly, but high time/space complexity
𝑑3
𝑑4
𝑑5 … fichier …
𝑠6
𝑑6
𝑠7
𝑑7
𝑠8
𝑑8
𝑠9
𝑑9
𝑠10
𝑑10
𝑑11
𝑑12
Exact Matches
In Translation Memory (TM) (and EBMT systems) we would like to first
look for exact matches for input or source language sentences
Some non-exact matches:
Different spelling:
Change the color of the font.
Change the color of the font.
Different punctuation:
Open the file and select the text.
Open the file, and select the text.
Different inflection:
Delete the document.
Delete the documents.
Different numbers:
Use version 1.1.
Use version 1.2.
Different formatting:
Click on OK.
Click on OK.
Fuzzy Matches
Exact matches often do not occur; TM systems then use “fuzzy” matching:
A Fuzzy Match
New input segment: The specified operation failed because it requires the Character
to be active.
Stored TM Unit:
EN: The specified language for the Character is not supported on
the computer
FR: La langue spécifiée pour le Compagnon n’est pas prise en
charge par cet ordinateur.
Fuzzy Matches
Exact matches often do not occur; TM systems then use “fuzzy” matching:
A Fuzzy Match
New input segment: The specified operation failed because it requires the Character
to be active.
Stored TM Unit:
EN: The specified language for the Character is not supported on
the computer
FR: La langue spécifiée pour le Compagnon n’est pas prise en
charge par cet ordinateur.
Fuzzy Matches
Exact matches often do not occur; TM systems then use “fuzzy” matching:
Multiple Fuzzy Matches in Ranked Order
New input segment: The operation was interrupted because the Character was hidden.
Best Match:
EN: The operation was interrupted because the Listening key was
pressed.
FR: L’opération a été interrompue car la touche d’écout a été
enfoncée.
Fuzzy Matches
Exact matches often do not occur; TM systems then use “fuzzy” matching:
Multiple Fuzzy Matches in Ranked Order
New input segment: The operation was interrupted because the Character was hidden.
Best Match:
EN: The operation was interrupted because the Listening key was
pressed.
FR: L’opération a été interrompue car la touche d’écout a été
enfoncée.
Fuzzy Matches
Exact matches often do not occur; TM systems then use “fuzzy” matching:
Multiple Fuzzy Matches in Ranked Order
New input segment: The operation was interrupted because the Character was hidden.
Best Match:
EN: The operation was interrupted because the Listening key was
pressed.
FR: L’opération a été interrompue car la touche d’écout a été
enfoncée.
2nd Best Match:
EN: The specified method failed because the Character is hidden.
FR: La méthode spécifiée a échoué cat le Compagnon est masqué
3nd Best Match:
EN: The operation was interrupted by the application.
FR: L’opération a été interrompue par l’application
Subsentential Alignment (1/4)
Words and terms tend to have similar distributions within two texts and can
be expected to occur in the same aligned sentences.
Word alignment -> bilingual dictionaries.
e.g. ‘computer’ occurs in 30 English strings and ‘computadora’ in 29
Spanish strings and nowhere else: strong chance they are translations of
each other.
Can measure this co-occurrence using Mutual Information:
𝐼 𝑠; 𝑑 = π‘™π‘œπ‘”
𝑃(𝑠, 𝑑)
𝑃 𝑠 . 𝑃(𝑑)
e.g. given an aligned corpus of 100 strings, if ‘store’ occurs in 30 aligned
English strings, ‘speichern’ occurs in 20 aligned German sentences and
these alignments coincide in 19 cases, then for this corpus:
19
100
𝐼 π‘ π‘‘π‘œπ‘Ÿπ‘’; π‘ π‘π‘’π‘–π‘β„Žπ‘’π‘Ÿπ‘› = π‘™π‘œπ‘”
= 1.66
30
20
100 × 100
Subsentential Alignment (2/4)
Mutual information values:
positive values indicate the occurrence of one word strongly predicts that the
other word will occur
A value near 0 indicates that they occur independently
Negative values predict that if one word occurs the other does not
Problem with MI: doesn’t provide probabilities (difficult to use in SMT
systems)
Trick: use MI to align source-target words, then if a threshold can be determined
empirically, rank them. If a 𝑀𝑠 has a higher MI score with 𝑀𝑑1 then with 𝑀𝑑2 then
𝑀𝑠 , 𝑀𝑑2 is deleted as a less like translation.
Subsentential Alignment (3/4)
Different technique: translation of source word can be found in ‘about the
same position’ as the target sentence to which it is aligned, if not then
within a fixed distance.
A Simple Aligned Text
EN: Start the operating system
ES: Comenzar el sistema operativo
EN: Launch the program via the
keyboard
ES: Empezar el programa mediante
el telcando
Algorithm:
1.
2.
Estimate word translation and offset (relates to word position) probabilities
Find the most probable alignment for the given text.
Naïve alignment: for a source word 𝑀𝑠 located 25% of the way through a source text,
its translation 𝑀𝑑 is located 25% of the way +/- 20 words.
Ignoring function words (e.g. prepositions, determiners etc.) taken from a stoplist,
we might induce the following alignments:
Subsentential Alignment (4/4)
A Simple Aligned Text




comenzar ο‚« start
*sistema ο‚« start
*empezat ο‚« system
telcado ο‚« keyboard
EN: Start the operating system
ES: Comenzar el sistema operativo
EN: Launch the program via the
keyboard
ES: Empezar el programa mediante
el telcando
Two are right, two are wrong… but their correct translations lie close by
Assuming enough text, we assume the pairings (sistema,system) and (empezar,launch) can be
confirmed as correct alignments. This leaves fewer candidates to be aligned.
For European languages at least, we can assume that words which are close together in
the source translate as words which are close together in the target.
Similarly, words at the start/end of a source sentence map to words at the start/end of the target
sentence.
Other subsentential alignment techniques:
Distance between occurrences of source word mirrors distance between occurrence of a target
word.
Confirm aligned word pairs, using a third corpus as reference.
Download