Building Lexicons Jae Dong Kim Matthias Eck

advertisement
Building Lexicons
Jae Dong Kim
Matthias Eck
Building Lexicons










Introduction
Previous Work
Translation Model Decomposition
Reestimated Models
Parameter Estimation
Method A
Method B
Method C
Evaluation
Conclusion
Building Lexicons










Introduction
Previous Work
Translation Model Decomposition
Reestimated Models
Parameter Estimation
Method A
Method B
Method C
Evaluation
Conclusion
Definitions
 Translational equivalence: A relation that holds between two
expressions with the same meaning, where two expressions
are in different languages.
 Statistical Translation Models: statistical models of
translational equivalence
 Empirical estimation of statistical translation models is
typically based on parallel texts or bitexts
 Word-to-Word Lexicon
 A list of word pairs
(source word, target word )
 Bidirectional
 Probabilistic word-to-word lexicon (source word, target word, prob.)
Additional Universal Property
 Translation models benefit from the best of both the
empiricist and rationalist traditions
 Models to be proposed
 Most word tokens translate to only one word token. Approximated by
one-to-one assumption - Method A
 Most text segments are not translated word for word. Explicit Noise
Model - Method B
 Different linguistic objects have statistically different behavior in
translation. Translation models on different word classes. - Method C
 Human judgment has shown that each of three estimation
biases improves translation model accuracy over a baseline
knowledge-free model
Applications of Translation Models
 Where word order is not important







Cross-language information retrieval
Multilingual document filtering
Computer-assisted language learning
Certain machine-assisted translation tools
Concordancing for bilingual lexicography
Corpus linguistics
“crummy” machine translation
 Where word order is important




Speech transcription for translation
Bootstrapping of OCR systems for new languages
Interactive translation
Fully automatic high-quality machine translation
Advantages of translation models
 Compared to handcrafted models
 The possibility of better coverage
 The possibility of frequent updates
 More accurate information about relative importance of different
translations
Q’
T
Qi
IR
Uniform Importance?
IRDB
Building Lexicons










Introduction
Previous Work
Translation Model Decomposition
Reestimated Models
Parameter Estimation
Method A
Method B
Method C
Evaluation
Conclusion
Models of Co-occurrence
 Intuition: words that are translations of each other are more
likely to appear in corresponding bitext regions than other
pairs of words.
 A boundary-based model: assumes that both halves of the
bitext have been segmented into s segments, so that
segment Ui in one half of the bitext and segment Vi in the
other half are mutual translations, 1<=i<=s
 Co-occurrence count by Brown et al
s
cooc(u,v)   ei (u)  f i (v)
i1
 Co-occurrence count by Melamed
s

cooc(u,v)   min[ ei (u), f i (v)]
i1
Nonprobabilistic Translation Lexicons (1)
 Summary of non-probabilistic translation lexicon algorithms
1. Choose a similarity function S between word types in L1 and word
types L2
2. Compute association scores S(u,v) for a set of word type pairs (u,v) 
(L1 x L2) that occur in training data
3. Sort the word pairs in descending order of their association scores
4. Discard all word pairs for which S(u,v) is less than a chosen threshold.
The remaining word pairs become the entries in the translation lexicon
 Main difference: choice of similarity function
 Those functions are based on a model of co-occurrence with
some linguistically motivated filtering
Nonprobabilistic Translation Lexicons (2)
 Problem: independence assumption in step 2
 Models of translational equivalence that are ignorant of
indirect association have “a tendency … to be confused by
collocates”
He
nods
Direct association
Il
hoche
his
head
Indirect association
la
tete
 If all the entries in a translation lexicon are sorted by their
association scores, the direct associations will be very dense
near the top of the list, and sparser towards the bottom
Nonprobabilistic Translation Lexicons (3)
 The very top of the list can be over 98% correct - Gale and
Church (1991)
 Gleaned lexicon entries for about 61% of the word tokens in a sample
of 800 English sentences
 Selected only entries with high association score
 61% word tokens represent 4.5%word types
 71.6% precision with top 23.8% of noun-noun entries Fung(1995)
 Automatic acquisition of 6,517 lexicon entries with 86%
precision from 3.3-million-word corpus - Wu & Xia (1994)
 19% recall
 Weighted precision: in {(E1,C1,0.533), (E1,C2,0.277), (E1,C3,0.190)}, if
(E1,C3,0.190) is wrong, we have precision of 0.810
 Higher than unweighted one
Building Lexicons










Introduction
Previous Work
Translation Model Decomposition
Reestimated Models
Parameter Estimation
Method A
Method B
Method C
Evaluation
Conclusion
Decomposition of Translation Model (1)
 Two stage decomposition of sequence-to-sequence model
 First stage:
 Every sequence L is just an ordered bag, and the bag B can be
modeled independently of its order O
Pr(L)  Pr(B,O)
 Pr(B)  Pr(O | B)

Decomposition of Translation Model (2)
 First Stage:
 Let L1 and L2 be two sequences and let A be a one-to-one mapping
between the elements of L1 and the elements of L2
Pr(L1 | L2 ) 
 Pr(L , A | L )
1
2
A 
Pr(L1,L2 ) 
 Pr(L , A,L )
1
A 

2
Decomposition of Translation Model (2)
 First Stage:
 Let L1 and L2 be two sequences and let A be a one-to-one mapping
between the elements of L1 and the elements of L2
Pr(L1 | L2 ) 
 Pr(L , A | L )
1
2
A 
Pr(L1,L2 ) 
 Pr(L , A,L )
1
2
A 
where
Pr(L1, A | L2 )  Pr(B1,O1, A | L2 )
 Pr(B1, A | L2 )  Pr(O1 | B1, A,L2 )
Pr(L1, A,L2 )  Pr(B1,O1, A,B2 ,O2 )
 Pr(B1, A,B2 )  Pr(O1,O2 | B1, A,B2 )
Decomposition of Translation Model (3)
 First Stage:
 Bag-to-bag translation model
Pr(B1,B2 ) 
Pr(B , A,B )
1
A 

2
Decomposition of Translation Model (4)
 Second Stage:
 From bags of words to the words that they contain
 Bag pair generation process - how word-to-word model is embeded
1. Generate a bag size l. l is also the assignment size
2. Generate l language-independent concepts C1,…,Cl.
3. From each concept Ci, 1<=i<=l, generate a pair of word sequences (ui ,v i )
from L1* x L2*, according to the distribution trans(u,v ,) to lexicalize the
concept in the two languages. Some concepts are not lexicalized in some
languages, so one of ui and vi may be empty.
 Bags: B1  {u1,...,ul },B2  {v1,...,vl }


 An assignment: {(i1,j1),…,(il,jl)}

Decomposition of Translation Model (5)
 Second Stage:
 The probability of generating a pair of bags (B1,B2)
Pr(B1,, A,B2 | l,C,trans)  Pr(l)  l!
 Pr(C)trans(u ,v
(i, j )A C C

i
i
| C)
Decomposition of Translation Model (5)
 Second Stage:
 The probability of generating a pair of bags (B1,B2)
Pr(B1,, A,B2 | l,C,trans)  Pr(l)  l!
 Pr(C)trans(u ,v
i
i
| C)
(i, j )A C C
 trans(ui ,vi | C) is zero for all concepts except one




Pr(B1,, A,B2 | l,trans)  Pr(l)  l!
 trans(u ,v )
i
i
(i, j )A
 trans(ui ,vi ) is symmetric unlike the models of Brown et al.
The One-to-One Assumption
 u and v may consist of at most one word each
 A pair of bags containing m and n nonempty words can be
generated by a process where the bag size l is anywhere
between max(m,n) and m+n
 Not as restrictive as it may appear. What if we extend a word
to include spaces?
Building Lexicons










Introduction
Previous Work
Translation Model Decomposition
Reestimated Models
Parameter Estimation
Method A
Method B
Method C
Evaluation
Conclusion
Reestimated Seq.-to-Seq. Trans. Model (1)
 Variations on the theme proposed by Brown et al.
 Conditional probabilities, but can be compared to symmetric
models if the letter are normalized marginally
 Only Co-occurrence Information
 EM
transi (v | u)  z
transi1 (v | u)  e(u)  f (v)

 transi1(v | u')
(U ,V )(U ,V )
u' U
p  e(u)  f (v)
e(u)  f (v)
trans1 (v | u)  z 
z 
p |U |
|U |
(U ,V )(U ,V )
(U ,V )(U ,V )
 When information about segment lengths is not available

trans1 (v | u)  z
e(u)  f (v) z

e(u)  f (v)


c
c (U ,V )(U ,V )
(U ,V )(U ,V )
Reestimated Seq.-to-Seq. Trans. Model (2)
 Word Order Correlation Biases
 In any bitext, the positions of words relative to the true bitext map
correlate with the positions of their translations
 The word order correlation bias is most useful when it has high
predictive power
 Absolute word positions - Brown et al. 1988
 A much smaller set of relative offset parameters - Dagan, Church, and
Gale. 1993
 Even more efficient parameter estimation using HMM with some
additional assumptions - Vogel, Ney, and Tillman. 1996
Reestimated Bag-to-Bag Trans. Models
 Another Bag-to-Bag model by Hiemstra. 1996
 The same: one-to-one assumption
 The difference: empty words are allowed in only one of the two bags,
the one representing the shorter sentence
 Iterative Proportional Fitting Procedure(IPFP) for parameter estimation
 IPFP is subjective to initial conditions
 With the most advantageous, more accurate than Model 1
Building Lexicons










Introduction
Previous Work
Translation Model Decomposition
Reestimated Models
Parameter Estimation
Method A
Method B
Method C
Evaluation
Conclusion
Building Lexicons










Introduction
Previous Work
Translation Model Decomposition
Reestimated Models
Parameter Estimation
Method A
Method B
Method C
Evaluation
Conclusion
Parameter Estimation
Methods for estimating the parameters of a symmetric
word-to-word translation model from a bitext.
 Interested in probability trans(u,v)
Probability to jointly generate the pair of words (u,v)
 trans(u,v) cannot be directly inferred:
It is unknown which words were generated together
 Observable in bitext is only cooc(u,v)
(co-occurrence count)
Definitions
 Link counts:
links(u,v): hypothesis about the number of times u and v
were generated together
 Link token:
 Link type:
Ordered Pair of word tokens
Ordered Pair of word types
 links(u,v) ranges over Link types
 trans(u,v) can be calculated using links(u,v)
links (u , v)
trans(u , v) 
u,v links(u, v)
Definitions (continued)
 score(u,v) chance u and v can ever be mutual translations
similar to trans(u,v), convenient for estimation
 Relationship between trans(u,v) and score(u,v) can be direct
(depending on model)
General outline for all Methods
1. Initialize the score parameter to a first approximation based
only on cooc(u,v)
REPEAT
2. Approximate links(u,v) based on score and cooc
3. Calculate trans(u,v), Stop if only little change
4. Reestimate score(u,v) based on links and cooc
EM-Algorithm!
1. Initialize the score parameter to a first approximation based
only on cooc(u,v) Initial E-Step
REPEAT
2. Approximate links(u,v) based on score and cooc M-Step
3. Calculate trans(u,v), Stop if only little change
4. Re-estimate score(u,v) based on links and cooc E-Step
EM: Maximum Likelihood Approach
 Find the parameters that maximize the probability of the
given bitext
ˆ  arg max Pr(U ,V | )


Pr(U ,V | )   Pr(U , A,V | )
A
 Assignments cannot be decomposed due to the one-to-one
assumption (compare to Brown et al. 1993)
 MLE approach is infeasible
 Approximating EM is necessary
Maximum a Posteriori
 Evaluate Expectations using the single most probable
assignment only (Maximum a posteriori (MAP) assignment)
Amax  arg max Pr(U , A,V | )
A
Maximum a Posteriori
 Evaluate Expectations using the single most probable
assignment (Maximum a posteriori (MAP) assignment)
Amax  arg max Pr(U , A,V | )
A
 arg max Pr(l )  l!
A
 trans(u , v )
i
j
( i , j )A
 l: number of Concepts, number of produced words
Maximum a Posteriori
 Evaluate Expectations using the single most probable
assignment (Maximum a posteriori (MAP) assignment)
Amax  arg max Pr(U , A,V | )
A
 arg max Pr(l )  l!
A
 trans(u , v )
i
( i , j )A
j


 arg max log Pr(l )  l!  trans(ui , v j )
A
( i , j ) A


Maximum a Posteriori
 Evaluate Expectations using the single most probable
assignment (Maximum a posteriori (MAP) assignment)
Amax  arg max Pr(U , A,V | )
A
 arg max Pr(l )  l!
A
 trans(u , v )
i
( i , j )A
j


 arg max log Pr(l )  l!  trans(ui , v j )
A
( i , j ) A




 arg max log Pr(l )  l!  log( trans(ui , v j )) 
A
( i , j )A


 l, Pr(l): constant
Maximum a Posteriori
 Evaluate Expectations using the single most probable
assignment (Maximum a posteriori (MAP) assignment)
Amax  arg max Pr(U , A,V | )
A
 arg max Pr(l )  l!
A
 trans(u , v )
i
( i , j )A
j


 arg max log Pr(l )  l!  trans(ui , v j )
A
( i , j ) A




 arg max log Pr(l )  l!  log( trans(ui , v j )) 
A
( i , j )A


 arg max  log( trans(ui , v j ))
A
( i , j ) A
Bipartite Graph
Amax  arg max
A
 log( trans(u , v ))
i
( i , j ) A
j
score A (u, v)  log( trans(u, v)
 Represent bitext as bipartite graph
…
…
u
log(trans(u,v))
…
v
…
 Find solution for weighted maximum matching
 Still too expensive to solve
 Competitive Linking Algorithm approximates
Building Lexicons










Introduction
Previous Work
Translation Model Decomposition
Reestimated Models
Parameter Estimation
Method A
Method B
Method C
Evaluation
Conclusion
Method A: Competitive Linking
Step 1:
 Co-occurrence counts
u
v cooc(u,v)
!u
cooc(!u,v)
Total
cooc(.,v)
!v cooc(u,!v) cooc(!u,!v) cooc(.,!v)
Total cooc(u,.)
cooc(!u,.)
cooc(.,.)
 Use “whole” table information
 Initialize score(u,v) to G2(u,v) (similar to Chi-square)
 Good-Turing Smoothing gives improvements
Step 2: Estimation of link counts
 Competitive Linking algorithm is employed
 Greedy approximation of the MAP approximation
Algorithm
1. Sort all score(u,v) from the highest to the lowest
2. For each score(u,v) in order:
 Link all co-occurring token pairs (u,v) in the bitext
(If u is NULL consider all tokens of v in the bitext linked to
NULL and vice versa)
 One-to-One assumption:
Linked words cannot be linked again
Remove all linked words from the bitext
Example: Competitive Linking
u
a
b
c
v
d
Competitive Linking
u
X
X
X
a
b
X
X
X
X
X
X
X
X
c
v
d
X
X
Competitive Linking
u
X
X
X
a
b
X
X
X
X
X
X
X
X
X
X
c
v
X
X
X
X
X
X
X
d
X
X
X
X
Competitve Linking per sentence
…
b
…
a
…
c
d
…
…
a
b
…
…
c
d
e
…
links(a,c)++
links(b,d)++
…
links(a,d)++
links(b,e)++
…
Building Lexicons










Introduction
Previous Work
Translation Model Decomposition
Reestimated Models
Parameter Estimation
Method A
Method B
Method C
Evaluation
Conclusion
Method B:
 “Most texts are not translated word-for-word”
 Why is that a problem with Method A?
…
a
b
x
…
c
d
e
…
f
…
Method B:
 “Most texts are not translated word-for-word”
 Why is that a problem with Method A?
…
a
b
…
c
d
e
…
a
b
x
…
c
d
…
x
e
f
…
Competitive Linking
…
f
…
We are forced to
connect (b,d)!
Method B:
 After one iteration of Method A
on 300k sentences Hansard
 links = cooc
often, probably correct
 links < cooc
rare, might be correct
 links << cooc
often, probably
incorrect
Method B:
Use information links(u,v)/cooc(u,v) to bias parameter
estimation
 Introduce p(u,v) as the probability of u and v being linked
when they co-occur.
 Leads to binomial process for each co-occurrence
(either linked or not linked)
 Too sparse data to model p(u,v)
 Just 2 cases:
p(u, v)  
If u,v are mutual translations
(Rate of true positives)
p(u, v)  
If u,v are not mutual translations
(Rate of false positives)
Method B
Maximum Likelihood Estimation
Maximum Likelihood Estimation
 on 300k sentences Hansard
Method B:
Overall score calculation for Method B:
 Probability for generating correct links(u,v) given cooc(u,v):
B(links(u, v) | cooc(u, v),  )
 Probability for generating incorrect links(u,v) given cooc(u,v):
B(links(u, v) | cooc(u, v),  )
 Score is ratio
B(links(u, v) | cooc(u, v),  )
scoreB (u, v)  log
B(links(u, v) | cooc(u, v),  )
Building Lexicons










Introduction
Previous Work
Translation Model Decomposition
Reestimated Models
Parameter Estimation
Method A
Method B
Method C
Evaluation
Conclusion
Method C:
 Improved Estimation using Preexisting Word Classes
 Method A, B:
 All word pairs that co-occur the same number of times and
are linked the same number of times are assigned the same
score
 But: Frequent words are translated less consistently than rare
words
B(links(u, v) | cooc(u, v), Z )
scoreC (u, v | Z  class (u, v))  log
B(links(u, v) | cooc(u, v), Z )
 Introduce classes to get Statistics per class
Building Lexicons










Introduction
Previous Work
Translation Model Decomposition
Reestimated Models
Parameter Estimation
Method A
Method B
Method C
Evaluation
Conclusion
Method C for Evaluation
We have to choose classes:
 EOS:
End of sentence punctuation
 EOP:
End of phrase punctuation (, ;)
 SCM:
Subordinate clause markers (“ ()
 SYM:
Symbols (~ *)
 NU:
NULL word
 C:
Content words
 F:
Function words
Experiment 1:
Training Data
 29,614 sentence pairs French, English (Bible)
Test Data
 250 hand linked sentences (gold standard)
Procedure
 Single Best: Models guess one translation per word on each
side
 Whole Distribution: Model outputs all possible translation with
probabilities
Experiment 1 – Results
 Single Best – All links
(95% confidence intervals)
Experiment 1 – Results
 Single Best – open-class links only (just the content words)
Experiment 1 – Results
 Whole Distribution – All Links
Experiment 1 – Results
 Whole Distribution – open-class links only (just the content
words)
Experiment 2:
 Influence of training data size
 Model A is 102% more correct than Model 1 when trained on
only 250 sentence pairs
 Overall up to 125% improvements
Evaluation at the Link Type Level
 Sorted scores for all link types:
 1/1, 2/2 and 3/3 correspond to links/cooc
Coverage vs. Accuracy
 incomplete: Lexicon contains only part of correct phrase
Building Lexicons










Introduction
Previous Work
Translation Model Decomposition
Reestimated Models
Parameter Estimation
Method A
Method B
Method C
Evaluation
Conclusion
Conclusion - Overview
 IBM Model 1:
co-occurrence information only
 Method A:
one-to-one assumption
 Method B:
Noise Model
 Method C:
condition auxiliary parameters
on word classes
…
a
b
x
…
c
d
e
…
a
b
x
…
c
d
e
…
a
b
x
…
c
d
e
…
f
…
…
f
…
…
f
…
Download