Penalized EP for Graphical Models Over Strings Ryan Cotterell

advertisement
Penalized EP for Graphical
Models Over Strings
Ryan Cotterell and Jason Eisner
Natural Language is Built from Words
Can store info about each word in a table
Index Spelling
Meaning
Pronunciation
Syntax
123
ca
[si.ei]
NNP (abbrev)
124
can
[kɛɪn]
NN
125
can
[kæn], [kɛn], … MD
126
cane
[keɪn]
NN (mass)
127
cane
[keɪn]
NN
128
canes
[keɪnz]
NNS
Problem: Too Many Words!
• Technically speaking, # words = 
• Really the set of (possible) words is ∑*
•
•
•
•
Names
Neologisms
Typos
Productive processes:
– friend  friendless  friendlessness 
friendlessnessless  …
– hand+bag  handbag (sometimes can iterate)
Solution: Don’t model every cell separately
Positive
ions
Noble
gases
Can store info about each word in a table
Index Spelling
Meaning
Pronunciation
Syntax
123
ca
[si.ei]
NNP (abbrev)
124
can
[kɛɪn]
NN
125
can
[kæn], [kɛn], … MD
126
cane
[keɪn]
NN (mass)
127
cane
[keɪn]
NN
128
canes
[keɪnz]
NNS
Can store info about each word in a table
Index Spelling
Meaning
Pronunciation
Syntax
Ultimate goal: Probabilistically reconstruct all missing entries of this
123
ca
[si.ei]
infinite multilingual
table, given some entries
and some text. NNP (abbrev)
124
can
[kɛɪn]
NN
125
can
[kæn], [kɛn], … MD
Approach: Linguistics + generative modeling + statistical inference.
Modeling ingredients: Finite-state machines + graphical models.
126
cane
[keɪn]
NN (mass)
Inference
ingredients: Expectation Propagation
127
cane
[keɪn] (this talk). NN
128
canes
[keɪnz]
NNS
Can store info about each word in a table
Index Spelling
Meaning
Pronunciation
Syntax
Ultimate goal: Probabilistically reconstruct all missing entries of this
123
ca
[si.ei]
infinite multilingual
table, given some entries
and some text. NNP (abbrev)
124
can
[kɛɪn]
NN
125
can
[kæn], [kɛn], … MD
Approach: Linguistics + generative modeling + statistical inference.
Modeling ingredients: Finite-state machines + graphical models.
126
cane
[keɪn]
NN (mass)
Inference
ingredients: Expectation Propagation
127
cane
[keɪn] (this talk). NN
128
canes
[keɪnz]
NNS
Predicting Pronunciations of Novel Words
(Morpho-Phonology)
eɪʃən
dæmnz
????
damns
dæmn
dæmneɪʃən
How
do you
pronounce
this word?
dˌæmnˈeɪʃən
damnation
z
rizajgnz
rizajgn
rizajgneɪʃən
rizˈajnz
rˌɛzɪgnˈeɪʃən
resigns
resignation
Predicting Pronunciations of Novel Words
(Morpho-Phonology)
eɪʃən
dæmnz
dˌæmz
damns
dæmn
dæmneɪʃən
How
do you
pronounce
this word?
dˌæmnˈeɪʃən
damnation
z
rizajgnz
rizajgn
rizajgneɪʃən
rizˈajnz
rˌɛzɪgnˈeɪʃən
resigns
resignation
Graphical Models over Strings
• Use Graphical Model Framework to model many strings jointly!
aardvark 0.1
4
rung
5
…
…
ring
2
4 0.1
rang
7
1
2
rung
8
1
3
aardvark 0.1
ψ1
…
ring
X1
rung
3
ring
rang
rang
…
…
…
aardvark
rung
X1
rang
ψ1
ring 1
rang 2
rung 2
ring
X1
s
r
ae
h
i n g
e u ε e
ψ1
r
s
s
r
e
ε
X2
s
r
0.2 0.1 0.1
…
X2
ring 10.2
rang 13
rung 16
X2
rang
0.1
2
4 0.1
ring
0.1
7
1
2
rung
0.2
8
1
3
…
a
u
i n
a n
u
a ε
a
e
g
g
ae
h
i n g
e u
ε
11
Zooming in on a WFSA
• Compactly represents an (unnormalized)
probability distribution over all strings in
• Marginal belief: How do we pronounce
damns?
• Possibilities: /damz/, /dams/, /damnIz/, etc..
n/.25
d/1
a/1
m/1
z/.5
s/.25
z/1
I/1
z/1
Log-Linear Approximation
• Given a WFSA distribution p, find a log-linear
approximation q
– min KL(p || q) “inclusive KL divergence”
– q corresponds to a smaller/tidier WFSA
• Two Approaches:
– Gradient-Based Optimization (Discussed Here)
– Closed Form Optimization
ML Estimation = Moment Matching
bar = 2
Broadcast n-gram
counts
foo
1.2
bar
0.5
baz
4.3
Fit model that
predicts same
counts
FSA Approx. = Moment Matching
s ae
h
r i n g
s ua e
h
er i ε ne g e
e
e u ε e
bar = 2
Compute with
forward-backward!
foo
1.2
bar
0.5
baz
4.3
Fit model that
predicts same
counts
Gradient-Based Minimization
• Objective:
• Gradient with respect to
Arc weights are determined by
a parameter vector - just like a
log-linear model
• Difference between two expectations of feature counts,
which are determined by the weighted DFA q
• Features are just n-gram counts!
Does q need a lot of features?
• Game: what order of n-grams do we need to
put probability 1 on a string?
• Word 1: noon
– Bigram model? No - Trigram model
• Word 2: papa
– Trigram model? No - 4-gram model - very big!
• Word 3: abracadabra
– 6-gram model – way too big!
Variable Order Approximations
• Intuition: In NLP marginals are often peaked
– Probability mass mostly on a few similar strings!
• q should reward a few long n-grams
– also need short n-gram features for backoff
^abrab
5.0
abraca
5.0
zzzzzz
-500
6-gram table.
Too Big!
Variable order table.
Very Small!
abra
5.0
^a
5.0
b
4.3
Variable Order Approximations
• Moral: Use only the n-grams you really need!
Belief Propagation (BP) in a Nutshell
X6
X3
X1
X4
X2
X5
Belief Propagation (BP) in a Nutshell
X6
n/.25
X1
d/1
m/1
a/1
X2
X3
z/.5
s/.25
z/1
I/1
X5
X4
z/1
Belief Propagation (BP) in a Nutshell
X6
X3
X1
X4
X2
X5
Computing Marginal Beliefs
X7
X3
X1
X4
X2
X5
Computing Marginal Beliefs
X7
X3
X1
X4
X2
X5
Belief Propagation (BP) in a Nutshell
X6
X1
s ae h
r i ng
eu ε e e s ae h
r i n g
e u ε s ae h
r i ng
u e es ae h
e
ε
X3
r i n g
eu ε
X4
X2
X5
Computing Marginal Beliefs
X7
s ae h
r i n g
eu ε
s ae h
r i n g X3
eu ε
s ae h
r i n g
eu ε
X1
s ae h
r i n g
eu ε
X4
X2
X5
Computing Marginal Beliefs
X7
X1
s ae h
Computation of belief
r i n g
results in large state
e us εae h
space
s sareaei hnhg
r ri einung εgs a h
s ae hes ueaueε εh r i en g
r i n gr siX3naeg h u
e ε
eeur ihε n g
e u εs a
r i ne ug ε
s eaue h
r i n εg
Ce u
X4
ε
X2
X5
Computing Marginal Beliefs
X7
X1
X2
s ae h
Computation of belief
r i n g
results in large state
e us εae h
space
s sareaei hnhg
r ri einung εgs a h
s ae hes ueaueε εh r i en g
r i n gr siX3naeg h u
e ε
eeur ihε n g
e u εs a
r i ne ug ε
s eaue h
r i n εg
Ce u
X4
ε
What a
hairball!
X5
Computing Marginal Beliefs
X7
s ae h
r i n g
eu ε
Approximation
Required!!!
s ae
s a h
e
r i n g X3
eu ε
s ae h
r i n g
eu ε
X1
h
r i n g
eu ε
X4
X2
X5
BP over String-Valued Variables
• In fact, with a cyclic factor graph,
messages and marginal beliefs grow
unboundedly complex!
a
a
a
X1
ψ2
ψ1
X2
a
a ε
a
a
a
BP over String-Valued Variables
• In fact, with a cyclic factor graph,
messages and marginal beliefs grow
unboundedly complex!
a
a
a
X1
ψ2
ψ1
X2
a
a
a ε
a
a
a
BP over String-Valued Variables
• In fact, with a cyclic factor graph,
messages and marginal beliefs grow
unboundedly complex!
a
a
a
X1
ψ2
ψ1
X2
a
a
a
a
a
a ε
a
a
BP over String-Valued Variables
• In fact, with a cyclic factor graph,
messages and marginal beliefs grow
unboundedly complex!
a
a
a
X1
ψ2
ψ1
X2
a
a
a
a
a
a ε
a
a
a
BP over String-Valued Variables
• In fact, with a cyclic factor graph,
messages and marginal beliefs grow
unboundedly complex!
a
a
a
X1
ψ2
ψ1
X2
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a ε
a
a
a
a
a
a
a
a
a
a
a
a
a
Expectation Propagation (EP) in a
Nutshell
X7
s ae h
r i n g
eu ε
s ae h
r i n g X3
eu ε
s ae h
r i n g
eu ε
X1
s ae h
r i n g
eu ε
X4
X2
X5
Expectation Propagation (EP) in a
Nutshell
X7
foo 1.2
bar 0.5
baz 4.3
s ae h
r i n g X3
eu ε
s ae h
r i n g
eu ε
X1
s ae h
r i n g
eu ε
X4
X2
X5
Expectation Propagation (EP) in a
Nutshell
X7
foo 1.2
bar 0.5
baz 4.3
foo 1.2
bar 0.5
baz 4.3
X1
X3
s ae h
r i n g
eu ε
s ae h
r i n g
eu ε
X4
X2
X5
Expectation Propagation (EP) in a
Nutshell
X7
foo 1.2
bar 0.5
baz 4.3
foo 1.2
bar 0.5
baz 4.3
X1
X3
foo 1.2
bar 0.5
baz 4.3
s ae h
r i n g
eu ε
X4
X2
X5
Expectation Propagation (EP) in a
Nutshell
X7
foo 1.2
bar 0.5
baz 4.3
foo 1.2
bar 0.5
baz 4.3
X1
X3
foo 1.2
bar 0.5
baz 4.3
foo 1.2
bar 0.5
baz 4.3
X4
X2
X5
EP In a Nutshell
X7
foo 1.2
bar 0.5
baz 4.3
foo
Approximate belief is
now a table of n-grams.
foo 1.2
4.8
bar 0.5 The point-wise product
baz 4.3 is now super easy!
2.0
foo 1.2 bar X
3
bar 0.5
baz 4.3
17.2
baz
foo 1.2
bar 0.5
baz 4.3
X1
X4
X2
X5
How to approximate a message?
KL(
s ae h
i n g
u ε
s ae h
r i n g
eu ε
foo 1.2
bar 0.5
baz 4.3
s ae h
= i ng
u ε
foo
bar
baz
||
foo 0.2
bar 1.1
baz -0.3
s ae h
= i ng
u ε
foo 1.2
bar 0.5
baz 4.3
θ
foo
bar
baz
foo 0.2
bar 1.1
baz -0.3
)
Minimize with
respect to the
parameters θ
Results
• Question 1: Does EP work in
general (comparison to
baseline)?
• Question 2: Do variable
order approximations
improve over fixed n-grams?
• Unigram EP (Green) – fast,
but inaccurate
• Bigram EP (Blue) – also fast
and inaccurate
• Trigram EP (Cyan) – slow
and accurate
• Penalized EP (Red) – fast
and accurate
• Baseline (Black) – accurate
and slow (pruning based)
Fin
Thanks for you attention!
For more information on structured
models and belief propagation, see the
Structured Belief Propagation Tutorial at
ACL 2015 by Matt Gormley and Jason
Eisner.
Download