Joint Models with Missing Data for Semi-Supervised Learning Jason Eisner – June 2009

advertisement
Joint Models with Missing Data
for Semi-Supervised Learning
Jason Eisner
NAACL Workshop Keynote – June 2009
1
Outline
1. Why use joint models?
2. Making big joint models tractable:
Approximate inference and training
by loopy belief propagation
3. Open questions:
Semi-supervised training of joint models
2
The standard story
x
Task
y
p(y|x) model
Semi-sup. learning: Train on many (x,?) and a few (x,y)
3
Some running examples
x
Task
y
p(y|x) model
Semi-sup. learning: Train on many (x,?) and a few (x,y)
E.g., in low-resource languages
sentence
lemma
parse
(with David A. Smith)
morph. paradigm
(with Markus Dreyer)
4
Semi-supervised learning
Semi-sup. learning: Train on many (x,?) and a few (x,y)
Why would knowing p(x) help you learn p(y|x) ??

Shared parameters via joint model



e.g., noisy channel:
p(x,y) = p(y) * p(x|y)
Estimate p(x,y) to have
appropriate marginal p(x)
This affects the conditional
distrib p(y|x)
5
sample of p(x)
6
For any x, can now recover cluster c that probably generated it
A few supervised examples may let us predict y from c
E.g., if p(x,y) = ∑c p(x,y,c) = ∑c p(c) p(y | c) p(x | c)
(joint model!)
few params
sample of p(x)
7
Semi-supervised learning
Semi-sup. learning: Train on many (x,?) and a few (x,y)
Why would knowing p(x) help you learn p(y|x) ??

Shared parameters via joint model





e.g., noisy channel:
p(x,y) = p(y) * p(x|y)
Estimate p(x,y) to have
appropriate marginal p(x)
This affects the conditional
distrib p(y|x)
Picture is misleading: No need to assume a distance metric
(as in TSVM, label propagation, etc.)
But we do need to choose a model family for p(x,y)
8
NLP + ML = ???
x
structured
input
(may be
only partly
observed,
so infer x,
too)
Task
p(y|x) model
depends on
features of<x,y>
(sparse features?)
or features of <x,z,y>
where z are latent
(so infer z, too)
y
structured
output
(so already
need joint
inference
for decoding,
e.g.,dynamic
programming)
9
Each task in a vacuum?
Task1
x1
y1
Task2
x2
x3
x4
Task3
Task4
y2
y3
y4
10
Solved tasks help later ones? (e.g, pipeline)
x
Task1
z1
Task2
Task3
Task4
z2
z3
y
11
Feedback?
x
Task1
z1
Task2
Task3
Task4
z2
What if Task3
isn’t solved yet
and we have
little <z2,z3>
training data?
z3
y
12
Feedback?
x
Task1
z1
Task2
Task3
Task4
z2
What if Task3
isn’t solved yet
and we have
little <z2,z3>
training data?
z3
y
Impute
<z2,z3>
given x1
and y4!
13
A later step benefits from many earlier ones?
x
Task1
z1
Task2
Task3
Task4
z2
z3
y
14
A later step benefits from many earlier ones?
And conversely?
x
Task1
z1
Task2
Task3
Task4
z2
z3
y
15
We end up with a Markov Random Field (MRF)
x
z1
Φ1
Φ2
z2
Φ3
Φ4
z3
y
16
Variable-centric, not task-centric
=
p(x,z1,z2,z3,y) =(1/Z)
Φ1(x,z1) Φ2(z1,z2) Φ4(z3,y)
Φ3(x,z1,z2,z3)
1 Φ5(y)
1
x
z
Φ
Φ2
z2
Φ3
Φ4
z3
y
Φ5
17
Familiar MRF example

First, a familiar example

Conditional Random Field (CRF) for POS tagging
Possible tagging (i.e., assignment to remaining variables)
…
v
v
v
…
preferred
find
tags
Observed input sentence (shaded)
18
Familiar MRF example

First, a familiar example

Conditional Random Field (CRF) for POS tagging
Possible tagging (i.e., assignment to remaining variables)
Another possible tagging
…
v
a
n
…
preferred
find
tags
Observed input sentence (shaded)
19
Familiar MRF example: CRF
”Binary” factor
that measures
compatibility of 2
adjacent tags
v
v 0
n 2
a 0
n
2
1
3
a
1
0
1
v
v 0
n 2
a 0
n
2
1
3
a
1
0
1
Model reuses
same parameters
at this position
…
find
…
preferred
tags
20
Familiar MRF example: CRF
“Unary” factor evaluates this tag
Its values depend on corresponding word
…
…
v 0.2
n 0.2
a 0
find
preferred
tags
can’t be adj
21
Familiar MRF example: CRF
“Unary” factor evaluates this tag
Its values depend on corresponding word
…
…
v 0.2
n 0.2
a 0
find
preferred
tags
(could be made to depend on
entire observed sentence)
22
Familiar MRF example: CRF
“Unary” factor evaluates this tag
Different unary factor at each position
…
…
v 0.3
n 0.02
a 0
find
v 0.3
n 0
a 0.1
preferred
v 0.2
n 0.2
a 0
tags
23
Familiar MRF example: CRF
p(v a n) is proportional
to the product of all
factors’ values on v a n
…
v
v 0
n 2
a 0
v
a
1
0
1
v
v 0
n 2
a 0
a
v 0.3
n 0.02
a 0
find
n
2
1
3
n
2
1
3
a
1
0
1
…
n
v 0.3
n 0
a 0.1
preferred
v 0.2
n 0.2
a 0
tags
24
Familiar MRF example: CRF
NOTE: This is not just a pipeline of single-tag prediction tasks
(which might work ok in well-trained supervised case …)
p(v a n) is proportional
to the product of all
factors’ values on v a n
…
v
v 0
n 2
a 0
v
a
1
0
1
v
v 0
n 2
a 0
a
v 0.3
n 0.02
a 0
find
n
2
1
3
n
2
1
3
a
1
0
1
= … 1*3*0.3*0.1*0.2 …
…
n
v 0.3
n 0
a 0.1
preferred
v 0.2
n 0.2
a 0
tags
25
Task-centered view of the world
x
Task1
z1
Task2
Task3
Task4
z2
z3
y
26
Variable-centered view of the world
=
p(x,z1,z2,z3,y) =(1/Z)
Φ1(x,z1) Φ2(z1,z2) Φ4(z3,y)
Φ3(x,z1,z2,z3)
1 Φ5(y)
1
x
z
Φ
Φ2
z2
Φ3
Φ4
z3
y
Φ5
27
Variable-centric, not task-centric
Throw in any variables that might help!
Model and exploit correlations
28
semantics
lexicon (word types)
entailment
correlation
inflection
cognates
transliteration
abbreviation
neologism
language evolution
tokens
sentences
N
translation
alignment
editing
quotation
discourse context
resources
speech
misspellings,typos
formatting
entanglement
annotation
29
Back to our (simpler!) running examples
sentence
lemma
parse
(with David A. Smith)
morph. paradigm
(with Markus Dreyer)
30
Parser projection
sentence
sentence
translation
parse
little direct
training data
much more
training data
(with David A. Smith)
parse
parse of
translation
31
Parser projection
Auf
I
diese
did
not
Frage
habe
unfortunately
ich
leider
receive
an
keine
answer
Antwort
to
bekommen
this
question
32
Parser projection
sentence
little direct
training data
parse
word-to-word
alignment
translation
much more
training data
parse of
translation
33
Parser projection
Auf
diese
Frage
habe
ich
leider
keine
Antwort
bekommen
NULL
I
did
not
unfortunately
receive
an
answer
to
this
question
34
Parser projection
sentence
little direct
training data
word-to-word
alignment
translation
much more
training data
parse
need an
interesting
model
parse of
translation
35
Parses are not entirely isomorphic
Auf
diese
Frage
habe
ich
leider
keine
Antwort
bekommen
NULL
I
did
not
unfortunately
receive
an
answer
to
null
head-swapping
this
question
siblings
monotonic
36
Dependency Relations
+ “none of the above”
37
Parser projection
Typical test data (no translation observed):
sentence
parse
word-to-word
alignment
translation
parse of
translation
38
Parser projection
Small supervised training set (treebank):
sentence
parse
word-to-word
alignment
translation
parse of
translation
39
Parser projection
Moderate treebank in other language:
sentence
parse
word-to-word
alignment
translation
parse of
translation
40
Parser projection
Maybe a few gold alignments:
sentence
parse
word-to-word
alignment
translation
parse of
translation
41
Parser projection
Lots of raw bitext:
sentence
parse
word-to-word
alignment
translation
parse of
translation
42
Parser projection
Given bitext,
sentence
parse
word-to-word
alignment
translation
parse of
translation
43
Parser projection
Given bitext, try to impute other variables:
sentence
parse
word-to-word
alignment
translation
parse of
translation
44
Parser projection
Given bitext, try to impute other variables:
Now we have more constraints on the parse …
sentence
parse
word-to-word
alignment
translation
parse of
translation
45
Parser projection
Given bitext, try to impute other variables:
Now we have more constraints on the parse …
which should help us train the parser.
sentence
parse
word-to-word
alignment
translation
parse of
translation
We’ll see how belief propagation naturally handles this.
46
English does help us impute Chinese parse
Seeing noisy output of an English WSJ parser fixes these Chinese links
中国 在 基本 建设 方面 , 开始 利用 国际 金融 组织 的 贷款 进行 国际性 竞争性 招标 采购
Complement verbs swap objects
Subject attaches to intervening noun
N
P
J
N
N
, V
V
N
N
N
‘s
N
V
J
N
N
N
The corresponding bad versions found without seeing the English parse
China: in: infrastructure: construction: area:, : has begun: to utilize: international: financial:
organizations: ‘s: loans: to implement: international: competitive: bidding: procurement
In the area of infrastructure construction, China has begun to utilize loans from international financial
organizations to implement international competitive bidding procurement
47
Which does help us train a monolingual Chinese parser
48
(Could add a
rd
3
language …)
parse of
translation’
translation’
alignment
sentence
parse
alignment
parse of
translation
translation
alignment
49
(Could add world knowledge …)
sentence
parse
word-to-word
alignment
translation
parse of
translation
50
(Could add bilingual dictionary …)
dict
(since incomplete, treat
as partially observed var)
sentence
parse
N
word-to-word
alignment
translation
parse of
translation
51
Dynamic Markov Random Field
sentence
parse
alignment
parse of
translation
translation
Auf
I
diese
did
not
Frage
habe
unfortunately
ich
leider
receive
an
Note: These are
structured vars
Each is expanded
into a collection
of fine-grained
variables (words,
dependency links,
alignment links,…)
Thus, # of finegrained variables
keine
Antwort
bekommen
& factors varies
by NULL
example
(but all examples
share a single finite
answer
to
this
question
parameter vector)
52
Back to our running examples
sentence
lemma
parse
(with David A. Smith)
morph. paradigm
(with Markus Dreyer)
53
Morphological paradigm
inf
xyz
1st Sg
2nd Sg
3rd Sg
1st Pl
2nd Pl
3rd Pl
Present
Past
54
Morphological paradigm
inf
werfen
1st Sg
werfe
warf
2nd Sg
wirfst
warfst
3rd Sg
wirft
warf
1st Pl
werfen
warfen
2nd Pl
werft
warft
3rd Pl
werfen
warfen
Present
Past
55
Morphological paradigm as MRF
inf
xyz
1st Sg
2nd Sg
3rd Sg
1st Pl
2nd Pl
3rd Pl
Present
Each factor
Past
is a sophisticated weighted FST
56
# observations per form (fine-grained
semisupervision)
inf
9,393
1st Sg
285
2nd Sg
166
3rd Sg
1410
1124
1st Pl
1688
673
2nd Pl
1275
9
3rd Pl
1688
673
Present
Past
1124
undertrained
4
rare!
rare!
Question: Does joint inference help?
57
gelten ‘to hold, to apply’
inf
gelten
1st Sg
gelte
galt
2nd Sg
giltst
galtst
3rd Sg
gilt
galt
1st Pl
gelten
galten
2nd Pl
geltet
galtet
or: galtest
58
abbrechen ‘to quit’
inf
abbrechen
1st Sg abbreche
or: breche ab
2nd Sg abbrichst
or: brichst ab
3rd Sg abbricht
or: bricht ab
1st Pl
abbrechen
or: brechen ab
2nd Pl abbrecht
or: brecht ab
abbrechen
abbrach
or: brach ab
abbrachst
or: brachst ab
abbrach
or: brach ab
abbrachen
or: brachen ab
abbracht
or: bracht ab
abbrachen
59
gackern ‘to cackle’
inf
gackern
1st Sg
gackere
gackerte
2nd Sg
gackerst
gackertest
3rd Sg
gackert
gackerte
1st Pl
gackern
gackerten
2nd Pl
gackert
gackertet
60
werfen ‘to throw’
inf
werfen
1st Sg
werfe
warf
2nd Sg
wirfst
warfst
3rd Sg
wirft
warf
1st Pl
werfen
warfen
2nd Pl
werft
warft
61
Preliminary results …
joint inference helps
a lot on the rare forms
Hurts on the others.
Can we fix??
(Is it because our joint
decoder is approx?
Or because semisupervised training is
hard and we need a
better method for it?)
62
Outline
1. Why use joint models in NLP?
2. Making big joint models tractable:
Approximate inference and training
by loopy belief propagation
3. Open questions:
Semisupervised training of joint models
63
Key Idea!

We’re using an MRF to coordinate the solutions to
several NLP problems

Each factor may be a whole NLP model over one or a
few complex structured variables (strings, parses)


Or equivalently, over many fine-grained variables
(individual words, tags, links)
Within a factor, use existing fast exact NLP algorithms


These are the “propagators” that compute outgoing messages
Even though the product of factors may be intractable or even
undecidable to work with
64
Why we need approximate inference



MRFs great for n-way classification (maxent)
Also good for predicting sequences
v
a
n
find
preferred
tags
alas, forward-backward algorithm
only allows n-gram features
Also good for dependency parsing
…find preferred links…
alas, our combinatorial
algorithms only allow
single-edge features
(more interactions slow them down
or introduce NP-hardness)
65
Great Ideas in ML: Message Passing
Count the soldiers
there’s
1 of me
1
before
you
2
before
you
3
before
you
4
before
you
5
behind
you
4
behind
you
3
behind
you
2
behind
you
adapted from MacKay (2003) textbook
5
before
you
1
behind
you
66
Great Ideas in ML: Message Passing
Count the soldiers
there’s
1 of me
2
before
you
Belief:
Must be
22 +11 +33 =
6 of us
3
only see
my incoming behind
you
messages
adapted from MacKay (2003) textbook
67
Great Ideas in ML: Message Passing
Count the soldiers
there’s
1 of me
1
before
you
Belief:
Belief:
Must be
Must be
1 1 +11 +44 = 22 +11 +33 =
6 of us
6 of us
4
only see
my incoming behind
you
messages
adapted from MacKay (2003) textbook
68
Great Ideas in ML: Message Passing
Each soldier receives reports from all branches of tree
3 here
7 here
1 of me
11 here
(= 7+3+1)
adapted from MacKay (2003) textbook
69
Great Ideas in ML: Message Passing
Each soldier receives reports from all branches of tree
3 here
7 here
(= 3+3+1)
3 here
adapted from MacKay (2003) textbook
70
Great Ideas in ML: Message Passing
Each soldier receives reports from all branches of tree
11 here
(= 7+3+1)
7 here
3 here
adapted from MacKay (2003) textbook
71
Great Ideas in ML: Message Passing
Each soldier receives reports from all branches of tree
3 here
7 here
3 here
adapted from MacKay (2003) textbook
Belief:
Must be
14 of us
72
Great Ideas in ML: Message Passing
Each soldier receives reports from all branches of tree
3 here
7 here
3 here
adapted from MacKay (2003) textbook
Belief:
Must be
14 of us
73
Great ideas in ML: Forward-Backward

In the CRF, message passing = forward-backward
belief
message
α
…
v
v 0
n 2
a 0
v 7
n 2
a 1
n
2
1
3
α
v 1.8
n 0
a 4.2
av 3
1n 1
0a 6
1
β
message
v 2v
n v1 0
a n7 2
a 0
n
2
1
3
β
a
1
0
1
v 3
n 6
a 1
…
v 0.3
n 0
a 0.1
find
preferred
tags
74
Great ideas in ML: Forward-Backward

Extend CRF to “skip chain” to capture non-local factor

More influences on belief 
α
v 5.4
n 0
a 25.2
β
v 3
n 1
a 6
…
v 3
n 1
a 6
find
v 2
n 1
a 7
…
v 0.3
n 0
a 0.1
preferred
tags
75
Great ideas in ML: Forward-Backward

Extend CRF to “skip chain” to capture non-local factor


More influences on belief 
Red messages not independent?
v 5.4`
Graph becomes loopy 
Pretend they are!
α
n 0
a 25.2`
β
v 3
n 1
a 6
…
v 3
n 1
a 6
find
v 2
n 1
a 7
…
v 0.3
n 0
a 0.1
preferred
tags
76
MRF over string-valued variables!
inf
xyz
1st Sg
2nd Sg
3rd Sg
1st Pl
2nd Pl
3rd Pl
Present
Each factor
Past
is a sophisticated weighted FST
77
MRF over string-valued variables!
inf
xyz
1st Sg
2nd Sg
3rd Sg
1st Pl
2nd Pl
3rd Pl
What are these messages?
Probability distributions over strings …
Represented by weighted FSAs
Constructed by finite-state operations
Parameters trainable using finite-state methods
Warning: FSAs can get larger and larger;
Present
Past
must prune back using k-best or variational approx
Each factor
is a sophisticated weighted FST
78
Key Idea!

We’re using an MRF to coordinate the solutions to
several NLP problems

Each factor may be a whole NLP model over one or a
few complex structured variables (strings, parses)


Or equivalently, over many fine-grained variables
(individual words, tags, links)
Within a factor, use existing fast exact NLP algorithms



These are the “propagators” that compute outgoing messages
Even though the product of factors may be intractable or even
undecidable to work with
We just saw this for morphology; now let’s see it for parsing
79
Local factors in a graphical model

Back to simple variables …


v
a
n
CRF for POS tagging
Now let’s do dependency parsing!

O(n2) boolean variables for the possible links
…
find
preferred
links
…
80
Local factors in a graphical model

Back to simple variables …


v
a
n
CRF for POS tagging
Now let’s do dependency parsing!

O(n2) boolean variables for the possible links
Possible parse — encoded as an assignment to these vars
t
f
f
f
f
…
find
t
preferred
links
…
81
Local factors in a graphical model

Back to simple variables …


v
a
n
CRF for POS tagging
Now let’s do dependency parsing!

O(n2) boolean variables for the possible links
Possible parse — encoded as an assignment to these vars
Another possible parse
f
f
t
t
f
…
find
f
preferred
links
…
82
Local factors in a graphical model

Back to simple variables …


v
a
n
CRF for POS tagging
Now let’s do dependency parsing!

O(n2) boolean variables for the possible links
Possible parse — encoded as an assignment to these vars
Another possible parse
An illegal parse
f
t
t
f
…
find
t
(cycle)
preferred
f
links
…
83
Local factors in a graphical model

Back to simple variables …


v
a
n
CRF for POS tagging
Now let’s do dependency parsing!

O(n2) boolean variables for the possible links
Possible parse — encoded as an assignment to these vars
Another possible parse
An illegal parse
f
Another illegal parse t
t
f
…
find
t
(cycle)
t
preferred
(multiple parents)
links
…
84
Local factors for parsing

So what factors shall we multiply to define parse probability?

Unary factors to evaluate each link in isolation
But what if the best
assignment isn’t a tree??
t 2
f 1
as before, goodness
of this link can
depend on entire
observed input context
t 1
f 2
t 1
f 8
t 1
f 3
…
find
t 1
f 6
preferred
t 1
f 2
some other links
aren’t as good
given this input
sentence
links
…
104
Global factors for parsing

So what factors shall we multiply to define parse probability?


Unary factors to evaluate each link in isolation
Global TREE factor to require that the links form a legal tree

…
this is a “hard constraint”: factor is either 0 or 1
find
preferred
ffffff
ffffft
fffftf
…
fftfft
…
tttttt
0
0
0
…
1
…
0
links
…
105
Global factors for parsing

optionally require the
tree to be projective
crossing
links)
So what factors shall we multiply to define (no
parse
probability?
 Unary factors to evaluate each link in isolation
 Global TREE factor to require that the links form a legal tree

this is a “hard constraint”: factor is either 0 or 1
So far, this is equivalent to
edge-factored parsing
(McDonald et al. 2005).
t
f
f
f
f
…
find
t
preferred
ffffff
ffffft
fffftf
…
fftfft
…
tttttt
0
0
0
…
1
…
0
we’re
legal!
64 entries (0/1)
links
…
Note: McDonald et al. (2005) don’t loop through this table
to consider exponentially many trees one at a time.
They use combinatorial algorithms; so should we!106
Local factors for parsing

So what factors shall we multiply to define parse probability?


Unary factors to evaluate each link in isolation
Global TREE factor to require that the links form a legal tree


this is a “hard constraint”: factor is either 0 or 1
Second-order effects: factors on 2 variables

grandparent
t
f
t
f
1
1
t
1 3
t
…
find
preferred
links
…
107
Local factors for parsing

So what factors shall we multiply to define parse probability?


Unary factors to evaluate each link in isolation
Global TREE factor to require that the links form a legal tree


this is a “hard constraint”: factor is either 0 or 1
Second-order effects: factors on 2 variables


grandparent
no-cross
t
f
t
f
1
1
t
1 0.2
t
…
find
preferred
links
by
…
108
Local factors for parsing

So what factors shall we multiply to define parse probability?


Unary factors to evaluate each link in isolation
Global TREE factor to require that the links form a legal tree


Second-order effects: factors on 2 variables







…
this is a “hard constraint”: factor is either 0 or 1
grandparent
no-cross
coordination with other parse & alignment
hidden POS tags
siblings
subcategorization
…
find
preferred
links
by
…
109
Exactly Finding the Best Parse
…find preferred links…

but to allow fast dynamic
programming or MST parsing,
only use single-edge features
With arbitrary features, runtime blows up

Projective parsing: O(n3) by dynamic programming
grandparents

grandp.
+ sibling
bigrams
POS
trigrams
sibling pairs
(non-adjacent)
O(n5)
O(n4)
O(n3g6) … O(2n)
Non-projective: O(n2) by minimum spanning tree
NP-hard
• any of the above features
• soft penalties for crossing links
• pretty much anything else!
110
Two great tastes that taste great together
You got belief
propagation in
my dynamic
programming!
You got
dynamic
programming in
my belief
propagation!
111
Loopy Belief Propagation for Parsing






Sentence tells word 3, “Please be a verb”
Word 3 tells the 3  7 link, “Sorry, then you probably don’t exist”
The 3  7 link tells the Tree factor, “You’ll have to find another
parent for 7”
The tree factor tells the 10  7 link, “You’re on!”
The 10  7 link tells 10, “Could you please be a noun?”
…
…
find
preferred
links
…
113
Loopy Belief Propagation for Parsing

Higher-order factors (e.g., Grandparent) induce loops


…
Let’s watch a loop around one triangle …
Strong links are suppressing or promoting other links …
find
preferred
links
…
114
Loopy Belief Propagation for Parsing

Higher-order factors (e.g., Grandparent) induce loops


Let’s watch a loop around one triangle …
How did we compute outgoing message to green link?

“Does the TREE factor think that the green link is probably t,
given the messages it receives from all the other links?”
?
TREE factor
…
find
ffffff
0
ffffft
0
fffftf
0
…
…
fftfft
1
…
…
preferred
tttttt
0
links
…
115
Loopy Belief Propagation for Parsing

How did we compute outgoing message to green link?

“Does the TREE factor think that the green link is probably t,
given the messages it receives from all the other links?”
TREE factor
But this is the outside probability of green link!
ffffff
0
ffffft
0
TREE factor computes all outgoing messages
at
once
fffftf
0
(given all incoming messages)
…
…
fftfft
1
…
…
Projective case: total O(n3) time by inside-outside
tttttt
0
Non-projective: total O(n3) time by inverting Kirchhoff
?
…
find
preferred
links
…
matrix (Smith & Smith, 2007)
116
Loopy Belief Propagation for Parsing

How did we compute outgoing message to green link?

“Does the TREE factor think that the green link is probably t,
given the messages it receives from all the other links?”
But this is the outside probability of green link!
TREE factor computes all outgoing messages at once
(given all incoming messages)
Projective case: total O(n3) time by inside-outside
Non-projective: total O(n3) time by inverting Kirchhoff
matrix (Smith & Smith, 2007)
Belief propagation assumes incoming messages to TREE are independent.
So outgoing messages can be computed with first-order parsing algorithms
(fast, no grammar constant).
117
Some interesting connections …

Parser stacking (Nivre & McDonald 2008, Martins et al. 2008)

Global constraints in arc consistency
 ALLDIFFERENT constraint (Régin 1994)

Matching constraint in max-product BP
 For computer vision (Duchi et al., 2006)
 Could be used for machine translation

As far as we know, our parser is the first use of global constraints
in sum-product BP.
And nearly the first use of BP in natural language processing.

118
Runtimes for each factor type (see paper)
Factor type
Tree
Proj. Tree
degree runtime
O(n2)
O(n3)
O(n2)
O(n3)
count total
1
O(n3)
1
O(n3)
Individual links
1
O(1)
O(n2)
O(n2)
Grandparent
2
O(1)
O(n3)
O(n3)
Sibling pairs
2
O(1)
O(n3)
O(n3)
Sibling bigrams O(n)
O(n2)
O(n)
O(n3)
NoCross
O(n)
O(n)
O(n2)
O(n3)
Tag
1
O(g)
O(n)
O(n)
TagLink
3
O(g2)
O(n2)
O(n2)
TagTrigram
O(n)
O(ng3)
1
TOTAL
Additive, not multiplicative!
+ O(n)
= O(n )
3
per
iteration
119
Runtimes for each factor type (see paper)
Factor type
Tree
Proj. Tree
degree runtime
O(n2)
O(n3)
O(n2)
O(n3)
count total
1
O(n3)
1
O(n3)
Individual links
1
O(1)
O(n2)
O(n2)
Grandparent
2
O(1)
O(n3)
O(n3)
Sibling pairs
2
O(1)
O(n3)
O(n3)
Sibling bigrams O(n)
O(n2)
O(n)
O(n3)
NoCross
O(n)
O(n)
O(n2)
O(n3)
Tag
1
O(g)
O(n)
O(n)
TagLink
3
O(g2)
O(n2)
O(n2)
TagTrigram
O(n)
O(ng3)
1
+ O(n)
Each
“global” factorAdditive,
coordinates
an unbounded # of variables
TOTAL
O(n )
not multiplicative!
=
Standard belief propagation would take exponential time
3
to iterate over all configurations of those variables
See paper for efficient propagators
120
Dependency Accuracy
The extra, higher-order features help! (non-projective parsing)
Tree+Link
Danish
85.5
Dutch
87.3
English
88.6
+NoCross
86.1
88.3
89.1
+Grandparent
86.1
88.6
89.4
+ChildSeq
86.5
88.5
90.1
121
Dependency Accuracy
The extra, higher-order features help! (non-projective parsing)
Tree+Link
Danish
85.5
Dutch
87.3
English
88.6
+NoCross
86.1
88.3
89.1
+Grandparent
86.1
88.6
89.4
+ChildSeq
86.5
88.5
90.1
86.0
84.5
90.2
86.1
87.6
90.2
exact, slow Best projective
parse with all factors
doesn’t fix
+hill-climbing
enough edges
122
Time vs. Projective Search Error
iterations
iterations
…DP 140
iterations
Compared with O(n4) DP
Compared with O(n5) DP
123
Summary of MRF parsing by BP

Output probability defined as product of local and global factors



Let local factors negotiate via “belief propagation”




Throw in any factors we want! (log-linear model)
Each factor must be fast, but they run independently
Each bit of syntactic structure is influenced by others
Some factors need combinatorial algorithms to compute messages fast

e.g., existing parsing algorithms using dynamic programming

Compare reranking or stacking
Each iteration takes total time O(n3) or even O(n2); see paper
Converges to a pretty good (but approximate) global parse


Fast parsing for formerly intractable or slow models
Extra features of these models really do help accuracy
125
Outline
1. Why use joint models in NLP?
2. Making big joint models tractable:
Approximate inference and training
by loopy belief propagation
3. Open questions:
Semisupervised training of joint models
126
Training with missing data is hard!

Semi-supervised learning of HMMs or PCFGs: ouch!



A stronger model helps (McClosky et al. 2007, Cohen et al. 2009)



Merialdo: Just stick with the small supervised training set
Adding unsupervised data tends to hurt
So maybe some hope from good models @ factors
And from having lots of factors (i.e., take cues from lots of
correlated variables at once; cf. Yarowsky et al.)
Naïve Bayes would be okay …

Variables with unknown values can’t hurt you.



They have no influence on training or decoding.
But can’t help you, either! And indep. assumptions are flaky.
So I’d like to keep discussing joint models …
127
Case #1: Missing data that you can’t impute
sentence
parse
word-to-word
alignment
translation
parse of
translation
Treat like multi-task learning? Shared features between 2 tasks:
parse Chinese vs. parse Chinese w/ English translation
Or 3 tasks:
parse Chinese w/ inferred English gist
vs. parse Chinese w/ English translation
vs. parse English gist derived from English (supervised)
128
Case #2: Missing data you can impute, but maybe badly
inf
xyz
1st Sg
2nd Sg
3rd Sg
1st Pl
2nd Pl
3rd Pl
Present
Each factor
Past
is a sophisticated weighted FST
129
Case #2: Missing data you can impute, but maybe badly
inf
xyz
1st Sg



2nd isSg
This
where simple cases of EM go wrong
Could
reduce to case #1 and throw away these variables
3rd Sg
Or: Damp messages from imputed variables to the extent you’re
not
in them
1stconfident
Pl
Requires confidence estimation. (cf. strapping)
Pl versions: Confidence depends in a fixed way on time, or on
2nd
Crude
entropy of belief at that node, or on length of input sentence.
3rd
ButPl
could train a confidence estimator on supervised data to
pay attention to all sorts of things!

Present
Past
Correspondingly, scale
up features for related missing-data
tasks
since
the damped
are “partially missing”
Each
factor
is data
a sophisticated
weighted FST

130
Related documents
Download