slides () - Johns Hopkins University

advertisement
1
Structured Belief Propagation
for NLP
Matthew R. Gormley & Jason Eisner
ACL ‘14 Tutorial
June 22, 2014
For the latest version of these slides, please visit:
http://www.cs.jhu.edu/~mrg/bp-tutorial/
2
Language has a lot going on at once
Structured representations of utterances
3
Structured knowledge of the language
Many interacting parts
for BP to reason about!
Outline
• Do you want to push past the simple NLP models (logistic
regression, PCFG, etc.) that we've all been using for 20 years?
• Then this tutorial is extremely practical for you!
1.
2.
3.
4.
5.
6.
7.
Models: Factor graphs can express interactions among linguistic
structures.
Algorithm: BP estimates the global effect of these interactions on
each variable, using local computations.
Intuitions: What’s going on here? Can we trust BP’s estimates?
Fancier Models: Hide a whole grammar and dynamic
programming algorithm within a single factor. BP coordinates
multiple factors.
Tweaked Algorithm: Finish in fewer steps and make the steps
faster.
Learning: Tune the parameters. Approximately improve the true
predictions -- or truly improve the approximate predictions.
Software: Build the model you want!
4
Outline
• Do you want to push past the simple NLP models (logistic
regression, PCFG, etc.) that we've all been using for 20 years?
• Then this tutorial is extremely practical for you!
1.
2.
3.
4.
5.
6.
7.
Models: Factor graphs can express interactions among linguistic
structures.
Algorithm: BP estimates the global effect of these interactions on
each variable, using local computations.
Intuitions: What’s going on here? Can we trust BP’s estimates?
Fancier Models: Hide a whole grammar and dynamic
programming algorithm within a single factor. BP coordinates
multiple factors.
Tweaked Algorithm: Finish in fewer steps and make the steps
faster.
Learning: Tune the parameters. Approximately improve the true
predictions -- or truly improve the approximate predictions.
Software: Build the model you want!
5
Outline
• Do you want to push past the simple NLP models (logistic
regression, PCFG, etc.) that we've all been using for 20 years?
• Then this tutorial is extremely practical for you!
1.
2.
3.
4.
5.
6.
7.
Models: Factor graphs can express interactions among linguistic
structures.
Algorithm: BP estimates the global effect of these interactions on
each variable, using local computations.
Intuitions: What’s going on here? Can we trust BP’s estimates?
Fancier Models: Hide a whole grammar and dynamic
programming algorithm within a single factor. BP coordinates
multiple factors.
Tweaked Algorithm: Finish in fewer steps and make the steps
faster.
Learning: Tune the parameters. Approximately improve the true
predictions -- or truly improve the approximate predictions.
Software: Build the model you want!
6
Outline
• Do you want to push past the simple NLP models (logistic
regression, PCFG, etc.) that we've all been using for 20 years?
• Then this tutorial is extremely practical for you!
1.
2.
3.
4.
5.
6.
7.
Models: Factor graphs can express interactions among linguistic
structures.
Algorithm: BP estimates the global effect of these interactions on
each variable, using local computations.
Intuitions: What’s going on here? Can we trust BP’s estimates?
Fancier Models: Hide a whole grammar and dynamic
programming algorithm within a single factor. BP coordinates
multiple factors.
Tweaked Algorithm: Finish in fewer steps and make the steps
faster.
Learning: Tune the parameters. Approximately improve the true
predictions -- or truly improve the approximate predictions.
Software: Build the model you want!
7
Outline
• Do you want to push past the simple NLP models (logistic
regression, PCFG, etc.) that we've all been using for 20 years?
• Then this tutorial is extremely practical for you!
1.
2.
3.
4.
5.
6.
7.
Models: Factor graphs can express interactions among linguistic
structures.
Algorithm: BP estimates the global effect of these interactions on
each variable, using local computations.
Intuitions: What’s going on here? Can we trust BP’s estimates?
Fancier Models: Hide a whole grammar and dynamic
programming algorithm within a single factor. BP coordinates
multiple factors.
Tweaked Algorithm: Finish in fewer steps and make the steps
faster.
Learning: Tune the parameters. Approximately improve the true
predictions -- or truly improve the approximate predictions.
Software: Build the model you want!
8
Outline
• Do you want to push past the simple NLP models (logistic
regression, PCFG, etc.) that we've all been using for 20 years?
• Then this tutorial is extremely practical for you!
1.
2.
3.
4.
5.
6.
7.
Models: Factor graphs can express interactions among linguistic
structures.
Algorithm: BP estimates the global effect of these interactions on
each variable, using local computations.
Intuitions: What’s going on here? Can we trust BP’s estimates?
Fancier Models: Hide a whole grammar and dynamic
programming algorithm within a single factor. BP coordinates
multiple factors.
Tweaked Algorithm: Finish in fewer steps and make the steps
faster.
Learning: Tune the parameters. Approximately improve the true
predictions -- or truly improve the approximate predictions.
Software: Build the model you want!
9
Outline
• Do you want to push past the simple NLP models (logistic
regression, PCFG, etc.) that we've all been using for 20 years?
• Then this tutorial is extremely practical for you!
1.
2.
3.
4.
5.
6.
7.
Models: Factor graphs can express interactions among linguistic
structures.
Algorithm: BP estimates the global effect of these interactions on
each variable, using local computations.
Intuitions: What’s going on here? Can we trust BP’s estimates?
Fancier Models: Hide a whole grammar and dynamic
programming algorithm within a single factor. BP coordinates
multiple factors.
Tweaked Algorithm: Finish in fewer steps and make the steps
faster.
Learning: Tune the parameters. Approximately improve the true
predictions -- or truly improve the approximate predictions.
Software: Build the model you want!
10
Outline
• Do you want to push past the simple NLP models (logistic
regression, PCFG, etc.) that we've all been using for 20 years?
• Then this tutorial is extremely practical for you!
1.
2.
3.
4.
5.
6.
7.
Models: Factor graphs can express interactions among linguistic
structures.
Algorithm: BP estimates the global effect of these interactions on
each variable, using local computations.
Intuitions: What’s going on here? Can we trust BP’s estimates?
Fancier Models: Hide a whole grammar and dynamic
programming algorithm within a single factor. BP coordinates
multiple factors.
Tweaked Algorithm: Finish in fewer steps and make the steps
faster.
Learning: Tune the parameters. Approximately improve the true
predictions -- or truly improve the approximate predictions.
Software: Build the model you want!
11
Section 1:
Introduction
Modeling with Factor Graphs
12
Sampling from a Joint Distribution
A joint distribution defines a probability p(x) for each assignment of values x to variables X.
This gives the proportion of samples that will equal x.
Sample 1:
n
v
p
d
n
Sample 2:
n
n
v
d
n
Sample 3:
n
v
p
d
n
Sample 4:
v
n
p
d
n
Sample 5:
v
n
v
d
n
Sample 6:
n
v
p
d
n
X0
<START>
ψ0
X1
ψ2
X2
ψ4
X3
ψ6
X4
ψ8
X5
ψ1
ψ3
ψ5
ψ7
ψ9
time
flies
like
an
arrow
13
Sampling from a Joint Distribution
A joint distribution defines a probability p(x) for each assignment of values x to variables X.
This gives the proportion of samples that will equal x.
Sample 1:
Sample 2:
ψ11
ψ11
ψ12
ψ4
ψ2
ψ1
Sample 3:
ψ12
ψ10
ψ6
ψ2
ψ8
ψ5
ψ1
Sample 4:
ψ8
ψ5
ψ7
ψ3
ψ7
ψ9
ψ11
ψ11
ψ3
ψ9
ψ12
ψ12
ψ10
ψ6
ψ4
ψ2
ψ10
ψ6
ψ4
ψ2
ψ8
ψ5
ψ10
ψ6
ψ4
ψ8
ψ5
ψ1
ψ1
ψ7
ψ7
ψ9
ψ3
ψ9
ψ3
X7
X6
ψ11
ψ10
ψ6
ψ4
ψ2
ψ1
X3
ψ12
X1
ψ5
X4
ψ8
X5
X2
ψ7
ψ3
ψ9
14
Sampling from a Joint Distribution
A joint distribution defines a probability p(x) for each assignment of values x to variables X.
This gives the proportion of samples that will equal x.
Sample 1:
Sample 2:
Sample 3:
Sample 4:
X0
<START>
ψ0
n
v
p
d
n
time
flies
like
an
arrow
n
n
v
d
n
time
flies
like
an
arrow
n
v
p
n
n
flies
fly
with
their
wings
p
n
n
v
v
with
time
you
will
see
X1
ψ2
X2
ψ4
X3
ψ6
X4
ψ8
X5
ψ1
ψ3
ψ5
ψ7
ψ9
W1
W2
W3
W4
W5
15
Factors have local opinions (≥ 0)
Each black box looks at some of the tags Xi and words Wi
X0
ψ0
time
flies
like
…
<START>
v
n
p
d
3 5 3
4 5 2
0.1 0.1 3
0.1 0.2 0.1
X1
n
6
4
3
8
ψ2
p
3
2
1
0
d
4
0.1
3
0
X2
ψ1
ψ3
W1
W2
v
n
p
d
v
1
8
1
0.1
n
6
4
3
8
ψ4
p
3
2
1
0
d
4
0.1
3
0
X3
v
n
p
d
ψ6
X4
ψ8
X5
ψ5
ψ7
ψ9
W3
3 5 3
4 5 2
0.1 0.1 3
0.1 0.2 0.1
W4
W5
time
flies
like
…
v
n
p
d
v
1
8
1
0.1
Note: We chose to reuse
the same factors at
different positions in the
sentence.
16
Factors have local opinions (≥ 0)
Each black box looks at some of the tags Xi and words Wi
p(n, v, p, d, n, time, flies, like, an, arrow)
ψ0
time
flies
like
…
<START>
v
n
p
d
3 5 3
4 5 2
0.1 0.1 3
0.1 0.2 0.1
n
n
6
4
3
8
ψ2
p
3
2
1
0
d
4
0.1
3
0
v
n
p
d
v
ψ1
ψ3
time
flies
v
1
8
1
0.1
n
6
4
3
8
ψ4
p
3
2
1
0
v
n
p
d
?
d
4
0.1
3
0
p
ψ6
d
ψ8
n
ψ5
ψ7
ψ9
like
an
arrow
time
flies
like
…
v
n
p
d
v
1
8
1
0.1
=
3 5 3
4 5 2
0.1 0.1 3
0.1 0.2 0.1
17
Global probability = product of local opinions
Each black box looks at some of the tags Xi and words Wi
p(n, v, p, d, n, time, flies, like, an, arrow)
ψ0
time
flies
like
…
<START>
v
n
p
d
3 5 3
4 5 2
0.1 0.1 3
0.1 0.2 0.1
n
n
6
4
3
8
ψ2
p
3
2
1
0
d
4
0.1
3
0
v
n
p
d
v
ψ1
ψ3
time
flies
v
1
8
1
0.1
n
6
4
3
8
ψ4
p
3
2
1
0
d
4
0.1
3
0
p
v
n
p
d
(4 * 8 * 5 * 3 * …)
Uh-oh! The probabilities of
the various assignments sum
up to Z > 1.
So divide them all by Z.
ψ6
d
ψ8
n
ψ5
ψ7
ψ9
like
an
arrow
time
flies
like
…
v
n
p
d
v
1
8
1
0.1
=
3 5 3
4 5 2
0.1 0.1 3
0.1 0.2 0.1
18
Markov Random Field (MRF)
Joint distribution over tags Xi and words Wi
The individual factors aren’t necessarily probabilities.
p(n, v, p, d, n, time, flies, like, an, arrow)
ψ0
time
flies
like
…
<START>
v
n
p
d
3 5 3
4 5 2
0.1 0.1 3
0.1 0.2 0.1
n
n
6
4
3
8
ψ2
p
3
2
1
0
d
4
0.1
3
0
v
n
p
d
v
ψ1
ψ3
time
flies
v
1
8
1
0.1
n
6
4
3
8
ψ4
p
3
2
1
0
v
n
p
d
(4 * 8 * 5 * 3 * …)
d
4
0.1
3
0
p
ψ6
d
ψ8
n
ψ5
ψ7
ψ9
like
an
arrow
time
flies
like
…
v
n
p
d
v
1
8
1
0.1
=
3 5 3
4 5 2
0.1 0.1 3
0.1 0.2 0.1
19
Hidden Markov Model
But sometimes we choose to make them probabilities.
Constrain each row of a factor to sum to one. Now Z = 1.
p(n, v, p, d, n, time, flies, like, an, arrow)
v
n
p
d
n
.4
.1
.3
.8
p
.2
.1
.2
0
d
.3
0
.3
0
v
n
p
d
v
time
flies
n
.4
.1
.3
.8
v
n
p
d
.2
.3
.1
.1
.5
.4
.1
.2
.2
.2
.3
.1
p
.2
.1
.2
0
(.3 * .8 * .2 * .5 * …)
d
.3
0
.3
0
p
d
n
an
arrow
time
flies
like
…
n
v
.1
.8
.2
.2
time
flies
like
…
<START>
v
.1
.8
.2
.2
=
v
n
p
d
.2
.3
.1
.1
.5
.4
.1
.2
like
.2
.2
.3
.1
20
Markov Random Field (MRF)
Joint distribution over tags Xi and words Wi
p(n, v, p, d, n, time, flies, like, an, arrow)
ψ0
time
flies
like
…
<START>
v
n
p
d
3 5 3
4 5 2
0.1 0.1 3
0.1 0.2 0.1
n
n
6
4
3
8
ψ2
p
3
2
1
0
d
4
0.1
3
0
v
n
p
d
v
ψ1
ψ3
time
flies
v
1
8
1
0.1
n
6
4
3
8
ψ4
p
3
2
1
0
v
n
p
d
(4 * 8 * 5 * 3 * …)
d
4
0.1
3
0
p
ψ6
d
ψ8
n
ψ5
ψ7
ψ9
like
an
arrow
time
flies
like
…
v
n
p
d
v
1
8
1
0.1
=
3 5 3
4 5 2
0.1 0.1 3
0.1 0.2 0.1
21
Conditional Random Field (CRF)
Conditional distribution over tags Xi given words wi.
The factors and Z are now specific to the sentence w.
p(n, v, p, d, n, time, flies, like, an, arrow)
v
n
p
d
ψ0
<START>
v
n
p
d
3
4
0.1
0.1
n
v
1
8
1
0.1
n
6
4
3
8
ψ2
p
3
2
1
0
d
4
0.1
3
0
v
n
p
d
v
ψ1
ψ3
time
flies
v
1
8
1
0.1
n
6
4
3
8
p
3
2
1
0
ψ4
v
n
p
d
=
d
4
0.1
3
0
p
5
5
0.1
0.2
(4 * 8 * 5 * 3 * …)
ψ6
d
ψ8
n
ψ5
ψ7
ψ9
like
an
arrow
22
How General Are Factor Graphs?
• Factor graphs can be used to describe
– Markov Random Fields (undirected graphical models)
• i.e., log-linear models over a tuple of variables
– Conditional Random Fields
– Bayesian Networks (directed graphical models)
• Inference treats all of these interchangeably.
– Convert your model to a factor graph first.
– Pearl (1988) gave key strategies for exact inference:
• Belief propagation, for inference on acyclic graphs
• Junction tree algorithm, for making any graph acyclic
(by merging variables and factors: blows up the runtime)
Object-Oriented Analogy
•
•
What is a sample?
A datum: an immutable object that describes a linguistic structure.
What is the sample space?
The class of all possible sample objects.
class Tagging:
int n;
Word[] w;
Tag[] t;
•
// length of sentence
// array of n words (values wi)
// array of n tags (values ti)
What is a random variable?
An accessor method of the class, e.g., one that returns a certain field.
– Will give different values when called on different random samples.
Word W(int i) { return w[i]; }
Tag T(int i) { return t[i]; }
// random var Wi
// random var Ti
String S(int i) {
return suffix(w[i], 3);
// random var Si
}
Random variable W5 takes value w5 == “arrow” in this sample
24
Object-Oriented Analogy
•
•
•
What is a sample?
A datum: an immutable object that describes a linguistic structure.
What is the sample space?
The class of all possible sample objects.
What is a random variable?
An accessor method of the class, e.g., one that returns a certain field.
•
A model is represented by a different object. What is a factor of the model?
A method of the model that computes a number ≥ 0 from a sample, based on the sample’s values of a
few random variables, and parameters stored in the model.
•
class
TaggingModel:
What
probability
does the model assign to a sample?
A product
of
its
factors
(rescaled). E.g., uprob(tagging)
Z().{
float transition(Tagging
tagging, int/ i)
// tag-tag bigram
return tparam[tagging.t(i-1)][tagging.t(i)]; }
float emission(Tagging tagging, int i) {
// tag-word bigram
return eparam[tagging.t(i)][tagging.w(i)]; }
•
How do you find the scaling factor?
Add up the probabilities of all possible samples. If the result Z != 1, divide the probabilities by that Z.
float uprob(Tagging tagging) {
// unnormalized prob
float p=1;
for (i=1; i <= tagging.n; i++) {
p *= transition(i) * emission(i); }
return p; }
25
Modeling with Factor Graphs
• Factor graphs can be used to model many
linguistic structures.
• Here we highlight a few example NLP tasks.
– People have used BP for all of these.
• We’ll describe how variables and factors
were used to describe structures and the
interactions among their parts.
26
Annotating a Tree
Given: a sentence and
unlabeled parse tree.
s
vp
pp
np
n
v
p
d
n
time
flies
like
an
arrow
27
Annotating a Tree
Given: a sentence and
unlabeled parse tree.
X9
ψ13
X8
ψ12
X7
Construct a factor graph
which mimics the tree
structure, to predict the
tags / nonterminals.
ψ11
X6
ψ10
X1
ψ2
X2
ψ4
X3
ψ6
X4
ψ8
X5
ψ1
ψ3
ψ5
ψ7
ψ9
time
flies
like
an
arrow
28
Annotating a Tree
Given: a sentence and
unlabeled parse tree.
s
ψ13
vp
ψ12
pp
Construct a factor graph
which mimics the tree
structure, to predict the
tags / nonterminals.
ψ11
np
ψ10
n
ψ2
v
ψ4
p
ψ6
d
ψ8
n
ψ1
ψ3
ψ5
ψ7
ψ9
time
flies
like
an
arrow
29
Annotating a Tree
Given: a sentence and
unlabeled parse tree.
s
ψ13
vp
ψ12
pp
Construct a factor graph
which mimics the tree
structure, to predict the
tags / nonterminals.
ψ11
np
ψ10
n
ψ2
v
ψ4
p
ψ6
d
ψ8
n
ψ1
ψ3
ψ5
ψ7
ψ9
time
flies
like
an
arrow
We could add a linear chain
structure between tags.
(This creates cycles!)
30
Constituency Parsing
What if we needed to
predict the tree structure
too?
Use more variables:
Predict the nonterminal
of each substring, or ∅ if
it’s not a constituent.
s
ψ13
n
∅
vp
ψ10
ψ12
∅
∅
pp
ψ10
ψ10
ψ11
∅
∅
∅
np
ψ10
ψ10
ψ10
ψ10
ψ2
v
ψ4
p
ψ6
d
ψ8
n
ψ1
ψ3
ψ5
ψ7
ψ9
time
flies
like
an
arrow
31
Constituency Parsing
What if we needed to
predict the tree structure
too?
Use more variables:
Predict the nonterminal
of each substring, or ∅ if
it’s not a constituent.
s
ψ13
n
∅
vp
ψ10
ψ12
∅
∅
pp
ψ10
ψ10
ψ11
s
∅
∅
np
ψ10
ψ10
ψ10
ψ10
ψ2
v
ψ4
p
ψ6
d
ψ8
n
ψ1
ψ3
ψ5
ψ7
ψ9
time
flies
like
an
arrow
But nothing prevents
non-tree structures.
32
Constituency Parsing
What if we needed to
predict the tree structure
too?
Use more variables:
Predict the nonterminal
of each substring, or ∅ if
it’s not a constituent.
s
ψ13
n
∅
vp
ψ10
ψ12
∅
∅
pp
ψ10
ψ10
ψ11
s
∅
∅
np
ψ10
ψ10
ψ10
ψ10
ψ2
v
ψ4
p
ψ6
d
ψ8
n
ψ1
ψ3
ψ5
ψ7
ψ9
time
flies
like
an
arrow
But nothing prevents
non-tree structures.
33
Constituency Parsing
What if we needed to
predict the tree structure
too?
s
ψ13
∅
vp
ψ10
ψ12
∅
∅
pp
ψ10
ψ10
ψ11
s
∅
∅
np
ψ10
ψ10
ψ10
ψ10
Use more variables:
Predict the nonterminal
of each substring, or ∅ if
it’s not a constituent.
ψ1
ψ3
ψ5
ψ7
ψ9
time
flies
like
an
arrow
But nothing prevents
non-tree structures.
Add a factor which multiplies
in 1 if the variables form a
tree and 0 otherwise.
n
ψ2
v
ψ4
p
ψ6
d
ψ8
n
34
Constituency Parsing
What if we needed to
predict the tree structure
too?
s
ψ13
∅
vp
ψ10
ψ12
∅
∅
pp
ψ10
ψ10
ψ11
∅
∅
∅
np
ψ10
ψ10
ψ10
ψ10
Use more variables:
Predict the nonterminal
of each substring, or ∅ if
it’s not a constituent.
ψ1
ψ3
ψ5
ψ7
ψ9
time
flies
like
an
arrow
But nothing prevents
non-tree structures.
Add a factor which multiplies
in 1 if the variables form a
tree and 0 otherwise.
n
ψ2
v
ψ4
p
ψ6
d
ψ8
n
35
Example Task:
Constituency Parsing
s
• Variables:
vp
– Constituent type (or ∅)
for each of O(n2)
substrings
n
• Interactions:
– Constituents must
describe a binary tree
– Tag bigrams
– Nonterminal triples
(parent, left-child,
right-child)
np
time
v
p
d
n
flies
like
an
arrow
s
ψ13
[these factors not shown]
n
(Naradowsky, Vieira, & Smith, 2012)
pp
∅
vp
ψ10
ψ12
∅
∅
pp
ψ10
ψ10
ψ11
∅
∅
∅
np
ψ10
ψ10
ψ10
ψ10
ψ2
v
ψ4
p
ψ6
d
ψ8
n
ψ1
ψ3
ψ5
ψ7
ψ9
time
flies
like
an
arrow
36
Example Task:
Dependency Parsing
• Variables:
– POS tag for each word
– Syntactic label (or ∅)
for each of O(n2)
possible directed arcs
time
flies
like
an
arrow
• Interactions:
– Arcs must form a tree
– Discourage (or forbid)
crossing edges
– Features on edge pairs
that share a vertex
*Figure from Burkett & Klein (2012)
• Learn to discourage a verb from having 2 objects, etc.
• Learn to encourage specific multi-arc constructions
(Smith & Eisner, 2008)
37
Example Task:
Joint CCG Parsing and Supertagging
• Variables:
– Spans
– Labels on nonterminals
– Supertags on preterminals
• Interactions:
– Spans must form a
tree
– Triples of labels:
parent, left-child, and
right-child
– Adjacent tags
(Auli & Lopez, 2011)
38
Example task:
Transliteration or Back-Transliteration
• Variables (string):
– English and Japanese orthographic strings
– English and Japanese phonological strings
• Interactions:
– All pairs of strings could be relevant
Figure thanks to Markus Dreyer
39
Example task:
Morphological Paradigms
• Variables (string):
– Inflected forms
of the same verb
• Interactions:
– Between pairs of
entries in the
table
(e.g. infinitive form
affects presentsingular)
(Dreyer & Eisner, 2009)
40
Application:
Word Alignment / Phrase Extraction
• Variables (boolean):
– For each (Chinese phrase,
English phrase) pair,
are they linked?
• Interactions:
–
–
–
–
–
Word fertilities
Few “jumps” (discontinuities)
Syntactic reorderings
“ITG contraint” on alignment
Phrases are disjoint (?)
(Burkett & Klein, 2012)
41
Application:
Congressional Voting
• Variables:
– Text of all speeches of a
representative
– Local contexts of
references between two
representatives
• Interactions:
– Words used by
representative and their
vote
– Pairs of representatives
and their local context
(Stoyanov & Eisner, 2012)
42
Application:
Semantic Role Labeling with Latent Syntax
• Variables:
arg1
– Semantic predicate
sense
– Semantic dependency
arcs
– Labels of semantic arcs
– Latent syntactic
dependency arcs
arg0
time
an
arrow
L0,4
• Interactions:
– Pairs of syntactic and
semantic dependencies
– Syntactic dependency
arcs must form a tree
like
flies
L1,4
L4,1
R1,4
R4,1
L0,3
L1,3
L3,1
R1,3
R3,1
L2,4
L4,2
R2,4
R4,2
L0,2
L1,2
L2,1
R1,2
R2,1
L2,3
L3,2
R2,3
R3,2
L3,4
L4,3
R3,4
R4,3
L0,1
0
<WALL>
(Naradowsky, Riedel, & Smith, 2012)
(Gormley, Mitchell, Van Durme, & Dredze, 2014)
1
The
2
barista
3
made
43
4
coffee
Application:
Joint NER & Sentiment Analysis
• Variables:
– Named entity spans
– Sentiment directed
toward each entity
PERSON
I
love
Mark
Twain
POSITIVE
• Interactions:
– Words and entities
– Entities and
sentiment
(Mitchell, Aguilar, Wilson, & Van Durme, 2013)
44
Variable-centric view of the world
When we deeply understand language, what representations
45
(type and token) does that understanding comprise?
semantics
lexicon (word types)
entailment
correlation
inflection
cognates
transliteration
abbreviation
neologism
language evolution
tokens
sentences
N
translation
alignment
editing
quotation
discourse context
resources
speech
misspellings,typos
formatting
entanglement
annotation
46
To recover variables,
model and exploit
their correlations
Section 2:
Belief Propagation Basics
47
Outline
• Do you want to push past the simple NLP models (logistic
regression, PCFG, etc.) that we've all been using for 20 years?
• Then this tutorial is extremely practical for you!
1.
2.
3.
4.
5.
6.
7.
Models: Factor graphs can express interactions among linguistic
structures.
Algorithm: BP estimates the global effect of these interactions on
each variable, using local computations.
Intuitions: What’s going on here? Can we trust BP’s estimates?
Fancier Models: Hide a whole grammar and dynamic
programming algorithm within a single factor. BP coordinates
multiple factors.
Tweaked Algorithm: Finish in fewer steps and make the steps
faster.
Learning: Tune the parameters. Approximately improve the true
predictions -- or truly improve the approximate predictions.
Software: Build the model you want!
48
Outline
• Do you want to push past the simple NLP models (logistic
regression, PCFG, etc.) that we've all been using for 20 years?
• Then this tutorial is extremely practical for you!
1.
2.
3.
4.
5.
6.
7.
Models: Factor graphs can express interactions among linguistic
structures.
Algorithm: BP estimates the global effect of these interactions on
each variable, using local computations.
Intuitions: What’s going on here? Can we trust BP’s estimates?
Fancier Models: Hide a whole grammar and dynamic
programming algorithm within a single factor. BP coordinates
multiple factors.
Tweaked Algorithm: Finish in fewer steps and make the steps
faster.
Learning: Tune the parameters. Approximately improve the true
predictions -- or truly improve the approximate predictions.
Software: Build the model you want!
49
Factor Graph Notation
• Variables:
X9
ψψ{1,8,9}
{1,8,9}
{1,8,9}
X8
• Factors:
ψ{2,7,8}
X7
ψ{3,6,7}
Joint Distribution
X1
ψψψ
222
{1,2}
X2
ψ{2,3}
X3
ψ{3,4}
X
ψψ{1}
ψ
111
ψψ{2}
ψ
333
ψψ{3}
ψ
555
ψ
time
flies
like
a
50
Factors are Tensors
s vp pp …
s vp pp …
s 0 2 .3
.3 pp …
s 0 s2 vp
vp 3 4 2
vp s3 04 22 .3
pp .1 2 1
.1 32 41 2
ppvp
…
… pp .1 2 1
s
…
vp
pp
• Factors:
X9
ψψ{1,8,9}
{1,8,9}
{1,8,9}
X8
ψ{2,7,8}
X7
v
n
p
d
X1
v
n
p
d
3
4
0.1
0.1
v
1
8
1
0.1
n
6
4
3
8
ψψψ
222
{1,2}
p
3
2
1
0
d
4
0.1
3
0
X2
ψ{3,6,7}
ψ{2,3}
X3
ψ{3,4}
X
ψψ{1}
ψ
111
ψψ{2}
ψ
333
ψψ{3}
ψ
555
ψ
time
flies
like
a
51
Inference
Given a factor graph, two common tasks …
– Compute the most likely joint assignment,
x* = argmaxx p(X=x)
– Compute the marginal distribution of variable Xi:
p(Xi=xi) for each value xi
Both consider all joint assignments.
Both are NP-Hard in general.
So, we turn to approximations.
p(Xi=xi) = sum of
p(X=x) over joint
assignments with
Xi=xi
52
Marginals by Sampling on Factor Graph
Suppose we took many samples from the distribution over
taggings:
Sample 1:
n
v
p
d
n
Sample 2:
n
n
v
d
n
Sample 3:
n
v
p
d
n
Sample 4:
v
n
p
d
n
Sample 5:
v
n
v
d
n
Sample 6:
n
v
p
d
n
X0
<START>
ψ0
X1
ψ2
X2
ψ4
X3
ψ6
X4
ψ8
X5
ψ1
ψ3
ψ5
ψ7
ψ9
time
flies
like
an
arrow
53
Marginals by Sampling on Factor Graph
The marginal p(Xi = xi) gives the probability that variable Xi
takes value xi in a random sample
Sample 1:
n
v
p
d
n
Sample 2:
n
n
v
d
n
Sample 3:
n
v
p
d
n
Sample 4:
v
n
p
d
n
Sample 5:
v
n
v
d
n
Sample 6:
n
v
p
d
n
X0
<START>
ψ0
X1
ψ2
X2
ψ4
X3
ψ6
X4
ψ8
X5
ψ1
ψ3
ψ5
ψ7
ψ9
time
flies
like
an
arrow
54
Marginals by Sampling on Factor Graph
Estimate the
marginals as:
n 4/6
v 2/6
n 3/6
v 3/6
p 4/6
v 2/6
d 6/6
n 6/6
Sample 1:
n
v
p
d
n
Sample 2:
n
n
v
d
n
Sample 3:
n
v
p
d
n
Sample 4:
v
n
p
d
n
Sample 5:
v
n
v
d
n
Sample 6:
n
v
p
d
n
X0
<START>
ψ0
X1
ψ2
X2
ψ4
X3
ψ6
X4
ψ8
X5
ψ1
ψ3
ψ5
ψ7
ψ9
time
flies
like
an
arrow
55
How do we get marginals without sampling?
That’s what Belief Propagation is all about!
Why not just sample?
• Sampling one joint assignment is also NP-hard in general.
– In practice: Use MCMC (e.g., Gibbs sampling) as an anytime algorithm.
– So draw an approximate sample fast, or run longer for a “good” sample.
• Sampling finds the high-probability values xi efficiently.
But it takes too many samples to see the low-probability ones.
– How do you find p(“The quick brown fox …”) under a language model?
• Draw random sentences to see how often you get it? Takes a long time.
• Or multiply factors (trigram probabilities)? That’s what BP would do.
56
Great Ideas in ML: Message Passing
Count the soldiers
there's
1 of me
1
before
you
2
before
you
3
before
you
4
before
you
5
behind
you
4
behind
you
3
behind
you
2
behind
you
adapted from MacKay (2003) textbook
5
before
you
1
behind
you
57
Great Ideas in ML: Message Passing
Count the soldiers
there's
1 of me
Belief:
Must be
22 + 11 + 3 = 6 of
us
2
before
you
only see
my incoming
messages
adapted from MacKay (2003) textbook
3
behind
you
58
Great Ideas in ML: Message Passing
Count the soldiers
there's
1 of me
1 before
you
only see
my incoming
messages
Belief:
Belief:
Must be
Must be
11 + 1 + 4 = 6 of 22 + 11 + 3 = 6 of
us
us
4
behind
you
adapted from MacKay (2003) textbook
59
Great Ideas in ML: Message Passing
Each soldier receives reports from all branches of tree
3 here
7 here
1 of me
11 here
(= 7+3+1)
adapted from MacKay (2003) textbook
60
Great Ideas in ML: Message Passing
Each soldier receives reports from all branches of tree
3 here
7 here
(= 3+3+1)
3 here
adapted from MacKay (2003) textbook
61
Great Ideas in ML: Message Passing
Each soldier receives reports from all branches of tree
11 here
(= 7+3+1)
7 here
3 here
adapted from MacKay (2003) textbook
62
Great Ideas in ML: Message Passing
Each soldier receives reports from all branches of tree
3 here
7 here
3 here
adapted from MacKay (2003) textbook
Belief:
Must be
14 of us
63
Great Ideas in ML: Message Passing
Each soldier receives reports from all branches of tree
3 here
7 here
3 here
adapted from MacKay (2003) textbook
Belief:
Must be
14 of us
64
Message Passing in Belief Propagation
v 6
n 1
a 9
v 1
n 6
a 3
My other factors
think I’m a noun
…
…
Ψ
X
…
But my other
variables and I
think you’re a verb
…
v 6
n 1
a 3
Both of these messages judge the possible values of variable X.
65
Their product = belief at X = product of all 3 messages to X.
Sum-Product Belief Propagation
Variables
Factors
Beliefs
ψ2
ψ1
X1
X2
ψ3
X1
Messages
ψ2
ψ1
X1
ψ1
X3
X2
ψ3
X1
ψ1
X3
66
Sum-Product Belief Propagation
Variable Belief
ψ2
v 0.1
n 3
p 1
ψ1
v
n
p
1
2
2
v
n
p
4
1
0
ψ3
X1
v
n
p
.4
6
0
67
Sum-Product Belief Propagation
Variable Message
ψ2
v 0.1
n 3
p 1
ψ1
X1
v
n
p
1
2
2
v 0.1
n 6
p 2
ψ3
68
Sum-Product Belief Propagation
Factor Belief
p
d
n
X1
4
1
0
v
p 0.1
d 3
n 1
n
8
0
1
ψ1
v
p 3.2
d 0.1
n 9
v 8
n 0.2
X3
n
6.4
7
1
69
Sum-Product Belief Propagation
Factor Belief
X1
ψ1
v
p 3.2
d 0.1
n 9
X3
n
6.4
7
1
70
Sum-Product Belief Propagation
Factor Message
p 0.8 + 0.16
24 + 0
d
8 + 0.2
n
X1
v
p 0.1
d 3
n 1
ψ1
n
8
0
1
v 8
n 0.2
X3
71
Sum-Product Belief Propagation
Factor Message
matrix-vector product
(for a binary factor)
X1
ψ1
X3
72
Sum-Product Belief Propagation
Input: a factor graph with no cycles
Output: exact marginals for each variable and factor
Algorithm:
1. Initialize the messages to the uniform distribution.
1.
2.
Choose a root node.
Send messages from the leaves to the root.
Send messages from the root to the leaves.
1.
Compute the beliefs (unnormalized marginals).
2.
Normalize beliefs and return the exact marginals.
73
Sum-Product Belief Propagation
Variables
Factors
Beliefs
ψ2
ψ1
X1
X2
ψ3
X1
Messages
ψ2
ψ1
X1
ψ1
X3
X2
ψ3
X1
ψ1
X3
74
Sum-Product Belief Propagation
Variables
Factors
Beliefs
ψ2
ψ1
X1
X2
ψ3
X1
Messages
ψ2
ψ1
X1
ψ1
X3
X2
ψ3
X1
ψ1
X3
75
CRF Tagging Model
X1
X2
X3
find
preferred
tags
Could be verb or noun
Could be adjective or verb
Could be noun or verb
76
CRF Tagging by Belief Propagation
Forward algorithm =
message passing
belief
v 1.8
n 0
a 4.2
(matrix-vector products)
message
α
…
α
v
v 0
n 2
a 0
v 7
n 2
a 1
Backward algorithm =
message passing
n
2
1
3
av 3
1n 1
0a 6
1
(matrix-vector products)
β
message
v 2v
nv 1 0
an 7 2
a 0
n
2
1
3
β
a
1
0
1
v 3
…
n 6
a 1
v 0.3
n 0
a 0.1
find
preferred
tags
• Forward-backward is a message passing algorithm.
• It’s the simplest case of belief propagation.
77
So Let’s Review Forward-Backward …
X1
X2
X3
find
preferred
tags
Could be verb or noun
Could be adjective or verb
Could be noun or verb
78
So Let’s Review Forward-Backward …
START
X1
X2
X3
v
v
v
n
n
n
a
a
a
find
preferred
tags
END
• Show the possible values for each variable
79
So Let’s Review Forward-Backward …
START
X1
X2
X3
v
v
v
n
n
n
a
a
a
find
preferred
tags
END
• Let’s show the possible values for each variable
• One possible assignment
80
So Let’s Review Forward-Backward …
START
X1
X2
X3
v
v
v
n
n
n
a
a
a
find
preferred
tags
• Let’s show the possible values for each variable
• One possible assignment
• And what the 7 factors think of it …
END
81
Viterbi Algorithm: Most Probable Assignment
X1
X2
X3
v
v
v
ψ{3,4}(a,END)
START
n
n
n
END
ψ{3}(n)
a
a
a
find
preferred
tags
• So p(v a n) = (1/Z) * product of 7 numbers
• Numbers associated with edges and nodes of path
82
• Most probable assignment = path with highest product
Viterbi Algorithm: Most Probable Assignment
X1
X2
X3
v
v
v
ψ{3,4}(a,END)
START
n
n
n
END
ψ{3}(n)
a
a
a
find
preferred
tags
• So p(v a n) = (1/Z) * product weight of one path
83
Forward-Backward Algorithm: Finds Marginals
START
X1
X2
X3
v
v
v
n
n
n
a
a
a
find
preferred
tags
• So p(v a n) = (1/Z) * product weight of one path
• Marginal probability p(X2 = a)
= (1/Z) * total weight of all paths through a
END
84
Forward-Backward Algorithm: Finds Marginals
START
X1
X2
X3
v
v
v
n
n
n
a
a
a
find
preferred
tags
• So p(v a n) = (1/Z) * product weight of one path
• Marginal probability p(X2 = n)
= (1/Z) * total weight of all paths through n
END
85
Forward-Backward Algorithm: Finds Marginals
START
X1
X2
X3
v
v
v
n
n
n
a
a
a
find
preferred
tags
• So p(v a n) = (1/Z) * product weight of one path
• Marginal probability p(X2 = v)
= (1/Z) * total weight of all paths through v
END
86
Forward-Backward Algorithm: Finds Marginals
START
X1
X2
X3
v
v
v
n
n
n
a
a
a
find
preferred
tags
• So p(v a n) = (1/Z) * product weight of one path
• Marginal probability p(X2 = n)
= (1/Z) * total weight of all paths through n
END
87
Forward-Backward Algorithm: Finds Marginals
START
X1
X2
X3
v
v
v
n
n
n
a
a
a
find
preferred
tags
END
α2(n) = total weight of these
path prefixes
(found by dynamic programming: matrix-vector products)
88
Forward-Backward Algorithm: Finds Marginals
START
X1
X2
X3
v
v
v
n
n
n
a
a
a
find
END
preferred
tags
2(n) = total weight of these
path suffixes
(found by dynamic programming: matrix-vector products)
89
Forward-Backward Algorithm: Finds Marginals
START
X1
X2
X3
v
v
v
n
n
n
a
a
a
END
find
preferred
tags
2(n) = total weight of these
α2(n) = total weight of these
path suffixes (x + y + z)
path prefixes (a + b + c)
90
Product gives ax+ay+az+bx+by+bz+cx+cy+cz = total weight of paths
Forward-Backward Algorithm: Finds Marginals
X1
Oops! The weight vof a
path through a state also
includes a weight at that
state.
n
START
So
α(n)∙β(n) isn’t enough.
The extra weight is the
opinion of the unigram
a
factor at this variable.
X2
X3
v
v
n
2(n)
α2(n)
“belief
that X2 = n” END
n
a
a
ψ{2}(n)
find
preferred
total weight of all paths through
= α2(n) ψ{2}(n) 2(n)
tags
n
91
Forward-Backward Algorithm: Finds Marginals
START
X1
X2
v
v
“belief
that X2 = v”
v
n
n
“belief
that X2 = n” END
n
X3
2(v)
α2(v)
a
a
a
ψ{2}(v)
find
preferred
total weight of all paths through
= α2(v)  ψ{2}(v) 2(v)
tags
v
92
Forward-Backward Algorithm: Finds Marginals
vX 1.8
1
n 0
a 4.2
X2
v
v 0.3
START
n 0
a 0.7
divide
by Z=6 to
get n
marginal
probs
a
X3
v
“belief
that X2 = v”
v
n
“belief
that X2 = n” END
n
2(a)
α2(a)
“belief
that X2 = a”
a
a
ψ{2}(a)
find
preferred
total weight of all paths through
= α2(a)  ψ{2}(a) 2(a)
sum = Z
(total probability
of all paths)
tags
a
93
(Acyclic) Belief Propagation
In a factor graph with no cycles:
1. Pick any node to serve as the root.
2. Send messages from the leaves to the root.
3. Send messages from the root to the leaves.
A node computes an outgoing message along an edge
only after it has received incoming messages along all its other edges.
X8
ψ12
X7
ψ11
X9
X6
ψ13
ψ10
X1
X2
X3
X4
X5
ψ1
ψ3
ψ5
ψ7
ψ9
time
flies
like
an
arrow
94
(Acyclic) Belief Propagation
In a factor graph with no cycles:
1. Pick any node to serve as the root.
2. Send messages from the leaves to the root.
3. Send messages from the root to the leaves.
A node computes an outgoing message along an edge
only after it has received incoming messages along all its other edges.
X8
ψ12
X7
ψ11
X9
X6
ψ13
ψ10
X1
X2
X3
X4
X5
ψ1
ψ3
ψ5
ψ7
ψ9
time
flies
like
an
arrow
95
Acyclic BP as Dynamic Programming
Subproblem:
Inference using just the
factors in subgraph H
ψ12
Xi
F
ψ14
ψ11
X9
X6
G
ψ13
H
ψ10
X1
X2
X3
X4
X5
ψ1
ψ3
ψ5
ψ7
ψ9
time
flies
like
an
arrow
Figure adapted from
96
Burkett & Klein (2012)
Acyclic BP as Dynamic Programming
Subproblem:
Inference using just the
factors in subgraph H
Xi
ψ11
X9
X6
H
ψ10
X1
time
X2
flies
X3
like
X4
The marginal of Xi in
that smaller model is the
message sent to Xi from
subgraph H
X5
ψ7
ψ9
an
arrow
Message to
a variable
97
Acyclic BP as Dynamic Programming
Subproblem:
Inference using just the
factors in subgraph H
Xi
ψ14
X9
The marginal of Xi in
that smaller model is the
message sent to Xi from
subgraph H
X6
G
X1
X2
X3
X4
X5
Message to
a variable
ψ5
time
flies
like
an
arrow
98
Acyclic BP as Dynamic Programming
X
Subproblem:
Inference using just the
factors in subgraph H
8
ψ12
Xi
F
X9
The marginal of Xi in
that smaller model is the
message sent to Xi from
subgraph H
X6
ψ13
X1
X2
ψ1
ψ3
time
flies
X3
X4
X5
Message to
a variable
like
an
arrow
99
Acyclic BP as Dynamic Programming
Subproblem:
Inference using just the
factors in subgraph FH
ψ12
Xi
F
ψ14
X1
ψ11
X9
X6
ψ13
ψ10
X2
X3
X4
H
The marginal of Xi in
that smaller model is the
message sent by Xi
out of subgraph FH
X5
ψ1
ψ3
ψ5
ψ7
ψ9
time
flies
like
an
arrow
Message from
a variable 100
Acyclic BP as Dynamic Programming
• If you want the marginal pi(xi) where Xi has degree k, you can think of that
summation as a product of k marginals computed on smaller subgraphs.
• Each subgraph is obtained by cutting some edge of the tree.
• The message-passing algorithm uses dynamic programming to compute
the marginals on all such subgraphs, working from smaller to bigger. So
you can compute all the marginals.
X8
ψ12
X7
ψ14
ψ11
X9
X6
ψ13
ψ10
X1
X2
X3
X4
X5
ψ1
ψ3
ψ5
ψ7
ψ9
time
flies
like
an
arrow
101
Acyclic BP as Dynamic Programming
• If you want the marginal pi(xi) where Xi has degree k, you can think of that
summation as a product of k marginals computed on smaller subgraphs.
• Each subgraph is obtained by cutting some edge of the tree.
• The message-passing algorithm uses dynamic programming to compute
the marginals on all such subgraphs, working from smaller to bigger. So
you can compute all the marginals.
X8
ψ12
X7
ψ14
ψ11
X9
X6
ψ13
ψ10
X1
X2
X3
X4
X5
ψ1
ψ3
ψ5
ψ7
ψ9
time
flies
like
an
arrow
102
Acyclic BP as Dynamic Programming
• If you want the marginal pi(xi) where Xi has degree k, you can think of that
summation as a product of k marginals computed on smaller subgraphs.
• Each subgraph is obtained by cutting some edge of the tree.
• The message-passing algorithm uses dynamic programming to compute
the marginals on all such subgraphs, working from smaller to bigger. So
you can compute all the marginals.
X8
ψ12
X7
ψ14
ψ11
X9
X6
ψ13
ψ10
X1
X2
X3
X4
X5
ψ1
ψ3
ψ5
ψ7
ψ9
time
flies
like
an
arrow
103
Acyclic BP as Dynamic Programming
• If you want the marginal pi(xi) where Xi has degree k, you can think of that
summation as a product of k marginals computed on smaller subgraphs.
• Each subgraph is obtained by cutting some edge of the tree.
• The message-passing algorithm uses dynamic programming to compute
the marginals on all such subgraphs, working from smaller to bigger. So
you can compute all the marginals.
X8
ψ12
X7
ψ14
ψ11
X9
X6
ψ13
ψ10
X1
X2
X3
X4
X5
ψ1
ψ3
ψ5
ψ7
ψ9
time
flies
like
an
arrow
104
Acyclic BP as Dynamic Programming
• If you want the marginal pi(xi) where Xi has degree k, you can think of that
summation as a product of k marginals computed on smaller subgraphs.
• Each subgraph is obtained by cutting some edge of the tree.
• The message-passing algorithm uses dynamic programming to compute
the marginals on all such subgraphs, working from smaller to bigger. So
you can compute all the marginals.
X8
ψ12
X7
ψ14
ψ11
X9
X6
ψ13
ψ10
X1
X2
X3
X4
X5
ψ1
ψ3
ψ5
ψ7
ψ9
time
flies
like
an
arrow
105
Acyclic BP as Dynamic Programming
• If you want the marginal pi(xi) where Xi has degree k, you can think of that
summation as a product of k marginals computed on smaller subgraphs.
• Each subgraph is obtained by cutting some edge of the tree.
• The message-passing algorithm uses dynamic programming to compute
the marginals on all such subgraphs, working from smaller to bigger. So
you can compute all the marginals.
X8
ψ12
X7
ψ14
ψ11
X9
X6
ψ13
ψ10
X1
X2
X3
X4
X5
ψ1
ψ3
ψ5
ψ7
ψ9
time
flies
like
an
arrow
106
Acyclic BP as Dynamic Programming
• If you want the marginal pi(xi) where Xi has degree k, you can think of that
summation as a product of k marginals computed on smaller subgraphs.
• Each subgraph is obtained by cutting some edge of the tree.
• The message-passing algorithm uses dynamic programming to compute
the marginals on all such subgraphs, working from smaller to bigger. So
you can compute all the marginals.
X8
ψ12
X7
ψ14
ψ11
X9
X6
ψ13
ψ10
X1
X2
X3
X4
X5
ψ1
ψ3
ψ5
ψ7
ψ9
time
flies
like
an
arrow
107
Loopy Belief Propagation
• Messages from
different
subgraphs are
no longer
independent!
What if our graph has cycles?
– Dynamic
programming
can’t help. It’s
now #P-hard in
general to
compute the
exact marginals.
X8
ψ12
X7
ψ14
X1
ψ11
X9
X6
ψ13
ψ10
ψ2
X2
ψ4
X3
ψ6
X4
ψ8
X5
ψ1
ψ3
ψ5
ψ7
ψ9
time
flies
like
an
arrow
• But we can still
run BP -- it's a local
algorithm so it
doesn't "see the
cycles."
108
What can go wrong with loopy BP?
F
F
All 4 factors
on cycle
enforce
equality
F
F
109
What can go wrong with loopy BP?
T
T
All 4 factors
on cycle
enforce
equality
T
T
This factor says
upper variable
is twice as likely
to be true as false
(and that’s the true
marginal!)
110
What can go wrong with loopy BP?
T 4
F 1
T 2
F 1
T 2
F 1
All 4 factors
on cycle
enforce
equality
T 4
F 1
T 2
F 1
T 2
F 1
This factor says
upper variable
is twice as likely
to be T as F
(and that’s the true
marginal!)
T 2
F 1
• Messages loop around and around …
• 2, 4, 8, 16, 32, ... More and more
convinced that these variables are T!
• So beliefs converge to marginal
distribution (1, 0) rather than (2/3, 1/3).
• BP incorrectly treats this message as
separate evidence that the variable is T.
• Multiplies these two messages as if they
were independent.
• But they don’t actually come from
independent parts of the graph.
• One influenced the other (via a cycle).
This is an extreme example. Often in practice,
the cyclic influences are weak. (As cycles are
long or include at least one weak correlation.)
111
What can go wrong with loopy BP?
Your prior doesn’t think Obama owns it.
But everyone’s saying he does. Under a
Naïve Bayes model, you therefore believe it.
T 2048
F 99
T 1
F 99
Obama
owns it
T 2
F 1
Alice
says so
Bob
says so
A rumor is circulating
that Obama secretly
owns an insurance
company.
(Obamacare is
actually designed to
maximize his profit.)
T 2
F 1
T 2
F 1
T 2
F 1
Charlie
says so
Kathy
says so
A lie told often enough becomes truth.
-- Lenin
112
What can go wrong with loopy BP?
Better model ... Rush can influence conversation.
– Now there are 2 ways to explain why everyone’s
repeating the story: it’s true, or Rush said it was.
– The model favors one solution (probably Rush).
– Yet BP has 2 stable solutions. Each solution is selfreinforcing around cycles; no impetus to switch.
T 1
F 99
T ???
F ???
T 1
F 24
Rush
says so
Obama
owns it
Kathy
says so
Alice
says so
Bob
says so
Charlie
says so
Actually 4 ways:
but “both” has a
low prior and
“neither” has a low
likelihood, so only 2
good ways.
If everyone blames Obama,
then no one has to blame
Rush.
But if no one blames Rush,
then everyone has to
continue to blame Obama
(to explain the gossip).
A lie told often enough becomes truth.
-- Lenin
113
Loopy Belief Propagation Algorithm
• Run the BP update equations on a cyclic graph
– Hope it “works” anyway (good approximation)
• Though we multiply messages that aren’t independent
• No interpretation as dynamic programming
– If largest element of a message gets very big or small,
• Divide the message by a constant to prevent over/underflow
• Can update messages in any order
– Stop when the normalized messages converge
• Compute beliefs from final messages
– Return normalized beliefs as approximate marginals
e.g., Murphy, Weiss & Jordan (1999)
114
Loopy Belief Propagation
Input: a factor graph with cycles
Output: approximate marginals for each variable and factor
Algorithm:
1. Initialize the messages to the uniform distribution.
1.
Send messages until convergence.
Normalize them when they grow too large.
1.
Compute the beliefs (unnormalized marginals).
2.
Normalize beliefs and return the approximate marginals.
115
Section 3:
Belief Propagation Q&A
Methods like BP and in what sense
they work
116
Outline
• Do you want to push past the simple NLP models (logistic
regression, PCFG, etc.) that we've all been using for 20 years?
• Then this tutorial is extremely practical for you!
1.
2.
3.
4.
5.
6.
7.
Models: Factor graphs can express interactions among linguistic
structures.
Algorithm: BP estimates the global effect of these interactions on
each variable, using local computations.
Intuitions: What’s going on here? Can we trust BP’s estimates?
Fancier Models: Hide a whole grammar and dynamic
programming algorithm within a single factor. BP coordinates
multiple factors.
Tweaked Algorithm: Finish in fewer steps and make the steps
faster.
Learning: Tune the parameters. Approximately improve the true
predictions -- or truly improve the approximate predictions.
Software: Build the model you want!
117
Outline
• Do you want to push past the simple NLP models (logistic
regression, PCFG, etc.) that we've all been using for 20 years?
• Then this tutorial is extremely practical for you!
1.
2.
3.
4.
5.
6.
7.
Models: Factor graphs can express interactions among linguistic
structures.
Algorithm: BP estimates the global effect of these interactions on
each variable, using local computations.
Intuitions: What’s going on here? Can we trust BP’s estimates?
Fancier Models: Hide a whole grammar and dynamic
programming algorithm within a single factor. BP coordinates
multiple factors.
Tweaked Algorithm: Finish in fewer steps and make the steps
faster.
Learning: Tune the parameters. Approximately improve the true
predictions -- or truly improve the approximate predictions.
Software: Build the model you want!
118
Q&A
Q: Forward-backward is to the Viterbi algorithm
as sum-product BP is to __________ ?
A: max-product BP
119
Max-product Belief Propagation
• Sum-product BP can be used to
compute the marginals, pi(Xi)
• Max-product BP can be used to
compute the most likely assignment,
X* = argmaxX p(X)
120
Max-product Belief Propagation
• Change the sum to a max:
• Max-product BP computes max-marginals
– The max-marginal bi(xi) is the (unnormalized)
probability of the MAP assignment under the
constraint Xi = xi.
– For an acyclic graph, the MAP assignment (assuming
there are no ties) is given by:
121
Max-product Belief Propagation
• Change the sum to a max:
• Max-product BP computes max-marginals
– The max-marginal bi(xi) is the (unnormalized)
probability of the MAP assignment under the
constraint Xi = xi.
– For an acyclic graph, the MAP assignment (assuming
there are no ties) is given by:
122
Deterministic Annealing
• Motivation: Smoothly transition from sumproduct to max-product
• Add inverse temperature parameter to each
factor:
Annealed Joint Distribution
• Send messages as usual for sum-product BP
• Anneal T from 1 to 0:
T=1
Sum-product
T0
Max-product
123
Q&A
Q: This feels like Arc Consistency…
Any relation?
A: Yes, BP is doing (with probabilities) what
people were doing in AI long before.
124
From Arc Consistency to BP
Goal: Find a satisfying assignment
Algorithm: Arc Consistency
1.
2.
3.
Pick a constraint
Reduce domains to satisfy
the constraint
Repeat until convergence
X, Y, U, T ∈ {1, 2, 3}
XY
Y=U
TU
X<T
X

1,2, 3
Y
1,2, 3
Note: These steps
can occur in
somewhat
arbitrary order

1,2, 3
T

=
1,2, 3
U
Propagation completely solved
the problem!
Slide thanks to Rina Dechter (modified)
125
From Arc Consistency to BP
Goal: Find a satisfying assignment
Algorithm: Arc Consistency
1.
2.
3.
Pick a constraint
Reduce domains to satisfy
the constraint
Repeat until convergence
X, Y, U, T ∈ {1, 2, 3}
XY
Y=U
TU
X<T
Arc Consistency is a
special case of Belief
Propagation.
Slide thanks to Rina Dechter (modified)
X

1,2, 3
Y
1,2, 3
Note: These steps
can occur in
somewhat
arbitrary order

1,2, 3
T

=
1,2, 3
U
Propagation completely solved
the problem!
126
From Arc Consistency to BP
Solve the same problem with BP
• Constraints become “hard”
factors with only 1’s or 0’s
• Send messages until
convergence
X, Y, U, T ∈ {1, 2, 3}
XY
Y=U
TU
X<T
0 0 0
1 0 0
1 1 0
X

1,2, 3

Y
1,2, 3
1 0 0
0 1 0
0 0 1
1,2, 3
T
Slide thanks to Rina Dechter (modified)
0 1 1
0 0 1
0 0 0

=
1,2, 3
U
0 1 1
0 0 1
0 0 0
127
From Arc Consistency to BP
Solve the same problem with BP
• Constraints become “hard”
factors with only 1’s or 0’s
• Send messages until
convergence
X, Y, U, T ∈ {1, 2, 3}
XY
Y=U
TU
X<T
X
1,2, 3
1,2, 3

=
1,2, 3
T
Slide thanks to Rina Dechter (modified)
Y

1,2, 3

2
1
0
0 1 1
0 0 1
0 0 0
U
1
1
1
128
From Arc Consistency to BP
Solve the same problem with BP
• Constraints become “hard”
factors with only 1’s or 0’s
• Send messages until
convergence
X, Y, U, T ∈ {1, 2, 3}
XY
Y=U
TU
X<T
X
1,2, 3
1,2, 3

=
1,2, 3
T
Slide thanks to Rina Dechter (modified)
Y

1,2, 3

2
1
0
0 1 1
0 0 1
0 0 0
U
1
1
1
129
From Arc Consistency to BP
Solve the same problem with BP
• Constraints become “hard”
factors with only 1’s or 0’s
• Send messages until
convergence
X, Y, U, T ∈ {1, 2, 3}
XY
Y=U
TU
X<T
X
1,2, 3
1,2, 3

=
1,2, 3
T
Slide thanks to Rina Dechter (modified)
Y

1,2, 3

2
1
0
0 1 1
0 0 1
0 0 0
U
1
1
1
130
From Arc Consistency to BP
Solve the same problem with BP
• Constraints become “hard”
factors with only 1’s or 0’s
• Send messages until
convergence
X, Y, U, T ∈ {1, 2, 3}
XY
Y=U
TU
X<T
1
0
X
1,2, 3

=
1,2, 3
T
Slide thanks to Rina Dechter (modified)
1,2, 3
0
0 0 0
1 0 0
1 1 0
2
1
0
Y

1,2, 3

2
1
0
0 1 1
0 0 1
0 0 0
U
1
1
1
131
From Arc Consistency to BP
Solve the same problem with BP
• Constraints become “hard”
factors with only 1’s or 0’s
• Send messages until
convergence
X, Y, U, T ∈ {1, 2, 3}
XY
Y=U
TU
X<T
1
0
X
1,2, 3

=
1,2, 3
T
Slide thanks to Rina Dechter (modified)
1,2, 3
0
0 0 0
1 0 0
1 1 0
2
1
0
Y

1,2, 3

2
1
0
0 1 1
0 0 1
0 0 0
U
1
1
1
132
From Arc Consistency to BP
Solve the same problem with BP
• Constraints become “hard”
factors with only 1’s or 0’s
• Send messages until
convergence
X, Y, U, T ∈ {1, 2, 3}
XY
Y=U
TU
X<T
1
0
X
1,2, 3

=
1,2, 3
T
Slide thanks to Rina Dechter (modified)
1,2, 3
0
0 0 0
1 0 0
1 1 0
2
1
0
Y

1,2, 3

2
1
0
0 1 1
0 0 1
0 0 0
U
1
1
1
133
From Arc Consistency to BP
Solve the same problem with BP
• Constraints become “hard”
factors with only 1’s or 0’s
• Send messages until
convergence
X, Y, U, T ∈ {1, 2, 3}
XY
Y=U
TU
X<T
X
Y

1,2, 3
1,2, 3

=
1,2, 3
T

1,2, 3
U
Loopy BP will converge to the
equivalent solution!
Slide thanks to Rina Dechter (modified)
134
From Arc Consistency to BP
Takeaways:
X
• Arc Consistency is a
1,2, 3
special case of
Belief Propagation.
• Arc Consistency

will only rule out
impossible values.
• BP rules out those
1,2, 3
same values
T
(belief = 0).
Y

1,2, 3
=

1,2, 3
U
Loopy BP will converge to the
equivalent solution!
Slide thanks to Rina Dechter (modified)
135
Q&A
Q: Is BP totally divorced from sampling?
A: Gibbs Sampling is also a kind of message
passing algorithm.
136
From Gibbs Sampling to Particle BP to BP
Message
Representation:
A. Belief Propagation:
full distribution
B. Gibbs sampling:
single particle
C. Particle BP:
multiple particles
BP
# of
particles
+∞
Particle BP
k
Gibbs Sampling
1
137
From Gibbs Sampling to Particle BP to BP
…
W
mean
man
meant
ψ2
X
too
to
two
ψ3
Y
…
taipei
tight
type
138
From Gibbs Sampling to Particle BP to BP
…
W
ψ2
mean
man
meant
X
ψ3
too
to
two
Y
…
taipei
tight
type
Approach 1: Gibbs Sampling
•
For each variable, resample the value by conditioning on all the other variables
–
–
•
Called the “full conditional” distribution
Computationally easy because we really only need to condition on the Markov Blanket
We can view the computation of the full conditional in terms of message passing
–
Message puts all its probability mass on the current particle (i.e. current value)
139
From Gibbs Sampling to Particle BP to BP
ψ2
W
ψ3
X
Y
mean 1
mean
man
meant
0.2 0.1 0.1
0.1
zymurgy
…
tight
…
aardvark
type
0.1 0.2 0.1
0.1
2
4 0.1
0.1
to
0.2
8
3
2
0.1
mean 0.1
7
1
0.1
too
0.1
7
6
1
0.1
two
0.2
0.1 3
1
0.1
2
Approachmeant
1: Gibbs
0.2
8Sampling
1 3
0.1
For each variable,
resample the value by conditioning
on all the other variables
…
…
–
–
•
aardvar
0.1
k
…
0.1
man
•
taipei
tight
type
zymurgy
…
two
too
to
…
aardvark
too
to
two
aardvar
0.1
k
…
…
type 1
taipei
…
Called the
“full conditional” distribution
zymurg
zymurg
0.1
0.1 0.2 0.2
0.1
0.1
0.2 0.2 0.1
0.1
y
Computationally easy because we really only need yto condition on the Markov Blanket
We can view the computation of the full conditional in terms of message passing
–
Message puts all its probability mass on the current particle (i.e. current value)
140
From Gibbs Sampling to Particle BP to BP
ψ2
W
ψ3
X
Y
mean 1
mean
man
meant
0.2 0.1 0.1
0.1
zymurgy
…
tight
…
aardvark
type
0.1 0.2 0.1
0.1
2
4 0.1
0.1
to
0.2
8
3
2
0.1
mean 0.1
7
1
0.1
too
0.1
7
6
1
0.1
two
0.2
0.1 3
1
0.1
2
Approachmeant
1: Gibbs
0.2
8Sampling
1 3
0.1
For each variable,
resample the value by conditioning
on all the other variables
…
…
–
–
•
aardvar
0.1
k
…
0.1
man
•
taipei
tight
type
zymurgy
…
two
too
to
…
aardvark
too
to
two
aardvar
0.1
k
…
…
type 1
taipei
…
Called the
“full conditional” distribution
zymurg
zymurg
0.1
0.1 0.2 0.2
0.1
0.1
0.2 0.2 0.1
0.1
y
Computationally easy because we really only need yto condition on the Markov Blanket
We can view the computation of the full conditional in terms of message passing
–
Message puts all its probability mass on the current particle (i.e. current value)
141
From Gibbs Sampling to Particle BP to BP
…
W
ψ2
X
ψ3
Y
meant
mean
two
to
too
taipei
type
man
meant
to
tight
too
tight
type
…
142
From Gibbs Sampling to Particle BP to BP
…
ψ2
W
X
ψ3
Y
meant
mean
mean
1
two
to
too
taipei
1
taipei
type
man
meant
meant
1
to
tight
too
type
1
tight
type
…
Approach 2: Multiple Gibbs Samplers
• Run each Gibbs Sampler independently
• Full conditionals computed independently
– k separate messages that are each a pointmass distribution
143
From Gibbs Sampling to Particle BP to BP
…
W
ψ2
X
ψ3
Y
meant
mean
two
to
too
taipei
type
man
meant
to
tight
too
tight
type
…
Approach 3: Gibbs Sampling w/Averaging
• Keep k samples for each variable
• Resample from the average of the full conditionals for each
possible pair of variables
– Message is a uniform distribution over current particles
144
From Gibbs Sampling to Particle BP to BP
aardvar
0.1
k
…
0.2 0.1 0.1
0.1
aardvar
0.1
k
…
1
type
1
0.1 0.2 0.1
0.1
0.1
2
4 0.1
0.1
to
0.2
8
3
2
0.1
mean 0.1
7
1
0.1
too
0.1
7
6
1
0.1
3
1
0.1
man
2
Approachmeant
3: Gibbs
w/Averaging
0.2
8 Sampling
1 3
0.1
0.1
two 0.2
…
taipei
type
zymurgy
tight
type
to
tight
too
…
zymurgy
…
two
too
to
man
meant
two
to
too
taipei
…
meant 1
aardvark
meant
mean
Y
1
aardvark
mean
ψ3
X
taipei
ψ2
W
…
…
tight
type
…
• Keep k samples
for each variable …
zymurg
zymurg
0.1
0.1 0.2 0.2
0.1
0.1
0.2 0.2 0.1
0.1
y
y
• Resample from the average of the full
conditionals
for
each
possible pair of variables
– Message is a uniform distribution over current particles
145
From Gibbs Sampling to Particle BP to BP
aardvar
0.1
k
…
0.2 0.1 0.1
0.1
aardvar
0.1
k
…
1
type
1
0.1 0.2 0.1
0.1
0.1
2
4 0.1
0.1
to
0.2
8
3
2
0.1
mean 0.1
7
1
0.1
too
0.1
7
6
1
0.1
3
1
0.1
man
2
Approachmeant
3: Gibbs
w/Averaging
0.2
8 Sampling
1 3
0.1
0.1
two 0.2
…
taipei
type
zymurgy
tight
type
to
tight
too
…
zymurgy
…
two
too
to
man
meant
two
to
too
taipei
…
meant 1
aardvark
meant
mean
Y
1
aardvark
mean
ψ3
X
taipei
ψ2
W
…
…
tight
type
…
• Keep k samples
for each variable …
zymurg
zymurg
0.1
0.1 0.2 0.2
0.1
0.1
0.2 0.2 0.1
0.1
y
y
• Resample from the average of the full
conditionals
for
each
possible pair of variables
– Message is a uniform distribution over current particles
146
From Gibbs Sampling to Particle BP to BP
…
ψ2
W
meant
mean
X
ψ3
Y
mean 3
taipei
2
meant 4
type
6
man
meant
…
taipei
type
tight
type
Approach 4: Particle BP
• Similar in spirit to Gibbs Sampling w/Averaging
• Messages are a weighted distribution over k particles
(Ihler & McAllester, 2009)
147
From Gibbs Sampling to Particle BP to BP
…
ψ2
W
aardvark 0.1
X
ψ3
aardvark 0.1
…
…
…
…
man
3
type
2
mean
4
tight
2
meant
5
taipei
1
…
…
…
…
zymurgy 0.1
…
Y
zymurgy 0.1
Approach 5: BP
• In Particle BP, as the number of particles goes to +∞, the
estimated messages approach the true BP messages
• Belief propagation represents messages as the full distribution
– This assumes we can store the whole distribution compactly
(Ihler & McAllester, 2009)
148
From Gibbs Sampling to Particle BP to BP
Message
Representation:
A. Belief Propagation:
full distribution
B. Gibbs sampling:
single particle
C. Particle BP:
multiple particles
BP
# of
particles
+∞
Particle BP
k
Gibbs Sampling
1
149
From Gibbs Sampling to Particle BP to BP
Tension between approaches…
Sampling values or combinations of
values:
• quickly get a good estimate of the
frequent cases
• may take a long time to estimate
probabilities of infrequent cases
• may take a long time to draw a
sample (mixing time)
• exact if you run forever
Enumerating each value and
computing its probability exactly:
• have to spend time on all values
• but only spend O(1) time on each
value (don’t sample frequent values
over and over while waiting for
infrequent ones)
• runtime is more predictable
• lets you tradeoff exactness for
greater speed (brute force exactly
enumerates exponentially many
assignments, BP approximates this
by enumerating local
configurations)
150
Background: Convergence
When BP is run on a tree-shaped factor graph,
the beliefs converge to the marginals of the
distribution after two passes.
151
Q&A
Q: How long does loopy BP take to converge?
A: It might never converge. Could oscillate.
ψ2
ψ1
ψ2
ψ2
ψ1
ψ2
152
Q&A
Q: When loopy BP converges, does it always
get the same answer?
A: No. Sensitive to initialization and update
order.
ψ2
ψ1
ψ2
ψ2
ψ2
ψ2
ψ1
ψ1
ψ2
ψ2
ψ1
ψ2
153
Q&A
Q: Are there convergent variants of loopy BP?
A: Yes. It's actually trying to minimize a certain
differentiable function of the beliefs, so you
could just minimize that function directly.
154
Q&A
Q: But does that function have a unique
minimum?
A: No, and you'll only be able to find a local
minimum in practice. So you're still dependent
on initialization.
155
Q&A
Q: If you could find the global minimum,
would its beliefs give the marginals of the
true distribution?
A: No.
We’ve
found the
bottom!!
156
Q&A
Q: Is it finding the
marginals of some
other distribution?
A: No, just a
collection of
beliefs.
Might not be
globally consistent
in the sense of all
being views of the
same elephant.
*Cartoon by G. Renee Guzlas
157
Q&A
Q: Does the global minimum give beliefs that
are at least locally consistent?
A: Yes.
A variable belief and
a factor belief are
locally consistent if
the marginal of the
factor’s belief equals
the variable’s belief.
X2
v n
7 10
ψα
X1
p
d
n
v n
7 10
5
2
10
p
d
n
5
2
10
p
d
n
v
2
1
4
n
3
1
6
158
Q&A
Q: In what sense are the beliefs at the global
minimum any good?
A: They are the global minimum of the
Bethe Free Energy.
We’ve
found the
bottom!!
159
Q&A
Q: When loopy BP converges, in what sense
are the beliefs any good?
A: They are a local minimum of the
Bethe Free Energy.
160
Q&A
Q: Why would you want to minimize the Bethe
Free Energy?
A: 1) It’s easy to minimize* because it’s a sum of
functions on the individual beliefs.
2) On an acyclic factor graph, it measures KL
divergence between beliefs and true
marginals, and so is minimized when beliefs
= marginals. (For a loopy graph, we close
our eyes and hope it still works.)
[*] Though we can’t just minimize each function
separately – we need message passing to keep the
beliefs locally consistent.
161
Section 4:
Incorporating Structure into
Factors and Variables
162
Outline
• Do you want to push past the simple NLP models (logistic
regression, PCFG, etc.) that we've all been using for 20 years?
• Then this tutorial is extremely practical for you!
1.
2.
3.
4.
5.
6.
7.
Models: Factor graphs can express interactions among linguistic
structures.
Algorithm: BP estimates the global effect of these interactions on
each variable, using local computations.
Intuitions: What’s going on here? Can we trust BP’s estimates?
Fancier Models: Hide a whole grammar and dynamic
programming algorithm within a single factor. BP coordinates
multiple factors.
Tweaked Algorithm: Finish in fewer steps and make the steps
faster.
Learning: Tune the parameters. Approximately improve the true
predictions -- or truly improve the approximate predictions.
Software: Build the model you want!
163
Outline
• Do you want to push past the simple NLP models (logistic
regression, PCFG, etc.) that we've all been using for 20 years?
• Then this tutorial is extremely practical for you!
1.
2.
3.
4.
5.
6.
7.
Models: Factor graphs can express interactions among linguistic
structures.
Algorithm: BP estimates the global effect of these interactions on
each variable, using local computations.
Intuitions: What’s going on here? Can we trust BP’s estimates?
Fancier Models: Hide a whole grammar and dynamic
programming algorithm within a single factor. BP coordinates
multiple factors.
Tweaked Algorithm: Finish in fewer steps and make the steps
faster.
Learning: Tune the parameters. Approximately improve the true
predictions -- or truly improve the approximate predictions.
Software: Build the model you want!
164
BP for Coordination of Algorithms
F
F
F
T
white
house
T
T
T
T
T
T
T
T
T
T
ψ4
the
blanca
ψ4
casa
T
la
ψ2
ψ2
T
F
F
F
T
T
165
Sending Messages:
Computational Complexity
From Variables
To Variables
ψ2
ψ1
X1
X2
ψ3
O(d*k)
d = # of neighboring factors
k = # possible values for Xi
X1
ψ1
X3
O(d*kd)
d = # of neighboring variables
k = maximum # possible values
for a neighboring variable
166
Sending Messages:
Computational Complexity
From Variables
To Variables
ψ2
ψ1
X1
X2
ψ3
O(d*k)
d = # of neighboring factors
k = # possible values for Xi
X1
ψ1
X3
O(d*kd)
d = # of neighboring variables
k = maximum # possible values
for a neighboring variable
167
INCORPORATING STRUCTURE
INTO FACTORS
168
Unlabeled Constituency Parsing
Given: a sentence.
Predict: unlabeled parse.
We could predict
whether each span is
present T or not F.
(Naradowsky, Vieira, & Smith, 2012)
T
ψ13
T
F
T
ψ10
ψ12
F
F
T
ψ10
ψ10
ψ11
F
F
F
T
ψ10
ψ10
ψ10
ψ10
ψ2
T
ψ4
T
ψ6
T
ψ8
T
ψ1
ψ3
ψ5
ψ7
ψ9
time
flies
like
an
arrow
169
Unlabeled Constituency Parsing
Given: a sentence.
Predict: unlabeled parse.
We could predict
whether each span is
present T or not F.
(Naradowsky, Vieira, & Smith, 2012)
T
ψ13
T
F
T
ψ10
ψ12
F
F
T
ψ10
ψ10
ψ11
F
F
F
T
ψ10
ψ10
ψ10
ψ10
ψ2
T
ψ4
T
ψ6
T
ψ8
T
ψ1
ψ3
ψ5
ψ7
ψ9
time
flies
like
an
arrow
170
Unlabeled Constituency Parsing
Given: a sentence.
Predict: unlabeled parse.
We could predict
whether each span is
present T or not F.
(Naradowsky, Vieira, & Smith, 2012)
T
ψ13
T
T
T
ψ10
ψ12
F
F
T
ψ10
ψ10
ψ11
F
F
F
T
ψ10
ψ10
ψ10
ψ10
ψ2
T
ψ4
T
ψ6
T
ψ8
T
ψ1
ψ3
ψ5
ψ7
ψ9
time
flies
like
an
arrow
171
Unlabeled Constituency Parsing
Given: a sentence.
Predict: unlabeled parse.
We could predict
whether each span is
present T or not F.
(Naradowsky, Vieira, & Smith, 2012)
T
ψ13
T
F
T
ψ10
ψ12
F
F
T
ψ10
ψ10
ψ11
F
F
F
T
ψ10
ψ10
ψ10
ψ10
ψ2
T
ψ4
T
ψ6
T
ψ8
T
ψ1
ψ3
ψ5
ψ7
ψ9
time
flies
like
an
arrow
172
Unlabeled Constituency Parsing
Given: a sentence.
Predict: unlabeled parse.
We could predict
whether each span is
present T or not F.
(Naradowsky, Vieira, & Smith, 2012)
T
ψ13
T
F
T
ψ10
ψ12
F
F
T
ψ10
ψ10
ψ11
F
F
F
T
ψ10
ψ10
ψ10
ψ10
ψ2
T
ψ4
T
ψ6
T
ψ8
T
ψ1
ψ3
ψ5
ψ7
ψ9
time
flies
like
an
arrow
173
Unlabeled Constituency Parsing
Given: a sentence.
Predict: unlabeled parse.
We could predict
whether each span is
present T or not F.
(Naradowsky, Vieira, & Smith, 2012)
T
ψ13
T
F
F
ψ10
ψ12
F
F
T
ψ10
ψ10
ψ11
T
F
F
T
ψ10
ψ10
ψ10
ψ10
ψ2
T
ψ4
T
ψ6
T
ψ8
T
ψ1
ψ3
ψ5
ψ7
ψ9
time
flies
like
an
arrow
174
Unlabeled Constituency Parsing
Given: a sentence.
Predict: unlabeled parse.
We could predict
whether each span is
present T or not F.
T
ψ13
T
F
T
ψ10
ψ12
F
F
T
ψ10
ψ10
ψ11
F
F
F
T
ψ10
ψ10
ψ10
ψ10
ψ2
T
ψ4
T
ψ6
T
ψ8
T
ψ1
ψ3
ψ5
ψ7
ψ9
time
flies
like
an
arrow
Sending a messsage to a variable
from its unary factors takes only
O(d*kd) time where k=2 and d=1.
(Naradowsky, Vieira, & Smith, 2012)
175
Unlabeled Constituency Parsing
Given: a sentence.
Predict: unlabeled parse.
We could predict
whether each span is
present T or not F.
But nothing prevents
non-tree structures.
(Naradowsky, Vieira, & Smith, 2012)
T
ψ13
T
F
T
ψ10
ψ12
F
F
T
ψ10
ψ10
ψ11
T
F
F
T
ψ10
ψ10
ψ10
ψ10
ψ2
T
ψ4
T
ψ6
T
ψ8
T
ψ1
ψ3
ψ5
ψ7
ψ9
time
flies
like
an
arrow
Sending a messsage to a variable
from its unary factors takes only
O(d*kd) time where k=2 and d=1.
176
Unlabeled Constituency Parsing
Given: a sentence.
Predict: unlabeled parse.
We could predict
whether each span is
present T or not F.
But nothing prevents
non-tree structures.
(Naradowsky, Vieira, & Smith, 2012)
T
ψ13
T
F
T
ψ10
ψ12
F
F
T
ψ10
ψ10
ψ11
T
F
F
T
ψ10
ψ10
ψ10
ψ10
ψ2
T
ψ4
T
ψ6
T
ψ8
T
ψ1
ψ3
ψ5
ψ7
ψ9
time
flies
like
an
arrow
Add a CKYTree factor which
multiplies in 1 if the variables
form a tree and 0 otherwise.
177
Unlabeled Constituency Parsing
Given: a sentence.
Predict: unlabeled parse.
We could predict
whether each span is
present T or not F.
But nothing prevents
non-tree structures.
(Naradowsky, Vieira, & Smith, 2012)
T
ψ13
T
F
T
ψ10
ψ12
F
F
T
ψ10
ψ10
ψ11
F
F
F
T
ψ10
ψ10
ψ10
ψ10
ψ2
T
ψ4
T
ψ6
T
ψ8
T
ψ1
ψ3
ψ5
ψ7
ψ9
time
flies
like
an
arrow
Add a CKYTree factor which
multiplies in 1 if the variables
form a tree and 0 otherwise.
178
Unlabeled Constituency Parsing
How long does it take to send a
message to a variable from the
the CKYTree factor?
T
ψ13
O(d*kd)
For the given sentence,
time where k=2 and d=15.
For a length n sentence, this will
be O(2n*n).
But we know an algorithm
(inside-outside) to compute all
the marginals in O(n3)…
So can’t we do better?
(Naradowsky, Vieira, & Smith, 2012)
T
F
T
ψ10
ψ12
F
F
T
ψ10
ψ10
ψ11
F
F
F
T
ψ10
ψ10
ψ10
ψ10
ψ2
T
ψ4
T
ψ6
T
ψ8
T
ψ1
ψ3
ψ5
ψ7
ψ9
time
flies
like
an
arrow
Add a CKYTree factor which
multiplies in 1 if the variables
form a tree and 0 otherwise.
179
Example: The Exactly1 Factor
Variables: d binary variables X1, …, Xd
one of
Global Factor: Exactly1(X1, …, Xd) = 1theif exactly
d binary
variables Xi is on,
0 otherwise
ψE1
(Smith & Eisner, 2008)
X1
X2
X3
X4
ψ1
ψ2
ψ3
ψ4
180
Example: The Exactly1 Factor
Variables: d binary variables X1, …, Xd
one of
Global Factor: Exactly1(X1, …, Xd) = 1theif exactly
d binary
variables Xi is on,
0 otherwise
ψE1
(Smith & Eisner, 2008)
X1
X2
X3
X4
ψ1
ψ2
ψ3
ψ4
181
Example: The Exactly1 Factor
Variables: d binary variables X1, …, Xd
one of
Global Factor: Exactly1(X1, …, Xd) = 1theif exactly
d binary
variables Xi is on,
0 otherwise
ψE1
(Smith & Eisner, 2008)
X1
X2
X3
X4
ψ1
ψ2
ψ3
ψ4
182
Example: The Exactly1 Factor
Variables: d binary variables X1, …, Xd
one of
Global Factor: Exactly1(X1, …, Xd) = 1theif exactly
d binary
variables Xi is on,
0 otherwise
ψE1
(Smith & Eisner, 2008)
X1
X2
X3
X4
ψ1
ψ2
ψ3
ψ4
183
Example: The Exactly1 Factor
Variables: d binary variables X1, …, Xd
one of
Global Factor: Exactly1(X1, …, Xd) = 1theif exactly
d binary
variables Xi is on,
0 otherwise
ψE1
(Smith & Eisner, 2008)
X1
X2
X3
X4
ψ1
ψ2
ψ3
ψ4
184
Example: The Exactly1 Factor
Variables: d binary variables X1, …, Xd
one of
Global Factor: Exactly1(X1, …, Xd) = 1theif exactly
d binary
variables Xi is on,
0 otherwise
ψE1
(Smith & Eisner, 2008)
X1
X2
X3
X4
ψ1
ψ2
ψ3
ψ4
185
Messages: The Exactly1 Factor
From Variables
To Variables
ψE1
ψE1
X1
X2
X3
X4
X1
X2
X3
X4
ψ1
ψ2
ψ3
ψ4
ψ1
ψ2
ψ3
ψ4
O(d*2)
O(d*2d)
d = # of neighboring factors
d = # of neighboring variables
186
Messages: The Exactly1 Factor
From Variables
To Variables
ψE1
ψE1
X1
X2
X3
X4
X1
X2
X3
X4
ψ1
ψ2
ψ3
ψ4
ψ1
ψ2
ψ3
ψ4
Fast!
O(d*2)
O(d*2d)
d = # of neighboring factors
d = # of neighboring variables
187
Messages: The Exactly1 Factor
To Variables
ψE1
X1
X2
X3
X4
ψ1
ψ2
ψ3
ψ4
O(d*2d)
d = # of neighboring variables
188
Messages: The Exactly1 Factor
To Variables
But the outgoing messages from
the Exactly1 factor are defined as
a sum over the 2d possible
assignments to X1, …, Xd.
Conveniently, ψE1(xa) is 0 for all
but d values – so the sum is
sparse!
So we can compute all the
outgoing messages from ψE1 in
O(d) time!
ψE1
X1
X2
X3
X4
ψ1
ψ2
ψ3
ψ4
O(d*2d)
d = # of neighboring variables
189
Messages: The Exactly1 Factor
To Variables
But the outgoing messages from
the Exactly1 factor are defined as
a sum over the 2d possible
assignments to X1, …, Xd.
ψE1
X1
X2
X3
X4
ψ1
ψ2
ψ3
ψ4
Fast!
Conveniently, ψE1(xa) is 0 for all
but d values – so the sum is
sparse!
So we can compute all the
outgoing messages from ψE1 in
O(d) time!
O(d*2d)
d = # of neighboring variables
190
Messages: The Exactly1 Factor
ψE1
A factor has a belief about each of its variables.
X1
X2
X3
X4
ψ1
ψ2
ψ3
ψ4
An outgoing message from a factor is the factor's belief with the incoming message
divided out.
We can compute the Exactly1 factor’s beliefs about each of its variables efficiently. (Each
of the parenthesized terms needs to be computed only once for all the variables.)
(Smith & Eisner, 2008)
191
Example: The CKYTree Factor
Variables: O(n2) binary variables Sij
if the span
Global Factor: CKYTree(S01, S12, …, S04) = 1variables
form a
constituency tree,
0 otherwise
S04
S03
S13
S02
S01
0
S12
(Naradowsky, Vieira, & Smith, 2012)
S24
barista
S34
S23
2
1
the
S14
3
made
4
coffee
192
Messages: The CKYTree Factor
From Variables
To Variables
S04
S03
S01
S12
barista
S03
S24
3
made
S01
4
coffee
0
S12
S24
barista
S34
S23
2
1
the
S14
S13
S02
S34
S23
2
1
the
S14
S13
S02
0
S04
3
made
4
coffee
O(d*2)
O(d*2d)
d = # of neighboring factors
d = # of neighboring variables
193
Messages: The CKYTree Factor
From Variables
To Variables
S04
S03
S14
S13
S02
S01
0
S04
S12
S24
3
Fast!
the
barista
made
S01
4
coffee
0
S12
S24
barista
S34
S23
2
1
the
S14
S13
S02
S34
S23
2
1
S03
3
made
4
coffee
O(d*2)
O(d*2d)
d = # of neighboring factors
d = # of neighboring variables
194
Messages: The CKYTree Factor
To Variables
S04
S03
S13
S02
S01
0
S12
S24
barista
S34
S23
2
1
the
S14
3
made
4
coffee
O(d*2d)
d = # of neighboring variables
195
Messages: The CKYTree Factor
To Variables
But the outgoing messages from
the CKYTree factor are defined as
a sum over the O(2n*n) possible
assignments to {Sij}.
S04
S03
S13
S02
S01
0
ψCKYTree(xa) is 1 for exponentially
many values in the sum – but they
all correspond to trees!
With inside-outside we can
compute all the outgoing
messages from CKYTree in O(n3)
time!
S12
S24
barista
S34
S23
2
1
the
S14
3
made
4
coffee
O(d*2d)
d = # of neighboring variables
196
Messages: The CKYTree Factor
To Variables
But the outgoing messages from
the CKYTree factor are defined as
a sum over the O(2n*n) possible
assignments to {Sij}.
S04
S03
S13
S02
S01
Fast!
0
ψCKYTree(xa) is 1 for exponentially
many values in the sum – but they
all correspond to trees!
With inside-outside we can
compute all the outgoing
messages from CKYTree in O(n3)
time!
S12
S24
barista
S34
S23
2
1
the
S14
3
made
4
coffee
O(d*2d)
d = # of neighboring variables
197
Example: The CKYTree Factor
For a length n sentence, define an anchored weighted
context free grammar (WCFG).
Each span is weighted by the ratio of the incoming
messages from the corresponding span variable.
S04
S03
S13
S02
S01
0
S12
S24
barista
S34
S23
2
1
the
S14
3
made
4
coffee
Run the inside-outside algorithm on the sentence a1, a1,
…, an with the anchored WCFG.
(Naradowsky, Vieira, & Smith, 2012)
198
Example: The TrigramHMM Factor
Factors can compactly encode the preferences of an entire submodel.
Consider the joint distribution of a trigram HMM over 5 variables:
– It’s traditionally defined as a Bayes Network
– But we can represent it as a (loopy) factor graph
– We could even pack all those factors into a single TrigramHMM
factor (Smith & Eisner, 2008)
X1
X2
X3
X4
X5
W1
W2
W3
W4
W5
time
flies
like
an
arrow
(Smith & Eisner, 2008)
199
Example: The TrigramHMM Factor
Factors can compactly encode the preferences of an entire submodel.
Consider the joint distribution of a trigram HMM over 5 variables:
– It’s traditionally defined as a Bayes Network
– But we can represent it as a (loopy) factor graph
– We could even pack all those factors into a single TrigramHMM
factor (Smith & Eisner, 2008)
ψ10
X1
ψ2
X2
ψ11
ψ4
X3
ψ12
ψ6
X4
ψ8
X5
ψ1
ψ3
ψ5
ψ7
ψ9
time
flies
like
an
arrow
(Smith & Eisner, 2008)
200
Example: The TrigramHMM Factor
Factors can compactly encode the preferences of an entire submodel.
Consider the joint distribution of a trigram HMM over 5 variables:
– It’s traditionally defined as a Bayes Network
– But we can represent it as a (loopy) factor graph
– We could even pack all those factors into a single TrigramHMM
factor (Smith & Eisner, 2008)
ψ10
X1
ψ2
X2
ψ11
ψ4
X3
ψ12
ψ6
X4
ψ8
X5
ψ1
ψ3
ψ5
ψ7
ψ9
time
flies
like
an
arrow
(Smith & Eisner, 2008)
201
Example: The TrigramHMM Factor
Factors can compactly encode the preferences of an entire submodel.
Consider the joint distribution of a trigram HMM over 5 variables:
– It’s traditionally defined as a Bayes Network
– But we can represent it as a (loopy) factor graph
– We could even pack all those factors into a single TrigramHMM
factor (Smith & Eisner, 2008)
X1
X2
X3
X4
X5
time
flies
like
an
arrow
(Smith & Eisner, 2008)
202
Example: The TrigramHMM Factor
Variables: n discrete variables X1, …, Xn
Global Factor: TrigramHMM (X1, …, Xn) = p(X1, …, Xn)
according to
a trigram
HMM model
X1
X2
X3
X4
X5
time
flies
like
an
arrow
(Smith & Eisner, 2008)
203
Example: The TrigramHMM Factor
Variables: n discrete variables X1, …, Xn
Global Factor: TrigramHMM (X1, …, Xn) = p(X1, …, Xn)
according to
a trigram
HMM model
Compute outgoing messages efficiently with
the standard trigram HMM dynamic
programming algorithm (junction tree)!
X1
X2
X3
X4
X5
time
flies
like
an
arrow
(Smith & Eisner, 2008)
204
Combinatorial Factors
• Usually, it takes O(kn) time to compute
outgoing messages from a factor over n
variables with k possible values each.
• But not always:
– Factors like Exactly1 with only polynomially
many nonzeroes in the potential table.
– Factors like CKYTree with exponentially many
nonzeroes but in a special pattern.
– Factors like TrigramHMM (Smith & Eisner 2008)
with all nonzeroes but which factor further.
205
Combinatorial Factors
Factor graphs can encode structural constraints
on many variables via constraint factors.
Example NLP constraint factors:
– Projective and non-projective dependency parse
constraint (Smith & Eisner, 2008)
– CCG parse constraint (Auli & Lopez, 2011)
– Labeled and unlabeled constituency parse
constraint (Naradowsky, Vieira, & Smith, 2012)
– Inversion transduction grammar (ITG)
constraint (Burkett & Klein, 2012)
206
Combinatorial Optimization within Max-Product
• Max-product BP computes max-marginals.
– The max-marginal bi(xi) is the (unnormalized) probability
of the MAP assignment under the constraint Xi = xi.
• Duchi et al. (2006) define factors, over many
variables, for which efficient combinatorial
optimization algorithms exist.
– Bipartite matching: max-marginals can be computed
with standard max-flow algorithm and the FloydWarshall all-pairs shortest-paths algorithm.
– Minimum cuts: max-marginals can be computed with a
min-cut algorithm.
• Similar to sum-product case: the combinatorial
algorithms are embedded within the standard loopy
BP algorithm.
(Duchi, Tarlow, Elidan, & Koller, 2006)
207
Additional Resources
See NAACL 2012 / ACL 2013 tutorial by Burkett
& Klein “Variational Inference in Structured
NLP Models” for…
– An alternative approach to efficient marginal
inference for NLP: Structured Mean Field
– Also, includes Structured BP
http://nlp.cs.berkeley.edu/tutorials/variational-tutorial-slides.pdf
208
Sending Messages:
Computational Complexity
From Variables
To Variables
ψ2
ψ1
X1
X2
ψ3
O(d*k)
d = # of neighboring factors
k = # possible values for Xi
X1
ψ1
X3
O(d*kd)
d = # of neighboring variables
k = maximum # possible values
for a neighboring variable
209
Sending Messages:
Computational Complexity
From Variables
To Variables
ψ2
ψ1
X1
X2
ψ3
O(d*k)
d = # of neighboring factors
k = # possible values for Xi
X1
ψ1
X3
O(d*kd)
d = # of neighboring variables
k = maximum # possible values
for a neighboring variable
210
INCORPORATING STRUCTURE
INTO VARIABLES
211
String-Valued Variables
Consider two examples from Section 1:
• Variables (string):
– English and Japanese
orthographic strings
– English and Japanese
phonological strings
• Interactions:
– All pairs of strings could
be relevant
• Variables (string):
– Inflected forms
of the same verb
• Interactions:
– Between pairs of
entries in the table
(e.g. infinitive form affects
present-singular)
212
Graphical Models over Strings
X2
ring
2
4 0.1
rang
7
1
2
rung
8
1
3
ring 10.2
rang 13
rung 16
(Dreyer & Eisner, 2009)
rung
rang
ψ1
ring 1
rang 2
rung 2
ring
X1
• Most of our problems so far:
– Used discrete variables
– Over a small finite set of string
values
– Examples:
• POS tagging
• Labeled constituency parsing
• Dependency parsing
• We use tensors (e.g. vectors,
matrices) to represent the
messages and factors
213
Graphical Models over Strings
aardvark 0.1
X2
rung
8
1
3
ring 10.2
rang 13
rung 16
…
…
ψ1
aardvar
0.1
k
…
X2
0.2 0.1 0.1
zymurgy
2
5
…
1
rung
rung
7
4
var.  fac. O(d*kd)
fac.  var. O(d*k)
ring
rang
ring
Time Complexity:
rang
4 0.1
3
…
2
rang
aardvark
ring
…
zymurgy 0.1
rung
rang
ψ1
X1
ring
X1
ring 1
rang 2
rung 2
…
0.1
rang
0.1
2
4 0.1
0.1
ring
0.1
7
1
2
0.1
rung
0.2
8
1
3
0.1
…
zymurgy 0.1
(Dreyer & Eisner, 2009)
0.1 0.2 0.2
0.1
What happens as
the # of possible
values for a
variable, k,
increases?
We can still keep
the
computational
complexity down
by including only
low arity factors
(i.e. small d).
214
Graphical Models over Strings
aardvark 0.1
X2
2
rung
8
1
3
ring 10.2
rang 13
rung 16
5
…
…
ψ1
aardvar
0.1
k
…
X2
(Dreyer & Eisner, 2009)
But what if
the domain of
a variable is
Σ*, the infinite
set of all
possible
strings?
0.2 0.1 0.1
rang
0.1
2
4 0.1
ring
0.1
7
1
2
rung
0.2
8
1
3
…
…
1
rung
rung
7
4
ring
rang
ring
rang
4 0.1
3
…
2
rang
aardvark
ring
…
rung
rang
ψ1
X1
ring
X1
ring 1
rang 2
rung 2
…
How can we
represent a
distribution
over one or
more infinite
sets?
215
Graphical Models over Strings
aardvark 0.1
X2
2
rung
8
1
3
ring 10.2
rang 13
rung 16
5
…
…
ψ1
aardvar
0.1
k
…
X2
0.1
2
4 0.1
ring
0.1
7
1
2
rung
0.2
8
1
3
s
r
ae
h
i n g
e u ε e
ψ1
r
s
s
r
e
ε
X2
s
r
0.2 0.1 0.1
rang
…
…
1
rung
X1
rung
7
4
ring
rang
ring
rang
4 0.1
3
…
2
rang
aardvark
ring
…
rung
rang
ψ1
X1
ring
X1
ring 1
rang 2
rung 2
…
a
u
i n
a n
u
a ε
a
e
g
g
ae
h
i n g
e u
ε
Finite State Machines let us represent
something infinite in finite space!
(Dreyer & Eisner, 2009)
216
Graphical Models over Strings
aardvark 0.1
4
rung
5
…
…
aardvark
ψ1
aardvar
0.1
k
…
X2
X1
…
ring
rung
3
ring
rang
rang
…
…
X1
…
ψ1
0.2 0.1 0.1
rang
0.1
2
4 0.1
ring
0.1
7
1
2
rung
0.2
8
1
3
…
X2
s
r
ae
h
i n g
e u ε e
r
s
s
r
e
ε
s
r
a
u
i n
a n
u
a ε
a
g
g
ae
h
i n g
e u
ε
messages and
beliefs are
Weighted
e
Finite State
Acceptors
(WFSA)
factors are
Weighted
Finite State
Transducers
(WFST)
Finite State Machines let us represent
something infinite in finite space!
(Dreyer & Eisner, 2009)
217
Graphical Models over Strings
0.1
Thataardvark
solves
the
…
…
problem
of
3
rang
X
4
ring
representation.
5
rung
1
…
…
…
rung
ring
rang
aardvark
…
But how do we
manage the problem
of computation?
0.1
0.2 0.1 0.1
(We still need to
X
compute
messages
and beliefs.)
ψ1
X1
ψ1
aardvar
k
…
2
rang
0.1
2
4 0.1
ring
0.1
7
1
2
rung
0.2
8
1
3
…
X2
s
r
ae
h
i n g
e u ε e
r
s
s
r
e
ε
s
r
a
u
i n
a n
u
a ε
a
g
g
ae
h
i n g
e u
ε
messages and
beliefs are
Weighted
e
Finite State
Acceptors
(WFSA)
factors are
Weighted
Finite State
Transducers
(WFST)
Finite State Machines let us represent
something infinite in finite space!
(Dreyer & Eisner, 2009)
218
Graphical Models over Strings
s
a
r
i
e
h
g
e
u
ψ1
n
ψ1
s
a
r
i
e
n
u
X1
h
g
e
e
ψ1
ε
r
s
s
r
ψ1
X2
s
a
r
i
e
u
h
g
e
n
ε
e
X2
u
h
g
e
i
e
e
ε
a
s
r
n
e
e
ε
a
u
i
n
a
n
u
e a ε
ε
a
a
s
r
i
e
e
u
g
g
h
g
e
n
ε
All the message and belief computations
simply reuse standard FSM dynamic
programming algorithms.
(Dreyer & Eisner, 2009)
219
Graphical Models over Strings
s
r
ae
h
i n g
e u
ε
ψ1
s
r
ψ1
e
ae
h
i n g
e u ε e
The pointwise
product of two
WFSAs is…
…their intersection.
ψ1
X2
s
r
(Dreyer & Eisner, 2009)
ae
h
i n g
e u ε e
e
Compute the
product of (possibly
many) messages μαi
(each of which is a
WSFA) via WFSA
intersection
220
Graphical Models over Strings
Compute marginalized
product of WFSA message
μkα and WFST factor ψα, with:
domain(compose(ψα, μkα))
– compose: produces a new
WFST with a distribution
over (Xi, Xj)
– domain: marginalizes over
Xj to obtain a WFSA over Xi
only
(Dreyer & Eisner, 2009)
X1
s
r
ae
h
i n g
e u ε e
ψ1
r
s
s
r
e
ε
X2
s
r
a
u
i n
a n
u
a ε
a
e
g
g
ae
h
i n g
e u
ε
221
Graphical Models over Strings
s
a
r
i
e
h
g
e
u
ψ1
n
ψ1
s
a
r
i
e
n
u
X1
h
g
e
e
ψ1
ε
r
s
s
r
ψ1
X2
s
a
r
i
e
u
h
g
e
n
ε
e
X2
u
h
g
e
i
e
e
ε
a
s
r
n
e
e
ε
a
u
i
n
a
n
u
e a ε
ε
a
a
s
r
i
e
e
u
g
g
h
g
e
n
ε
All the message and belief computations
simply reuse standard FSM dynamic
programming algorithms.
(Dreyer & Eisner, 2009)
222
The usual NLP toolbox
• WFSA: weighted finite state automata
• WFST: weighted finite state transducer
• k-tape WFSM: weighted finite state machine jointly
mapping between k strings
They each assign a score to a set of strings.
We can interpret them as factors in a graphical
model.
The only difference is the arity of the factor.
223
WFSA as a Factor Graph
• WFSA: weighted finite state automata
• WFST: weighted finite state transducer
• k-tape WFSM: weighted finite state machine jointly
mapping between k strings
ψ1(x1) = 4.25
A WFSA is a
function which
maps a string to a
score.
b
x1 =
X1
ψ1
a
e
r
c
h
e
n
b
ψ1 =
z
…
c
224
WFST as a Factor Graph
• WFSA: weighted finite state automata
• WFST: weighted finite state transducer
• k-tape WFSM: weighted finite state machine jointly
mapping between k strings
ψ1(x1, x2) = 13.26
A WFST is a function
that maps a pair of
strings to a score.
x1 =
r
e
c
h
e
n
b
b
r
r
e
a
c
c
h
h
e
ε
n
ε
b
r
a
c
h
t
X1
ψ1
ψ1 =
X2
x2 =
(Dreyer, Smith, & Eisner, 2008)
b
n
t
ε
t
225
k-tape WFSM as a Factor Graph
• WFSA: weighted finite state automata
• WFST: weighted finite state transducer
• k-tape WFSM: weighted finite state machine jointly
mapping between k strings
ψ1
ψ1 =
X1
X2
X3
X4
b
b
b
b
r
r
r
r
e
ε
ε
ε
ε
a
a
a
c
c
c
c
h
h
h
h
e
ε
e
ε
n
ε
n
ε
ε
t
ε
ε
ψ1(x1, x2, x3, x4) = 13.26
A k-tape WFSM is a function that maps k strings to a score.
What's wrong with a 100-tape WFSM for jointly modeling the 100
distinct forms of a Polish verb?
– Each arc represents a 100-way edit operation
– Too many arcs!
226
Factor Graphs over Multiple Strings
P(x1, x2, x3, x4) = 1/Z ψ1(x1, x2) ψ2(x1, x3) ψ3(x1, x4) ψ4(x2, x3) ψ5(x3, x4)
X1
ψ3
ψ1
ψ2
X2
ψ4
(Dreyer & Eisner, 2009)
X4
ψ5
X3
Instead, just
build factor
graphs with
WFST factors
(i.e. factors of
arity 2)
227
Factor Graphs over Multiple Strings
P(x1, x2, x3, x4) = 1/Z ψ1(x1, x2) ψ2(x1, x3) ψ3(x1, x4) ψ4(x2, x3) ψ5(x3, x4)
infinitive
X1
ψ3
1st
2nd
3rd
X4
ψ1
ψ2
X2
ψ4
X3
singular
plural
singular
present
(Dreyer & Eisner, 2009)
ψ5
plural
Instead, just
build factor
graphs with
WFST factors
(i.e. factors of
arity 2)
past
228
BP for Coordination of Algorithms
F
F
F
T
white
house
T
T
T
T
T
T
T
T
T
T
ψ4
the
blanca
ψ4
casa
T
la
ψ2
ψ2
T
F
F
F
T
T
229
Section 5:
What if even BP is slow?
Computing fewer messages
Computing them faster
230
Outline
• Do you want to push past the simple NLP models (logistic
regression, PCFG, etc.) that we've all been using for 20 years?
• Then this tutorial is extremely practical for you!
1.
2.
3.
4.
5.
6.
7.
Models: Factor graphs can express interactions among linguistic
structures.
Algorithm: BP estimates the global effect of these interactions on
each variable, using local computations.
Intuitions: What’s going on here? Can we trust BP’s estimates?
Fancier Models: Hide a whole grammar and dynamic
programming algorithm within a single factor. BP coordinates
multiple factors.
Tweaked Algorithm: Finish in fewer steps and make the steps
faster.
Learning: Tune the parameters. Approximately improve the true
predictions -- or truly improve the approximate predictions.
Software: Build the model you want!
231
Outline
• Do you want to push past the simple NLP models (logistic
regression, PCFG, etc.) that we've all been using for 20 years?
• Then this tutorial is extremely practical for you!
1.
2.
3.
4.
5.
6.
7.
Models: Factor graphs can express interactions among linguistic
structures.
Algorithm: BP estimates the global effect of these interactions on
each variable, using local computations.
Intuitions: What’s going on here? Can we trust BP’s estimates?
Fancier Models: Hide a whole grammar and dynamic
programming algorithm within a single factor. BP coordinates
multiple factors.
Tweaked Algorithm: Finish in fewer steps and make the steps
faster.
Learning: Tune the parameters. Approximately improve the true
predictions -- or truly improve the approximate predictions.
Software: Build the model you want!
232
Loopy Belief Propagation Algorithm
1. For every directed edge, initialize its
message to the uniform distribution.
2. Repeat until all normalized beliefs converge:
a. Pick a directed edge u  v.
b. Update its message: recompute u  v from its
“parent” messages v’  u for v’ ≠ v.
More efficient if u has high degree (e.g., CKYTree):
•
Compute all outgoing messages u  … at once, based on all incoming
messages …  u.
Loopy Belief Propagation Algorithm
1. For every directed edge, initialize its
message to the uniform distribution.
2. Repeat until all normalized beliefs converge:
a. Pick a directed edge u  v.
b. Update its message: recompute u  v from its
“parent” messages v’  u for v’ ≠ v.
Which edge do we pick and recompute?
A “stale” edge?
Message Passing in Belief Propagation
v 1
n 6
a 3
My other factors
think I’m a noun
…
…
Ψ
X
…
But my other
variables and I
think you’re a verb
…
v 6
n 1
a 3
235
Stale Messages
We update this message
from its antecedents.
Now it’s “fresh.” Don’t
need to update it again.
…
X
Ψ
…
antecedents
…
…
236
Stale Messages
We update this message
from its antecedents.
Now it’s “fresh.” Don’t
need to update it again.
…
Ψ
X
…
antecedents
…
…
But it again becomes “stale”
– out of sync with its
antecedents – if they change.
Then we do need to revisit.
The edge is very stale if its antecedents have changed a lot since its last
update. Especially in a way that might make this edge change a lot. 237
Stale Messages
…
Ψ
…
…
…
For a high-degree node that likes to update
all its outgoing messages at once …
We say that the whole node is very stale if its
incoming messages have changed a lot. 238
Stale Messages
…
Ψ
…
…
…
For a high-degree node that likes to update
all its outgoing messages at once …
We say that the whole node is very stale if its
incoming messages have changed a lot. 239
Maintain an Queue of Stale Messages to Update
Initially all
messages are
uniform.
X8
Messages from variables
are actually fresh (in sync
with their uniform antecedents).
X7
X9
X6
X1
X2
X3
X4
time
flies
like
an
X5
arrow
Maintain an Queue of Stale Messages to Update
X8
X7
X9
X6
X1
X2
X3
X4
time
flies
like
an
X5
arrow
Maintain an Queue of Stale Messages to Update
X8
X7
X9
X6
X1
X2
X3
X4
time
flies
like
an
X5
arrow
A Bad Update Order!
X0
X1
X2
X3
X4
X5
time
flies
like
an
arrow
<START>
Acyclic Belief Propagation
In a graph with no cycles:
1. Send messages from the leaves to the root.
2. Send messages from the root to the leaves.
Each outgoing message is sent only after all its incoming messages have
been received.
X8
X7
X9
X6
X1
X2
X3
X4
X5
time
flies
like
an
arrow
244
Acyclic Belief Propagation
In a graph with no cycles:
1. Send messages from the leaves to the root.
2. Send messages from the root to the leaves.
Each outgoing message is sent only after all its incoming messages have
been received.
X8
X7
X9
X6
X1
X2
X3
X4
X5
time
flies
like
an
arrow
245
Loopy Belief Propagation
In what order do we send messages for Loopy BP?
• Asynchronous
– Pick a directed edge:
update its message
– Or, pick a vertex: update all
its outgoing messages at
once
X8
X7
X9
X6
X1
X2
X3
X4
X5
time
flies
like
an
arrow
Wait for your parents
Don’t update a message if its
parents will get a big update.
Otherwise, will have to re-update.
• Size. Send big updates first.
• Forces other messages to
wait for them.
• Topology. Use graph structure.
• E.g., in an acyclic graph, a
message can wait for all
updates before sending.
246
Message Scheduling
The order in which messages are sent has a significant effect on convergence
• Synchronous (SBP)
– Compute all the messages
– Send all the messages
• Asynchronous (ABP)
– Pick an edge: compute and send
that message
• Tree-based Reparameterization
(TRP)
– Successively update embedded
spanning trees (Wainwright et al.,
2001)
– Choose spanning trees such that
each edge is included in at least one
• Residual BP (RBP)
– Pick the edge whose message
would change the most if sent:
compute and send that message
(Elidan et al., 2006)
Figure from (Elidan, McGraw, & Koller, 2006)
247
Message Scheduling
The order in which messages are sent has a significant effect on convergence
Convergence rates:
• Synchronous (SBP)
– Compute all the messages
– Send all the messages
• Asynchronous (ABP)
– Pick an edge: compute and send
that message
• Residual BP (RBP)
– Pick the edge whose message
would change the most if sent:
compute and send that message
(Elidan et al., 2006)
• Tree-based Reparameterization
(TRP)
– Successively update embedded
spanning trees (Wainwright et al.,
2001)
– Choose spanning trees such that
each edge is included in at least one
Figure from (Elidan, McGraw, & Koller, 2006)
248
Message Scheduling
Even better dynamic scheduling is possible
by learning the heuristics for selecting the
next message by reinforcement learning
(RLBP).
(Jiang, Moon, Daumé III, & Eisner, 2013)
249
Computing Variable Beliefs
Suppose…
– Xi is a discrete variable
– Each incoming
messages is a
Multinomial
ring 1
rang 2
rung 2
ring 0.1
rang 3
rung 1
ring 4
rang 1
rung 0
X
Pointwise product is
easy when the
variable’s domain is
small and discrete
ring .4
rang 6
rung 0
250
Computing Variable Beliefs
Suppose…
– Xi is a real-valued
variable
– Each incoming
message is a Gaussian
The pointwise product
of n Gaussians is…
…a Gaussian!
X
251
Computing Variable Beliefs
Suppose…
– Xi is a real-valued
variable
– Each incoming
messages is a mixture
of k Gaussians
X
The pointwise product
explodes!
p(x) = p1(x) p2(x)…pn(x)
( 0.3 q1,1(x) ( 0.5 q2,1(x)
+ 0.7 q1,2(x)) + 0.5 q2,2(x))
252
Computing Variable Beliefs
Suppose…
– Xi is a string-valued
variable (i.e. its
domain is the set of
all strings)
– Each incoming
messages is a FSA
X
The pointwise product
explodes!
253
Example: String-valued Variables
a
a
a
X1
ψ2
ψ1
a
a
X2
ε
a
a
a
• Messages can grow larger when sent
through a transducer factor
• Repeatedly sending messages through a
transducer can cause them to grow to
unbounded size!
(Dreyer & Eisner, 2009)
254
Example: String-valued Variables
a
a
a
X1
ψ2
ψ1
a
a
X2
a
ε
a
a
a
• Messages can grow larger when sent
through a transducer factor
• Repeatedly sending messages through a
transducer can cause them to grow to
unbounded size!
(Dreyer & Eisner, 2009)
255
Example: String-valued Variables
a
a
a
X1
ψ2
ψ1
a
a
X2
a
ε
a
a
a
a
• Messages can grow larger when sent
through a transducer factor
• Repeatedly sending messages through a
transducer can cause them to grow to
unbounded size!
(Dreyer & Eisner, 2009)
256
Example: String-valued Variables
a
a
a
X1
ψ2
ψ1
a
a
X2
a
a
a
a
ε
a
a
• Messages can grow larger when sent
through a transducer factor
• Repeatedly sending messages through a
transducer can cause them to grow to
unbounded size!
(Dreyer & Eisner, 2009)
257
Example: String-valued Variables
a
a
a
X1
ψ2
ψ1
a
a
X2
a
a
a
a
a
a
a
a
a
a
a
ε
a
a
a
• Messages can grow larger when sent
through a transducer factor
• Repeatedly sending messages through a
transducer can cause them to grow to
unbounded size!
(Dreyer & Eisner, 2009)
258
Example: String-valued Variables
a
a
a
X1
ψ2
ψ1
a
a
X2
a
ε
a
a
a
a
a
a
a a
a a
a
a
a • The
domain of these
variables is infinite (i.e. Σ*);
• WSFA’s representation is
finite – but the size of the
representation can grow
• In cases where the domain of
each variable is small and
finite this is not an issue
• Messages can grow larger when sent
through a transducer factor
• Repeatedly sending messages through a
transducer can cause them to grow to
unbounded size!
(Dreyer & Eisner, 2009)
259
Message Approximations
Three approaches to dealing with complex
messages:
1. Particle Belief Propagation (see Section 3)
2. Message pruning
3. Expectation propagation
260
Message Pruning
• Problem: Product of d messages = complex distribution.
– Solution: Approximate with a simpler distribution.
– For speed, compute approximation without computing full product.
For real variables, try a mixture of K Gaussians:
– E.g., true product is a mixture of Kd Gaussians
– Prune back: Randomly keep just K of them
– Chosen in proportion to weight in full mixture
– Gibbs sampling to efficiently choose them
X
– What if incoming messages are not Gaussian mixtures?
– Could be anything sent by the factors …
– Can extend technique to this case.
(Sudderth et al., 2002 –“Nonparametric BP”)
261
Message Pruning
• Problem: Product of d messages = complex distribution.
– Solution: Approximate with a simpler distribution.
– For speed, compute approximation without computing full product.
For string variables, use a small finite set:
– Each message µi gives positive probability to …
– … every word in a 50,000 word vocabulary
– … every string in ∑* (using a weighted FSA)
X
– Prune back to a list L of a few “good” strings
– Each message adds its own K best strings to L
– For each x  L, let µ(x) = i µi(x) – each message scores x
– For each x  L, let µ(x) = 0
(Dreyer & Eisner, 2009)
262
Expectation Propagation (EP)
• Problem: Product of d messages = complex distribution.
– Solution: Approximate with a simpler distribution.
– For speed, compute approximation without computing full product.
EP provides four special advantages over pruning:
1. General recipe that can be used in many settings.
2. Efficient. Uses approximations that are very fast.
3. Conservative. Unlike pruning, never forces b(x) to 0.
• Never kills off a value x that had been possible.
4. Adaptive. Approximates µ(x) more carefully
if x is favored by the other messages.
• Tries to be accurate on the most “plausible” values.
(Minka, 2001)
263
Expectation Propagation (EP)
Belief at X3
will be simple!
X7
exponential-family
approximations inside Messages to
and from X3
will be simple!
X3
X1
X4
X2
X5
Expectation Propagation (EP)
Key idea: Approximate variable X’s incoming messages µ.
We force them to have a simple parametric form:
µ(x) = exp (θ ∙ f(x)) “log-linear model” (unnormalized)
where f(x) extracts a feature vector from the value x.
For each variable X, we’ll choose a feature function f.
Maybe unnormalizable,
e.g., initial message θ=0
is uniform “distribution”
So by storing a few parameters θ, we’ve defined µ(x) for all x.
Now the messages are super-easy to multiply:
µ1(x) µ2(x) = exp (θ ∙ f(x)) exp (θ ∙ f(x)) = exp ((θ1+θ2) ∙ f(x))
Represent a message by its parameter vector θ.
To multiply messages, just add their θ vectors!
So beliefs and outgoing messages also have this simple form.
265
Expectation Propagation
• Form of messages/beliefs at X3?
– Always µ(x)=exp (θ∙f(x))
• If x is real:
X2 7
exponential-family
approximations inside
– Gaussian: Take f(x) = (x,x )
• If x is string:
– Globally normalized trigram
model: Take f(x) = (count of aaa,
Xaab,
1
count of
… count of zzz)
X3
• If x is discrete:
– Arbitrary discrete distribution
(can exactly represent original
X2 BP)
message, so we get ordinary
– Coarsened discrete distribution,
based on features of x
• Can’t use mixture models, or other
models that use latent variables
to define µ(x) = ∑y p(x, y)
X4
X5
Expectation Propagation
• Each message to X3 is
µ(x) = exp (θ ∙ f(x))
X7 θ.
for some θ. We only store
• To take a product of such
X1 just add their θ
messages,
– Easily compute belief at X3
(sum of incoming θ vectors)
– Then easily compute X
each
2
outgoing message
(belief minus one incoming θ)
• All very easy …
exponential-family
approximations inside
X3
X4
X5
Expectation Propagation
•
But what about messages from
factors?
– Like the message
X7 M4.
– This is not exponential
family! Uh-oh!
– It’sXjust
whatever the
1
factor happens to send.
•
X3
µ4
M4
X4
This is where we need to
approximate, by µ4 .
X2
X5
Expectation Propagation
• blue = arbitrary distribution,
green = simple distribution exp (θ ∙ f(x))
•
•
The belief at x “should” be
X7
p(x) = µ1(x) ∙ µ2(x) ∙ µ3 (x) ∙ M4(x)
µ1
But we’ll be using
b(x) = µ1(x) ∙ µ2(x) ∙ µ3 (x) ∙ µ4(x)
µ2
X3
µ4
M4
• Choose
X1the simple distribution b
µ3
that minimizes KL(p || b).
• Then, work backward from belief That is, choose b that
b to message µ4.
assigns high probability
– Take θ vector of b and X
subtract
to samples from p.
2
off the θ vectors of µ1, µ2, µ3.
– Chooses µ4 to preserve belief well.
Find b’s params θ in closed
form – or follow gradient:
X
Ex~p[f(x)] – Ex~b5[f(x)]
Example: Factored PCFGs
Expectation Propagation
• Task: Constituency
parsing, with factored
annotations
– Lexical annotations
– Parent annotations
– Latent annotations
• Approach:
– Sentence specific
approximation is an
anchored grammar:
q(A  B C, i, j, k)
– Sending messages is
equivalent to
marginalizing out the
annotations
(Hall & Klein, 2012)
270
Section 6:
Approximation-aware Training
271
Outline
• Do you want to push past the simple NLP models (logistic
regression, PCFG, etc.) that we've all been using for 20 years?
• Then this tutorial is extremely practical for you!
1.
2.
3.
4.
5.
6.
7.
Models: Factor graphs can express interactions among linguistic
structures.
Algorithm: BP estimates the global effect of these interactions on
each variable, using local computations.
Intuitions: What’s going on here? Can we trust BP’s estimates?
Fancier Models: Hide a whole grammar and dynamic
programming algorithm within a single factor. BP coordinates
multiple factors.
Tweaked Algorithm: Finish in fewer steps and make the steps
faster.
Learning: Tune the parameters. Approximately improve the true
predictions -- or truly improve the approximate predictions.
Software: Build the model you want!
272
Outline
• Do you want to push past the simple NLP models (logistic
regression, PCFG, etc.) that we've all been using for 20 years?
• Then this tutorial is extremely practical for you!
1.
2.
3.
4.
5.
6.
7.
Models: Factor graphs can express interactions among linguistic
structures.
Algorithm: BP estimates the global effect of these interactions on
each variable, using local computations.
Intuitions: What’s going on here? Can we trust BP’s estimates?
Fancier Models: Hide a whole grammar and dynamic
programming algorithm within a single factor. BP coordinates
multiple factors.
Tweaked Algorithm: Finish in fewer steps and make the steps
faster.
Learning: Tune the parameters. Approximately improve the true
predictions -- or truly improve the approximate predictions.
Software: Build the model you want!
273
Training
Thus far, we’ve seen how to
compute (approximate)
marginals, given a factor
graph…
Two ways to learn:
1.
…but where do the potential
tables ψα come from?
– Some have a fixed structure
(e.g. Exactly1, CKYTree)
– Others could be trained ahead
of time (e.g. TrigramHMM)
– For the rest, we define them
parametrically and learn the
parameters!
2.
Standard CRF
Training
(very simple; often
yields state-of-theart results)
ERMA
(less simple; but
takes
approximations
and loss function
into account)
274
Standard CRF Parameterization
Define each potential function in terms of a
fixed set of feature functions:
Observed
variables
Predicted
variables
275
Standard CRF Parameterization
Define each potential function in terms of a
fixed set of feature functions:
n
ψ2
v
ψ4
p
ψ6
d
ψ8
n
ψ1
ψ3
ψ5
ψ7
ψ9
time
flies
like
an
arrow
276
Standard CRF Parameterization
Define each potential function in terms of a
fixed set of feature functions:
s
ψ13
vp
ψ12
pp
ψ11
np
ψ10
n
ψ2
v
ψ4
p
ψ6
d
ψ8
n
ψ1
ψ3
ψ5
ψ7
ψ9
time
flies
like
an
arrow
277
What is Training?
That’s easy:
Training = picking good model parameters!
But how do we know if the
model parameters are any “good”?
278
Standard CRF Training
Given labeled training examples:
Maximize Conditional Log-likelihood:
279
Standard CRF Training
Given labeled training examples:
Maximize Conditional Log-likelihood:
280
Standard CRF Training
Given labeled training examples:
Maximize Conditional Log-likelihood:
We can approximate the
factor marginals by the
factor beliefs from BP!
281
Stochastic Gradient Descent
Input:
– Training data, {(x(i), y(i)) : 1 ≤ i ≤ N }
– Initial model parameters, θ
Output:
– Trained model parameters, θ.
Algorithm:
While not converged:
– Sample a training example (x(i), y(i))
– Compute the gradient of log(pθ(y(i) | x(i))) with
respect to our model parameters θ.
– Take a (small) step in the direction of the gradient.
(Stoyanov, Ropson, & Eisner, 2011)
282
What’s wrong with the usual approach?
• If you add too many features, your predictions might get worse!
– Log-linear models used to remove features to avoid this overfitting
– How do we fix it now? Regularization!
• If you add too many factors, your predictions might get worse!
– The model might be better, but we replace the true marginals with
approximate marginals (e.g. beliefs computed by BP)
– But approximate inference can cause gradients for structured
learning to go awry! (Kulesza & Pereira, 2008).
283
What’s wrong with the usual approach?
Mistakes made by Standard CRF Training:
1. Using BP (approximate)
2. Not taking loss function into account
3. Should be doing MBR decoding
Big pile of approximations…
…which has tunable parameters.
Treat it like a neural net, and run backprop!
284
Error Back-Propagation
Slide from (Stoyanov & Eisner, 2012)
285
Error Back-Propagation
Slide from (Stoyanov & Eisner, 2012)
286
Error Back-Propagation
Slide from (Stoyanov & Eisner, 2012)
287
Error Back-Propagation
Slide from (Stoyanov & Eisner, 2012)
288
Error Back-Propagation
Slide from (Stoyanov & Eisner, 2012)
289
Error Back-Propagation
Slide from (Stoyanov & Eisner, 2012)
290
Error Back-Propagation
Slide from (Stoyanov & Eisner, 2012)
291
Error Back-Propagation
Slide from (Stoyanov & Eisner, 2012)
292
Error Back-Propagation
Slide from (Stoyanov & Eisner, 2012)
293
Error Back-Propagation
P(y3=noun|x)
ϴ
μ(y1y2)=μ(y3y1)*μ(y4y1)
y3
Slide from (Stoyanov & Eisner, 2012)
294
Error Back-Propagation
• Applying the chain rule of derivation over
and over.
• Forward pass:
– Regular computation (inference + decoding) in
the model (+ remember intermediate
quantities).
• Backward pass:
– Replay the forward pass in reverse computing
gradients.
295
Empirical Risk Minimization as a
Computational Expression Graph
Forward Pass
loss
from an output
Loss Module: takes the
prediction as input, and
outputs a loss for the
training example.
an output
from beliefs
Decoder Module: takes
marginals as input, and
outputs a prediction.
beliefs
from model parameters
(Stoyanov, Ropson, & Eisner, 2011)
(Stoyanov & Eisner, 2012)
Loopy BP Module: takes the
parameters as input, and
outputs marginals (beliefs).
296
Empirical Risk Minimization as a
Computational Expression Graph
Forward Pass
Loss Module: takes the
prediction as input, and
outputs a loss for the
training example.
Decoder Module: takes
marginals as input, and
outputs a prediction.
Loopy BP Module: takes the
parameters as input, and
outputs marginals (beliefs).
(Stoyanov, Ropson, & Eisner, 2011)
(Stoyanov & Eisner, 2012)
297
Empirical Risk Minimization under
Approximations (ERMA)
Input:
–
–
–
–
Training data, {(x(i), y(i)) : 1 ≤ i ≤ N }
Initial model parameters, θ
Decision function (aka. decoder), fθ(x)
Loss function, L
Output:
– Trained model parameters, θ.
Algorithm:
While not converged:
– Sample a training example (x(i), y(i))
– Compute the gradient of L(fθ(x(i)), y(i)) with respect to our
model parameters θ.
– Take a (small) step in the direction of the gradient.
(Stoyanov, Ropson, & Eisner, 2011)
298
Empirical Risk Minimization under
Approximations
Input:
–
–
–
–
Training data, {(x(i), y(i)) : 1 ≤ i ≤ N }
Initial model parameters, θ
Decision function (aka. decoder), fθ(x)
Loss function, L
This section is about how to
Output:
(efficiently) compute this gradient,
– Trained model parameters,
by θ.
treating inference, decoding,
and the loss function as a
differentiable black-box.
Algorithm:
While not converged:
– Sample a training example (x(i), y(i))
– Compute the gradient of L(fθ(x(i)), y(i)) with respect to our
model parameters θ.
– Take a (small) step in the direction of the gradient.
Figure from (Stoyanov & Eisner, 2012)
299
The Chain Rule
• Version 1:
• Version 2:
Key idea:
1. Represent inference, decoding, and the loss function
as a computational expression graph.
2. Then repeatedly apply the chain rule to a compute
the partial derivatives.
300
Module-based AD
•
The underlying idea goes by various names:
– Described by Bottou & Gallinari (1991), as “A Framework for the Cooperation of Learning
Algorithms”
– Automatic Differentation in the reverse mode
– Backpropagation is a special case for Neural Network training
•
•
Define a set of modules, connected in a feed-forward topology (i.e.
computational expression graph)
Each module must define the following:
–
–
–
–
•
•
Input variables
Output variables
Forward pass: function mapping input variables to output variables
Backward pass: function mapping the adjoint of the output variables to the adjoint of the
input variables
The forward pass computes the goal
The backward pass computes the partial derivative of the goal with respect to
each parameter in the computational expression graph
(Bottou & Gallinari, 1991)
301
Empirical Risk Minimization as a
Computational Expression Graph
Forward Pass
Backward Pass
loss
from an output
d loss / d output
an output
from beliefs
d loss / d beliefs
by chain rule
beliefs
from model parameters d loss / d model params
by chain rule
(Stoyanov, Ropson, & Eisner, 2011)
(Stoyanov & Eisner, 2012)
Loss Module: takes the
prediction as input, and
outputs a loss for the
training example.
Decoder Module: takes
marginals as input, and
outputs a prediction.
Loopy BP Module: takes the
parameters as input, and
outputs marginals (beliefs).
302
Empirical Risk Minimization as a
Computational Expression Graph
Forward Pass
Backward Pass
Loss Module: takes the
prediction as input, and
outputs a loss for the
training example.
Decoder Module: takes
marginals as input, and
outputs a prediction.
Loopy BP Module: takes the
parameters as input, and
outputs marginals (beliefs).
(Stoyanov, Ropson, & Eisner, 2011)
(Stoyanov & Eisner, 2012)
303
Empirical Risk Minimization as a
Computational Expression Graph
Forward Pass
Backward Pass
Loss Module: takes the
prediction as input, and
outputs a loss for the
training example.
Decoder Module: takes
marginals as input, and
outputs a prediction.
Loopy BP Module: takes the
parameters as input, and
outputs marginals (beliefs).
(Stoyanov, Ropson, & Eisner, 2011)
(Stoyanov & Eisner, 2012)
304
Loopy BP as a Computational Expression Graph
…
…
…
…
305
Loopy BP as a Computational Expression Graph
…
We…
obtain a feed-forward
(acyclic) topology for the graph
by “unrolling” the message
passing algorithm.
This
…amounts to indexing each
…
message with a timestamp.
306
Empirical Risk Minimization under
Approximations (ERMA)
Approximation Aware
No
No
MLE
SVMstruct
Yes
Loss Aware
Yes
ERMA
[Finley and Joachims, 2008]
M3 N
[Taskar et al., 2003]
Softmax-margin
[Gimpel & Smith, 2010]
Figure from (Stoyanov & Eisner, 2012)
307
Application:
Example: Congressional Voting
• Task: predict representatives’
votes based on debates
• Novel training method:
– Empirical Risk Minimization
under Approximations
(ERMA)
– Loss-aware
– Approximation-aware
• Findings:
– On highly loopy graphs,
significantly improves over
(strong) loss-aware baseline
(Stoyanov & Eisner, 2012)
308
Application:
Example: Congressional Voting
• Task: predict representatives’
votes based on debates
• Novel training method:
– Empirical Risk Minimization
under Approximations
(ERMA)
– Loss-aware
– Approximation-aware
• Findings:
– On highly loopy graphs,
significantly improves over
(strong) loss-aware baseline
(Stoyanov & Eisner, 2012)
309
Section 7:
Software
310
Outline
• Do you want to push past the simple NLP models (logistic
regression, PCFG, etc.) that we've all been using for 20 years?
• Then this tutorial is extremely practical for you!
1.
2.
3.
4.
5.
6.
7.
Models: Factor graphs can express interactions among linguistic
structures.
Algorithm: BP estimates the global effect of these interactions on
each variable, using local computations.
Intuitions: What’s going on here? Can we trust BP’s estimates?
Fancier Models: Hide a whole grammar and dynamic
programming algorithm within a single factor. BP coordinates
multiple factors.
Tweaked Algorithm: Finish in fewer steps and make the steps
faster.
Learning: Tune the parameters. Approximately improve the true
predictions -- or truly improve the approximate predictions.
Software: Build the model you want!
311
Outline
• Do you want to push past the simple NLP models (logistic
regression, PCFG, etc.) that we've all been using for 20 years?
• Then this tutorial is extremely practical for you!
1.
2.
3.
4.
5.
6.
7.
Models: Factor graphs can express interactions among linguistic
structures.
Algorithm: BP estimates the global effect of these interactions on
each variable, using local computations.
Intuitions: What’s going on here? Can we trust BP’s estimates?
Fancier Models: Hide a whole grammar and dynamic
programming algorithm within a single factor. BP coordinates
multiple factors.
Tweaked Algorithm: Finish in fewer steps and make the steps
faster.
Learning: Tune the parameters. Approximately improve the true
predictions -- or truly improve the approximate predictions.
Software: Build the model you want!
312
Pacaya
Features:
– Structured Loopy BP over factor graphs with:
• Discrete variables
• Structured constraint factors
(e.g. projective dependency tree constraint factor)
– Coming Soon:
• ERMA training with backpropagation through structured
factors (Gormley, Dredze, & Eisner, In prep.)
Language: Java
Authors: Gormley, Mitchell, & Wolfe
URL: http://www.cs.jhu.edu/~mrg/software/
(Gormley, Mitchell, Van Durme, & Dredze, 2014)
(Gormley, Dredze, & Eisner, In prep.)
313
ERMA
Features:
ERMA performs inference and training on CRFs and
MRFs with arbitrary model structure over discrete
variables. The training regime, Empirical Risk
Minimization under Approximations is loss-aware and
approximation-aware. ERMA can optimize several loss
functions such as Accuracy, MSE and F-score.
Language: Java
Authors: Stoyanov, Ropson, & Eisner
URL: https://sites.google.com/site/ermasoftware/
(Stoyanov, Ropson, & Eisner, 2011)
(Stoyanov & Eisner, 2012)
314
Graphical Models Libraries
•
Factorie (McCallum, Shultz, & Singh, 2012) is a Scala library allowing modular specification of
inference, learning, and optimization methods. Inference algorithms include belief propagation
and MCMC. Learning settings include maximum likelihood learning, maximum margin learning,
learning with approximate inference, SampleRank, pseudo-likelihood.
http://factorie.cs.umass.edu/
•
LibDAI (Mooij, 2010) is a C++ library that supports inference, but not learning, via Loopy BP,
Fractional BP, Tree-Reweighted BP, (Double-loop) Generalized BP, variants of Loop Corrected
Belief Propagation, Conditioned Belief Propagation, and Tree Expectation Propagation.
http://www.libdai.org
•
OpenGM2 (Andres, Beier, & Kappes, 2012) provides a C++ template library for discrete factor
graphs with support for learning and inference (including tie-ins to all LibDAI inference
algorithms).
http://hci.iwr.uni-heidelberg.de/opengm2/
•
FastInf (Jaimovich, Meshi, Mcgraw, Elidan) is an efficient Approximate Inference Library in C++.
http://compbio.cs.huji.ac.il/FastInf/fastInf/FastInf_Homepage.html
•
Infer.NET (Minka et al., 2012) is a .NET language framework for graphical models with support
for Expectation Propagation and Variational Message Passing.
http://research.microsoft.com/en-us/um/cambridge/projects/infernet
315
References
316
•
•
•
•
•
•
•
•
•
•
M. Auli and A. Lopez, “A Comparison of Loopy Belief Propagation and Dual Decomposition for Integrated CCG
Supertagging and Parsing,” in Proceedings of the 49th Annual Meeting of the Association for Computational
Linguistics: Human Language Technologies, Portland, Oregon, USA, 2011, pp. 470–480.
M. Auli and A. Lopez, “Training a Log-Linear Parser with Loss Functions via Softmax-Margin,” in Proceedings of
the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, UK., 2011, pp.
333–343.
Y. Bengio, “Training a neural network with a financial criterion rather than a prediction criterion,” in Decision
Technologies for Financial Engineering: Proceedings of the Fourth International Conference on Neural
Networks in the Capital Markets (NNCM’96), World Scientific Publishing, 1997, pp. 36–48.
D. P. Bertsekas and J. N. Tsitsiklis, Parallel and distributed computation: numerical methods. Prentice-Hall, Inc.,
1989.
D. P. Bertsekas and J. N. Tsitsiklis, Parallel and distributed computation: numerical methods. Athena Scientific,
1997.
L. Bottou and P. Gallinari, “A Framework for the Cooperation of Learning Algorithms,” in Advances in Neural
Information Processing Systems, vol. 3, D. Touretzky and R. Lippmann, Eds. Denver: Morgan Kaufmann, 1991.
R. Bunescu and R. J. Mooney, “Collective information extraction with relational Markov networks,” 2004, p.
438–es.
C. Burfoot, S. Bird, and T. Baldwin, “Collective Classification of Congressional Floor-Debate Transcripts,”
presented at the Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:
Human Language Techologies, 2011, pp. 1506–1515.
D. Burkett and D. Klein, “Fast Inference in Phrase Extraction Models with Belief Propagation,” presented at the
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, 2012, pp. 29–38.
T. Cohn and P. Blunsom, “Semantic Role Labelling with Tree Conditional Random Fields,” presented at the
Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005), 2005, pp.
169–172.
317
•
•
•
•
•
•
•
•
•
•
F. Cromierès and S. Kurohashi, “An Alignment Algorithm Using Belief Propagation and a Structure-Based
Distortion Model,” in Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009),
Athens, Greece, 2009, pp. 166–174.
M. Dreyer, “A non-parametric model for the discovery of inflectional paradigms from plain text using graphical
models over strings,” Johns Hopkins University, Baltimore, MD, USA, 2011.
M. Dreyer and J. Eisner, “Graphical Models over Multiple Strings,” presented at the Proceedings of the 2009
Conference on Empirical Methods in Natural Language Processing, 2009, pp. 101–110.
M. Dreyer and J. Eisner, “Discovering Morphological Paradigms from Plain Text Using a Dirichlet Process
Mixture Model,” presented at the Proceedings of the 2011 Conference on Empirical Methods in Natural
Language Processing, 2011, pp. 616–627.
J. Duchi, D. Tarlow, G. Elidan, and D. Koller, “Using Combinatorial Optimization within Max-Product Belief
Propagation,” Advances in neural information processing systems, 2006.
G. Durrett, D. Hall, and D. Klein, “Decentralized Entity-Level Modeling for Coreference Resolution,” presented
at the Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1:
Long Papers), 2013, pp. 114–124.
G. Elidan, I. McGraw, and D. Koller, “Residual belief propagation: Informed scheduling for asynchronous
message passing,” in Proceedings of the Twenty-second Conference on Uncertainty in AI (UAI, 2006.
K. Gimpel and N. A. Smith, “Softmax-Margin CRFs: Training Log-Linear Models with Cost Functions,” in Human
Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for
Computational Linguistics, Los Angeles, California, 2010, pp. 733–736.
J. Gonzalez, Y. Low, and C. Guestrin, “Residual splash for optimally parallelizing belief propagation,” in
International Conference on Artificial Intelligence and Statistics, 2009, pp. 177–184.
D. Hall and D. Klein, “Training Factored PCFGs with Expectation Propagation,” presented at the Proceedings of
the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural
Language Learning, 2012, pp. 1146–1156.
318
•
•
•
•
•
•
•
•
•
•
T. Heskes, “Stable fixed points of loopy belief propagation are minima of the Bethe free energy,” Advances in
neural information processing systems, vol. 15, pp. 359–366, 2003.
A. T. Ihler, J. W. Fisher III, A. S. Willsky, and D. M. Chickering, “Loopy belief propagation: convergence and
effects of message errors.,” Journal of Machine Learning Research, vol. 6, no. 5, 2005.
A. T. Ihler and D. A. Mcallester, “Particle belief propagation,” in International Conference on Artificial
Intelligence and Statistics, 2009, pp. 256–263.
J. Jancsary, J. Matiasek, and H. Trost, “Revealing the Structure of Medical Dictations with Conditional Random
Fields,” presented at the Proceedings of the 2008 Conference on Empirical Methods in Natural Language
Processing, 2008, pp. 1–10.
J. Jiang, T. Moon, H. Daumé III, and J. Eisner, “Prioritized Asynchronous Belief Propagation,” in ICML
Workshop on Inferning, 2013.
A. Kazantseva and S. Szpakowicz, “Linear Text Segmentation Using Affinity Propagation,” presented at the
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, 2011, pp. 284–293.
T. Koo and M. Collins, “Hidden-Variable Models for Discriminative Reranking,” presented at the Proceedings of
Human Language Technology Conference and Conference on Empirical Methods in Natural Language
Processing, 2005, pp. 507–514.
A. Kulesza and F. Pereira, “Structured Learning with Approximate Inference.,” in NIPS, 2007, vol. 20, pp. 785–
792.
J. Lee, J. Naradowsky, and D. A. Smith, “A Discriminative Model for Joint Morphological Disambiguation and
Dependency Parsing,” in Proceedings of the 49th Annual Meeting of the Association for Computational
Linguistics: Human Language Technologies, Portland, Oregon, USA, 2011, pp. 885–894.
S. Lee, “Structured Discriminative Model For Dialog State Tracking,” presented at the Proceedings of the
SIGDIAL 2013 Conference, 2013, pp. 442–451.
319
•
•
•
•
•
•
•
•
•
•
•
•
X. Liu, M. Zhou, X. Zhou, Z. Fu, and F. Wei, “Joint Inference of Named Entity Recognition and Normalization for
Tweets,” presented at the Proceedings of the 50th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), 2012, pp. 526–535.
D. J. C. MacKay, J. S. Yedidia, W. T. Freeman, and Y. Weiss, “A Conversation about the Bethe Free Energy and
Sum-Product,” MERL, TR2001-18, 2001.
A. Martins, N. Smith, E. Xing, P. Aguiar, and M. Figueiredo, “Turbo Parsers: Dependency Parsing by
Approximate Variational Inference,” presented at the Proceedings of the 2010 Conference on Empirical
Methods in Natural Language Processing, 2010, pp. 34–44.
D. McAllester, M. Collins, and F. Pereira, “Case-Factor Diagrams for Structured Probabilistic Modeling,” in In
Proceedings of the Twentieth Conference on Uncertainty in Artificial Intelligence (UAI’04), 2004.
T. Minka, “Divergence measures and message passing,” Technical report, Microsoft Research, 2005.
T. P. Minka, “Expectation propagation for approximate Bayesian inference,” in Uncertainty in Artificial
Intelligence, 2001, vol. 17, pp. 362–369.
M. Mitchell, J. Aguilar, T. Wilson, and B. Van Durme, “Open Domain Targeted Sentiment,” presented at the
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 2013, pp. 1643–
1654.
K. P. Murphy, Y. Weiss, and M. I. Jordan, “Loopy belief propagation for approximate inference: An empirical
study,” in Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, 1999, pp. 467–475.
T. Nakagawa, K. Inui, and S. Kurohashi, “Dependency Tree-based Sentiment Classification using CRFs with
Hidden Variables,” presented at the Human Language Technologies: The 2010 Annual Conference of the North
American Chapter of the Association for Computational Linguistics, 2010, pp. 786–794.
J. Naradowsky, S. Riedel, and D. Smith, “Improving NLP through Marginalization of Hidden Syntactic
Structure,” in Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing
and Computational Natural Language Learning, 2012, pp. 810–820.
J. Naradowsky, T. Vieira, and D. A. Smith, Grammarless Parsing for Joint Inference. Mumbai, India, 2012.
J. Niehues and S. Vogel, “Discriminative Word Alignment via Alignment Matrix Modeling,” presented at the
Proceedings of the Third Workshop on Statistical Machine Translation, 2008, pp. 18–25.
320
•
•
•
•
•
•
•
•
•
•
J. Pearl, Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann,
1988.
X. Pitkow, Y. Ahmadian, and K. D. Miller, “Learning unbelievable probabilities,” in Advances in Neural
Information Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q.
Weinberger, Eds. Curran Associates, Inc., 2011, pp. 738–746.
V. Qazvinian and D. R. Radev, “Identifying Non-Explicit Citing Sentences for Citation-Based Summarization.,”
presented at the Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics,
2010, pp. 555–564.
H. Ren, W. Xu, Y. Zhang, and Y. Yan, “Dialog State Tracking using Conditional Random Fields,” presented at the
Proceedings of the SIGDIAL 2013 Conference, 2013, pp. 457–461.
D. Roth and W. Yih, “Probabilistic Reasoning for Entity & Relation Recognition,” presented at the COLING
2002: The 19th International Conference on Computational Linguistics, 2002.
A. Rudnick, C. Liu, and M. Gasser, “HLTDI: CL-WSD Using Markov Random Fields for SemEval-2013 Task 10,”
presented at the Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2:
Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), 2013, pp. 171–177.
T. Sato, “Inside-Outside Probability Computation for Belief Propagation.,” in IJCAI, 2007, pp. 2605–2610.
D. A. Smith and J. Eisner, “Dependency Parsing by Belief Propagation,” in Proceedings of the Conference on
Empirical Methods in Natural Language Processing (EMNLP), Honolulu, 2008, pp. 145–156.
V. Stoyanov and J. Eisner, “Fast and Accurate Prediction via Evidence-Specific MRF Structure,” in ICML
Workshop on Inferning: Interactions between Inference and Learning, Edinburgh, 2012.
V. Stoyanov and J. Eisner, “Minimum-Risk Training of Approximate CRF-Based NLP Systems,” in Proceedings of
NAACL-HLT, 2012, pp. 120–130.
321
•
•
•
•
•
•
•
•
•
•
V. Stoyanov, A. Ropson, and J. Eisner, “Empirical Risk Minimization of Graphical Model Parameters Given
Approximate Inference, Decoding, and Model Structure,” in Proceedings of the 14th International Conference
on Artificial Intelligence and Statistics (AISTATS), Fort Lauderdale, 2011, vol. 15, pp. 725–733.
E. B. Sudderth, A. T. Ihler, W. T. Freeman, and A. S. Willsky, “Nonparametric belief propagation,” MIT, Technical
Report 2551, 2002.
E. B. Sudderth, A. T. Ihler, W. T. Freeman, and A. S. Willsky, “Nonparametric belief propagation,” in In
Proceedings of CVPR, 2003.
E. B. Sudderth, A. T. Ihler, M. Isard, W. T. Freeman, and A. S. Willsky, “Nonparametric belief propagation,”
Communications of the ACM, vol. 53, no. 10, pp. 95–103, 2010.
C. Sutton and A. McCallum, “Collective Segmentation and Labeling of Distant Entities in Information
Extraction,” in ICML Workshop on Statistical Relational Learning and Its Connections to Other Fields, 2004.
C. Sutton and A. McCallum, “Piecewise Training of Undirected Models,” in Conference on Uncertainty in
Artificial Intelligence (UAI), 2005.
C. Sutton and A. McCallum, “Improved dynamic schedules for belief propagation,” UAI, 2007.
M. J. Wainwright, T. Jaakkola, and A. S. Willsky, “Tree-based reparameterization for approximate inference on
loopy graphs.,” in NIPS, 2001, pp. 1001–1008.
Z. Wang, S. Li, F. Kong, and G. Zhou, “Collective Personal Profile Summarization with Social Networks,”
presented at the Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing,
2013, pp. 715–725.
Y. Watanabe, M. Asahara, and Y. Matsumoto, “A Graph-Based Approach to Named Entity Categorization in
Wikipedia Using Conditional Random Fields,” presented at the Proceedings of the 2007 Joint Conference on
Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLPCoNLL), 2007, pp. 649–657.
322
•
•
•
•
•
•
•
Y. Weiss and W. T. Freeman, “On the optimality of solutions of the max-product belief-propagation algorithm
in arbitrary graphs,” Information Theory, IEEE Transactions on, vol. 47, no. 2, pp. 736–744, 2001.
J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Bethe free energy, Kikuchi approximations, and belief propagation
algorithms,” MERL, TR2001-16, 2001.
J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Constructing free-energy approximations and generalized belief
propagation algorithms,” IEEE Transactions on Information Theory, vol. 51, no. 7, pp. 2282–2312, Jul. 2005.
J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Generalized belief propagation,” in NIPS, 2000, vol. 13, pp. 689–
695.
J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Understanding belief propagation and its generalizations,”
Exploring artificial intelligence in the new millennium, vol. 8, pp. 236–239, 2003.
J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Constructing Free Energy Approximations and Generalized Belief
Propagation Algorithms,” MERL, TR-2004-040, 2004.
J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Constructing free-energy approximations and generalized belief
propagation algorithms,” Information Theory, IEEE Transactions on, vol. 51, no. 7, pp. 2282–2312, 2005.
323
Download