Inference

advertisement
Approximation-aware
Dependency Parsing by
Belief Propagation
Matt Gormley
Mark Dredze
Jason Eisner
September 19, 2015
TACL at EMNLP
1
Motivation #1:
Approximation-unaware Learning
Problem: Approximate inference causes standard
learning algorithms to go awry
(Kulesza & Pereira, 2008)
with exact inference:
with approx. inference:
Can we take our
approximations
into account?
2
Motivation #2:
Hybrid Models
Graphical models let you Neural nets are really
encode domain
good at fitting the data
knowledge
discriminatively to make
good predictions
…
…
…
Could we define a neural net
that incorporates
domain knowledge?
3
Our Solution
Key idea: Treat your unrolled approximate
inference algorithm as a deep network
…
…
…
…
…
…
Chart parser:
…
…
…
…
4
Talk Summary
Loopy BP + Dynamic Prog.
= Structured BP
Loopy BP + Backprop.
= ERMA / Back-BP
Loopy BP + Dynamic Prog. + Backprop.
= This Talk
5
Loopy BP + Dynamic Prog. + Backprop.
= This Talk
Graphical + Hypergraphs
Models
= The models that
interest me
+ Neural
Networks
• If you’re thinking,
“This sounds like a
great direction!”
• Then you’re in good
company
• And have been
since before 1995
6
Loopy BP + Dynamic Prog. + Backprop.
= This Talk
Graphical + Hypergraphs
Models
= The models that
interest me
+ Neural
Networks
• So what’s new since 1995?
• Two new emphases:
1. Learning under approximate inference
2. Structural constraints
7
An Abstraction for Modeling
True
0.1
False
5.2
ψ2
y1
ψ12
False
Factor Graph
(bipartite graph)
• variables (circles)
• factors (squares)
True
Mathematical
Modeling
True
2
9
False
4
2
y2
8
Factor Graph for Dependency Parsing
Y0,4
Y0,3
Y0,2
Y0,1
Y1,4
Y1,3
Y1,2
Y2,1
Y4,1
Y2,4
Y3,1
Y2,3
Y3,2
Y4,2
Y3,4
Y4,3
9
Factor Graph for Dependency Parsing
Y0,4
Y0,3
Y0,2
Y1,4
Y1,3
Y4,1
Y2,4
Y3,1
Y4,2
Left Right
arc arc
Y0,1
0
<WALL>
Y1,2
1
Juan_Carlos
Y2,3
Y2,1
2
abdica
Y3,2
Y3,4
3
su
Y4,3
4
reino
10
Factor Graph for Dependency Parsing



✔

0
<WALL>


1
Juan_Carlos

✔


Left Right
arc arc
✔

2
abdica


3
su
✔
4
reino
11
Factor Graph for Dependency Parsing

Unary: local opinion
about one edge


✔

0
<WALL>


1
Juan_Carlos

✔

✔

2
abdica



3
su
✔
4
reino
12
Factor Graph for Dependency Parsing

Unary: local opinion
about one edge


✔

0
<WALL>


1
Juan_Carlos

✔

✔

2
abdica

✔

3
su
✔
4
reino
13
Factor Graph for Dependency Parsing
PTree: Hard constraint,
multiplying in 1 if the
variables form a tree
and 0 otherwise.

Unary: local opinion
about one edge


✔

0
<WALL>


1
Juan_Carlos

✔

✔

2
abdica



3
su
✔
4
reino
14
Factor Graph for Dependency Parsing
PTree: Hard constraint,
multiplying in 1 if the
variables form a tree
and 0 otherwise.

Unary: local opinion
about one edge


✔

0
<WALL>


1
Juan_Carlos

✔

✔

2
abdica



3
su
✔
4
reino
15
Factor Graph for Dependency Parsing
PTree: Hard constraint,
multiplying in 1 if the
variables form a tree
and 0 otherwise.

Unary: local opinion
about one edge


✔

0
<WALL>


1
Juan_Carlos
✔

✔

2
abdica
Grandparent: local
opinion about
grandparent, head,
and modifier




3
su
✔
4
reino
16
Factor Graph for Dependency Parsing
PTree: Hard constraint,
multiplying in 1 if the
variables form a tree
and 0 otherwise.

Unary: local opinion
about one edge


✔

0
<WALL>


1
Juan_Carlos

✔

✔

2
abdica



3
su
✔
Grandparent: local
opinion about
grandparent, head,
and modifier
Sibling: local opinion
about pair of arbitrary
siblings
4
reino
17
Factor Graph for Dependency Parsing
Y0,4
Y0,3
Y0,2
Y0,1
Y1,4
Y1,3
Y1,2
Y2,1
Y4,1
Y2,4
Y3,1
Y2,3
Y3,2
Y4,2
Y3,4
Y4,3
Now we can
work at this
level of
abstraction.
Why dependency parsing?
1. Simplest example for
Structured BP
2. Exhibits both polytime and
NP-hard problems
19
The Impact of Approximations
Model
Linguistics
pθ(
) = 0.50
Inference
Learning
(Inference is usually
called as a subroutine
in learning)
20
The Impact of Approximations
Model
Linguistics
pθ(
) = 0.50
Inference
Learning
(Inference is usually
called as a subroutine
in learning)
21
The Impact of Approximations
Model
Linguistics
pθ(
) = 0.50
Inference
Learning
(Inference is usually
called as a subroutine
in learning)
22
Machine
Learning
1.
Conditional Log-likelihood Training
Choose model
Such that derivative in #3 is ea
2.
Choose objective:
Assign high probability to the
things we observe and low
probability to everything else
3.
Compute
derivative by
hand using the
chain rule
4. Replace exact
inference by
approximate
inference
23
Machine
Learning
1.
Conditional Log-likelihood Training
Choose model
(3. comes from log-linear factors)
2.
Choose objective:
Assign high probability to the
things we observe and low
probability to everything else
3.
Compute
derivative by
hand using the
chain rule
4. Replace exact
inference by
approximate
inference
24
Machine
Learning
What’s wrong with CLL?
How did we compute
these approximate
marginal probabilities
anyway?
By Structured Belief
Propagation of course!
25
Everything you need to know about:
Structured BP
1. It’s a message passing
algorithm
2. The message computations
are just multiplication,
addition, and division
3. Those computations are
differentiable
26
Inference
Structured Belief Propagation
This is just another
factor graph, so we
can run Loopy BP



✔

0
<WALL>


1
Juan_Carlos

✔

✔

2
abdica



3
su
✔
What goes wrong?
• Naïve
computation is
inefficient
• We can embed
the insideoutside
algorithm within
the structured
factor
4
reino
27
Algorithmic Differentiation
• Backprop works on more than just neural
networks
• You can apply the chain rule to any arbitrary
differentiable algorithm
That’s the key (old) idea behind this talk.
• Alternatively: could estimate a gradient by
finite-difference approximations – but
algorithmic differentiation is much more
efficient!
28
Feed-forward Topology of
Inference, Decoding and Loss
• Unary factor: vector with
2 entries
• Binary factor: (flattened)
matrix with 4 entries
…
…
Factors
Model
parameters
29
Feed-forward Topology of
Inference, Decoding and Loss
• Messages from neighbors used to
compute next message
• Leads to sparsity in layerwise connections
…
…
…
Messages
at time t=1
…
Messages
at time t=0
Factors
Model
parameters
30
Arrows in This
Arrows in Neural
Net:
Feed-forward
Topology
of Diagram:
A different
semantics
Linear combination,
then
Inference,
Decoding
and Loss
given by the algorithm
a sigmoid
…
…
…
Messages
at time t=1
…
Messages
at time t=0
Factors
Model
parameters
31
Arrows in This
Arrows in Neural
Net:
Feed-forward
Topology
of Diagram:
A different
semantics
Linear combination,
then
Inference,
Decoding
and Loss
given by the algorithm
a sigmoid
…
…
…
Messages
at time t=1
…
Messages
at time t=0
Factors
Model
parameters
32
Feed-forward Topology
Decode / Loss
…
…
…
Beliefs
…
Messages
at time t=3
…
Messages
at time t=2
…
Messages
at time t=1
…
Messages
at time t=0
Factors
Model
parameters
33
in This Diagram:
Feed-forwardArrows
Topology
A different
semantics
Decode
/ Loss
given by the algorithm
…
Messages from PTree
factor rely on a
variant of insideoutside
…
…
Beliefs
…
Messages
at time t=3
…
Messages
at time t=2
…
Messages
at time t=1
…
Messages
at time t=0
Factors
Model
parameters
34
Feed-forward Topology
…
Messages from PTree
factor rely on a
variant of insideoutside
…
…
Chart parser:
…
…
…
…
35
Machine
Learning
1.
2.
3.
4.
Approximation-aware Learning
Choose model to be the
computation with all its
approximations
Choose objective
to likewise include the
approximations
Compute derivative by
backpropagation (treating
the entire computation as
if it were a neural network)
Make no approximations!
(Our gradient is exact)
Key idea: Open up the black box!
…
…
…
Chart parser:
…
…
…
…
36
Experimental Setup
Goal: Compare two training approaches
1. Standard approach (CLL)
2. New approach (Backprop)
Data: English PTB
– Converted to dependencies using Yamada &
Matsumoto (2003) head rules
– Standard train (02-21), dev (22), test (23) split
– TurboTagger predicted POS tags
Metric: Unlabeled Attachment Score
(higher is better)
37
Results
New training
approach yields
models which are:
1. Faster for a given
level of accuracy
2. More accurate for
a given level of
speed
More accurate
Unlabeled Attachment Score
(UAS)
Speed-Accuracy
Tradeoff
Dependency Parsing
93
92
91
90
89
88
1
2
3 4 5 6 7
# Iterations of BP
CLL
Backprop
8
Faster
38
Results
• As we add more
factors to the
model, our model
becomes loopier
• Yet, our training
by Backprop
consistently
improves as
models get richer
More accurate
Increasingly
Cyclic Models
Unlabeled Attachement Score (UAS)
Dependency Parsing
93
92
91
90
CLL
Backprop
Richer Models
39
See our TACL paper for…
2) Results with
alternate
training
objectives
93.0
92.5
92.0
92
UAS
UAS
1) Results on 19
languages
from CoNLL
2006 / 2007
91.0
CLL
L2
90.0
3) Empirical
comparison of
exact and
approximate
inference
CLL
L2
L 2 +AR
91.5
91
L 2 +AR
89.0
90.5
Unary
88.0
1
91.0
CLL
L2
L 2 +AR
88.0
1
2
3
4
5
6
# Iterations of BP
7
7
8
Grand.
Sib.
Grand.+Sib.
Figure 4: English PTB-YM UAS vs. the types of 2ndorder factors included in the model for approximationaware training and standard conditional likelihood training. All models include 1st-order factors (Unary). The
2nd-order models include grandparents (Grand.), arbitrary siblings (Sib.), or both (Grand.+Sib.)—and use 4
iterations of BP.
UAS
UAS
92.0
89.0
3
4
5
6
# Iterations of BP
Figure
92.5 3: Speed/accuracy tradeoff of English PTB-YM
CLL number of BP iterations t max for
UAS vs. the total
L 2 likelihood training (CLL) and our
standard conditional
92
approximation-aware
training with either an L 2 objective
L 2 +AR
(L 2 ) or a staged training of L 2 followed by annealed risk
(L 291.5
+AR). Note that the x-axis shows the number of iterations used for both training and testing. We use a 2ndorder91model with Grand.+Sib. factors.
93.0
90.0
2
8
Figure 3: Speed/accuracy tradeoff of English PTB-YM
UAS vs. the total number of BP iterations t max for
standard conditional likelihood training (CLL) and our
approximation-aware training with either an L 2 objective
(L 2 ) or a staged training of L 2 followed by annealed risk
likelihood training (CLL) (§6). As is common
in the literature, we conflate two distinct learning
settings (conditional log-likelihood/surrogate log90.5tags for the CoNLL languages, and predicted
likelihood) under the single name “ CLL,” allowing
POS
Unary
Grand.
Grand.+Sib.
tags from TurboTagger
(Martins et Sib.
al., 2013)
for the the inference method (exact/inexact) to differentiate
PTB. Unlike most prior work, we hold out 10% of them. The second learning setting is approximationFigure
4: English
PTB-YM
UAS
the types of
2nd- aware learning (§3) with either our L distance obeach CoNLL
training
dataset
as vs.
development
data
2
order factors included in the model 9for approximation- jective (L ) (§4.2) or our layer-wise training method
for regularization by early stopping.
2
aware training and standard conditional likelihood trainSome of the CoNLL languages contain non- (L +AR) which takestheL 2-trained model asan iniing. All models include 1st-order factors (Unary). The 2
projective models
edges, include
but our grandparents
system is built
using arbia tializer for our annealed risk (§4.3). The annealed
2nd-order
(Grand.),
probability
distribution
over
projective
trees
only.
trary siblings (Sib.), or both (Grand.+Sib.)—and use 4 risk objective requires an annealing schedule: over
ERMA can
iterations
of still
BP. be used with such a badly misspec- the course of training, we linearly anneal from initial temperature T = 0.1 to T = 0.0001, updat-
40
Machine
Learning
Comparison of Two Approaches
1. CLL with approximate inference
– A totally ridiculous thing to do!
– But it’s been done for years because it often
works well
– (Also named “surrogate likelihood training” by
Wainright (2006))
41
Machine
Learning
Comparison of Two Approaches
Key idea: Open up the black box!
2. Approximation-aware Learning for NLP
– In hindsight, treating the approximations as part of
the model is the obvious thing to do
(Domke, 2010; Domke, 2011; Stoyanov et al., 2011;
Ross et al., 2011; Stoyanov & Eisner, 2012; Hershey et al., 2014)
– Our contribution: Approximation-aware learning
with structured factors
– But there's some challenges to get it right (numerical
stability, efficiency, backprop through structured factors, annealing
a decoder’s argmin)
– Sum-Product Networks are similar in spirit
(Poon & Domingos, 2011; Gen & Domingos, 2012)
42
Takeaways
• New learning approach for Structured BP
maintains high accuracy with fewer
iterations of BP, even with cycles
• Need a neural network? Treat your unrolled
approximate inference algorithm as a deep
network
43
Questions?
Pacaya - Open source framework for hybrid
graphical models, hypergraphs, and neural networks
Features:
– Structured BP
– Coming Soon: Approximation-aware training
Language: Java
URL: https://github.com/mgormley/pacaya
Download