My title

advertisement
Query-Specific Learning and Inference
for Probabilistic Graphical Models
Anton Chechetka
Thesis committee: Carlos Guestrin
Eric Xing
J. Andrew Bagnell
Pedro Domingos (University of Washington)
14 June 2011
Carnegie Mellon
Motivation
Fundamental problem: to reason accurately about
noisy
high-dimensional data with
local interactions
2
Sensor networks
• noisy: sensors fail
noise in readings
• high-dimensional: many sensors,
(temperature, humidity, …) per sensor
• local interactions: nearby locations have high correlations
3
Hypertext classification
• noisy: automated text understanding is far from perfect
• high-dimensional: a variable for every webpage
• local interactions: directly linked pages have correlated topics
4
Image segmentation
• noisy: local information is not enough
camera sensor noise
compression artifacts
• high-dimensional: a variable for every patch
• local interactions: cows are next to grass,
airplanes next to sky
5
Probabilistic graphical models
Noisy
Probabilistic inference
P(Q | E ) 
query
high-dimensional
data with
local interactions
P(Q, E )
P( E )
evidence
a graph to encode
only direct interactions
over many variables
6
Graphical models semantics
Factorized distributions
1
P X  
Z
   X  
f F
Graph structure
X4
X1
X6
X5
X2
X7
X3
X   X 3 , X 4 , X 5 
separator
X are small subsets of X  compact representation
7
Graphical models workflow
Factorized distributions
1
P X  
Z
   X  
f F
Learn/construct
structure
Learn/define
parameters
Graph structure
X4
X1
X6
X5
X2
X7
X3
Inference
P(Q|E=E)
8
Graph. models fundamental problems
Learn/construct
structure
NP-complete
Learn/define
parameters
exp(|X|)
Inference
#P-complete (exact)
NP-complete (approx)
Compounding
errors
P(Q|E=E)
9
Domain knowledge structures don’t help
Domain knowledge-based structures
do not support tractable inference
(webpages)
10
This thesis: general directions
New algorithms for learning and inference
in graphical models
to make answering the queries better
Emphasizing the computational aspects of the graph
Learn accurate and tractable models
Compensate for reduced expressive power with
exact inference and optimal parameters
Gain significant speedups
Inference speedups via better prioritization of computation
Estimate the long-term effects of propagating information through the
graph
Use long-term estimates to prioritize updates
11
Thesis contributions
Learn accurate and tractable models
In the generative setting P(Q,E) [NIPS 2007]
In the discriminative setting P(Q|E) [NIPS 2010]
Speed up belief propagation for cases with many
nuisance variables [AISTATS 2010]
12
Generative learning
query goal
learning goal
P(Q, E )
P(Q | E ) 
P( E )
Useful when E is not known in advance
Sensors fail unpredictably
Measurements are expensive (e.g. user time), want
adaptive evidence selection
13
Tractable vs intractable models workflow
Tractable models
learn simple tractable
structure from
domain knowledge + data
optimal parameters,
exact inference
approx. P(Q|E=E)
Intractable models
construct intractable
structure from
domain knowledge
learn intractable
structure from
data
approximate algs:
no quality
guarantees
approx. P(Q|E=E)
14
Tractability via low treewidth
7
3
4
1
5
2
6
Treewidth:
size of largest clique in a
triangulated graph
Exact inference exponential in treewidth (sum-product)
Treewidth NP-complete to compute in general
Low-treewidth graphs are easy to construct
Convenient representation: junction tree
Other tractable model classes exist too
15
Junction trees
Cliques connected by edges with separators
Running intersection property
X1,X4,X5
Most likely junction tree
X1,X5
of given treewidth >1 is NP-complete
We will look for good approximations
X ,X ,X
1
2
5
X4,X5
C1
X1,X5
C2
X4,X5,X6
X1,X3,X5
C4
C5
X1,X2
7
3
4
1
5
X1,X2,X7
C3
6
2
16
Independencies in low-treewidth distributions
P(X) factorizes according to a JT
P(C , E )
PC 

(X )  
P S 

 

C
, C ,
conditional independencies hold
I X   , X   | S   0

conditional mutual information
E
works in the other way too!
KLP || P(C , E )   X 

X4,X5,X6
I X   , X   | S    S  S
X1,X3,X5
X = X2 X3 X7
X = X4 X6
X1,X4,X5
C
X1,X5
S
X1,X2,X5
C
X1,X2,X7
17
Constraint-based structure learning
KLP || P(C , E )   X 

I X   , X   | S    S  S
I(X , X X Look
| S3)for
< JTs
 where this holds
(constraint-based structure learning)
S8
S1: X1X2
S2: X1X3
X
S3: X1X4
…
Sm: Xn-1Xn
all candidate
separators
X X
X1 X4
S1
S7
C1
C4
S3
C2
C5
C3
all variables
X
partition remaining
variables into weakly
dependent subsets
find consistent
junction tree
18
Mutual information complexity
I(X , X- | S) = H(X | S) - H(X | X- S3)
everything except for X
conditional entropy
I(X , X- | S) depends on all assignments to X:
exp(|X|) complexity in general
Our contribution: polynomial-time upper bound
19
Mutual info upper bound: intuition
hard
I(A,B | C)=??
A
B
D
F
Only look at small subsets D, F
Poly number of small subsets
Poly complexity for every pair
|DF| k
Any conclusions about I(A,B|C)?
In general, no
If a good junction tree exists, yes
20
Contribution: mutual info upper bound
A
B
D
|DF| treewidth+1
F
Suppose an -JT of treewidth k for P(ABC) exists:
I X   , X   | S    S  S
Theorem:
Let
 = max I(D, F | C) for |DF| k+1
Then
I(A, B | C)  |ABC| ( + )
21
Mutual info upper bound: complexity
Direct computation: complexity exp(|ABC|)
Our upper bound:
O(|AB|treewidth + 1) small subsets
A
exp(|C|+ treewidth) time each
D
|C| = treewidth
for structure learning
B
F
|DF| treewidth+1
polynomial(|ABC|) complexity
22
Guarantees on learned model quality
Theorem:
Suppose a strongly connected -JT of treewidth k for P(X) exists.
Then our alg. will with probability at least (1-) find a JT s.t.
KLP || PJT   (k  1) X (  2)
using
 log( X /  ) 

O
2




poly samples
 X
samples and O


2 k 3
quality
guarantee
log( 1 /  ) 
time.
2



poly time
Corollary: strongly connected junction trees are PAC-learnable
23
Related work
Reference
Model
Guarantees
Time
[Bach+Jordan:2002]
tractable
local
poly(n)
[Chow+Liu:1968]
tree
global
O(n2 log n)
[Meila+Jordan:2001]
tree mix
local
O(n2 log n)
[Teyssier+Koller:2005]
compact
local
poly(n)
[Singh+Moore:2005]
all
global
exp(n)
[Karger+Srebro:2001]
tractable
const-factor
poly(n)
[Abbeel+al:2006]
compact
PAC
poly(n)
[Narasimhan+Bilmes:2004] tractable
PAC
exp(n)
our work
tractable
PAC
poly(n)
[Gogate+al:2010]
tractable with PAC
high treewidth
poly(n)
24
Test log-likelihood
better
Results – typical convergence time
good results
early on in practice
25
Results – log-likelihood
better
OBS
 local search in limited in-degree Bayes nets
Chow-Liu
 most likely JTs of treewidth 1
Karger-Srebro  constant-factor approximation JTs
our method
26
Conclusions
A tractable upper bound on conditional mutual info
Graceful quality degradation and PAC learnability
guarantees
Analysis on when dynamic programming works
[in the thesis]
Dealing with unknown mutual information threshold
[in the thesis]
Speedups preserving the guarantees
Further speedups without guarantees
27
Thesis contributions
Learn accurate and tractable models
In the generative setting P(Q,E) [NIPS 2007]
In the discriminative setting P(Q|E) [NIPS 2010]
Speed up belief propagation for cases with many
nuisance variables [AISTATS 2010]
28
Discriminative learning
query goal
learning goal
P(Q, E )
P(Q | E ) 
P( E )
Useful when variables E are always the same
Non-adaptive, one-shot observation
Image pixels  scene description
Document text  topic, named entities
Better accuracy than generative models
29
Discriminative log-linear models
feature
(domain knowledge)
weight
(learn from data)
1


P(Q | E , w) 
exp  w f Q , E 
Z ( E , w)


evidence-dependent
normalization
Evidence
Don’t sum over all values of E
Don’t model P(E)
f34
f12
No need for structure over E
Query
30
Model tractability still important
Observation #1: tractable models are necessary for
exact inference and parameter learning
in the discriminative setting
Tractability is determined by the structure over query
31
Simple local models: motivation
query
Locally almost linear
Q
Q=f(E)
E
evidence
Exploiting evidence values
overcomes the expressive power deficit of simple models
We will learn local tractable models
32
Context-specific independence
no
edge
Observation #2: use evidence values at test time
to tune the structure of the models,
do not commit
to a single tractable model
33
Low-dimensional dependencies in generative
structure learning
Generative structure learning often relies only on
low-dimensional marginals
Junction trees:
decomposable scores
separators
cliques
LLH (C, S)   H(S)   H(C)
SS
CC
??
Low-dimensional independence tests:
I ( A, B | S )  
Small changes to structure  quick score recomputation
Discriminative structure learning: need inference in full model
for every datapoint even for small changes in structure
34
Leverage generative learning
Observation #3: generative structure learning
algorithms have very useful properties,
can we leverage them?
35
Observations so far
Discriminative setting has extra information, including evidence
values at test time
Want to use
to learn local tractable models
Good structure learning algorithms exist for generative setting
that only require low-dimensional marginals P(Q)
Approach: 1. use local conditionals P(Q | E=E) as “fake marginals”
to learn local tractable structures
2. learn exact discriminative feature weights
36
Evidence-specific CRF overview
Approach: 1. use local conditionals P(Q | E=E) as “fake marginals”
to learn local tractable structures
2. learn exact discriminative feature weights
Local conditional
density estimators P(Q | E)
Evidence
value E=E
Generative structure
learning algorithm
Feature
weights w
P(Q | E=E)
Tractable structure
for E=E
Tractable evidencespecific CRF
37
Evidence-specific CRF formalism
Observation: identically zero feature   0 does not affect the model
extra “structural” parameters
P(Q | E , w, u ) 
(
) (
Fixed dense
model
1


exp  w f (Q , E )  I ( E , u )
Z ( E , w, u )


Evidence-specific
feature values
E=E1
E=E2
E=E3
evidence-specific structure: I(E,u){0, 1}
)×(
×
×
×
Evidence-specific
tree “mask”
Evidence-specific
model
)= (
38
)
Evidence-specific CRF learning
Learning is in the same order as testing
Local conditional
density estimators P(Q | E)
Evidence
value E=E
Generative structure
learning algorithm
Feature
weights w
P(Q | E=E)
Tractable structure
for E=E
Tractable evidencespecific CRF
39
Plug in generative structure learning
1


P(Q | E , w, u ) 
exp  w  (Q , E )  I ( E , u )
Z ( E , w, u )


encodes the output of the chosen structure learning algorithm
Directly generalize generative algorithms :
Generative
Discriminative
P(Qi,Qj)
P(Qi,Qj|E=E)
(pairwise marginals)
+
Chow-Liu algorithm
=
optimal tree
(pairwise conditionals)
+
Chow-Liu algorithm
=
good tree for E=E
40
Evidence-specific CRF learning: structure
Choose generative structure learning algorithm A
Chow-Liu
Identify low-dimensional subsets Qβ that A may need
All pairs (Qi, Qj)
E
Q
E Q1,Q2
E Q1,Q3
,
original problem
E Q3,Q4
…
low-dimensional pairwise problems
Pˆ (Q1 , Q2 | E , u12 )
Pˆ (Q1 , Q3 | E, u13 )
Pˆ (Q3 , Q4 | E, u34 )
41
Estimating low-dimensional conditionals
Use the same features as the baseline high-treewidth model
Baseline CRF
1


P(Q, E | w) 
exp  w  Q , E 
Z ( E , w)


Scope restriction
Low-dimensional
model


1
P(Q | E, u ) 
exp  u  Q , E  s.t. Q  Q
Z ( E, u)


End result: optimal u
42
Evidence-specific CRF learning: weights
1


P(Q | E , w, u ) 
exp  w  (Q , E )  I ( E , u )
Z ( E , w, u )


“effective features”
Already chosen the algorithm behind I(E,u)
Already learned parameters u
Only need to learn feature weights w
log P(Q|E,w,u) is concave in w  unique global optimum
43
Evidence-specific CRF learning: weights
Tree-structured distribution
 log P(Q | E, w, u )
 I E, u    f Q , E   E P (Q|E, w,u )  f Q , E 
w
(
)(
Fixed dense
model
Evidence-specific
tree “mask”
)
(
Exact
tree-structured
gradients wrt w
E=E1
Q=Q1
E=E2
Q=Q2
E=E3
Q=Q3
)
Overall gradient
(dense)
(
Σ
44
)
Results – WebKB
Text + links  webpage topic
Ignore links
Standard dense CRF
Max-margin model
1000
0.15
better
Our work
0.1
500
0.05
0
0
SVM
RMN ESS-CRF M3N
Prediction error
RMN ESS-CRF M3N
Time
45
Image segmentation - accuract
local segment features + neighbor segments type of object
Ignore links
Standard dense CRF
Our work
better
0.75
0.7
0.65
0.6
Logistic Dense CRF ESS-CRF
regression
Accuracy
46
Image segmentation - time
Ignore links
Standard dense CRF
better
10000
Our work
1000
1000
100
100
10
10
1
1
Logistic Dense CRF ESS-CRF
regression
Train time (log scale)
Logistic Dense CRF ESS-CRF
regression
Test time (log scale)
47
Conclusions
Using evidence values to tune low-treewidth model
structure
Compensates for the reduced expressive power
Order of magnitude speedup at test time (sometimes train time too)
General framework for plugging in existing generative
structure learners
Straightforward relational extension [in the thesis]
48
Thesis contributions
Learn accurate and tractable models
In the generative setting P(Q,E) [NIPS 2007]
In the discriminative setting P(Q|E) [NIPS 2010]
Speed up belief propagation for cases with many
nuisance variables [AISTATS 2010]
49
Why high-treewidth models?
A dense model expressing laws of nature
Protein folding
Max-margin parameters don’t work well (yet?) with
evidence-specific structures
50
Query-Specific inference problem
query
evidence
not interesting
P(X )   fij ( X i X j )
ijE
Using information about the query
to speed up convergence of belief propagation
for the query marginals
51
(loopy) Belief Propagation
Passing messages along edges
mi k
k
r
h
j
s
u
i
Variable belief:
~ (t )
P ( xi )   m (jt) i ( xi )
ijE
Update rule:
m (jt1i) ( xi )   f ij ( xi x j )
xj
(t )
m
 k j (x j )
kjE , k  i
Result: all single-variable beliefs
52
(loopy) Belief Propagation
Message dependencies are local:
k
r
h
j
s
u
i
dependence
Freedom in scheduling updates
Round–robin schedule
Fix message order
Apply updates in that order until convergence
53
Dynamic update prioritization
large change
large change
small change
1
large change
informative update
small change
2
small change
wasted computation
Fixed update sequence is not the best option
Dynamic update scheduling can speed up convergence
Tree-Reweighted BP [Wainwright et. al., AISTATS 2003]
Residual BP [Elidan et. al. UAI 2006]
Residual BP  apply the largest change first
54
Residual BP [Elidan et. al., UAI 2006]
Update rule:
new
)
m(j NEW
( xi )   f ij ( xi x j )
i
xj
( OLD )
m
 k j (x j )
old
kjE , k  i
Pick edge with largest residual
)
( OLD )
max m (j NEW

m
i
j i
)
( NEW )
m
Update m(OLD
 j i
j i
More effort on the difficult parts of the model 
But no query 
55
Why edge importance weights?
query
residual < residual
which to update??
• Residual BP updates
• no influence on the query
• wasted computation
• want to update
• influence on the query
in the future
Our work  max approx. eventual effect on P(query)
Residual BP  max immediate residual reduction
56
Query-Specific BP
Update rule:
new
)
m(j NEW
( xi )   f ij ( xi x j )
i
xj
Pick edge with
max m
( NEW )
j i
Update m(OLD)  m( NEW )
j i
j i
m
( OLD )
m
 k j (x j )
old
kjE , k  i
( OLD )
j i
 A j i
edge
importance
the only change!
Rest of the talk:
defining and computing edge importance
57
Edge importance base case
)
( OLD )
m (j NEW

m
 A j i
i
j i
k
r
h
j
s
u
i
approximate eventual
update effect on P(Q)
query
Base case: edge directly connected to the query
Aji=??
change in query belief
change in message
||P(NEW)(Q)  P(OLD)(Q)||  ||m(NEW)  m(OLD)||  11
ji
||P(Q)||
tight bound
ji
||mji ||
Edge importance one step away
query
Edge one step away from the query:
Arj=??
||P(Q)||
k
r
h
j
s
u
i
 ||m ji ||
over values of
all other messages
change in query belief
 ||m || 
rj
change in message
can compute in closed form
looking at only fji [Mooij, Kappen; 2007]
sup
mji
mrj
message
importance
Edge importance general case
query
k
i
 
r
j
s

h
u
Base case: Aji=1
One step away:
mji
Arj= sup
mrj
||P(Q)||||msh|| sup
sup
P(Q)

msh 
sup
P(Q)
msh
Generalization?
 expensive to compute
 bound may be infinite
mhr  sup mrj  sup mji
msh
mhr
mrj
sensitivity(): max impact along the path

Edge importance general case
2
query
sup
P(Q)

msh 
i
sup
k
1
r
j
s

h
u
mhr  sup mrj  sup mji
msh
mhr
mrj
sensitivity(): max impact along the path
Ash = max
all paths
 from h
to query sensitivity(
There are a lot of paths in a graph,
trying out every one is intractable 

)
Efficient edge importance computation
A = max
all paths  from
to query sensitivity()
There are a lot of paths in a graph,
trying out every one is intractable 
always decreases as the path grows
alwaysDijkstra’s
1
always
1paths) alg.
always 1
(shortest
willm
efficiently
findm
max-sensitivity
paths
hr  sup
rj  sup m
ji
sup
sensitivity( hrji ) =
for every
m
medge 
m
sh
hr
rj
decomposes into
individual edge contributions
62
Query-Specific BP
Run Dijkstra’s alg starting at query to get edge weights
Aji = max
all paths
 from i to query sensitivity()
Pick edge with largest weighted residual
More effort on the difficult and relevant parts of the model
)
( OLD )
max m (j NEW

m
 A j i
i
j i
Takes into
account
not
only graphical structure,
)
( NEW
)
Update
m(OLD
m
 j i
j i
but
also strength
of dependencies
63
Experiments – single query
Easy model
Hard model
(sparse connectivity,
weak interactions)
(dense connectivity,
strong interactions)
Our work
better
Standard residual BP
Faster convergence, but long initialization still a problem
64
Anytime query-specific BP
query
k
r
j
s
i
Query-specific BP:
Dijkstra’s alg.
BP updates
same BP update sequence!
Anytime QSBP:
65
Experiments – anytime QSBP
Easy model
Hard model
(sparse connectivity,
weak interactions)
(dense connectivity,
strong interactions)
Our work
Our work + anytime
better
Standard residual BP
Much shorter initialization
66
Experiments – multiquery
Easy model
Hard model
(sparse connectivity,
weak interactions)
(dense connectivity,
strong interactions)
Our work
Our work + anytime
better
Standard residual BP
67
Conclusions
Weighting edges is a simple and effective way to
improve prioritization
We introduce a principled notion of edge importance
based on both structure and parameters of the model
Robust speedups in the query-specific setting
Don’t spend computation on nuisance variables unless
needed for the query marginal
Deferring BP initialization has a large impact
68
Thesis contributions
Learn accurate and tractable models
In the generative setting P(Q,E) [NIPS 2007]
In the discriminative setting P(Q|E) [NIPS 2010]
Speed up belief propagation for cases with many
nuisance variables [AISTATS 2010]
69
Future work
More practical JT learning
SAT solvers to construct structure, pruning heuristics, …
Evidence-specific learning
Trade efficiency for accuracy
Max-margin evidence-specific models
Theory on ES structures too
Inference:
Beyond query-specific: better prioritization in general
Beyond BP: query-specific Gibbs sampling?
70
Thesis conclusions
Graphical models are a regularization technique for
high-dimensional distributions
Representation-based structure is well-understood
Conditional independencies
Right now, structured computation is a
“consequence” of representation
Major issues with tractability, approximation quality
Logical next step structured computation as a primary
basis of regularization
This thesis: computation-centric approaches have
better efficiency and do not sacrifice accuracy
71
Thank you!
Collaborators: Carlos Guestrin, Joseph Bradley, Dafna Shahaf
72
Mutual info upper bound: quality
Upper bound:
Suppose an -JT exists
 is the largest mutual information over small subsets
Then I(A, B | C)  |ABC| ( + )
No need to know the -JT, only that it exists
No connection between C and the JT separators
C can be of any size, no connection to JT treewidth
The bound is loose only when there is no hope to learn a good JT
73
Typical graphical models workflow
reasonable intractable
structure from
domain knowledge
Learn/construct
structure
Learn/define
parameters
The graph is
primarily a
representation
tool
approximate algs:
no quality
guarantees
Inference
P(Q|E=e)
approx. P(Q|E=e)
74
Contributions – tractable models
Learn accurate and tractable models
In the generative setting [NIPS 2007]
Polynomial-time conditional mutual information upper bound
First PAC-learning result for strongly connected junction trees
Graceful degradation guarantees
Speedup heuristrics
In the discriminative setting [NIPS 2010]
General framework for learning CRF structure that depends on
evidence values at test time
Extensions to the relational setting
Empirical: order of magnitude speedups with the same accuracy as
high-treewidth models
75
Contributions – faster inference
Speed up belief propagation for cases with many
nuisance variables [AISTATS 2010]
A framework of importance-weighted residual belief
propagation
A principled measure of eventual impact of an edge update
on the query belief
Prioritize updates by importance for the query instead of absolute
magnitude
An anytime modification to defer much of initialization
Initial inference results available much sooner
Often much faster eventual convergence
The same fixed points as the full model
76
Future work
Two main bottlenecks:
Constructing JTs given mutual information values.
Esp. with non-uniform treewidth, dependence strength
Large sample: learnability guarantees for non-uniform treewidth
Small sample: non-uniform treewidth for regularization
Constraint satisfaction, SAT solvers, etc?
Relax strong connectivity requirement?
Evaluating mutual information:
need to look at 2k+1 variables instead of k+1, large penalty
Branch on features instead of sets of variables? [Gogate+al:2010]
Speedups without guarantees
Local search, greedy separator construction, …
77
Log-linear parameter learning
conditional log-likelihood
LLH (D|w) 
 log P(Q | E,w)
( Q ,E )D
Convex optimization: unique global maximum
Gradient: features – [expected features]
 log P(Q | E, w)
 f Q , E   E P (Q|E, w)  f Q , E 
w
need inference
inference for every E given w
78
Log-linear parameter learning
Tractable
Intractable
Generative (E=)
Discriminative
Closed-form
Exact
gradient-based
Approximate
gradient-based
(no guarantees)
Approximate
gradient-based
(no guarantees)
Complexity
“phase
transition”
“manageable” slowdown
by the number of datapoints
Inference
once per weights update
Inference
for every datapoint (Q,E)
once per weights update
79
Plug in generative structure learning
1


P(Q | E , w, u ) 
exp  w  (Q , E )  I ( E , u )
Z ( E , w, u )


encodes the output of the chosen structure learning algorithm
Fix algorithm  always get structures with desired properties
(e.g. treewidth):
Chow-Liu for optimal trees
Our thin junction tree learning from part 1
Karger-Srebro for high-quality low-diameter junction trees
Local search, etc …
replace P(Q) with approximate conditionals P(Q | E=E, u) everywhere
80
Evidence-specific CRF learning: weights
1


P(Q | E , w, u ) 
exp  w  (Q , E )  I ( E , u )
Z ( E , w, u )


Already know
algorithm behind I(E,u)
Already learned u
Only need to learn w
Can find evidence-specific structure
I(E=E,u)
for every training datapoint (Q,E)
Structure induced by I(E,u)
is always tractable
Learn optimal w exactly
Tree-structured distribution
 log P(Q | E, w, u )
 I E, u    f Q , E   E P (Q|E, w,u )  f Q , E 
w
81
Relational evidence-specific CRF
Relational models: templated features + shared weights
Relation:
webpage
LinksTo
webpage
Learn a single
weight wLINK
Copy weight for every grounding
Groundings:
LinksTo
wLINK
LinksTo
wLINK
82
Relational evidence-specific CRF
Relational models: templated features + shared weights
Every grounding is a separate datapoint for structure training
use propositional approach + shared weights
Grounded model
Training datasets for “structural” parameters u
x3
x1
x2
x4
x5
x3 x4
x3 x5
x4 x5
x1 x2
x 1 x3
x 1 x4
x 1 x5
x 2 x3
x 2 x4
x 2 x5
83
Future work
Faster learning: pseudolikelihood is really fast, need to
compete
Larger treewidth: trade time for accuracy
Theory on learning “structural parameters” u
Max-margin learning
Inference is basic step in max-margin learning too  tractable
models are useful beyond log-likelihood
Optimizing feature weights w given local trees is straightforward
Optimizing “structural parameters” u for max-margin is hard
What is the right objective?
Almost tractable structures, other tractable models
Make sure loops don’t hurt too much
84
Query versus nuisance variables
We may actually care about only few variables
What are the topics of the webpages on the first page of Google
search results for my query?
Smart heating control: is anybody going to be at home for the
next hour?
Does the patient need immediate doctor attention?
But the model may have a lot of other variables to be
accurate enough
Don’t care about them per se, but necessary to look at to get the
query right
Both query and nuisance variables are unknown,
inference algorithms don’t see a difference
Speed up inference by focusing on the query
Only look at nuisance variable to the extent needed to answer
the query
85
Our contributions
Using weighted residuals to prioritize updates
Define message weights reflecting the importance of the
message to the query
Computing importance weights efficiently
Experiments: faster convergence on large relational models
86
Interleaving
Dijkstra’s expands the highest weight edges first
query
expanded on
previous iteration
min
not yet expanded
just expanded
expanded edges
A
A
( NEW )
( OLD )
M

max
m

m
suppose
j iALL EDGES
j i
j i
actual priority of
upper bound on
)
( OLD )
max j iEXPANDED m (j NEW

m
 A j i  M  min
i
j i
priority
expanded
A
no need to expand further at this point
87
Deferring BP initialization
Observation: Dijkstra’s alg. expands the most important edges first
Do we really need to look at every low importance edge
before applying BP updates?
No! Can use upper bounds on priority instead.
88
Upper bounds in priority queue
Observation: for edges low in the priority queue,
an upper bound on the priority is enough
Updates priority queue
k
r
j
s
Exact priority needed for
top element
i
Priority upper bound
is enough here
89
Priority upper bound for not yet seen edges
Expand several edges with Dijkstra’s :
For
:
(residual)  (weight) = (exact priority)
k
r
j
s
i
For all the other edges…
priority( ) =
residual( )  importance weight( )

priority( )  || factor(

) ||
 importance weight(
s.t.
)
is already expanded
Component-wise upper bound without looking at the edge!
90
Interleaving BP and Dijkstra’s
query
full model
Dijkstra
Dijkstra
> upper bound  BP update
exact priority < upper bound  Dijkstra expand an edge
exact priority
BP
Dijkstra
BP
BP
…
91
Download