Poster

advertisement
Sample Complexity of Composite Likelihood
Joseph K. Bradley & Carlos Guestrin
PAC-learning parameters for general MRFs & CRFs
via practical methods: pseudolikelihood & structured composite likelihood.
e
(
L min Ñq2 EP*( X ) [ log Pq (X)]
X4: losing hair?
X2: bags under eyes?
æ
ç
Binary X:
ç
F12 (X1, X 2 ) = ç
ç
ç
è
X1 & X 2
X1 & X 2
X1 & X 2
)
2
Regularization
qˆi ' ¬ argminq Edata éë-log Pq (Xi | X-i )ùû
qˆ ¬ Avg of qˆ ' from separate estimates
Gold Standard: MLE is (optimally) statistically efficient.
Hard to compute (inference).
Iterate:
• Compute gradient.
• Step along gradient.
Can we learn without
intractable inference?
i
i
Theorem
Sample Complexity Bound
for Disjoint MPLE:
MLE loss:
-log Pq (X)
-å log Pq (Xi | X-i )
(Besag, 1975)
YAi
1
100
10000
10000
0
100
10000
Training set size
Is the bound
still useful
(predictive)?
Pq (X) » Pq (X2 | X1 )× Pq (X1 | X2, X3, X4, X5 )×...
X2: bags under eyes?
X1: deadline?
MPLE-disjoint
0
1
100
10000
Liang and Jordan (2008)
• Asymptotic bounds for pseudolikelihood,
composite likelihood.
• Our finite sample bounds are of the
same order.
20
0
1
100
10000
Training set size
Chains
Factors
Random:
X1
Y12 (X1, X2 )
X2
0.25
0.2
0.15
0.1
0.05
0
r=23
•
Learning
log Y12 (x1, x2 ) ~ Uniform[-s, s]
ì s
Associative: log Y12 (x1, x2 ) = í
î 0
if
x1 = x2
otherwise
factor
strength
MCLE: The effect of a bad estimator
P(XAi|X-Ai) can be averaged out by
other good estimators.
MPLE: One bad estimator
P(Xi|X-i) can give bad results.
10
100
1000
2000
Learning Test
1500
1000
500
0
1
10
100
1/Λmin
1000
1.04
MPLE
1.03
1.02
1.01
combs
1
0
20 40 60
Grid size |X|
6000
MLE
4000
combs
MPLE
2000
Grid.
Associative factors
(fixed strength).
10,000 training
samples.
0
0
20 40 60
Grid size |X|
Combs (MCLE) lower sample complexity--without increasing computation!
(Low-degree factor graphs over discrete X.)
Averaging MCLE Components
Re-write P(X) as a ratio of many small factors P( XCi | X-Ci ).
• Fine print: Each factor is instantiated 2|Ci| times using a reference assignment.
Estimate each small factor P( XCi | X-Ci ) from data.
 Computing MPLE directly is faster.
10 runs with separate datasets
Optimized with conjugate gradient
• MLE on big grids: stochastic
gradient with Gibbs sampling
Intuition:
ρmin/Mmax = Average Λmin
(over multiple components
estimating each
parameter)
= minj [ sum over components Ai which estimate θj of
[ min eigval of Hessian of E [-log Pq (X Ai | X- Ai )] at θ* ].
Mmax = maxj [ number of components Ai which estimate θj ].
1
r=5
r=11
ρmin
Grid with strong vertical (associative) factors.
0.2
MLE
Combs - vertical
Theorem
If the canonical parameterization uses the factorization of P(X),
it is equivalent to MPLE with disjoint optimization.
•
•
 Good choice: vertical combs
Bound on Parameter Error: MCLE
40
• Main idea (their “canonical parameterization”):
Experimental Setup
E.g., model with:
• Weak horizontal factors
• Strong vertical factors
• Only previous method for PAC-learning high-treewidth discrete MRFs.
•
Stars
Choosing MCLE
components YAi:
• Larger is better.
• Keep inference
tractable.
• Use model structure.
1
1
r
n ³ const ×
log
2 2
(rmin / M max ) e
d
Abbeel et al. (2006)
•
Learning with approximate inference
• No previous PAC-style bounds for general MRFs, CRFs.
• c.f.: Hinton (2002), Koller & Friedman (2009), Wainwright (2006)
Grids
Composite Likelihood (MCLE):
Theorem
Related Work
Structures
 Estimate a larger component, but keep
inference tractable.
MPLE
5
æ 1
ö
e = Oç
logr ÷
è L min
ø
Chains.
Random factors.
10,000 train exs.
MLE (similar results for MPLE)
Chain.
|X|=4.
Random factors.
MLE
Yes! Actual error vs. bound:
• Different constants
• Similar behavior
• Nearly independent of r
Pro: No intractable inference required
Pro: Consistent estimator
Con: Less statistically efficient than MLE
Con: No PAC bounds
Ravikumar et al. (2010)
• PAC bounds for regression
Yi ~ X with Ising factors.
• Our theory is largely
derived from this work.
Combs (Structured MCLE)
improve upon MPLE.
Estimate P(YAi|Y-Ai) separately, YAi in Y.
10
Predictive Power of Bounds
i
distribution as product
of local conditionals.
0
1
Pseudolikelihood (MPLE) loss:
Intuition: Approximate
20
20000
Hard to compute  replace it!
60
40
30000
L1 param
error bound
Maximum Pseudolikelihood (MPLE)
15
60
L1 param error
Edata [-log Pq (X | E)]
Log loss
≤ f(param estimation error)
(tighter bound)
L1 param error bound
Con: Z depends on E!
Compute Z(e) for every training example!
(Lafferty et al., 2001)
1
Factor strength
(Fixed |Y|=8)
Something in between?
rX
1 X
n ³ const × 2
log
2
L min e
d
Log (base e) loss
L1 param error
1
Pq (X | E) =
exp(q T F(X, E))
Zq (E)
MLE objective:
0
MPLE: Estimate P(Yi|Y-i) separately
Con: Worse bound
(extra factors |X|)
2
Log loss bound,
given params
Parameter estimation error
≤ f(sample size)
(looser bound)
Model conditional distribution P(X|E) over random variables X,
 Inference exponential
only in |X|, not in |E|.
2
Factor strength
(Fixed |Y|=8)
(Lindsay, 1988)
Conditional Random Fields (CRFs)
Pro: Model X, not E.
Yi
Pro: Data parallel
Tightness of Bounds
given variables E:
0
MLE: Estimate P(Y) all at once
better
MLE Algorithm
i
1
Composite Likelihood (MCLE)
é
ù
ˆ
Joint MPLE: q ¬ argminq Edata ê-å log Pq (Xi | X-i )ú
ë i
û
i
2000
combs
Structured Composite Likelihood
Joint vs. Disjoint Optimization
Disjoint
MPLE:
MPLE
All plots are for associative factors. (Random factors behave similarly.)
Max feature
magnitude
(q ) £ L (q *) + Fmaxre
1
10
0
2
4
Factor strength
(Fixed |Y|=8)
combs
2 3 4 5 6 7
Grid width
(Fixed factor strength)
MPLE
4000
MPLE
10
MPLE is worse for strong factors.
2 2
1
(
q
) £ (
q
*) + L
+
F
r
( 2 max max ) e
L
L
Maximum Likelihood Estimation (MLE)
L2 regularization
is more common.
Our analysis
applies to L1 &
L2.
0
0
else the log loss converges linearly in ε:
Pq* (X)
100
Theorem
If the parameter estimation error ε is small,
then the log loss converges quadratically in ε:
L
MPLE
6000
MPLE
Λmin ratio
q =q
*
MPLE is worse
for big grids.
2
7
12
Model size |Y|
(Fixed factor strength)
200
Bound on Log Loss
= P( deadline | bags under eyes, losing hair )
Loss
1
Log loss ratio (other/MLE)
P(X1 | X2, X4 )
X1 & X 2
ö
÷
÷
÷
÷
÷
ø
MPLE
9
7
5
3
1
 Our analysis covers their learning method.
Λmin
Example query:
)
(
X5: overeating?
exp (q12T F12 (X1, X2 ))
q =q *
mini L min Ñq2 EP*( X ) [ log Pq (Xi | X-i )]
X3: sick?
factor
Edata [-log Pq (X)] + l q
1.2
0
5
10
Model size |Y|
(Fixed factor strength)
Λmin for MPLE: mini [ min eigval of Hessian of loss component i at θ* ]:
X1: deadline?
Minimize
objective:
1.4
Λmin for MLE: min eigenvalue of Hessian of loss at θ*:
Example MRF: the health of a grad student
Given data: n i.i.d. samples from
d
Avg. per-parameter error
x
P(X)µ Y12 (X1, X2 )Y14 (X1, X4 )×...
Probability
of failure
MPLE is worse for
high-degree nodes.
Λmin ratio
L
log
Grids
Λmin ratio
2
r
Model diameter
is not important.
1.6
better
Zq = å exp (q F(x))
2
min
n ³ const ×
T
1
# parameters
(length of θ)
Training time (sec)
Parameters
)
Λmin ratio
Features
Stars
Chains
Λmin ratio
exp(q F(X))
MLE or MPLE using L1 or L2 regularization l = O(n
achieve avg. per-parameter error e = 1r q - q *
1
with probability ≥ 1-δ
1
using n i.i.d. samples from Pθ*(X):
-1/2
better
Requires inference.
 Provably hard for general MRFs.
Plotted: Ratio (Λmin for MLE) / (Λmin for other method)
Theorem
Model distribution P(X) over random variables X
1
as a log-linear MRF:
T
Zq
How do the bounds vary w.r.t. model properties?
Bound on Parameter Error: MLE, MPLE
Λmin ratio
Markov Random Fields (MRFs)
Pq (X) =
Λmin for Various Models
Sample Complexity Bounds
Background
Combs - both
MPLE
0.02
2
Future Work
•
•
•
•
Theoretical understanding of how Λmin varies with model properties.
Choosing MCLE structure on natural graphs.
Parallel learning: Lowering sample complexity of disjoint optimization via
limited communication.
Comparing with MLE using approximate inference.
3
4
5
Grid width
6
Combs - horizontal
Best: Component
structure matches
model structure.
Average: Reasonable
choice without prior
knowledge of θ*.
Worst: Component
structure does not
match model
structure.
Acknowledgements
•
•
Thanks to John Lafferty, Geoff Gordon, and our reviewers for helpful feedback.
Funded by NSF Career IIS-0644225, ONR YIP N00014-08-1- 0752, and ARO
MURI W911NF0810242.
Download