Sample Complexity of Composite Likelihood Joseph K. Bradley & Carlos Guestrin PAC-learning parameters for general MRFs & CRFs via practical methods: pseudolikelihood & structured composite likelihood. e ( L min Ñq2 EP*( X ) [ log Pq (X)] X4: losing hair? X2: bags under eyes? æ ç Binary X: ç F12 (X1, X 2 ) = ç ç ç è X1 & X 2 X1 & X 2 X1 & X 2 ) 2 Regularization qˆi ' ¬ argminq Edata éë-log Pq (Xi | X-i )ùû qˆ ¬ Avg of qˆ ' from separate estimates Gold Standard: MLE is (optimally) statistically efficient. Hard to compute (inference). Iterate: • Compute gradient. • Step along gradient. Can we learn without intractable inference? i i Theorem Sample Complexity Bound for Disjoint MPLE: MLE loss: -log Pq (X) -å log Pq (Xi | X-i ) (Besag, 1975) YAi 1 100 10000 10000 0 100 10000 Training set size Is the bound still useful (predictive)? Pq (X) » Pq (X2 | X1 )× Pq (X1 | X2, X3, X4, X5 )×... X2: bags under eyes? X1: deadline? MPLE-disjoint 0 1 100 10000 Liang and Jordan (2008) • Asymptotic bounds for pseudolikelihood, composite likelihood. • Our finite sample bounds are of the same order. 20 0 1 100 10000 Training set size Chains Factors Random: X1 Y12 (X1, X2 ) X2 0.25 0.2 0.15 0.1 0.05 0 r=23 • Learning log Y12 (x1, x2 ) ~ Uniform[-s, s] ì s Associative: log Y12 (x1, x2 ) = í î 0 if x1 = x2 otherwise factor strength MCLE: The effect of a bad estimator P(XAi|X-Ai) can be averaged out by other good estimators. MPLE: One bad estimator P(Xi|X-i) can give bad results. 10 100 1000 2000 Learning Test 1500 1000 500 0 1 10 100 1/Λmin 1000 1.04 MPLE 1.03 1.02 1.01 combs 1 0 20 40 60 Grid size |X| 6000 MLE 4000 combs MPLE 2000 Grid. Associative factors (fixed strength). 10,000 training samples. 0 0 20 40 60 Grid size |X| Combs (MCLE) lower sample complexity--without increasing computation! (Low-degree factor graphs over discrete X.) Averaging MCLE Components Re-write P(X) as a ratio of many small factors P( XCi | X-Ci ). • Fine print: Each factor is instantiated 2|Ci| times using a reference assignment. Estimate each small factor P( XCi | X-Ci ) from data. Computing MPLE directly is faster. 10 runs with separate datasets Optimized with conjugate gradient • MLE on big grids: stochastic gradient with Gibbs sampling Intuition: ρmin/Mmax = Average Λmin (over multiple components estimating each parameter) = minj [ sum over components Ai which estimate θj of [ min eigval of Hessian of E [-log Pq (X Ai | X- Ai )] at θ* ]. Mmax = maxj [ number of components Ai which estimate θj ]. 1 r=5 r=11 ρmin Grid with strong vertical (associative) factors. 0.2 MLE Combs - vertical Theorem If the canonical parameterization uses the factorization of P(X), it is equivalent to MPLE with disjoint optimization. • • Good choice: vertical combs Bound on Parameter Error: MCLE 40 • Main idea (their “canonical parameterization”): Experimental Setup E.g., model with: • Weak horizontal factors • Strong vertical factors • Only previous method for PAC-learning high-treewidth discrete MRFs. • Stars Choosing MCLE components YAi: • Larger is better. • Keep inference tractable. • Use model structure. 1 1 r n ³ const × log 2 2 (rmin / M max ) e d Abbeel et al. (2006) • Learning with approximate inference • No previous PAC-style bounds for general MRFs, CRFs. • c.f.: Hinton (2002), Koller & Friedman (2009), Wainwright (2006) Grids Composite Likelihood (MCLE): Theorem Related Work Structures Estimate a larger component, but keep inference tractable. MPLE 5 æ 1 ö e = Oç logr ÷ è L min ø Chains. Random factors. 10,000 train exs. MLE (similar results for MPLE) Chain. |X|=4. Random factors. MLE Yes! Actual error vs. bound: • Different constants • Similar behavior • Nearly independent of r Pro: No intractable inference required Pro: Consistent estimator Con: Less statistically efficient than MLE Con: No PAC bounds Ravikumar et al. (2010) • PAC bounds for regression Yi ~ X with Ising factors. • Our theory is largely derived from this work. Combs (Structured MCLE) improve upon MPLE. Estimate P(YAi|Y-Ai) separately, YAi in Y. 10 Predictive Power of Bounds i distribution as product of local conditionals. 0 1 Pseudolikelihood (MPLE) loss: Intuition: Approximate 20 20000 Hard to compute replace it! 60 40 30000 L1 param error bound Maximum Pseudolikelihood (MPLE) 15 60 L1 param error Edata [-log Pq (X | E)] Log loss ≤ f(param estimation error) (tighter bound) L1 param error bound Con: Z depends on E! Compute Z(e) for every training example! (Lafferty et al., 2001) 1 Factor strength (Fixed |Y|=8) Something in between? rX 1 X n ³ const × 2 log 2 L min e d Log (base e) loss L1 param error 1 Pq (X | E) = exp(q T F(X, E)) Zq (E) MLE objective: 0 MPLE: Estimate P(Yi|Y-i) separately Con: Worse bound (extra factors |X|) 2 Log loss bound, given params Parameter estimation error ≤ f(sample size) (looser bound) Model conditional distribution P(X|E) over random variables X, Inference exponential only in |X|, not in |E|. 2 Factor strength (Fixed |Y|=8) (Lindsay, 1988) Conditional Random Fields (CRFs) Pro: Model X, not E. Yi Pro: Data parallel Tightness of Bounds given variables E: 0 MLE: Estimate P(Y) all at once better MLE Algorithm i 1 Composite Likelihood (MCLE) é ù ˆ Joint MPLE: q ¬ argminq Edata ê-å log Pq (Xi | X-i )ú ë i û i 2000 combs Structured Composite Likelihood Joint vs. Disjoint Optimization Disjoint MPLE: MPLE All plots are for associative factors. (Random factors behave similarly.) Max feature magnitude (q ) £ L (q *) + Fmaxre 1 10 0 2 4 Factor strength (Fixed |Y|=8) combs 2 3 4 5 6 7 Grid width (Fixed factor strength) MPLE 4000 MPLE 10 MPLE is worse for strong factors. 2 2 1 ( q ) £ ( q *) + L + F r ( 2 max max ) e L L Maximum Likelihood Estimation (MLE) L2 regularization is more common. Our analysis applies to L1 & L2. 0 0 else the log loss converges linearly in ε: Pq* (X) 100 Theorem If the parameter estimation error ε is small, then the log loss converges quadratically in ε: L MPLE 6000 MPLE Λmin ratio q =q * MPLE is worse for big grids. 2 7 12 Model size |Y| (Fixed factor strength) 200 Bound on Log Loss = P( deadline | bags under eyes, losing hair ) Loss 1 Log loss ratio (other/MLE) P(X1 | X2, X4 ) X1 & X 2 ö ÷ ÷ ÷ ÷ ÷ ø MPLE 9 7 5 3 1 Our analysis covers their learning method. Λmin Example query: ) ( X5: overeating? exp (q12T F12 (X1, X2 )) q =q * mini L min Ñq2 EP*( X ) [ log Pq (Xi | X-i )] X3: sick? factor Edata [-log Pq (X)] + l q 1.2 0 5 10 Model size |Y| (Fixed factor strength) Λmin for MPLE: mini [ min eigval of Hessian of loss component i at θ* ]: X1: deadline? Minimize objective: 1.4 Λmin for MLE: min eigenvalue of Hessian of loss at θ*: Example MRF: the health of a grad student Given data: n i.i.d. samples from d Avg. per-parameter error x P(X)µ Y12 (X1, X2 )Y14 (X1, X4 )×... Probability of failure MPLE is worse for high-degree nodes. Λmin ratio L log Grids Λmin ratio 2 r Model diameter is not important. 1.6 better Zq = å exp (q F(x)) 2 min n ³ const × T 1 # parameters (length of θ) Training time (sec) Parameters ) Λmin ratio Features Stars Chains Λmin ratio exp(q F(X)) MLE or MPLE using L1 or L2 regularization l = O(n achieve avg. per-parameter error e = 1r q - q * 1 with probability ≥ 1-δ 1 using n i.i.d. samples from Pθ*(X): -1/2 better Requires inference. Provably hard for general MRFs. Plotted: Ratio (Λmin for MLE) / (Λmin for other method) Theorem Model distribution P(X) over random variables X 1 as a log-linear MRF: T Zq How do the bounds vary w.r.t. model properties? Bound on Parameter Error: MLE, MPLE Λmin ratio Markov Random Fields (MRFs) Pq (X) = Λmin for Various Models Sample Complexity Bounds Background Combs - both MPLE 0.02 2 Future Work • • • • Theoretical understanding of how Λmin varies with model properties. Choosing MCLE structure on natural graphs. Parallel learning: Lowering sample complexity of disjoint optimization via limited communication. Comparing with MLE using approximate inference. 3 4 5 Grid width 6 Combs - horizontal Best: Component structure matches model structure. Average: Reasonable choice without prior knowledge of θ*. Worst: Component structure does not match model structure. Acknowledgements • • Thanks to John Lafferty, Geoff Gordon, and our reviewers for helpful feedback. Funded by NSF Career IIS-0644225, ONR YIP N00014-08-1- 0752, and ARO MURI W911NF0810242.