Thesis Defense Learning Large-Scale Conditional Random Fields Joseph K. Bradley Committee Carlos Guestrin (U. of Washington, Chair) Tom Mitchell John Lafferty (U. of Chicago) Andrew McCallum (U. of Massachusetts at Amherst) 1 / 18 / 2013 Carnegie Mellon Modeling Distributions Goal: Model distribution P(X) over random variables X E.g.: Model life of a grad student. X1: losing sleep? X6: loud roommate? X7: taking classes? X4: losing hair? X2: deadline? X3: sick? X9: exercising? X5: overeating? X10: gaining weight? X8: cold weather? X11: single? 2 Modeling Distributions Goal: Model distribution P(X) over random variables X E.g.: Model life of a grad student. X1: losing sleep? X7: taking classes? X2: deadline? X5: overeating? P(X1, X5 | X2, X7 ) = P( losing sleep, overeating | deadline, taking classes ) 3 Markov Random Fields (MRFs) Goal: Model distribution P(X) over random variables X E.g.: Model life of a grad student. X1: losing sleep? X6: loud roommate? X7: taking classes? X4: losing hair? X2: deadline? X3: sick? X9: exercising? X5: overeating? X10: gaining weight? X8: cold weather? X10: single? 4 Markov Random Fields (MRFs) Goal: Model distribution P(X) over random variables X P(X)µ Y16 (X1, X6 )Y13 (X1, X3 )×... Y16 X1 Y Y X5 Y X6 Y X4 X3 Y factor (parameters) Y X10 Y Y Y X7 Y X9 Y Y X2 Y Y Y graphical structure X8 X10 5 Conditional Random Fields (CRFs) MRFs: P(X) (Lafferty et al., 2001) CRFs: P(Y|X) P(Y | X)µ Y(Y1, X1 )Y(Y1,Y3 )×... X1 Y1 X3 Y4 X2 Y3 X4 X5 Y5 Y2 Simpler structure (over Y only) X6 Do not model P(X) 6 MRFs & CRFs Benefits • Principled statistical and computational framework • Large body of literature Applications • • • • • Natural language processing (e.g., Lafferty et al., 2001) Vision (e.g., Tappen et al., 2007) Activity recognition (e.g., Vail et al., 2007) Medical applications (e.g., Schmidt et al., 2008) ... 7 Challenges Goal: Given data, learn CRF structure and parameters. X1 Y1 Y(Y1, X1 ), Y(Y1,Y3 ),... Y4 X2 Y3 X5 Y5 Y2 X6 NP hard in general Big structured optimization problem (Srebro, 2003) Many learning methods require inference, i.e., answering queries P(A|B) NP hard to approximate (Roth, 1996) Approximations often lack strong guarantees. 8 Thesis Statement CRFs offer statistical and computational advantages, but traditional learning methods are often impractical for large problems. We can scale learning by using decompositions of learning problems which trade off sample complexity, computation, and parallelization. 9 Scaling core methods Outline Parameter Learning Structure Learning Learning without intractable inference Learning tractable structures Parallel scaling solve via Parallel Regression Multicore sparse regression 10 Scaling core methods Outline Parameter Learning Learning without intractable inference 11 Log-linear MRFs Goal: Model distribution P(X) over random variables X X6 X1 P(X)µ Y12 (X1, X2 )Y 24 (X2, X4 )×... X2 X3 Parameters Features X8 X9 X5 = exp (q12T F12 (X1, X2 )) X7 X4 X10 X10 All results generalize to CRFs. Pq (X)µ exp(q F(X)) T 12 Parameter Learning: MLE Pq (X)µ exp(q F(X)) T Parameter Learning Given structure Φ and samples from Pθ*(X), Learn parameters θ. Traditional method: max-likelihood estimation (MLE) Minimize objective: Edata [-log Pq (X)] + regularization Loss Gold Standard: MLE is (optimally) statistically efficient. 13 Parameter Learning: MLE Pq (X)µ exp(q T F(X)) 14 Parameter Learning: MLE 1 Pq (X) = exp(q T F(X)) Zq MLE requires inference. Provably hard for general MRFs. (Roth, 1996) Inference makes learning hard. Can we learn without intractable inference? Zq = å exp (q T F(x)) x 15 Parameter Learning: MLE 1 Pq (X) = exp(q T F(X)) Zq Approximate inference & objectives • Many works: Hinton (2002), Sutton & McCallum (2005), Wainwright (2006), ... • Many lack strong theory. • Almost no guarantees for general MRFs or CRFs. Inference makes learning hard. Can we learn without intractable inference? 16 Our Solution Bradley, Guestrin (2012) Max Likelihood Estimation (MLE) Max Pseudolikelihood Estimation (MPLE) Sample complexity Parallel Computational complexity optimization Optimal High Difficult High Low Easy PAC learnability for many MRFs! 17 Our Solution Bradley, Guestrin (2012) Max Likelihood Estimation (MLE) Max Pseudolikelihood Estimation (MPLE) Sample complexity Parallel Computational complexity optimization Optimal High Difficult High Low Easy PAC learnability for many MRFs! 18 Our Solution Bradley, Guestrin (2012) Max Likelihood Estimation (MLE) Max Composite Likelihood Estimation (MCLE) Sample complexity Parallel Computational complexity optimization Optimal High Difficult Low Low Easy Choose MCLE structure to optimize trade-offs Max Pseudolikelihood Estimation (MPLE) High Low Easy 19 Deriving Pseudolikelihood (MPLE) MLE: minq Edata [-log Pq (X)] Hard to compute. So replace it! P(X)µ Y12 (X1, X2 )Y 24 (X2, X4 )×... X1 Y 24 Y12 Y 23 X2 Y 25 X4 X5 X3 Y 35 20 Deriving Pseudolikelihood (MPLE) P(X1 | X\1 )µ Y12 (X1, X2 ) MLE: minq Edata [-log Pq (X)] X1 Y12 MPLE: é ù minq Edata ê-å log Pq (Xi | X \ i )ú ë i û (Besag, 1975) Y12 via regression: minq12 Edata éë-log Pq12 (X1 | X \1 )ùû Estimate Tractable inference! 21 Pseudolikelihood (MPLE) Cons Pros • No intractable inference! MPLE: é ù • min Consistent estimator E - log P (X | X ) q data ê ë å i q i \i ú û • Less statistically efficient than MLE (Liang & Jordan, 2008) • No PAC bounds (Besag, 1975) PAC = Probably Approximately Correct (Valiant, 1984) 22 Sample Complexity: MLE Our Theorem: Bound on n (# training examples needed) 1 1 r n ³ const × 2 2 log L min e d Λmin: min eigenvalue of Hessian of loss at θ* # parameters (length of θ) probability of failure parameter error (L1) Recall: Requires intractable inference. 23 Sample Complexity: MPLE Our Theorem: Bound on n (# training examples needed) 1 1 r n ³ const × 2 2 log L min e d Λmin: mini [ min eigenvalue of Hessian of component i at θ* ] Recall: Tractable inference. # parameters (length of θ) probability of failure parameter error (L1) PAC learnability for many MRFs! 24 Sample Complexity: MPLE Our Theorem: Bound on n (# training examples needed) 1 1 r n ³ const × 2 2 log L min e d PAC learnability for many MRFs! Related Work Ravikumar et al. (2010) • Regression Yi~X with Ising models • Basis of our theory Liang & Jordan (2008) • Asymptotic analysis of MLE, MPLE • Our bounds match theirs Abbeel et al. (2006) • Only previous method with PAC bounds for high-treewidth MRFs • We extend their work: • Extension to CRFs, algorithmic improvements, analysis • Their method is very similar to MPLE. 25 Trade-offs: MLE & MPLE Our Theorem: Bound on n (# training examples needed) 1 1 r n ³ const × 2 2 log L min e d MLE Larger Λmin => Lower sample complexity Higher computational complexity MPLE Smaller Λmin => Higher sample complexity Lower computational complexity Sample — computational complexity trade-off 26 Trade-offs: MPLE Joint optimization for MPLE: minq Edata [-log Pq (X1 | X2 ) - log Pq (X2 | X1 )] X1 Lower sample complexity Disjoint optimization for MPLE: minq Edata [-log Pq (X1 | X2 )] minq Edata [-log Pq (X2 | X1 )] Y12 Y12 X2 Data-parallel 2 estimates of Y12 Average estimates Sample complexity — parallelism trade-off 27 Synthetic CRFs Pq (X | E) Chains Y Stars Random Associative Grids Factor strength = strength of variable interactions 28 Predictive Power of Bounds Errors should be ordered: MLE < MPLE < MPLE-disjoint L1 param error ε better 60 Length-4 chains Factors: random, fixed strength MPLE-disjoint MPLE MLE 50 40 30 20 10 0 1 100 10000 # training examples 29 Predictive Power of Bounds 1 1 r log MLE & MPLE Sample Complexity: n ³ const × 2 2 L min e d Length-6 chains Factors: random 10,000 train exs e µ1/ L min Actual ε better 0.1 MLE 0.08 0.06 0.04 0.02 0 50 1/ L min harder 100 30 Failure Modes of MPLE Sample complexity: n = O (1/ L 2min ) How do Λmin(MLE) and Λmin(MPLE) vary for different models? Model diameter Factor strength Node degree 31 Λmin: Model Diameter Λmin ratio: MLE/MPLE Chains (Higher = MLE better) Factors: associative, fixed strength Relative MPLE performance is independent of diameter in chains. (Same for random factors) Λmin ratio 1.6 1.4 1.2 1 0 5 10 Model diameter 32 Λmin: Factor Strength Λmin ratio: MLE/MPLE Length-8 Chains (Higher = MLE better) Factors: associative MPLE performs poorly with strong factors. (Same for random factors, and star & grid models) Λmin ratio 200 150 100 50 0 0 2 4 Factor strength 33 Λmin: Node Degree Stars Λmin ratio: MLE/MPLE Factors: associative, fixed strength 9 (Same for random factors) Λmin ratio MPLE performs poorly with high-degree nodes. (Higher = MLE better) 7 5 3 1 1 6 11 Node degree 34 Failure Modes of MPLE Sample complexity: n = O (1/ L 2min ) How do Λmin(MLE) and Λmin(MPLE) vary for different models? Model diameter Factor strength Node degree We can often fix this! 35 Composite Likelihood (MCLE) MLE: Estimate P(Y) all at once 36 Composite Likelihood (MCLE) MLE: Estimate P(Y) all at once MPLE: Estimate P(Yi|Y\i) separately Yi 37 Composite Likelihood (MCLE) MLE: Estimate P(Y) all at once MPLE: Estimate P(Yi|Y\i) separately Something in between? YAi Composite Likelihood (MCLE): Estimate P(YAi|Y\Ai) separately. (Lindsay, 1988) 38 Composite Likelihood (MCLE) MCLE Class: Node-disjoint subgraphs which cover graph. Generalizes MLE, MPLE; analogous: Objective Sample complexity Joint & disjoint optimization 39 Composite Likelihood (MCLE) Combs MCLE Class: Node-disjoint subgraphs which cover graph. Generalizes MLE, MPLE; analogous: Objective Sample complexity Joint & disjoint optimization • Trees (tractable inference) • Follow structure of P(X) • Cover star structures • Cover strong factors • Choose large components 40 Structured MCLE on a Grid 1.04 MPLE 1.03 1.02 1.01 MCLE (combs) 1 0 20 40 60 Grid size |X| Training time (sec) Log loss ratio (other/MLE) better Grid. Associative factors. 10,000 train exs. Gibbs sampling. MCLE (combs) lowers sample complexity ...without increasing computation! 6000 MLE 4000 2000 MCLE (combs) MPLE 0 0 20 40 60 Grid size |X| MCLE tailored to model structure. Also in thesis: tailoring to correlations in data. 41 Summary: Parameter Learning • Finite sample complexity bounds for general MRFs, CRFs • PAC learnability for certain classes • Empirical analysis • Guidelines for choosing MCLE structures: tailor to model, data Sample complexity Likelihood (MLE) Parallel Computational complexity optimization Optimal High Difficult Composite Likelihood (MCLE) Low Low Easy Pseudolikelihood (MPLE) High Low Easy 42 Scaling core methods Outline Structure Learning Learning tractable structures 43 CRF Structure Learning P(Y | X) µ Õ Y j (YCj , X Dj ) j Structure learning: Choose YC I.e., learn conditional independence Y1: losing sleep? Y(Y1, X1 ) Evidence selection: Choose XD I.e., select X relevant to each YC X1: loud roommate? X2: taking classes? Y(Y1,Y3 ) Y3: sick? Y2: losing hair? X3: deadline? 44 Related Work Previous Work Method Structure learning? Tractable inference? Evidence selection? Torralba et al. (2004) Boosted Random Fields Yes No Yes Schmidt et al. (2008) Block-L1 regularized pseudolikelihood Yes No No Shahaf et al. (2009) Edge weights + low-treewidth model Yes Yes No Most similar to our work: They focus on selecting treewidth-k structures. We focus on the choice of edge weight. 45 Tree CRFs with Local Evidence Bradley, Guestrin (2010) Goal Given: Xi relevant to each Yi Data F(Yi ,Yj , Xij ), | Xij | << | X | Local evidence Fast inference at test-time Learn tree CRF structure Via a scalable method 46 Chow-Liu for MRFs Chow & Liu (1968) Algorithm Weight edges with mutual information: w(i, j) = I(Yi ;Y j ) Y1 I(Y1;Y2 ) Y2 I(Y2;Y3 ) I(Y1;Y3 ) Y3 47 Chow-Liu for MRFs Chow & Liu (1968) Algorithm Weight edges with mutual information: I(Y1;Y2 ) w(i, j) = I(Yi ;Y j ) Y1 Choose max-weight spanning tree. I(Y1;Y3 ) Y2 I(Y2;Y3 ) Y3 Chow-Liu finds a max-likelihood structure. 48 Chow-Liu for CRFs? Algorithm Weight each possible edge: w(i, j) = ? Choose max-weight spanning tree. What edge weight? must be efficient to compute Global Conditional Mutual Information (CMI) w(i, j) = I(Yi;Y j | X) Pro: Finds max-likelihood structure (with enough data) Con: Intractable for large |X| 49 Generalized Edge Weights Global CMI w(i, j) = I(Yi;Y j | X) = -H(Yi,Yj | X)+ H(Yi | X)+ H(Yj | X) Local Linear Entropy Scores (LLES): w(i,j) = linear combination of entropies over Yi,Yj,Xi,Xj Theorem No LLES can recover all tree CRFs (even with non-trivial parameters and exact entropies). 50 Heuristic Edge Weights Global CMI w(i, j) = I(Yi;Y j | X) = -H(Yi,Yj | X)+ H(Yi | X)+ H(Yj | X) Local CMI w(i, j) = I(Yi;Y j | X i , X j ) = -H(Yi ,Y j | X i,X j ) + H(Yi | X i,X j ) + H(Y j | X i,X j ) Decomposable Conditional Influence (DCI) w(i, j) = -H(Yi ,Y j | X i,X j ) + H(Yi | X i ) + H(Y j | X j ) Method Guarantees Compute w(i,j) tractably Comments Global CMI Recovers true tree No Shahaf et al. (2009) Local CMI Lower-bounds likelihood gain Yes Fails with strong Yi—Xi potentials DCI Exact likelihood gain for some edges Yes Best empirically 51 Synthetic Tests Trees w/ associative factors. |Y|=40. 1000 test samples. Error bars: 2 std. errors. Fraction edges recovered better True CRF DCI Global CMI 1 0.8 Local CMI 0.6 0.4 0.2 Schmidt et al. 0 0 100 200 300 400 # training examples 500 52 Synthetic Tests Trees w/ associative factors. |Y|=40. 1000 test samples. Error bars: 2 std. errors. Global CMI Seconds better 20000 15000 10000 DCI Local CMI 5000 0 0 100 200 300 400 500 Schmidt et al. # training examples 53 fMRI Tests (Application & data from Palatucci et al., 2009) X: fMRI voxels (500) predict Y: semantic features (218) E[log P(Y | X )] Disconnected (Palatucci et al., 2009) better DCI 1 Image from http://en.wikipedia.org/wiki/File:FMRI.jpg DCI 2 54 Summary: Structure Learning • Analyzed generalizing Chow-Liu to CRFs • Proposed class of edge weights: Local Linear Entropy Scores • Negative result: insufficient for recovering trees • Discovered useful heuristic edge weights: Local CMI, DCI • Promising empirical results on synthetic & fMRI data Generalized Chow-Liu Compute edge weights w12 w23 w25 Max-weight spanning tree w24 w45 55 Scaling core methods Outline Parameter Learning Pseudolikelihood Canonical parameterization Parallel scaling Regress each variable on its neighbors: P( Xi | X\i ) Structure Learning Generalized Chow-Liu solve via Compute edge weights via P(Yi,Yj | Xij ) Parallel Regression Multicore sparse regression 56 Sparse (L1) Regression (Bradley, Kyrola, Bickson, Guestrin, 2011) Useful in high-dimensional setting (# features >> # examples) Lasso and sparse logistic regression Lasso (Tibshirani, 1996) Goal: Predict Objective: y Î Â from x Î Âd, given samples {(x i , y i )}i min 12 || Xw - y ||22 + l || w ||1 w Bias towards sparse solutions 57 Parallelizing LASSO Many LASSO optimization algorithms Gradient descent, interior point, stochastic gradient, shrinkage, hard/soft thresholding Coordinate descent (a.k.a. Shooting (Fu, 1998)) One of the fastest algorithms (Yuan et al., 2010) Parallel optimization Matrix-vector ops (e.g., interior point) Not great empirically Stochastic gradient (e.g., Zinkevich et al., 2010) Best for many samples, not large d Shooting Inherently sequential Shotgun: Parallel coordinate descent for L1 regression simple algorithm, elegant analysis 58 Shooting: Sequential SCD min F(w) w where F(w) = 12 || Xw - y ||22 + l || w ||1 Stochastic Coordinate Descent (SCD) While not converged, Choose random coordinate j, Update wj (closed-form minimization) 59 Shotgun: Parallel SCD min F(w) w where F(w) = 12 || Xw - y ||22 + l || w ||1 Shotgun Algorithm (Parallel SCD) While not converged, On each of P processors, Choose random coordinate j, Update wj (same as for Shooting) Is SCD inherently sequential? Nice case: Uncorrelated features Bad case: Correlated features 60 Shotgun: Theory min F(w) w where F(w) = 12 || Xw - y ||22 + l || w ||1 Convergence Theorem Assume # parallel updates E éëF(w(T ) )ùû - F(w*) £ Final objective Optimal objective d P£ r x Î Âd r= spectral radius of XTX d × ( 12 || w* ||22 +2F(w(0) )) T ×P iterations Generalizes bounds for Shooting (Shalev-Shwartz & Tewari, 2009) 61 Shotgun: Theory min F(w) w F(w) = 12 || Xw - y ||22 + l || w ||1 where Convergence Theorem Assume P £ d / r where r = spectral radius of X’X. E [ F(w )] - F(w*) (T ) £ final - opt objective d ( || w* || +2F(w )) 1 2 2 2 (0) T ×P iterations # parallel updates Nice case: Uncorrelated features r =1Þ Pmax = d Bad case: Correlated features r = d Þ Pmax =1 (at worst) 62 Shotgun: Theory Convergence Theorem Assume P £ d / r Up to a threshold... E [ F(w )] - F(w*) Experiments match (T ) 2 2 (0) Mug32_singlepixcam 10000 T (iterations) ... linear speedups predicted. T ×P (d =1024) Pmax=79 1000 100 10 1 1 10 100 1000 P (parallel updates) SparcoProblem7 10000 T (iterations) £ our theory! d ( || w* || +2F(w )) 1 2 (d = 2560) Pmax=284 1000 100 10 1 1 100 P (parallel updates) 63 Lasso Experiments Compared many algorithms Interior point (L1_LS) Shrinkage (FPC_AS, SpaRSA) Projected gradient (GPSR_BB) Iterative hard thresholding (Hard_IO) Also ran: GLMNET, LARS, SMIDAS Single-Pixel Camera 35 datasets Pmax =1 λ=.5, 10 Shooting Shotgun P = 8 (multicore) Shotgun proves most scalable & robust Sparco (van den Berg et al., 2009) Pmax Î [1,8683] Sparse Compressed Imaging Large, Sparse Datasets Pmax Î [1432, 5889] Pmax Î [107,1036] 64 Shotgun: Speedup Aggregated results from all tests But we are doing fewer iterations! 8 Optimal 7 Speedup 6 Lasso Iteration Speedup 5 Explanation: Memory wall (Wulf & McKee, 1995) 4 Logistic regression The memory bus uses more FLOPS/datum. gets flooded. Extra computation hides memory latency. great Better speedups Not so on average! Logistic Reg. Time Speedup 3 2 Lasso Time Speedup 1 1 2 3 4 5 # cores 6 7 8 65 Summary: Parallel Regression • Shotgun: parallel coordinate descent on multicore • Analysis: near-linear speedups, up to problem-dependent limit • Extensive experiments (37 datasets, 7 other methods) • Our theory predicts empirical behavior well. • Shotgun is one of the most scalable methods. Shotgun Decompose computation by coordinate updates Trade a little extra computation for a lot of parallelism 66 Recall: Thesis Statement We can scale learning by using decompositions of learning problems which trade off sample complexity, computation, and parallelization. Decompositions use model structure & locality. Trade-offs use model- and data-specific methods. Parameter Learning Structured composite likelihood Structure Learning Generalized Chow-Liu w12 MLE MCLE MPLE w23 w25 w24 w45 Parallel Regression Shotgun: parallel coordinate descent 67 Future Work: Unified System Structure Learning Parameter Learning L1 Structure Learning Structured MCLE Use structured MCLE? Automatically: • choose MCLE structure & parallelization strategy • to optimize trade-offs, • tailored to model & data. Learning Trees Learn trees for parameter estimators? Parallel Regression Shotgun (multicore) Distributed Limited communication in distributed setting. Handle complex objectives (e.g., MCLE). 68 Summary We can scale learning by using decompositions of learning problems which trade off sample complexity, computation, and parallelization. Parameter learning Structure learning Structured composite likelihood Generalizing Chow-Liu to CRFs Finite sample complexity bounds Empirical analysis Guidelines for choosing MCLE structures: tailor to model, data Analyzed canonical parameterization of Abbeel et al. (2006) Parallel regression Proposed class of edge weights: Local Linear Entropy Scores Insufficient for recovering trees Discovered useful heuristic edge weights: Local CMI, DCI Promising empirical results on synthetic & fMRI data Shotgun: parallel coordinate descent on multicore Analysis: near-linear speedups, up to problem-dependent limit Extensive experiments (37 datasets, 7 other methods) Our theory predicts empirical behavior well. Shotgun is one of the most scalable methods. Than k you! 69