Parallel Coordinate Descent for L1-Regularized Loss Minimization Joseph K. Bradley Danny Bickson Aapo Kyrola Carlos Guestrin Carnegie Mellon L1-Regularized Regression Stock volatility Bigrams from financial reports (label) (features) (Kogan et al., 2009) Lasso (Tibshirani, 1996) Sparse logistic regression (Ng, 2004) Produces sparse solutions Useful in high-dimensional settings (# features >> # examples) 5x106 features 3x104 samples From Sequential Optimization... Many algorithms Gradient descent, stochastic gradient, interior point methods, hard/soft thresholding, ... Coordinate descent (a.k.a. Shooting (Fu, 1998)) One of the fastest algorithms (Friedman et al., 2010; Yuan et al., 2010) But for big problems? 5x106 features 3x104 samples ...to Parallel Optimization We use the multicore setting: shared memory low latency We could parallelize: Matrix-vector ops Not great empirically. E.g., interior point W.r.t. samples Best for many samples, E.g., stochastic gradient not many features. (Zinkevich et al., 2010) Analysis not for L1. W.r.t. features E.g., shooting Inherently sequential? Surprisingly, no! Our Work Shotgun: Parallel coordinate descent for L1-regularized regression Parallel convergence analysis Linear speedups up to a problem-dependent limit Large-scale empirical study 37 datasets, 9 algorithms Lasso (Tibshirani, 1996) Goal: Regress Objective: y Î Â on a Î Âd, given samples {(ai , yi )}i min 12 || Ax - y ||22 +l || x ||1 x Squared error L1 regularization Shooting: Sequential SCD Lasso: min F(x) x where F(x) = 12 || Ax - y ||22 +l || x ||1 Stochastic Coordinate Descent (SCD) (e.g., Shalev-Shwartz & Tewari, 2009) While not converged, Choose random coordinate j, Update xj (closed-form minimization) F(x) contour Shotgun: Parallel SCD Lasso: min F(x) x where F(x) = 12 || Ax - y ||22 +l || x ||1 Shotgun (Parallel SCD) While not converged, On each of P processors, Choose random coordinate j, Update xj (same as for Shooting) Is SCD inherently sequential? Nice case: Uncorrelated features Bad case: Correlated features Is SCD inherently sequential? Lasso: min F(x) x where F(x) = 12 || Ax - y ||22 +l || x ||1 Coordinate update: x j ¬ x j + dx j (closed-form minimization) Collective update: æd x i ç0 Dx = ç 0 çd x ç j è0 ö ÷ ÷ ÷ ÷ ø d xi Dx dx j Is SCD inherently sequential? min F(x) Lasso: where x F(x) = 12 || Ax - y ||22 +l || x ||1 Theorem: If A is normalized s.t. diag(ATA)=1, Decrease of objective F(x + Dx) - F(x) £- 1 2 å( d xi i j ÎP j Sequential Updated coordinates progress ) 2 + 12 å i j ,ik ÎP, T A ( A) i j ,ik d xi d xi j¹k Interference j k Is SCD inherently sequential? Theorem: If A is normalized s.t. diag(ATA)=1, F(x + Dx) - F(x) £- 1 2 å( d xi i j ÎP j ) 2 + 12 å i j ,ik ÎP, T A ( A) i j ,ik d xi d xi j k i j ¹ik (ATA)jk=0 for j≠k. (if ATA is centered) Nice case: Uncorrelated features (ATA)jk≠0 Bad case: Correlated features Convergence Analysis Lasso: min F(x) x F(x) = 12 || Ax - y ||22 +l || x ||1 where Main Theorem: Shotgun Convergence Assume # parallel updates P < d / r +1 xÎÂ d E éëF(x (T ) )ùû - F(x*) £ Final objective Optimal objective r = spectral radius of ATA d × ( 12 || x* ||22 +F(x(0) )) T ×P iterations Generalizes bounds for Shooting (Shalev-Shwartz & Tewari, 2009) Convergence Analysis Lasso: min F(x) x where F(x) = 12 || Ax - y ||22 +l || x ||1 Theorem: Shotgun Convergence Assume P < d / r +1 where r = spectral radius of ATA. E éëF(x(T ) )ùû - F(x*) £ final - opt objective d ( || x* || +F(x )) 1 2 2 2 (0) TP iterations # parallel updates Nice case: Uncorrelated features r =1Þ Pmax = d Bad case: Correlated features r = d Þ Pmax =1 (at worst) Convergence Analysis Theorem: Shotgun Convergence Assume P < d / r +1. Experiments match the theory! E éëF(x(T ) )ùû - F(x*) d ( 12 || x* ||22 +F(x (0) )) TP Mug32_singlepixcam 10000 Pmax=158 1000 d = 1024 r = 6.4967 100 10 1 10 100 P (# simulated parallel updates) Iterations to convergence Iterations to convergence £ Up to a threshold... linear speedups predicted. Ball64_singlepixcam 56870 Pmax=3 d = 4096 r = 2047.8 14000 1 4 P (# simulated parallel updates) Thus far... Shotgun The naive parallelization of coordinate descent works! Theorem: Linear speedups up to a problem-dependent limit. Now for some experiments… Experiments: Lasso 7 Algorithms Shotgun P=8 (multicore) Shooting (Fu, 1998) 35 Datasets # samples n: [128, 209432] # features d: [128, 5845762] λ=.5, 10 Interior point (Parallel L1_LS) (Kim et al., 2007) Shrinkage (FPC_AS, SpaRSA) (Wen et al., 2010; Wright et al., 2009) Projected gradient (GPSR_BB) (Figueiredo et al., 2008) Iterative hard thresholding (Hard_l0) (Blumensath & Davies, 2009) Also ran: GLMNET, LARS, SMIDAS Optimization Details Pathwise optimization (continuation) Asynchronous Shotgun with atomic operations Hardware 8 core AMD Opteron 8384 (2.69 GHz) Sparse Compressed Imaging Sparco (van den Berg et al., 2009) n Î [477, 32768] d Î [954, 65536] n Î [128, 29166] d Î [128, 29166] Pmax Î [2865, 11779], avg 7688 Pmax Î [3, 17366], avg 2987 40 70 Shotgun faster 4 On this (data,λ) Shotgun: 1.212s Shooting: %&' ' ( )3.406s *$ +, -Shotgun . / %$ Other alg. runtime (sec) Other alg. runtime (sec) Experiments: Lasso slower 0.4 0.4 0 , %1. 22$ 4 40 3456. 7 ! $ Shotgun runtime (sec) %&' ' ( ) *$ Shotgun faster 7 %&' ' ( ) *$ +, - . / %$ %Shotgun &' ' ( ) *$ 0, %1. 22$ 0.73456. 7! $ 0.7 %&' ' ( ) *$ %: 41%/ $ 0, %1. 22$ 7 slower +, - . / %$ 89. 8%$ Shotgun +, - . /runtime %$ 0,(sec) %1. 22$ 70 %&' ' ( ) *$ 89. 8%$(Parallel) +, - . / %$ +, - . / %$ %: 41%/ $ 89. 8%$ 8 cores. 0, %1. 22$ 3456. L1_LS 7! $ Shotgun & Parallel used 0 , %1. 22$ 3456. 7! $ 89. 8%$ 3456. 7! $ %: 41%/ $ Single-Pixel Camera (Duarte et al., 2008) Large, Sparse Datasets n Î [410, 4770] d Î [1024, 16384] n Î [30465, 209432] d Î [209432, 5845762] Pmax = 3 Pmax Î [214, 2072], avg 1143 Shotgun faster 20 Other alg. runtime (sec) Other alg. runtime (sec) Experiments: Lasso Shotgun faster 3000 300 %&' ' ( ) *$ Shotgun slower 2 0 , %1. 22$ 2 20 Shotgun slower %&' ' ( ) *$ +, - . / %$ 30 30 +, - . / %$ 300 3000 3456. 7! $ 0, %Shotgun 1. 22$ runtime (sec) %&' ' ( ) *$ 89. 8%$(Parallel) 3456. 7! $ +, - . / %$ %: 41%/ $ Shotgun89. $ &8% Parallel L1_LS used 8 cores. Shotgun runtime (sec) 0 , %1. 22$ %: 41%/ $ Single-Pixel Camera (Duarte et al., 2008) Large, Sparse Datasets n Î [410, 4770] d Î [1024, 16384] n Î [30465, 209432] d Î [209432, 5845762] Pmax = 3 Pmax Î [214, 2072], avg 1143 Shotgun faster 20 Other alg. runtime (sec) Other alg. runtime (sec) Experiments: Lasso Shotgun faster 3000 Shooting is one of the fastest algorithms. Shotgun provides additional speedups. 300 %&' ' ( ) *$ Shotgun slower 2 0 , %1. 22$ 2 20 Shotgun slower %&' ' ( ) *$ +, - . / %$ 30 30 +, - . / %$ 300 3000 3456. 7! $ 0, %Shotgun 1. 22$ runtime (sec) %&' ' ( ) *$ 89. 8%$(Parallel) 3456. 7! $ +, - . / %$ %: 41%/ $ Shotgun89. $ &8% Parallel L1_LS used 8 cores. Shotgun runtime (sec) 0 , %1. 22$ %: 41%/ $ Experiments: Logistic Regression Coordinate Descent Newton (CDN) (Yuan et al., 2010) Uses line search Extensive tests show CDN is very fast. Algorithms Shooting (CDN) Shotgun CDN Stochastic Gradient Descent (SGD) Parallel SGD (Zinkevich et al., 2010) Averages results of 8 instances run in parallel SGD Lazy shrinkage updates (Langford et al., 2009) Used best of 14 learning rates Experiments: Logistic Regression Zeta* dataset: low-dimensional setting l =1 n = 500, 000 d=2000 320000 Shooting CDN 300000 Objective Value better 280000 260000 SGD 240000 220000 Parallel SGD 200000 Shotgun CDN 180000 160000 140000 0 200 400 600 800 1000 1200 1400 1600 Time (sec) *From the Pascal Large Scale Learning Challenge http://www.mlbench.org/instructions/ Shotgun & Parallel SGD used 8 cores. Experiments: Logistic Regression rcv1 dataset (Lewis et al, 2004): high-dimensional setting l =1 n =18217 d=44504 8500 Objective Value better 8000 Shooting CDN SGD and Parallel SGD 7500 7000 6500 Shotgun CDN 6000 5500 0.1 1 10 100 Time (sec) Shotgun & Parallel SGD used 8 cores. Shotgun: Self-speedup Aggregated results from all tests But we are doing fewer iterations! 8 Optimal 7 Speedup 6 Lasso Iteration Speedup 5 Explanation: Memory wall (Wulf & McKee, 1995) 4 Logistic regression The memory bus uses more FLOPS/datum. gets flooded. Extra computation hides memory latency. great Better speedups Not so on average! Logistic Reg. Time Speedup 3 2 Lasso Time Speedup 1 1 2 3 4 5 # cores 6 7 8 Conclusions Shotgun: Parallel Coordinate Descent Linear speedups up to a problem-dependent limit Significant speedups in practice Large-Scale Empirical Study Lasso & sparse logistic regression 37 datasets, 9 algorithms Compared with parallel methods: Parallel matrix/vector ops Parallel Stochastic Gradient Descent Code & Data: http://www.select.cs.cmu.edu/projects Future Work Hybrid Shotgun + parallel SGD More FLOPS/datum, e.g., Group Lasso (Yuan and Lin, 2006) Alternate hardware, e.g., graphics processors Thanks! References Blumensath, T. and Davies, M.E. Iterative hard thresholding for compressed sensing. Applied and Computational Harmonic Analysis, 27(3):265– 274, 2009. Duarte, M.F., Davenport, M.A., Takhar, D., Laska, J.N., Sun, T., Kelly, K.F., and Baraniuk, R.G. Single-pixel imaging via compressive sampling. Signal Processing Magazine, IEEE, 25(2):83–91, 2008. Figueiredo, M.A.T, Nowak, R.D., and Wright, S.J. Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems. IEEE J. of Sel. Top. in Signal Processing, 1(4):586–597, 2008. Friedman, J., Hastie, T., and Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22, 2010. Fu, W.J. Penalized regressions: The bridge versus the lasso. J. of Comp. and Graphical Statistics, 7(3):397– 416, 1998. Kim, S. J., Koh, K., Lustig, M., Boyd, S., and Gorinevsky, D. An interior-point method for large-scale l1-regularized least squares. IEEE Journal of Sel. Top. in Signal Processing, 1(4):606–617, 2007. Kogan, S., Levin, D., Routledge, B.R., Sagi, J.S., and Smith, N.A. Predicting risk from financial reports with regression. In Human Language Tech.NAACL, 2009. Langford, J., Li, L., and Zhang, T. Sparse online learning via truncated gradient. In NIPS, 2009a. Lewis, D.D., Yang, Y., Rose, T.G., and Li, F. RCV1: A new benchmark collection for text categorization research. JMLR, 5:361–397, 2004. Ng, A.Y. Feature selection, l1 vs. l2 regularization and rotational invariance. In ICML, 2004. Shalev-Shwartz, S. and Tewari, A. Stochastic methods for l1 regularized loss minimization. In ICML, 2009. Tibshirani, R. Regression shrinkage and selection via the lasso. J. Royal Statistical Society, 58(1):267–288, 1996. van den Berg, E., Friedlander, M.P., Hennenfent, G., Herrmann, F., Saab, R., and Yılmaz, O. Sparco: A testing framework for sparse reconstruction. ACM Transactions on Mathematical Software, 35(4):1–16, 2009. Wen, Z., Yin, W. Goldfarb, D., and Zhang, Y. A fast algorithm for sparse reconstruction based on shrinkage, subspace optimization and continuation. SIAM Journal on Scientific Computing, 32(4):1832–1857, 2010. Wright, S.J., Nowak, D.R., and Figueiredo, M.A.T. Sparse reconstruction by separable approximation. IEEE Trans. on Signal Processing, 57(7):2479–2493, 2009. Wulf, W.A. and McKee, S.A. Hitting the memory wall: Implications of the obvious. ACM SIGARCH Computer Architecture News, 23(1):20–24, 1995. Yuan, G. X., Chang, K. W., Hsieh, C. J., and Lin, C. J. A comparison of optimization methods and software for large-scale l1-reg. linear classification. JMLR, 11:3183– 3234, 2010. Zinkevich, M., Weimer, M., Smola, A.J., and Li, L. Parallelized stochastic gradient descent. In NIPS, 2010. TO DO References slide Backup slides Discussion with reviewer about SGD vs SCD in terms of d,n Experiments: Logistic Regression Zeta* dataset: low-dimensional setting Objective Value better l =1 n = 500, 000 d=2000 Shooting CDN 290000 SGD 240000 Parallel SGD Shotgun CDN 190000 140000 0 200 400 600 800 1000 1200 1400 1600 0 200 400 600 800 1000 1200 1400 1600 0.25 Test Error 0.2 0.15 0.1 0.05 *Pascal Large Scale Learning Challenge: http://www.mlbench.org/instructions/ Time (sec) Experiments: Logistic Regression rcv1 dataset (Lewis et al, 2004): high-dimensional setting l =1 n =18217 d=44504 Objective Value SGD and Parallel SGD Shooting CDN 8000 7500 7000 Shotgun CDN 6500 6000 5500 0.1 1 10 100 0.1 1 10 100 0.053 Test Error better 8500 0.048 0.043 0.038 Time (sec) Shotgun: Improving Self-speedup Lasso: Time Speedup 6 5 4 Max 3 2 Mean 1 Min 0 2 3 4 5 6 7 8 Max 4 Mean 3 2 Min 1 1 # cores Speedup (iterations) 5 0 1 10 Speedup (time) Speedup (time) 6 Logistic Regression: Time Speedup Lasso: Iteration Speedup Mean 6 Min 4 2 0 2 3 4 5 # cores 6 7 3 4 5 # cores 6 7 Max 8 1 2 8 Logistic regression uses more FLOPS/datum. Better speedups on average. 8 Shotgun: Self-speedup Lasso: Time Speedup Speedup (time) 6 5 4 Max 3 2 Mean 1 Min 0 1 2 3 4 5 6 7 Speedup (iterations) The memory bus gets flooded. Lasso: Iteration Speedup Max 8 Mean 6 Min 4 2 0 1 2 3 4 5 # cores 6 7 Explanation: Memory wall (Wulf & McKee, 1995) 8 # cores 10 Not so great 8 But we are doing fewer iterations!