Complexity of proximal (quasi-)Newton methods Katya Scheinberg Lehigh University katyas@lehigh.edu with Xiaocheng Tang NIPS, Greedy 2013 Introduction • Proximal gradient methods for large scale composite optimization have known convergence rates for exact and inexact case. • In practice have limited success, due to slow convergence. • Recently carefully implemented second order methods (proximal Newton) have been developed to speed up convergence. • Subproblems for second order methods are difficult, hence should be solved inexactly. • Recently theory was developed for convergence and local convergence rates of proximal Newton method with some assumptions on accuracy of subproblems. • Verifying sufficient accuracy of subproblems is a key difficulty. • We show global convergence rates for general second order framework with inexact subproblems optimization. • We show that using randomized coordinate descent for subproblems give appropriate error decay for the global rates. NIPS, Greedy 2013 Prox-gradient methods for composite optimization • Minimize approximation function Qf,µ(x) on each iteration Shrinkage operator Closed form solution! O(n) effort 12/10/13 NIPS, Greedy 2013 Proximal (quasi-)Newton for composite optimization 1. Build a model of the objective by approximating the smooth part, f(x), by a quadratic function around current iterate xk. 2. “Optimize” the resulting model to obtain a trial point xtrial 3. Evaluate the objective function f(xtrial). – If sufficient decrease has been achieved, accept the trial point as the new iterate, xk+1=xtrial M. Schmidt, D. Kim, and S. – Otherwise cut back and obtain a new trial point. 4. Return to the Step 1. 12/10/13 Sra NIPS OPT 2010, C.-J. Hsieh, M. Sustik, I. Dhilon, and P. Ravikumar, NIPS 2011, J. D. Lee, Y. Sun, and M. A. Saunders, NIPS 2012, M. Wytock and Z. Kolter ICML 2013 NIPS, Greedy 2013 Proximal Newton and QN methods for composite optimization • Minimize approximation function Q(x)¼ F(x) on each iteration Hk is a Hessian approximation. Resulting subproblem is not simple, e.g. Lasso problem, for ¸ ||x||p=¸ ||x||1 • Solution – solve subproblem inexactly, with accuracy 12/10/13 NIPS, Greedy 2013 Proximal Newton and QN methods for composite optimization • Minimize approximation function Q(x)¼ F(x) on each iteration Hk is a Hessian approximation. Resulting subproblem is not simple, e.g. Lasso problem, for ¸ ||x||p=¸ ||x||1 • Solution – solve subproblem inexactly, with accuracy Key Result: 12/10/13 NIPS, Greedy 2013 Similar to M. Schmidt, N. Le Roux, F. Bach. NIPS, 2011 but for proximal quasi-Newton Progressively more accurate solutions Main algorithm builds quasi-Newton Hessian matrix H_k and approximation Q_k ? ? ? ? How many iterations? ? … Inexact subproblem optimization, with error 12/10/13 NIPS, Greedy 2013 ? Popular, simple stopping rule for subproblems Main algorithm builds quasi-Newton Hessian matrix H_k and approximation Q_k k k+1 k+2 k+3 … k+4 … … Inexact subproblem optimization, with error 12/10/13 NIPS, Greedy 2013 For instance, Hsieh et al. NIPS, 2011. Wytock and Kolter ICML, 2013 Simple rule works if linearly convergent method is used for subproblem Main algorithm builds quasi-Newton Hessian matrix H_k and approximation Q_k k k+1 k+2 k+3 k+4 … … Inexact subproblem optimization, with error 12/10/13 NIPS, Greedy 2013 Using results on RCD by P. Richtarik and M. Takac Limited memory Quasi-Newton approximations and randomized coordinate descent. Many proximal Newton method (also SQA) use coordinate descent (CD) to optimize subproblems when ||x||p=||x||1. No convergent rates! Using randomized coordinate descent we get linear convergence rates (in expectation) and summable errors, hence global rates for the entire algorithm. Using low-rank Hessian matrix, complexity of each CD iteration is constant and small. For the k-th subproblem only k CD iterations are needed. Extend these ideas to solving subproblems inexactly, but with known rates, for other ||x||p (for instance indicator functions) using conditional gradients methods??? Analyze greedy active set selective strategy which we use in practical implementation. 12/10/13 NIPS, Greedy 2013 Thank you! “Complexity of Inexact Proximal Newton methods”, K. Scheinberg and X. Tang http://www.optimization-online.org/DB_FILE/2013/11/4134.pdf 12/10/13 NIPS, Greedy 2013