Complexity of proximal (quasi-)Newton methods

advertisement
Complexity of proximal (quasi-)Newton methods
Katya Scheinberg
Lehigh University
katyas@lehigh.edu
with Xiaocheng Tang
NIPS, Greedy 2013
Introduction
•  Proximal gradient methods for large scale composite optimization
have known convergence rates for exact and inexact case.
•  In practice have limited success, due to slow convergence.
•  Recently carefully implemented second order methods (proximal
Newton) have been developed to speed up convergence.
•  Subproblems for second order methods are difficult, hence should
be solved inexactly.
•  Recently theory was developed for convergence and local
convergence rates of proximal Newton method with some
assumptions on accuracy of subproblems.
•  Verifying sufficient accuracy of subproblems is a key difficulty.
•  We show global convergence rates for general second order
framework with inexact subproblems optimization.
•  We show that using randomized coordinate descent for
subproblems give appropriate error decay for the global rates.
NIPS, Greedy 2013
Prox-gradient methods for composite optimization
• 
Minimize approximation function Qf,µ(x) on each iteration
Shrinkage
operator
Closed form
solution!
O(n) effort
12/10/13
NIPS, Greedy 2013
Proximal (quasi-)Newton for composite optimization
1.  Build a model of the objective by approximating the smooth part,
f(x), by a quadratic function around current iterate xk.
2.  “Optimize” the resulting model to obtain a trial point xtrial
3.  Evaluate the objective function f(xtrial).
–  If sufficient decrease has been achieved, accept the trial point as the new
iterate, xk+1=xtrial
M. Schmidt, D. Kim, and S.
–  Otherwise cut back and obtain a new trial point.
4.  Return to the Step 1.
12/10/13
Sra NIPS OPT 2010,
C.-J. Hsieh, M. Sustik, I.
Dhilon, and P. Ravikumar,
NIPS 2011,
J. D. Lee, Y. Sun, and M. A.
Saunders, NIPS 2012,
M. Wytock and Z. Kolter
ICML 2013
NIPS, Greedy 2013
Proximal Newton and QN methods for composite optimization
• 
Minimize approximation function Q(x)¼ F(x) on each iteration
Hk is a Hessian approximation.
Resulting subproblem is not simple, e.g. Lasso problem, for ¸ ||x||p=¸ ||x||1
• 
Solution – solve subproblem inexactly, with accuracy
12/10/13
NIPS, Greedy 2013
Proximal Newton and QN methods for composite optimization
• 
Minimize approximation function Q(x)¼ F(x) on each iteration
Hk is a Hessian approximation.
Resulting subproblem is not simple, e.g. Lasso problem, for ¸ ||x||p=¸ ||x||1
• 
Solution – solve subproblem inexactly, with accuracy
Key Result:
12/10/13
NIPS, Greedy 2013
Similar to M. Schmidt, N. Le
Roux, F. Bach. NIPS, 2011 but
for proximal quasi-Newton
Progressively more accurate solutions
Main algorithm
builds quasi-Newton Hessian matrix
H_k and approximation Q_k
?
?
?
?
How many
iterations?
?
…
Inexact subproblem optimization, with error
12/10/13
NIPS, Greedy 2013
?
Popular, simple stopping rule for subproblems
Main algorithm
builds quasi-Newton Hessian matrix
H_k and approximation Q_k
k
k+1
k+2
k+3
…
k+4
…
…
Inexact subproblem optimization, with error
12/10/13
NIPS, Greedy 2013
For instance, Hsieh et al.
NIPS, 2011.
Wytock and Kolter
ICML, 2013
Simple rule works if linearly convergent method
is used for subproblem
Main algorithm
builds quasi-Newton Hessian matrix
H_k and approximation Q_k
k
k+1
k+2
k+3
k+4
…
…
Inexact subproblem optimization, with error
12/10/13
NIPS, Greedy 2013
Using results on RCD by P.
Richtarik and M. Takac
Limited memory Quasi-Newton approximations and
randomized coordinate descent.
Many proximal Newton method (also SQA) use coordinate
descent (CD) to optimize subproblems when ||x||p=||x||1.
No convergent rates!
Using randomized coordinate descent we get linear
convergence rates (in expectation) and summable errors,
hence global rates for the entire algorithm.
Using low-rank Hessian matrix, complexity of each CD
iteration is constant and small. For the k-th subproblem only
k CD iterations are needed.
Extend these ideas to solving subproblems inexactly, but
with known rates, for other ||x||p (for instance indicator
functions) using conditional gradients methods???
Analyze greedy active set selective strategy which we use in
practical implementation.
12/10/13
NIPS, Greedy 2013
Thank you!
“Complexity of Inexact Proximal Newton methods”, K. Scheinberg and X. Tang
http://www.optimization-online.org/DB_FILE/2013/11/4134.pdf
12/10/13
NIPS, Greedy 2013
Download