A Stochastic Quasi-Newton Method for Large-Scale Learning Jorge Nocedal

A Stochastic Quasi-Newton Method for Large-Scale Learning Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 1 Goal Propose a robust quasi-Newton method that operates in the stochastic approximation regime wk 1  wk   k H k ̂F(wk ) • • • purely stochastic method (not batch) – to compete with stochatic gradient (SG) method Full non-diagonal Hessian approximation Scalable to millions of parameters 2 Outline wk 1  wk   k H k ̂f (wk ) Are iterations of this following form viable? - theoretical considerations; iteration costs - differencing noisy gradients? Key ideas: compute curvature information pointwise at regular intervals, build on strength of BFGS updating recalling that it is an overwriting (and not averaging process) - results on text and speech problems - examine both training and testing errors 3 Problem minw n F(w)  E[ f (w;  )] Stochastic function of variable w with random variable  Applications • Simulation optimization • Machine learning  = selection of input-output pair (x, z) f (w,  )  f (w, xi , zi )  (h(w; xi ), zi ) 1 N F(w)   f (w; xi , zi ) N i1 Algorithm not (yet) applicable to simulation based optimization 4 Stochastic gradient method For loss function 1 N F(w)   f (w; xi , zi ) N i1 Robbins-Monro or stochastic gradient method wk 1  wk   k ̂F(wk )  k  O(1 / k) using stochastic gradient (estimator) ̂F(wk ) : min-batch E [̂F(w)]  F(w) b | S | N 1 ̂F(wk )    f (w; xi , zi ), b iS | S | b 5 Why it won’t work …. wk 1  wk   k H k ̂F(wk ) 1. Is there any reason to think that including a Hessian approximation will improve upon stochastic gradient method? 2. Iteration costs are so high that even if method is faster than SG in terms of training costs it will be a weaker learner 6 Theoretical Considerations wk1  wk   k Bk1̂F(wk ),  k  O(1/ k) Bk   I 2 F(w* ) Completely removes the Depends on the condition dependency on the condition number number of the Hessian at the (Murata 98); cf Bottou-Bousquet true solution Number of iterations needed to compute an epsilon-accurate solution: Motivation for choosing Depends on the Hessian 2 true solution the Batk  F andQuasi-Newton gradient covariance matrix 7 Computational cost Assuming we obtain efficiencies of classical quasi-Newton methods in limited memory form wk 1  wk   k H k ̂f (wk ) • • • Each iteration requires 4Md operations M = memory in limited memory implementation; M=5 d = dimension of the optimization problem 4 Md  d  21d vs d Stochastic gradient method cost of computing ̂f (wk ) 8 Mini-batching ̂F(wk )  • • 1  f (w; xi , zi ),  b iS |S | b assuming a min-batch b=50 cost of stochastic gradient = 50d 4 Md  bd  20d  50d 70d vs vs 50d 50d Use of small mini-batches will be a game-changer b =10, 50, 100 9 Game changer? Not quite… Mini-batching makes operation counts favorable but does not resolve challenges related to noise 1. Avoid differencing noise • • • Curvature estimates cannot suffer from sporadic spikes in noise (Schraudolph et al. (99), Ribeiro et at (2013) Quasi-Newton updating is an overwriting process not an averaging process Control quality of curvature information 2. Cost of curvature computation 10 Desing of Stochastic Quasi-Newton Method wk 1  wk   k H k ̂f (wk ) Propose a method based on the famous BFGS formula • • all components seem to fit together well numerical performance appears to be strong Propose a new quasi-Newton updating formula • • Specifically designed to deal with noisy gradients Work in progress 11 Review of the deterministic BFGS method wk 1  wk   k Bk1F(wk ) wk 1  wk   k H k F(wk ) At every iteration compute and save yk  F(wk1 )  F(wk ), sk  wk1  wk , H k1  (I  k sk y )H k (I  k y s )   s s T k T k k T k k k k  1 skT yk Correction pairs yk , sk uniquely determine BFGS updating 12 The remarkable properties of BFGS method (convex case) Superlinear convergence; global convergence for strongly convex problems, self-correction properties °´( Bk  F(x * ))sk °´ 0 °´sk °´ Only need to approximate Hessian in a subspace If the algorithm takes a bad step, matrix is corrected... °´ Bk sk °´ °´ yk °´2 tr(Bk 1 )  tr(Bk )  T  T sk Bk sk yk s k det Bk 1 ykT sk skT sk  det Bk T sk sk skT Bk sk Powell 76 Byrd-N 89 13 Adaptation to stochastic setting Cannot mimic classical approach and update after each iteration yk  ̂F(wk1 )  ̂F(wk ), sk  wk1  wk Since batch size b is small this will yield highly noisy curvature estimates Instead: Use a collection of iterates to define the correction pairs 14 Stochastic BFGS: Approach 1 Define two collections of size L:   w , ̂F(w ) | j J  wi , ̂F(w i ) | i I , j j | I || J | L Define average iterate/gradient: 1 wI  wi ,  | I | iI New curvature pair: 1 gI  ̂F(wi )  | I | iI s  wI  w j , y  gI  gJ 15 Stochastic L-BFGS: First Approach wk1 Ht ̂F(w k1  wkk   kk ̂F(w k) k) s  wI  wJ s  wI  wJ s  wI  wJ y  gI  gJ  H2 y  gI  gJ  H1 y  gI  gJ  H3 w , ̂F(w ) | j J  j j w , ̂F(w ) | j J  j j w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14  wi , ̂F(wi ) | i I   wi , ̂F(w i ) | i I   wi , ̂F(w i ) | i I 16  Stochastic BFGS: Approach 1 We could not make this work in a robust manner! 1. Two sources of error • • 2. Sample variance Lack of sample uniformity s  wI  wJ , y  gI  gJ Initial reaction • • Control quality of average gradients Use of sample variance … dynamic sampling Proposed Solution Control quality of curvature y estimate directly 17 Key idea: avoid differencing Standard definition y  ̂F(wI )  ̂F(wJ ) arises from ̂F(wI )  ̂F(wJ )  ̂2 F(wI )(wI  wJ ) Hessian-vector products are often available y  ̂2 F(wI )s Define curvature vector for L-BFGS via a Hessian-vector product perform only every L iterations 18 Structure of Hessian-vector product y  ̂2 F(wI )s Mini-batch stochastic gradient 1 ̂ F(wk )s  bH 2 1. 2. 3. 4.  2 f (w; xi , zi )s , | SH |  bH iSH Code Hessian-vector product directly Achieve sample uniformity automatically (c.f. Schraudolph) Avoid numerical problems when ||s|| is small Control cost of y computation 19 The Proposed Algorithm Ht ̂F(w ) wk1  wk   k ̂F(w k) k s  wI  wJ y  ̂ 2 F(wI )s  H2 s  wI  wJ s  wI  wJ y  ̂ 2 F(wI )s y  ̂ F(wI )s 2 w j | j J   H1 w  H3 j | j J  w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 wi | i I  wi | i I  wi | i I  20 Algorithmic Parameters • b: stochastic gradient batch size • bH: Hessian-vector batch size • L: controls frequency of quasi-Newton updating • M: memory parameter in L-BFGS updating M=5 - use limited memory form 21 Need Hessian to implement a quasi-Newton method? Ŧ Œ⌥ ä ϖ ś ħ ?? Are you out of your mind? We don’t need Hessian-vector product, but it has many Advantages: complete freedom in sampling and accuracy 22 Numerical Tests wk 1  wk   k ̂f (wk ) k   / k wk 1  wk   k H k ̂f (wk ) k   / k Stochastic gradient method (SGD) Stochastic quasi-Newton method (SQN) parameter  to be fixed at start for each method; found by bisection procedure It is well know that SGD is highly sensitive to choice of steplength, and so will be the SQN method (though perhaps less) 23 n = 112919, N = 688329 RCV1 Problem sgd sqn • b = 50, 300, 1000 • M = 5, L = 20 bH = 1000 Accessed data points; includes Hessianvector products 24 Speech Problem n= 30315, N = 191607 sgd sqn • b = 100, 500 , M = 5, L = 20, bH = 1000 25 Varying Hessian batch bH: RCV1 b=300 26 Varying memory size M in limited memory BFGS: RCV1 27 Varying L-BFGS Memory Size: Synthetic problem 28 Generalization Error: RCV1 Problem SQN SGD 29 Test Problems • Synthetically Generated Logistic Regression: Singer et al – n = 50, N = 7000 – Training data:. xi  n , zi [0,1] • RCV1 dataset – n = 112919, N = 688329 – Training data: xi [0,1]n , zi [0,1] • SPEECH dataset – NF = 235, |C| = 129 – n = NF x |C| --> n= 30315, N = 191607 NF , zi [1,129] – Training data: xi  30 Iteration Costs SGD wk+1 = wk - b k Ñ̂F(wk ) • mini-batch stochastic gradient SQN wk+1 = wk - b k H t Ñ̂F(wk ) • mini-batch stochastic gradient • Hessian-vector product every L iterations • matrix-vector product bn bn + bH n / L+4Mn 31 Iteration Costs SQN SGD bn bn  bH n / L+4Mn 300n 370n Typical Parameter Values • b = 50-1000 b = 300 • bH = 100-1000 bH = 1000 • L = 10-20 L = 20 • M = 3-20 M =5 32 Hasn’t this been done before? Hessian-free Newton method: Martens (2010), Byrd et al (2011) - claim: stochastic Newton not competitive with stochastic BFGS Prior work: Schraudolph et al. - similar, cannot ensure quality of y - change BFGS formula in one-sided form 33 Supporting theory? Work in progress: Figen Oztoprak, Byrd, Soltnsev - combine classical analysis Murata, Nemirovsky et a - and asumptotic quasi-Newton theory - effect on constants (condition number) - invoke self-correction properties of BFGS Practical Implementation: limited memory BFGS - loses superlinear convergence property - enjoys self-correction mechanisms 34 Small batches: RCV1 Problem bH=1000, M=5, L=200 SGD: b adp/iter bn work/iter SQN: b + bH/L adp/iter bn + bHn/L +4Mn work/iter Parameters L, M and bH provide freedom in adapting the SQN method to a specific application 35 Alternative quasi-Newton framework BFGS method was not derived with noisy gradients in mind - how do we know it is an appropriate framework - Start from scratch - derive quasi-Newton updating formulas tolerant to noise 36 Foundations Define quadratic model around a reference point z 1 qz (w)  f (w)  g (w  z)  (w  z)T B(w  z) 2 T Using a collection indexed by I , natural to require [q(w )  ̂f (w )]  0 j j j I i.e. residuals are zero in expectation Not enough information to determine the whole model 37 qz (w)  gT (w  z)  Mean square error 1 (w  z)T B(w  z) 2 Given a collection I, choose model q to minimize F(g, B)   °´ q(w j )  ̂f (w j )°´2  0 j I Define sk  wk  z, restate problem as min g, B °´Bsk  g  ̂f (w j )°´2 j I Differentiating w.r.t. g:  {Bsk  g  ̂f (w j )}  0 j I Encouraging: obtained residual condition 38 The End 39

A Stochastic Quasi-Newton Method for Large-Scale Learning Jorge Nocedal

Related documents

Products

Support

A Stochastic Quasi-Newton Method for Large-Scale Learning Jorge Nocedal

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib