A Stochastic Quasi-Newton Method for Large-Scale Learning Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 1 Goal Propose a robust quasi-Newton method that operates in the stochastic approximation regime wk 1 wk k H k ̂F(wk ) • • • purely stochastic method (not batch) – to compete with stochatic gradient (SG) method Full non-diagonal Hessian approximation Scalable to millions of parameters 2 Outline wk 1 wk k H k ̂f (wk ) Are iterations of this following form viable? - theoretical considerations; iteration costs - differencing noisy gradients? Key ideas: compute curvature information pointwise at regular intervals, build on strength of BFGS updating recalling that it is an overwriting (and not averaging process) - results on text and speech problems - examine both training and testing errors 3 Problem minw n F(w) E[ f (w; )] Stochastic function of variable w with random variable Applications • Simulation optimization • Machine learning = selection of input-output pair (x, z) f (w, ) f (w, xi , zi ) (h(w; xi ), zi ) 1 N F(w) f (w; xi , zi ) N i1 Algorithm not (yet) applicable to simulation based optimization 4 Stochastic gradient method For loss function 1 N F(w) f (w; xi , zi ) N i1 Robbins-Monro or stochastic gradient method wk 1 wk k ̂F(wk ) k O(1 / k) using stochastic gradient (estimator) ̂F(wk ) : min-batch E [̂F(w)] F(w) b | S | N 1 ̂F(wk ) f (w; xi , zi ), b iS | S | b 5 Why it won’t work …. wk 1 wk k H k ̂F(wk ) 1. Is there any reason to think that including a Hessian approximation will improve upon stochastic gradient method? 2. Iteration costs are so high that even if method is faster than SG in terms of training costs it will be a weaker learner 6 Theoretical Considerations wk1 wk k Bk1̂F(wk ), k O(1/ k) Bk I 2 F(w* ) Completely removes the Depends on the condition dependency on the condition number number of the Hessian at the (Murata 98); cf Bottou-Bousquet true solution Number of iterations needed to compute an epsilon-accurate solution: Motivation for choosing Depends on the Hessian 2 true solution the Batk F andQuasi-Newton gradient covariance matrix 7 Computational cost Assuming we obtain efficiencies of classical quasi-Newton methods in limited memory form wk 1 wk k H k ̂f (wk ) • • • Each iteration requires 4Md operations M = memory in limited memory implementation; M=5 d = dimension of the optimization problem 4 Md d 21d vs d Stochastic gradient method cost of computing ̂f (wk ) 8 Mini-batching ̂F(wk ) • • 1 f (w; xi , zi ), b iS |S | b assuming a min-batch b=50 cost of stochastic gradient = 50d 4 Md bd 20d 50d 70d vs vs 50d 50d Use of small mini-batches will be a game-changer b =10, 50, 100 9 Game changer? Not quite… Mini-batching makes operation counts favorable but does not resolve challenges related to noise 1. Avoid differencing noise • • • Curvature estimates cannot suffer from sporadic spikes in noise (Schraudolph et al. (99), Ribeiro et at (2013) Quasi-Newton updating is an overwriting process not an averaging process Control quality of curvature information 2. Cost of curvature computation 10 Desing of Stochastic Quasi-Newton Method wk 1 wk k H k ̂f (wk ) Propose a method based on the famous BFGS formula • • all components seem to fit together well numerical performance appears to be strong Propose a new quasi-Newton updating formula • • Specifically designed to deal with noisy gradients Work in progress 11 Review of the deterministic BFGS method wk 1 wk k Bk1F(wk ) wk 1 wk k H k F(wk ) At every iteration compute and save yk F(wk1 ) F(wk ), sk wk1 wk , H k1 (I k sk y )H k (I k y s ) s s T k T k k T k k k k 1 skT yk Correction pairs yk , sk uniquely determine BFGS updating 12 The remarkable properties of BFGS method (convex case) Superlinear convergence; global convergence for strongly convex problems, self-correction properties °´( Bk F(x * ))sk °´ 0 °´sk °´ Only need to approximate Hessian in a subspace If the algorithm takes a bad step, matrix is corrected... °´ Bk sk °´ °´ yk °´2 tr(Bk 1 ) tr(Bk ) T T sk Bk sk yk s k det Bk 1 ykT sk skT sk det Bk T sk sk skT Bk sk Powell 76 Byrd-N 89 13 Adaptation to stochastic setting Cannot mimic classical approach and update after each iteration yk ̂F(wk1 ) ̂F(wk ), sk wk1 wk Since batch size b is small this will yield highly noisy curvature estimates Instead: Use a collection of iterates to define the correction pairs 14 Stochastic BFGS: Approach 1 Define two collections of size L: w , ̂F(w ) | j J wi , ̂F(w i ) | i I , j j | I || J | L Define average iterate/gradient: 1 wI wi , | I | iI New curvature pair: 1 gI ̂F(wi ) | I | iI s wI w j , y gI gJ 15 Stochastic L-BFGS: First Approach wk1 Ht ̂F(w k1 wkk kk ̂F(w k) k) s wI wJ s wI wJ s wI wJ y gI gJ H2 y gI gJ H1 y gI gJ H3 w , ̂F(w ) | j J j j w , ̂F(w ) | j J j j w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 wi , ̂F(wi ) | i I wi , ̂F(w i ) | i I wi , ̂F(w i ) | i I 16 Stochastic BFGS: Approach 1 We could not make this work in a robust manner! 1. Two sources of error • • 2. Sample variance Lack of sample uniformity s wI wJ , y gI gJ Initial reaction • • Control quality of average gradients Use of sample variance … dynamic sampling Proposed Solution Control quality of curvature y estimate directly 17 Key idea: avoid differencing Standard definition y ̂F(wI ) ̂F(wJ ) arises from ̂F(wI ) ̂F(wJ ) ̂2 F(wI )(wI wJ ) Hessian-vector products are often available y ̂2 F(wI )s Define curvature vector for L-BFGS via a Hessian-vector product perform only every L iterations 18 Structure of Hessian-vector product y ̂2 F(wI )s Mini-batch stochastic gradient 1 ̂ F(wk )s bH 2 1. 2. 3. 4. 2 f (w; xi , zi )s , | SH | bH iSH Code Hessian-vector product directly Achieve sample uniformity automatically (c.f. Schraudolph) Avoid numerical problems when ||s|| is small Control cost of y computation 19 The Proposed Algorithm Ht ̂F(w ) wk1 wk k ̂F(w k) k s wI wJ y ̂ 2 F(wI )s H2 s wI wJ s wI wJ y ̂ 2 F(wI )s y ̂ F(wI )s 2 w j | j J H1 w H3 j | j J w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 wi | i I wi | i I wi | i I 20 Algorithmic Parameters • b: stochastic gradient batch size • bH: Hessian-vector batch size • L: controls frequency of quasi-Newton updating • M: memory parameter in L-BFGS updating M=5 - use limited memory form 21 Need Hessian to implement a quasi-Newton method? Ŧ Œ⌥ ä ϖ ś ħ ?? Are you out of your mind? We don’t need Hessian-vector product, but it has many Advantages: complete freedom in sampling and accuracy 22 Numerical Tests wk 1 wk k ̂f (wk ) k / k wk 1 wk k H k ̂f (wk ) k / k Stochastic gradient method (SGD) Stochastic quasi-Newton method (SQN) parameter to be fixed at start for each method; found by bisection procedure It is well know that SGD is highly sensitive to choice of steplength, and so will be the SQN method (though perhaps less) 23 n = 112919, N = 688329 RCV1 Problem sgd sqn • b = 50, 300, 1000 • M = 5, L = 20 bH = 1000 Accessed data points; includes Hessianvector products 24 Speech Problem n= 30315, N = 191607 sgd sqn • b = 100, 500 , M = 5, L = 20, bH = 1000 25 Varying Hessian batch bH: RCV1 b=300 26 Varying memory size M in limited memory BFGS: RCV1 27 Varying L-BFGS Memory Size: Synthetic problem 28 Generalization Error: RCV1 Problem SQN SGD 29 Test Problems • Synthetically Generated Logistic Regression: Singer et al – n = 50, N = 7000 – Training data:. xi n , zi [0,1] • RCV1 dataset – n = 112919, N = 688329 – Training data: xi [0,1]n , zi [0,1] • SPEECH dataset – NF = 235, |C| = 129 – n = NF x |C| --> n= 30315, N = 191607 NF , zi [1,129] – Training data: xi 30 Iteration Costs SGD wk+1 = wk - b k Ñ̂F(wk ) • mini-batch stochastic gradient SQN wk+1 = wk - b k H t Ñ̂F(wk ) • mini-batch stochastic gradient • Hessian-vector product every L iterations • matrix-vector product bn bn + bH n / L+4Mn 31 Iteration Costs SQN SGD bn bn bH n / L+4Mn 300n 370n Typical Parameter Values • b = 50-1000 b = 300 • bH = 100-1000 bH = 1000 • L = 10-20 L = 20 • M = 3-20 M =5 32 Hasn’t this been done before? Hessian-free Newton method: Martens (2010), Byrd et al (2011) - claim: stochastic Newton not competitive with stochastic BFGS Prior work: Schraudolph et al. - similar, cannot ensure quality of y - change BFGS formula in one-sided form 33 Supporting theory? Work in progress: Figen Oztoprak, Byrd, Soltnsev - combine classical analysis Murata, Nemirovsky et a - and asumptotic quasi-Newton theory - effect on constants (condition number) - invoke self-correction properties of BFGS Practical Implementation: limited memory BFGS - loses superlinear convergence property - enjoys self-correction mechanisms 34 Small batches: RCV1 Problem bH=1000, M=5, L=200 SGD: b adp/iter bn work/iter SQN: b + bH/L adp/iter bn + bHn/L +4Mn work/iter Parameters L, M and bH provide freedom in adapting the SQN method to a specific application 35 Alternative quasi-Newton framework BFGS method was not derived with noisy gradients in mind - how do we know it is an appropriate framework - Start from scratch - derive quasi-Newton updating formulas tolerant to noise 36 Foundations Define quadratic model around a reference point z 1 qz (w) f (w) g (w z) (w z)T B(w z) 2 T Using a collection indexed by I , natural to require [q(w ) ̂f (w )] 0 j j j I i.e. residuals are zero in expectation Not enough information to determine the whole model 37 qz (w) gT (w z) Mean square error 1 (w z)T B(w z) 2 Given a collection I, choose model q to minimize F(g, B) °´ q(w j ) ̂f (w j )°´2 0 j I Define sk wk z, restate problem as min g, B °´Bsk g ̂f (w j )°´2 j I Differentiating w.r.t. g: {Bsk g ̂f (w j )} 0 j I Encouraging: obtained residual condition 38 The End 39