A Stochastic Quasi-Newton Method for Large-Scale Learning Jorge Nocedal

advertisement
A Stochastic Quasi-Newton Method for
Large-Scale Learning
Jorge Nocedal
Northwestern University
With S. Hansen, R. Byrd and Y. Singer
IPAM, UCLA, Feb 2014
1
Goal
Propose a robust quasi-Newton method that operates in
the stochastic approximation regime
wk 1  wk   k H k ̂F(wk )
•
•
•
purely stochastic method (not batch) – to compete
with stochatic gradient (SG) method
Full non-diagonal Hessian approximation
Scalable to millions of parameters
2
Outline
wk 1  wk   k H k ̂f (wk )
Are iterations of this following form viable?
- theoretical considerations; iteration costs
- differencing noisy gradients?
Key ideas: compute curvature information pointwise at
regular intervals, build on strength of BFGS updating
recalling that it is an overwriting (and not averaging
process)
- results on text and speech problems
- examine both training and testing errors
3
Problem
minw
n
F(w)  E[ f (w;  )]
Stochastic function of variable w with random variable 
Applications
• Simulation optimization
• Machine learning  = selection of input-output pair (x, z)
f (w,  )  f (w, xi , zi )  (h(w; xi ), zi )
1 N
F(w)   f (w; xi , zi )
N i1
Algorithm not (yet) applicable to simulation based optimization
4
Stochastic gradient method
For loss function
1 N
F(w)   f (w; xi , zi )
N i1
Robbins-Monro or stochastic gradient method
wk 1  wk   k ̂F(wk )
 k  O(1 / k)
using stochastic gradient (estimator)
̂F(wk ) :
min-batch
E [̂F(w)]  F(w)
b | S | N
1
̂F(wk )    f (w; xi , zi ),
b iS
| S | b
5
Why it won’t work ….
wk 1  wk   k H k ̂F(wk )
1.
Is there any reason to think that including a Hessian
approximation will improve upon stochastic gradient
method?
2. Iteration costs are so high that even if method is faster than
SG in terms of training costs it will be a weaker learner
6
Theoretical Considerations
wk1  wk   k Bk1̂F(wk ),  k  O(1/ k)
Bk  
I 2 F(w* )
Completely removes the
Depends
on the condition
dependency
on the condition number
number of the Hessian at the
(Murata 98); cf Bottou-Bousquet
true solution
Number of iterations
needed to compute an
epsilon-accurate
solution:
Motivation for choosing
Depends on the Hessian
2
true
solution
the
Batk 
F andQuasi-Newton
gradient covariance
matrix
7
Computational cost
Assuming we obtain efficiencies of classical quasi-Newton
methods in limited memory form
wk 1  wk   k H k ̂f (wk )
•
•
•
Each iteration requires 4Md operations
M = memory in limited memory implementation; M=5
d = dimension of the optimization problem
4 Md  d  21d
vs
d
Stochastic gradient
method
cost of computing
̂f (wk )
8
Mini-batching
̂F(wk ) 
•
•
1
 f (w; xi , zi ),

b iS
|S | b
assuming a min-batch b=50
cost of stochastic gradient = 50d
4 Md  bd  20d  50d
70d
vs
vs
50d
50d
Use of small mini-batches will be a game-changer
b =10, 50, 100
9
Game changer? Not quite…
Mini-batching makes operation counts favorable but does not
resolve challenges related to noise
1. Avoid differencing noise
•
•
•
Curvature estimates cannot suffer from sporadic spikes in noise
(Schraudolph et al. (99), Ribeiro et at (2013)
Quasi-Newton updating is an overwriting process not an averaging
process
Control quality of curvature information
2. Cost of curvature computation
10
Desing of Stochastic Quasi-Newton Method
wk 1  wk   k H k ̂f (wk )
Propose a method based on the famous BFGS formula
•
•
all components seem to fit together well
numerical performance appears to be strong
Propose a new quasi-Newton updating formula
•
•
Specifically designed to deal with noisy gradients
Work in progress
11
Review of the deterministic BFGS method
wk 1  wk   k Bk1F(wk )
wk 1  wk   k H k F(wk )
At every iteration compute and save
yk  F(wk1 )  F(wk ),
sk  wk1  wk ,
H k1  (I  k sk y )H k (I  k y s )   s s
T
k
T
k k
T
k k k
k 
1
skT yk
Correction pairs yk , sk uniquely determine BFGS updating
12
The remarkable properties of BFGS method (convex case)
Superlinear convergence; global convergence
for strongly convex problems, self-correction properties
°´( Bk  F(x * ))sk °´
0
°´sk °´
Only need to approximate Hessian in a subspace
If the algorithm takes a bad step, matrix is corrected...
°´ Bk sk °´ °´ yk °´2
tr(Bk 1 )  tr(Bk )  T
 T
sk Bk sk
yk s k
det Bk 1
ykT sk skT sk
 det Bk T
sk sk skT Bk sk
Powell 76
Byrd-N 89
13
Adaptation to stochastic setting
Cannot mimic classical approach and update after each
iteration
yk  ̂F(wk1 )  ̂F(wk ),
sk  wk1  wk
Since batch size b is small this will yield highly noisy
curvature estimates
Instead: Use a collection of iterates to define the correction
pairs
14
Stochastic BFGS: Approach 1
Define two collections of size L:

 w , ̂F(w ) | j J 
wi , ̂F(w i ) | i I ,
j
j
| I || J | L
Define average iterate/gradient:
1
wI 
wi ,

| I | iI
New curvature pair:
1
gI 
̂F(wi )

| I | iI
s  wI  w j , y  gI  gJ
15
Stochastic L-BFGS: First Approach
wk1
Ht ̂F(w
k1  wkk   kk ̂F(w
k) k)
s  wI  wJ
s  wI  wJ
s  wI  wJ
y  gI  gJ
 H2
y  gI  gJ
 H1
y  gI  gJ
 H3
w , ̂F(w ) | j J 
j
j
w , ̂F(w ) | j J 
j
j
w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14

wi , ̂F(wi ) | i I


wi , ̂F(w i ) | i I


wi , ̂F(w i ) | i I
16

Stochastic BFGS: Approach 1
We could not make this work in a robust manner!
1.
Two sources of error
•
•
2.
Sample variance
Lack of sample uniformity
s  wI  wJ ,
y  gI  gJ
Initial reaction
•
•
Control quality of average gradients
Use of sample variance … dynamic sampling
Proposed Solution
Control quality of curvature y estimate directly
17
Key idea: avoid differencing
Standard definition
y  ̂F(wI )  ̂F(wJ )
arises from
̂F(wI )  ̂F(wJ )  ̂2 F(wI )(wI  wJ )
Hessian-vector products are often available
y  ̂2 F(wI )s
Define curvature vector for L-BFGS via a Hessian-vector
product perform only every L iterations
18
Structure of Hessian-vector product
y  ̂2 F(wI )s
Mini-batch stochastic gradient
1
̂ F(wk )s 
bH
2
1.
2.
3.
4.

2
f (w; xi , zi )s ,
| SH |  bH
iSH
Code Hessian-vector product directly
Achieve sample uniformity automatically (c.f. Schraudolph)
Avoid numerical problems when ||s|| is small
Control cost of y computation
19
The Proposed Algorithm
Ht ̂F(w
)
wk1  wk   k ̂F(w
k) k
s  wI  wJ
y  ̂ 2 F(wI )s
 H2
s  wI  wJ
s  wI  wJ
y  ̂ 2 F(wI )s
y  ̂ F(wI )s
2
w
j
| j J

 H1
w
 H3
j
| j J

w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14
wi | i I 
wi | i I 
wi | i I 
20
Algorithmic Parameters
• b: stochastic gradient batch size
• bH: Hessian-vector batch size
• L: controls frequency of quasi-Newton updating
• M: memory parameter in L-BFGS updating M=5
- use limited memory form
21
Need Hessian to implement a quasi-Newton method?
Ŧ Œ⌥ ä ϖ ś ħ ??
Are you out of your
mind?
We don’t need Hessian-vector product, but it has many
Advantages: complete freedom in sampling and accuracy
22
Numerical Tests
wk 1  wk   k ̂f (wk )
k   / k
wk 1  wk   k H k ̂f (wk )
k   / k
Stochastic gradient
method (SGD)
Stochastic quasi-Newton
method (SQN)
parameter  to be fixed at start for each method;
found by bisection procedure
It is well know that SGD is highly sensitive to choice of steplength,
and so will be the SQN method (though perhaps less)
23
n = 112919, N = 688329
RCV1 Problem
sgd
sqn
• b = 50, 300, 1000
• M = 5, L = 20
bH = 1000
Accessed
data points;
includes Hessianvector products
24
Speech Problem
n= 30315, N = 191607
sgd
sqn
• b = 100, 500 , M = 5, L = 20, bH = 1000
25
Varying Hessian batch bH: RCV1
b=300
26
Varying memory size M in limited memory BFGS: RCV1
27
Varying L-BFGS Memory Size: Synthetic problem
28
Generalization Error: RCV1 Problem
SQN
SGD
29
Test Problems
• Synthetically Generated Logistic Regression: Singer et al
– n = 50, N = 7000
– Training data:.
xi 
n
, zi [0,1]
• RCV1 dataset
– n = 112919, N = 688329
– Training data: xi [0,1]n , zi [0,1]
• SPEECH dataset
– NF = 235, |C| = 129
– n = NF x |C| --> n= 30315, N = 191607
NF
, zi [1,129]
– Training data: xi 
30
Iteration Costs
SGD
wk+1 = wk - b k Ñ̂F(wk )
• mini-batch stochastic
gradient
SQN
wk+1 = wk - b k H t Ñ̂F(wk )
• mini-batch stochastic
gradient
• Hessian-vector product
every L iterations
• matrix-vector product
bn
bn + bH n / L+4Mn
31
Iteration Costs
SQN
SGD
bn
bn  bH n / L+4Mn
300n
370n
Typical Parameter Values
• b = 50-1000
b = 300
• bH = 100-1000
bH = 1000
• L = 10-20
L = 20
• M = 3-20
M =5
32
Hasn’t this been done before?
Hessian-free Newton method: Martens (2010), Byrd et al (2011)
- claim: stochastic Newton not competitive with
stochastic BFGS
Prior work: Schraudolph et al.
- similar, cannot ensure quality of y
- change BFGS formula in one-sided form
33
Supporting theory?
Work in progress: Figen Oztoprak, Byrd, Soltnsev
- combine classical analysis Murata, Nemirovsky et a
- and asumptotic quasi-Newton theory
- effect on constants (condition number)
- invoke self-correction properties of BFGS
Practical Implementation: limited memory BFGS
- loses superlinear convergence property
- enjoys self-correction mechanisms
34
Small batches: RCV1 Problem
bH=1000, M=5, L=200
SGD:
b adp/iter
bn work/iter
SQN:
b + bH/L adp/iter
bn + bHn/L +4Mn work/iter
Parameters L, M and bH provide freedom in adapting the SQN
method to a specific application
35
Alternative quasi-Newton framework
BFGS method was not derived with noisy gradients in
mind
- how do we know it is an appropriate framework
-
Start from scratch
- derive quasi-Newton updating formulas tolerant to noise
36
Foundations
Define quadratic model around a reference point z
1
qz (w)  f (w)  g (w  z)  (w  z)T B(w  z)
2
T
Using a collection indexed by I , natural to require
[q(w )  ̂f (w )]  0
j
j
j I
i.e. residuals are zero in expectation
Not enough information to determine the whole model
37
qz (w)  gT (w  z) 
Mean square error
1
(w  z)T B(w  z)
2
Given a collection I, choose model q to minimize
F(g, B)   °´ q(w j )  ̂f (w j )°´2  0
j I
Define sk  wk  z, restate problem as
min g, B °´Bsk  g  ̂f (w j )°´2
j I
Differentiating w.r.t. g:

{Bsk  g  ̂f (w j )}  0
j I
Encouraging: obtained residual condition
38
The End
39
Download