Generalization properties of multiple passes stochastic

advertisement
Generalization properties of multiple passes stochastic
gradient method
Silvia Villa
joint work with Lorenzo Rosasco
Laboratory for Computational and Statistical Learning, IIT and MIT
http://lcsl.mit.edu/data/silviavilla
– Computational and Statistical trade-offs in learning –
Paris 2016
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
1 / 45
Introduction
Motivation
Large scale learning
=
Statistics
+
Optimization
Stochastic gradient method
single or multiple passes over the data?
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
2 / 45
Introduction
Outline
Problem setting
Tikhonov regularization for learning
Assumptions: source condition
Theoretical results
Learning with the stochastic gradient method
Algorithms
Theoretical results and discussion
Proof
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
3 / 45
Setting
Problem setting
H separable Hilbert space
ρ probability distribution on H × R
Problem
minimize E(w ) =
w ∈H
Z
H×R
(hw , xi − y )2 dρ(x, y ),
given {(x1 , y1 ), . . . , (xn , yn )} i.i.d. with respect to ρ.
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
4 / 45
Setting
Special cases of interest
Linear and functional regression
Let
yi = hw∗ , xi i + δi ,
i = 1, . . . , n
with xi , δi random iid and w∗ , xi ∈ H.
random design linear regression, H = Rd
functional regression, H infinite dimensional Hilbert space
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
5 / 45
Setting
Special cases of interest
Learning with kernels
Ξ × R input/output space with probability µ.
HK RKHS with reproducing kernel K , w (ξ) = hw , K (ξ, ·)iH
Problem
Z
minimizew ∈HK
Ξ×R
S. Villa (LCSL - IIT and MIT)
(w (ξ) − y )2 dµ(ξ, y )
Multiple passes SG
6 / 45
Setting
Special cases of interest
Learning with kernels
Ξ × R input/output space with probability µ.
HK RKHS with reproducing kernel K , w (ξ) = hw , K (ξ, ·)iH
Problem
Z
minimizew ∈HK
Ξ×R
(w (ξ) − y )2 dµ(ξ, y )
If ρ is the distribution of (ξ, y ) 7→ (K (ξ, ·), y ) = (x, y ), then
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
6 / 45
Setting
Special cases of interest
Learning with kernels
Ξ × R input/output space with probability µ.
HK RKHS with reproducing kernel K , w (ξ) = hw , K (ξ, ·)iH
Problem
Z
minimizew ∈HK
Ξ×R
(w (ξ) − y )2 dµ(ξ, y )
If ρ is the distribution of (ξ, y ) 7→ (K (ξ, ·), y ) = (x, y ), then
Z
2
(w (ξ)−y ) dµ(ξ, y ) =
Ξ×R
S. Villa (LCSL - IIT and MIT)
Z
2
(hw , K (ξ, ·)i−y ) dµ =
Ξ×R
Multiple passes SG
Z
(hw , xi−y )2 dρ(w , y )
HK ×R
6 / 45
Setting
Outline
Problem setting
Tikhonov regularization for learning
Assumptions: source condition
Theoretical results
Learning with the stochastic gradient method
Algorithms
Theoretical results and discussion
Proof
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
7 / 45
Tikhonov regularization for learning
A classical approach: Tikhonov regularization of the
empirical risk
ŵλ = argmin Ê(w ) + λR(w )
H
empirical risk,
n
1X
(hw , xi i − yi )2
Ê(w ) =
n
i=1
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
8 / 45
Tikhonov regularization for learning
A classical approach: Tikhonov regularization of the
empirical risk
ŵλ = argmin Ê(w ) + λR(w )
H
empirical risk,
n
1X
(hw , xi i − yi )2
Ê(w ) =
n
i=1
regularizer, R = k · k2H
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
8 / 45
Tikhonov regularization for learning
A classical approach: Tikhonov regularization of the
empirical risk
ŵλ = argmin Ê(w ) + λR(w )
H
empirical risk,
n
1X
(hw , xi i − yi )2
Ê(w ) =
n
i=1
regularizer, R = k · k2H
regularization parameter, λ > 0
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
8 / 45
Tikhonov regularization for learning
A classical approach: Tikhonov regularization of the
empirical risk
ŵλ = argmin Ê(w ) + λR(w )
H
empirical risk,
n
1X
(hw , xi i − yi )2
Ê(w ) =
n
i=1
regularizer, R = k · k2H
regularization parameter, λ > 0
What about statistics?
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
8 / 45
Tikhonov regularization for learning
Assumptions
Boundedness.
There exist κ > 0 and M > 0 such that
|y | ≤ M
S. Villa (LCSL - IIT and MIT)
and kxk2H ≤ κ
Multiple passes SG
a.s.
9 / 45
Tikhonov regularization for learning
Assumptions
Boundedness.
There exist κ > 0 and M > 0 such that
|y | ≤ M
and kxk2H ≤ κ
a.s.
Existence a (minimal norm) solution
O = argmin E 6= ∅,
H
w † = argmin kw kH
O
More assumptions needed for finite sample error bounds...
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
9 / 45
Tikhonov regularization for learning
Source condition
Boundedness assumption implies
T: H→H
Z
w 7→
hw , xi x dρH (x),
ρH marginal of ρ
H
is well defined.
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
10 / 45
Tikhonov regularization for learning
Source condition
Boundedness assumption implies
T: H→H
Z
w 7→
hw , xi x dρH (x),
ρH marginal of ρ
H
is well defined.
Source condition
Let r ∈ [1/2, +∞[ and assume that
∃h ∈ H
S. Villa (LCSL - IIT and MIT)
such that
w† = Tr−1/2 h.
Multiple passes SG
(SC)
10 / 45
Tikhonov regularization for learning
Source condition: remarks
H
T r1
If r = 1/2, no assumption
if r > 1/2, (SC) implies
w † is in a subspace of H
T r2
1
2
1
2
(H)
(H)
r1 < r2
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
11 / 45
Tikhonov regularization for learning
Source condition: remarks
H
T r1
If r = 1/2, no assumption
if r > 1/2, (SC) implies
w † is in a subspace of H
T r2
1
2
1
2
(H)
(H)
r1 < r2
Spectral point of view
If (σi , vi )i∈I is the eigenbasis of T , then
X
kw † k2H =
|hw † , vi i|2 < +∞
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
11 / 45
Tikhonov regularization for learning
Source condition: remarks
H
T r1
If r = 1/2, no assumption
if r > 1/2, (SC) implies
w † is in a subspace of H
T r2
1
2
1
2
(H)
(H)
r1 < r2
Spectral point of view
If (σi , vi )i∈I is the eigenbasis of T , then
X
kw † k2H =
|hw † , vi i|2 < +∞
If h = T 1/2−r w † , then
khk2H =
S. Villa (LCSL - IIT and MIT)
X
|hw† , vi i|2 /σi2r−1 < +∞
Multiple passes SG
11 / 45
Tikhonov regularization for learning
Results: error bounds
Theorem
Assume boundedness and (SC) for some r ∈ ]1/2, +∞[, Let
1
λ̂ = n− 2r+1 .
then with high probability,
 r−1/2 O n− 2r+1
if r ≤ 3/2
†
kŵλ̂ − w kH =
O n−1/2
if r > 3/2
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
12 / 45
Tikhonov regularization for learning
Proof: bias-variance trade-off
Set
wλ = argmin E(w ) + λkw k2H ,
H
and decompose the error
kŵλ − w † kH ≤ kŵλ − wλ kH + kwλ − w † kH
{z
} |
{z
}
|
Variance
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
Bias
13 / 45
Tikhonov regularization for learning
Remarks
The bounds are minimax [Blanchard-Muecke 2016]: if
Pr = {ρ | Boundedness and (SC) are satisfied},
then
min max Ekŵ − w † kH ≥ Cn−
r −1/2
2r +1
ŵ ∈H ρ∈Pr
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
14 / 45
Tikhonov regularization for learning
Remarks
The bounds are minimax [Blanchard-Muecke 2016]: if
Pr = {ρ | Boundedness and (SC) are satisfied},
then
min max Ekŵ − w † kH ≥ Cn−
r −1/2
2r +1
ŵ ∈H ρ∈Pr
saturation for r > 3/2
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
14 / 45
Tikhonov regularization for learning
Remarks
The bounds are minimax [Blanchard-Muecke 2016]: if
Pr = {ρ | Boundedness and (SC) are satisfied},
then
min max Ekŵ − w † kH ≥ Cn−
r −1/2
2r +1
ŵ ∈H ρ∈Pr
saturation for r > 3/2
Adaptivity via Lepskii/Balancing principle [De Vito-Pereverzev-Rosasco
2010]
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
14 / 45
Tikhonov regularization for learning
Remarks
The bounds are minimax [Blanchard-Muecke 2016]: if
Pr = {ρ | Boundedness and (SC) are satisfied},
then
min max Ekŵ − w † kH ≥ Cn−
r −1/2
2r +1
ŵ ∈H ρ∈Pr
saturation for r > 3/2
Adaptivity via Lepskii/Balancing principle [De Vito-Pereverzev-Rosasco
2010]
Improved bounds under further assumptions on the decay of the
eigenvalues of T [De Vito-Caponnetto 2006 . . . ]
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
14 / 45
Tikhonov regularization for learning
Remarks
The bounds are minimax [Blanchard-Muecke 2016]: if
Pr = {ρ | Boundedness and (SC) are satisfied},
then
min max Ekŵ − w † kH ≥ Cn−
r −1/2
2r +1
ŵ ∈H ρ∈Pr
saturation for r > 3/2
Adaptivity via Lepskii/Balancing principle [De Vito-Pereverzev-Rosasco
2010]
Improved bounds under further assumptions on the decay of the
eigenvalues of T [De Vito-Caponnetto 2006 . . . ]
BUT...
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
14 / 45
Tikhonov regularization for learning
Remarks
The bounds are minimax [Blanchard-Muecke 2016]: if
Pr = {ρ | Boundedness and (SC) are satisfied},
then
min max Ekŵ − w † kH ≥ Cn−
r −1/2
2r +1
ŵ ∈H ρ∈Pr
saturation for r > 3/2
Adaptivity via Lepskii/Balancing principle [De Vito-Pereverzev-Rosasco
2010]
Improved bounds under further assumptions on the decay of the
eigenvalues of T [De Vito-Caponnetto 2006 . . . ]
BUT...
. . . what about the optimization error?
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
14 / 45
Tikhonov regularization for learning
A new perspective
[Bottou-Bousquet 2008]
Let ŵλ,t the t-th iteration of some algorithm for regularized empirical risk
minimization,
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
15 / 45
Tikhonov regularization for learning
A new perspective
[Bottou-Bousquet 2008]
Let ŵλ,t the t-th iteration of some algorithm for regularized empirical risk
minimization, then
kŵλ,t − w † kH ≤ kŵλ,t − ŵλ kH + kŵλ − wλ kH + kwλ − w † kH
{z
} |
{z
}
{z
} |
|
Variance
Bias
Optimization
|
{z
}
Statistics
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
15 / 45
Tikhonov regularization for learning
A new perspective
[Bottou-Bousquet 2008]
Let ŵλ,t the t-th iteration of some algorithm for regularized empirical risk
minimization, then
kŵλ,t − w † kH ≤ kŵλ,t − ŵλ kH + kŵλ − wλ kH + kwλ − w † kH
{z
} |
{z
}
{z
} |
|
Variance
Bias
Optimization
|
{z
}
Statistics
A new trade-off
⇒ Optimization accuracy tailored to statistical accuracy
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
15 / 45
Tikhonov regularization for learning
Optimization error
Recently a *lot* of interest in methods to solve
min
w ∈H
S. Villa (LCSL - IIT and MIT)
n
X
Vi (w )
i=1
Multiple passes SG
16 / 45
Tikhonov regularization for learning
Optimization error
Recently a *lot* of interest in methods to solve
min
w ∈H
n
X
Vi (w )
i=1
Let Vi (w ) = V (hw , xi iH , yi ) + λkw k2H for some loss function V
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
16 / 45
Tikhonov regularization for learning
Optimization error
Recently a *lot* of interest in methods to solve
min
w ∈H
n
X
Vi (w )
i=1
Let Vi (w ) = V (hw , xi iH , yi ) + λkw k2H for some loss function V
Large scale setting, n “large”
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
16 / 45
Tikhonov regularization for learning
Optimization error
Recently a *lot* of interest in methods to solve
min
w ∈H
n
X
Vi (w )
i=1
Let Vi (w ) = V (hw , xi iH , yi ) + λkw k2H for some loss function V
Large scale setting, n “large”
Focus on batch and stochastic first order methods
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
16 / 45
Tikhonov regularization for learning
A new perspective
[Bottou-Bousquet 2008]
kŵλ,t − w † kH ≤ kŵλ,t − ŵλ kH + kŵλ − wλ kH + kwλ − w † kH
{z
} |
{z
}
|
{z
} |
Variance
Bias
Optimization
|
{z
}
Statistics
This suggests
Approach 1: combine statistics with optimization
Approach 2: use a different decomposition
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
17 / 45
Tikhonov regularization for learning
Outline
Problem setting
Tikhonov regularization for learning
Assumptions: source condition
Theoretical results
Learning with the stochastic gradient method
Algorithms
Theoretical results and discussion
Proof
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
18 / 45
Learning with the stochastic gradient method
Another approach: stochastic gradient method
Directly minimize the expected risk E
ŵt = ŵt−1 − γt (hŵt−1 , xt i − yt )xt
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
19 / 45
Learning with the stochastic gradient method
Another approach: stochastic gradient method
Directly minimize the expected risk E
ŵt = ŵt−1 − γt (hŵt−1 , xt i − yt )xt
each iteration corresponds to one sample/gradient evaluation
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
19 / 45
Learning with the stochastic gradient method
Another approach: stochastic gradient method
Directly minimize the expected risk E
ŵt = ŵt−1 − γt (hŵt−1 , xt i − yt )xt
each iteration corresponds to one sample/gradient evaluation
no explicit regularization
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
19 / 45
Learning with the stochastic gradient method
Another approach: stochastic gradient method
Directly minimize the expected risk E
ŵt = ŵt−1 − γt (hŵt−1 , xt i − yt )xt
each iteration corresponds to one sample/gradient evaluation
no explicit regularization
several theoretical results
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
19 / 45
Learning with the stochastic gradient method
Another approach: stochastic gradient method
Directly minimize the expected risk E
ŵt = ŵt−1 − γt (hŵt−1 , xt i − yt )xt
each iteration corresponds to one sample/gradient evaluation
no explicit regularization
several theoretical results
BUT...
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
19 / 45
Learning with the stochastic gradient method
Another approach: stochastic gradient method
Directly minimize the expected risk E
ŵt = ŵt−1 − γt (hŵt−1 , xit i − yit )xit
each iteration corresponds to one sample/gradient evaluation
no explicit regularization
several theoretical results
BUT...
. . . in practice, MULTIPLE PASSES over the data are used
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
19 / 45
Learning with the stochastic gradient method
Multiple passes stochastic gradient learning
Rewrite the iteration
Let wˆ0 = 0 and γ > 0. For t ∈ N, iterate:

 v̂0 = ŵt

 for i = 1, . . . , n
  v̂ = v̂
i
i−1 − (γ/n)(hv̂i−1 , xi i − yi )xi

ŵt+1 = v̂n
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
20 / 45
Learning with the stochastic gradient method
Multiple passes stochastic gradient learning
Rewrite the iteration
Let wˆ0 = 0 and γ > 0. For t ∈ N, iterate:

 v̂0 = ŵt

 for i = 1, . . . , n
  v̂ = v̂
i
i−1 − (γ/n)(hv̂i−1 , xi i − yi )xi

ŵt+1 = v̂n
incremental gradient method for the empirical risk Ê [Bertsekas
-Tsitsiklis 2000]
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
20 / 45
Learning with the stochastic gradient method
Multiple passes stochastic gradient learning
Rewrite the iteration
Let wˆ0 = 0 and γ > 0. For t ∈ N, iterate:

 v̂0 = ŵt

 for i = 1, . . . , n
  v̂ = v̂
i
i−1 − (γ/n)(hv̂i−1 , xi i − yi )xi

ŵt+1 = v̂n
incremental gradient method for the empirical risk Ê [Bertsekas
-Tsitsiklis 2000]
each inner step is one pass of stochastic gradient method
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
20 / 45
Learning with the stochastic gradient method
Multiple passes stochastic gradient learning
Rewrite the iteration
Let wˆ0 = 0 and γ > 0. For t ∈ N, iterate:

 v̂0 = ŵt

 for i = 1, . . . , n
  v̂ = v̂
i
i−1 − (γ/n)(hv̂i−1 , xi i − yi )xi

ŵt+1 = v̂n
incremental gradient method for the empirical risk Ê [Bertsekas
-Tsitsiklis 2000]
each inner step is one pass of stochastic gradient method
t is the number of passes over the data (epochs)
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
20 / 45
Learning with the stochastic gradient method
Main question
How many passes we need to approximately minimize
the expected risk E?
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
21 / 45
Learning with the stochastic gradient method
Multiple passes stochastic gradient learning
We have
ŵt → argmin Ê,
H
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
22 / 45
Learning with the stochastic gradient method
Multiple passes stochastic gradient learning
We have
ŵt → argmin Ê,
H
but we would like
ŵt → argmin E.
H
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
22 / 45
Learning with the stochastic gradient method
Multiple passes stochastic gradient learning
We have
ŵt → argmin Ê,
H
but we would like
ŵt → argmin E.
H
What’s the catch?
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
22 / 45
Learning with the stochastic gradient method
Stability and early stopping
Consider the gradient descent iteration for the expected risk.
w
0 = 0 ∈ H
 v0 = wt

 for i = 1, . . . , n
 R
 v =v
i
i−1 − (γ/n) H (hvi−1 , xi − y )xdρ(x, y )

wt+1 = vn
Note: step-size γ/n.
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
23 / 45
Learning with the stochastic gradient method
Stability and early stopping
Consider the gradient descent iteration for the expected risk.
w
0 = 0 ∈ H
 v0 = wt

 for i = 1, . . . , n
 R
 v =v
i
i−1 − (γ/n) H (hvi−1 , xi − y )xdρ(x, y )

wt+1 = vn
Note: step-size γ/n.
wt −→ wt+1
−→ . . .
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
23 / 45
Learning with the stochastic gradient method
Stability and early stopping
Consider the gradient descent iteration for the expected risk.
w
0 = 0 ∈ H
 v0 = wt

 for i = 1, . . . , n
 R
 v =v
i
i−1 − (γ/n) H (hvi−1 , xi − y )xdρ(x, y )

wt+1 = vn
Note: step-size γ/n.
wt −→ wt+1
−→ . . .
S. Villa (LCSL - IIT and MIT)
−→
−→ argmin E
H
Multiple passes SG
23 / 45
Learning with the stochastic gradient method
Stability and early stopping
Consider the gradient descent iteration for the expected risk.
w
0 = 0 ∈ H
 v0 = wt

 for i = 1, . . . , n
 R
 v =v
i
i−1 − (γ/n) H (hvi−1 , xi − y )xdρ(x, y )

wt+1 = vn
Note: step-size γ/n.
wt −→ wt+1
−→ . . .
ŵt −→ ŵt+1
−→ . . .
S. Villa (LCSL - IIT and MIT)
−→
−→ argmin E
H
Multiple passes SG
23 / 45
Learning with the stochastic gradient method
Stability and early stopping
Consider the gradient descent iteration for the expected risk.
w
0 = 0 ∈ H
 v0 = wt

 for i = 1, . . . , n
 R
 v =v
i
i−1 − (γ/n) H (hvi−1 , xi − y )xdρ(x, y )

wt+1 = vn
Note: step-size γ/n.
wt −→ wt+1
−→ . . .
ŵt −→ ŵt+1
−→ . . .
−→
−→ argmin E
H
&
&
argmin Ê
H
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
23 / 45
Learning with the stochastic gradient method
Early stopping - semi-convergence
0.24
Empirical Error
Expected Error
0.22
Error
0.2
0.18
0.16
0.14
0.12
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Iteration
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
24 / 45
Learning with the stochastic gradient method
Early stopping - example
First epoch:
8
6
4
2
0
−2
0
S. Villa (LCSL - IIT and MIT)
1
2
3
Multiple passes SG
4
5
6
25 / 45
Learning with the stochastic gradient method
Early stopping - example
10-th epoch:
8
6
4
2
0
−2
0
S. Villa (LCSL - IIT and MIT)
2
4
Multiple passes SG
6
26 / 45
Learning with the stochastic gradient method
Early stopping - example
100-th epoch:
8
6
4
2
0
−2
0
S. Villa (LCSL - IIT and MIT)
2
4
Multiple passes SG
6
27 / 45
Learning with the stochastic gradient method
Main results: consistency
Theorem
Assume boundedness. Let γ ∈ 0, κ−1 . Let t∗ (n) be such that
t∗ (n) → +∞ and t∗ (n) (log n/n)1/3 → 0
Assume O = argminH E 6= ∅. Then
kŵt∗ (n) − w † k → 0
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
ρ − a.s.
28 / 45
Learning with the stochastic gradient method
Main results: consistency
Theorem
Assume boundedness. Let γ ∈ 0, κ−1 . Let t∗ (n) be such that
t∗ (n) → +∞ and t∗ (n) (log n/n)1/3 → 0
Assume O = argminH E 6= ∅. Then
kŵt∗ (n) − w † k → 0
ρ − a.s.
universal step-size fixed a priori
early stopping needed for consistency
multiple passes are needed
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
28 / 45
Learning with the stochastic gradient method
First comparison with one pass stochastic gradient
Consistency in the following cases:
Multiple passes Stochastic Gradient method:
γt = γ/n,
S. Villa (LCSL - IIT and MIT)
t∗ (n) ∼ (n/ log n)1/3
Multiple passes SG
29 / 45
Learning with the stochastic gradient method
First comparison with one pass stochastic gradient
Consistency in the following cases:
Multiple passes Stochastic Gradient method:
γt = γ/n,
t∗ (n) ∼ (n/ log n)1/3
One pass Stochastic Gradient method:
√
γt = γ/ n,
t∗ (n) = 1 (+ averaging)
[∼ Ying-Pontil 2008 and Dieuleveut-Bach 2014]
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
29 / 45
Learning with the stochastic gradient method
First comparison with one pass stochastic gradient
Consistency in the following cases:
Multiple passes Stochastic Gradient method:
γt = γ/n,
t∗ (n) ∼ (n/ log n)1/3
One pass Stochastic Gradient method:
√
γt = γ/ n,
t∗ (n) = 1 (+ averaging)
[∼ Ying-Pontil 2008 and Dieuleveut-Bach 2014]
Why multiple passes make sense?
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
29 / 45
Learning with the stochastic gradient method
Main results: error bounds
Theorem
Assume boundedness and (SC) with r ∈ ]1/2, +∞[. If
l
m
t∗ (n) = n1/(2r+1)
then with high probability,
r−1/2 kŵt − w † kH = O n− 2r+1
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
30 / 45
Learning with the stochastic gradient method
Main results: error bounds
Theorem
Assume boundedness and (SC) with r ∈ ]1/2, +∞[. If
l
m
t∗ (n) = n1/(2r+1)
then with high probability,
r−1/2 kŵt − w † kH = O n− 2r+1
optimal capacity independent rates for kŵt − w † kH
no saturation w.r.t. r
the stopping rule depends on the source condition ⇒ a balancing
principle can be used
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
30 / 45
Learning with the stochastic gradient method
Stochastic gradient method (one pass)
ŵt = ŵt−1 − γt (hŵt−1 , xt i − yt )xt +λt ŵt−1 )
classically studied in stochastic optimization, for strongly convex
functions (λt = 0) (Robbins-Monro), in the finite dimensional setting
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
31 / 45
Learning with the stochastic gradient method
Stochastic gradient method (one pass)
ŵt = ŵt−1 − γt (hŵt−1 , xt i − yt )xt +λt ŵt−1 )
classically studied in stochastic optimization, for strongly convex
functions (λt = 0) (Robbins-Monro), in the finite dimensional setting
In a RKHS, square loss first in [Smale-Yao 2006]
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
31 / 45
Learning with the stochastic gradient method
Stochastic gradient method (one pass) in RKHS - square
loss
Assume Source Condition
Let r ∈ ]1/2, +∞[ γt = n−2r /(2r +1) , λt = 0, then [Ying-Pontil 2008]
Ekwt − w † kH ≤ ct −
r −1/2
2r +1
Let r ∈ ]1/2, 1], let γt = n−2r /(2r +1) , λt = n−1/(2r +1) , then
[Tarrès-Yao 2011]
r −1/2 with h. p.
kwt − w † kH = O t − 2r +1
Optimal rates, capacity dependent bounds in expectation on the risk
[Dieuleveut-Bach 2014] (saturation for r > 1).
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
32 / 45
Learning with the stochastic gradient method
Comparison between multiple passes and one pass
There are two regimes with optimal error bounds
Multiple passes Stochastic Gradient method:
γt = γn−1
S. Villa (LCSL - IIT and MIT)
t∗ (n) ∼ n1/(2r +1)
Multiple passes SG
33 / 45
Learning with the stochastic gradient method
Comparison between multiple passes and one pass
There are two regimes with optimal error bounds
Multiple passes Stochastic Gradient method:
γt = γn−1
t∗ (n) ∼ n1/(2r +1)
One pass Stochastic Gradient method:
γt = γn−2r /(2r +1)
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
t∗ (n) = 1
33 / 45
Learning with the stochastic gradient method
Comparison between multiple passes and one pass
There are two regimes with optimal error bounds
Multiple passes Stochastic Gradient method:
γt = γn−1
t∗ (n) ∼ n1/(2r +1)
One pass Stochastic Gradient method:
γt = γn−2r /(2r +1)
t∗ (n) = 1
...in practice
model selction is needed and multiple passes stochastic gradient is a
natural approach
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
33 / 45
Learning with the stochastic gradient method
Comparison with gradient descent learning
Let wˆ0 = 0 and γ > 0. Iterate:
ŵt+1 = ŵt − γ/n
S. Villa (LCSL - IIT and MIT)
n
X
i=1
(hŵt−1 , xi i − yi )xi
Multiple passes SG
34 / 45
Learning with the stochastic gradient method
Comparison with gradient descent learning
Let wˆ0 = 0 and γ > 0. Iterate:
ŵt+1 = ŵt − γ/n
n
X
i=1
(hŵt−1 , xi i − yi )xi
Multiple passes Stochastc gradient vs. Gradient descent: same
computational and statistical properties [Bauer-Pereverzev-Rosasco
2007],[Caponnetto-Yao 2008],[Raskutti-Wainwright-Yu 2013]
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
34 / 45
Proof
Proof: Bias-Variance trade-off
Define
w
0 = 0 ∈ H
 v0 = wt

 for i = 1, . . . , n
 R
 v =v
i
i−1 − (γ/n) H (hvi−1 , xi − y )xdρ(x, y )

wt+1 = vn
(wt ) is the nt-th gradient descent iteration with step-size γ/n on the risk.
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
35 / 45
Proof
Proof: Bias-Variance trade-off
Define
w
0 = 0 ∈ H
 v0 = wt

 for i = 1, . . . , n
 R
 v =v
i
i−1 − (γ/n) H (hvi−1 , xi − y )xdρ(x, y )

wt+1 = vn
(wt ) is the nt-th gradient descent iteration with step-size γ/n on the risk.
Then
kŵt − w † kH ≤ kŵt − wt kH + kwt − w † kH
|
{z
}
{z
}
|
Variance
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
Bias=Optimization
35 / 45
Proof
Variance - Step 1
ŵt can be written as a perturbed gradient descent iteration on the
empirical risk
ŵt+1 = (I − γ T̂ )ŵt + γ ĝ + γ 2 (Âŵt − b̂)
with
T̂ = (1/n)
Pn
i=1 xi
S. Villa (LCSL - IIT and MIT)
⊗ xi , where (x ⊗ x)w = hw , xi x
Multiple passes SG
36 / 45
Proof
Variance - Step 1
ŵt can be written as a perturbed gradient descent iteration on the
empirical risk
ŵt+1 = (I − γ T̂ )ŵt + γ ĝ + γ 2 (Âŵt − b̂)
with
P
T̂ = (1/n) ni=1 xi ⊗ xi , where (x ⊗ x)w = hw , xi x
P
ĝ = (1/n) ni=1 yi xi
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
36 / 45
Proof
Variance - Step 1
ŵt can be written as a perturbed gradient descent iteration on the
empirical risk
ŵt+1 = (I − γ T̂ )ŵt + γ ĝ + γ 2 (Âŵt − b̂)
with
P
T̂ = (1/n) ni=1 xi ⊗ xi , where (x ⊗ x)w = hw , xi x
P
ĝ = (1/n) ni=1 yi xi
#
" n
n
k−1
X
1X Y γ
 = 2
I − xi ⊗ xi (xk ⊗ xk )
xj ⊗ xj
n
n
k=2
j=1
i=k+1
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
36 / 45
Proof
Variance - Step 1
ŵt can be written as a perturbed gradient descent iteration on the
empirical risk
ŵt+1 = (I − γ T̂ )ŵt + γ ĝ + γ 2 (Âŵt − b̂)
with
P
T̂ = (1/n) ni=1 xi ⊗ xi , where (x ⊗ x)w = hw , xi x
P
ĝ = (1/n) ni=1 yi xi
#
" n
n
k−1
X
1X Y γ
 = 2
I − xi ⊗ xi (xk ⊗ xk )
xj ⊗ xj
n
n
k=2
n
1X
b̂ = 2
n
k=2
j=1
i=k+1
"
n
Y
i=k+1
S. Villa (LCSL - IIT and MIT)
#
k−1
X
γ
I − xi ⊗ xi (xk ⊗ xk )
yj xj
n
j=1
Multiple passes SG
36 / 45
Proof
Variance - Step 1
wt is a perturbed gradient descent iteration with step γ on the risk
wt+1 = (I − γT )wt + γg + γ 2 (Awt − b)
with
T = E[x ⊗ x],
g = E[gρ (x)x]
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
37 / 45
Proof
Variance - Step 1
wt is a perturbed gradient descent iteration with step γ on the risk
wt+1 = (I − γT )wt + γg + γ 2 (Awt − b)
with
T = E[x ⊗ x],
g = E[gρ (x)x]
n
1X
A= 2
n
k=2
"
n
Y
i=k+1
S. Villa (LCSL - IIT and MIT)
# k−1
X
γ I− T T
T
n
j=1
Multiple passes SG
37 / 45
Proof
Variance - Step 1
wt is a perturbed gradient descent iteration with step γ on the risk
wt+1 = (I − γT )wt + γg + γ 2 (Awt − b)
with
T = E[x ⊗ x],
g = E[gρ (x)x]
n
1X
A= 2
n
"
k=2
n
1X
b= 2
n
k=2
n
Y
i=k+1
"
n
Y
i=k+1
S. Villa (LCSL - IIT and MIT)
# k−1
X
γ I− T T
T
n
j=1
# k−1
X
γ I− T T
g
n
j=1
Multiple passes SG
37 / 45
Proof
Variance - Step 2
kŵt − wt kH ≤ γ
t−1 X
2
t−k+1 (I
−
γ
T̂
+
γ
Â)
k=0
(T − T̂ )wk + γ(Â − A)wk + (ĝ − g ) − γ(b̂ − b)
≤γ
t−1
X
k=0
(kT − T̂ k+ γ k − Ak )kwk kH +kĝ − g kH + γ kb̂ − bkH
| {z } | {z }
| {z }
sum of
bounded
martingales
sum of
martingales
+ Pinelis concentration inequality
≤ c1
log(16/δ)t
√
n
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
38 / 45
Proof
Bias
Convergence results for the gradient descent applied to E
Standard approach based on spectral calculus (square loss used here!)
Convergence depends on the step-size γ ∈ 0, nκ−1 and the source
condition
†
kwt − w kH ≤ c
S. Villa (LCSL - IIT and MIT)
r − 1/2
γt
Multiple passes SG
r −1/2
39 / 45
Proof
Bias-Variance trade-off - again
Then, with probability greater than 1 − δ,
16
†
c1 tn−1/2 + c2 t1/2−r
kŵt − w kH ≤ log
δ
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
40 / 45
Proof
Contributions and future work
Contributions
first results on generalization properties of multiple passes stochastic
gradient method
results support commonly used heuristics, e.g. early stopping
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
41 / 45
Proof
Contributions and future work
Contributions
first results on generalization properties of multiple passes stochastic
gradient method
results support commonly used heuristics, e.g. early stopping
Future work
extension to other losses and sampling techniques [Lin-Rosasco 2016]
capacity dependent bounds and optimal bounds for the risk
unified analysis for one pass and multiple passes
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
41 / 45
Proof
References
L. Rosasco, S. Villa
Learning with incremental iterative Regularization, NIPS 2015
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
42 / 45
Proof
Finite sample bounds for the risk
Corollary
Assume boundedness
condition with r ∈ ]1/2, +∞[. Then,
and source
Choosing t∗ (n) = n1/2(r+1) , with high probability
r
E(ŵt ) − inf E = O n− r+1
H
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
43 / 45
Proof
Finite sample bounds for the risk
Corollary
Assume boundedness
condition with r ∈ ]1/2, +∞[. Then,
and source
Choosing t∗ (n) = n1/2(r+1) , with high probability
r
E(ŵt ) − inf E = O n− r+1
H
The rates are not optimal...
but valid under more general source condition (even in the nonattainable
case)
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
43 / 45
Experimental results
Incremental gradient in a RKHS
H RKHS of functions from X to Y with kernel K : X × X → R. Let
ŵ0 = 0, then
n
X
ŵt =
(αt )k Kxk
k=1
where αt = ((αt )1 , . . . , (αt )n ) ∈ Rn satisfy
αt+1 = ctn
ct0 = αt ,
(
γ Pn
i−1
i−1
(c
)
−
K
(x
,
x
)(c
)
−
y
i j
j
i , k =i
k
t
t
j=1
n
(cti )k =
(cti−1 )k ,
k=
6 i
S. Villa (LCSL - IIT and MIT)
Multiple passes SG
44 / 45
Download