Generalization properties of multiple passes stochastic gradient method Silvia Villa joint work with Lorenzo Rosasco Laboratory for Computational and Statistical Learning, IIT and MIT http://lcsl.mit.edu/data/silviavilla – Computational and Statistical trade-offs in learning – Paris 2016 S. Villa (LCSL - IIT and MIT) Multiple passes SG 1 / 45 Introduction Motivation Large scale learning = Statistics + Optimization Stochastic gradient method single or multiple passes over the data? S. Villa (LCSL - IIT and MIT) Multiple passes SG 2 / 45 Introduction Outline Problem setting Tikhonov regularization for learning Assumptions: source condition Theoretical results Learning with the stochastic gradient method Algorithms Theoretical results and discussion Proof S. Villa (LCSL - IIT and MIT) Multiple passes SG 3 / 45 Setting Problem setting H separable Hilbert space ρ probability distribution on H × R Problem minimize E(w ) = w ∈H Z H×R (hw , xi − y )2 dρ(x, y ), given {(x1 , y1 ), . . . , (xn , yn )} i.i.d. with respect to ρ. S. Villa (LCSL - IIT and MIT) Multiple passes SG 4 / 45 Setting Special cases of interest Linear and functional regression Let yi = hw∗ , xi i + δi , i = 1, . . . , n with xi , δi random iid and w∗ , xi ∈ H. random design linear regression, H = Rd functional regression, H infinite dimensional Hilbert space S. Villa (LCSL - IIT and MIT) Multiple passes SG 5 / 45 Setting Special cases of interest Learning with kernels Ξ × R input/output space with probability µ. HK RKHS with reproducing kernel K , w (ξ) = hw , K (ξ, ·)iH Problem Z minimizew ∈HK Ξ×R S. Villa (LCSL - IIT and MIT) (w (ξ) − y )2 dµ(ξ, y ) Multiple passes SG 6 / 45 Setting Special cases of interest Learning with kernels Ξ × R input/output space with probability µ. HK RKHS with reproducing kernel K , w (ξ) = hw , K (ξ, ·)iH Problem Z minimizew ∈HK Ξ×R (w (ξ) − y )2 dµ(ξ, y ) If ρ is the distribution of (ξ, y ) 7→ (K (ξ, ·), y ) = (x, y ), then S. Villa (LCSL - IIT and MIT) Multiple passes SG 6 / 45 Setting Special cases of interest Learning with kernels Ξ × R input/output space with probability µ. HK RKHS with reproducing kernel K , w (ξ) = hw , K (ξ, ·)iH Problem Z minimizew ∈HK Ξ×R (w (ξ) − y )2 dµ(ξ, y ) If ρ is the distribution of (ξ, y ) 7→ (K (ξ, ·), y ) = (x, y ), then Z 2 (w (ξ)−y ) dµ(ξ, y ) = Ξ×R S. Villa (LCSL - IIT and MIT) Z 2 (hw , K (ξ, ·)i−y ) dµ = Ξ×R Multiple passes SG Z (hw , xi−y )2 dρ(w , y ) HK ×R 6 / 45 Setting Outline Problem setting Tikhonov regularization for learning Assumptions: source condition Theoretical results Learning with the stochastic gradient method Algorithms Theoretical results and discussion Proof S. Villa (LCSL - IIT and MIT) Multiple passes SG 7 / 45 Tikhonov regularization for learning A classical approach: Tikhonov regularization of the empirical risk ŵλ = argmin Ê(w ) + λR(w ) H empirical risk, n 1X (hw , xi i − yi )2 Ê(w ) = n i=1 S. Villa (LCSL - IIT and MIT) Multiple passes SG 8 / 45 Tikhonov regularization for learning A classical approach: Tikhonov regularization of the empirical risk ŵλ = argmin Ê(w ) + λR(w ) H empirical risk, n 1X (hw , xi i − yi )2 Ê(w ) = n i=1 regularizer, R = k · k2H S. Villa (LCSL - IIT and MIT) Multiple passes SG 8 / 45 Tikhonov regularization for learning A classical approach: Tikhonov regularization of the empirical risk ŵλ = argmin Ê(w ) + λR(w ) H empirical risk, n 1X (hw , xi i − yi )2 Ê(w ) = n i=1 regularizer, R = k · k2H regularization parameter, λ > 0 S. Villa (LCSL - IIT and MIT) Multiple passes SG 8 / 45 Tikhonov regularization for learning A classical approach: Tikhonov regularization of the empirical risk ŵλ = argmin Ê(w ) + λR(w ) H empirical risk, n 1X (hw , xi i − yi )2 Ê(w ) = n i=1 regularizer, R = k · k2H regularization parameter, λ > 0 What about statistics? S. Villa (LCSL - IIT and MIT) Multiple passes SG 8 / 45 Tikhonov regularization for learning Assumptions Boundedness. There exist κ > 0 and M > 0 such that |y | ≤ M S. Villa (LCSL - IIT and MIT) and kxk2H ≤ κ Multiple passes SG a.s. 9 / 45 Tikhonov regularization for learning Assumptions Boundedness. There exist κ > 0 and M > 0 such that |y | ≤ M and kxk2H ≤ κ a.s. Existence a (minimal norm) solution O = argmin E 6= ∅, H w † = argmin kw kH O More assumptions needed for finite sample error bounds... S. Villa (LCSL - IIT and MIT) Multiple passes SG 9 / 45 Tikhonov regularization for learning Source condition Boundedness assumption implies T: H→H Z w 7→ hw , xi x dρH (x), ρH marginal of ρ H is well defined. S. Villa (LCSL - IIT and MIT) Multiple passes SG 10 / 45 Tikhonov regularization for learning Source condition Boundedness assumption implies T: H→H Z w 7→ hw , xi x dρH (x), ρH marginal of ρ H is well defined. Source condition Let r ∈ [1/2, +∞[ and assume that ∃h ∈ H S. Villa (LCSL - IIT and MIT) such that w† = Tr−1/2 h. Multiple passes SG (SC) 10 / 45 Tikhonov regularization for learning Source condition: remarks H T r1 If r = 1/2, no assumption if r > 1/2, (SC) implies w † is in a subspace of H T r2 1 2 1 2 (H) (H) r1 < r2 S. Villa (LCSL - IIT and MIT) Multiple passes SG 11 / 45 Tikhonov regularization for learning Source condition: remarks H T r1 If r = 1/2, no assumption if r > 1/2, (SC) implies w † is in a subspace of H T r2 1 2 1 2 (H) (H) r1 < r2 Spectral point of view If (σi , vi )i∈I is the eigenbasis of T , then X kw † k2H = |hw † , vi i|2 < +∞ S. Villa (LCSL - IIT and MIT) Multiple passes SG 11 / 45 Tikhonov regularization for learning Source condition: remarks H T r1 If r = 1/2, no assumption if r > 1/2, (SC) implies w † is in a subspace of H T r2 1 2 1 2 (H) (H) r1 < r2 Spectral point of view If (σi , vi )i∈I is the eigenbasis of T , then X kw † k2H = |hw † , vi i|2 < +∞ If h = T 1/2−r w † , then khk2H = S. Villa (LCSL - IIT and MIT) X |hw† , vi i|2 /σi2r−1 < +∞ Multiple passes SG 11 / 45 Tikhonov regularization for learning Results: error bounds Theorem Assume boundedness and (SC) for some r ∈ ]1/2, +∞[, Let 1 λ̂ = n− 2r+1 . then with high probability, r−1/2 O n− 2r+1 if r ≤ 3/2 † kŵλ̂ − w kH = O n−1/2 if r > 3/2 S. Villa (LCSL - IIT and MIT) Multiple passes SG 12 / 45 Tikhonov regularization for learning Proof: bias-variance trade-off Set wλ = argmin E(w ) + λkw k2H , H and decompose the error kŵλ − w † kH ≤ kŵλ − wλ kH + kwλ − w † kH {z } | {z } | Variance S. Villa (LCSL - IIT and MIT) Multiple passes SG Bias 13 / 45 Tikhonov regularization for learning Remarks The bounds are minimax [Blanchard-Muecke 2016]: if Pr = {ρ | Boundedness and (SC) are satisfied}, then min max Ekŵ − w † kH ≥ Cn− r −1/2 2r +1 ŵ ∈H ρ∈Pr S. Villa (LCSL - IIT and MIT) Multiple passes SG 14 / 45 Tikhonov regularization for learning Remarks The bounds are minimax [Blanchard-Muecke 2016]: if Pr = {ρ | Boundedness and (SC) are satisfied}, then min max Ekŵ − w † kH ≥ Cn− r −1/2 2r +1 ŵ ∈H ρ∈Pr saturation for r > 3/2 S. Villa (LCSL - IIT and MIT) Multiple passes SG 14 / 45 Tikhonov regularization for learning Remarks The bounds are minimax [Blanchard-Muecke 2016]: if Pr = {ρ | Boundedness and (SC) are satisfied}, then min max Ekŵ − w † kH ≥ Cn− r −1/2 2r +1 ŵ ∈H ρ∈Pr saturation for r > 3/2 Adaptivity via Lepskii/Balancing principle [De Vito-Pereverzev-Rosasco 2010] S. Villa (LCSL - IIT and MIT) Multiple passes SG 14 / 45 Tikhonov regularization for learning Remarks The bounds are minimax [Blanchard-Muecke 2016]: if Pr = {ρ | Boundedness and (SC) are satisfied}, then min max Ekŵ − w † kH ≥ Cn− r −1/2 2r +1 ŵ ∈H ρ∈Pr saturation for r > 3/2 Adaptivity via Lepskii/Balancing principle [De Vito-Pereverzev-Rosasco 2010] Improved bounds under further assumptions on the decay of the eigenvalues of T [De Vito-Caponnetto 2006 . . . ] S. Villa (LCSL - IIT and MIT) Multiple passes SG 14 / 45 Tikhonov regularization for learning Remarks The bounds are minimax [Blanchard-Muecke 2016]: if Pr = {ρ | Boundedness and (SC) are satisfied}, then min max Ekŵ − w † kH ≥ Cn− r −1/2 2r +1 ŵ ∈H ρ∈Pr saturation for r > 3/2 Adaptivity via Lepskii/Balancing principle [De Vito-Pereverzev-Rosasco 2010] Improved bounds under further assumptions on the decay of the eigenvalues of T [De Vito-Caponnetto 2006 . . . ] BUT... S. Villa (LCSL - IIT and MIT) Multiple passes SG 14 / 45 Tikhonov regularization for learning Remarks The bounds are minimax [Blanchard-Muecke 2016]: if Pr = {ρ | Boundedness and (SC) are satisfied}, then min max Ekŵ − w † kH ≥ Cn− r −1/2 2r +1 ŵ ∈H ρ∈Pr saturation for r > 3/2 Adaptivity via Lepskii/Balancing principle [De Vito-Pereverzev-Rosasco 2010] Improved bounds under further assumptions on the decay of the eigenvalues of T [De Vito-Caponnetto 2006 . . . ] BUT... . . . what about the optimization error? S. Villa (LCSL - IIT and MIT) Multiple passes SG 14 / 45 Tikhonov regularization for learning A new perspective [Bottou-Bousquet 2008] Let ŵλ,t the t-th iteration of some algorithm for regularized empirical risk minimization, S. Villa (LCSL - IIT and MIT) Multiple passes SG 15 / 45 Tikhonov regularization for learning A new perspective [Bottou-Bousquet 2008] Let ŵλ,t the t-th iteration of some algorithm for regularized empirical risk minimization, then kŵλ,t − w † kH ≤ kŵλ,t − ŵλ kH + kŵλ − wλ kH + kwλ − w † kH {z } | {z } {z } | | Variance Bias Optimization | {z } Statistics S. Villa (LCSL - IIT and MIT) Multiple passes SG 15 / 45 Tikhonov regularization for learning A new perspective [Bottou-Bousquet 2008] Let ŵλ,t the t-th iteration of some algorithm for regularized empirical risk minimization, then kŵλ,t − w † kH ≤ kŵλ,t − ŵλ kH + kŵλ − wλ kH + kwλ − w † kH {z } | {z } {z } | | Variance Bias Optimization | {z } Statistics A new trade-off ⇒ Optimization accuracy tailored to statistical accuracy S. Villa (LCSL - IIT and MIT) Multiple passes SG 15 / 45 Tikhonov regularization for learning Optimization error Recently a *lot* of interest in methods to solve min w ∈H S. Villa (LCSL - IIT and MIT) n X Vi (w ) i=1 Multiple passes SG 16 / 45 Tikhonov regularization for learning Optimization error Recently a *lot* of interest in methods to solve min w ∈H n X Vi (w ) i=1 Let Vi (w ) = V (hw , xi iH , yi ) + λkw k2H for some loss function V S. Villa (LCSL - IIT and MIT) Multiple passes SG 16 / 45 Tikhonov regularization for learning Optimization error Recently a *lot* of interest in methods to solve min w ∈H n X Vi (w ) i=1 Let Vi (w ) = V (hw , xi iH , yi ) + λkw k2H for some loss function V Large scale setting, n “large” S. Villa (LCSL - IIT and MIT) Multiple passes SG 16 / 45 Tikhonov regularization for learning Optimization error Recently a *lot* of interest in methods to solve min w ∈H n X Vi (w ) i=1 Let Vi (w ) = V (hw , xi iH , yi ) + λkw k2H for some loss function V Large scale setting, n “large” Focus on batch and stochastic first order methods S. Villa (LCSL - IIT and MIT) Multiple passes SG 16 / 45 Tikhonov regularization for learning A new perspective [Bottou-Bousquet 2008] kŵλ,t − w † kH ≤ kŵλ,t − ŵλ kH + kŵλ − wλ kH + kwλ − w † kH {z } | {z } | {z } | Variance Bias Optimization | {z } Statistics This suggests Approach 1: combine statistics with optimization Approach 2: use a different decomposition S. Villa (LCSL - IIT and MIT) Multiple passes SG 17 / 45 Tikhonov regularization for learning Outline Problem setting Tikhonov regularization for learning Assumptions: source condition Theoretical results Learning with the stochastic gradient method Algorithms Theoretical results and discussion Proof S. Villa (LCSL - IIT and MIT) Multiple passes SG 18 / 45 Learning with the stochastic gradient method Another approach: stochastic gradient method Directly minimize the expected risk E ŵt = ŵt−1 − γt (hŵt−1 , xt i − yt )xt S. Villa (LCSL - IIT and MIT) Multiple passes SG 19 / 45 Learning with the stochastic gradient method Another approach: stochastic gradient method Directly minimize the expected risk E ŵt = ŵt−1 − γt (hŵt−1 , xt i − yt )xt each iteration corresponds to one sample/gradient evaluation S. Villa (LCSL - IIT and MIT) Multiple passes SG 19 / 45 Learning with the stochastic gradient method Another approach: stochastic gradient method Directly minimize the expected risk E ŵt = ŵt−1 − γt (hŵt−1 , xt i − yt )xt each iteration corresponds to one sample/gradient evaluation no explicit regularization S. Villa (LCSL - IIT and MIT) Multiple passes SG 19 / 45 Learning with the stochastic gradient method Another approach: stochastic gradient method Directly minimize the expected risk E ŵt = ŵt−1 − γt (hŵt−1 , xt i − yt )xt each iteration corresponds to one sample/gradient evaluation no explicit regularization several theoretical results S. Villa (LCSL - IIT and MIT) Multiple passes SG 19 / 45 Learning with the stochastic gradient method Another approach: stochastic gradient method Directly minimize the expected risk E ŵt = ŵt−1 − γt (hŵt−1 , xt i − yt )xt each iteration corresponds to one sample/gradient evaluation no explicit regularization several theoretical results BUT... S. Villa (LCSL - IIT and MIT) Multiple passes SG 19 / 45 Learning with the stochastic gradient method Another approach: stochastic gradient method Directly minimize the expected risk E ŵt = ŵt−1 − γt (hŵt−1 , xit i − yit )xit each iteration corresponds to one sample/gradient evaluation no explicit regularization several theoretical results BUT... . . . in practice, MULTIPLE PASSES over the data are used S. Villa (LCSL - IIT and MIT) Multiple passes SG 19 / 45 Learning with the stochastic gradient method Multiple passes stochastic gradient learning Rewrite the iteration Let wˆ0 = 0 and γ > 0. For t ∈ N, iterate: v̂0 = ŵt for i = 1, . . . , n v̂ = v̂ i i−1 − (γ/n)(hv̂i−1 , xi i − yi )xi ŵt+1 = v̂n S. Villa (LCSL - IIT and MIT) Multiple passes SG 20 / 45 Learning with the stochastic gradient method Multiple passes stochastic gradient learning Rewrite the iteration Let wˆ0 = 0 and γ > 0. For t ∈ N, iterate: v̂0 = ŵt for i = 1, . . . , n v̂ = v̂ i i−1 − (γ/n)(hv̂i−1 , xi i − yi )xi ŵt+1 = v̂n incremental gradient method for the empirical risk Ê [Bertsekas -Tsitsiklis 2000] S. Villa (LCSL - IIT and MIT) Multiple passes SG 20 / 45 Learning with the stochastic gradient method Multiple passes stochastic gradient learning Rewrite the iteration Let wˆ0 = 0 and γ > 0. For t ∈ N, iterate: v̂0 = ŵt for i = 1, . . . , n v̂ = v̂ i i−1 − (γ/n)(hv̂i−1 , xi i − yi )xi ŵt+1 = v̂n incremental gradient method for the empirical risk Ê [Bertsekas -Tsitsiklis 2000] each inner step is one pass of stochastic gradient method S. Villa (LCSL - IIT and MIT) Multiple passes SG 20 / 45 Learning with the stochastic gradient method Multiple passes stochastic gradient learning Rewrite the iteration Let wˆ0 = 0 and γ > 0. For t ∈ N, iterate: v̂0 = ŵt for i = 1, . . . , n v̂ = v̂ i i−1 − (γ/n)(hv̂i−1 , xi i − yi )xi ŵt+1 = v̂n incremental gradient method for the empirical risk Ê [Bertsekas -Tsitsiklis 2000] each inner step is one pass of stochastic gradient method t is the number of passes over the data (epochs) S. Villa (LCSL - IIT and MIT) Multiple passes SG 20 / 45 Learning with the stochastic gradient method Main question How many passes we need to approximately minimize the expected risk E? S. Villa (LCSL - IIT and MIT) Multiple passes SG 21 / 45 Learning with the stochastic gradient method Multiple passes stochastic gradient learning We have ŵt → argmin Ê, H S. Villa (LCSL - IIT and MIT) Multiple passes SG 22 / 45 Learning with the stochastic gradient method Multiple passes stochastic gradient learning We have ŵt → argmin Ê, H but we would like ŵt → argmin E. H S. Villa (LCSL - IIT and MIT) Multiple passes SG 22 / 45 Learning with the stochastic gradient method Multiple passes stochastic gradient learning We have ŵt → argmin Ê, H but we would like ŵt → argmin E. H What’s the catch? S. Villa (LCSL - IIT and MIT) Multiple passes SG 22 / 45 Learning with the stochastic gradient method Stability and early stopping Consider the gradient descent iteration for the expected risk. w 0 = 0 ∈ H v0 = wt for i = 1, . . . , n R v =v i i−1 − (γ/n) H (hvi−1 , xi − y )xdρ(x, y ) wt+1 = vn Note: step-size γ/n. S. Villa (LCSL - IIT and MIT) Multiple passes SG 23 / 45 Learning with the stochastic gradient method Stability and early stopping Consider the gradient descent iteration for the expected risk. w 0 = 0 ∈ H v0 = wt for i = 1, . . . , n R v =v i i−1 − (γ/n) H (hvi−1 , xi − y )xdρ(x, y ) wt+1 = vn Note: step-size γ/n. wt −→ wt+1 −→ . . . S. Villa (LCSL - IIT and MIT) Multiple passes SG 23 / 45 Learning with the stochastic gradient method Stability and early stopping Consider the gradient descent iteration for the expected risk. w 0 = 0 ∈ H v0 = wt for i = 1, . . . , n R v =v i i−1 − (γ/n) H (hvi−1 , xi − y )xdρ(x, y ) wt+1 = vn Note: step-size γ/n. wt −→ wt+1 −→ . . . S. Villa (LCSL - IIT and MIT) −→ −→ argmin E H Multiple passes SG 23 / 45 Learning with the stochastic gradient method Stability and early stopping Consider the gradient descent iteration for the expected risk. w 0 = 0 ∈ H v0 = wt for i = 1, . . . , n R v =v i i−1 − (γ/n) H (hvi−1 , xi − y )xdρ(x, y ) wt+1 = vn Note: step-size γ/n. wt −→ wt+1 −→ . . . ŵt −→ ŵt+1 −→ . . . S. Villa (LCSL - IIT and MIT) −→ −→ argmin E H Multiple passes SG 23 / 45 Learning with the stochastic gradient method Stability and early stopping Consider the gradient descent iteration for the expected risk. w 0 = 0 ∈ H v0 = wt for i = 1, . . . , n R v =v i i−1 − (γ/n) H (hvi−1 , xi − y )xdρ(x, y ) wt+1 = vn Note: step-size γ/n. wt −→ wt+1 −→ . . . ŵt −→ ŵt+1 −→ . . . −→ −→ argmin E H & & argmin Ê H S. Villa (LCSL - IIT and MIT) Multiple passes SG 23 / 45 Learning with the stochastic gradient method Early stopping - semi-convergence 0.24 Empirical Error Expected Error 0.22 Error 0.2 0.18 0.16 0.14 0.12 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Iteration S. Villa (LCSL - IIT and MIT) Multiple passes SG 24 / 45 Learning with the stochastic gradient method Early stopping - example First epoch: 8 6 4 2 0 −2 0 S. Villa (LCSL - IIT and MIT) 1 2 3 Multiple passes SG 4 5 6 25 / 45 Learning with the stochastic gradient method Early stopping - example 10-th epoch: 8 6 4 2 0 −2 0 S. Villa (LCSL - IIT and MIT) 2 4 Multiple passes SG 6 26 / 45 Learning with the stochastic gradient method Early stopping - example 100-th epoch: 8 6 4 2 0 −2 0 S. Villa (LCSL - IIT and MIT) 2 4 Multiple passes SG 6 27 / 45 Learning with the stochastic gradient method Main results: consistency Theorem Assume boundedness. Let γ ∈ 0, κ−1 . Let t∗ (n) be such that t∗ (n) → +∞ and t∗ (n) (log n/n)1/3 → 0 Assume O = argminH E 6= ∅. Then kŵt∗ (n) − w † k → 0 S. Villa (LCSL - IIT and MIT) Multiple passes SG ρ − a.s. 28 / 45 Learning with the stochastic gradient method Main results: consistency Theorem Assume boundedness. Let γ ∈ 0, κ−1 . Let t∗ (n) be such that t∗ (n) → +∞ and t∗ (n) (log n/n)1/3 → 0 Assume O = argminH E 6= ∅. Then kŵt∗ (n) − w † k → 0 ρ − a.s. universal step-size fixed a priori early stopping needed for consistency multiple passes are needed S. Villa (LCSL - IIT and MIT) Multiple passes SG 28 / 45 Learning with the stochastic gradient method First comparison with one pass stochastic gradient Consistency in the following cases: Multiple passes Stochastic Gradient method: γt = γ/n, S. Villa (LCSL - IIT and MIT) t∗ (n) ∼ (n/ log n)1/3 Multiple passes SG 29 / 45 Learning with the stochastic gradient method First comparison with one pass stochastic gradient Consistency in the following cases: Multiple passes Stochastic Gradient method: γt = γ/n, t∗ (n) ∼ (n/ log n)1/3 One pass Stochastic Gradient method: √ γt = γ/ n, t∗ (n) = 1 (+ averaging) [∼ Ying-Pontil 2008 and Dieuleveut-Bach 2014] S. Villa (LCSL - IIT and MIT) Multiple passes SG 29 / 45 Learning with the stochastic gradient method First comparison with one pass stochastic gradient Consistency in the following cases: Multiple passes Stochastic Gradient method: γt = γ/n, t∗ (n) ∼ (n/ log n)1/3 One pass Stochastic Gradient method: √ γt = γ/ n, t∗ (n) = 1 (+ averaging) [∼ Ying-Pontil 2008 and Dieuleveut-Bach 2014] Why multiple passes make sense? S. Villa (LCSL - IIT and MIT) Multiple passes SG 29 / 45 Learning with the stochastic gradient method Main results: error bounds Theorem Assume boundedness and (SC) with r ∈ ]1/2, +∞[. If l m t∗ (n) = n1/(2r+1) then with high probability, r−1/2 kŵt − w † kH = O n− 2r+1 S. Villa (LCSL - IIT and MIT) Multiple passes SG 30 / 45 Learning with the stochastic gradient method Main results: error bounds Theorem Assume boundedness and (SC) with r ∈ ]1/2, +∞[. If l m t∗ (n) = n1/(2r+1) then with high probability, r−1/2 kŵt − w † kH = O n− 2r+1 optimal capacity independent rates for kŵt − w † kH no saturation w.r.t. r the stopping rule depends on the source condition ⇒ a balancing principle can be used S. Villa (LCSL - IIT and MIT) Multiple passes SG 30 / 45 Learning with the stochastic gradient method Stochastic gradient method (one pass) ŵt = ŵt−1 − γt (hŵt−1 , xt i − yt )xt +λt ŵt−1 ) classically studied in stochastic optimization, for strongly convex functions (λt = 0) (Robbins-Monro), in the finite dimensional setting S. Villa (LCSL - IIT and MIT) Multiple passes SG 31 / 45 Learning with the stochastic gradient method Stochastic gradient method (one pass) ŵt = ŵt−1 − γt (hŵt−1 , xt i − yt )xt +λt ŵt−1 ) classically studied in stochastic optimization, for strongly convex functions (λt = 0) (Robbins-Monro), in the finite dimensional setting In a RKHS, square loss first in [Smale-Yao 2006] S. Villa (LCSL - IIT and MIT) Multiple passes SG 31 / 45 Learning with the stochastic gradient method Stochastic gradient method (one pass) in RKHS - square loss Assume Source Condition Let r ∈ ]1/2, +∞[ γt = n−2r /(2r +1) , λt = 0, then [Ying-Pontil 2008] Ekwt − w † kH ≤ ct − r −1/2 2r +1 Let r ∈ ]1/2, 1], let γt = n−2r /(2r +1) , λt = n−1/(2r +1) , then [Tarrès-Yao 2011] r −1/2 with h. p. kwt − w † kH = O t − 2r +1 Optimal rates, capacity dependent bounds in expectation on the risk [Dieuleveut-Bach 2014] (saturation for r > 1). S. Villa (LCSL - IIT and MIT) Multiple passes SG 32 / 45 Learning with the stochastic gradient method Comparison between multiple passes and one pass There are two regimes with optimal error bounds Multiple passes Stochastic Gradient method: γt = γn−1 S. Villa (LCSL - IIT and MIT) t∗ (n) ∼ n1/(2r +1) Multiple passes SG 33 / 45 Learning with the stochastic gradient method Comparison between multiple passes and one pass There are two regimes with optimal error bounds Multiple passes Stochastic Gradient method: γt = γn−1 t∗ (n) ∼ n1/(2r +1) One pass Stochastic Gradient method: γt = γn−2r /(2r +1) S. Villa (LCSL - IIT and MIT) Multiple passes SG t∗ (n) = 1 33 / 45 Learning with the stochastic gradient method Comparison between multiple passes and one pass There are two regimes with optimal error bounds Multiple passes Stochastic Gradient method: γt = γn−1 t∗ (n) ∼ n1/(2r +1) One pass Stochastic Gradient method: γt = γn−2r /(2r +1) t∗ (n) = 1 ...in practice model selction is needed and multiple passes stochastic gradient is a natural approach S. Villa (LCSL - IIT and MIT) Multiple passes SG 33 / 45 Learning with the stochastic gradient method Comparison with gradient descent learning Let wˆ0 = 0 and γ > 0. Iterate: ŵt+1 = ŵt − γ/n S. Villa (LCSL - IIT and MIT) n X i=1 (hŵt−1 , xi i − yi )xi Multiple passes SG 34 / 45 Learning with the stochastic gradient method Comparison with gradient descent learning Let wˆ0 = 0 and γ > 0. Iterate: ŵt+1 = ŵt − γ/n n X i=1 (hŵt−1 , xi i − yi )xi Multiple passes Stochastc gradient vs. Gradient descent: same computational and statistical properties [Bauer-Pereverzev-Rosasco 2007],[Caponnetto-Yao 2008],[Raskutti-Wainwright-Yu 2013] S. Villa (LCSL - IIT and MIT) Multiple passes SG 34 / 45 Proof Proof: Bias-Variance trade-off Define w 0 = 0 ∈ H v0 = wt for i = 1, . . . , n R v =v i i−1 − (γ/n) H (hvi−1 , xi − y )xdρ(x, y ) wt+1 = vn (wt ) is the nt-th gradient descent iteration with step-size γ/n on the risk. S. Villa (LCSL - IIT and MIT) Multiple passes SG 35 / 45 Proof Proof: Bias-Variance trade-off Define w 0 = 0 ∈ H v0 = wt for i = 1, . . . , n R v =v i i−1 − (γ/n) H (hvi−1 , xi − y )xdρ(x, y ) wt+1 = vn (wt ) is the nt-th gradient descent iteration with step-size γ/n on the risk. Then kŵt − w † kH ≤ kŵt − wt kH + kwt − w † kH | {z } {z } | Variance S. Villa (LCSL - IIT and MIT) Multiple passes SG Bias=Optimization 35 / 45 Proof Variance - Step 1 ŵt can be written as a perturbed gradient descent iteration on the empirical risk ŵt+1 = (I − γ T̂ )ŵt + γ ĝ + γ 2 (Âŵt − b̂) with T̂ = (1/n) Pn i=1 xi S. Villa (LCSL - IIT and MIT) ⊗ xi , where (x ⊗ x)w = hw , xi x Multiple passes SG 36 / 45 Proof Variance - Step 1 ŵt can be written as a perturbed gradient descent iteration on the empirical risk ŵt+1 = (I − γ T̂ )ŵt + γ ĝ + γ 2 (Âŵt − b̂) with P T̂ = (1/n) ni=1 xi ⊗ xi , where (x ⊗ x)w = hw , xi x P ĝ = (1/n) ni=1 yi xi S. Villa (LCSL - IIT and MIT) Multiple passes SG 36 / 45 Proof Variance - Step 1 ŵt can be written as a perturbed gradient descent iteration on the empirical risk ŵt+1 = (I − γ T̂ )ŵt + γ ĝ + γ 2 (Âŵt − b̂) with P T̂ = (1/n) ni=1 xi ⊗ xi , where (x ⊗ x)w = hw , xi x P ĝ = (1/n) ni=1 yi xi # " n n k−1 X 1X Y γ  = 2 I − xi ⊗ xi (xk ⊗ xk ) xj ⊗ xj n n k=2 j=1 i=k+1 S. Villa (LCSL - IIT and MIT) Multiple passes SG 36 / 45 Proof Variance - Step 1 ŵt can be written as a perturbed gradient descent iteration on the empirical risk ŵt+1 = (I − γ T̂ )ŵt + γ ĝ + γ 2 (Âŵt − b̂) with P T̂ = (1/n) ni=1 xi ⊗ xi , where (x ⊗ x)w = hw , xi x P ĝ = (1/n) ni=1 yi xi # " n n k−1 X 1X Y γ  = 2 I − xi ⊗ xi (xk ⊗ xk ) xj ⊗ xj n n k=2 n 1X b̂ = 2 n k=2 j=1 i=k+1 " n Y i=k+1 S. Villa (LCSL - IIT and MIT) # k−1 X γ I − xi ⊗ xi (xk ⊗ xk ) yj xj n j=1 Multiple passes SG 36 / 45 Proof Variance - Step 1 wt is a perturbed gradient descent iteration with step γ on the risk wt+1 = (I − γT )wt + γg + γ 2 (Awt − b) with T = E[x ⊗ x], g = E[gρ (x)x] S. Villa (LCSL - IIT and MIT) Multiple passes SG 37 / 45 Proof Variance - Step 1 wt is a perturbed gradient descent iteration with step γ on the risk wt+1 = (I − γT )wt + γg + γ 2 (Awt − b) with T = E[x ⊗ x], g = E[gρ (x)x] n 1X A= 2 n k=2 " n Y i=k+1 S. Villa (LCSL - IIT and MIT) # k−1 X γ I− T T T n j=1 Multiple passes SG 37 / 45 Proof Variance - Step 1 wt is a perturbed gradient descent iteration with step γ on the risk wt+1 = (I − γT )wt + γg + γ 2 (Awt − b) with T = E[x ⊗ x], g = E[gρ (x)x] n 1X A= 2 n " k=2 n 1X b= 2 n k=2 n Y i=k+1 " n Y i=k+1 S. Villa (LCSL - IIT and MIT) # k−1 X γ I− T T T n j=1 # k−1 X γ I− T T g n j=1 Multiple passes SG 37 / 45 Proof Variance - Step 2 kŵt − wt kH ≤ γ t−1 X 2 t−k+1 (I − γ T̂ + γ Â) k=0 (T − T̂ )wk + γ( − A)wk + (ĝ − g ) − γ(b̂ − b) ≤γ t−1 X k=0 (kT − T̂ k+ γ k − Ak )kwk kH +kĝ − g kH + γ kb̂ − bkH | {z } | {z } | {z } sum of bounded martingales sum of martingales + Pinelis concentration inequality ≤ c1 log(16/δ)t √ n S. Villa (LCSL - IIT and MIT) Multiple passes SG 38 / 45 Proof Bias Convergence results for the gradient descent applied to E Standard approach based on spectral calculus (square loss used here!) Convergence depends on the step-size γ ∈ 0, nκ−1 and the source condition † kwt − w kH ≤ c S. Villa (LCSL - IIT and MIT) r − 1/2 γt Multiple passes SG r −1/2 39 / 45 Proof Bias-Variance trade-off - again Then, with probability greater than 1 − δ, 16 † c1 tn−1/2 + c2 t1/2−r kŵt − w kH ≤ log δ S. Villa (LCSL - IIT and MIT) Multiple passes SG 40 / 45 Proof Contributions and future work Contributions first results on generalization properties of multiple passes stochastic gradient method results support commonly used heuristics, e.g. early stopping S. Villa (LCSL - IIT and MIT) Multiple passes SG 41 / 45 Proof Contributions and future work Contributions first results on generalization properties of multiple passes stochastic gradient method results support commonly used heuristics, e.g. early stopping Future work extension to other losses and sampling techniques [Lin-Rosasco 2016] capacity dependent bounds and optimal bounds for the risk unified analysis for one pass and multiple passes S. Villa (LCSL - IIT and MIT) Multiple passes SG 41 / 45 Proof References L. Rosasco, S. Villa Learning with incremental iterative Regularization, NIPS 2015 S. Villa (LCSL - IIT and MIT) Multiple passes SG 42 / 45 Proof Finite sample bounds for the risk Corollary Assume boundedness condition with r ∈ ]1/2, +∞[. Then, and source Choosing t∗ (n) = n1/2(r+1) , with high probability r E(ŵt ) − inf E = O n− r+1 H S. Villa (LCSL - IIT and MIT) Multiple passes SG 43 / 45 Proof Finite sample bounds for the risk Corollary Assume boundedness condition with r ∈ ]1/2, +∞[. Then, and source Choosing t∗ (n) = n1/2(r+1) , with high probability r E(ŵt ) − inf E = O n− r+1 H The rates are not optimal... but valid under more general source condition (even in the nonattainable case) S. Villa (LCSL - IIT and MIT) Multiple passes SG 43 / 45 Experimental results Incremental gradient in a RKHS H RKHS of functions from X to Y with kernel K : X × X → R. Let ŵ0 = 0, then n X ŵt = (αt )k Kxk k=1 where αt = ((αt )1 , . . . , (αt )n ) ∈ Rn satisfy αt+1 = ctn ct0 = αt , ( γ Pn i−1 i−1 (c ) − K (x , x )(c ) − y i j j i , k =i k t t j=1 n (cti )k = (cti−1 )k , k= 6 i S. Villa (LCSL - IIT and MIT) Multiple passes SG 44 / 45