Gradient Descent Review: Gradient Descent • In step 3, we have to solve the following optimization problem: ๐ ∗ = arg min ๐ฟ ๐ ๐ L: loss function ๐: parameters Suppose that θ has two variables {θ1, θ2} 0 ๐ ๐๐ฟ ๐1 Τ๐๐1 Randomly start at ๐ 0 = 10 ๐ป๐ฟ ๐ = ๐2 ๐๐ฟ ๐2 Τ๐๐2 ๐11 ๐10 ๐๐ฟ ๐10 Τ๐๐1 ๐ 1 = ๐ 0 − ๐๐ป๐ฟ ๐ 0 1 = 0 −๐ 0 Τ ๐2 ๐2 ๐๐ฟ ๐2 ๐๐2 ๐12 ๐11 ๐๐ฟ ๐11 Τ๐๐1 1 −๐ 2 = ๐2 ๐๐ฟ ๐21 Τ๐๐2 ๐2 ๐ 2 = ๐ 1 − ๐๐ป๐ฟ ๐ 1 Review: Gradient Descent ๐2 ๐ป๐ฟ ๐ 0 ๐ Gradient: Loss ็็ญ้ซ็ท็ๆณ็ทๆนๅ Start at position ๐ 0 ๐ป๐ฟ ๐ 1 0 ๐1 Gradient Movement Compute gradient at ๐ 0 ๐ป๐ฟ ๐ 2 ๐2 ๐3 ๐ป๐ฟ ๐ 3 Move to ๐ 1 = ๐ 0 - η๐ป๐ฟ ๐ 0 Compute gradient at ๐ 1 Move to ๐ 2 = ๐ 1 – η๐ป๐ฟ ๐ 1 …… ๐1 Gradient Descent Tip 1: Tuning your learning rates ๏ฑ ๏ฝ๏ฑ i Learning Rate If there are more than three parameters, you cannot visualize this. i ๏ญ1 ๏จ ๏ฉ ๏ญ ๏จ๏L ๏ฑ i ๏ญ1 Set the learning rate η carefully Loss Very Large small Large Loss Just make No. of parameters updates But you can always visualize this. Adaptive Learning Rates • Popular & Simple Idea: Reduce the learning rate by some factor every few epochs. • At the beginning, we are far from the destination, so we use larger learning rate • After several epochs, we are close to the destination, so we reduce the learning rate • E.g. 1/t decay: ๐ ๐ก = ๐ Τ ๐ก + 1 • Learning rate cannot be one-size-fits-all • Giving different parameters different learning rates Adagrad ๐๐ก = ๐ ๐ก+1 ๐ก ๐๐ฟ ๐ ๐๐ก = ๐๐ค • Divide the learning rate of each parameter by the root mean square of its previous derivatives Vanilla Gradient descent ๐ค ๐ก+1 ← ๐ค ๐ก − ๐๐ก ๐๐ก Adagrad ๐ค ๐ก+1 ๐ก ๐ ← ๐ค ๐ก − ๐ก ๐๐ก ๐ w is one parameters ๐ ๐ก : root mean square of the previous derivatives of parameter w Parameter dependent ๐ ๐ก : root mean square of the previous derivatives of parameter w Adagrad 0 ๐ ๐ค 1 ← ๐ค 0 − 0 ๐0 ๐ 1 ๐ ๐ค 2 ← ๐ค 1 − 1 ๐1 ๐ 2 ๐ ๐ค 3 ← ๐ค 2 − 2 ๐2 ๐ ๐0 = 2 1 1 ๐0 2 2 + ๐1 2 2 1 ๐0 3 2 + ๐1 2 ๐ = ๐ = …… ๐ค ๐ก+1 ๐ก ๐ ← ๐ค ๐ก − ๐ก ๐๐ก ๐ ๐0 ๐ก ๐๐ก = 1 เท ๐๐ ๐ก+1 ๐=0 2 + ๐2 2 Adagrad • Divide the learning rate of each parameter by the root mean square of its previous derivatives ๐ ๐ก 1/t decay ๐ = ๐ก + 1 ๐ก ๐๐ ๐ก+1 ๐ก ๐ค ← ๐ค − ๐ก ๐๐ก ๐ก ๐ ๐๐ก 1 ๐ก ๐ = เท ๐๐ 2 ๐ก+1 ๐=0 ๐ค ๐ก+1 ← ๐ค๐ก − ๐ σ๐ก๐=0 ๐๐ ๐๐ก 2 Contradiction? ๐ ๐๐ก = ๐ก+1 ๐ก ๐๐ฟ ๐ ๐๐ก = ๐๐ค Vanilla Gradient descent ๐ค ๐ก+1 ← ๐ค๐ก Larger gradient, larger step − ๐ ๐ก ๐๐ก Adagrad ๐ค ๐ก+1 ← ๐ค๐ก − ๐ σ๐ก๐=0 ๐๐ ๐๐ก Larger gradient, larger step 2 Larger gradient, smaller step ๐ ๐ก ๐๐ถ ๐ ๐๐ก = ๐๐ก = ๐ก+1 ๐๐ค Intuitive Reason • How surprise it is ๅๅทฎ ๐ค ็นๅฅๅคง g0 g1 g2 g3 g4 0.001 0.001 0.003 0.002 0.1 …… …… g0 10.8 …… …… ๐ก+1 g1 20.9 ๐ก ←๐ค − g2 31.7 g3 12.1 ็นๅฅๅฐ ๐ σ๐ก๐=0 ๐๐ g4 0.1 ๐๐ก 2 ้ ๆๅๅทฎ็ๆๆ Larger gradient, larger steps? Best step: ๐ |๐ฅ0 + | 2๐ 1st Larger order derivative means far from the minima ๐ฆ = ๐๐ฅ 2 + ๐๐ฅ + ๐ |2๐๐ฅ0 + ๐| 2๐ ๐ฅ0 ๐ − 2๐ |2๐๐ฅ0 + ๐| ๐๐ฆ = |2๐๐ฅ + ๐| ๐๐ฅ ๐ฅ0 Larger 1st order derivative means far from the minima Do not cross parameters Comparison between different parameters a a>b b ๐ค1 ๐ค2 c c>d ๐ค1 d ๐ค2 Second Derivative ๐ฆ = ๐๐ฅ 2 + ๐๐ฅ + ๐ ๐๐ฆ = |2๐๐ฅ + ๐| ๐๐ฅ ๐2๐ฆ = 2๐ 2 ๐๐ฅ Best step: ๐ |๐ฅ0 + | 2๐ |2๐๐ฅ0 + ๐| 2๐ ๐ฅ0 ๐ − 2๐ |2๐๐ฅ0 + ๐| The best step is ๐ฅ0 |First derivative| Second derivative Larger 1st order derivative means far from the minima Do not cross parameters Comparison between different parameters The best step is Larger Second |First derivative| Second derivative a a>b b ๐ค1 smaller second derivative ๐ค2 c Smaller Second ๐ค1 c>d d ๐ค2 Larger second derivative ๐ ๐ค ๐ก+1 ← ๐ค ๐ก − σ๐ก๐=0 ๐๐ The best step is ๐๐ก |First derivative| 2 ? Second derivative Use first derivative to estimate second derivative larger second derivative smaller second derivative ๐ค1 first derivative 2 ๐ค2 Gradient Descent Tip 2: Stochastic Gradient Descent Make the training faster Stochastic Gradient Descent 2 Loss is the summation over all training examples ๐ฟ = เท ๐ฆเท ๐ − ๐ + เท ๐ค๐ ๐ฅ๐๐ ๐ ๏ตGradient Descent ๏ฑ i ๏ฝ ๏ฑ i ๏ญ1 ๏ญ ๏จ๏L ๏จ๏ฑ i ๏ญ1 ๏ฉ ๏ตStochastic Gradient Descent Pick an example xn ๐ฟ๐ = ๐ฆเท ๐ − ๐ + เท ๐ค๐ ๐ฅ๐๐ Loss for only one example 2 Faster! ๏ฑ i ๏ฝ ๏ฑ i ๏ญ1 ๏ญ ๏จ๏Ln ๏จ๏ฑ i ๏ญ1 ๏ฉ • Demo Stochastic Gradient Descent Stochastic Gradient Descent Gradient Descent Update after seeing all examples See all examples Update for each example If there are 20 examples, 20 times faster. See only one example See all examples Gradient Descent Tip 3: Feature Scaling Source of figure: http://cs231n.github.io/neuralnetworks-2/ Feature Scaling ๐ฆ = ๐ + ๐ค1 ๐ฅ1 + ๐ค2 ๐ฅ2 ๐ฅ2 ๐ฅ2 ๐ฅ1 ๐ฅ1 Make different features have the same scaling ๐ฆ = ๐ + ๐ค1 ๐ฅ1 + ๐ค2 ๐ฅ2 Feature Scaling 1, 2 …… 100, 200 …… w2 x1 w1 w2 x2 ๏ซ y 1, 2 …… 1, 2 …… b w2 Loss L w1 x1 w1 w2 x2 ๏ซ y b Loss L w1 Feature Scaling ๐ฅ1 ๐ฅ11 ๐ฅ21 ๐ฅ2 ๐ฅ12 ๐ฅ22 ๐ฅ๐ ๐ฅ3 ๐ฅ๐ For each dimension i: …… …… …… …… …… …… ๐ ๐ฅ ๐ − ๐๐ ๐ ๐ฅ๐ ← ๐๐ …… mean: ๐๐ standard deviation: ๐๐ The means of all dimensions are 0, and the variances are all 1 Gradient Descent Theory Question • When solving: ๐ ∗ = arg min ๐ฟ ๐ ๐ by gradient descent • Each time we update the parameters, we obtain ๐ that makes ๐ฟ ๐ smaller. ๐ฟ ๐ 0 > ๐ฟ ๐1 > ๐ฟ ๐ 2 > โฏ Is this statement correct? Warning of Math Formal Derivation • Suppose that θ has two variables {θ1, θ2} ๏ฑ2 ๏ฑ2 Given a point, we can easily find the point with the smallest value nearby. How? ๏ฑ1 ๏ฑ0 ๏ฑ1 L(θ) Taylor Series • Taylor series: Let h(x) be any function infinitely differentiable around x = x0. h ๏จk ๏ฉ ๏จ x0 ๏ฉ k ๏จx ๏ญ x0 ๏ฉ h ๏จx ๏ฉ ๏ฝ ๏ฅ k! k ๏ฝ0 ๏ฅ h๏ข๏ข๏จ x0 ๏ฉ 2 ๏ข ๏จx ๏ญ x0 ๏ฉ ๏ซ ๏ ๏ฝ h๏จ x0 ๏ฉ ๏ซ h ๏จ x0 ๏ฉ๏จ x ๏ญ x0 ๏ฉ ๏ซ 2! When x is close to x0 h๏จ x ๏ฉ ๏ป h๏จ x0 ๏ฉ ๏ซ h๏ข๏จ x0 ๏ฉ๏จ x ๏ญ x0 ๏ฉ E.g. Taylor series for h(x)=sin(x) around x0=π/4 sin(x)= …… The approximation is good around π/4. Multivariable Taylor Series ๏ถh๏จ x0 , y0 ๏ฉ ๏ถh๏จ x0 , y0 ๏ฉ ๏จx ๏ญ x0 ๏ฉ ๏ซ ๏จ y ๏ญ y0 ๏ฉ h๏จ x, y ๏ฉ ๏ฝ h๏จ x0 , y0 ๏ฉ ๏ซ ๏ถx ๏ถy + something related to (x-x0)2 and (y-y0)2 + …… When x and y is close to x0 and y0 ๏ถh๏จ x0 , y0 ๏ฉ ๏ถh๏จ x0 , y0 ๏ฉ ๏จx ๏ญ x0 ๏ฉ ๏ซ ๏จ y ๏ญ y0 ๏ฉ h๏จ x, y ๏ฉ ๏ป h๏จ x0 , y0 ๏ฉ ๏ซ ๏ถx ๏ถy Back to Formal Derivation Based on Taylor Series: If the red circle is small enough, in the red circle ๏ถL๏จa, b ๏ฉ ๏ถL๏จa, b ๏ฉ ๏จ๏ฑ1 ๏ญ a ๏ฉ ๏ซ ๏จ๏ฑ 2 ๏ญ b ๏ฉ L๏จ๏ฑ ๏ฉ ๏ป L๏จa, b ๏ฉ ๏ซ ๏ถ๏ฑ1 ๏ถ๏ฑ 2 s ๏ฝ L๏จa, b ๏ฉ ๏ถL๏จa, b ๏ฉ ๏ถL๏จa, b ๏ฉ u๏ฝ ,v ๏ฝ ๏ถ๏ฑ1 ๏ถ๏ฑ 2 L(θ) ๏ฑ2 ๏จa, b ๏ฉ L๏จ๏ฑ ๏ฉ ๏ป s ๏ซ u ๏จ๏ฑ1 ๏ญ a ๏ฉ ๏ซ v๏จ๏ฑ 2 ๏ญ b ๏ฉ ๏ฑ1 Back to Formal Derivation constant Based on Taylor Series: If the red circle is small enough, in the red circle L๏จ๏ฑ ๏ฉ ๏ป s ๏ซ u ๏จ๏ฑ1 ๏ญ a ๏ฉ ๏ซ v๏จ๏ฑ 2 ๏ญ b ๏ฉ Find θ1 and θ2 in the red circle minimizing L(θ) ๏จ๏ฑ1 ๏ญ a ๏ฉ ๏ซ ๏จ๏ฑ2 ๏ญ b ๏ฉ 2 2 s ๏ฝ L๏จa, b ๏ฉ ๏ถL๏จa, b ๏ฉ ๏ถL๏จa, b ๏ฉ u๏ฝ ,v ๏ฝ ๏ถ๏ฑ1 ๏ถ๏ฑ 2 ๏ฃ d2 L(θ) ๏ฑ2 Simple, right? ๏จa, b ๏ฉ d ๏ฑ1 Gradient descent – two variables Red Circle: (If the radius is small) L๏จ๏ฑ ๏ฉ ๏ป s ๏ซ u ๏จ๏ฑ1 ๏ญ a ๏ฉ ๏ซ v๏จ๏ฑ 2 ๏ญ b ๏ฉ ๏๏ฑ1 ๏๏ฑ2 ๏จ๏๏ฑ1, ๏๏ฑ2 ๏ฉ Find θ1 and θ2 in the red circle minimizing L(θ) ๏จ๏ฑ1 ๏ญ a ๏ฉ ๏ซ ๏จ๏ฑ2 ๏ญ b ๏ฉ 2 ๏๏ฑ1 2 ๏จ๏๏ฑ1, ๏๏ฑ2 ๏ฉ ๏ฃ d2 ๏จu, v ๏ฉ ๏๏ฑ2 To minimize L(θ) ๏ฉ ๏๏ฑ1 ๏น ๏ฉu ๏น ๏ช๏๏ฑ ๏บ ๏ฝ ๏ญ๏จ ๏ช v ๏บ ๏ซ ๏ป ๏ซ 2๏ป ๏ฉ๏ฑ1 ๏น ๏ฉa ๏น ๏ฉu ๏น ๏ช๏ฑ ๏บ ๏ฝ ๏ชb ๏บ ๏ญ ๏จ ๏ช v ๏บ ๏ซ ๏ป ๏ซ 2๏ป ๏ซ ๏ป Back to Formal Derivation Based on Taylor Series: If the red circle is small enough, in the red circle L๏จ๏ฑ ๏ฉ ๏ป s ๏ซ u ๏จ๏ฑ1 ๏ญ a ๏ฉ ๏ซ v๏จ๏ฑ 2 ๏ญ b ๏ฉ constant s ๏ฝ L๏จa, b ๏ฉ ๏ถL๏จa, b ๏ฉ ๏ถL๏จa, b ๏ฉ u๏ฝ ,v ๏ฝ ๏ถ๏ฑ1 ๏ถ๏ฑ 2 Find ๐1 and ๐2 yielding the smallest value of ๐ฟ ๐ in the circle ๏ฉ ๏ถL๏จa, b ๏ฉ ๏น ๏ช ๏ถ๏ฑ ๏บ ๏ฉ๏ฑ1 ๏น ๏ฉa ๏น ๏ฉu ๏น ๏ฉa ๏น This is gradient 1 ๏ฝ ๏ญ ๏จ ๏ฝ ๏ญ ๏จ ๏ช ๏บ ๏ช๏ฑ ๏บ ๏ชb ๏บ ๏ชv ๏บ ๏ชb ๏บ ๏ช ๏ถL๏จa, b ๏ฉ ๏บ descent. ๏ซ ๏ป ๏ซ 2๏ป ๏ซ ๏ป ๏ซ ๏ป ๏ช๏ซ ๏ถ๏ฑ2 ๏บ๏ป Not satisfied if the red circle (learning rate) is not small enough You can consider the second order term, e.g. Newton’s method. End of Warning More Limitation of Gradient Descent Loss Very slow at ๐ฟ the plateau Stuck at saddle point ๐ค1 ๐ค2 Stuck at local minima ๐๐ฟ โ ๐๐ค ≈0 ๐๐ฟ โ ๐๐ค =0 The value of the parameter w ๐๐ฟ โ ๐๐ค =0 Acknowledgement • ๆ่ฌ Victor Chen ็ผ็พๆๅฝฑ็ไธ็ๆๅญ้ฏ่ชค