Uploaded by Alex Zhang

Gradient Descent (v2)

advertisement
Gradient Descent
Review: Gradient Descent
• In step 3, we have to solve the following optimization
problem:
๐œƒ ∗ = arg min ๐ฟ ๐œƒ
๐œƒ
L: loss function ๐œƒ: parameters
Suppose that θ has two variables {θ1, θ2}
0
๐œƒ
๐œ•๐ฟ ๐œƒ1 Τ๐œ•๐œƒ1
Randomly start at ๐œƒ 0 = 10
๐›ป๐ฟ ๐œƒ =
๐œƒ2
๐œ•๐ฟ ๐œƒ2 Τ๐œ•๐œƒ2
๐œƒ11
๐œƒ10
๐œ•๐ฟ ๐œƒ10 Τ๐œ•๐œƒ1
๐œƒ 1 = ๐œƒ 0 − ๐œ‚๐›ป๐ฟ ๐œƒ 0
1 =
0 −๐œ‚
0 Τ
๐œƒ2
๐œƒ2
๐œ•๐ฟ ๐œƒ2 ๐œ•๐œƒ2
๐œƒ12
๐œƒ11
๐œ•๐ฟ ๐œƒ11 Τ๐œ•๐œƒ1
1 −๐œ‚
2 =
๐œƒ2
๐œ•๐ฟ ๐œƒ21 Τ๐œ•๐œƒ2
๐œƒ2
๐œƒ 2 = ๐œƒ 1 − ๐œ‚๐›ป๐ฟ ๐œƒ 1
Review: Gradient Descent
๐œƒ2
๐›ป๐ฟ ๐œƒ 0
๐œƒ
Gradient: Loss ็š„็ญ‰้ซ˜็ทš็š„ๆณ•็ทšๆ–นๅ‘
Start at position ๐œƒ 0
๐›ป๐ฟ ๐œƒ 1
0
๐œƒ1
Gradient
Movement
Compute gradient at ๐œƒ 0
๐›ป๐ฟ ๐œƒ 2
๐œƒ2
๐œƒ3
๐›ป๐ฟ ๐œƒ 3
Move to ๐œƒ 1 = ๐œƒ 0 - η๐›ป๐ฟ ๐œƒ 0
Compute gradient at ๐œƒ 1
Move to ๐œƒ 2 = ๐œƒ 1 – η๐›ป๐ฟ ๐œƒ 1
……
๐œƒ1
Gradient Descent
Tip 1: Tuning your
learning rates
๏ฑ ๏€ฝ๏ฑ
i
Learning Rate
If there are more than three
parameters, you cannot
visualize this.
i ๏€ญ1
๏€จ ๏€ฉ
๏€ญ ๏จ๏ƒ‘L ๏ฑ
i ๏€ญ1
Set the learning rate η carefully
Loss
Very Large
small
Large
Loss
Just make
No. of parameters updates
But you can always visualize this.
Adaptive Learning Rates
• Popular & Simple Idea: Reduce the learning rate by
some factor every few epochs.
• At the beginning, we are far from the destination, so we
use larger learning rate
• After several epochs, we are close to the destination, so
we reduce the learning rate
• E.g. 1/t decay: ๐œ‚ ๐‘ก = ๐œ‚ Τ ๐‘ก + 1
• Learning rate cannot be one-size-fits-all
• Giving different parameters different learning
rates
Adagrad
๐œ‚๐‘ก =
๐œ‚
๐‘ก+1
๐‘ก
๐œ•๐ฟ
๐œƒ
๐‘”๐‘ก =
๐œ•๐‘ค
• Divide the learning rate of each parameter by the
root mean square of its previous derivatives
Vanilla Gradient descent
๐‘ค ๐‘ก+1 ← ๐‘ค ๐‘ก − ๐œ‚๐‘ก ๐‘”๐‘ก
Adagrad
๐‘ค ๐‘ก+1
๐‘ก
๐œ‚
← ๐‘ค ๐‘ก − ๐‘ก ๐‘”๐‘ก
๐œŽ
w is one parameters
๐œŽ ๐‘ก : root mean square of
the previous derivatives of
parameter w
Parameter dependent
๐œŽ ๐‘ก : root mean square of
the previous derivatives of
parameter w
Adagrad
0
๐œ‚
๐‘ค 1 ← ๐‘ค 0 − 0 ๐‘”0
๐œŽ
1
๐œ‚
๐‘ค 2 ← ๐‘ค 1 − 1 ๐‘”1
๐œŽ
2
๐œ‚
๐‘ค 3 ← ๐‘ค 2 − 2 ๐‘”2
๐œŽ
๐œŽ0 =
2
1
1
๐‘”0
2
2
+ ๐‘”1
2
2
1
๐‘”0
3
2
+ ๐‘”1
2
๐œŽ =
๐œŽ =
……
๐‘ค ๐‘ก+1
๐‘ก
๐œ‚
← ๐‘ค ๐‘ก − ๐‘ก ๐‘”๐‘ก
๐œŽ
๐‘”0
๐‘ก
๐œŽ๐‘ก
=
1
เท ๐‘”๐‘–
๐‘ก+1
๐‘–=0
2
+ ๐‘”2
2
Adagrad
• Divide the learning rate of each parameter by the
root mean square of its previous derivatives
๐œ‚
๐‘ก
1/t decay
๐œ‚ =
๐‘ก
+
1
๐‘ก
๐œ‚๐œ‚
๐‘ก+1
๐‘ก
๐‘ค
← ๐‘ค − ๐‘ก ๐‘”๐‘ก
๐‘ก
๐œŽ
๐œŽ๐‘ก
1
๐‘ก
๐œŽ =
เท ๐‘”๐‘– 2
๐‘ก+1
๐‘–=0
๐‘ค ๐‘ก+1
←
๐‘ค๐‘ก
−
๐œ‚
σ๐‘ก๐‘–=0 ๐‘”๐‘–
๐‘”๐‘ก
2
Contradiction?
๐œ‚
๐œ‚๐‘ก =
๐‘ก+1
๐‘ก
๐œ•๐ฟ
๐œƒ
๐‘”๐‘ก =
๐œ•๐‘ค
Vanilla Gradient descent
๐‘ค ๐‘ก+1
←
๐‘ค๐‘ก
Larger gradient,
larger step
− ๐œ‚ ๐‘ก ๐‘”๐‘ก
Adagrad
๐‘ค ๐‘ก+1
←
๐‘ค๐‘ก
−
๐œ‚
σ๐‘ก๐‘–=0 ๐‘”๐‘–
๐‘”๐‘ก
Larger gradient,
larger step
2
Larger gradient,
smaller step
๐œ‚
๐‘ก
๐œ•๐ถ
๐œƒ
๐œ‚๐‘ก =
๐‘”๐‘ก =
๐‘ก+1
๐œ•๐‘ค
Intuitive Reason
• How surprise it is ๅๅทฎ
๐‘ค
็‰นๅˆฅๅคง
g0
g1
g2
g3
g4
0.001 0.001 0.003 0.002 0.1
……
……
g0
10.8
……
……
๐‘ก+1
g1
20.9
๐‘ก
←๐‘ค −
g2
31.7
g3
12.1
็‰นๅˆฅๅฐ
๐œ‚
σ๐‘ก๐‘–=0 ๐‘”๐‘–
g4
0.1
๐‘”๐‘ก
2
้€ ๆˆๅๅทฎ็š„ๆ•ˆๆžœ
Larger gradient, larger steps?
Best step:
๐‘
|๐‘ฅ0 +
|
2๐‘Ž
1st
Larger order
derivative means far
from the minima
๐‘ฆ = ๐‘Ž๐‘ฅ 2 + ๐‘๐‘ฅ + ๐‘
|2๐‘Ž๐‘ฅ0 + ๐‘|
2๐‘Ž
๐‘ฅ0
๐‘
−
2๐‘Ž
|2๐‘Ž๐‘ฅ0 + ๐‘|
๐œ•๐‘ฆ
= |2๐‘Ž๐‘ฅ + ๐‘|
๐œ•๐‘ฅ
๐‘ฅ0
Larger 1st order
derivative means far
from the minima
Do not cross parameters
Comparison between
different parameters
a
a>b
b
๐‘ค1
๐‘ค2
c
c>d
๐‘ค1
d
๐‘ค2
Second Derivative
๐‘ฆ = ๐‘Ž๐‘ฅ 2 + ๐‘๐‘ฅ + ๐‘
๐œ•๐‘ฆ
= |2๐‘Ž๐‘ฅ + ๐‘|
๐œ•๐‘ฅ
๐œ•2๐‘ฆ
= 2๐‘Ž
2
๐œ•๐‘ฅ
Best step:
๐‘
|๐‘ฅ0 +
|
2๐‘Ž
|2๐‘Ž๐‘ฅ0 + ๐‘|
2๐‘Ž
๐‘ฅ0
๐‘
−
2๐‘Ž
|2๐‘Ž๐‘ฅ0 + ๐‘|
The best step is
๐‘ฅ0
|First derivative|
Second derivative
Larger 1st order
derivative means far
from the minima
Do not cross parameters
Comparison between
different parameters
The best step is
Larger
Second
|First derivative|
Second derivative
a
a>b
b
๐‘ค1
smaller second derivative
๐‘ค2
c
Smaller
Second
๐‘ค1
c>d
d
๐‘ค2
Larger second derivative
๐œ‚
๐‘ค ๐‘ก+1 ← ๐‘ค ๐‘ก −
σ๐‘ก๐‘–=0 ๐‘”๐‘–
The best step is
๐‘”๐‘ก
|First derivative|
2
?
Second derivative
Use first derivative to estimate second derivative
larger second
derivative
smaller second
derivative
๐‘ค1
first derivative
2
๐‘ค2
Gradient Descent
Tip 2: Stochastic
Gradient Descent
Make the training faster
Stochastic Gradient Descent
2
Loss is the summation over
all training examples
๐ฟ = เท ๐‘ฆเทœ ๐‘› − ๐‘ + เท ๐‘ค๐‘– ๐‘ฅ๐‘–๐‘›
๐‘›
๏ตGradient Descent ๏ฑ i ๏€ฝ ๏ฑ i ๏€ญ1 ๏€ญ ๏จ๏ƒ‘L ๏€จ๏ฑ i ๏€ญ1 ๏€ฉ
๏ตStochastic Gradient Descent
Pick an example xn
๐ฟ๐‘› = ๐‘ฆเทœ ๐‘› − ๐‘ + เท ๐‘ค๐‘– ๐‘ฅ๐‘–๐‘›
Loss for only one example
2
Faster!
๏ฑ i ๏€ฝ ๏ฑ i ๏€ญ1 ๏€ญ ๏จ๏ƒ‘Ln ๏€จ๏ฑ i ๏€ญ1 ๏€ฉ
• Demo
Stochastic Gradient Descent
Stochastic Gradient Descent
Gradient Descent
Update after seeing all
examples
See all
examples
Update for each example
If there are 20 examples,
20 times faster.
See only one
example
See all
examples
Gradient Descent
Tip 3: Feature Scaling
Source of figure:
http://cs231n.github.io/neuralnetworks-2/
Feature Scaling
๐‘ฆ = ๐‘ + ๐‘ค1 ๐‘ฅ1 + ๐‘ค2 ๐‘ฅ2
๐‘ฅ2
๐‘ฅ2
๐‘ฅ1
๐‘ฅ1
Make different features have the same scaling
๐‘ฆ = ๐‘ + ๐‘ค1 ๐‘ฅ1 + ๐‘ค2 ๐‘ฅ2
Feature Scaling
1, 2 ……
100, 200 ……
w2
x1
w1
w2
x2
๏€ซ
y
1, 2 ……
1, 2 ……
b
w2
Loss L
w1
x1
w1
w2
x2
๏€ซ
y
b
Loss L
w1
Feature Scaling
๐‘ฅ1
๐‘ฅ11
๐‘ฅ21
๐‘ฅ2
๐‘ฅ12
๐‘ฅ22
๐‘ฅ๐‘Ÿ
๐‘ฅ3
๐‘ฅ๐‘…
For each
dimension i:
……
……
……
……
……
……
๐‘Ÿ
๐‘ฅ
๐‘– − ๐‘š๐‘–
๐‘Ÿ
๐‘ฅ๐‘– ←
๐œŽ๐‘–
……
mean: ๐‘š๐‘–
standard
deviation: ๐œŽ๐‘–
The means of all dimensions are 0,
and the variances are all 1
Gradient Descent
Theory
Question
• When solving:
๐œƒ ∗ = arg min ๐ฟ ๐œƒ
๐œƒ
by gradient descent
• Each time we update the parameters, we obtain ๐œƒ
that makes ๐ฟ ๐œƒ smaller.
๐ฟ ๐œƒ 0 > ๐ฟ ๐œƒ1 > ๐ฟ ๐œƒ 2 > โ‹ฏ
Is this statement correct?
Warning of Math
Formal Derivation
• Suppose that θ has two variables {θ1, θ2}
๏ฑ2
๏ฑ2
Given a point, we can
easily find the point
with the smallest value
nearby. How?
๏ฑ1
๏ฑ0
๏ฑ1
L(θ)
Taylor Series
• Taylor series: Let h(x) be any function infinitely
differentiable around x = x0.
h ๏€จk ๏€ฉ ๏€จ x0 ๏€ฉ
k
๏€จx ๏€ญ x0 ๏€ฉ
h ๏€จx ๏€ฉ ๏€ฝ ๏ƒฅ
k!
k ๏€ฝ0
๏‚ฅ
h๏‚ข๏‚ข๏€จ x0 ๏€ฉ
2
๏‚ข
๏€จx ๏€ญ x0 ๏€ฉ ๏€ซ ๏‹
๏€ฝ h๏€จ x0 ๏€ฉ ๏€ซ h ๏€จ x0 ๏€ฉ๏€จ x ๏€ญ x0 ๏€ฉ ๏€ซ
2!
When x is close to x0
h๏€จ x ๏€ฉ ๏‚ป h๏€จ x0 ๏€ฉ ๏€ซ h๏‚ข๏€จ x0 ๏€ฉ๏€จ x ๏€ญ x0 ๏€ฉ
E.g. Taylor series for h(x)=sin(x) around x0=π/4
sin(x)=
……
The approximation
is good around π/4.
Multivariable Taylor Series
๏‚ถh๏€จ x0 , y0 ๏€ฉ
๏‚ถh๏€จ x0 , y0 ๏€ฉ
๏€จx ๏€ญ x0 ๏€ฉ ๏€ซ
๏€จ y ๏€ญ y0 ๏€ฉ
h๏€จ x, y ๏€ฉ ๏€ฝ h๏€จ x0 , y0 ๏€ฉ ๏€ซ
๏‚ถx
๏‚ถy
+ something related to (x-x0)2 and (y-y0)2 + ……
When x and y is close to x0 and y0
๏‚ถh๏€จ x0 , y0 ๏€ฉ
๏‚ถh๏€จ x0 , y0 ๏€ฉ
๏€จx ๏€ญ x0 ๏€ฉ ๏€ซ
๏€จ y ๏€ญ y0 ๏€ฉ
h๏€จ x, y ๏€ฉ ๏‚ป h๏€จ x0 , y0 ๏€ฉ ๏€ซ
๏‚ถx
๏‚ถy
Back to Formal Derivation
Based on Taylor Series:
If the red circle is small enough, in the red circle
๏‚ถL๏€จa, b ๏€ฉ
๏‚ถL๏€จa, b ๏€ฉ
๏€จ๏ฑ1 ๏€ญ a ๏€ฉ ๏€ซ
๏€จ๏ฑ 2 ๏€ญ b ๏€ฉ
L๏€จ๏ฑ ๏€ฉ ๏‚ป L๏€จa, b ๏€ฉ ๏€ซ
๏‚ถ๏ฑ1
๏‚ถ๏ฑ 2
s ๏€ฝ L๏€จa, b ๏€ฉ
๏‚ถL๏€จa, b ๏€ฉ
๏‚ถL๏€จa, b ๏€ฉ
u๏€ฝ
,v ๏€ฝ
๏‚ถ๏ฑ1
๏‚ถ๏ฑ 2
L(θ)
๏ฑ2
๏€จa, b ๏€ฉ
L๏€จ๏ฑ ๏€ฉ
๏‚ป s ๏€ซ u ๏€จ๏ฑ1 ๏€ญ a ๏€ฉ ๏€ซ v๏€จ๏ฑ 2 ๏€ญ b ๏€ฉ
๏ฑ1
Back to Formal Derivation
constant
Based on Taylor Series:
If the red circle is small enough, in the red circle
L๏€จ๏ฑ ๏€ฉ ๏‚ป s ๏€ซ u ๏€จ๏ฑ1 ๏€ญ a ๏€ฉ ๏€ซ v๏€จ๏ฑ 2 ๏€ญ b ๏€ฉ
Find θ1 and θ2 in the red circle
minimizing L(θ)
๏€จ๏ฑ1 ๏€ญ a ๏€ฉ ๏€ซ ๏€จ๏ฑ2 ๏€ญ b ๏€ฉ
2
2
s ๏€ฝ L๏€จa, b ๏€ฉ
๏‚ถL๏€จa, b ๏€ฉ
๏‚ถL๏€จa, b ๏€ฉ
u๏€ฝ
,v ๏€ฝ
๏‚ถ๏ฑ1
๏‚ถ๏ฑ 2
๏‚ฃ d2
L(θ)
๏ฑ2
Simple, right?
๏€จa, b ๏€ฉ
d
๏ฑ1
Gradient descent – two variables
Red Circle: (If the radius is small)
L๏€จ๏ฑ ๏€ฉ ๏‚ป s ๏€ซ u ๏€จ๏ฑ1 ๏€ญ a ๏€ฉ ๏€ซ v๏€จ๏ฑ 2 ๏€ญ b ๏€ฉ
๏„๏ฑ1
๏„๏ฑ2 ๏€จ๏„๏ฑ1, ๏„๏ฑ2 ๏€ฉ
Find θ1 and θ2 in the red circle
minimizing L(θ)
๏€จ๏ฑ1 ๏€ญ a ๏€ฉ ๏€ซ ๏€จ๏ฑ2 ๏€ญ b ๏€ฉ
2
๏„๏ฑ1
2
๏€จ๏„๏ฑ1, ๏„๏ฑ2 ๏€ฉ
๏‚ฃ d2
๏€จu, v ๏€ฉ
๏„๏ฑ2
To minimize L(θ)
๏ƒฉ ๏„๏ฑ1 ๏ƒน
๏ƒฉu ๏ƒน
๏ƒช๏„๏ฑ ๏ƒบ ๏€ฝ ๏€ญ๏จ ๏ƒช v ๏ƒบ
๏ƒซ ๏ƒป
๏ƒซ 2๏ƒป
๏ƒฉ๏ฑ1 ๏ƒน ๏ƒฉa ๏ƒน
๏ƒฉu ๏ƒน
๏ƒช๏ฑ ๏ƒบ ๏€ฝ ๏ƒชb ๏ƒบ ๏€ญ ๏จ ๏ƒช v ๏ƒบ
๏ƒซ ๏ƒป
๏ƒซ 2๏ƒป ๏ƒซ ๏ƒป
Back to Formal Derivation
Based on Taylor Series:
If the red circle is small enough, in the red circle
L๏€จ๏ฑ ๏€ฉ ๏‚ป s ๏€ซ u ๏€จ๏ฑ1 ๏€ญ a ๏€ฉ ๏€ซ v๏€จ๏ฑ 2 ๏€ญ b ๏€ฉ
constant
s ๏€ฝ L๏€จa, b ๏€ฉ
๏‚ถL๏€จa, b ๏€ฉ
๏‚ถL๏€จa, b ๏€ฉ
u๏€ฝ
,v ๏€ฝ
๏‚ถ๏ฑ1
๏‚ถ๏ฑ 2
Find ๐œƒ1 and ๐œƒ2 yielding the smallest value of ๐ฟ ๐œƒ in the circle
๏ƒฉ ๏‚ถL๏€จa, b ๏€ฉ ๏ƒน
๏ƒช ๏‚ถ๏ฑ ๏ƒบ
๏ƒฉ๏ฑ1 ๏ƒน ๏ƒฉa ๏ƒน
๏ƒฉu ๏ƒน
๏ƒฉa ๏ƒน
This is gradient
1
๏€ฝ
๏€ญ
๏จ
๏€ฝ
๏€ญ
๏จ
๏ƒช
๏ƒบ
๏ƒช๏ฑ ๏ƒบ ๏ƒชb ๏ƒบ
๏ƒชv ๏ƒบ
๏ƒชb ๏ƒบ
๏ƒช ๏‚ถL๏€จa, b ๏€ฉ ๏ƒบ descent.
๏ƒซ ๏ƒป
๏ƒซ 2๏ƒป ๏ƒซ ๏ƒป
๏ƒซ ๏ƒป
๏ƒช๏ƒซ ๏‚ถ๏ฑ2 ๏ƒบ๏ƒป
Not satisfied if the red circle (learning rate) is not small enough
You can consider the second order term, e.g. Newton’s method.
End of Warning
More Limitation
of Gradient Descent
Loss
Very slow at
๐ฟ
the plateau
Stuck at
saddle point
๐‘ค1
๐‘ค2
Stuck at local minima
๐œ•๐ฟ โˆ• ๐œ•๐‘ค
≈0
๐œ•๐ฟ โˆ• ๐œ•๐‘ค
=0
The value of the parameter w
๐œ•๐ฟ โˆ• ๐œ•๐‘ค
=0
Acknowledgement
• ๆ„Ÿ่ฌ Victor Chen ็™ผ็พๆŠ•ๅฝฑ็‰‡ไธŠ็š„ๆ‰“ๅญ—้Œฏ่ชค
Download