Lecture 7 — September 18 7.1 Recap 7.2 Introduction to Newton`s

advertisement
EE 381V: Large Scale Optimization
Fall 2014
Lecture 7 — September 18
Lecturer: Sanghavi
7.1
Scribe: David Inouye, Jimmy Lin, & Vutha Va
Recap
In the last lecture we studied coordinate descent and steepest descent method. Toward the
end of the lecture we looked at an example illustrating that by change of coordinates (x = Ay)
we could alter the convergence behavior of gradient descent methods. In this lecture we will
study Newton’s method, which is affine invariant, i.e. the change of coordinates x = Ay
does not affect the convergence property as in gradient descent method.
7.2
Introduction to Newton’s Method
Definition 1. For f having positive definite Hessian ∇2 f (x) 0, the Newton updating rule
with step size t is defined as,
x+ = x + t∆xnt = x − t ∇2 f (x)−1 ∇f (x),
(7.1)
where, ∆xnt is called the Newton step.
Note that since ∇2 f (x) 0 we have,
∇f (x)T ∆xnt = −∇f (x)T ∇2 f (x)−1 ∇f (x) < 0
always holds for ∇f (x) 6= 0, so this is a descent direction.
There are various ways to interpret this choice of updating rule.
Minimizer of Quadratic Approximation
Consider a quadratic approximation of f around x,
1
f˜(x + v) = f (x) + ∇f (x)T v + v T ∇2 f (x)v.
(7.2)
2
This quadratic function is minimized at v ? = −∇2 f (x)−1 ∇f (x). Note that if f is quadratic,
this approximation is exact and x + v ? is the exact minimizer of f .
A Special Case of Steepest Descent
Newton’s method can also be viewed as the steepest descent method with the norm
p
kuk∇2 f (x) , uT ∇2 f (x)u.
7-1
(7.3)
EE 381V
Lecture 7 — September 18
Fall 2014
Linear Approximation of Gradient around x
Consider a linear approximation of ∇f (x + v),
∇f (x + v) ' ∇f (x) + ∇2 f (x)v.
(7.4)
The Newton updating rule is obtained by setting the right hand side to 0, which is an
approximation to the optimality condition ∇f (x? ) = 0.
Affine Invariant Property of the Newton’s Method
As mentioned earlier, an important feature of the Newton’s method is the affine invariant
property, i.e. change of coordinates does not alter the convergence behavior as we have seen
in the gradient descent methods. We will now state this formally and prove it.
Lemma 7.1. The Newton’s method is affine invariant, i.e. if we define g(y) = f (Ay) and
• y + is the Newton update for g
• x+ is the Newton update for f
then if x = Ay we have x+ = Ay + .
Proof: Let x = Ay, then
∇2y g(y) = ∇2y f (Ay) = AT ∇2x f (x)A
(7.5)
∇y g(y) = AT ∇x f (x)
(7.6)
Substituting these to the Newton updating rule we have,
−1 T
A ∇x f (x)
y + = y − t AT ∇2x f (x)A
= y − t A−1 ∇2x f (x)−1 A−T AT ∇x f (x)
= y − t A−1 ∇2x f (x)−1 ∇x f (x)
Multiply both sides by A,
Ay + = x − t ∇2x f (x)−1 ∇x f (x)
= x+ .
7-2
EE 381V
7.3
Lecture 7 — September 18
Fall 2014
Convergence of Newton’s Method
We make two major assumptions in this analysis:
1. f is strongly convex, such that mI ∇2 f (x) M I.
2. ∇2 f (x) is Lipschitz continuous with constant L > 0, i.e.
k∇2 f (x) − ∇2 f (y)k2 ≤ Lkx − yk2
∀x, y.
(7.7)
Note that the norm on the left is the spectral norm, defined as the largest singular
value. L can be interpreted as a bound on the third derivative of f . The smaller
L is, the better f can be approximated by a quadratic function. Since each step
of Newton’s method minimizes a quadratic approximation of f , the performance of
Newton’s method will be best for functions with small L.
For notational convenience we denote g = ∇f (x) and H = ∇2 f (x) from this point on.
Our main result of this lecture is Theorem 7.2. We will devote the rest of the lecture
trying to understand and prove this theorem. Before stating the theorem, let us first recall
Backtracking Line Search (BTLS). With BTLS, first α and β are chosen such that 0 < α < 21
and 0 < β < 1, starting with t = 1, repeat
while true
if f (x + t∆x) ≤ f (x) + αtg T ∆x
exit
else
t ← βt
end
end
We are now ready to state and prove the theorem.
Theorem 7.2. There exist η and γ with 0 < η ≤
method with BTLS satisfies,
m2
L
and γ = αβη 2 Mm2 such that Newton’s
(a). Damped Newton Phase: If kgk2 ≥ η then f (x+ ) − f (x) ≤ −γ.
(b). Quadratic Phase: If kgk2 < η then BTLS t = 1 and
L
k∇f (x+ )k2 ≤
2
2m
L
k∇f (x)k2
2m2
2
.
(7.8)
Before proceeding to the proof, let us first interpret the implications of this theorem.
7-3
EE 381V
Lecture 7 — September 18
Fall 2014
Implication of (a)
In the damped Newton phase, f decreases by at least γ at each iteration, there the total of
iterations in this phase cannot exceed,
f (x(0) ) − p?
γ
since otherwise f (x) would be less than p? , which contradicts the optimality of p? . In other
(0)
?
words, the quadratic phase will start after f (x γ)−p iterations.
Implication of (b)
Let k be the first iteration in which kgk < η. And let ` ≥ 0 be the number of iterations after
k. For simplicity, let us define:
a` =
L
k∇f (x(k+`−1) )k2
2m2
(7.9)
First, let us establish a bound on a1 . In the quadratic phase, since k∇f (x(k) )k2 < η and
2
η < mL by assumption, we have
L
1
L
k∇f (x(k) )k2 <
η<
2
2
2m
2m
2
1
Thus, a1 < 2 . Now from (b) of the theorem, we also have that a`+1 ≤ a2` . Therefore, we
have the following sequence:
2
3
`−1
a` ≤ (a`−1 )2 ≤ (a`−2 )2 ≤ (a`−3 )2 ≤ · · · ≤ (a1 )2
`−1
=⇒ a` ≤ (a1 )2
2`−1
1
=⇒ a` ≤
2
2`−1
1
2
2`−1
2m2 1
(`)
=⇒ k∇f (x )k2 ≤
L
2
L
=⇒
k∇f (x(`) )k2 ≤
2m2
For strongly convex functions, we also have
1
f (x(`) ) − p? ≤
k∇f (x(`) )k22
2m
2`−1 !2
2
1
2m 1
≤
2m
L
2
`−1
2
2m3 1
≤ 2
L
2
7-4
(7.10)
(7.11)
(7.12)
EE 381V
Lecture 7 — September 18
Fall 2014
thus, f (x) → p? quadratically.
Therefore, if we want a` ≤ , we only need the following number of iterations:
`−1
(a1 )2
`−1
2
≤
log a1 ≤ log 2`−1 ≥ constant × log ` − 1 ≥ log log + constant.
7.3.1
Convergence Proof
For readability of the proof, we will divide the proof into lemmas to emphasize the flow of
the proof and not get lost in the details of the derivation.
Lemma 7.3. With the assumptions in part (a), t =
f (x + t∆xnt ) ≤ f (x) + αtg T ∆xnt .
m
M
satisfies BTLS exit condition, i.e.
Lemma 7.4. With the assumptions in part (b), t = 1 satisfies BTLS exit condition.
Proofs for both Lemmas in Section 7.3.2.
m
and substituting ∆xnt =
Proof (Theorem 7.2 part (a)): Using Lemma 7.3, with t ≥ β M
−1
−H g, we have
f (x+ ) ≤ f (x) − αβ
By strong convexity H M I, so H −1 1
I,
M
m T −1
g H g
M
(7.13)
and we have
g T H −1 g ≥
1
kgk22 ,
M
(7.14)
therefore,
m 1
f (x+ ) ≤ f (x) − αβ
kgk22
MM
m
f (x+ ) − f (x) ≤ − αβ 2 η 2
| M
{z }
(7.15)
(7.16)
γ
where the last inequality follows because kgk2 ≥ η.
Proof (Theorem 7.2 part (b)): Using Lemma 7.4, we can set t = 1. Thus, we have
x+ = x − H −1 g and
∇f (x+ ) = ∇f (x − H −1 g) − g + HH −1 g
7-5
(7.17)
EE 381V
Lecture 7 — September 18
Fall 2014
Applying the fundamental theorem of calculus to ∇f (x − H −1 g) − g we have1
Z 1
−1
∇2 f (x − tH −1 g)(−H −1 g)dt
∇f (x − H g) − g =
(7.18)
0
thus,
1
Z
+
∇2 f (x − tH −1 g) − H (−H −1 g)dt
∇f (x ) =
(7.19)
0
Taking the norm of both sides,
Z
k∇f (x )k2 = 1
+
2
∇ f (x − tH
−1
g) − H (−H
0
−1
g)dt
(7.20)
2
Z
≤
1
k∇2 f (x − tH −1 g) − Hk2 kH −1 gk2 dt
(7.21)
0
By Lipschitz continuity of ∇2 f we have
k∇2 f (x − tH −1 g) − Hk2 ≤ LktH −1 gk2 = LtkH −1 gk2 ,
(7.22)
thus,
Z
+
k∇f (x )k2 ≤
1
LtkH −1 gk22 dt =
0
Also, H mI so H −1 1
I
m
L −1 2
kH gk2
2
(7.23)
and,
kH −1 gk22 ≤
1
kgk22 .
m2
(7.24)
L
kgk22 ,
2
2m
(7.25)
Substituting this,
k∇f (x+ )k2 ≤
and multiplying both sides by
7.3.2
L
2m2
we obtain the result stated in the theorem.
Proof of Lemmas
Proof (Lemma 7.3):
f (x+ ) = f (x − tH −1 g)
≤ f (x) − tg T H −1 g +
(7.26)
M 2 T −1 −1
tg H H g
2
(7.27)
This point can be easily seen for scalar case. Let a(t) = ∇f (x − tH −1 g), then a(0) = ∇f (x), a(1) =
R1 d
∇f (x − H −1 g), and 0 dt
a(t)dt = a(1) − a(0).
1
7-6
EE 381V
Lecture 7 — September 18
Fall 2014
Note that2 ,
g T H −1 H −1 g = g T H −1/2 H −1 H −1/2 g ≤
1 T −1
g H g
m
(7.28)
Thus,
f (x+ ) ≤ f (x) − tg T H −1 g +
Setting t =
M 2 T −1
tg H g
2m
(7.29)
m
,
M
f (x+ ) ≤ f (x) −
This satisfies the exit condition for t =
m
,
M
m T −1
g H g
2M
(7.30)
m T
f (x+ ) ≤ f (x) − α M
g H −1 g since α < 1/2.
Proof (Lemma 7.4): In this proof we will find α <
condition. Our goal is to find α such that,
1
2
such that t = 1 satisfies BTLS exit
f (x + ∆xnt ) ≤ f (x) + αg T ∆xnt .
(7.31)
For notational convenience we denote
λ(x) = (∆xTnt H∆xnt )1/2 = (g T H −1 g)1/2
(7.32)
which is known as the Newton decrement at x. The second equality follows because ∆xnt =
−H −1 g.
By Lipschitz condition for t ≥ 0,
k∇2 f (x + t∆xnt ) − Hk2 ≤ tLk∆xnt k2 ,
(7.33)
|∆xTnt (∇2 f (x + t∆xnt ) − H)∆xnt | ≤ tLk∆xnt k32 ,
(7.34)
and we have,
Now define f˜(t) = f (x + t∆xnt ), then the second derivative with respect to t becomes
f˜00 (t) = ∆xTnt ∇2 f (x + t∆xnt )∆xnt . Substituting f˜00 into the above inequality we have,
|f˜00 (t) − f˜00 (0)| ≤ tLk∆xnt k32 ,
We will use this inequality to find an upper bound on f˜(t). Starting with
f˜00 (t) ≤ f˜00 (0) + tLk∆xnt k32
L
≤ λ2 (x) + t 3/2 λ3 (x)
m
(7.35)
3
(7.36)
(7.37)
2
Recall the definition of square root of a matrix. If A is positive definite then we can write A = U ΛU T ,
where U is unitary and Λ is diagonal, and A1/2 = U Λ1/2 U T .
3
Note that if a, b, c > 0 and |a − b| ≤ c, then a ≤ b + c. Consider if a ≥ b, then we can remove the absolute
value sign. Suppose a ≤ b, then we have a ≤ b ≤ b + c since c is positive.
7-7
EE 381V
Lecture 7 — September 18
Fall 2014
since, f˜00 (0) = ∆xTnt H∆xnt = λ2 (x) and from strong convexity, H mI,
λ2 (x) = ∆xTnt H∆xnt ≥ mk∆xnt k22
⇒ k∆xnt k2 ≤ m−1/2 λ(x).
Now integrate both sides of (7.37) with respect to t
L
λ3 (x)
3/2
2m
L
2
2
= −λ (x) + tλ (x) + t2 3/2 λ3 (x)
2m
f˜0 (t) ≤ f˜0 (0) + tλ2 (x) + t2
since f˜0 (0) = ∆xTnt g = −g T H −1 g = −λ2 (x). Integrating once more,
L
t2
f˜(t) ≤ f˜(0) − tλ2 (x) + λ2 (x) + t3 3/2 λ3 (x)
2
6m
Setting t = 1 we have,
1
L
f (x + ∆xnt ) ≤ f (x) − λ2 (x) +
λ3 (x)
3/2
2
6m
1 Lλ(x)
2
= f (x) − λ (x)
−
2 6m3/2
1 Lλ(x)
T
= f (x) + g ∆xnt
−
2 6m3/2
Again using strong convexity, we have
λ(x) = (g T H −1 g)1/2 ≤
1
1
kgk2 < 1/2 η.
1/2
m
m
where the last inequality follows from the assumption kgk2 < η. Hence if we choose α such
that,
1 Lλ(x)
−
2 6m3/2
1
L
< −
η
2 6m2
α<
then t = 1 satisfies BTLS exit condition.
7-8
Download