Lecture 7 — September 18 7.1 Recap 7.2 Introduction to Newton`s

EE 381V: Large Scale Optimization Fall 2014 Lecture 7 — September 18 Lecturer: Sanghavi 7.1 Scribe: David Inouye, Jimmy Lin, & Vutha Va Recap In the last lecture we studied coordinate descent and steepest descent method. Toward the end of the lecture we looked at an example illustrating that by change of coordinates (x = Ay) we could alter the convergence behavior of gradient descent methods. In this lecture we will study Newton’s method, which is affine invariant, i.e. the change of coordinates x = Ay does not affect the convergence property as in gradient descent method. 7.2 Introduction to Newton’s Method Definition 1. For f having positive definite Hessian ∇2 f (x) 0, the Newton updating rule with step size t is defined as, x+ = x + t∆xnt = x − t ∇2 f (x)−1 ∇f (x), (7.1) where, ∆xnt is called the Newton step. Note that since ∇2 f (x) 0 we have, ∇f (x)T ∆xnt = −∇f (x)T ∇2 f (x)−1 ∇f (x) < 0 always holds for ∇f (x) 6= 0, so this is a descent direction. There are various ways to interpret this choice of updating rule. Minimizer of Quadratic Approximation Consider a quadratic approximation of f around x, 1 f˜(x + v) = f (x) + ∇f (x)T v + v T ∇2 f (x)v. (7.2) 2 This quadratic function is minimized at v ? = −∇2 f (x)−1 ∇f (x). Note that if f is quadratic, this approximation is exact and x + v ? is the exact minimizer of f . A Special Case of Steepest Descent Newton’s method can also be viewed as the steepest descent method with the norm p kuk∇2 f (x) , uT ∇2 f (x)u. 7-1 (7.3) EE 381V Lecture 7 — September 18 Fall 2014 Linear Approximation of Gradient around x Consider a linear approximation of ∇f (x + v), ∇f (x + v) ' ∇f (x) + ∇2 f (x)v. (7.4) The Newton updating rule is obtained by setting the right hand side to 0, which is an approximation to the optimality condition ∇f (x? ) = 0. Affine Invariant Property of the Newton’s Method As mentioned earlier, an important feature of the Newton’s method is the affine invariant property, i.e. change of coordinates does not alter the convergence behavior as we have seen in the gradient descent methods. We will now state this formally and prove it. Lemma 7.1. The Newton’s method is affine invariant, i.e. if we define g(y) = f (Ay) and • y + is the Newton update for g • x+ is the Newton update for f then if x = Ay we have x+ = Ay + . Proof: Let x = Ay, then ∇2y g(y) = ∇2y f (Ay) = AT ∇2x f (x)A (7.5) ∇y g(y) = AT ∇x f (x) (7.6) Substituting these to the Newton updating rule we have, −1 T A ∇x f (x) y + = y − t AT ∇2x f (x)A = y − t A−1 ∇2x f (x)−1 A−T AT ∇x f (x) = y − t A−1 ∇2x f (x)−1 ∇x f (x) Multiply both sides by A, Ay + = x − t ∇2x f (x)−1 ∇x f (x) = x+ . 7-2 EE 381V 7.3 Lecture 7 — September 18 Fall 2014 Convergence of Newton’s Method We make two major assumptions in this analysis: 1. f is strongly convex, such that mI ∇2 f (x) M I. 2. ∇2 f (x) is Lipschitz continuous with constant L > 0, i.e. k∇2 f (x) − ∇2 f (y)k2 ≤ Lkx − yk2 ∀x, y. (7.7) Note that the norm on the left is the spectral norm, defined as the largest singular value. L can be interpreted as a bound on the third derivative of f . The smaller L is, the better f can be approximated by a quadratic function. Since each step of Newton’s method minimizes a quadratic approximation of f , the performance of Newton’s method will be best for functions with small L. For notational convenience we denote g = ∇f (x) and H = ∇2 f (x) from this point on. Our main result of this lecture is Theorem 7.2. We will devote the rest of the lecture trying to understand and prove this theorem. Before stating the theorem, let us first recall Backtracking Line Search (BTLS). With BTLS, first α and β are chosen such that 0 < α < 21 and 0 < β < 1, starting with t = 1, repeat while true if f (x + t∆x) ≤ f (x) + αtg T ∆x exit else t ← βt end end We are now ready to state and prove the theorem. Theorem 7.2. There exist η and γ with 0 < η ≤ method with BTLS satisfies, m2 L and γ = αβη 2 Mm2 such that Newton’s (a). Damped Newton Phase: If kgk2 ≥ η then f (x+ ) − f (x) ≤ −γ. (b). Quadratic Phase: If kgk2 < η then BTLS t = 1 and L k∇f (x+ )k2 ≤ 2 2m L k∇f (x)k2 2m2 2 . (7.8) Before proceeding to the proof, let us first interpret the implications of this theorem. 7-3 EE 381V Lecture 7 — September 18 Fall 2014 Implication of (a) In the damped Newton phase, f decreases by at least γ at each iteration, there the total of iterations in this phase cannot exceed, f (x(0) ) − p? γ since otherwise f (x) would be less than p? , which contradicts the optimality of p? . In other (0) ? words, the quadratic phase will start after f (x γ)−p iterations. Implication of (b) Let k be the first iteration in which kgk < η. And let ` ≥ 0 be the number of iterations after k. For simplicity, let us define: a` = L k∇f (x(k+`−1) )k2 2m2 (7.9) First, let us establish a bound on a1 . In the quadratic phase, since k∇f (x(k) )k2 < η and 2 η < mL by assumption, we have L 1 L k∇f (x(k) )k2 < η< 2 2 2m 2m 2 1 Thus, a1 < 2 . Now from (b) of the theorem, we also have that a`+1 ≤ a2` . Therefore, we have the following sequence: 2 3 `−1 a` ≤ (a`−1 )2 ≤ (a`−2 )2 ≤ (a`−3 )2 ≤ · · · ≤ (a1 )2 `−1 =⇒ a` ≤ (a1 )2 2`−1 1 =⇒ a` ≤ 2 2`−1 1 2 2`−1 2m2 1 (`) =⇒ k∇f (x )k2 ≤ L 2 L =⇒ k∇f (x(`) )k2 ≤ 2m2 For strongly convex functions, we also have 1 f (x(`) ) − p? ≤ k∇f (x(`) )k22 2m 2`−1 !2 2 1 2m 1 ≤ 2m L 2 `−1 2 2m3 1 ≤ 2 L 2 7-4 (7.10) (7.11) (7.12) EE 381V Lecture 7 — September 18 Fall 2014 thus, f (x) → p? quadratically. Therefore, if we want a` ≤ , we only need the following number of iterations: `−1 (a1 )2 `−1 2 ≤ log a1 ≤ log 2`−1 ≥ constant × log ` − 1 ≥ log log + constant. 7.3.1 Convergence Proof For readability of the proof, we will divide the proof into lemmas to emphasize the flow of the proof and not get lost in the details of the derivation. Lemma 7.3. With the assumptions in part (a), t = f (x + t∆xnt ) ≤ f (x) + αtg T ∆xnt . m M satisfies BTLS exit condition, i.e. Lemma 7.4. With the assumptions in part (b), t = 1 satisfies BTLS exit condition. Proofs for both Lemmas in Section 7.3.2. m and substituting ∆xnt = Proof (Theorem 7.2 part (a)): Using Lemma 7.3, with t ≥ β M −1 −H g, we have f (x+ ) ≤ f (x) − αβ By strong convexity H M I, so H −1 1 I, M m T −1 g H g M (7.13) and we have g T H −1 g ≥ 1 kgk22 , M (7.14) therefore, m 1 f (x+ ) ≤ f (x) − αβ kgk22 MM m f (x+ ) − f (x) ≤ − αβ 2 η 2 | M {z } (7.15) (7.16) γ where the last inequality follows because kgk2 ≥ η. Proof (Theorem 7.2 part (b)): Using Lemma 7.4, we can set t = 1. Thus, we have x+ = x − H −1 g and ∇f (x+ ) = ∇f (x − H −1 g) − g + HH −1 g 7-5 (7.17) EE 381V Lecture 7 — September 18 Fall 2014 Applying the fundamental theorem of calculus to ∇f (x − H −1 g) − g we have1 Z 1 −1 ∇2 f (x − tH −1 g)(−H −1 g)dt ∇f (x − H g) − g = (7.18) 0 thus, 1 Z + ∇2 f (x − tH −1 g) − H (−H −1 g)dt ∇f (x ) = (7.19) 0 Taking the norm of both sides, Z k∇f (x )k2 = 1 + 2 ∇ f (x − tH −1 g) − H (−H 0 −1 g)dt (7.20) 2 Z ≤ 1 k∇2 f (x − tH −1 g) − Hk2 kH −1 gk2 dt (7.21) 0 By Lipschitz continuity of ∇2 f we have k∇2 f (x − tH −1 g) − Hk2 ≤ LktH −1 gk2 = LtkH −1 gk2 , (7.22) thus, Z + k∇f (x )k2 ≤ 1 LtkH −1 gk22 dt = 0 Also, H mI so H −1 1 I m L −1 2 kH gk2 2 (7.23) and, kH −1 gk22 ≤ 1 kgk22 . m2 (7.24) L kgk22 , 2 2m (7.25) Substituting this, k∇f (x+ )k2 ≤ and multiplying both sides by 7.3.2 L 2m2 we obtain the result stated in the theorem. Proof of Lemmas Proof (Lemma 7.3): f (x+ ) = f (x − tH −1 g) ≤ f (x) − tg T H −1 g + (7.26) M 2 T −1 −1 tg H H g 2 (7.27) This point can be easily seen for scalar case. Let a(t) = ∇f (x − tH −1 g), then a(0) = ∇f (x), a(1) = R1 d ∇f (x − H −1 g), and 0 dt a(t)dt = a(1) − a(0). 1 7-6 EE 381V Lecture 7 — September 18 Fall 2014 Note that2 , g T H −1 H −1 g = g T H −1/2 H −1 H −1/2 g ≤ 1 T −1 g H g m (7.28) Thus, f (x+ ) ≤ f (x) − tg T H −1 g + Setting t = M 2 T −1 tg H g 2m (7.29) m , M f (x+ ) ≤ f (x) − This satisfies the exit condition for t = m , M m T −1 g H g 2M (7.30) m T f (x+ ) ≤ f (x) − α M g H −1 g since α < 1/2. Proof (Lemma 7.4): In this proof we will find α < condition. Our goal is to find α such that, 1 2 such that t = 1 satisfies BTLS exit f (x + ∆xnt ) ≤ f (x) + αg T ∆xnt . (7.31) For notational convenience we denote λ(x) = (∆xTnt H∆xnt )1/2 = (g T H −1 g)1/2 (7.32) which is known as the Newton decrement at x. The second equality follows because ∆xnt = −H −1 g. By Lipschitz condition for t ≥ 0, k∇2 f (x + t∆xnt ) − Hk2 ≤ tLk∆xnt k2 , (7.33) |∆xTnt (∇2 f (x + t∆xnt ) − H)∆xnt | ≤ tLk∆xnt k32 , (7.34) and we have, Now define f˜(t) = f (x + t∆xnt ), then the second derivative with respect to t becomes f˜00 (t) = ∆xTnt ∇2 f (x + t∆xnt )∆xnt . Substituting f˜00 into the above inequality we have, |f˜00 (t) − f˜00 (0)| ≤ tLk∆xnt k32 , We will use this inequality to find an upper bound on f˜(t). Starting with f˜00 (t) ≤ f˜00 (0) + tLk∆xnt k32 L ≤ λ2 (x) + t 3/2 λ3 (x) m (7.35) 3 (7.36) (7.37) 2 Recall the definition of square root of a matrix. If A is positive definite then we can write A = U ΛU T , where U is unitary and Λ is diagonal, and A1/2 = U Λ1/2 U T . 3 Note that if a, b, c > 0 and |a − b| ≤ c, then a ≤ b + c. Consider if a ≥ b, then we can remove the absolute value sign. Suppose a ≤ b, then we have a ≤ b ≤ b + c since c is positive. 7-7 EE 381V Lecture 7 — September 18 Fall 2014 since, f˜00 (0) = ∆xTnt H∆xnt = λ2 (x) and from strong convexity, H mI, λ2 (x) = ∆xTnt H∆xnt ≥ mk∆xnt k22 ⇒ k∆xnt k2 ≤ m−1/2 λ(x). Now integrate both sides of (7.37) with respect to t L λ3 (x) 3/2 2m L 2 2 = −λ (x) + tλ (x) + t2 3/2 λ3 (x) 2m f˜0 (t) ≤ f˜0 (0) + tλ2 (x) + t2 since f˜0 (0) = ∆xTnt g = −g T H −1 g = −λ2 (x). Integrating once more, L t2 f˜(t) ≤ f˜(0) − tλ2 (x) + λ2 (x) + t3 3/2 λ3 (x) 2 6m Setting t = 1 we have, 1 L f (x + ∆xnt ) ≤ f (x) − λ2 (x) + λ3 (x) 3/2 2 6m 1 Lλ(x) 2 = f (x) − λ (x) − 2 6m3/2 1 Lλ(x) T = f (x) + g ∆xnt − 2 6m3/2 Again using strong convexity, we have λ(x) = (g T H −1 g)1/2 ≤ 1 1 kgk2 < 1/2 η. 1/2 m m where the last inequality follows from the assumption kgk2 < η. Hence if we choose α such that, 1 Lλ(x) − 2 6m3/2 1 L < − η 2 6m2 α< then t = 1 satisfies BTLS exit condition. 7-8

Lecture 7 — September 18 7.1 Recap 7.2 Introduction to Newton`s

Related documents

Products

Support

Lecture 7 — September 18 7.1 Recap 7.2 Introduction to Newton`s

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib