MATH-GA 2010.001/CSCI-GA 2420.001, Georg Stadler (NYU Courant) Fall 2015: Numerical Methods I Assignment 4 (due Nov. 9, 2015) 1. [Behavior of descent methods, 6pt] Consider the unconstrained optimization problem min f (x, y) ≡ − cos x cos(y/10). (a) Find and classify all stationary points in the region −π/2 ≤ x ≤ π/2, −10π/2 ≤ y ≤ 10π/2. (b) There is a portion of the problem region within which the Hessian matrix of f (x, y) is positive definite. Give expressions for this portion. You should be able to do this analytically. (c) Derive expressions for the search directions associated with the steepest descent and Newton methods. (d) Write a program that performs both iterations, both without a line search and with an exact line search. Note that you will not be able to find the value of the optimal step length analytically; instead, determine it numerically.1 (e) Run your program for various initial guesses within the region. Verify the following:2 • Steepest descent converges to the minimum x∗ for any starting point within the region. • Newton’s method with line search converges to the minimum only for initial points for which the Hessian matrix is positive definite. • Newton’s method without line search has an even more restricted radius of convergence. (f) What do you observe about the convergence rate in these cases? 2. [Scaling of descent methods, 2pt] For a twice differentiable function f1 : Rn → R, consider the minimization problems min f1 (x) and x∈Rn min f2 (x), x∈Rn where f2 (x) = βf1 (x) + γ with β > 0 and γ ∈ R. (a) Compare the steepest descent and the Newton directions at x0 ∈ Rn for these two optimization problems, which obviously have the same minimizers. In class we argued that (locally) a good step size for Newton’s methods is α = 1, and thus we initialize a backtracking line search with that value. Why is it not possible to give a good initial step length for the steepest descent method? 1 You can use a built-in one-dimensional minimization function (fzero in MATLAB). While we use exact line search here, this is usually too costly as it requires a large number of function evaluations. As we’ve discussed, one thus uses an inexact step size that satisfies, for instance, the Wolfe conditions to guarantee convergence to a stationary point. 2 One way to illustrate this (non)convergence is to randomly choose initializations and draw them as dots with different colors depending on whether the method, started from that initialization, converged. 1 (b) Newton’s method for optimization problems can also be seen as a method to find stationary points x of the gradient, i.e., points where g(x) = 0. Show that the Newton step for g(x) = 0 coincides with the Newton step for the modified problem Bg(x) = 0, where B ∈ Rn×n is a regular matrix3 . 3. [Globalization of Newton descent, 3+1pt] As we have seen, the Newton direction for solving a minimization problem is only a descent direction if the Hessian matrix is positive definite. This is not always the case, in particular far from the minimizer. To guarantee a descent direction in Newton’s method, a simple idea is as follows (where we choose 0 < α1 < 1 and α2 > 0): • Compute a direction dk by solving the Newton equation ∇2 f (xk )dk = −∇f (xk ). If that is possible and dk satisfies4 −∇f (xk )T dk ≥ min α1 , α2 k∇f (xk )k , k k k∇f (x )kkd k (1) then use dk as descent direction. • Otherwise, use the steepest descent direction dk = −∇f (xk ). To illustrate this globalization, let f : R2 → R be defined by 1 f (x) = (x21 + x22 ) exp(x21 − x22 ). 2 (a) Using the initial iterate x0 = (1, 1)T , find a local minimum of f using the modified Newton method described above, combined with Armijo line search with backtracking. Hand in a listing of your implementation.5 (b) Carry out the computation also with the modified Newton matrix ∇2 f (x) + 3I. Discuss your findings. 3 This property is called affine invariance of Newton’s method, and it is one of the reasons why Newton’s method is so efficient. A comprehensive reference for Newton’s method is the book by P. Deuflhard, Newton Methods for Nonlinear Problems, Springer 2006. 4 This is a condition on the angle between the negative gradient and the Newton directions, which must less than 90◦ . However, using this condition, the angle may approach 90◦ at the same speed as k∇f (xk )k approaches zero. Recall from class that what is required to guarantee convergence for a descent method with Wolfe line search is that an infinite sum that involves the square of the right hand side in (1) and the norm of the gradient is finite. 5 It is sufficient to hand in a listing of the important parts of your implementation, i.e., Armijo line search and computation of the descent direction. 2 4. [Modified metric in steepest descent, 2pt] Consider f : Rn → R continuously differn×n entiable, and x ∈ Rn with ∇f (x) 6= 0. For pa symmetric positive definite matrix A ∈ R , we define the A-weighted norm kykA = y T Ay. Derive the unit norm steepest descent direction of f in x with respect to the k · kA -norm, i.e., find the solution to the problem6 min ∇f (x)T d. kdkA =1 Hint: Use the factorization A = B T B and the Cauchy-Schwarz inequality. 5. [Equality-constrained optimization, 2pt] Solve the optimization problem min x1 x2 x∈R2 subject to the equality constraint x21 + x22 − 1 = 0. We know that at minimizer(s) x∗ there exists a Lagrange multiplier λ∗ , such that the Lagrangian function L(x, λ) := x1 x2 −λ(x21 +x22 −1) is stationary, i.e., its partial derivatives with respect to x and λ L,x (x∗ , λ∗ ) and L,λ (x∗ , λ∗ ) vanish. Use this to compute the minima of this equality-constrained optimization problem. 6. [Largest eigenvalues using the power method, 2+1pt] The power method to compute the largest eigenvalue only requires the matrix-application rather than the matrix itself. To illustrate this, we use the power method to compute the largest eigenvalue of a matrix A, of which we only know its action on vectors. (a) Use the power method (with a reasonable stopping tolerance) to compute the largest eigenvalues of the symmetric matrix A, which is implicitly given through a function Afun()7 that applies A to vectors v. Use the convergence rate to estimate the 2 magnitude of the next largest eigenvalue. Report results for n = 10, 50, where v ∈ Rn 2 2 and A ∈ Rn ×n . (b) Compare your result for the largest two eigenvalues with what you obtain with an available eigenvalue solver.8 7. [Stability of eigenvalues, 2+1pt] Let λ 1 ... ε Aε be a family of matrices given by ... ∈ Rn×n . ... 1 λ Obviously, A0 has λ as its only eigenvalue with multiplicity n. 6 If f is twice differentiable with positive definite Hessian matrix, one can choose A = ∇2 (x). This shows that the Newton descent direction is the steepest descent direction in the metric where norms are weighted by the Hessian matrix. 7 Download the MATLAB function from http://cims.nyu.edu/~stadler/num1/material/Afun.m. 8 In MATLAB, eigs finds the largest eigenvalues of a matrix that is given through its application to vectors. 3 (a) Show that for ε > 0, Aε has n different eigenvalues given by λε,k = λ + ε1/n exp(2πik/n), k = 0, . . . , n − 1, and thus that |λ − λε,k | = ε1/n . (b) Based on the above result, what accuracy can be expected for the eigenvalues of A0 when the machine epsilon is 10−16 ? 4