Goals of the project VV471 — Introduction to Numerical Methods • Perform research beyond what is taught • Select content and clearly organise ideas Project • Orally present results in an appealing way Ailin & Manuel — UM-JI (Summer 2022) • Be critical on your results 1 Instructions 1.1 Format Each of you need to be in one and only one 3-member team for the project. The project will be graded according to the following three aspects: 1. Slide Presentation of your work 2. Clarity of the explanations of your work 3. Demonstration of the accuracy and efficiency of your algorithm Each of those three aspects has an equal weight. Each member of the same team will receive the same project mark. 1.2 Due The oral presentation and testing of your algorithm are due in week 12. The exact time for the presentation and testing will be released in week 11. 1.3 Expectation The following two sections are relevant topics to get your own research started on how to solve the problem stated in section 4. They are by not means the algorithms or the only algorithms that you need to solve the problem. We expect you to read academic articles as well as textbooks. We expect your presentation to effectively summarise the important aspects of your research on the problem within 15 min; it will be followed by a 3-5 minutes question and answer period. We do not expect you to present all aspects of what you have done during the research process. However, we expect you as a group to know very well what you have done so that you can explain to us if we choose to ask a question on a different aspect of your solution. After your presentation, we expect to see your algorithm in action. You will demonstrate your algorithm on your machine. We expect accuracy and efficiency. Each group that achieves a certain accuracy will receive a ranking according to the average time over 100 trials of your algorithm. We expect at least 2 hours of work from each student per week to be put into the project. 2 Multidimensional root-finding In this part, we consider the problems of solving a system of n equations in n unknowns, f1 (x1 , x2 , ... , xn ) = 0, f (x , x , ... , x ) = 0, 1 1 2 n .. . fn (x1 , x2 , ... , xn ) = 0. For simplicity, we express this system of nonlinear equations in vector form, F(x) = 0, where F : D ⊆ Rn → Rn is a vector-valued function of n variables presented by vectors x1 x2 x=. .. xn f1 f2 and F = . .. fn The component functions f1 , f2 , ..., fn are functions of x1 , x2 , ..., xn in general. 2.1 Limit, Continuity and Differentiability in Rn Recall the notion of limit can be easily generalized to vector-valued functions and functions of several variables. Given a function f : D ⊆ Rn → R and a point x0 ∈ D, we write lim f (x) = L if, for any ϵ > 0, x→x0 there exists a δ > 0 such that f (x) − L < ϵ, whenever x ∈ D and 0 ≤ ∥x − x0 ∥ ≤ δ, where ∥·∥ denotes any appropriate norm in Rn . The function f is said to be continuous at x0 ∈ D, if lim f (x) = f (x0 ). Furthermore, f is said to be x→x0 differentiable at x0 ∈ D if there is a constant vector ⃗a such that lim f (x) − f (x0 ) − aT (x − x0 ) ∥x − x0 ∥ x→x0 =0 Similarly, for F : D ⊆ Rn → Rn and x0 ∈ D, we write lim F(x) = L if and only if x→x0 lim fi (x) = Li x→x0 for all i = 1, 2, ... , n. And F is said to be continuous at x0 ∈ D if lim F(x) = F(x0 ). Furthermore, F is said to be differentiable x→x0 at x0 if there exists a linear transformation T : Rn → Rn such that lim x→x0 F(x) − F(x0 ) − T (x − x0 ) =0 ∥x − x0 ∥ where the matrix associated with the linear transformation T is the matrix of partial derivatives, namely, the Jacobian matrix J. 2.2 Fixed-point iteration Now suppose we have the following function F(x) = G(x) − x where G : D ⊆ Rn → Rn In class, we have discuss why fixed-point iteration can be used to find the root of a single function of the form f (x) = g (x) − x where g is Lipschitz continuous with Lipschitz constant c ∈ [0, 1). In the following assignment, we have learned that if the function g instead is continuously differentiable and maps an interval I into itself, then g has a fixed point in I. Furthermore, if there is a constant c < 1 such that g ′ (x) < c for all x ∈ I. then the fixed-point iteration will converge to the fixed point. The existence and uniqueness of fixed points of vector-valued functions of several variables and the convergence result are in line with the single-variable case. Below are the analogues of the two theorems in higher dimensions. Let D ⊆ Rn be closed and G : D → D be Lipschitz continuous, that is, there is a constant c G(x1 ) − G(x2 ) ≤ c ∥x1 − x2 ∥ for all x1 , x2 ∈ D. If this c, which is known as a Lipschitz constant, is 0 ≤ c < 1, then the function G has a unique fixed point x∗ ∈ D and the sequence defined by x0 ∈ D and xk = G(xk−1 ) for k = 1, 2, 3, ... converges to x∗ Let D ⊆ Rn be closed and G : D → D be continuously differentiable. Furthermore, if there exists a constant c < 1 and a matrix norm ∥·∥ such that JG (x) ≤ c for all x ∈ D. then G has a unique fixed point x∗ in D, and fixed-point iteration is guaranteed to converge to x∗ for any initial point x0 ∈ D. 2.3 Newton’s method We have learned in class that Newton’s method can achieve quadratic convergence. Recall that this method uses the following iteration formula: xk+1 = xk − f (xk ) f ′ (xk ) k = 0, 1, 2, ... where x0 is an initial point. Here we present Newton’s method to system of nonlinear equations. Let us define −1 G(x) = x − JF (x) F(x) where JF (x) is the Jacobian matrix of F evaluated at x. Similar to what we did in class for the singlevariable case. In order to show quadratic convergence, all we need is to show the derivative, here the Jacobian of G, is zero at the root x = x∗ given F has an appropriate level of smoothness. If we define x1 x2 x = . , .. xn f1 (x1 , x2 , ... , xn ) f2 (x1 , x2 , ... , xn ) F= .. . fn (x1 , x2 , ... , xn ) g1 (x1 , x2 , ... , xn ) g2 (x1 , x2 , ... , xn ) and G(x) = .. . gn (x1 , x2 , ... , xn ) where fi and gi are the component functions of F and G, then we have ∂gi ∂ = xi − ∂xk ∂xk = δik − n X j=1 n X −1 where aij is the ij-th element of JF (x) aij (x)fj (x) j=1 X ∂fj ∂aij aij (x) − fj (x) ∂xk ∂xk n j=1 Note the first summation is equal to δik since summation is equal to ik-th element of ∂fj is the jk-th element of JF (x) by definition, thus the ∂xk JF (x) −1 JF (x) = I The second summation vanishes at x∗ since F(x∗ ) = 0. Hence we reach the conclusion ∇gi (x∗ ) = 0 for all i = 1, 2, ... , n ∗ =⇒ JG (x ) = 0 Since JG is the zero matrix at x∗ , and G is continuously differentiable at x∗ for a sufficiently smooth F that has an invertible JF (x∗ ), there is a constant c < 1 and a norm ∥·∥ such that JG (x) ≤ c for all x ∈ D. This shows that Newton’s method in this case can have quadratic convergence near root x∗ . To apply a multidimensional version of Newton’s method, we choose an initial point x0 in the hope of it being sufficiently close to x∗ . Then we iterate as follows: yk = −F(xk ) −1 wk = JF (xk ) yk xk+1 = xk + wk where the vector wk is computed by solving the system JF (xk )wk = yk using a method such as Gaussian elimination with back substitution. The above scheme can be applied to a function F : Rn → Rm with only a slight modification to how we solve wk . When m < n, we use the pseudoinverse, by which we can find one of potentially many roots of F. When m > n, we can try the least-square solution, but convergence of this scheme become doubtful. 2.4 Quasi-Newton When n becomes large, Newton’s method becomes very expensive due the computation of the Jacobian matrix JF (xk ) in each iteration, and the computation of wk by solving JF (xk )wk = yk Furthermore, it is not possible to take information from one iteration to the next since the Jacobian matrix is changing every iteration. An alternative is to modify Newton’s method by using approximate derivatives as in the secant method for a single nonlinear equation. The secant method uses the approximation f ′ (x1 ) ≈ f (x1 ) − f (x0 ) x −x | 1 {z 0 } A1 as a replacement for f ′ (x1 ) in the single-variable Newton’s method in the first iteration. Note the approximation A1 is a scalar such that A1 (x1 − x0 ) = f (x1 ) − f (x0 ) Along this line of thinking, we replace the Jacobian matrix JF (x1 ) by a matrix A1 that satisfies A1 (x1 − x0 ) = F (x1 ) − F (x0 ) Any nonzero vector in Rn can be written as a sum of a scalar multiple of x1 − x0 and a scalar multiple of a vector in the orthogonal complement of x1 − x0 . So to uniquely define the matrix A1 , we also need to specify how A1 maps the orthogonal complement of x1 − x0 . However, no information is available about the change in F in a direction orthogonal to x1 − x0 . So we specify there is no update in this direction, that is, A1 z = JF (x0 )z whenever (x1 − x0 )T z = 0 It is clear that the following matrix satisfies both conditions: h A1 = JF (x0 ) + i F(x1 ) − F(x0 ) − JF (x0 ) (x1 − x0 ) (x1 − x0 )T ∥x1 − x0 ∥22 So we use this A1 in place of JF (x1 ) to determine x2 , and then A2 to x3 , and etc. In general sk = xk − xk−1 yk = F(xk ) − F(xk−1 ) Ak = Ak−1 + yk − Ak−1 sk ∥sk ∥22 sT k wk = A−1 k F(xk ) xk+1 = xk − wk However, simply approximating JF (xk ) by Ak does not reduce the cost by much, because we still have to incur the same cost of solving the linear system A k w k = yk So to truly cut the cost, we have to find a way to efficiently update the inverse of Ak from previous available information. This can be done by employing a matrix inversion formula known as the Sherman and Morrison’ formula: B + abT −1 = B−1 − B−1 abT B−1 1 + bT B−1 a where B is invertible and 1 + bT A−1 a ̸= 0. So by letting B = Ak−1 , a= yk − Ak−1 sk and ∥sk ∥22 , b=s the Sherman and Morrison’s formula gives the following formula for the inverse −1 A−1 k = Ak−1 + sk − T −1 A−1 k−1 yk sk Ak−1 −1 sT k Ak−1 yk Combining the above approximation for the Jacobian matrix and this inversion formula gives what is known as Broyden’s Method. Note that it is not necessary to compute the approximation for the Jacobian matrix, namely, Ak for k ≥ 1; A−1 k is computed directly. 3 Optimization 3.1 Introduction In this part, we consider the general problem of finding the minimum for f : Rn → R with or without constraints: g (x) = 0 where x ∈ Rn . and/or h(x) ≥ 0 3.2 Nonsmoothness If the function f is sufficiently smooth and convex in the absence of any constraint, we could turn the optimisation problem into a root-finding problem by finding x∗ such that ∇f (x∗ ) = 0 So our discussion on Newton’s method and Quasi-Newton’s method are applicable with little or no modification. In fact, finding a minimum of a function of many variables is easier in comparison with multidimensional root-finding. This is because the component functions of an n-dimensional gradient vector are not completely arbitrary functions. Rather, they obey so-called integrability conditions. So those methods are usually fast for well-behaved functions, however, they are still not very robust. That is, if the function is not sufficiently smooth or the initial guess is not sufficiently close to the minimum, then they may converge very slowly or even diverge. The downhill simplex method is an alternative that makes no assumptions about f and requires no derivatives. It is less efficient but it is very easy to understand and implement. We are to cover it here so that we have an optional method for functions that are too problematic for Newton and Quasi-Newton to handle. In many cases, the downhill simplex is faster than simulated annealing but usually slower than Newton and Quasi-Newton. It is based on the geometric object known as a simplex. To define a simplex properly, consider the following definitions: A linear combination of x1 , x2 , ..., xn n X αi xi = α1 x1 + α2 x2 + · · · + αn xn i=1 is known as a convex combination if all coefficients sum to 1 and are non-negative, that is, n X αi = 1 and αi ≥ 0 for all i = 1, ... , n i=1 The convex hull of a set S ∈ Rn is the set of all convex combinations of all points xi ∈ S. A simplex is a geometric object in Rn that represents the convex hull of n + 1 points in Rn , which are known as vertices of the simplex. So a simplex in R is a line segment, a simplex in R2 is a triangle, a simplex in R3 is a tetrahedron, and etc. The downhill simplex method iteratively generates a a sequence of smaller and smaller simplexes in Rn to enclose the minimum point x∗ . At each iteration, the vertices {xj }n+1 j=1 of the simplex are ordered according their f (x) values f (x1 ) ≤ f (x2 ) ≤ · · · f (xn+1 ). The point x1 is referred as the best vertex, and xn+1 as the worst vertex. The downhill simplex method uses 4 possible operations to iterate from one simplex to the next: reflection expansion contraction shrink α>0 β>1 0<γ<1 0<δ<1 Each operation is associated with a scalar parameter, and the centroid of n vertices other than the worst vertex 1X x̄ = x n n j=1 The downhill simplex method iterates by the following scheme: 1. Sort: Evaluate f at all n + 1 vertices of the current simplex and sort the vertices so that f (x1 ) ≤ f (x2 ) ≤ · · · f (xn+1 ) 2. Reflection: Compute the reflection point xr = x̄ + α(x̄ − xn+1 ) #Since xn+1 is the worst vertex, we want moving away from it. This so-called reflection point is just a test point along the line defined by x̄ and xn+1 away from the point xn+1 . 3. Conditional Execution: If f (x1 ) ≤ f (xr ) < f (xn ), then replace the worst vertex xn+1 with xr . Next Iteration. If f (xr ) < f (x1 ), then do the Expansion step. If f (xn ) ≤ f (xr ) < f (xn+1 ), then jump to the Outside Contraction step. If f (xn+1 ) ≤ f (xr ), then jump to the Inside Contraction step. 4. Expansion: Compute the expansion point xe = x̄ + β(xr − x̄) # If moving along the line defined by x̄ and xn+1 away from the point xn+1 gives a new best vertex, perhaps the minimum is just farther away in that region. So we move further into this region by pushing further along this direction. 5. Conditional Execution: If f (xe ) < f (xr ), then replace the worst vertex xn+1 with xe ; else replace xn+1 with xr ; Next Iteration. 6. Outside Contraction: Compute the outside contraction point xoc = x̄ + γ(xr − x̄) # If the test point xr is almost as bad as the worst vertex, then we contract along this direction to see whether we have pushed too far in this direction. 7. Conditional Execution: If f (xoc ) ≤ f (xr ), then replace the worst vertex xn+1 with xoc ; Next Iteration. Else do the Shrink step. 8. Insider Contraction: Compute the inside contraction point xic = x̄ − γ(xr − x̄) # If the test point xr is even worse then the worst vertex, then we contract further along this direction to be on the same side as xn+1 . 9. Conditional Execution: If f (xic ) < f (xn+1 ), then replace the worst vertex xn+1 with xic ; Next Iteration. Else do the Shrink step. 10. Shrink: Shrink the simplex toward the best vertex. Assign, for 2 ≤ j ≤ n + 1, xj = x1 + δ(xj − x1 ) Next Iteration. # If moving away from the worst vertex xn+1 alone is not useful, we have to consider moving towards the best vertex first before moving away from the worst vertex again. So this is a back and forth process like the movement of an ameba. Even when f is sufficiently smooth and the gradient/Hessian is not expensive to evaluate , the downhill simplex method still provides a good initial guidance before Newton’s method and Quasi-Newton’s method are employed to refine the solution once we are sufficiently close. 4 Tasks The followings are the actual tasks for this project. 4.1 Task 1 Let x, θ ∈ R5 , y ∈ {0, 1} and hθ (x) = g θT x , where g (z) = 1 1+e −z . Solve arg minθ J (θ), where 1 X −yi ln hθ (xi ) − (1 − yi ) ln 1 − hθ (xi ) n n J (θ) = i=1 The sets {yi } and {xi } are provided in text files on Canvas. 4.2 Task 2 Suppose n X π(x) = exp λj x j for x ∈ (0, 1) j=0 Solve Z arg min λ 1 π(x) dx − 0 n X i=0 λi mi for i = 0, 1, 2, ... n The ordered set {m0 , m1 , m2 , ... mn } is provided in text file m1.txt on Canvas. 4.3 Task 3 Discuss whether the following integral is solvable or not. If so proceed with it and otherwise prove it cannot be solved. Z 1 2 0 Z ··· 1Y 0 i<j ui − uj ui + uj !2 du1 du5 ··· u1 u5 4.4 Reminder The material provided with this document should serve as a starting point of your research. Higher marks will be given to groups that go beyond what have been given and covered in class. You need to find the right numerical methods and implementing them correctly by writing your own MATLAB functions from scratch to complete those tasks. You are not allowed to use high-level Matlab functions to avoid implement key algorithms in your solution. For example, high-level functions including but not limited to the followings are not allowed mldivide; qr; svd; vpasolve; integral; fminbnd.