COMP 558 lecture 18 Nov. 15, 2010 Least squares We have seen several least squares problems thus far, and we will see more in the upcoming lectures. For this reason it is good to have a more general picture of these problems and how to solve them. Version 1: Given an m × n matrix A, where m > n, find a unit length vector x that minimizes k Ax k. Here, and in the rest of this lecture, the norm k · k is the L2 norm. Also note that minimizing the L2 norm is equivalent to minimizing the sum squares of the elements of the vector Ax, i.e. the L2 norm is just the square root of the sum of squares of the elements of Ax. Also, the reason we restrict the minimization to be for x of unit length is that, without this restriction, the minimum is achieved when x = 0 which is uninteresting. We can solve this constrained least squares problem using Lagrange multipliers, by finding the x that minimizes: xT AT Ax + λ(xT x − 1). (∗) That is, taking the derivative of (*) with respect to λ and setting it to 0 enforces that x has unit length. Taking derivatives with respect to the x components enforces AT Ax + λx = 0 which says that x is an eigenvector of AT A. So which eigenvector gives the least value of xT AT Ax ? Since xT AT Ax ≥ 0, it follows that the eigenvalues of AT A are also greater than or equal to 0. Thus, to minimize k Ax k, we take the (unit) eigenvector of AT A with the smallest eigenvalue, i.e. we are sure this eigenvalue is non-negative. Version 2: Given an m × n matrix A with m > n, and given an m-vector b, minimize k Ax − b k. If b is 0 then we have the same problem above, so let’s assume b 6= 0. We don’t need Lagrange multipliers in this case, since the trivial solution x = 0 is no longer a solution so we don’t need to avoid it. Instead, we expand: k Ax − b k2 = (Ax − b)T (Ax − b) = xT AT Ax − 2bT Ax + bT b and then take partial derivatives with respect to the x variables and set them to 0. This gives 2AT Ax − 2AT b = 0. or AT Ax = AT b (1) which are called the normal equations. Assume the columns of the m × n matrix A have full rank n, i.e. they are linearly independent so AT A is invertible. (Not so obvious.) Then, the solution is x = (AT A)−1 AT b. 1 COMP 558 lecture 18 Nov. 15, 2010 What is the geometric interpretation of Eq. (1)? Since b is an m-dimensional vector and A is an m×n matrix, we can uniquely write b as a sum of a vector in the column space of A and a vector in the space orthogonal to the column space of A. To minimize k Ax − b k, by definition we find the x such that the distance from Ax to b is as small as possible. This is done by choosing x such that Ax is the component of b that lies in the column space of A, that is, Ax is the orthogonal projection of b to the column space of A. Note that if b already belonged in the column space of A then the “error” k Ax − b k would be 0 and there would be an exact solution. Pseudoinverse of A In the above problem, we define the pseudoinverse of A to be the n × m matrix, A+ ≡ (AT A)−1 AT . Recall we are assuming m > n, so that A maps from a lower dimensional space to a higher dimensional space. 1 Note that A+ A = I. Moreover, AA+ = A(AT A)−1 AT and AA+ projects any vector b ∈ <m onto the column space of A, that is, it removes from b the component that is perpendicular to the column space of A. This is illustrated in the figure below. Note that the plane in the figure on the right contains the origin, since it includes A x = 0, where x = 0. Am x n + x=Ab + A nxm Ax b The pseudoinverse maps in the reverse direction of A, namely it maps b in an m-D space to an x in an n-D space. Rather than inverting the mapping A, however, only “inverts” the component of b that belongs to the column space of A, i.e. A+ A = I. It ignores the component of b that is orthogonal to the column space of A. 1 Note that if m = n then A+ = A−1 . 2 COMP 558 lecture 18 Nov. 15, 2010 Non-linear least squares (Gauss-Newton method) Let’s look at a commmon application of version 2 above. Suppose we have m differentiable functions fi (x) that take <n to <m . The problem we consider now is, given an initial value x0 ∈ <n , find a nearby x that minimizes k f~(x) k . Note that this problem is not well defined, in the sense that “nearby” is not well defined. Nonetheless, it is worth considering problems for which one can seek to improve the solution from some initial x0 . Consider a linear approximation of the m-vector of functions f~(x) in the neighborhood of x0 , so that we are trying to minimize k f~(x0 ) + where the Jacobian ∂ f~(x) (x − x0 ) k, ∂x ∂ f~(x) ∂x is evaluated at x0 . It is an m×n matrix. This new linearized minimization ~ ~ problem is now of the form we saw above in version 2 where A = ∂ f∂x(x) and b = f~(x0 ) − ∂ f∂x(x) x0 . Since we are using a linear approximation to the f~(x) functions, we do not expect to minimize k f~(x) k exactly. Instead, we iterate x(k+1) ← x(k) + ∆x. At each step, we evaluate the Jacobian at the new point x(k) . Examples • P To fit a line y = mx + b to a set of points (xi , yi ) we found the m and b that minimized 2 i (yi − mxi − b) . This is a linear least squares problem (version 2). • To (xi , yi ), we found the θ and r that minimized P fit a line x cos2θ + y sin θ = r to a set of points 2 (ax + by − r) , subject to the constraint a + b2 = 1. As we saw in lecture 15, this reduces i i i P to solving for the (a, b) that minimized i (a(xi − x̄) + b(yi − ȳ))2 , which is the version 1 least squares problem. Once we have θ, we can solve for r since (recall lecture 15), namely r = x̄ cos θ + ȳ sin θ. • Vanishing point detection (lecture 15). It is of the form version 2. • Recall the image registration problem where we wanted the (hx , hy ) that minimizes: X {I(x + hx , y + hy ) − J(x, y)}2 (x,y)∈N gd(x0 ,y0 ) This is a non-linear least squares problem, since the (hx , hy ) are parameters of I(x+hx , y+hy ). To solve the problem, we took a first order Taylor series expansion of I(x + hx , y + hy ) and thereby turned it into a linear least squares problem. Our initial estimate of (hx , hy ) was (0, 0). We then iterated to try to find a better solution. Note that we required that A is of rank 2, so that AT A is invertible. This condition says that the second moment matrix M needs to be invertible. 3 COMP 558 lecture 18 Nov. 15, 2010 SVD (Singular Value Decomposition) In the version 1 least squares problem, we need to find the eigenvector of AT A that had smallest eigenvalue. In the following, we use the eigenvectors and eigenvalues of AT A to decompose A into a product of simple matrices. This singular value decomposition is a very heavily used tool in data analysis in many fields. Strangely, it is not taught in most introductory linear algebra courses. For this reason, I give you the derivation. Let A be an m × n matrix with m ≥ n. The first step is to note that the n × n matrix AT A is symmetric and positive semi-definite – that’s easy to see since xT AT Ax ≥ 0 for any real x. So AT A has an orthonormal set of eigenvectors and the eigenvalues are all real and non-negative. Let V be an n × n matrix whose columns are the orthonormal eigenvectors of AT A. Since the eigenvalues are non-negative, we can write2 them as σi2 . Define Σ to be an n × n diagonal matrix with values Σii = σi on the diagonal. The elements, σi are called the singular values of A. Note that they are the square roots of the eigenvalues of AT A. Also note that Σ2 is an n × n diagonal matrix, and AT AV = VΣ2 . Next define Ũm×n ≡ Am×n Vn×n . (2) Then ŨT Ũ = VT AT AV = Σ2 . Thus, the n columns of Ũ are orthogonal and of length σi . We now define an m × n matrix U whose columns are orthonormal (length 1), so that UΣ = Ũ. Substituting into Eq. 2 and right multiplying by VT gives us T Am×n = Um×n Σn×n Vn×n . Thus, any m × n matrix A can be decomposed into A = UΣVT where U is m × n, Σ is an n × n diagonal matrix, and V is n × n. A similar construction can be given when m < n. One often defines the singular value decomposition of A slightly differently than this, namely one defines the U to be m × m, by just adding m − n orthonormal columns. One also needs to add m − n rows of 0’s to Σ to make it m × n, giving T Am×n = Um×m Σm×n Vn×n For our purposes there is no important difference between these two decompositions. Finally, note Matlab has a function svd which computes the singular value decomposition. One can use svd(A) to compute the eigenvectors and eigenvalues of AT A. 2 Forgive me for using the symbol σ yet again, but σ is always used in the SVD. 4