CO902 Supporting Lecture Notes: Least Squares Derivation in Matrix Mode & the Pseudoinverse Lecture 5 — 4 Feb 2013 1 Fitting Variables to Data with Least Squares We have data Y = (Y1 , Y2 , . . . , Yn ) that we wish to fit with a linear combination of the columns of X, where X = [ 1 X1 · · · XD ] and Xd = {Xdi }ni=1 are the data on the dth variable, d = 1, . . . , D. To find the “least squares solution” we need to find length vector w of length (D+1) that minimizes the sum of squared errors (Xw − Y)0 (Xw − Y). It is useful to see how to do this in “matrix mode”. The two following results from matrix calculus are useful. For column vectors x and a of the same length ∂ 0 ∂ 0 ax= x a = a. ∂x ∂x For a column vector x and matrix A ∂ 0 x Ax = (A + A0 )x, ∂x ∂ 0 and, if A is symmetric, ∂x x Ax = 2Ax (think (d/dx)x2 = 2x). We take the derivative of the sum of squared errors w.r.t. w, set to zero and solve for ŵ: ∂ (Xw − Y)0 (Xw − Y) ∂w ∂ [w0 X0 Xw − w0 X0 Y − Y0 Xw − Y0 Y] = ∂w ∂ 0 0 ∂ 0 0 = w X Xw − 2 wXY−0 ∂w ∂w = 2X0 Xw − 2X0 Y. 0 = This leads to the definition of the Least Squares “Normal Equations” X0 Y = X0 Xw. This system of D + 1 equations defines the Least Squares solutions... any vector w that satisfies it is a “Least Squares” estimate. Of course, if X is full rank, then we have the standard answer ŵ = (X0 X)−1 X0 Y. However you should avoid at all costs ever using this formulate to estimate ŵ. It is numerically unstable if X is poorly conditioned. The “best practice” estimates for least squares is to make a Q-R decomponsition of the Normal Equations (this is what happens inside any regression program). But, in Matlab, the easiest thing to use is the Moore Penrose pseudoinverse. 1 2 Moore-Penrose pseudoinverse For an arbitrary matrix A, not necessarly square nor even full rank, the Moore-Penrose pseudoinverse is written A− and satisfies the following conditions 1. AA− is symmetric, 2. A− A is symmetric, 3. AA− A = A, 4. A− AA− = A− . From these you can show that (A− )− = A and (A− )0 = (A0 )− , and also this useful result: A0 = A0 AA− . The gory details to this last one are ⇔ ⇔ ⇔ ⇔ (AA− )0 (A− )0 A0 0 A (A− )0 A0 A0 (A0 )− A0 A0 = = = = = AA− AA− A0 AA− A0 AA− A0 AA− (by condition 1) (premultiply by A0 ) (by (A− )0 = (A0 )− ) (by condition 3). But this means that the pseudo inverse of X times Y is a solution to Normal Equations! Consider w̃ = X− Y, the Normal Equations are then X0 Y = X0 X w̃ = X0 X X− Y = X0 Y, using the previous result. The reason to use ŵ = X− Y instead of ŵ = (X0 X)−1 X0 Y is: (1) the pseudo inverse is numerically stable, and (2) it will work even if X is rank deficient. Of course, if X is rank deficient then there are an infinite number of solutions to the Normal Equations, but we can be happy in the knowledge that the Moore-Penrose pseudo inverse will always give a single unique solution. In matlab, use pinv(X) to get the pseudo inverse. TEN / March 1, 2013 2