Some Different Perspectives on Linear Least Squares A standard problem in statistics is to measure a response or dependent variable, y, at fixed values of one or more independent variables. Sometimes there exists a deterministic model y f x, , where x stands collectively for all of the independent variables of the system and stands collectively for all of the model parameters. In the probabilistic model it is often assumed that each measured y value can be expressed as yi f xi , i , here i is an index that labels the fixed input condition at xi and i represents random error associated with the given measurement. For a valid model the mean random error should be zero. For a given set of n data points the problem of choosing the “best” values for the parameters is typically solved via the method of least squares. The parameters are chosen to minimize the function F n yi f xi , i 1 2 . If the function f xi , is linear in the procedure is referred to as a linear least squares or linear regression problem. The purpose of these notes is to examine the least squares solution from several different points of view. The Least Squares Approximation to a Vector in an n Dimensional Euclidean Space Consider an n dimensional Euclidean space, V with an inner product of vectors v and u denoted as u , v . Let W be a subspace of V, then V W W , where W is the orthogonal complement of W. Let v be any element of V, then v w u , where w W and u W . The vector w is called the projection of v 2 onto W. Let g v v v , v . The following argument shows that w is the unique minimum of this function on W. First, by expanding the inner product, g v can be expressed as 2 2 2 g v v w w v w 2 v w, w w . If W , then w W 2 2 2 and v w, w u , w 0 . So on W, g v u w u , with equality only at n w . If V R and W has the orthonormal basis C1 , C 2 , C m with m n , then the column vector m which is the projection of v onto W is given by w Cˆ j , v Cˆ j . Define the n m matrix Q as j 1 Q C1 , C 2 , C m , that is Q i, j C j . The n m matrix QQT , called the “outer product”, i,1 m m m projects v onto W: wi,1 C j , v C j C j QT v j ,1 Q i , j QT v j ,1 QQT v . Thus, i,1 j 1 i ,1 i,1 j 1 j 1 the vector QQT v minimizes the squared distance from v to W. Al Lehnen 2/17/2016 Madison Area Technical College 2 Some Different Perspectives on Linear Least Squares Linear Least Squares in Matrix Notation: The Generalized Inverse m If a model is linear in its m parameters designated by the vector in R , then all the n estimated y n responses can be generated as a vector in R given by A where A is an n m matrix. We will assume that A has rank m with m n . If the rank of A is less than m, it means that the model parameters are not all independent and need to be reduced in number so as to make a “design matrix” with all of its columns linearly independent. Denote the k’th column of A as Ak , i.e., A A1, A2 , Am . Since the columns of A n are linearly independent vectors in R , Ab m bk Ak 0n if and only if each bk 0 . Thinking of A as a k 1 linear transformation from R m to R , the image of A, W, is the span of A1, A2 , n Am and has dimension m, while the kernel of A has dimension zero and consists of only the zero vector 0m . Thus, T every vector in W is orthogonal to each column of A, or stated differently, if u W , A u 0 m . T Conversely, if A u 0 m , then every column of A is perpendicular to u, so u W . Thus, the kernel of n A transpose is the orthogonal complement of the column space of A. Let y R , then y w u where w W and u W . Unless u 0 n , the system of equations Ab y has no solutions. However, Ab w has a solution, , and since A1, A2 , Am is a linearly independent set, this solution is unique. Furthermore, w is the closest vector in the column space of A to y and the minimum value of 2 2 2 m Fy b y Ab as b varies over R is Fy y w y A . Now, A w , so m T AT A AT w AT y u AT y AT u AT y , since ker A W . Suppose that for b R , 2 T AT Ab 0m , then bT AT Ab Ab Ab Ab, Ab Ab 0 . So, if AT Ab 0m , Ab 0 n which implies that b 0 m since the kernel of A consists of only 0m . Therefore, AT A is nonsingular and the least 1 squares solution to minimizing Fy b is given by AT A AT y , the m n matrix AT A sometimes called the generalized inverse of the n m matrix A. 1 AT is An Alternate Formulation of Linear Least Squares Using Multivariable Calculus m The least squares solution is the vector in R that minimizes the function of m variables, 2 m Fy b y Ab y Ab, y Ab y , y 2 Ab, y Ab, Ab as b varies over R . Rewriting this function using sigma notation gives the following expression. Al Lehnen 2/17/2016 Madison Area Technical College 3 Some Different Perspectives on Linear Least Squares Fy b y , y 2 n m i 1 j 1 Ab i,1 yi n i 1 1 j 1 b b Ai , j b j Ai, j b j yi b b j AT A , j n m y, y 2 m m Ai , m m i 1 j 1 1 j 1 Now a necessary condition for Fy b to be at a minimum is that all of the partial derivatives F y bk b j vanish for every integer k with 1 k m . Since Fy bk 02 Ai, j b i 1 n T A i 1 k ,i 1 j 1 m yi k ,1 k k ,1 b j bk AT A AT A ,j ,j AT A ,k b m k, j bj 1 AT A m T A Ab bj b ,k b j b j,k j 1 k ,1 bk 1 j 1 T A A T T 2 A y 2 A Ab 2 AT y b b yi m m 2 Ai ,k yi 2 m m k n b j i 1 j 1 Fy j , k , the following simplification results. n m m m T y , y 2 A b j yi b b A A j i , j , j bk i 1 j 1 1 j 1 n m The condition, bk T 1 b k, k ,1 0 for every integer k with 1 k m , requires that AT Ab AT y . This has the unique solution given by AT A 1 AT y , which is identical to the result obtained in the last section. A Formulation of Linear Least Squares Using QR Factorization Since the columns of A are linearly independent a Gram Schmidt procedure on the columns yields an orthonormal basis for W. Designate this basis as C1 , C 2 , defined recursively for j 2 as C j Al Lehnen Cj Cj , where C j A j 2/17/2016 C m . Specifically, this orthonormal basis is j 1 Ak , C k C k and C1 k 1 A1 . A1 Madison Area Technical College 4 Some Different Perspectives on Linear Least Squares The column vector which is the projection of y onto W is given by w QQT y , where the n m matrix Q is defined as Q i, j C j , i.e., Q C1 , C 2 , C m . Now, A QQT A , since the projection of each i ,1 column of matrix A onto A’s column space is just that original column of A . Let the m m matrix R be defined as R QT A . Now, R i, j C i span of C1 , C 2 , Ci , A j j k 1 m j Ri, j C i , Ck C k , A j A j C i , A j . By the Gram Schmidt procedure if j m the A1 , A2 , C j is equal to the span of spanned by A1 , A2 , 1 , , 2 , , 3 , T A j . So A j j k 1 C k C k , A j . Now, if i j , j 0 C k , A j 0 , i.e., if i j , C i is orthogonal to the subspace k 1 A j . Therefore, the matrix R is upper triangular. Consider for m real numbers, , m , the sum m j Ri, j j 1 m j C i , A j C i , z , where z j 1 m j A j W . If j 1 C i , z C i 0 . But since the columns of A are m 0 for every integer i, 1 i m , then z i 1 j 1 linearly independent, this requires that 1 2 3 m 0 . Hence, the columns of the matrix R are linearly independent and R is therefore nonsingular. This factorization of A, A QQT A QR , is generically called the QR factorization. Since the unique solution of the linear least squares problem solves A w QQT y , we have QR QQT y . Now, QT Q I m , i.e., QT Q i, j QT C1, C 2 , T Cm C i C j C i , C j i, j . i, j So R QT QR QT QQT y QT y and the unique linear least squares solution is given by R 1QT y . The following steps show that this solution is the same as that obtained by the generalized inverse. AT QR RT QT T AT A RT QT QR RT I m R RT R AT A RT R R1 RT 1 1 AT A AT R1 RT RT QT R1QT 1 1 1 Calculating the Parameter Variances From the theory of the probability distributions of random variables we have the following fundamental result. If y1 , y2 , y3 , , yn are n statistically independent random variables with the variance of y j being Al Lehnen 2/17/2016 Madison Area Technical College 5 Some Different Perspectives on Linear Least Squares n n j 1 j 1 2j , and h j y j , then h2 2j 2j . Now, using the generalized inverse the linear least squares solution for parameter j is given by n m n 1 T 1 T 1 AT . j AT A A y yk yk A AT A A A k ,i j ,i k 1 i 1 j ,1 k 1 j, k Hence, m n k2 Ak ,i AT A j k 1 i 1 2 1 j,i 2 . In the special case where all of the random variables have a common variance, 2 , this simplifies to 2 2 n m n 1 1 2 AT A AT . 2 2 A AT A k ,i j ,i j j ,k k 1 i 1 k 1 In the QR factorization solution of the linear least squares problem, j R 1QT y j,1 k 1 n R 1QT 2 j ,k yk n n m k 1 yk m i 1 R j ,1i QiT,k k k2 R j,1i C i Rj,1i C i k , so that n k 1 m yk i 1 2 . k 1 i 1 In the special case where all of the random variables have a common variance this simplifies as follows. j 2 n m m m n m 1 m 1 2 1 2 1 1 R j ,i C i R j ,i C i R j, C R j ,i R j , C Ci j k k k k k k 1 i 1 k 1 i 1 i 1 1 k 1 1 2 2 2 n m m i 1 1 R j ,1i R j ,1 C , C i 2 m m i 1 1 R j ,1i R j ,1 ,i 2 R j,1i m 2 i 1 An Example of the Methods: The Linear Model in One Independent Variable x If the model is that yestimate 1 x 0 , and the 2 1 parameter vector is defined as 1 , then the 0 n 2 n x1 1 xi xi x 1 x2 x i 1 i 1 n 2 “design” matrix given by A 2 . Hence, AT A n and n x 1 xi n x 1 n i 1 Al Lehnen 2/17/2016 Madison Area Technical College 6 Some Different Perspectives on Linear Least Squares 1 x 1 x 2 2 1 x x x x 1 n . Here the average value of a variable is denoted as AT A i 2 SS xx n i 1 n x 2 x while the covariation of two variables SS , is defined as SS n n i 1 i 1 i i ii n n n n n i i n i i n n i 1 n i 1 i 1 n 1 n i i i i n n i 1 i 1 i 1 n . n xi yi i 1 , so the least squares solution is given by the following expression, Also, AT y n yi i 1 n n n 1 n xi yi yi xi xi yi 1 x n 1 1 1 i 1 i 1 i 1 i 1 . 1 AT A AT y n n SS xx x x 2 n 0 SS xx 2 yi x yi x xi yi i 1 i 1 i 1 These are the “regression” equations which are sometimes expressed as n n 1 n xi yi yi xi n i 1 i 1 SS xy 1 i 1 SS xx SS xx n 2 n n n xi yi xi xi yi 0 i 1 i 1 i 1 i 1 nSS xx 2 n 21 n n 1 n 1 n n 1 n 1 n n xi yi xi yi xi xi yi xi xi yi 2 n i 1 n i 1 i 1 n i 1 n i 1 i 1 i 1 n i 1 i 1 SS xx 2 n n 2 1 n 1 n n xi xi xi yi xi yi n n i 1 i 1 SS xy SS i 1 i 1 i 1 y x y xx x SS xx SS xx SS xx SS xx y 1 x Assuming a common variance for each independent random variable yi , the variances of the model parameters are computed as follows. Al Lehnen 2/17/2016 Madison Area Technical College Some Different Perspectives on Linear Least Squares 2 2 1 2 2 n n x x 2 2 n 1 2 2 T T k x x SS A A A 2 SS k xx SS 1,k xx k 1 k 1 SS 2 k 1 SS 2 xx xx xx 2 2 7 0 2 n n x2 x x 1 T T 2 A A A SS k 2,k xx k 1 k 1 2 2 2 n x2 x x k 2 k 1 SS xx 2 n x2 x 2 x x x 2 k SS 2 k 1 xx 2 2 n 2 2 2 2 2 2 x x 2 x x x xk x x xk x SS 2 k 1 xx 2 n n 2 2 2 2 2 2 2 n x x 2 x x x xk n x x xk x k 1 k 1 SS 2 xx 2 x2 x 2 n 2 2 2 SS xx 2 2 2 2 2 x x x 0 x SS xx x SS xx 2 n n SS 2 SS xx xx 1 x 2 2 n SS xx 2 Using the QR formulation to calculate the parameters requires an orthonormal basis for the column space of A. A Gram Schmidt procedure results in the following vectors. x1 x1 x n 2 2 x2 C1 xi n x2 i 1 xn xn n n 2 xi x1 xi x2 x x i 1 i 1 1 1 1 x1 x1 n n 1 1 x x 2 2 n x x2 xi x x2 x C2 , 2 2 xi2 i 1 i n i 1 i 1 1 1 xn xn x2 x x n n n xi2 xn xi i 1 i 1 Al Lehnen 2/17/2016 Madison Area Technical College 8 Some Different Perspectives on Linear Least Squares Normalizing C2 requires the previously encountered sum, 2 2 2 n x 2 x x x 2 x 2 x x x SS xx x 2 SS . So that i xx i n i 1 i 1 n i C2 x 2 xi x SS xx x1 x Thus, Q 2 xn 1 x n SS xx 2 . 2 1 x and nx 2 SS xx n SS xx nx 2 nx nx nx 2 2 C1 , A x2 nx 2 C1 , A2 nx 1 R QT A 1 1 0 0 C 2 , A2 2 C 2 , A1 2 1 x 1 x n SS xx n SS xx 2 2 1 nx x 1 x 1 x 1 2 2 n SS xx 1 x 2 2 n SS x x xx and R 1 nx n SS xx 2 nx 2 1 x 0 2 0 nx n SS xx 1 n xi yi i 1 2 nx n QT y 2 n . 1 x yi x xi yi 2 i 1 i 1 1 x SS xx n SS xx x2 x x 1 2 x x2 x x2 x x n . Proceeding, Computing the parameter vector gives n 1 n 2 n x xi yi 2 x yi x xi yi nx 2 i 1 i 1 x SS xx i 1 1 1 T R Q y . 1 n 2 n n n 0 xi yi xi xi yi nSS xx i 1 i 1 i 1 i 1 Al Lehnen 2/17/2016 Madison Area Technical College Some Different Perspectives on Linear Least Squares The expression for 0 agrees with the earlier derivation, while the previous expression for 1 is recovered as is shown by the following steps. n 2 n 1 n x 1 xi yi 2 x yi x xi yi i 1 x SS xx i 1 nx 2 i 1 n n 2 n SS xx xi yi nx x yi x xi yi i 1 i 1 i 1 SS xx nx 2 1 n n n 2 n n n 2 n xi xi yi x xi xi yi xi x yi x xi yi i 1 i 1 i 1 i 1 i 1 SS xx nx 2 i 1 i 1 n n n 2 n 1 2 xi xi yi xi x yi i 1 i 1 nx 2 SS i 1 i 1 xx n 2 xi n 1 n n SS xy i 1 xi yi xi yi n i 1 i 1 SS xx n i 1 xi2 SS xx i 1 1 1 Assuming a common variance for each independent random variable yi , in the QR factorization the variances of the model parameters are computed as follows. 2 1 2 2 i 1 R 1,1i 2 2 x 1 x 2 1 2 2 n SS xx x nx 2 2 2 n x 2 2 2 n x x 2 x 2 1 SS xx 2 4 2 2 2 4 2 2 x x SS xx n x x x x x x 2 2 x 2 SS n x 2 SS xx xx 2 2 2 2 x 2 x 2 SS SS xx xx 2 2 2 2 2 2 1 2 2 1 x 2 1 x . R 2,i 0 n SS xx n SS xx 0 i 1 2 Both of these results agree with the previous analysis that used the generalized inverse. Al Lehnen 2/17/2016 Madison Area Technical College 9