Estimation of the Response Mean c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 511 1 / 27 The Gauss-Markov Linear Model y = Xβ + y is an n × 1 random vector of responses. X is an n × p matrix of constants with columns corresponding to explanatory variables. X is sometimes referred to as the design matrix. β is an unknown parameter vector in IRp . is an n × 1 random vector of errors. E() = 0 and Var() = σ 2 I, where σ 2 is an unknown parameter in IR+ . c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 511 2 / 27 The Column Space of the Design Matrix Xβ is a linear combination of the columns of X: β1 Xβ = [x1 , . . . , xp ] ... = β1 x1 + · · · + βp xp . βp The set of all possible linear combinations of the columns of X is called the column space of X and is denoted by C(X) = {Xa : a ∈ IRp }. The Gauss-Markov linear model says y is a random vector whose mean is in the column space of X and whose variance is σ 2 I for some positive real number σ 2 , i.e., E(y) ∈ C(X) and Var(y) = σ 2 I, σ 2 ∈ IR+ . c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 511 3 / 27 An Example Column Space X= 1 1 =⇒ C(X) = {Xa : a ∈ IRp } 1 a1 : a1 ∈ IR = 1 1 = a1 : a1 ∈ IR 1 a1 : a1 ∈ IR = a1 c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 511 4 / 27 Another Example Column Space 1 1 X= 0 0 0 0 =⇒ C(X) 1 1 c Copyright 2012 Dan Nettleton (Iowa State University) 1 1 = 0 0 = a1 = = 0 0 a 1 2 : a ∈ IR 1 a2 1 1 0 0 1 + a2 : a1 , a2 ∈ IR 1 0 0 1 a1 0 0 a1 + : a1 , a2 ∈ IR 0 a2 0 a2 a1 a1 : a1 , a2 ∈ IR a2 a2 Statistics 511 5 / 27 Different Matrices with the Same Column Space 1 1 W= 0 0 x ∈ C(W) =⇒ =⇒ 0 0 1 1 1 1 X= 1 1 1 1 0 0 0 0 1 1 x = Wa for some a ∈ IR2 0 x=X for some a ∈ IR2 a =⇒ x = Xb for some b ∈ IR3 =⇒ x ∈ C(X) Thus, C(W) ⊆ C(X). c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 511 6 / 27 1 1 W= 0 0 x ∈ C(X) =⇒ =⇒ =⇒ =⇒ 0 0 1 1 1 1 X= 1 1 1 1 0 0 0 0 1 1 x = Xa for some a ∈ IR3 1 1 0 1 1 0 3 x = a1 1 + a2 0 + a3 1 for some a ∈ IR 1 0 1 a1 + a2 a1 + a2 for some a1 , a2 , a3 ∈ IR x= a1 + a3 a1 + a3 a1 + a2 x=W for some a1 , a2 , a3 ∈ IR a1 + a3 c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 511 7 / 27 a1 + a2 a1 + a3 for some a1 , a2 , a3 ∈ IR =⇒ x=W =⇒ x = Wb for some b ∈ IR2 =⇒ x ∈ C(W) Thus, C(X) ⊆ C(W). We previously showed that C(W) ⊆ C(X). Thus, it follows that C(W) = C(X). c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 511 8 / 27 Estimation of E(y) A fundamental goal of linear model analysis is to estimate E(y). We could, of course, use y to estimate E(y). y is obviously an unbiased estimator of E(y), but it is often not a very sensible estimator. For example, suppose y1 1 1 6.1 , and we observe y = . µ+ = 2 2.3 1 y2 Should we estimate E(y) = c Copyright 2012 Dan Nettleton (Iowa State University) µ µ by y = 6.1 ? 2.3 Statistics 511 9 / 27 Estimation of E(y) The Gauss-Markov linear models says that E(y) ∈ C(X), so we should use that information when estimating E(y). Consider estimating E(y) by the point in C(X) that is closest to y (as measured by the usual Euclidean distance). This unique point is called the orthogonal projection of y onto C(X) d might be and denoted by ŷ (although it could be argued that E(y) better notation). By definition, ||y − ŷ|| = minz∈C(X) ||y − z||, where ||a|| ≡ c Copyright 2012 Dan Nettleton (Iowa State University) qP n 2 i=1 ai . Statistics 511 10 / 27 Orthogonal Projection Matrices In Homework Assignment 2, we will formally prove the following: 1 ∀ y ∈ IRn , ŷ = PX y, where PX is a unique n × n matrix known as an orthogonal projection matrix. 2 PX is idempotent: PX PX = PX . 3 PX is symmetric: PX = P0X . 4 PX X = X and X0 PX = X0 . 5 PX = X(X0 X)− X0 , where (X0 X)− is any generalized inverse of X0 X. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 511 11 / 27 Why Does PX X = X? PX X = PX [x1 , . . . , xp ] = [PX x1 , . . . , PX xp ] = [x1 , . . . , xp ] = X. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 511 12 / 27 Generalized Inverses G is a generalized inverse of a matrix A if AGA = A. We usually denote a generalized inverse of A by A− . If A is nonsingular, i.e., if A−1 exists, then A−1 is the one and only generalized inverse of A. AA−1 A = AI = IA = A If A is singular, i.e., if A−1 does not exist, then there are infinitely many generalized inverses of A. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 511 13 / 27 An Algorithm for Finding a Generalized Inverse of a Matrix A 1 2 Find any r × r nonsingular submatrix of A where r = rank(A). Call this matrix W. Invert and transpose W, ie., compute (W −1 )0 . 3 Replace each element of W in A with the corresponding element of (W −1 )0 . 4 Replace all other elements in A with zeros. 5 Transpose the resulting matrix to obtain G, a generalized inverse for A. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 511 14 / 27 Invariance of PX = X(X0 X)− X0 to Choice of (X0 X)− If X0 X is nonsingular, then PX = X(X0 X)−1 X0 because the only generalized inverse of X0 X is (X0 X)−1 . If X0 X is singular, then PX = X(X0 X)− X0 and the choice of the generalized inverse (X0 X)− does not matter because PX = X(X0 X)− X0 will turn out to be the same matrix no matter which generalized inverse of X0 X is used. − 0 To see this, suppose (X0 X)− 1 and (X X)2 are any two generalized 0 inverses of X X. Then − 0 − 0 − 0 0 0 0 0 X(X0 X)− 1 X = X(X X)2 X X(X X)1 X = X(X X)2 X . c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 511 15 / 27 An Example Orthogonal Projection Matrix Suppose 0 y1 y2 − X(X X) X = 0 1 1 µ+ = = = = , and we observe y = 1 1 0 1 1 !− 1 1 6.1 2.3 . 0 − 1 [1 1] [1 1] 1 1 1 1 [1 1] [2]−1 [ 1 1 ] = 1 1 2 1 1 1 1 1 [1 1]= 1 2 2 1 1 1/2 1/2 . 1/2 1/2 = 1 1 1 2 c Copyright 2012 Dan Nettleton (Iowa State University) 1 1 Statistics 511 16 / 27 An Example Orthogonal Projection 6.1 2.3 Thus, the orthogonal projection of y = 1 1 4.2 4.2 onto the column space of X = is PX y = 1/2 1/2 1/2 1/2 c Copyright 2012 Dan Nettleton (Iowa State University) 6.1 2.3 = . Statistics 511 17 / 27 Why is PX called an orthogonal projection matrix? Suppose X = 1 2 and y = 2 3 4 . X● c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 511 18 / 27 Why is PX called an orthogonal projection matrix? Suppose X = 1 2 and y = 2 3 4 . C (X ) X● c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 511 19 / 27 Why is PX called an orthogonal projection matrix? Suppose X = 1 2 and y = 2 3 4 . C (X ) X● c Copyright 2012 Dan Nettleton (Iowa State University) ● y Statistics 511 20 / 27 Why is PX called an orthogonal projection matrix? Suppose X = 1 2 and y = 2 3 4 . C (X ) X● ^ y ● c Copyright 2012 Dan Nettleton (Iowa State University) ● y Statistics 511 21 / 27 Why is PX called an orthogonal projection matrix? Suppose X = 1 2 and y = 2 3 4 . C (X ) X● ^ y ● ● ● y ^ y−y c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 511 22 / 27 Why is PX called an orthogonal projection matrix? The angle between ŷ and y − ŷ is 90◦ . The vectors ŷ and y − ŷ are orthogonal. ŷ0 (y − ŷ) = ŷ0 (y − PX y) = ŷ0 (I − PX )y = (PX y)0 (I − PX )y = y0 P0X (I − PX )y = = y0 PX (I − PX )y = y0 (PX − PX PX )y = y0 (PX − PX )y = 0. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 511 23 / 27 Optimality of ŷ as an Estimator of E(y) ŷ is an unbiased estimator of E(y): E(ŷ) = E(PX y) = PX E(y) = PX Xβ = Xβ = E(y). It can be shown that ŷ = PX y is the best estimator of E(y) in the class of linear unbiased estimators, i.e., estimators of the form My for M satisfying E(My) = E(y) ∀ β ∈ IRp ⇐⇒ MXβ = Xβ ∀ β ∈ IRp ⇐⇒ MX = X. Under the Gauss-Markov Linear Model, ŷ = PX y is best among all unbiased estimators of E(y). c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 511 24 / 27 Ordinary Least Squares (OLS) Estimation of E(y) OLS: Find a vector b∗ ∈ IRp such that Q(b∗ ) ≤ Q(b) ∀ b ∈ IRp , where Q(b) ≡ n X (yi − x0(i) b)2 . i=1 Note that Q(b) = n X (yi − x0(i) b)2 = (y − Xb)0 (y − Xb) = ||y − Xb||2 . i=1 To minimize this sum of squares, we need to choose b∗ ∈ IRp such Xb∗ will be the point in C(X) that is closest to y. In other words, we need to choose b∗ such that Xb∗ = PX y = X(X0 X)− X0 y. Clearly, choosing b∗ = (X0 X)− X0 y will work. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 511 25 / 27 Ordinary Least Squares and the Normal Equations It can be shown that Q(b∗ ) ≤ Q(b) ∀ b ∈ IRp if and only if b∗ is a solution to the normal equations: X0 Xb = X0 y. If X0 X is nonsingular, multiplying both sides of the normal equations by (X0 X)−1 shows that the only solution to the normal equations is b∗ = (X0 X)−1 X0 y. If X0 X is singular, there are infinitely many solutions that include (X0 X)− X0 y for all choices of generalized inverse of X0 X. X0 X[(X0 X)− X0 y] = X0 [X(X0 X)− X0 ]y = X0 PX y = X0 y Henceforth, we will use β̂ to denote any solution to the normal equations. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 511 26 / 27 Ordinary Least Squares Estimator of E(y) = Xβ We call Xβ̂ = PX Xβ̂ = X(X0 X)− X0 Xβ̂ = X(X0 X)− X0 y = PX y = ŷ the OLS estimator of E(y) = Xβ. c rather than Xβ̂ to denote It might be more appropriate to use Xβ our estimator because we are estimating Xβ rather than pre-multiplying an estimator of β by X. As we shall soon see, it does not make sense to estimate β when X0 X is singular. However, it does make sense to estimate E(y) = Xβ whether X0 X is singular or nonsingular. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 511 27 / 27