1 1 c Copyright 2011 Dept. of Statistics (Iowa State University) Statistics 511 =⇒ C(X) = {Xa : a ∈ IR} 1 a1 : a1 ∈ IR = 1 a1 : a1 ∈ IR = a1 What does this column space “look like”? X= An Example Column Space Statistics 511 β is an unknown parameter vector in IRp . c Copyright 2011 Dept. of Statistics (Iowa State University) X is sometimes referred to as the design matrix. It is an n × p matrix of constants with columns corresponding to explanatory variables. We saw two possible X matrices for the t-test. This section focuses on the Q: does it matter which X we use? Important pieces of information for what follows: y = Xβ + Reminder from the last section of the notes: 3 / 32 1 / 32 −0.5 c Copyright 2011 Dept. of Statistics (Iowa State University) −1.0 X1 0.0 0.5 1.0 E(y) ∈ C(X) and Var(y) = σ 2 I, σ 2 ∈ IR+ . Statistics 511 Statistics 511 4 / 32 2 / 32 The Gauss-Markov linear model says y is a random vector whose: mean is in the column space of X and whose variance is σ 2 I for some positive real number σ 2 , i.e., C(X) = {Xa : a ∈ IRp }. The set of all possible linear combinations of the columns of X is called the column space of X and is denoted by βp Xβ is a linear combination of the columns of X: ⎡ ⎤ β1 ⎢ ⎥ Xβ = [x1 , . . . , xp ] ⎣ ... ⎦ = β1 x1 + · · · + βp xp . c Copyright 2011 Dept. of Statistics (Iowa State University) The Column Space of the Design Matrix X2 Geometry of the Gauss-Markov Linear Model 1.0 0.5 0.0 −0.5 −1.0 5 / 32 because there may be some b for which X2 b is not in C(X1 ). If you can also show C(X2 ) ⊆ C(X1 ), then C(X1 ) ⊆ C(X2 ) and C(X1 ) ⊆ C(X2 ) =⇒ C(X1 ) = C(X2 ) c Copyright 2011 Dept. of Statistics (Iowa State University) Statistics 511 N.B. Not (at least yet) C(X1 ) = C(X2 ) The next few slides have the details If you can start with x ∈ C(X1 ) and derive x = X2 b, this implies x ∈ C(X2 ) and C(X1 ) ⊆ C(X2 ). Concept: 7 / 32 Proving that two column spaces, C(X1 ) and C(X2 ), are the same c Copyright 2011 Dept. of Statistics (Iowa State University) Statistics 511 ⎧⎡ ⎫ ⎤ 1 0 0 ⎪ ⎪ ⎪ ⎪ ⎨⎢ ⎬ ⎥ a1 1 0 0 ⎥ 2 ⎢ ⎥ ⎥ =⇒ C(X) = : a ∈ IR ⎣ ⎦ ⎦ 0 1 1 a2 ⎪ ⎪ ⎪ ⎪ ⎩ ⎭ 0 1 1 ⎧ ⎡ ⎤ ⎫ ⎡ ⎤ 1 0 ⎪ ⎪ ⎪ ⎪ ⎨ ⎢ ⎥ ⎬ ⎢ 0 ⎥ 1 ⎥ + a2 ⎢ ⎥ : a1 , a2 ∈ IR a1 ⎢ = ⎣ 0 ⎦ ⎣ 1 ⎦ ⎪ ⎪ ⎪ ⎪ ⎩ ⎭ 0 1 ⎧⎡ ⎫ ⎤ a1 ⎪ ⎪ ⎪ ⎪ ⎨⎢ ⎬ ⎥ a 1 ⎢ ⎥ : a1 , a2 ∈ IR = ⎣ a2 ⎦ ⎪ ⎪ ⎪ ⎪ ⎩ ⎭ a2 ⎤ What is this column space? A plane “living” in IR4 . 1 ⎢ 1 X=⎢ ⎣ 0 0 ⎡ Another Example Column Space =⇒ =⇒ =⇒ ⎤ 0 0 ⎥ ⎥ 1 ⎦ 1 6 / 32 1 1 0 0 ⎤ 0 0 ⎥ ⎥ 1 ⎦ 1 x ∈ C(X2 ) =⇒ Thus, C(X1 ) ⊆ C(X2 ). x = X2 b for some b ∈ IR3 x = X1 a for some a ∈ IR2 0 x = X2 for some a ∈ IR2 a 1 ⎢ 1 X2 = ⎢ ⎣ 1 1 ⎡ =⇒ =⇒ =⇒ ⎤ 0 0 ⎥ ⎥ 1 ⎦ 1 c Copyright 2011 Dept. of Statistics (Iowa State University) x ∈ C(X1 ) 1 ⎢ 1 X1 = ⎢ ⎣ 0 0 ⎡ Statistics 511 8 / 32 Proving two column spaces are the same (continued) c Copyright 2011 Dept. of Statistics (Iowa State University) Statistics 511 x = X2 a for some a ∈ IR3 ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 1 1 0 ⎢ 1 ⎥ ⎢ 1 ⎥ ⎢ 0 ⎥ 3 ⎢ ⎥ ⎢ ⎥ ⎥ x = a1 ⎢ ⎣ 1 ⎦ + a2 ⎣ 0 ⎦ + a3 ⎣ 1 ⎦ for some a ∈ IR 1 0 1 ⎤ ⎡ ⎤ ⎡ a1 + a2 b1 ⎢ a1 + a2 ⎥ ⎢ b1 ⎥ ⎥ ⎢ ⎥ x=⎢ ⎣ a1 + a3 ⎦ = ⎣ b2 ⎦ for some b1 , b2 ∈ IR a1 + a3 b2 1 1 0 0 This is also a plane in R4 . Is it the same plane? x ∈ C(X2 ) 1 ⎢ 1 X2 = ⎢ ⎣ 1 1 ⎡ A Third Column Space Example =⇒ =⇒ =⇒ =⇒ 1 ⎢ 1 X2 = ⎢ ⎣ 1 1 ⎡ 1 1 0 0 0 0 ⎥ ⎥ 1 ⎦ 1 ⎤ c Copyright 2011 Dept. of Statistics (Iowa State University) Should we estimate E(y) = μ μ by y = 6.1 ? 2.3 Statistics 511 For example, suppose 1 1 6.1 y1 = μ+ , and we observe y = . 1 2.3 y2 2 y is obviously an unbiased estimator of E(y), but it is often not a very sensible estimator. We could, of course, use y to estimate E(y). A fundamental goal of linear model analysis is to estimate E(y). Estimation of E(y) Statistics 511 9 / 32 11 / 32 x = X2 a for some a ∈ IR3 ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 1 1 0 ⎢ 1 ⎥ ⎢ 1 ⎥ ⎢ 0 ⎥ 3 ⎥ ⎢ ⎥ ⎢ ⎥ x = a1 ⎢ ⎣ 1 ⎦ + a2 ⎣ 0 ⎦ + a3 ⎣ 1 ⎦ for some a ∈ IR 1 0 1 ⎡ ⎤ a1 + a 2 ⎢ a1 + a2 ⎥ ⎥ x=⎢ ⎣ a1 + a3 ⎦ for some a1 , a2 , a3 ∈ IR a1 + a3 a 1 + a2 for some a1 , a2 , a3 ∈ IR x = X1 a 1 + a3 ⎤ 0 0 ⎥ ⎥ 1 ⎦ 1 c Copyright 2011 Dept. of Statistics (Iowa State University) x ∈ C(X2 ) 1 ⎢ 1 X1 = ⎢ ⎣ 0 0 ⎡ Proving two column spaces are the same (continued) x ∈ C(X1 ) =⇒ c Copyright 2011 Dept. of Statistics (Iowa State University) 12 / 32 n 2 i=1 ai . Statistics 511 By definition, ||y − ŷ|| = minz∈C(X) ||y − z||, where ||a|| ≡ This unique point is called the orthogonal projection of y onto C(X) might be and denoted by ŷ (although it could be argued that E(y) better notation). Consider estimating E(y) by the point in C(X) that is closest to y (as measured by the usual Euclidean distance). 10 / 32 The Gauss-Markov linear model says that E(y) ∈ C(X), so we should use that information when estimating E(y). Statistics 511 Estimation of E(y) c Copyright 2011 Dept. of Statistics (Iowa State University) Thus, it follows that C(X1 ) = C(X2 ). We previously showed that C(X1 ) ⊆ C(X2 ). Thus, C(X2 ) ⊆ C(X1 ). x = X1 b for some b ∈ IR2 =⇒ =⇒ =⇒ a 1 + a2 for some a1 , a2 , a3 ∈ IR x = X1 a1 + a3 b1 for some b1 , b2 ∈ IR x = X1 b2 Proving two column spaces are the same (continued) X ● Suppose X = 1 1 ● y^ and y = y ● C(X) 6.1 2.3 . c Copyright 2011 Dept. of Statistics (Iowa State University) ● y Suppose X = 1 2 ● y^ ● X and y = . C(X) 3 4 2 A second example and picture (continued) c Copyright 2011 Dept. of Statistics (Iowa State University) In a picture Statistics 511 Statistics 511 15 / 32 13 / 32 1 2 ● X and y = C(X) 3 4 2 . c Copyright 2011 Dept. of Statistics (Iowa State University) y − y^ ● y Suppose X = 1 2 ● y^ ● X and y = . C(X) 3 4 2 A second example and picture (continued) c Copyright 2011 Dept. of Statistics (Iowa State University) ● y Suppose X = A second example and picture Statistics 511 Statistics 511 16 / 32 14 / 32 We’re doing least squares estimation! i=1 (y n − ŷ)2 The vectors ŷ and y − ŷ are orthogonal The correlation between ŷ and y − ŷ = 0. i.e. predicted values ŷ and residuals y − ŷ are uncorrelated PX X = X and X PX = X . PX is idempotent: PX PX = PX . PX is symmetric: PX = PX . 18 / 32 c Copyright 2011 Dept. of Statistics (Iowa State University) = X PX X = X(X X)−1 X X Algebra (assuming full rank X) Statistics 511 19 / 32 c Copyright 2011 Dept. of Statistics (Iowa State University) = PX A = X(X X)−1 X A = X(X X)−1 (X X)(X X)−1 X A Algebra (assuming full rank X of dimension n × p): Consider any matrix A of dimension n × k. PX PX A = X(X X)−1 X X(X X)−1 X A Statistics 511 20 / 32 Geometry: Each column of X represents a point in C(X) (by definition). The projection is the closest point in C(X). Already there! Statistics 511 Geometry: If X is an n × p matrix, consider any n × k matrix A. Each column of PX A represents a point in C(X) (by definition). Projecting a second time doesn’t move PX A. Since this is true for any A, PX PX X = PX X. c Copyright 2011 Dept. of Statistics (Iowa State University) It can be shown that... Why is PX X idempotent? 17 / 32 What is PX ? If (X X)−1 exists, i.e. X X is full rank, PX = X(X X)−1 X If not, PX = X(X X)− X , where (X X)− is any generalized inverse of X X. ŷ = PX y ∀ y ∈ IRn , where PX is a unique n × n matrix known as an orthogonal projection matrix. Can find ŷ by matrix multiplication Orthogonal Projection Matrices Why Does PX X = X? Statistics 511 essentially the ANOVA decomposition of sums-of-squares: SStotal = SSmodel + SSerror . But, how do you compute ŷ without drawing lines? Pythagorean Theorem: y 2 = ŷ 2 + y − ŷ 2 Geometrically, y − ŷ is minimized when the angle between the vector ŷ and the vector y − ŷ is 90◦ . ŷ is the point in C(X) that minimizes y − ŷ 2 = c Copyright 2011 Dept. of Statistics (Iowa State University) What the geometry tells a statistician If A is singular, i.e., if A−1 does not exist, there are infinitely many generalized inverses of A. AA−1 A = AI = IA = A − y1 y2 = 1 1 1 1 1 2 μ+ 1 1 1 1 − 1 1 , and we observe y = = 1 1 6.1 2.3 Statistics 511 − 1 [1 1] [1 1] 1 1 1 1 = [2]−1 [ 1 1 ] = [1 1] 1 1 2 1 1 1 1 1 [1 1]= = 2 1 2 1 1 1/2 1/2 = . 1/2 1/2 = c Copyright 2011 Dept. of Statistics (Iowa State University) X(X X) X Suppose An Example Orthogonal Projection Matrix Statistics 511 . 23 / 32 21 / 32 G is a generalized inverse of a matrix A if AGA = A. item If A is nonsingular, i.e., if A−1 exists, then A−1 is the one and only generalized inverse of A. c Copyright 2011 Dept. of Statistics (Iowa State University) Generalized inverse, A− We’ve already seen that some X matrices are not full rank. Hence, (X X)−1 not defined. Still need to do statistics with these design matrices! Generalized Inverses 1/2 1/2 1/2 1/2 c Copyright 2011 Dept. of Statistics (Iowa State University) is PX y = 6.1 2.3 = onto the column space of X = . 6.1 2.3 4.2 4.2 1 1 Thus, the orthogonal projection of y = An Example Orthogonal Projection Statistics 511 Statistics 511 − − − X(X X)− 1 X = X(X X)2 X X(X X)1 X = X(X X)2 X . 24 / 32 22 / 32 − Suppose (X X)− 1 and (X X)2 are any two generalized inverses of − X X. Hence, PX = X(X X)1 X = X(X X)− 2 X . Then If X X is singular, then PX = X(X X)− X and the choice of the generalized inverse (X X)− does not matter because PX = X(X X)− X will turn out to be the same matrix no matter which generalized inverse of X X is used. If X X is nonsingular, then PX = X(X X)−1 X because the only generalized inverse of X X is (X X)−1 . c Copyright 2011 Dept. of Statistics (Iowa State University) Invariance of PX = X(X X)− X to Choice of (X X)− ● y^ y ● C(X) Statistics 511 25 / 32 The vectors ŷ and y − ŷ are orthogonal. 27 / 32 p i=1 n (yi − x(i) b)2 . (yi − x(i) b)2 = (y − Xb) (y − Xb) = ||y − Xb||2 . i=1 n 26 / 32 Clearly, choosing b∗ = (X X)− X y will work. In other words, we need to choose b∗ such that Xb∗ = PX y = X(X X)− X y. Statistics 511 28 / 32 To minimize this sum of squares, we need to choose b∗ ∈ IRp such Xb∗ will be the point in C(X) that is closest to y. Q(b) = Note that Q(b ) ≤ Q(b) ∀ b ∈ IR , where Q(b) ≡ ∗ OLS: Find a vector b∗ ∈ IRp such that c Copyright 2011 Dept. of Statistics (Iowa State University) Under the Normal Theory Gauss-Markov Linear Model, ŷ = PX y is best among all unbiased estimators of E(y). Statistics 511 Statistics 511 Ordinary Least Squares (OLS) Estimation of E(y) = Xβ c Copyright 2011 Dept. of Statistics (Iowa State University) = y (PX − PX )y = 0. = = y PX (I − PX )y = y (PX − PX PX )y = (PX y) (I − PX )y = y PX (I − PX )y ŷ (y − ŷ) = ŷ (y − PX y) = ŷ (I − PX )y The angle between ŷ and y − ŷ is 90◦ . Why is PX called an orthogonal projection matrix? E(My) = E(y) ∀ β ∈ IRp ⇐⇒ MXβ = Xβ ∀ β ∈ IRp ⇐⇒ MX = X. It can be shown that ŷ = PX y is the best estimator of E(y) in the class of linear unbiased estimators, i.e., estimators of the form My for M satisfying E(ŷ) = E(PX y) = PX E(y) = PX Xβ = Xβ = E(y). ŷ is an unbiased estimator of E(y): c Copyright 2011 Dept. of Statistics (Iowa State University) Optimality of ŷ as an Estimator of E(y) c Copyright 2011 Dept. of Statistics (Iowa State University) X ● A picture we’ve seen before Statistics 511 if in addition, ∼ N(0, σ 2 I), then that ŷ is the best unbiased estimator. The estimator ŷ = X(X X)− X y is the best linear unbiased estimator of E(y). (Gauss-Markov Theorem) c Copyright 2011 Dept. of Statistics (Iowa State University) Those two models have the same predicted values because C(X1 ) = C(X2 ). Corresponds to the effects model yij = μ + αi + ij , Var = σ 2 I. Straightforward to extend to 2 groups with 4 obs. per group. The X matrix on slide 6 also describes 2 groups with 2 observations per group. Corresponds to the cell means model yij = μi + ij , Var = σ 2 I. Our study has 2 groups with 4 observations per group. The X matrix on slide 5 describes 2 groups with 2 observations per group. Return to our t-test Statistics 511 Henceforth, we will use β̂ to denote any solution to the normal equations. 31 / 32 29 / 32 Choice of X matrix determines interpretation of β. Some X matrices are not full column rank. Many possible solutions for β. All have same ŷ. What are the properties of estimable functions of β? c Copyright 2011 Dept. of Statistics (Iowa State University) What functions of β can be estimated? 32 / 32 Two X matrices with the same column space represent the same model for the mean. Statistics 511 EY is a point in the column space of X. Questions still to be answered: A linear model is a set of EY describing the mean of each random variable. 30 / 32 Summary: Statistics 511 However, it does make sense to estimate E(y) = Xβ whether X X is singular or nonsingular. As we shall soon see, it does not make sense to estimate β when X X is singular. rather than Xβ̂ to denote It might be more appropriate to use Xβ our estimator because we are estimating Xβ rather than pre-multiplying an estimator of β by X. We call Xβ̂ = PX Xβ̂ = X(X X)− X Xβ̂ = X(X X)− X y = PX y = ŷ the OLS estimator of E(y) = Xβ. c Copyright 2011 Dept. of Statistics (Iowa State University) If X X is singular, there are infinitely many solutions that include (X X)− X y for all choices of generalized inverse of X X. X X[(X X)− X y] = X [X(X X)− X ]y = X PX y = X y Ordinary Least Squares Estimator of E(y) = Xβ If X X is nonsingular, multiplying both sides of the normal equations by (X X)−1 shows that the only solution to the normal equations is b∗ = (X X)−1 X y. X Xb = X y. Often calculus is used to show that Q(b∗ ) ≤ Q(b) ∀ b ∈ IRp if and only if b∗ is a solution to the normal equations: c Copyright 2011 Dept. of Statistics (Iowa State University) Ordinary Least Squares and the Normal Equations