A review of linear models 1/31 A simple linear model Suppose we observe a univariate Xi and a univariate response Yi for i = 1, · · · , n. A simple linear regression model assumes that Yi = β0 + β1 Xi + εi for i = 1, · · · , n where β0 , β1 are unknown parameters and i is a random error with mean 0 and variance σ 2 . 2/31 A multiple linear regression In general, a multiple linear regression model is Yi = β0 + β1 Xi1 + · · · + βp−1 Xi(p−1) + εi for i = 1, · · · , n where Xi1 , · · · , Xi(p−1) are p − 1 predictors (p > 1) and β0 , · · · , βp−1 are unknown parameters. 3/31 A multiple linear model in a matrix form I Y = (Y1 , · · · , Yn )T be an n-dim response vector I β = (β0 , · · · , βp−1 )T be p-dim unknown coefficients I ε = (1 , · · · , n )T be n-dim random error vector I Let Xi = (1, Xi1 , · · · , Xi(p−1) )T be p-dim predictor obtained from the i-th individual. Define the n × p design matrix as X = (X1 , · · · , Xn )T . Then the linear regression model is Y = X β + ε, where E(ε) = 0 and var(ε) = σ 2 In . 4/31 Example: Beverage study data set Consider the beverage study data set introduced in the last lecture. Let us now focus on the gene expression measured at hour 1 from gene 1. GSM87854 GSM87859 GSM87864 GSM87869 GSM87874 GSM87878 GSM87883 7.01942 7.04051 6.93354 6.97392 7.20032 7.15189 6.59645 2 3 1 4 4 2 3 GSM87888 GSM87892 GSM87897 GSM87902 GSM87907 GSM87911 GSM87921 7.08282 7.06286 6.94016 7.0484 7.04565 6.88402 6.8528 1 3 1 4 2 4 2 GSM87925 GSM87930 GSM87935 GSM87939 GSM87944 GSM87949 GSM87954 7.00089 7.13008 7.03175 7.07665 6.80101 6.95144 6.86817 2 3 1 4 1 3 4 GSM87957 7.05255 2 5/31 Example: Beverage study data set Study the effect of beverages on the gene expression using linear models. As we discussed in the last class, the covariates are defined by the dummy variables. Therefore, the response and design matrix for the linear regression are Y = 7.01942 1 0 1 0 0 7.04051 and X = 1 0 0 1 1 1 0 0 1 0 0 0 1 0 0 0 1 0 1 0 1 0 0 1 . . . . . . . . . . . . 1 0 0 0 0 0 1 1 0 0 . . . 1 1 0 1 0 6.93354 6.97392 7.20032 7.15189 6.59645 . . . 6.86817 7.05255 0 6/31 Estimation: least squares method I If the number of unknown parameters is the same as the number of observations and there is no random error, one might obtain the solution for β as X β = Y , which could be considered as the minimization of the objective function (Y − X β)T (Y − X β). I In general, n > p in most applications mentioned in this course. If n = p, the model is called saturated model. I A least squares method estimates β by minimizing the following objective function β̂ = arg min(Y − X β)T (Y − X β). β 7/31 Least squares estimation I If rank(X ) = p, namely X is full rank, then the least square solution of β is β̂ = (X T X )−1 X T Y . I However, if rank(X ) < p, the solution of β is not unique. There exist multiple solutions that minimize the least squares objective function. We can not identify an estimate for β. 8/31 Example: Beverage study data set Using the beverage study data set, we can see that X is a 22 × 5 matrix. Thus, p = 5 is the number of unknown prameters. However, the rank of X is 4. Therefore, rank(X ) < p, the least squares estimator is not unique. 9/31 Estimability I When X is full rank, the least squares estimate of β exists and making sense. In this case, it is also meaningful to estimate a linear combination of β (i.e., c T β for some T β = c T (X T X )−1 X T Y . c ∈ R p ) as cd I When X is not full rank, β is not estimable. However, there still exist some linear combination of β that is estimable in the following sense: Definition: c T β is estimable if there exist some linear combination of Y (i.e., aT Y , a ∈ R n ) such that E(aT Y ) = c T β. 10/31 Estimability The following conditions are equivalent: I There exist a vector a ∈ R n such that aT X β = c T β for any β ∈ Rp. I c belongs to the row space of X , i.e., c is a linear combination of rows of X . I X β1 = X β2 implies that c T β1 = c T β2 . 11/31 Example: Beverage study data set The row space of the design matrix of X in this example is 1 1 n p R(X ) = a ∈ R , a = c1 0 +c2 0 0 1 1 0 0 +c +c 1 3 0 4 1 0 0 0 1 0 o 0 0 1 12/31 Gauss-Markov model Consider the Gauss-Markov model Y = X β + ε, where E(ε) = 0 and Var(ε) = σ 2 In . 13/31 Mean and variance of the least squares estimator I For any estimable functions Cβ where C = (c1 , · · · , c` )T is an ` × p matrix, the LS estimator is c = C(X T X )− X T Y . Cβ where (X T X )− is a generalized inverse of X T X . I c has mean Cβ and variance It is easy to show that Cβ σ 2 C(X T X )− C T . I b = β and In particular, if β is estimable, then E(β) b = σ 2 (X T X )−1 . Var(β) 14/31 Estimation of σ 2 I Define the projection matrix as PX = X (X T X )− X T . I The residual of the least squares estimation is ε̂ = Y − Xcβ = Y − PX Y = (In − PX )Y . I It is easy to see that E(ε̂) = 0 and Var(ε̂) = σ 2 (In − PX ). I We can also check that E(ε̂T ε̂) = {n − rank(X )}σ 2 . This suggests that an unbiased estimation of σ 2 is σˆ2 = 1 ε̂T ε̂. n − rank(X ) 15/31 Example: Beverage study data set Consider estimating the different effects of red wine and water on the gene expression, namely, c T β = β3 − β4 . Here c = (0, 0, 0, 1, −1)T . As we know, this is an estimable function. The corresponding LS estimate is cvec<-c(0,0,0,1,-1) cbeta<-cvec%*%ginv(t(X)%*%X)%*%t(X)%*%y cbeta [,1] [1,] -0.052312 16/31 Example: Beverage study data set The estimation of σ 2 is epsilonhat<-y-X%*%ginv(t(X)%*%X)%*%t(X)%*%y n<-length(y) rankx<-qr(X)$rank SSE<-sum(epsilonhatˆ2) MSE<-SSE/(n-rankx) MSE [1] 0.01953277 17/31 Two useful lemmas The following two lemmas will be useful. Lemma 1: Suppose that A is an n × n symmetric matrix with rank k. Assume Y ∼ N(µ, Σ) for Σ > 0. If AΣ is idempotent (i.e., AΣAΣ = AΣ), then Y T AY ∼ χ2k (µT Aµ). In particular, if Aµ = 0, Y T AY ∼ χ2k . Lemma 2: Suppose that Y ∼ N(µ, σ 2 In ) and BA = 0. (a) If A is symmetric, Y T AY and BY are independent. (b) If both A and B are symmetric, then Y T AY and Y T BY are independent. For a proof of Lemma 1 and 2, See Muirhead (1982) or Khuri (2010). 18/31 Statistical inference under Gauss-Markov model Under the Gauss-Markov model, we would like to make inference on estimable functions Cβ and error variance σ 2 . We will discuss the following: (a): confidence intervals and confidence regions; (b): hypothesis testing; (c): prediction and prediction interval. 19/31 Confidence interval for σ 2 By Lemma 1, it can be shown that ε̂T ε̂/σ 2 ∼ χ2 n−rank(X ) . Therefore, an 1 − α confidence interval for σ 2 is (SSE/χ2n−rank(X ),α/2 , SSE/χ2n−rank(X ),1−α/2 ), where SSE = ε̂T ε̂. 20/31 Example: Beverage study data set A 95% confidence interval for σ 2 is low95sigma2<-SSE/qchisq(0.975,n-rankx) upp95sigma2<-SSE/qchisq(0.025,n-rankx) CI95sigma2<-c(low95sigma2,upp95sigma2) CI95sigma2 [1] 0.01115224 0.04271664 21/31 Confidence interval for c T β Under the Gauss Markov model, √ T β − cT β cd p ∼ tn−rank(X ) MSE c T (X T X )− c where MSE = SSE/{n − rank(X )}. A 1 − α confidence interval for c T β is q √ c T β ± tn−rank(X ),α/2 MSE c T (X T X )− c. 22/31 Example: Beverage study data set A 95% confidence interval for β3 − β4 is cvec<-c(0,0,0,1,-1) cbeta<-cvec%*%ginv(t(X)%*%X)%*%t(X)%*%y varcbeta<-t(cvec)%*%ginv(t(X)%*%X)%*%cvec low95cbeta<-cbeta-qt(0.975,n-rankx) *sigmahat*sqrt(varcbeta) upp95cbeta<-cbeta+qt(0.975,n-rankx) *sigmahat*sqrt(varcbeta) CI95cbeta<-c(low95cbeta,upp95cbeta) CI95cbeta [1] -0.2301103 0.1254863 23/31 Hypothesis testing for H0 : c T β = u0 Consider the hypothesis testing H0 : c T β = u0 vs. H1 : c T β 6= u0 . An α level rejection region is Rα = {|Tn | > tn−rank(X ),α/2 } where Tn = √ Tβ − u cd 0 p . T MSE c (X T X )− c 24/31 Example: Beverage study data set Hypothesis testing for H0 : β3 = β4 vs H1 : β3 6= β4 is Tn<-cbeta/(sigmahat*sqrt(varcbeta)) pvalue<-2*(1-pt(abs(Tn),n-rankx)) pvalue [,1] [1,] 0.5442288 25/31 Hypothesis testing for H0 : Cβ = d Suppose we want to test H0 : Cβ = d vs. H1 : Cβ 6= d where C is an ` × p matrix (` ≤ p and rank(C) = `) and a constant vector d. An α-level test reject H0 if Fn > F`,n−rank(X );α where Fn = SSH0 /` . SSE/{n − rank(X )} c − d)T {C(X T X )− C T }−1 (Cβ c − d). Here SSH0 = (Cβ 26/31 Example: Beverage study data set Hypothesis testing for H0 : β1 = β2 = β3 = β4 vs H1 : βk 6= βl for some k 6= l is Cmat<-rbind(c(0,1,-1,0,0),c(0,0,1,-1,0), c(0,0,0,1,-1)) Cmatbeta<-Cmat%*%ginv(t(X)%*%X)%*%t(X)%*%y CovCmatbeta<-Cmat%*%ginv(t(X)%*%X)%*%t(Cmat) SSH0<-t(Cmatbeta)%*%solve(CovCmatbeta)%*%Cmatbeta Fn<-(SSH0/3)/MSE pvalF<-1-pf(Fn,3,n-rankx) pvalF [1,] 0.8142451 27/31 Prediction for a new observation Suppose we would like to predict for a new observation Y ∗ which is independent of Y and y ∗ ∼ N(c T β, r σ 2 ). Because y ∗ is independent of Y , the best linear unbiased T β. (Why?) prediction (BLUP) of y ∗ is cd 28/31 Example: Beverage study data set A prediction for a new observation from the red wine group, namely y ∗ ∼ N(β0 + β3 , σ 2 ) is cpred<-c(1,0,0,1,0) predy<-t(cpred)%*%ginv(t(X)%*%X)%*%t(X)%*%y predy [,1] [1,] 6.956268 29/31 Prediction interval T β and Because the BLUP of y ∗ is cd √ T β − y∗ cd p ∼ tn−rank(X ) , MSE r + c T (X T X )− c a 1 − α prediction interval for y ∗ is q √ Tβ ± t cd MSE r + c T (X T X )− c. n−rank(X ) 30/31 Example: Beverage study data set A 95% prediction interval for a new observation from the red wine group, namely y ∗ ∼ N(β0 + β3 , σ 2 ) is MSEpred<-MSE*(1+t(cpred)%*%ginv(t(X)%*%X)%*%cpred) low95pred<-predy-qt(0.975,n-rankx)*sqrt(MSEpred) upp95pred<-predy+qt(0.975,n-rankx)*sqrt(MSEpred) predintvaly<-c(low95pred,upp95pred) predintvaly [1] 6.634619 7.277917 31/31