A review of linear models 1/31

advertisement
A review of linear models
1/31
A simple linear model
Suppose we observe a univariate Xi and a univariate response
Yi for i = 1, · · · , n. A simple linear regression model assumes
that
Yi = β0 + β1 Xi + εi for i = 1, · · · , n
where β0 , β1 are unknown parameters and i is a random error
with mean 0 and variance σ 2 .
2/31
A multiple linear regression
In general, a multiple linear regression model is
Yi = β0 + β1 Xi1 + · · · + βp−1 Xi(p−1) + εi for i = 1, · · · , n
where Xi1 , · · · , Xi(p−1) are p − 1 predictors (p > 1) and
β0 , · · · , βp−1 are unknown parameters.
3/31
A multiple linear model in a matrix form
I
Y = (Y1 , · · · , Yn )T be an n-dim response vector
I
β = (β0 , · · · , βp−1 )T be p-dim unknown coefficients
I
ε = (1 , · · · , n )T be n-dim random error vector
I
Let Xi = (1, Xi1 , · · · , Xi(p−1) )T be p-dim predictor obtained
from the i-th individual. Define the n × p design matrix as
X = (X1 , · · · , Xn )T .
Then the linear regression model is
Y = X β + ε,
where E(ε) = 0 and var(ε) = σ 2 In .
4/31
Example: Beverage study data set
Consider the beverage study data set introduced in the last
lecture. Let us now focus on the gene expression measured at
hour 1 from gene 1.
GSM87854
GSM87859
GSM87864
GSM87869
GSM87874
GSM87878
GSM87883
7.01942
7.04051
6.93354
6.97392
7.20032
7.15189
6.59645
2
3
1
4
4
2
3
GSM87888
GSM87892
GSM87897
GSM87902
GSM87907
GSM87911
GSM87921
7.08282
7.06286
6.94016
7.0484
7.04565
6.88402
6.8528
1
3
1
4
2
4
2
GSM87925
GSM87930
GSM87935
GSM87939
GSM87944
GSM87949
GSM87954
7.00089
7.13008
7.03175
7.07665
6.80101
6.95144
6.86817
2
3
1
4
1
3
4
GSM87957
7.05255
2
5/31
Example: Beverage study data set
Study the effect of beverages on the gene expression using
linear models. As we discussed in the last class, the covariates
are defined by the dummy variables. Therefore, the response
and design matrix for the linear regression are













Y =












7.01942


1
0
1
0
0
7.04051





































and X = 












1
0
0
1
1
1
0
0
1
0
0
0
1
0
0
0
1
0
1
0
1
0
0
1
.
.
.
.
.
.
.
.
.
.
.
.
1
0
0
0

0 


0 


1 


1 


0 


0 

. 
. 

. 

1 

1
0
1
0
6.93354
6.97392
7.20032
7.15189
6.59645
.
.
.
6.86817
7.05255

0
6/31
Estimation: least squares method
I
If the number of unknown parameters is the same as the
number of observations and there is no random error, one
might obtain the solution for β as X β = Y , which could be
considered as the minimization of the objective function
(Y − X β)T (Y − X β).
I
In general, n > p in most applications mentioned in this
course. If n = p, the model is called saturated model.
I
A least squares method estimates β by minimizing the
following objective function
β̂ = arg min(Y − X β)T (Y − X β).
β
7/31
Least squares estimation
I
If rank(X ) = p, namely X is full rank, then the least square
solution of β is
β̂ = (X T X )−1 X T Y .
I
However, if rank(X ) < p, the solution of β is not unique.
There exist multiple solutions that minimize the least
squares objective function. We can not identify an estimate
for β.
8/31
Example: Beverage study data set
Using the beverage study data set, we can see that X is a
22 × 5 matrix. Thus, p = 5 is the number of unknown
prameters. However, the rank of X is 4. Therefore,
rank(X ) < p, the least squares estimator is not unique.
9/31
Estimability
I
When X is full rank, the least squares estimate of β exists
and making sense. In this case, it is also meaningful to
estimate a linear combination of β (i.e., c T β for some
T β = c T (X T X )−1 X T Y .
c ∈ R p ) as cd
I
When X is not full rank, β is not estimable. However, there
still exist some linear combination of β that is estimable in
the following sense:
Definition: c T β is estimable if there exist some linear
combination of Y (i.e., aT Y , a ∈ R n ) such that
E(aT Y ) = c T β.
10/31
Estimability
The following conditions are equivalent:
I
There exist a vector a ∈ R n such that aT X β = c T β for any
β ∈ Rp.
I
c belongs to the row space of X , i.e., c is a linear
combination of rows of X .
I
X β1 = X β2 implies that c T β1 = c T β2 .
11/31
Example: Beverage study data set
The row space of the design matrix of X in this example is

1



 

 

 1 

 
n
 

p
R(X ) = a ∈ R , a = c1  0 +c2 
 


 
 0 


 
0
1


1



 

 



 0 
0 


 


 
+c
+c
1  3 0  4


 

 


 1 
0 

 

0
0
1



0 
o

0 


0 

1
12/31
Gauss-Markov model
Consider the Gauss-Markov model
Y = X β + ε,
where E(ε) = 0 and Var(ε) = σ 2 In .
13/31
Mean and variance of the least squares estimator
I
For any estimable functions Cβ where C = (c1 , · · · , c` )T is
an ` × p matrix, the LS estimator is
c = C(X T X )− X T Y .
Cβ
where (X T X )− is a generalized inverse of X T X .
I
c has mean Cβ and variance
It is easy to show that Cβ
σ 2 C(X T X )− C T .
I
b = β and
In particular, if β is estimable, then E(β)
b = σ 2 (X T X )−1 .
Var(β)
14/31
Estimation of σ 2
I
Define the projection matrix as PX = X (X T X )− X T .
I
The residual of the least squares estimation is
ε̂ = Y − Xcβ = Y − PX Y = (In − PX )Y .
I
It is easy to see that E(ε̂) = 0 and Var(ε̂) = σ 2 (In − PX ).
I
We can also check that E(ε̂T ε̂) = {n − rank(X )}σ 2 . This
suggests that an unbiased estimation of σ 2 is
σˆ2 =
1
ε̂T ε̂.
n − rank(X )
15/31
Example: Beverage study data set
Consider estimating the different effects of red wine and water
on the gene expression, namely, c T β = β3 − β4 . Here
c = (0, 0, 0, 1, −1)T . As we know, this is an estimable function.
The corresponding LS estimate is
cvec<-c(0,0,0,1,-1)
cbeta<-cvec%*%ginv(t(X)%*%X)%*%t(X)%*%y
cbeta
[,1]
[1,] -0.052312
16/31
Example: Beverage study data set
The estimation of σ 2 is
epsilonhat<-y-X%*%ginv(t(X)%*%X)%*%t(X)%*%y
n<-length(y)
rankx<-qr(X)$rank
SSE<-sum(epsilonhatˆ2)
MSE<-SSE/(n-rankx)
MSE
[1] 0.01953277
17/31
Two useful lemmas
The following two lemmas will be useful.
Lemma 1: Suppose that A is an n × n symmetric matrix with
rank k. Assume Y ∼ N(µ, Σ) for Σ > 0. If AΣ is idempotent
(i.e., AΣAΣ = AΣ), then Y T AY ∼ χ2k (µT Aµ). In particular, if
Aµ = 0, Y T AY ∼ χ2k .
Lemma 2: Suppose that Y ∼ N(µ, σ 2 In ) and BA = 0. (a) If A is
symmetric, Y T AY and BY are independent. (b) If both A and B
are symmetric, then Y T AY and Y T BY are independent.
For a proof of Lemma 1 and 2, See Muirhead (1982) or Khuri
(2010).
18/31
Statistical inference under Gauss-Markov model
Under the Gauss-Markov model, we would like to make
inference on estimable functions Cβ and error variance σ 2 .
We will discuss the following:
(a): confidence intervals and confidence regions;
(b): hypothesis testing;
(c): prediction and prediction interval.
19/31
Confidence interval for σ 2
By Lemma 1, it can be shown that ε̂T ε̂/σ 2 ∼ χ2
n−rank(X )
.
Therefore, an 1 − α confidence interval for σ 2 is
(SSE/χ2n−rank(X ),α/2 , SSE/χ2n−rank(X ),1−α/2 ),
where SSE = ε̂T ε̂.
20/31
Example: Beverage study data set
A 95% confidence interval for σ 2 is
low95sigma2<-SSE/qchisq(0.975,n-rankx)
upp95sigma2<-SSE/qchisq(0.025,n-rankx)
CI95sigma2<-c(low95sigma2,upp95sigma2)
CI95sigma2
[1] 0.01115224 0.04271664
21/31
Confidence interval for c T β
Under the Gauss Markov model,
√
T β − cT β
cd
p
∼ tn−rank(X )
MSE c T (X T X )− c
where MSE = SSE/{n − rank(X )}.
A 1 − α confidence interval for c T β is
q
√
c T β ± tn−rank(X ),α/2 MSE c T (X T X )− c.
22/31
Example: Beverage study data set
A 95% confidence interval for β3 − β4 is
cvec<-c(0,0,0,1,-1)
cbeta<-cvec%*%ginv(t(X)%*%X)%*%t(X)%*%y
varcbeta<-t(cvec)%*%ginv(t(X)%*%X)%*%cvec
low95cbeta<-cbeta-qt(0.975,n-rankx)
*sigmahat*sqrt(varcbeta)
upp95cbeta<-cbeta+qt(0.975,n-rankx)
*sigmahat*sqrt(varcbeta)
CI95cbeta<-c(low95cbeta,upp95cbeta)
CI95cbeta
[1] -0.2301103
0.1254863
23/31
Hypothesis testing for H0 : c T β = u0
Consider the hypothesis testing
H0 : c T β = u0 vs. H1 : c T β 6= u0 .
An α level rejection region is Rα = {|Tn | > tn−rank(X ),α/2 }
where
Tn = √
Tβ − u
cd
0
p
.
T
MSE c (X T X )− c
24/31
Example: Beverage study data set
Hypothesis testing for H0 : β3 = β4 vs H1 : β3 6= β4 is
Tn<-cbeta/(sigmahat*sqrt(varcbeta))
pvalue<-2*(1-pt(abs(Tn),n-rankx))
pvalue
[,1]
[1,] 0.5442288
25/31
Hypothesis testing for H0 : Cβ = d
Suppose we want to test
H0 : Cβ = d vs. H1 : Cβ 6= d
where C is an ` × p matrix (` ≤ p and rank(C) = `) and a
constant vector d.
An α-level test reject H0 if Fn > F`,n−rank(X );α where
Fn =
SSH0 /`
.
SSE/{n − rank(X )}
c − d)T {C(X T X )− C T }−1 (Cβ
c − d).
Here SSH0 = (Cβ
26/31
Example: Beverage study data set
Hypothesis testing for H0 : β1 = β2 = β3 = β4 vs H1 : βk 6= βl
for some k 6= l is
Cmat<-rbind(c(0,1,-1,0,0),c(0,0,1,-1,0),
c(0,0,0,1,-1))
Cmatbeta<-Cmat%*%ginv(t(X)%*%X)%*%t(X)%*%y
CovCmatbeta<-Cmat%*%ginv(t(X)%*%X)%*%t(Cmat)
SSH0<-t(Cmatbeta)%*%solve(CovCmatbeta)%*%Cmatbeta
Fn<-(SSH0/3)/MSE
pvalF<-1-pf(Fn,3,n-rankx)
pvalF
[1,] 0.8142451
27/31
Prediction for a new observation
Suppose we would like to predict for a new observation Y ∗
which is independent of Y and
y ∗ ∼ N(c T β, r σ 2 ).
Because y ∗ is independent of Y , the best linear unbiased
T β. (Why?)
prediction (BLUP) of y ∗ is cd
28/31
Example: Beverage study data set
A prediction for a new observation from the red wine group,
namely y ∗ ∼ N(β0 + β3 , σ 2 ) is
cpred<-c(1,0,0,1,0)
predy<-t(cpred)%*%ginv(t(X)%*%X)%*%t(X)%*%y
predy
[,1]
[1,] 6.956268
29/31
Prediction interval
T β and
Because the BLUP of y ∗ is cd
√
T β − y∗
cd
p
∼ tn−rank(X ) ,
MSE r + c T (X T X )− c
a 1 − α prediction interval for y ∗ is
q
√
Tβ ± t
cd
MSE
r + c T (X T X )− c.
n−rank(X )
30/31
Example: Beverage study data set
A 95% prediction interval for a new observation from the red
wine group, namely y ∗ ∼ N(β0 + β3 , σ 2 ) is
MSEpred<-MSE*(1+t(cpred)%*%ginv(t(X)%*%X)%*%cpred)
low95pred<-predy-qt(0.975,n-rankx)*sqrt(MSEpred)
upp95pred<-predy+qt(0.975,n-rankx)*sqrt(MSEpred)
predintvaly<-c(low95pred,upp95pred)
predintvaly
[1] 6.634619 7.277917
31/31
Download