Document 10639671

advertisement
Estimation of the Response Mean
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 511
1 / 27
The Gauss-Markov Linear Model
y = Xβ + y is an n × 1 random vector of responses.
X is an n × p matrix of constants with columns corresponding to
explanatory variables. X is sometimes referred to as the design
matrix.
β is an unknown parameter vector in IRp .
is an n × 1 random vector of errors.
E() = 0 and Var() = σ 2 I, where σ 2 is an unknown parameter in
IR+ .
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 511
2 / 27
The Column Space of the Design Matrix
Xβ is a linear combination of the columns of X:


β1


Xβ = [x1 , . . . , xp ]  ...  = β1 x1 + · · · + βp xp .
βp
The set of all possible linear combinations of the columns of X is
called the column space of X and is denoted by
C(X) = {Xa : a ∈ IRp }.
The Gauss-Markov linear model says y is a random vector whose
mean is in the column space of X and whose variance is σ 2 I for
some positive real number σ 2 , i.e.,
E(y) ∈ C(X) and Var(y) = σ 2 I, σ 2 ∈ IR+ .
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 511
3 / 27
An Example Column Space
X=
1
1
=⇒ C(X) = {Xa : a ∈ IRp }
1 a1 : a1 ∈ IR
=
1
1
=
a1
: a1 ∈ IR
1
a1
: a1 ∈ IR
=
a1
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 511
4 / 27
Another Example Column Space

1
 1
X=
 0
0

0
0 
 =⇒ C(X)
1 
1
c
Copyright 2012
Dan Nettleton (Iowa State University)

1



 1
=
 0



0
 


 
=
a1 




=















=






0 


0 
a
1
2

:
a
∈
IR
1  a2



1


 
1
0



 0 
1 
 + a2   : a1 , a2 ∈ IR
 1 
0 



0
1

 

a1
0



 0 
a1 
+
 : a1 , a2 ∈ IR
0   a2 



0
a2


a1



a1 
 : a1 , a2 ∈ IR
a2 



a2
Statistics 511
5 / 27
Different Matrices with the Same Column Space

1
 1
W=
 0
0
x ∈ C(W)
=⇒
=⇒

0
0 

1 
1

1
 1
X=
 1
1
1
1
0
0

0
0 

1 
1
x = Wa for some a ∈ IR2
0
x=X
for some a ∈ IR2
a
=⇒
x = Xb for some b ∈ IR3
=⇒
x ∈ C(X)
Thus, C(W) ⊆ C(X).
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 511
6 / 27

1
 1

W=
0
0
x ∈ C(X)
=⇒
=⇒
=⇒
=⇒

0
0 

1 
1

1
 1

X=
1
1
1
1
0
0

0
0 

1 
1
x = Xa for some a ∈ IR3


 


1
1
0
 1 
 1 
 0 
3

 


x = a1 
 1  + a2  0  + a3  1  for some a ∈ IR
1
0
1


a1 + a2
 a1 + a2 

 for some a1 , a2 , a3 ∈ IR
x=
a1 + a3 
a1 + a3
a1 + a2
x=W
for some a1 , a2 , a3 ∈ IR
a1 + a3
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 511
7 / 27
a1 + a2
a1 + a3
for some a1 , a2 , a3 ∈ IR
=⇒
x=W
=⇒
x = Wb for some b ∈ IR2
=⇒
x ∈ C(W)
Thus, C(X) ⊆ C(W).
We previously showed that C(W) ⊆ C(X).
Thus, it follows that C(W) = C(X).
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 511
8 / 27
Estimation of E(y)
A fundamental goal of linear model analysis is to estimate E(y).
We could, of course, use y to estimate E(y).
y is obviously an unbiased estimator of E(y), but it is often not a
very sensible estimator.
For example, suppose
y1
1
1
6.1
, and we observe y =
.
µ+
=
2
2.3
1
y2
Should we estimate E(y) =
c
Copyright 2012
Dan Nettleton (Iowa State University)
µ
µ
by y =
6.1
?
2.3
Statistics 511
9 / 27
Estimation of E(y)
The Gauss-Markov linear models says that E(y) ∈ C(X), so we
should use that information when estimating E(y).
Consider estimating E(y) by the point in C(X) that is closest to y
(as measured by the usual Euclidean distance).
This unique point is called the orthogonal projection of y onto C(X)
d might be
and denoted by ŷ (although it could be argued that E(y)
better notation).
By definition, ||y − ŷ|| = minz∈C(X) ||y − z||, where ||a|| ≡
c
Copyright 2012
Dan Nettleton (Iowa State University)
qP
n
2
i=1 ai .
Statistics 511
10 / 27
Orthogonal Projection Matrices
In Homework Assignment 2, we will formally prove the following:
1
∀ y ∈ IRn , ŷ = PX y, where PX is a unique n × n matrix known as an
orthogonal projection matrix.
2
PX is idempotent: PX PX = PX .
3
PX is symmetric: PX = P0X .
4
PX X = X and X0 PX = X0 .
5
PX = X(X0 X)− X0 , where (X0 X)− is any generalized inverse of X0 X.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 511
11 / 27
Why Does PX X = X?
PX X = PX [x1 , . . . , xp ]
= [PX x1 , . . . , PX xp ]
= [x1 , . . . , xp ]
= X.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 511
12 / 27
Generalized Inverses
G is a generalized inverse of a matrix A if AGA = A.
We usually denote a generalized inverse of A by A− .
If A is nonsingular, i.e., if A−1 exists, then A−1 is the one and only
generalized inverse of A.
AA−1 A = AI = IA = A
If A is singular, i.e., if A−1 does not exist, then there are infinitely
many generalized inverses of A.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 511
13 / 27
An Algorithm for Finding a Generalized Inverse of a
Matrix A
1
2
Find any r × r nonsingular submatrix of A where r = rank(A). Call
this matrix W.
Invert and transpose W, ie., compute (W −1 )0 .
3
Replace each element of W in A with the corresponding element
of (W −1 )0 .
4
Replace all other elements in A with zeros.
5
Transpose the resulting matrix to obtain G, a generalized inverse
for A.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 511
14 / 27
Invariance of PX = X(X0 X)− X0 to Choice of (X0 X)−
If X0 X is nonsingular, then PX = X(X0 X)−1 X0 because the only
generalized inverse of X0 X is (X0 X)−1 .
If X0 X is singular, then PX = X(X0 X)− X0 and the choice of the
generalized inverse (X0 X)− does not matter because
PX = X(X0 X)− X0 will turn out to be the same matrix no matter
which generalized inverse of X0 X is used.
−
0
To see this, suppose (X0 X)−
1 and (X X)2 are any two generalized
0
inverses of X X. Then
− 0
− 0
− 0
0
0
0
0
X(X0 X)−
1 X = X(X X)2 X X(X X)1 X = X(X X)2 X .
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 511
15 / 27
An Example Orthogonal Projection Matrix
Suppose
0
y1
y2
−
X(X X) X
=
0
1
1
µ+
=
=
=
=
, and we observe y =
1
1
0 1
1
!− 1
1
6.1
2.3
.
0
−
1
[1 1]
[1 1]
1
1
1
1
[1 1]
[2]−1 [ 1 1 ] =
1
1
2
1 1
1 1 1
[1 1]=
1
2
2 1 1
1/2 1/2
.
1/2 1/2
=
1
1
1
2
c
Copyright 2012
Dan Nettleton (Iowa State University)
1
1
Statistics 511
16 / 27
An Example Orthogonal Projection
6.1
2.3
Thus, the orthogonal projection of y =
1
1
4.2
4.2
onto the column space of X =
is PX y =
1/2 1/2
1/2 1/2
c
Copyright 2012
Dan Nettleton (Iowa State University)
6.1
2.3
=
.
Statistics 511
17 / 27
Why is PX called an orthogonal projection matrix?
Suppose X =
1
2
and y =
2
3
4
.
X●
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 511
18 / 27
Why is PX called an orthogonal projection matrix?
Suppose X =
1
2
and y =
2
3
4
.
C (X )
X●
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 511
19 / 27
Why is PX called an orthogonal projection matrix?
Suppose X =
1
2
and y =
2
3
4
.
C (X )
X●
c
Copyright 2012
Dan Nettleton (Iowa State University)
●
y
Statistics 511
20 / 27
Why is PX called an orthogonal projection matrix?
Suppose X =
1
2
and y =
2
3
4
.
C (X )
X●
^
y
●
c
Copyright 2012
Dan Nettleton (Iowa State University)
●
y
Statistics 511
21 / 27
Why is PX called an orthogonal projection matrix?
Suppose X =
1
2
and y =
2
3
4
.
C (X )
X●
^
y
●
●
●
y
^
y−y
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 511
22 / 27
Why is PX called an orthogonal projection matrix?
The angle between ŷ and y − ŷ is 90◦ .
The vectors ŷ and y − ŷ are orthogonal.
ŷ0 (y − ŷ) = ŷ0 (y − PX y) = ŷ0 (I − PX )y
= (PX y)0 (I − PX )y = y0 P0X (I − PX )y
= = y0 PX (I − PX )y = y0 (PX − PX PX )y
= y0 (PX − PX )y = 0.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 511
23 / 27
Optimality of ŷ as an Estimator of E(y)
ŷ is an unbiased estimator of E(y):
E(ŷ) = E(PX y) = PX E(y) = PX Xβ = Xβ = E(y).
It can be shown that ŷ = PX y is the best estimator of E(y) in the
class of linear unbiased estimators, i.e., estimators of the form My
for M satisfying
E(My) = E(y) ∀ β ∈ IRp ⇐⇒ MXβ = Xβ ∀ β ∈ IRp ⇐⇒ MX = X.
Under the Gauss-Markov Linear Model, ŷ = PX y is best among all
unbiased estimators of E(y).
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 511
24 / 27
Ordinary Least Squares (OLS) Estimation of E(y)
OLS: Find a vector b∗ ∈ IRp such that
Q(b∗ ) ≤ Q(b) ∀ b ∈ IRp , where Q(b) ≡
n
X
(yi − x0(i) b)2 .
i=1
Note that
Q(b) =
n
X
(yi − x0(i) b)2 = (y − Xb)0 (y − Xb) = ||y − Xb||2 .
i=1
To minimize this sum of squares, we need to choose b∗ ∈ IRp such
Xb∗ will be the point in C(X) that is closest to y.
In other words, we need to choose b∗ such that
Xb∗ = PX y = X(X0 X)− X0 y.
Clearly, choosing b∗ = (X0 X)− X0 y will work.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 511
25 / 27
Ordinary Least Squares and the Normal Equations
It can be shown that Q(b∗ ) ≤ Q(b) ∀ b ∈ IRp if and only if b∗ is a
solution to the normal equations:
X0 Xb = X0 y.
If X0 X is nonsingular, multiplying both sides of the normal
equations by (X0 X)−1 shows that the only solution to the normal
equations is b∗ = (X0 X)−1 X0 y.
If X0 X is singular, there are infinitely many solutions that include
(X0 X)− X0 y for all choices of generalized inverse of X0 X.
X0 X[(X0 X)− X0 y] = X0 [X(X0 X)− X0 ]y = X0 PX y = X0 y
Henceforth, we will use β̂ to denote any solution to the normal
equations.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 511
26 / 27
Ordinary Least Squares Estimator of E(y) = Xβ
We call Xβ̂ = PX Xβ̂ = X(X0 X)− X0 Xβ̂ = X(X0 X)− X0 y = PX y = ŷ the
OLS estimator of E(y) = Xβ.
c rather than Xβ̂ to denote
It might be more appropriate to use Xβ
our estimator because we are estimating Xβ rather than
pre-multiplying an estimator of β by X.
As we shall soon see, it does not make sense to estimate β when
X0 X is singular.
However, it does make sense to estimate E(y) = Xβ whether X0 X
is singular or nonsingular.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 511
27 / 27
Download