The Gauss-Markov Linear Model The Column Space of the Design Matrix y

advertisement
The Gauss-Markov Linear Model
The Column Space of the Design Matrix
Xβ is a linear combination of the columns of X:
⎡
⎤
β1
⎢
⎥
Xβ = [x1 , . . . , xp ] ⎣ ... ⎦ = β1 x1 + · · · + βp xp .
y = Xβ + y is an n × 1 random vector of responses.
βp
X is an n × p matrix of constants with columns corresponding to
explanatory variables. X is sometimes referred to as the design
matrix.
The set of all possible linear combinations of the columns of X is
called the column space of X and is denoted by
C(X) = {Xa : a ∈ IRp }.
β is an unknown parameter vector in IRp .
is an n × 1 random vector of errors.
E() = 0 and Var() = σ 2 I, where σ 2 is an unknown parameter in
IR+ .
c
Copyright 2010
Dan Nettleton (Iowa State University)
Statistics 511
1 / 25
An Example Column Space
The Gauss-Markov linear model says y is a random vector whose
mean is in the column space of X and whose variance is σ 2 I for
some positive real number σ 2 , i.e.,
E(y) ∈ C(X) and Var(y) = σ 2 I, σ 2 ∈ IR+ .
c
Copyright 2010
Dan Nettleton (Iowa State University)
X=
1
1
1
⎢ 1
X=⎢
⎣ 0
0
=⇒ C(X) = {Xa : a ∈ IRp }
1 a1 : a1 ∈ IR
=
1
1
: a1 ∈ IR
=
a1
1
a1
: a1 ∈ IR
=
a1
⎤
0
0 ⎥
⎥ =⇒ C(X)
1 ⎦
1
=
=
=
=
c
Copyright 2010
Dan Nettleton (Iowa State University)
2 / 25
Another Example Column Space
⎡
Statistics 511
Statistics 511
3 / 25
c
Copyright 2010
Dan Nettleton (Iowa State University)
⎧⎡
⎫
⎤
1 0 ⎪
⎪
⎪
⎪
⎬
⎨⎢
⎥ a1
1
0
2
⎢
⎥
:
a
∈
IR
⎣
⎦
0 1
a2
⎪
⎪
⎪
⎪
⎭
⎩
0 1
⎧ ⎡ ⎤
⎫
⎡ ⎤
1
0
⎪
⎪
⎪
⎪
⎨ ⎢ ⎥
⎬
⎢ 0 ⎥
1
⎥ + a2 ⎢ ⎥ : a1 , a2 ∈ IR
a1 ⎢
⎦
⎦
⎣
⎣
0
1
⎪
⎪
⎪
⎪
⎩
⎭
0
1
⎧⎡
⎫
⎤ ⎡
⎤
0
a1
⎪
⎪
⎪
⎪
⎨⎢
⎬
⎥ ⎢ 0 ⎥
a
1
⎢
⎥+⎢
⎥ : a1 , a2 ∈ IR
⎣
⎣
⎦
⎦
a2
0
⎪
⎪
⎪
⎪
⎩
⎭
a2
0
⎧⎡
⎫
⎤
a1
⎪
⎪
⎪
⎪
⎨⎢
⎬
⎥
a
1
⎢
⎥ : a1 , a2 ∈ IR
⎣ a2 ⎦
⎪
⎪
⎪
⎪
⎩
⎭
a2
Statistics 511
4 / 25
Another Column Space Example
Another Column Space Example (continued)
⎡
⎡
1
⎢ 1
X1 = ⎢
⎣ 0
0
x ∈ C(X1 )
⎤
0
0 ⎥
⎥
1 ⎦
1
⎡
1
⎢ 1
X2 = ⎢
⎣ 1
1
1
1
0
0
1
⎢ 1
X1 = ⎢
⎣ 0
0
⎤
0
0 ⎥
⎥
1 ⎦
1
x ∈ C(X2 )
x = X1 a for some a ∈ IR2
0
x = X2
for some a ∈ IR2
a
=⇒
=⇒
=⇒
x = X2 b for some b ∈ IR3
=⇒
x ∈ C(X2 )
=⇒
=⇒
=⇒
Thus, C(X1 ) ⊆ C(X2 ).
c
Copyright 2010
Dan Nettleton (Iowa State University)
=⇒
Statistics 511
5 / 25
⎤
0
0 ⎥
⎥
1 ⎦
1
⎡
1
⎢ 1
X2 = ⎢
⎣ 1
1
⎤
0
0 ⎥
⎥
1 ⎦
1
1
1
0
0
x = X2 a for some a ∈ IR3
⎡ ⎤
⎡ ⎤
⎡ ⎤
1
1
0
⎢ 1 ⎥
⎢ 1 ⎥
⎢ 0 ⎥
3
⎥
⎢ ⎥
⎢ ⎥
x = a1 ⎢
⎣ 1 ⎦ + a2 ⎣ 0 ⎦ + a3 ⎣ 1 ⎦ for some a ∈ IR
1
0
1
⎡
⎤
a1 + a2
⎢ a1 + a 2 ⎥
⎥
x=⎢
⎣ a1 + a3 ⎦ for some a1 , a2 , a3 ∈ IR
a1 + a 3
a 1 + a2
for some a1 , a2 , a3 ∈ IR
x = X1
a 1 + a3
c
Copyright 2010
Dan Nettleton (Iowa State University)
Statistics 511
6 / 25
Estimation of E(y)
Another Column Space Example (continued)
A fundamental goal of linear model analysis is to estimate E(y).
a1 + a2
a1 + a3
We could, of course, use y to estimate E(y).
for some a1 , a2 , a3 ∈ IR
=⇒
x = X1
=⇒
x = X1 b for some b ∈ IR2
=⇒
x ∈ C(X1 )
y is obviously an unbiased estimator of E(y), but it is often not a
very sensible estimator.
For example, suppose
y1
1
1
6.1
=
μ+
, and we observe y =
.
1
2.3
y2
2
Thus, C(X2 ) ⊆ C(X1 ).
We previously showed that C(X1 ) ⊆ C(X2 ).
Thus, it follows that C(X1 ) = C(X2 ).
c
Copyright 2010
Dan Nettleton (Iowa State University)
Should we estimate E(y) =
Statistics 511
7 / 25
c
Copyright 2010
Dan Nettleton (Iowa State University)
μ
μ
by y =
6.1
?
2.3
Statistics 511
8 / 25
Estimation of E(y)
Orthogonal Projection Matrices
The Gauss-Markov linear models says that E(y) ∈ C(X), so we
should use that information when estimating E(y).
It can be shown that...
1
Consider estimating E(y) by the point in C(X) that is closest to y
(as measured by the usual Euclidean distance).
This unique point is called the orthogonal projection of y onto C(X)
might be
and denoted by ŷ (although it could be argued that E(y)
better notation).
By definition, ||y − ŷ|| = minz∈C(X) ||y − z||, where ||a|| ≡
c
Copyright 2010
Dan Nettleton (Iowa State University)
n
2
i=1 ai .
Statistics 511
9 / 25
Why Does PX X = X?
∀ y ∈ IRn , ŷ = PX y, where PX is a unique n × n matrix known as an
orthogonal projection matrix.
2
PX is idempotent: PX PX = PX .
3
PX is symmetric: PX = PX .
4
PX X = X and X PX = X .
5
PX = X(X X)− X , where (X X)− is any generalized inverse of X X.
c
Copyright 2010
Dan Nettleton (Iowa State University)
Statistics 511
10 / 25
Generalized Inverses
G is a generalized inverse of a matrix A if AGA = A.
PX X = PX [x1 , . . . , xp ]
We usually denote a generalized inverse of A by A− .
If A is nonsingular, i.e., if A−1 exists, then A−1 is the one and only
generalized inverse of A.
= [PX x1 , . . . , PX xp ]
= [x1 , . . . , xp ]
AA−1 A = AI = IA = A
If A is singular, i.e., if A−1 does not exist, then there are infinitely
many generalized inverses of A.
= X.
c
Copyright 2010
Dan Nettleton (Iowa State University)
Statistics 511
11 / 25
c
Copyright 2010
Dan Nettleton (Iowa State University)
Statistics 511
12 / 25
Invariance of PX = X(X X)− X to Choice of (X X)−
An Example Orthogonal Projection Matrix
X)−1 X
If X X is nonsingular, then PX = X(X
generalized inverse of X X is (X X)−1 .
Suppose
because the only
X)− X
If X X is singular, then PX = X(X
and the choice of the
generalized inverse (X X)− does not matter because
PX = X(X X)− X will turn out to be the same matrix no matter
which generalized inverse of X X is used.
y1
y2
−
=
X(X X) X
X(X
X)−
1X
= X(X
− X)−
2 X X(X X)1 X
= X(X
Statistics 511
μ+
=
=
=
=
X)−
2X.
c
Copyright 2010
Dan Nettleton (Iowa State University)
−
Suppose (X X)−
1 and (X X)2 are any two generalized inverses of
X X. Then
1
1
=
13 / 25
1
1
6.1
2.3
Thus, the orthogonal projection of y =
onto the column space of X =
is PX y =
c
Copyright 2010
Dan Nettleton (Iowa State University)
6.1
2.3
1
1
1
1
− 1
1
6.1
2.3
.
−
1
[1 1]
[1 1]
1
1
1
1
[2]−1 [ 1 1 ] =
[1 1]
1
1
2
1 1 1
1 1
[1 1]=
2 1
2 1 1
1/2 1/2
.
1/2 1/2
c
Copyright 2010
Dan Nettleton (Iowa State University)
Statistics 511
, and we observe y =
1
1
Suppose X =
1/2 1/2
1/2 1/2
14 / 25
Why is PX called an orthogonal projection matrix?
An Example Orthogonal Projection
1
2
=
1
1
and y =
2
3
4
.
X●
4.2
4.2
1
2
.
Statistics 511
15 / 25
c
Copyright 2010
Dan Nettleton (Iowa State University)
Statistics 511
16 / 25
Why is PX called an orthogonal projection matrix?
Suppose X =
1
2
and y =
2
Why is PX called an orthogonal projection matrix?
Suppose X =
.
3
4
C (X )
1
2
and y =
2
.
3
4
C (X )
X●
X●
●
c
Copyright 2010
Dan Nettleton (Iowa State University)
Statistics 511
17 / 25
Why is PX called an orthogonal projection matrix?
Suppose X =
1
2
and y =
2
3
4
c
Copyright 2010
Dan Nettleton (Iowa State University)
Statistics 511
18 / 25
Why is PX called an orthogonal projection matrix?
Suppose X =
.
C (X )
1
2
and y =
2
3
4
.
C (X )
X●
X●
^
y
●
^
y
●
●
y
●
●
c
Copyright 2010
Dan Nettleton (Iowa State University)
y
Statistics 511
19 / 25
y
^
y−y
c
Copyright 2010
Dan Nettleton (Iowa State University)
Statistics 511
20 / 25
Why is PX called an orthogonal projection matrix?
Optimality of ŷ as an Estimator of E(y)
The angle between ŷ and y − ŷ is 90◦ .
ŷ is an unbiased estimator of E(y):
The vectors ŷ and y − ŷ are orthogonal.
E(ŷ) = E(PX y) = PX E(y) = PX Xβ = Xβ = E(y).
ŷ (y − ŷ) = ŷ (y − PX y) = ŷ (I − PX )y
It can be shown that ŷ = PX y is the best estimator of E(y) in the
class of linear unbiased estimators, i.e., estimators of the form My
for M satisfying
= (PX y) (I − PX )y = y PX (I − PX )y
E(My) = E(y) ∀ β ∈ IRp ⇐⇒ MXβ = Xβ ∀ β ∈ IRp ⇐⇒ MX = X.
= = y PX (I − PX )y = y (PX − PX PX )y
Under the Normal Theory Gauss-Markov Linear Model, ŷ = PX y is
best among all unbiased estimators of E(y).
= y (PX − PX )y = 0.
c
Copyright 2010
Dan Nettleton (Iowa State University)
Statistics 511
21 / 25
Ordinary Least Squares (OLS) Estimation of E(y)
n
i=1
n
i=1
(yi − x(i) b)2 .
X Xb = X y.
If X X is nonsingular, multiplying both sides of the normal
equations by (X X)−1 shows that the only solution to the normal
equations is b∗ = (X X)−1 X y.
(yi − x(i) b)2 = (y − Xb) (y − Xb) = ||y − Xb||2 .
To minimize this sum of squares, we need to choose b∗ ∈ IRp such
Xb∗ will be the point in C(X) that is closest to y.
If X X is singular, there are infinitely many solutions that include
(X X)− X y for all choices of generalized inverse of X X.
X X[(X X)− X y] = X [X(X X)− X ]y = X PX y = X y
In other words, we need to choose b∗ such that
Xb∗ = PX y = X(X X)− X y.
Henceforth, we will use β̂ to denote any solution to the normal
equations.
Clearly, choosing b∗ = (X X)− X y will work.
c
Copyright 2010
Dan Nettleton (Iowa State University)
22 / 25
Often calculus is used to show that Q(b∗ ) ≤ Q(b) ∀ b ∈ IRp if and
only if b∗ is a solution to the normal equations:
Note that
Q(b) =
Statistics 511
Ordinary Least Squares and the Normal Equations
OLS: Find a vector b∗ ∈ IRp such that
Q(b∗ ) ≤ Q(b) ∀ b ∈ IRp , where Q(b) ≡
c
Copyright 2010
Dan Nettleton (Iowa State University)
Statistics 511
23 / 25
c
Copyright 2010
Dan Nettleton (Iowa State University)
Statistics 511
24 / 25
Ordinary Least Squares Estimator of E(y) = Xβ
We call Xβ̂ = PX Xβ̂ = X(X X)− X Xβ̂ = X(X X)− X y = PX y = ŷ the
OLS estimator of E(y) = Xβ.
rather than Xβ̂ to denote
It might be more appropriate to use Xβ
our estimator because we are estimating Xβ rather than
pre-multiplying an estimator of β by X.
As we shall soon see, it does not make sense to estimate β when
X X is singular.
However, it does make sense to estimate E(y) = Xβ whether X X
is singular or nonsingular.
c
Copyright 2010
Dan Nettleton (Iowa State University)
Statistics 511
25 / 25
Download