Model Linear v ko

advertisement
1
1
c
Copyright 2011
Dept. of Statistics (Iowa State University)
Statistics 511
=⇒ C(X) = {Xa : a ∈ IR}
1 a1 : a1 ∈ IR
=
1
a1
: a1 ∈ IR
=
a1
What does this column space “look like”?
X=
An Example Column Space
Statistics 511
β is an unknown parameter vector in IRp .
c
Copyright 2011
Dept. of Statistics (Iowa State University)
X is sometimes referred to as the design matrix. It is an n × p
matrix of constants with columns corresponding to explanatory
variables.
We saw two possible X matrices for the t-test.
This section focuses on the Q: does it matter which X we use?
Important pieces of information for what follows:
y = Xβ + Reminder from the last section of the notes:
3 / 32
1 / 32
−0.5
c
Copyright 2011
Dept. of Statistics (Iowa State University)
−1.0
X1
0.0
0.5
1.0
E(y) ∈ C(X) and Var(y) = σ 2 I, σ 2 ∈ IR+ .
Statistics 511
Statistics 511
4 / 32
2 / 32
The Gauss-Markov linear model says y is a random vector whose:
mean is in the column space of X and
whose variance is σ 2 I for some positive real number σ 2 , i.e.,
C(X) = {Xa : a ∈ IRp }.
The set of all possible linear combinations of the columns of X is
called the column space of X and is denoted by
βp
Xβ is a linear combination of the columns of X:
⎡
⎤
β1
⎢
⎥
Xβ = [x1 , . . . , xp ] ⎣ ... ⎦ = β1 x1 + · · · + βp xp .
c
Copyright 2011
Dept. of Statistics (Iowa State University)
The Column Space of the Design Matrix
X2
Geometry of the Gauss-Markov Linear Model
1.0
0.5
0.0
−0.5
−1.0
5 / 32
because there may be some b for which X2 b is not in C(X1 ).
If you can also show C(X2 ) ⊆ C(X1 ), then
C(X1 ) ⊆ C(X2 ) and C(X1 ) ⊆ C(X2 ) =⇒ C(X1 ) = C(X2 )
c
Copyright 2011
Dept. of Statistics (Iowa State University)
Statistics 511
N.B. Not (at least yet) C(X1 ) = C(X2 )
The next few slides have the details
If you can start with x ∈ C(X1 ) and derive x = X2 b,
this implies x ∈ C(X2 ) and C(X1 ) ⊆ C(X2 ).
Concept:
7 / 32
Proving that two column spaces, C(X1 ) and C(X2 ), are
the same
c
Copyright 2011
Dept. of Statistics (Iowa State University)
Statistics 511
⎧⎡
⎫
⎤
1 0 0
⎪
⎪
⎪
⎪
⎨⎢
⎬
⎥ a1
1
0
0 ⎥
2
⎢
⎥
⎥ =⇒ C(X) =
:
a
∈
IR
⎣
⎦
⎦
0 1
1
a2
⎪
⎪
⎪
⎪
⎩
⎭
0 1
1
⎧ ⎡ ⎤
⎫
⎡ ⎤
1
0
⎪
⎪
⎪
⎪
⎨ ⎢ ⎥
⎬
⎢ 0 ⎥
1
⎥ + a2 ⎢ ⎥ : a1 , a2 ∈ IR
a1 ⎢
=
⎣ 0 ⎦
⎣ 1 ⎦
⎪
⎪
⎪
⎪
⎩
⎭
0
1
⎧⎡
⎫
⎤
a1
⎪
⎪
⎪
⎪
⎨⎢
⎬
⎥
a
1
⎢
⎥ : a1 , a2 ∈ IR
=
⎣ a2 ⎦
⎪
⎪
⎪
⎪
⎩
⎭
a2
⎤
What is this column space?
A plane “living” in IR4 .
1
⎢ 1
X=⎢
⎣ 0
0
⎡
Another Example Column Space
=⇒
=⇒
=⇒
⎤
0
0 ⎥
⎥
1 ⎦
1
6 / 32
1
1
0
0
⎤
0
0 ⎥
⎥
1 ⎦
1
x ∈ C(X2 )
=⇒
Thus, C(X1 ) ⊆ C(X2 ).
x = X2 b for some b ∈ IR3
x = X1 a for some a ∈ IR2
0
x = X2
for some a ∈ IR2
a
1
⎢ 1
X2 = ⎢
⎣ 1
1
⎡
=⇒
=⇒
=⇒
⎤
0
0 ⎥
⎥
1 ⎦
1
c
Copyright 2011
Dept. of Statistics (Iowa State University)
x ∈ C(X1 )
1
⎢ 1
X1 = ⎢
⎣ 0
0
⎡
Statistics 511
8 / 32
Proving two column spaces are the same (continued)
c
Copyright 2011
Dept. of Statistics (Iowa State University)
Statistics 511
x = X2 a for some a ∈ IR3
⎡ ⎤
⎡ ⎤
⎡ ⎤
1
1
0
⎢ 1 ⎥
⎢ 1 ⎥
⎢ 0 ⎥
3
⎢ ⎥
⎢ ⎥
⎥
x = a1 ⎢
⎣ 1 ⎦ + a2 ⎣ 0 ⎦ + a3 ⎣ 1 ⎦ for some a ∈ IR
1
0
1
⎤ ⎡
⎤
⎡
a1 + a2
b1
⎢ a1 + a2 ⎥ ⎢ b1 ⎥
⎥ ⎢
⎥
x=⎢
⎣ a1 + a3 ⎦ = ⎣ b2 ⎦ for some b1 , b2 ∈ IR
a1 + a3
b2
1
1
0
0
This is also a plane in R4 . Is it the same plane?
x ∈ C(X2 )
1
⎢ 1
X2 = ⎢
⎣ 1
1
⎡
A Third Column Space Example
=⇒
=⇒
=⇒
=⇒
1
⎢ 1
X2 = ⎢
⎣ 1
1
⎡
1
1
0
0
0
0 ⎥
⎥
1 ⎦
1
⎤
c
Copyright 2011
Dept. of Statistics (Iowa State University)
Should we estimate E(y) =
μ
μ
by y =
6.1
?
2.3
Statistics 511
For example, suppose
1
1
6.1
y1
=
μ+
, and we observe y =
.
1
2.3
y2
2
y is obviously an unbiased estimator of E(y), but it is often not a
very sensible estimator.
We could, of course, use y to estimate E(y).
A fundamental goal of linear model analysis is to estimate E(y).
Estimation of E(y)
Statistics 511
9 / 32
11 / 32
x = X2 a for some a ∈ IR3
⎡ ⎤
⎡ ⎤
⎡ ⎤
1
1
0
⎢ 1 ⎥
⎢ 1 ⎥
⎢ 0 ⎥
3
⎥
⎢ ⎥
⎢ ⎥
x = a1 ⎢
⎣ 1 ⎦ + a2 ⎣ 0 ⎦ + a3 ⎣ 1 ⎦ for some a ∈ IR
1
0
1
⎡
⎤
a1 + a 2
⎢ a1 + a2 ⎥
⎥
x=⎢
⎣ a1 + a3 ⎦ for some a1 , a2 , a3 ∈ IR
a1 + a3
a 1 + a2
for some a1 , a2 , a3 ∈ IR
x = X1
a 1 + a3
⎤
0
0 ⎥
⎥
1 ⎦
1
c
Copyright 2011
Dept. of Statistics (Iowa State University)
x ∈ C(X2 )
1
⎢ 1
X1 = ⎢
⎣ 0
0
⎡
Proving two column spaces are the same (continued)
x ∈ C(X1 )
=⇒
c
Copyright 2011
Dept. of Statistics (Iowa State University)
12 / 32
n
2
i=1 ai .
Statistics 511
By definition, ||y − ŷ|| = minz∈C(X) ||y − z||, where ||a|| ≡
This unique point is called the orthogonal projection of y onto C(X)
might be
and denoted by ŷ (although it could be argued that E(y)
better notation).
Consider estimating E(y) by the point in C(X) that is closest to y
(as measured by the usual Euclidean distance).
10 / 32
The Gauss-Markov linear model says that E(y) ∈ C(X), so we
should use that information when estimating E(y).
Statistics 511
Estimation of E(y)
c
Copyright 2011
Dept. of Statistics (Iowa State University)
Thus, it follows that C(X1 ) = C(X2 ).
We previously showed that C(X1 ) ⊆ C(X2 ).
Thus, C(X2 ) ⊆ C(X1 ).
x = X1 b for some b ∈ IR2
=⇒
=⇒
=⇒
a 1 + a2
for some a1 , a2 , a3 ∈ IR
x = X1
a1 + a3
b1
for some b1 , b2 ∈ IR
x = X1
b2
Proving two column spaces are the same (continued)
X
●
Suppose X =
1
1
●
y^
and y =
y
●
C(X)
6.1
2.3
.
c
Copyright 2011
Dept. of Statistics (Iowa State University)
●
y
Suppose X =
1
2
●
y^
●
X
and y =
.
C(X)
3
4
2
A second example and picture (continued)
c
Copyright 2011
Dept. of Statistics (Iowa State University)
In a picture
Statistics 511
Statistics 511
15 / 32
13 / 32
1
2
●
X
and y =
C(X)
3
4
2
.
c
Copyright 2011
Dept. of Statistics (Iowa State University)
y − y^
●
y
Suppose X =
1
2
●
y^
●
X
and y =
.
C(X)
3
4
2
A second example and picture (continued)
c
Copyright 2011
Dept. of Statistics (Iowa State University)
●
y
Suppose X =
A second example and picture
Statistics 511
Statistics 511
16 / 32
14 / 32
We’re doing least squares estimation!
i=1 (y
n
− ŷ)2
The vectors ŷ and y − ŷ are orthogonal
The correlation between ŷ and y − ŷ = 0.
i.e. predicted values ŷ and residuals y − ŷ are uncorrelated
PX X = X and X PX = X .
PX is idempotent: PX PX = PX .
PX is symmetric: PX = PX .
18 / 32
c
Copyright 2011
Dept. of Statistics (Iowa State University)
= X
PX X = X(X X)−1 X X
Algebra (assuming full rank X)
Statistics 511
19 / 32
c
Copyright 2011
Dept. of Statistics (Iowa State University)
= PX A
= X(X X)−1 X A
= X(X X)−1 (X X)(X X)−1 X A
Algebra (assuming full rank X of dimension n × p):
Consider any matrix A of dimension n × k.
PX PX A = X(X X)−1 X X(X X)−1 X A
Statistics 511
20 / 32
Geometry: Each column of X represents a point in C(X) (by definition).
The projection is the closest point in C(X). Already there!
Statistics 511
Geometry: If X is an n × p matrix, consider any n × k matrix A. Each
column of PX A represents a point in C(X) (by definition). Projecting a
second time doesn’t move PX A. Since this is true for any A,
PX PX X = PX X.
c
Copyright 2011
Dept. of Statistics (Iowa State University)
It can be shown that...
Why is PX X idempotent?
17 / 32
What is PX ?
If (X X)−1 exists, i.e. X X is full rank, PX = X(X X)−1 X
If not, PX = X(X X)− X ,
where (X X)− is any generalized inverse of X X.
ŷ = PX y ∀ y ∈ IRn , where PX is a unique n × n matrix known as an
orthogonal projection matrix.
Can find ŷ by matrix multiplication
Orthogonal Projection Matrices
Why Does PX X = X?
Statistics 511
essentially the ANOVA decomposition of sums-of-squares: SStotal =
SSmodel + SSerror .
But, how do you compute ŷ without drawing lines?
Pythagorean Theorem: y 2 = ŷ 2 + y − ŷ 2
Geometrically, y − ŷ is minimized when the angle between the
vector ŷ and the vector y − ŷ is 90◦ .
ŷ is the point in C(X) that minimizes y − ŷ 2 =
c
Copyright 2011
Dept. of Statistics (Iowa State University)
What the geometry tells a statistician
If A is singular, i.e., if A−1 does not exist,
there are infinitely many generalized inverses of A.
AA−1 A = AI = IA = A
−
y1
y2
=
1
1
1
1
1
2
μ+
1
1
1
1
− 1
1
, and we observe y =
=
1
1
6.1
2.3
Statistics 511
−
1
[1 1]
[1 1]
1
1
1
1
=
[2]−1 [ 1 1 ] =
[1 1]
1
1
2
1 1 1
1 1
[1 1]=
=
2 1
2 1 1
1/2 1/2
=
.
1/2 1/2
=
c
Copyright 2011
Dept. of Statistics (Iowa State University)
X(X X) X
Suppose
An Example Orthogonal Projection Matrix
Statistics 511
.
23 / 32
21 / 32
G is a generalized inverse of a matrix A if AGA = A.
item If A is nonsingular, i.e., if A−1 exists, then A−1 is the one and
only generalized inverse of A.
c
Copyright 2011
Dept. of Statistics (Iowa State University)
Generalized inverse, A−
We’ve already seen that some X matrices are not full rank.
Hence, (X X)−1 not defined.
Still need to do statistics with these design matrices!
Generalized Inverses
1/2 1/2
1/2 1/2
c
Copyright 2011
Dept. of Statistics (Iowa State University)
is PX y =
6.1
2.3
=
onto the column space of X =
.
6.1
2.3
4.2
4.2
1
1
Thus, the orthogonal projection of y =
An Example Orthogonal Projection
Statistics 511
Statistics 511
− − − X(X X)−
1 X = X(X X)2 X X(X X)1 X = X(X X)2 X .
24 / 32
22 / 32
−
Suppose (X X)−
1 and (X X)2 are any two generalized inverses of
− X X. Hence, PX = X(X X)1 X = X(X X)−
2 X . Then
If X X is singular, then PX = X(X X)− X and the choice of the
generalized inverse (X X)− does not matter because
PX = X(X X)− X will turn out to be the same matrix no matter
which generalized inverse of X X is used.
If X X is nonsingular, then PX = X(X X)−1 X because the only
generalized inverse of X X is (X X)−1 .
c
Copyright 2011
Dept. of Statistics (Iowa State University)
Invariance of PX = X(X X)− X to Choice of (X X)−
●
y^
y
●
C(X)
Statistics 511
25 / 32
The vectors ŷ and y − ŷ are orthogonal.
27 / 32
p
i=1
n
(yi − x(i) b)2 .
(yi − x(i) b)2 = (y − Xb) (y − Xb) = ||y − Xb||2 .
i=1
n
26 / 32
Clearly, choosing b∗ = (X X)− X y will work.
In other words, we need to choose b∗ such that
Xb∗ = PX y = X(X X)− X y.
Statistics 511
28 / 32
To minimize this sum of squares, we need to choose b∗ ∈ IRp such
Xb∗ will be the point in C(X) that is closest to y.
Q(b) =
Note that
Q(b ) ≤ Q(b) ∀ b ∈ IR , where Q(b) ≡
∗
OLS: Find a vector b∗ ∈ IRp such that
c
Copyright 2011
Dept. of Statistics (Iowa State University)
Under the Normal Theory Gauss-Markov Linear Model, ŷ = PX y is
best among all unbiased estimators of E(y).
Statistics 511
Statistics 511
Ordinary Least Squares (OLS) Estimation of
E(y) = Xβ
c
Copyright 2011
Dept. of Statistics (Iowa State University)
= y (PX − PX )y = 0.
= = y PX (I − PX )y = y (PX − PX PX )y
= (PX y) (I − PX )y = y PX (I − PX )y
ŷ (y − ŷ) = ŷ (y − PX y) = ŷ (I − PX )y
The angle between ŷ and y − ŷ is 90◦ .
Why is PX called an orthogonal projection matrix?
E(My) = E(y) ∀ β ∈ IRp ⇐⇒ MXβ = Xβ ∀ β ∈ IRp ⇐⇒ MX = X.
It can be shown that ŷ = PX y is the best estimator of E(y) in the
class of linear unbiased estimators, i.e., estimators of the form My
for M satisfying
E(ŷ) = E(PX y) = PX E(y) = PX Xβ = Xβ = E(y).
ŷ is an unbiased estimator of E(y):
c
Copyright 2011
Dept. of Statistics (Iowa State University)
Optimality of ŷ as an Estimator of E(y)
c
Copyright 2011
Dept. of Statistics (Iowa State University)
X
●
A picture we’ve seen before
Statistics 511
if in addition, ∼ N(0, σ 2 I), then that ŷ is the best unbiased
estimator.
The estimator ŷ = X(X X)− X y is the best linear unbiased
estimator of E(y). (Gauss-Markov Theorem)
c
Copyright 2011
Dept. of Statistics (Iowa State University)
Those two models have the same predicted values
because C(X1 ) = C(X2 ).
Corresponds to the effects model yij = μ + αi + ij , Var = σ 2 I.
Straightforward to extend to 2 groups with 4 obs. per group.
The X matrix on slide 6 also describes 2 groups with 2
observations per group.
Corresponds to the cell means model yij = μi + ij , Var = σ 2 I.
Our study has 2 groups with 4 observations per group.
The X matrix on slide 5 describes 2 groups with 2 observations
per group.
Return to our t-test
Statistics 511
Henceforth, we will use β̂ to denote any solution to the normal
equations.
31 / 32
29 / 32
Choice of X matrix determines interpretation of β.
Some X matrices are not full column rank.
Many possible solutions for β.
All have same ŷ.
What are the properties of estimable functions of β?
c
Copyright 2011
Dept. of Statistics (Iowa State University)
What functions of β can be estimated?
32 / 32
Two X matrices with the same column space represent the same
model for the mean.
Statistics 511
EY is a point in the column space of X.
Questions still to be answered:
A linear model is a set of EY describing the mean of each random
variable.
30 / 32
Summary:
Statistics 511
However, it does make sense to estimate E(y) = Xβ whether X X
is singular or nonsingular.
As we shall soon see, it does not make sense to estimate β when
X X is singular.
rather than Xβ̂ to denote
It might be more appropriate to use Xβ
our estimator because we are estimating Xβ rather than
pre-multiplying an estimator of β by X.
We call Xβ̂ = PX Xβ̂ = X(X X)− X Xβ̂ = X(X X)− X y = PX y = ŷ the
OLS estimator of E(y) = Xβ.
c
Copyright 2011
Dept. of Statistics (Iowa State University)
If X X is singular, there are infinitely many solutions that include
(X X)− X y for all choices of generalized inverse of X X.
X X[(X X)− X y] = X [X(X X)− X ]y = X PX y = X y
Ordinary Least Squares Estimator of E(y) = Xβ
If X X is nonsingular, multiplying both sides of the normal
equations by (X X)−1 shows that the only solution to the normal
equations is b∗ = (X X)−1 X y.
X Xb = X y.
Often calculus is used to show that Q(b∗ ) ≤ Q(b) ∀ b ∈ IRp if and
only if b∗ is a solution to the normal equations:
c
Copyright 2011
Dept. of Statistics (Iowa State University)
Ordinary Least Squares and the Normal Equations
Download