Document 10639918

advertisement
Underfitting and Overfitting
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
1 / 43
Underfitting
Suppose the true model is
y = Xβ + η + ε,
where η is an unknown fixed vector and ε satisfies the GMM.
Suppose we incorrectly assume that
y = Xβ + ε.
This example of misspecifying the model is known as underfitting.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
2 / 43
Note that η may equal Wα for some design matrix W whose columns
could contain explanatory variables excluded from X.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
3 / 43
What are the implications of underfitting?
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
4 / 43
Find E(c0 β̂).
Find E(σ̂ 2 ).
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
5 / 43
Derivation of E(c0 β̂)
E(c0 β̂) = E(a0 Xβ̂)
= E(a0 PX y)
= a0 PX E(y)
= a0 PX (Xβ + η)
= a0 Xβ + +a0 PX η
= c0 β + a0 PX η.
c0 β̂ is biased for c0 β unless a0 PX η = 0.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
6 / 43
Note that if η is close to 0, the bias a0 PX η may be small.
If η ∈ C(X)⊥ = N (X0 ), then
X0 η = 0 ⇒ X(X0 X)− X0 η = 0
c
Copyright 2012
Dan Nettleton (Iowa State University)
⇒ a0 PX η = 0.
Statistics 611
7 / 43
As an example of this last point, suppose we fit a multiple regression
but omit one explanatory variable.
Suppose for our sample of n observations, the vector x∗ = x − x̄1
contains the values of the missing variable centered so that the sample
mean is zero.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
8 / 43
If the sample covariance of the missing variable x with each of the
included variables x1 , . . . , xp is 0, then the LSE of the multiple
regression coefficients β̂ will still be unbiased for β even though x is
excluded ∵ X0 x∗ = 0.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
9 / 43
Derivation of E(σ̂ 2 )
(n − r)E(σ̂ 2 ) = E(y0 (I − PX )y)
= (Xβ + η)0 (I − PX )(Xβ + η) + tr((I − PX )σ 2 I)
= η 0 (I − PX )η + σ 2 (n − r)
∵ X0 (I − PX ) = 0
∴ E(σ̂ 2 ) =
and (I − PX )X = 0.
η 0 (I
− PX )η
+ σ2.
n−r
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
10 / 43
Note that
η 0 (I − PX )η = η 0 (I − PX )0 (I − PX )η
c
Copyright 2012
Dan Nettleton (Iowa State University)
= k(I − PX )ηk2
= kη − PX ηk2 .
Statistics 611
11 / 43
Thus, E(σ̂ 2 ) = σ 2 iff
(I − PX )η = 0 ⇐⇒ η ∈ N (I − PX )
c
Copyright 2012
Dan Nettleton (Iowa State University)
⇐⇒ η ∈ C(PX ) = C(X)
⇐⇒ ∃ α 3 Xα = η
⇐⇒ ∃ α 3 E(y) = Xβ + η
= Xβ + Xα
= X(β + α)
⇐⇒ E(y) ∈ C(X).
Statistics 611
12 / 43
Example 1
Consider an experiment with two experimental units (mice in this case)
for each of two treatments.
We might assume the GMM holds with
 

y11
1
 

y 
1
 12 

E(y) = E   = Xβ = 
y21 
1
 

y22
1
c
Copyright 2012
Dan Nettleton (Iowa State University)



1 0  
µ + τ1
 µ



 
1 0
µ
+
τ
1



=

.
τ
1


µ + τ2 
0 1
 τ


2
0 1
µ + τ2
Statistics 611
13 / 43
Example 1 (continued)
Suppose the person who conducted the experiment neglected to
mention that, in each treatment group, one of the experimental units
was male and the other was female.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
14 / 43
Example 1 (continued)
Then the true model may require

 
 

α/2
µ + τ1
µ + τ1 + α/2

 
 

µ + τ − α/2 µ + τ  −α/2
1
1


 

E(y) = 

+
=
µ + τ2 + α/2 µ + τ2   α/2

 
 

−α/2
µ + τ2
µ + τ2 − α/2


1/2


−1/2 h i


= Xβ + 
 α = Xβ + Wα = Xβ + η.
 1/2


−1/2
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
15 / 43
Example 1 (continued)
If we analyze the data assuming the GMM with E(y) = Xβ, determine
1
E(τ\
1 − τ2 ), and
2
E(σ̂ 2 ).
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
16 / 43
Example 1 (continued)
From slide 6,
0
E(τ\
1 − τ2 ) = τ1 − τ2 + a PX η


1/2


−1/2 h i


= τ1 − τ2 + a0 PX 
 α
 1/2


−1/2
h i
= τ1 − τ2 + a0 0 α
= τ1 − τ2 .
Thus, the LSE of τ1 − τ2 is unbiased in this case.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
17 / 43
Example 1 (continued)
From slide 10,
E(σ̂ 2 ) =
=
=
=
=
η 0 (I − PX )η
+ σ2
n−r
η 0 (Iη − PX η)
+ σ2
n−r
η 0 (η − 0)
+ σ2
n−r
η0η
+ σ2
n−r
α2
+ σ2.
4−2
Thus, σ̂ 2 is biased for σ 2 in this case.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
18 / 43
Example 2
Once again consider an experiment with two experimental units (mice)
for each of two treatments.
Suppose we assume the GMM holds with
 

y11
1
 

y 
1
 12 

E(y) = E   = Xβ = 
y21 
1
 

y22
1
c
Copyright 2012
Dan Nettleton (Iowa State University)



1 0  
µ + τ1
 µ



 
1 0
µ
+
τ
1



=

.
τ
1




0 1
µ + τ2 


τ2
0 1
µ + τ2
Statistics 611
19 / 43
Example 2 (continued)
Suppose the person who conducted the experiment neglected to
mention that both experimental units in treatment group 1 were female
and that both experimental units in treatment group 2 were male.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
20 / 43
Example 2 (continued)
Then the true model may require

 
 

α/2
µ + τ1
µ + τ1 + α/2

 
 

µ + τ + α/2 µ + τ   α/2
1
1


 

E(y) = 

+
=
µ + τ2 − α/2 µ + τ2  −α/2

 
 

−α/2
µ + τ2
µ + τ2 − α/2


1/2


 1/2 h i


= Xβ + 
 α = Xβ + Wα = Xβ + η.
−1/2


−1/2
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
21 / 43
Example 2 (continued)
If we analyze the data assuming the GMM with E(y) = Xβ, determine
1
E(τ\
1 − τ2 ), and
2
E(σ̂ 2 ).
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
22 / 43
Example 2 (continued)
From slide 6,
0
E(τ\
1 − τ2 ) = τ1 − τ2 + a PX η

1/2



 1/2 h i


= τ1 − τ2 + a0 PX 
 α
−1/2


−1/2

1/2


 1/2 h i


= τ1 − τ2 + a0 
 α = τ1 − τ2 + α.
−1/2


−1/2

c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
23 / 43
Example 2 (continued)
Note that τ\
1 − τ2 = ȳ1· − ȳ2· .
The previous slide shows that ȳ1· − ȳ2· is not an unbiased estimator of
the difference between treatment effects.
However, ȳ1· − ȳ2· is an unbiased estimator of the difference between
the means of the two treatment groups; i.e.,
E(ȳ1· − ȳ2· ) = (µ + τ1 + α/2) − (µ + τ2 − α/2) = τ1 − τ2 + α.
Part of the difference may be due to treatment, but part may be due to
sex of the mice.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
24 / 43
Example 2 (continued)
From slide 10,
η 0 (I − PX )η
+ σ2
n−r
η 0 (Iη − PX η)
=
+ σ2
n−r
η 0 (η − η)
=
+ σ2
n−r
= σ2.
E(σ̂ 2 ) =
Thus, σ̂ 2 is unbiased for σ 2 in this case.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
25 / 43
Example 2 (continued)
Because η ∈ C(X), both assumptions
E(y) = Xβ
and
E(y) = Xβ + η
are equivalent to E(y) ∈ C(X).
Thus, even though we ignore sex of the mice, our model for the mean
is correct.
The only mistake we would make is to assume that the difference in
means for the treatment groups is due only to treatment rather than to
a combination of treatment and sex.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
26 / 43
Overfitting
Now suppose we consider the model
y = Xβ + ε,
where
X = [X1 , X2 ]
and β =
" #
β1
β2
3
Xβ = X1 β 1 + X2 β 2 .
Furthermore, suppose that (unknown to us) X2 β 2 = 0.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
27 / 43
In this case, we say that we are overfitting.
Note that we are fitting a model that is more complicated than it needs
to be.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
28 / 43
To examine the impact of the overfitting, consider the case where
X = [X1 , X2 ] is of full-column rank.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
29 / 43
If we were to fit the simpler and correct model y = X1 β 1 + ε, the LSE of
β 1 is β̃ 1 = (X01 X1 )−1 X1 y. Then
E(β̃ 1 ) = (X01 X1 )−1 X01 E(y)
c
Copyright 2012
Dan Nettleton (Iowa State University)
= (X01 X1 )−1 X01 X1 β 1
= β1 .
Statistics 611
30 / 43
Var(β̃ 1 ) = (X01 X1 )−1 X01 Var(y)X1 (X01 X1 )−1
= σ 2 (X01 X1 )−1 X01 X1 (X01 X1 )−1
= σ 2 (X01 X1 )−1 .
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
31 / 43
If we were to fit the full model
y = X1 β1 + X2 β2 + ε
that is correct
" # but more complicated than it needs to be, then the LSE
β1
of β =
is
β2
"
β̂ 1
β̂ 2
#
−1
[X1 , X2 ]0 y
= [X1 , X2 ]0 [X1 , X2 ]
"
=
X01 X1 X01 X2
X02 X1 X02 X2
c
Copyright 2012
Dan Nettleton (Iowa State University)
#−1 "
#
X01 y
X02 y
.
Statistics 611
32 / 43
If X01 X2 = 0, then
"
β̂ 1
β̂ 2
#
"
=
"
=
X01 X1
0
0
X02 X2
#−1 "
#
X01 y
(X01 X1 )−1 X1 y
(X02 X2 )−1 X2 y
c
Copyright 2012
Dan Nettleton (Iowa State University)
X02 y
"
#
=
β̃ 1
#
(X02 X2 )−1 X2 y
.
Statistics 611
33 / 43
Now suppose X01 X2 6= 0.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
34 / 43
E
" #!
β̂ 1
β̂ 2
= (X0 X)−1 X0 E(y)
= (X0 X)−1 X0 Xβ
" #
β1
=β=
.
β2
Thus, E(β̂ 1 ) = β 1 .
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
35 / 43
Var(β̂) = Var
" #!
β̂ 1
β̂ 2
= σ 2 (X0 X)−1
#−1
"
0
0
2 X1 X1 X1 X2
=σ
.
X02 X1 X02 X2
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
36 / 43
By Exercise A.72,
"
A B
C D
#−1
"
=
A−1 + A−1 BE−1 CA−1 A−1 BE−1
−ECA−1
E−1
#
,
where E = D − CA−1 B.
Thus, Var(β̂ 1 ) is σ 2 times
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
37 / 43
(X01 X1 )−1 + (X01 X1 )−1 X01 X2 (X02 X2 − X02 X1 (X01 X1 )−1 X01 X2 )−1 X02 X1 (X01 X1 )−1
= (X01 X1 )−1 + (X01 X1 )−1 X01 X2 (X02 (I − PX1 )X2 )−1 X02 X1 (X01 X1 )−1 .
Thus,
Var(β̂ 1 ) − Var(β̃ 1 ) = σ 2 (X01 X1 )−1 X01 X2 (X02 (I − PX1 )X2 )−1 X02 X1 (X01 X1 )−1 .
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
38 / 43
In a homework problem, you will show that
Var(β̂ 1 ) − Var(β̃ 1 ) is NND.
Thus, one cost of overfitting is increased variability of estimators of
regression coefficients.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
39 / 43
How is estimation of σ 2 affected?
Let r1 = rank(X1 ) and r2 = rank(X2 ).
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
40 / 43
If we fit a simpler model y = X1 β 1 + ε, then
σ̃ 2 =
y0 (I − PX1 )y
n − r1
and
E(y0 (I − PX1 )y) = (n − r1 )σ 2
⇒ E(σ̃ 2 ) = σ 2 .
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
41 / 43
If we overfit with the model y = Xβ + ε, then
σ̂ 2 =
y0 (I − PX )y
n−r
and
E(σ̂ 2 ) = σ 2 .
Thus, overfitting does not lead to biased estimation of σ 2 .
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
42 / 43
However, as we will see later in the course, overfitting leads to a loss of
degrees of freedom (n − r < n − r1 ), which can lead to a loss of power
for testing hypotheses about β.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
43 / 43
Download