Underfitting and Overfitting c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 1 / 43 Underfitting Suppose the true model is y = Xβ + η + ε, where η is an unknown fixed vector and ε satisfies the GMM. Suppose we incorrectly assume that y = Xβ + ε. This example of misspecifying the model is known as underfitting. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 2 / 43 Note that η may equal Wα for some design matrix W whose columns could contain explanatory variables excluded from X. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 3 / 43 What are the implications of underfitting? c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 4 / 43 Find E(c0 β̂). Find E(σ̂ 2 ). c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 5 / 43 Derivation of E(c0 β̂) E(c0 β̂) = E(a0 Xβ̂) = E(a0 PX y) = a0 PX E(y) = a0 PX (Xβ + η) = a0 Xβ + +a0 PX η = c0 β + a0 PX η. c0 β̂ is biased for c0 β unless a0 PX η = 0. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 6 / 43 Note that if η is close to 0, the bias a0 PX η may be small. If η ∈ C(X)⊥ = N (X0 ), then X0 η = 0 ⇒ X(X0 X)− X0 η = 0 c Copyright 2012 Dan Nettleton (Iowa State University) ⇒ a0 PX η = 0. Statistics 611 7 / 43 As an example of this last point, suppose we fit a multiple regression but omit one explanatory variable. Suppose for our sample of n observations, the vector x∗ = x − x̄1 contains the values of the missing variable centered so that the sample mean is zero. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 8 / 43 If the sample covariance of the missing variable x with each of the included variables x1 , . . . , xp is 0, then the LSE of the multiple regression coefficients β̂ will still be unbiased for β even though x is excluded ∵ X0 x∗ = 0. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 9 / 43 Derivation of E(σ̂ 2 ) (n − r)E(σ̂ 2 ) = E(y0 (I − PX )y) = (Xβ + η)0 (I − PX )(Xβ + η) + tr((I − PX )σ 2 I) = η 0 (I − PX )η + σ 2 (n − r) ∵ X0 (I − PX ) = 0 ∴ E(σ̂ 2 ) = and (I − PX )X = 0. η 0 (I − PX )η + σ2. n−r c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 10 / 43 Note that η 0 (I − PX )η = η 0 (I − PX )0 (I − PX )η c Copyright 2012 Dan Nettleton (Iowa State University) = k(I − PX )ηk2 = kη − PX ηk2 . Statistics 611 11 / 43 Thus, E(σ̂ 2 ) = σ 2 iff (I − PX )η = 0 ⇐⇒ η ∈ N (I − PX ) c Copyright 2012 Dan Nettleton (Iowa State University) ⇐⇒ η ∈ C(PX ) = C(X) ⇐⇒ ∃ α 3 Xα = η ⇐⇒ ∃ α 3 E(y) = Xβ + η = Xβ + Xα = X(β + α) ⇐⇒ E(y) ∈ C(X). Statistics 611 12 / 43 Example 1 Consider an experiment with two experimental units (mice in this case) for each of two treatments. We might assume the GMM holds with y11 1 y 1 12 E(y) = E = Xβ = y21 1 y22 1 c Copyright 2012 Dan Nettleton (Iowa State University) 1 0 µ + τ1 µ 1 0 µ + τ 1 = . τ 1 µ + τ2 0 1 τ 2 0 1 µ + τ2 Statistics 611 13 / 43 Example 1 (continued) Suppose the person who conducted the experiment neglected to mention that, in each treatment group, one of the experimental units was male and the other was female. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 14 / 43 Example 1 (continued) Then the true model may require α/2 µ + τ1 µ + τ1 + α/2 µ + τ − α/2 µ + τ −α/2 1 1 E(y) = + = µ + τ2 + α/2 µ + τ2 α/2 −α/2 µ + τ2 µ + τ2 − α/2 1/2 −1/2 h i = Xβ + α = Xβ + Wα = Xβ + η. 1/2 −1/2 c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 15 / 43 Example 1 (continued) If we analyze the data assuming the GMM with E(y) = Xβ, determine 1 E(τ\ 1 − τ2 ), and 2 E(σ̂ 2 ). c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 16 / 43 Example 1 (continued) From slide 6, 0 E(τ\ 1 − τ2 ) = τ1 − τ2 + a PX η 1/2 −1/2 h i = τ1 − τ2 + a0 PX α 1/2 −1/2 h i = τ1 − τ2 + a0 0 α = τ1 − τ2 . Thus, the LSE of τ1 − τ2 is unbiased in this case. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 17 / 43 Example 1 (continued) From slide 10, E(σ̂ 2 ) = = = = = η 0 (I − PX )η + σ2 n−r η 0 (Iη − PX η) + σ2 n−r η 0 (η − 0) + σ2 n−r η0η + σ2 n−r α2 + σ2. 4−2 Thus, σ̂ 2 is biased for σ 2 in this case. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 18 / 43 Example 2 Once again consider an experiment with two experimental units (mice) for each of two treatments. Suppose we assume the GMM holds with y11 1 y 1 12 E(y) = E = Xβ = y21 1 y22 1 c Copyright 2012 Dan Nettleton (Iowa State University) 1 0 µ + τ1 µ 1 0 µ + τ 1 = . τ 1 0 1 µ + τ2 τ2 0 1 µ + τ2 Statistics 611 19 / 43 Example 2 (continued) Suppose the person who conducted the experiment neglected to mention that both experimental units in treatment group 1 were female and that both experimental units in treatment group 2 were male. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 20 / 43 Example 2 (continued) Then the true model may require α/2 µ + τ1 µ + τ1 + α/2 µ + τ + α/2 µ + τ α/2 1 1 E(y) = + = µ + τ2 − α/2 µ + τ2 −α/2 −α/2 µ + τ2 µ + τ2 − α/2 1/2 1/2 h i = Xβ + α = Xβ + Wα = Xβ + η. −1/2 −1/2 c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 21 / 43 Example 2 (continued) If we analyze the data assuming the GMM with E(y) = Xβ, determine 1 E(τ\ 1 − τ2 ), and 2 E(σ̂ 2 ). c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 22 / 43 Example 2 (continued) From slide 6, 0 E(τ\ 1 − τ2 ) = τ1 − τ2 + a PX η 1/2 1/2 h i = τ1 − τ2 + a0 PX α −1/2 −1/2 1/2 1/2 h i = τ1 − τ2 + a0 α = τ1 − τ2 + α. −1/2 −1/2 c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 23 / 43 Example 2 (continued) Note that τ\ 1 − τ2 = ȳ1· − ȳ2· . The previous slide shows that ȳ1· − ȳ2· is not an unbiased estimator of the difference between treatment effects. However, ȳ1· − ȳ2· is an unbiased estimator of the difference between the means of the two treatment groups; i.e., E(ȳ1· − ȳ2· ) = (µ + τ1 + α/2) − (µ + τ2 − α/2) = τ1 − τ2 + α. Part of the difference may be due to treatment, but part may be due to sex of the mice. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 24 / 43 Example 2 (continued) From slide 10, η 0 (I − PX )η + σ2 n−r η 0 (Iη − PX η) = + σ2 n−r η 0 (η − η) = + σ2 n−r = σ2. E(σ̂ 2 ) = Thus, σ̂ 2 is unbiased for σ 2 in this case. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 25 / 43 Example 2 (continued) Because η ∈ C(X), both assumptions E(y) = Xβ and E(y) = Xβ + η are equivalent to E(y) ∈ C(X). Thus, even though we ignore sex of the mice, our model for the mean is correct. The only mistake we would make is to assume that the difference in means for the treatment groups is due only to treatment rather than to a combination of treatment and sex. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 26 / 43 Overfitting Now suppose we consider the model y = Xβ + ε, where X = [X1 , X2 ] and β = " # β1 β2 3 Xβ = X1 β 1 + X2 β 2 . Furthermore, suppose that (unknown to us) X2 β 2 = 0. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 27 / 43 In this case, we say that we are overfitting. Note that we are fitting a model that is more complicated than it needs to be. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 28 / 43 To examine the impact of the overfitting, consider the case where X = [X1 , X2 ] is of full-column rank. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 29 / 43 If we were to fit the simpler and correct model y = X1 β 1 + ε, the LSE of β 1 is β̃ 1 = (X01 X1 )−1 X1 y. Then E(β̃ 1 ) = (X01 X1 )−1 X01 E(y) c Copyright 2012 Dan Nettleton (Iowa State University) = (X01 X1 )−1 X01 X1 β 1 = β1 . Statistics 611 30 / 43 Var(β̃ 1 ) = (X01 X1 )−1 X01 Var(y)X1 (X01 X1 )−1 = σ 2 (X01 X1 )−1 X01 X1 (X01 X1 )−1 = σ 2 (X01 X1 )−1 . c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 31 / 43 If we were to fit the full model y = X1 β1 + X2 β2 + ε that is correct " # but more complicated than it needs to be, then the LSE β1 of β = is β2 " β̂ 1 β̂ 2 # −1 [X1 , X2 ]0 y = [X1 , X2 ]0 [X1 , X2 ] " = X01 X1 X01 X2 X02 X1 X02 X2 c Copyright 2012 Dan Nettleton (Iowa State University) #−1 " # X01 y X02 y . Statistics 611 32 / 43 If X01 X2 = 0, then " β̂ 1 β̂ 2 # " = " = X01 X1 0 0 X02 X2 #−1 " # X01 y (X01 X1 )−1 X1 y (X02 X2 )−1 X2 y c Copyright 2012 Dan Nettleton (Iowa State University) X02 y " # = β̃ 1 # (X02 X2 )−1 X2 y . Statistics 611 33 / 43 Now suppose X01 X2 6= 0. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 34 / 43 E " #! β̂ 1 β̂ 2 = (X0 X)−1 X0 E(y) = (X0 X)−1 X0 Xβ " # β1 =β= . β2 Thus, E(β̂ 1 ) = β 1 . c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 35 / 43 Var(β̂) = Var " #! β̂ 1 β̂ 2 = σ 2 (X0 X)−1 #−1 " 0 0 2 X1 X1 X1 X2 =σ . X02 X1 X02 X2 c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 36 / 43 By Exercise A.72, " A B C D #−1 " = A−1 + A−1 BE−1 CA−1 A−1 BE−1 −ECA−1 E−1 # , where E = D − CA−1 B. Thus, Var(β̂ 1 ) is σ 2 times c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 37 / 43 (X01 X1 )−1 + (X01 X1 )−1 X01 X2 (X02 X2 − X02 X1 (X01 X1 )−1 X01 X2 )−1 X02 X1 (X01 X1 )−1 = (X01 X1 )−1 + (X01 X1 )−1 X01 X2 (X02 (I − PX1 )X2 )−1 X02 X1 (X01 X1 )−1 . Thus, Var(β̂ 1 ) − Var(β̃ 1 ) = σ 2 (X01 X1 )−1 X01 X2 (X02 (I − PX1 )X2 )−1 X02 X1 (X01 X1 )−1 . c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 38 / 43 In a homework problem, you will show that Var(β̂ 1 ) − Var(β̃ 1 ) is NND. Thus, one cost of overfitting is increased variability of estimators of regression coefficients. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 39 / 43 How is estimation of σ 2 affected? Let r1 = rank(X1 ) and r2 = rank(X2 ). c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 40 / 43 If we fit a simpler model y = X1 β 1 + ε, then σ̃ 2 = y0 (I − PX1 )y n − r1 and E(y0 (I − PX1 )y) = (n − r1 )σ 2 ⇒ E(σ̃ 2 ) = σ 2 . c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 41 / 43 If we overfit with the model y = Xβ + ε, then σ̂ 2 = y0 (I − PX )y n−r and E(σ̂ 2 ) = σ 2 . Thus, overfitting does not lead to biased estimation of σ 2 . c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 42 / 43 However, as we will see later in the course, overfitting leads to a loss of degrees of freedom (n − r < n − r1 ), which can lead to a loss of power for testing hypotheses about β. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 43 / 43