Underfitting and Overfitting c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 1 / 43 Underfitting Suppose the true model is y = Xβ + η + ε, where η is an unknown fixed vector and ε satisfies the GMM. Suppose we incorrectly assume that y = Xβ + ε. This example of misspecifying the model is known as underfitting. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 2 / 43 Note that η may equal Wα for some design matrix W whose columns could contain explanatory variables excluded from X. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 3 / 43 What are the implications of underfitting? c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 4 / 43 Find E(c0 β̂). Find E(σ̂ 2 ). c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 5 / 43 Example 1 Consider an experiment with two experimental units (mice in this case) for each of two treatments. We might assume the GMM holds with y11 1 y 1 12 E(y) = E = Xβ = y21 1 y22 1 c Copyright 2012 Dan Nettleton (Iowa State University) 1 0 µ + τ1 µ 1 0 µ + τ 1 = . τ 1 µ + τ2 0 1 τ 2 0 1 µ + τ2 Statistics 611 13 / 43 Example 1 (continued) Suppose the person who conducted the experiment neglected to mention that, in each treatment group, one of the experimental units was male and the other was female. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 14 / 43 Example 1 (continued) Then the true model may require α/2 µ + τ1 µ + τ1 + α/2 µ + τ − α/2 µ + τ −α/2 1 1 E(y) = + = µ + τ2 + α/2 µ + τ2 α/2 −α/2 µ + τ2 µ + τ2 − α/2 1/2 −1/2 h i = Xβ + α = Xβ + Wα = Xβ + η. 1/2 −1/2 c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 15 / 43 Example 1 (continued) If we analyze the data assuming the GMM with E(y) = Xβ, determine 1 E(τ\ 1 − τ2 ), and 2 E(σ̂ 2 ). c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 16 / 43 Example 2 Once again consider an experiment with two experimental units (mice) for each of two treatments. Suppose we assume the GMM holds with y11 1 y 1 12 E(y) = E = Xβ = y21 1 y22 1 c Copyright 2012 Dan Nettleton (Iowa State University) 1 0 µ + τ1 µ 1 0 µ + τ 1 = . τ 1 0 1 µ + τ2 τ2 0 1 µ + τ2 Statistics 611 19 / 43 Example 2 (continued) Suppose the person who conducted the experiment neglected to mention that both experimental units in treatment group 1 were female and that both experimental units in treatment group 2 were male. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 20 / 43 Example 2 (continued) Then the true model may require α/2 µ + τ1 µ + τ1 + α/2 µ + τ + α/2 µ + τ α/2 1 1 E(y) = + = µ + τ2 − α/2 µ + τ2 −α/2 −α/2 µ + τ2 µ + τ2 − α/2 1/2 1/2 h i = Xβ + α = Xβ + Wα = Xβ + η. −1/2 −1/2 c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 21 / 43 Example 2 (continued) If we analyze the data assuming the GMM with E(y) = Xβ, determine 1 E(τ\ 1 − τ2 ), and 2 E(σ̂ 2 ). c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 22 / 43 Overfitting Now suppose we consider the model y = Xβ + ε, where X = [X1 , X2 ] and β = " # β1 β2 3 Xβ = X1 β 1 + X2 β 2 . Furthermore, suppose that (unknown to us) X2 β 2 = 0. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 27 / 43 In this case, we say that we are overfitting. Note that we are fitting a model that is more complicated than it needs to be. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 28 / 43 To examine the impact of the overfitting, consider the case where X = [X1 , X2 ] is of full-column rank. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 29 / 43 If we were to fit the simpler and correct model y = X1 β 1 + ε, the LSE of β 1 is β̃ 1 = (X01 X1 )−1 X1 y. Then E(β̃ 1 ) = (X01 X1 )−1 X01 E(y) c Copyright 2012 Dan Nettleton (Iowa State University) = (X01 X1 )−1 X01 X1 β 1 = β1 . Statistics 611 30 / 43 Var(β̃ 1 ) = (X01 X1 )−1 X01 Var(y)X1 (X01 X1 )−1 = σ 2 (X01 X1 )−1 X01 X1 (X01 X1 )−1 = σ 2 (X01 X1 )−1 . c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 31 / 43 If we were to fit the full model y = X1 β1 + X2 β2 + ε that is correct " # but more complicated than it needs to be, then the LSE β1 of β = is β2 " β̂ 1 β̂ 2 # −1 [X1 , X2 ]0 y = [X1 , X2 ]0 [X1 , X2 ] " = X01 X1 X01 X2 X02 X1 X02 X2 c Copyright 2012 Dan Nettleton (Iowa State University) #−1 " # X01 y X02 y . Statistics 611 32 / 43 If X01 X2 = 0, then " β̂ 1 β̂ 2 # " = " = X01 X1 0 0 X02 X2 #−1 " # X01 y (X01 X1 )−1 X1 y (X02 X2 )−1 X2 y c Copyright 2012 Dan Nettleton (Iowa State University) X02 y " # = β̃ 1 # (X02 X2 )−1 X2 y . Statistics 611 33 / 43 Now suppose X01 X2 6= 0. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 34 / 43 E " #! β̂ 1 β̂ 2 = (X0 X)−1 X0 E(y) = (X0 X)−1 X0 Xβ " # β1 =β= . β2 Thus, E(β̂ 1 ) = β 1 . c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 35 / 43 Var(β̂) = Var " #! β̂ 1 β̂ 2 = σ 2 (X0 X)−1 #−1 " 0 0 2 X1 X1 X1 X2 =σ . X02 X1 X02 X2 c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 36 / 43 By Exercise A.72, " A B C D #−1 " = A−1 + A−1 BE−1 CA−1 A−1 BE−1 −ECA−1 E−1 # , where E = D − CA−1 B. Thus, Var(β̂ 1 ) is σ 2 times c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 37 / 43 (X01 X1 )−1 + (X01 X1 )−1 X01 X2 (X02 X2 − X02 X1 (X01 X1 )−1 X01 X2 )−1 X02 X1 (X01 X1 )−1 = (X01 X1 )−1 + (X01 X1 )−1 X01 X2 (X02 (I − PX1 )X2 )−1 X02 X1 (X01 X1 )−1 . Thus, Var(β̂ 1 ) − Var(β̃ 1 ) = σ 2 (X01 X1 )−1 X01 X2 (X02 (I − PX1 )X2 )−1 X02 X1 (X01 X1 )−1 . c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 38 / 43 In a homework problem, you will show that Var(β̂ 1 ) − Var(β̃ 1 ) is NND. Thus, one cost of overfitting is increased variability of estimators of regression coefficients. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 39 / 43 How is estimation of σ 2 affected? Let r1 = rank(X1 ) and r2 = rank(X2 ). c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 40 / 43 If we fit a simpler model y = X1 β 1 + ε, then σ̃ 2 = y0 (I − PX1 )y n − r1 and E(y0 (I − PX1 )y) = (n − r1 )σ 2 ⇒ E(σ̃ 2 ) = σ 2 . c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 41 / 43 If we overfit with the model y = Xβ + ε, then σ̂ 2 = y0 (I − PX )y n−r and E(σ̂ 2 ) = σ 2 . Thus, overfitting does not lead to biased estimation of σ 2 . c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 42 / 43 However, as we will see later in the course, overfitting leads to a loss of degrees of freedom (n − r < n − r1 ), which can lead to a loss of power for testing hypotheses about β. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 43 / 43