Stat 475 Notes 5 Reading: Lohr, Chapter 3.2-3.3 I. Comment on Ratio Estimator n yU Bˆ B We estimate the ratio xU by y i 1 n i . x i i 1 1 n yi What about the estimator B n x ? i 1 i B is not as good an estimator as B̂ . For a large sample, B̂ will have small bias and small variance but B can still have substantial bias because in general for two E (Y ) Y E random variables X and Y , X might not equal E ( X ) . II. Regression Estimator The regression estimator is another estimator besides the ratio estimator for yU when we have available an auxiliary variable x for which we know xU . Suppose a linear regression model holds in the population: yi B0 B1 xi ei , E (ei | xi ) 0 . Then E ( yi | xi xU ) B0 B1 xU yU . E ( yi | xi xU ) B0 B1 xU yU (1.1) The regression estimator substitutes the least squares estimates of B0 and B1 from the sample in (1.1) to estimate yU . Specifically, the least squares estimates are ( xi x )( yi y ) Bˆ1 iS ( xi x )2 iS Bˆ0 y Bˆ1 x and the regression estimator is yˆ reg Bˆ0 Bˆ1 xU Bias and standard error of the regression estimator: These properties do not assume that the regression model (1.1) holds. The standard error is an estimate of the standard deviation that uses a Taylor series expansions. Bias: E[ yˆ reg yU ] E[ y yU ] E[ Bˆ1 ( x xU )] Cov( Bˆ1, x ) The bias is approximately zero for large samples. 2 Let se be the sample variance of the residuals from the least squares regression in the sample: 1 se2,reg ( yi Bˆ0 Bˆ1 xi ) 2 n 2 iS Then 2 n s SE ( yˆ reg ) 1 e ,reg N n Also, if we are interested in estimating the population total t y for y and we know the population size N , then tˆreg Nyˆ reg and 2 n s SE (tˆreg ) N 1 e ,reg N n Unlike the ratio estimator, the regression estimator cannot be used to estimate a population total if we do not know N . Comparison of ratio estimator and regression estimator for estimating mean: 1 2 ˆ )2 ˆ y s ( y Bx B e , ratio i i Let . Then n 1 iS x and 2 n s SE ( yˆ ratio ) 1 e ,ratio . N n Thus, the ratio estimator and regression estimator perform 2 2 similarly when se,reg se,ratio which will be the case if the regression line (1.1) goes through the origin. But if the 2 regression line (1.1) does not go through the origin, then se,reg 2 might be much less than se,ratio and the regression estimator will be much better than the ratio estimator. As with the ratio estimator, the regression estimator will gain more over the sample mean when y and x are highly correlated. To decide whether using the ratio estimator, regression estimator or sample mean is a good idea, we should first plot the data. If a simple linear regression model appears to approximately hold with the regression line approximately going through the origin and there is a reasonably high correlation between y and x (say above 0.5), then the ratio or the regression estimator is reasonable. If a simple linear regression model appears to approximately hold but the regression line does not approximately go through the origin, then the regression estimator is the best choice. If a simple linear regression model does not appear to approximately hold, then the regression estimator could have substantial bias and a more advanced regression estimator than we study here could be considered. Example: To estimate the number of dead trees in an area, we divide the area into 100 square plots and count the number of dead trees on a photograph of each plot. Photo counts can be made quickly, but sometimes a tree is misclassified or not detected. So we select a simple random sample of size 25 of the plots for field counts of dead trees. We know that the population mean number of dead trees per plot is 11.3 The following is a scatterplot of the data and the least squares regression line: photo=c(10,12,7,13,13,6,17,16,15,10,14,12,10,5,12,10,10,9,6,11,7,9,11,10,10); field=c(15,14,9,14,8,5,18,15,13,15,11,15,12,8,13,9,11,12,9,12,13,11,10,9,8); regmodel=lm(field~photo); plot(photo,field); abline(a=coef(regmodel)[1],b=coef(regmodel)[2]); The following is a residual plot for the simple linear regression model: plot(photo,resid(regmodel),ylab="Residuals",main="Residual Plot for Simple Linear Regression Model"); abline(0,0); The simple linear regression model appears to approximately hold but the regression line does not approximately go through the origin, so the regression estimator is a better choice than the ratio estimator. The correlation is reasonably high > cor(photo,field) [1] 0.6241967 so the regression estimator may gain over the sample mean. The regression estimator is computed as follows: > ybar=mean(field); # Sample mean > se.ybar=sqrt((1-(25/100))*var(field)/25); # Standard error of sample mean > se.ybar.reg [1] 0.4167579 > se.ybar [1] 0.5222069 > regmodel=lm(field~photo); # Least squares regression of field on photo > B0hat=coef(regmodel)[1]; # Least squares estimate of intercept > B0hat (Intercept) 5.059292 > B1hat=coef(regmodel)[2]; # Least squares estimate of slope > B1hat photo 0.6132743 > ybar.reg=B0hat+B1hat*11.3; # Regression estimator, uses that known population mean of photo is 11.3 > ybar.reg (Intercept) 11.98929 > sereg.sq=sum(resid(regmodel)^2/(length(field)-2)); # Compute standard deviation of residuals from least squares regression > se.ybar.reg=sqrt((1-(25/100))*sereg.sq/25); # Standard error of regression estimator > se.ybar.reg [1] 0.4167579 > > ybar=mean(field); # Sample mean > ybar [1] 11.56 > se.ybar=sqrt((1-(25/100))*var(field)/25); # Standard error of sample mean > se.ybar [1] 0.5222069 The regression estimator has a considerably smaller standard error, 0.417, than the sample mean, 0.522. Our estimate of the total number of trees with the regression estimator is tˆy ,reg yˆ reg * N (11.99)(100) 1199 SE (tˆy ,reg ) SE ( yˆ reg )* N (0.417)(100) 41.7 An approximate 95% confidence interval for the total number of trees is 1199 t0.025,n 2 (41.7) 1199 2.07(41.7) (1113, 1285) . Here we use the t-distribution percentile (with n-2=23 degrees of freedom) of 2.07 rather than the normal distribution percentile of 1.96 because of the relatively small sample size.