Notes 5

advertisement
Stat 475 Notes 5
Reading: Lohr, Chapter 3.2-3.3
I. Comment on Ratio Estimator
n
yU
Bˆ 
B

We estimate the ratio
xU by
y
i 1
n
i
.
x
 i
i 1
1 n yi
What about the estimator B  n  x ?
i 1 i
B is not as good an estimator as B̂ .
For a large sample, B̂ will have small bias and small variance
but B can still have substantial bias because in general for two
E (Y )
Y 
E
random variables X and Y ,  X  might not equal E ( X ) .
II. Regression Estimator
The regression estimator is another estimator besides the ratio
estimator for yU when we have available an auxiliary variable x
for which we know xU .
Suppose a linear regression model holds in the population:
yi  B0  B1 xi  ei , E (ei | xi )  0 .
Then E ( yi | xi  xU )  B0  B1 xU  yU .
E ( yi | xi  xU )  B0  B1 xU  yU
(1.1)
The regression estimator substitutes the least squares estimates
of B0 and B1 from the sample in (1.1) to estimate yU .
Specifically, the least squares estimates are
( xi  x )( yi  y )

Bˆ1  iS
 ( xi  x )2
iS
Bˆ0  y  Bˆ1 x
and the regression estimator is
yˆ reg  Bˆ0  Bˆ1 xU
Bias and standard error of the regression estimator:
These properties do not assume that the regression model (1.1)
holds. The standard error is an estimate of the standard
deviation that uses a Taylor series expansions.
Bias:
E[ yˆ reg  yU ]  E[ y  yU ]  E[ Bˆ1 ( x  xU )]  Cov( Bˆ1, x )
The bias is approximately zero for large samples.
2
Let se be the sample variance of the residuals from the least
squares regression in the sample:
1
se2,reg 
( yi  Bˆ0  Bˆ1 xi ) 2

n  2 iS
Then
2
n s

SE ( yˆ reg )  1   e ,reg
 N n
Also, if we are interested in estimating the population total t y for
y and we know the population size N , then
tˆreg  Nyˆ reg and
2
n s

SE (tˆreg )  N 1   e ,reg
 N n
Unlike the ratio estimator, the regression estimator cannot be
used to estimate a population total if we do not know N .
Comparison of ratio estimator and regression estimator for
estimating mean:
1
2
ˆ )2
ˆ y
s

(
y

Bx
B

e , ratio
i
i
Let
. Then
n  1 iS
x and
2
n s

SE ( yˆ ratio )  1   e ,ratio .
 N n
Thus, the ratio estimator and regression estimator perform
2
2
similarly when se,reg  se,ratio which will be the case if the
regression line (1.1) goes through the origin. But if the
2
regression line (1.1) does not go through the origin, then se,reg
2
might be much less than se,ratio and the regression estimator will
be much better than the ratio estimator.
As with the ratio estimator, the regression estimator will gain
more over the sample mean when y and x are highly correlated.
To decide whether using the ratio estimator, regression estimator
or sample mean is a good idea, we should first plot the data. If a
simple linear regression model appears to approximately hold
with the regression line approximately going through the origin
and there is a reasonably high correlation between y and x (say
above 0.5), then the ratio or the regression estimator is
reasonable. If a simple linear regression model appears to
approximately hold but the regression line does not
approximately go through the origin, then the regression
estimator is the best choice. If a simple linear regression model
does not appear to approximately hold, then the regression
estimator could have substantial bias and a more advanced
regression estimator than we study here could be considered.
Example: To estimate the number of dead trees in an area, we
divide the area into 100 square plots and count the number of
dead trees on a photograph of each plot. Photo counts can be
made quickly, but sometimes a tree is misclassified or not
detected. So we select a simple random sample of size 25 of
the plots for field counts of dead trees. We know that the
population mean number of dead trees per plot is 11.3
The following is a scatterplot of the data and the least squares
regression line:
photo=c(10,12,7,13,13,6,17,16,15,10,14,12,10,5,12,10,10,9,6,11,7,9,11,10,10);
field=c(15,14,9,14,8,5,18,15,13,15,11,15,12,8,13,9,11,12,9,12,13,11,10,9,8);
regmodel=lm(field~photo);
plot(photo,field);
abline(a=coef(regmodel)[1],b=coef(regmodel)[2]);
The following is a residual plot for the simple linear regression
model:
plot(photo,resid(regmodel),ylab="Residuals",main="Residual Plot for Simple
Linear Regression Model");
abline(0,0);
The simple linear regression model appears to approximately
hold but the regression line does not approximately go through
the origin, so the regression estimator is a better choice than the
ratio estimator.
The correlation is reasonably high
> cor(photo,field)
[1] 0.6241967
so the regression estimator may gain over the sample mean.
The regression estimator is computed as follows:
> ybar=mean(field); # Sample mean
> se.ybar=sqrt((1-(25/100))*var(field)/25); # Standard error of sample mean
> se.ybar.reg
[1] 0.4167579
> se.ybar
[1] 0.5222069
> regmodel=lm(field~photo); # Least squares regression of field on photo
> B0hat=coef(regmodel)[1]; # Least squares estimate of intercept
> B0hat
(Intercept)
5.059292
> B1hat=coef(regmodel)[2]; # Least squares estimate of slope
> B1hat
photo
0.6132743
> ybar.reg=B0hat+B1hat*11.3; # Regression estimator, uses that known
population mean of photo is 11.3
> ybar.reg
(Intercept)
11.98929
> sereg.sq=sum(resid(regmodel)^2/(length(field)-2)); # Compute standard
deviation of residuals from least squares regression
> se.ybar.reg=sqrt((1-(25/100))*sereg.sq/25); # Standard error of regression
estimator
> se.ybar.reg
[1] 0.4167579
>
> ybar=mean(field); # Sample mean
> ybar
[1] 11.56
> se.ybar=sqrt((1-(25/100))*var(field)/25); # Standard error of sample mean
> se.ybar
[1] 0.5222069
The regression estimator has a considerably smaller standard
error, 0.417, than the sample mean, 0.522.
Our estimate of the total number of trees with the regression
estimator is
tˆy ,reg  yˆ reg * N  (11.99)(100)  1199
SE (tˆy ,reg )  SE ( yˆ reg )* N  (0.417)(100)  41.7
An approximate 95% confidence interval for the total number of
trees is
1199  t0.025,n  2 (41.7)  1199  2.07(41.7)  (1113, 1285) .
Here we use the t-distribution percentile (with n-2=23 degrees of
freedom) of 2.07 rather than the normal distribution percentile of
1.96 because of the relatively small sample size.
Download