Notes 7

advertisement
Stat 475 Notes 7
Reading: Lohr, Chapter 4.1-4.2
I. Generalized Regression Estimator: Example
The Florida Game and Freshwater Fish Commission is
interested in estimating the mean weight of alligators. The
lengths of alligators are much more easily observed. It is known
that the mean length of alligators in the population is 100 inches,
the mean length squared of alligators is 8000 inches and the
mean length cubed of alligators is 810,000 inches. A simple
random sample of size 25 is taken of alligators and the heights
and lengths of the alligators is recorded.
length=c(94,74,147,58,86,94,63,86,69,72,128,85,82,86,88,72,74,61,90,89,68,76,11
4,90,78);
weight=c(130,51,640,28,80,110,33,90,36,38,366,84,80,83,70,61,54,44,106,84,39,4
2,197,102,57);
plot(length,weight);
1
For the sample data, a cubic regression model
E (Y | X )  B0  B1 X  B2 X 2  B3 X 3
fits much better than a simple linear regression model.
> summary(lm(weight~length));
Call:
lm(formula = weight ~ length)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -393.2640 47.5341 -8.273 2.40e-08 ***
length
5.9024 0.5448 10.833 1.65e-10 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 54.01 on 23 degrees of freedom
2
Multiple R-squared: 0.8361, Adjusted R-squared: 0.829
F-statistic: 117.4 on 1 and 23 DF, p-value: 1.654e-10
lengthsq=length^2;
lengthcubed=length^3;
summary(lm(weight~length+lengthsq+lengthcubed));
Call:
lm(formula = weight ~ length + lengthsq + lengthcubed)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.778e+02 1.591e+02 -1.747 0.095330 .
length
1.147e+01 5.175e+00 2.217 0.037807 *
lengthsq -1.542e-01 5.418e-02 -2.846 0.009676 **
lengthcubed 8.070e-04 1.811e-04 4.456 0.000218 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 11.36 on 21 degrees of freedom
Multiple R-squared: 0.9934, Adjusted R-squared: 0.9924
Suppose the cubic regression model holds in the population:
yi  B0  B1 xi  B2 xi2  B3 xi3  ei , E (ei | xi )  0 , i.e.,
E ( yi | xi )  B0  B1 xi  B2 xi2  B3 xi3
Then,
yU  E[ E ( yi | xi )]
 E  B0  B1 xi  B2 xi2  B3 xi3  ei 
1
 B0  B1
N
N
1
x

B

i
2
N
i 1
3
N
1
x

B

3
N
i 1
2
i
N
x
i 1
3
i
(1.1)
.
In the generalized regression estimator, we estimate
B0 , B1 , B2 , B3 from the regression of yi on xi , xi2 , xi3 in the
sample; denote these estimates by Bˆ , Bˆ , Bˆ , Bˆ .
0
1
2
3
The generalized regression estimator plugs Bˆ0 , Bˆ1 , Bˆ2 , Bˆ3 into
(1.1) for B0 , B1 , B2 , B3 .
yˆ gen ,reg
1
 Bˆ0  Bˆ1
N
N
1
ˆ
x

B

i
2
N
i 1
N
1
ˆ
x

B

3
N
i 1
2
i
N
x
i 1
3
i
cubicreg=lm(weight~length+lengthsq+lengthcubed);
B0hat=coef(cubicreg)[1];
B1hat=coef(cubicreg)[2];
B2hat=coef(cubicreg)[3];
B3hat=coef(cubicreg)[4];
yhat.gen.reg=B0hat+B1hat*100+B2hat*8000+B3hat*810000;
> yhat.gen.reg
(Intercept)
289.5714
We will study a technique called the bootstrap for obtaining the
standard error of complex estimators like the generalized
regression estimator later in the course.
II. Stratified Sampling
4
A stratified random sample is a probability sample obtained by
separating the population elements into nonoverlapping groups,
called strata, and then selecting a simple random sample from
each stratum.
Motivating example: Suppose a public opinion poll designed to
estimate the proportion of voters who favor spending more tax
revenue on an improved ambulance service is to be conducted in
a certain county. The county contains two cities and a rural
area. The population of interest for the poll is all men and
women of voting age who reside in the county. A stratified
random sample of adults residing in the county can be obtained
by selecting a simple random sample of adults from each city
and another simple random sample of adults from the rural area.
That is, the two cities and the rural area represent three strata
from which we obtain simple random samples.
Reasons for using stratified random sample rather than simple
random sample for the county poll:
 Our goal in designing surveys is to maximize the
information obtained, i.e., minimize the standard deviation
of the estimate, for a fixed expenditure. Samples
displaying small variability among the measurements will
produce small standard deviations. Thus, if all the adults in
one city (say, city A) tend to think alike on the ambulance
service issue, we can obtain a very accurate estimate of the
proportion in question with a relatively small sample.
Similarly, if all the adults in the second city (city B) tend to
think alike on this issue, although they may differ in
opinion from those in city A, then we can again obtain an
5
accurate estimate with a small sample. This situation may
arise if city A has a hospital and hence has no great need
for improved ambulance service, whereas city B does not
have a hospital and hence has great need for an improved
ambulance service. The opinions in the rural area may be
more varied, but a smaller number of adults may reside
here and enough resources may be available for a careful
study of this area. When results of the stratified random
sample are combined, the final estimate of the proportion of
voters favoring more expenditures for an ambulance
service may have a much smaller standard deviation than
would an estimate from a simple random sample of
comparable size.
 The cost of obtaining observations varies with the design of
the survey. The cost of selecting adults to be sampled, the
cost of interviewer time and travel, and the cost of
administering the overall sampling procedure may all be
minimized by a carefully planned stratified random sample
in compact, well-defined geographic areas. Such cost
savings allow the investigators to use a larger sample size
than they could use for a simple random sample of the
same total cost.
 Estimates of a population parameter (e.g., the mean) may
be desired for certain subsets of the population, i.e.,
domains. In the county poll, each city commissioner may
want to see an estimate of the proportion of voters favoring
an expanded ambulance service for its own city. Stratified
random sampling allows us to ensure that the sample size in
each stratum is sufficient for obtaining accurate estimates
of the population parameter for the stratum.
6
In summary, the principle reasons for using stratified random
sampling rather than simple random sampling are as follows:
1. Stratification may produce a smaller standard deviation of
the estimate (resulting in a smaller confidence interval)
than would be produced by a simple random sample of the
same size. This result is particularly true if measurements
within a strata are homogeneous. This advantage of
stratification is similar to the advantage of blocking in
randomized experiments.
2. The cost per observation in the survey may be reduced by
stratification of the population elements into convenient
groupings.
3. Estimates of population parameters may be desired for
subgroups of the population. These subgroups should then be
identifiable strata.
Examples of surveys in which stratified random sampling is
advantageous:
 Sampling hospital patients on a certain diet to assess weight
gain may be more efficient if the patients are stratified by
gender because men tend to weigh more than women.
 A poll of college students at a large university may be more
conveniently administered and carried out if students are
stratified into on-campus and off-campus residents.
 A quality control sampling plan in a manufacturing plant
may be stratified by production lines because estimates of
proportions of defective products may be required by the
manager of each line.
7
Most major surveys have some degree of stratification
incorporated into the design. As examples, we consider three
important surveys conducted by the U.S. Bureau of Labor
Statistics:
 The consumer price index (CPI) is a measure of the average
change in prices for a fixed collection of goods and services
for urban consumers. The CPI is actually calculated from
at least four different types of surveys: surveys of cities,
surveys of urban families, surveys of outlets providing
goods and services, and surveys of specific goods and
services. In the design of most CPI surveys, sampling units
(counties or groups of contiguous counties) are identified in
the population and then grouped into strata. Strata are
chosen on the basis of geography, population size, rate of
population increase, major industry, percentage nonwhite
and percentage urban. The sampling units within a stratum
are chosen to be as much alike as possible with regard to
these characteristics.
 The Current Population Survey (CPS) measures aspects of
employment, unemployment and people not in the labor
force. It uses strata similar to those used in the CPI
surveys, except rural sampling units are used and the
number of farms becomes an important quantity for
 The Establishment Survey (ES) collects data on work hours
and earnings for nonagricultural establishments in the
United States. Establishments are stratified according to
industry type and size, primarily for homogeneity of
measurements but also for provision of estimates for
various types of industries. For example, information is
provided for such industrial categories as mining,
8
construction, manufacturing, transportation, and finance,
insurance and real estate.
III. Drawing a stratified random sample and estimating the
population mean and total.
The first step in the selection of a stratified random sample is to
clearly specify the strata: then each sampling unit of the
population is placed into its appropriate stratum. This step may
be more difficult than it sounds. For example, suppose you plan
to stratify the sampling units – say, households – into rural and
urban units. What should be done with households in a town of
1000 inhabitants? Are these households rural or urban? They
may be rural if the town is isolated in the country, or they may
be urban if the town is adjacent to a large city. Hence, to specify
what is meant by urban and rural is essential so that each
sampling unit clearly falls into one stratum.
For stratified sampling to work, we need to know the population
size in each stratum. Suppose there are H strata. The population
sizes in the strata are denoted by N1 , , N H and
N1   N H  N
where N is the total number of units in the entire population.
Theory of Stratified Sampling and Example from book.
9
Download