Stat 475 Notes 7 Reading: Lohr, Chapter 4.1-4.2 I. Generalized Regression Estimator: Example The Florida Game and Freshwater Fish Commission is interested in estimating the mean weight of alligators. The lengths of alligators are much more easily observed. It is known that the mean length of alligators in the population is 100 inches, the mean length squared of alligators is 8000 inches and the mean length cubed of alligators is 810,000 inches. A simple random sample of size 25 is taken of alligators and the heights and lengths of the alligators is recorded. length=c(94,74,147,58,86,94,63,86,69,72,128,85,82,86,88,72,74,61,90,89,68,76,11 4,90,78); weight=c(130,51,640,28,80,110,33,90,36,38,366,84,80,83,70,61,54,44,106,84,39,4 2,197,102,57); plot(length,weight); 1 For the sample data, a cubic regression model E (Y | X ) B0 B1 X B2 X 2 B3 X 3 fits much better than a simple linear regression model. > summary(lm(weight~length)); Call: lm(formula = weight ~ length) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -393.2640 47.5341 -8.273 2.40e-08 *** length 5.9024 0.5448 10.833 1.65e-10 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 54.01 on 23 degrees of freedom 2 Multiple R-squared: 0.8361, Adjusted R-squared: 0.829 F-statistic: 117.4 on 1 and 23 DF, p-value: 1.654e-10 lengthsq=length^2; lengthcubed=length^3; summary(lm(weight~length+lengthsq+lengthcubed)); Call: lm(formula = weight ~ length + lengthsq + lengthcubed) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.778e+02 1.591e+02 -1.747 0.095330 . length 1.147e+01 5.175e+00 2.217 0.037807 * lengthsq -1.542e-01 5.418e-02 -2.846 0.009676 ** lengthcubed 8.070e-04 1.811e-04 4.456 0.000218 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 11.36 on 21 degrees of freedom Multiple R-squared: 0.9934, Adjusted R-squared: 0.9924 Suppose the cubic regression model holds in the population: yi B0 B1 xi B2 xi2 B3 xi3 ei , E (ei | xi ) 0 , i.e., E ( yi | xi ) B0 B1 xi B2 xi2 B3 xi3 Then, yU E[ E ( yi | xi )] E B0 B1 xi B2 xi2 B3 xi3 ei 1 B0 B1 N N 1 x B i 2 N i 1 3 N 1 x B 3 N i 1 2 i N x i 1 3 i (1.1) . In the generalized regression estimator, we estimate B0 , B1 , B2 , B3 from the regression of yi on xi , xi2 , xi3 in the sample; denote these estimates by Bˆ , Bˆ , Bˆ , Bˆ . 0 1 2 3 The generalized regression estimator plugs Bˆ0 , Bˆ1 , Bˆ2 , Bˆ3 into (1.1) for B0 , B1 , B2 , B3 . yˆ gen ,reg 1 Bˆ0 Bˆ1 N N 1 ˆ x B i 2 N i 1 N 1 ˆ x B 3 N i 1 2 i N x i 1 3 i cubicreg=lm(weight~length+lengthsq+lengthcubed); B0hat=coef(cubicreg)[1]; B1hat=coef(cubicreg)[2]; B2hat=coef(cubicreg)[3]; B3hat=coef(cubicreg)[4]; yhat.gen.reg=B0hat+B1hat*100+B2hat*8000+B3hat*810000; > yhat.gen.reg (Intercept) 289.5714 We will study a technique called the bootstrap for obtaining the standard error of complex estimators like the generalized regression estimator later in the course. II. Stratified Sampling 4 A stratified random sample is a probability sample obtained by separating the population elements into nonoverlapping groups, called strata, and then selecting a simple random sample from each stratum. Motivating example: Suppose a public opinion poll designed to estimate the proportion of voters who favor spending more tax revenue on an improved ambulance service is to be conducted in a certain county. The county contains two cities and a rural area. The population of interest for the poll is all men and women of voting age who reside in the county. A stratified random sample of adults residing in the county can be obtained by selecting a simple random sample of adults from each city and another simple random sample of adults from the rural area. That is, the two cities and the rural area represent three strata from which we obtain simple random samples. Reasons for using stratified random sample rather than simple random sample for the county poll: Our goal in designing surveys is to maximize the information obtained, i.e., minimize the standard deviation of the estimate, for a fixed expenditure. Samples displaying small variability among the measurements will produce small standard deviations. Thus, if all the adults in one city (say, city A) tend to think alike on the ambulance service issue, we can obtain a very accurate estimate of the proportion in question with a relatively small sample. Similarly, if all the adults in the second city (city B) tend to think alike on this issue, although they may differ in opinion from those in city A, then we can again obtain an 5 accurate estimate with a small sample. This situation may arise if city A has a hospital and hence has no great need for improved ambulance service, whereas city B does not have a hospital and hence has great need for an improved ambulance service. The opinions in the rural area may be more varied, but a smaller number of adults may reside here and enough resources may be available for a careful study of this area. When results of the stratified random sample are combined, the final estimate of the proportion of voters favoring more expenditures for an ambulance service may have a much smaller standard deviation than would an estimate from a simple random sample of comparable size. The cost of obtaining observations varies with the design of the survey. The cost of selecting adults to be sampled, the cost of interviewer time and travel, and the cost of administering the overall sampling procedure may all be minimized by a carefully planned stratified random sample in compact, well-defined geographic areas. Such cost savings allow the investigators to use a larger sample size than they could use for a simple random sample of the same total cost. Estimates of a population parameter (e.g., the mean) may be desired for certain subsets of the population, i.e., domains. In the county poll, each city commissioner may want to see an estimate of the proportion of voters favoring an expanded ambulance service for its own city. Stratified random sampling allows us to ensure that the sample size in each stratum is sufficient for obtaining accurate estimates of the population parameter for the stratum. 6 In summary, the principle reasons for using stratified random sampling rather than simple random sampling are as follows: 1. Stratification may produce a smaller standard deviation of the estimate (resulting in a smaller confidence interval) than would be produced by a simple random sample of the same size. This result is particularly true if measurements within a strata are homogeneous. This advantage of stratification is similar to the advantage of blocking in randomized experiments. 2. The cost per observation in the survey may be reduced by stratification of the population elements into convenient groupings. 3. Estimates of population parameters may be desired for subgroups of the population. These subgroups should then be identifiable strata. Examples of surveys in which stratified random sampling is advantageous: Sampling hospital patients on a certain diet to assess weight gain may be more efficient if the patients are stratified by gender because men tend to weigh more than women. A poll of college students at a large university may be more conveniently administered and carried out if students are stratified into on-campus and off-campus residents. A quality control sampling plan in a manufacturing plant may be stratified by production lines because estimates of proportions of defective products may be required by the manager of each line. 7 Most major surveys have some degree of stratification incorporated into the design. As examples, we consider three important surveys conducted by the U.S. Bureau of Labor Statistics: The consumer price index (CPI) is a measure of the average change in prices for a fixed collection of goods and services for urban consumers. The CPI is actually calculated from at least four different types of surveys: surveys of cities, surveys of urban families, surveys of outlets providing goods and services, and surveys of specific goods and services. In the design of most CPI surveys, sampling units (counties or groups of contiguous counties) are identified in the population and then grouped into strata. Strata are chosen on the basis of geography, population size, rate of population increase, major industry, percentage nonwhite and percentage urban. The sampling units within a stratum are chosen to be as much alike as possible with regard to these characteristics. The Current Population Survey (CPS) measures aspects of employment, unemployment and people not in the labor force. It uses strata similar to those used in the CPI surveys, except rural sampling units are used and the number of farms becomes an important quantity for The Establishment Survey (ES) collects data on work hours and earnings for nonagricultural establishments in the United States. Establishments are stratified according to industry type and size, primarily for homogeneity of measurements but also for provision of estimates for various types of industries. For example, information is provided for such industrial categories as mining, 8 construction, manufacturing, transportation, and finance, insurance and real estate. III. Drawing a stratified random sample and estimating the population mean and total. The first step in the selection of a stratified random sample is to clearly specify the strata: then each sampling unit of the population is placed into its appropriate stratum. This step may be more difficult than it sounds. For example, suppose you plan to stratify the sampling units – say, households – into rural and urban units. What should be done with households in a town of 1000 inhabitants? Are these households rural or urban? They may be rural if the town is isolated in the country, or they may be urban if the town is adjacent to a large city. Hence, to specify what is meant by urban and rural is essential so that each sampling unit clearly falls into one stratum. For stratified sampling to work, we need to know the population size in each stratum. Suppose there are H strata. The population sizes in the strata are denoted by N1 , , N H and N1 N H N where N is the total number of units in the entire population. Theory of Stratified Sampling and Example from book. 9