Notes 4

advertisement

Stat 475 Notes 4

Reading: Lohr, Chapter 3.1

Corrections from earlier notes:

Notes 2: Bottom of page 10, top of page 11, the true population standard deviation should be

truesd.samplemean=((sd(dioxin)/sqrt(50))*sqrt(1-50/646))

# true SD of sample mean

> truesd.samplemean

[1] .3589683

 Notes 2: Equation (1.1) on page 15 should be:

 y

1.96

s n

1

 n

N

, y

1.96

s n

1

 n

N

  y

1.96

( ), y

1.96

SE y

(0.1)

0.02

0.03

0.04

0.05

Chart of minimum sample size needed for margins of error for a proportion, assuming the worst case that the true proportion is

0.5

Margin of error Sample size

0.01 9604

2401

1008

601

385

I. Ratio Estimation: Motivating examples

Suppose that for each member of the population, two variables are measured: y x i i

.

Ratio Estimation: Estimation of

B

 t y  y

U t x x U

.

The natural estimator of B from a simple random sample is

 y x

.

Motivation 1: We are directly interested in estimating

B

 t y  y

U t x x U

, i.e., the ratio of the population mean of y to that of x or equivalently the ratio of the population total of y to that of x .

Examples:

(1) In a survey of households, y is total monthly food budget, x is total monthly expenditures and B is the proportion of expenditures that are spent on food.

(2) In a survey of farms, y is acres of wheat planted, x is total acreage and B is the proportion of acres planted that are wheat.

Motivation 2: We are interested in the population total of y but the population size N is unknown. Thus, we cannot estimate

t y

Ny

U

by Ny . However, suppose that the population total t x of x is known. Note that t y

Bt x

. Then we can estimate t y by

ˆ x

.

Example:

The wholesale price paid for oranges in large shipments is based on the sugar content of the load. The exact sugar content cannot be determined prior to the purchase and extraction of the juice from the entire load; however it can be estimated. One approach is to estimate the mean sugar content per orange from a sample or oranges and then multiply by the number of oranges

N in the load. Unfortunately this method is not feasible because it is too time consuming and costly to determine N (that is to count the total number of oranges in the load).

We can instead use the ratio estimation method with x being the weight of an orange. It is easy to find t x

by weighing the shipment or oranges. Then, we can estimate t y by taking a sample of oranges, finding the ratio of the sample mean of the sugar content of the oranges to the sample mean of the weight of the oranges and then multiplying this ratio by t x

.

Motivation 3: We are interested in the population mean of y , y

U

, but know x

U

and would like to use x to improve our estimate by estimating y

U

by y

ˆ ˆ r

U

instead of y .

Example: We will consider a population of N

393 short stay hospitals from Herkson (1976). The data is in hospitals.txt. Let

y i denote the number of patients discharged from the i th hospital in January 1968. We are interested in y

U

. Without doing any sampling, for each hospital i , we have available the number of beds in the hospitals x i

and know x

U

.

The idea behind using the ratio estimator of y

ˆ ˆ r

U instead of using y is the following: We expect x i

and y i

to be closely related in the population, since a hospital with a large number of beds should tend to have a large number of discharges. A scatterplot of x i

versus y i

in the population is shown below:

R code: hospitaldata=read.table("hospitals.txt",header=TRUE); discharges=hospitaldata$discharges; beds=hospitaldata$beds; plot(beds,discharges);

Because of the close relationship between x i

and y i

, if the sample underestimates the mean number of beds, i.e., x

 x

U

, then the sample probably also underestimates the mean number of discharges. The ratio estimate

ˆ r

 ˆ

U

 y x

U x

multiplies y by x

U x

, which is a better estimate than y as long as y

U y

is closely related to x

U x

.

The following is a simulation study comparison of ˆ r to y for the hospital data for samples of size 64.

# Simulation study comparison of usual sample mean to ratio estimator for

# mean number of discharges in hospital data nosims=10000;

samplemeanvec=rep(0,nosims); ratioestvec=rep(0,nosims);

N=393; n=64; beds.population.mean=mean(beds); discharges.population.mean=mean(discharges); discharges.population.mean;

> discharges.population.mean

[1] 814.603 for(i in 1:nosims){ tempsample=sample(N,n,replace=FALSE); discharges.sample=discharges[tempsample]; beds.sample=beds[tempsample]; samplemeanvec[i]=mean(discharges.sample); ratioestvec[i]=(mean(discharges.sample)/mean(beds.sample))*beds.population.me

an;

}

# Bias of two estimators bias.samplemean=mean(samplemeanvec)-discharges.population.mean;

> bias.samplemean

[1] -0.4465956 bias.ratioest=mean(ratioestvec)-discharges.population.mean;

> bias.ratioest

[1] 0.7369439

# Mean squared error of two estimators mse.samplemean=mean((samplemeanvec-discharges.population.mean)^2);

> mse.samplemean

[1] 4482.136 mse.ratioest=mean((ratioestvec-discharges.population.mean)^2);

> mse.ratioest

[1] 904.0052

# Histograms of the two estimators par(mfrow=c(2,1)); hist(samplemeanvec,xlim=c(500,1100)); hist(ratioestvec,xlim=c(500,1100));

The ratio estimator is dramatically better than the sample mean for this population.

We will study the general conditions under which ˆ r is better than y below.

II. Standard Errors of Ratio Estimators

Unlike the sample mean, the ratio estimator is biased. Using a

Taylor expansion,

[

ˆ 

B ]

 

1 n 1

N nx

U

2

BS

2 x

RS S x y

1

[ ( )

( , )] x

U

2

Note that for a large sample size n and a small sampling fraction / , we have

S n x

2 

1

 n

N

0

and

[ ( , )]

2 

( ) ( )

0 (where the inequality follows from the Cauchy-Schwarz inequality) so that the bias is small.

Using a Taylor expansion, we have the following formulas for estimates of the standard deviations of various ratio estimators

(these are not unbiased estimators but for a large sample size n and a small sampling fraction / , they are accurate):

For estimating a population ratio (Lohr, pg. 61, 68)

 y

, SE B x

 1

 n

1

 s

2

N x n

U

2 e , where s e

2  n

1

1 i n 

1

 y

ˆ i

 i

2

For estimating a population total (Lohr, pg. 61, 68) t

ˆ  ˆ

, (

ˆ x

)

(

ˆ x

)

 t SE B x

N

1

 n

N

 s n e

2

For estimating a population mean (Lohr, pg. 61, 68)

ratio

U

, (

ˆ

) ratio

SE x B

U

 x

U

|

ˆ 

 1

 n

N

 s n e

2

Using a central limit theorem and the Delta method, an approximate 95% confidence interval for a population quantity

of interest (e.g., , or t ) is

 ˆ 

1.96

SE

  ˆ 

1.96

SE

 ˆ 

.

III. Comparison of the sample mean to the ratio estimate of the mean

We now compare the sample mean estimate y of the population mean y

U to the ratio estimator, ratio

. Let MSE denote the mean squared error, i.e.,

( )

E  ( y

 y

U

)

2

1 n

S

N

 n y

2 mean squared error is a good measure for comparing the accuracy and desirability of estimators.

Define the population correlation coefficient to be

. The

R

 i

N 

1

( x i

 x

U

)( y i

 y

U

( N

1) S S y

)

, where

S S x

, y are the population standard deviations of x and y .

Using a Taylor expansion approximation, we have that

[

ˆ ratio

]

[ ] if and only if R

BS ( ) x

2 S 2 ( ) y

If the coefficients of variation are approximately equal, then it pays to use ratio estimation when the correlation between x and y is larger than 0.5.

The same criterion applies for estimating the population total.

For the hospital data,

> cor(beds,discharges)

[1] 0.9109203

> ratio=discharges.population.mean/beds.population.mean

> ratio

[1] 2.964085

> ratio*sd(beds)/(2*sd(discharges))

[1] 0.5359784

Thus,

R

0.91

BS x

2 S y

0.54

and it pays to use ratio estimation.

In practice, we wouldn’t know

, , , y but can estimate from the sample to decide whether to use ratio estimation.

Example: One of the main uses of ratio estimation is in the updating of information across time. A simple example of this can be seen in the way agricultural crop forecasters can use a sample of current data to update complete crop reports from earlier years. The crop used in this example is sugarcane, an important economic crop for the four states of Florida, Hawaii,

Louisiana and Texas and grown in a total of about 32 counties from across those states. Our goal is to estimate the mean number of sugarcane acres harvested in these 32 counties.

Suppose we are near the end of 1999 and do not have complete data on the sugarcane crop from that year from all counties. We do, however, have complete data for all counties for the year

1997. In addition, we have the resources to collect preliminary information from six sample counties. The table below shows the acres harvested for sugarcane in the six sampled counties.

State County 1999 Acreage 1997 Acreage

FL

HI

LA

LA

Hendry

Kauai

Saint Landry

Calcasieu

57,000

13,900

15,500

3,900

54,000

12,300

9,100

1,700

LA

TX

Iberia

Cameron

59,900

10,400

57,200

12,900

By checking the complete records for 1997, we can find that the

1997 average acres harvested per county, across all 32 counties, was 27,752 acres. Use these data to estimate the mean acreage for sugarcane across all 32 counties for 1999 and calculate an approximate 95% confidence interval.

Solution: Let x be the acreage harvested in 1997 and y be the acreage harvested in 1999. The plot of the sample data shows a

strong, positive trend in the relationship between the acreage values for the 2 years. This bodes well for ratio estimation.

# Acreage of sugarcane harvested data acreage.1997=c(54400,12300,9100,1700,57200,12900); acreage.1999=c(57000,13900,15500,3900,59900,10400); plot(acreage.1997,acreage.1999,xlab="Acreage, 1997",ylab="Acreage,

1999",main="Sugarcane acreage in 1997 versus 1999");

# Use criterion R>=(B*S_x)/(2*S_y) to decide whether to use ratio estimation ratioest=mean(acreage.1999)/mean(acreage.1997);

[1] 1.088076 sxest=sd(acreage.1997); syest=sd(acreage.1999); ratioest*sxest/(2*syest);

[1] 0.5359384

Rest=cor(acreage.1999,acreage.1997);

[1] 0.9934721

The ratio estimator appears much better the sample mean since

 

ˆ ˆ x

2 S

ˆ y

0.54

The ratio estimate of the mean of acreage harvested in 1999 is y ratio

 ˆ 

U

1.088(27,752)

30,194

For calculating the standard error of the ratio estimate, we need to calculate s e

2

 n

1

1 i n 

1

 y

ˆ i

 i

2

.

> sehatsq=sum((acreage.1999-ratioest*acreage.1997)^2)/5;

> sehatsq

[1] 11860709

Then we have

(

ˆ ratio

)

1

 n

N

 s n

2 e

 1

32 6

1267 .

Thus, an approximate 95% confidence interval for y

U

is y ratio

1.96

SE y

ˆ ratio

)

  

(27711, 32677)

.

Note that because the sample size is small, it would be a better to approximation to use the 0.975 quantile of the t distribution with n

1 degrees of freedom rather than 1.96 in the confidence interval.

The 0.975 quantile of the t-distribution with 5 degrees of freedom is

> qt(.975,5)

[1] 2.570582 so the approximate 95% confidence interval is y

ˆ ratio

 t

0.975, n

1

SE y

ˆ ratio

)

  

(26938, 33450)

.

Download