Notes 2

advertisement
Stat 475/920 Notes 2
Reading: Lohr, Chapter 2
Office hour and R session notes:
Prof. Small’s office hours: Tues., Thurs., 4:45-5:45; by
appointment. Huntsman Hall 464.
Xin Fu’s office hours: Monday, 5-6.
Introductory tutorial on R: Monday, Sept. 15th, 5-6, Huntsman
Hall 441.
Goal of survey sampling is to obtain information about a
population by examining only a fraction of the population.
To simplify presentation, we assume for now that the sampled
population is the target population, that the sampling frame is
complete, that there is no nonresponse or missing data and that
there are no errors of observation. We will return to
nonsampling errors later.
I. Population
In survey sampling, we are typically concerned with a finite
population (universe) with N units labeled {1, 2, , N } .
Let yi be a characteristic associated with the i th unit, e.g., the
i th person in the population’s income.
1
The population quantities we will most often be interested in
estimating are
1 N
(1) The population mean yU : yU  N  yi .
i 1
N
(2) The population total t : t   yi .
i 1
2
(3) The population variance S :
1 N
S 
( yi  yU ) 2

N  1 i 1
2
(4) The population standard deviation S :
S S 
2
1 N
( yi  yU )2 .

N  1 i 1
The population standard deviation can be thought of as
measuring the typical absolute deviation of a randomly
chosen unit’s y from the population mean.
The coefficient of variation (CV) is a measure of variability in
S
CV
(
y
)

the population relative to the mean:
yU
It is sometimes helpful to have a special notation for
proportions. The proportion of units having a characteristic is
simply a special case of the mean, obtained by letting yi  1 if
2
the i th unit has the characteristic of interest (e.g., has income
above $100,000 per year) and yi  0 if the i th unit does not
have the characteristic of interest. Let p be the proportion of
units with the characteristic of interest:
N
yi
number of units with characteristic in the population 
p
 i 1
N
N
II. Simple Random Sampling
Probability sampling: Each unit in the population has a known
probability of being selected in the sample and a chance method
such as using numbers from a random number table is used to
choose the specific units to be included in the sample.
The simplest type of probability sampling is simple random
sampling without replacement (SRS). A simple random sample
without replacement of size n can be done by putting the
numbers {1, , N } into a hat and randomly selecting n of them.
In a simple random sample with replacement, after each draw,
we put the number drawn back into the hat. For short, we’ll call
simple random sampling without replacement just simple
random sampling as simple random sampling without
replacement is preferable to simple random sampling with
replacement for a finite population.
In a simple random sample of size n , every possible subset of n
distinct units has the same probability of being selected into the
3
N
sample. There are  n  possible samples and each is equally
 
likely, so the probability of selecting any individual sample S of
n units is
1
n !( N  n)!
P(S ) 

N!
N
 
n 
Let  i denote the probability that the i th unit will appear in the
n


sample. In simple random sampling, i N for all the units
(we’ll prove this later).
We can draw a simple random sample of size n from a
population of size N in R by the following commands:
> n=25
> N=500
> sample(N,n,replace=FALSE)
[1] 274 218 160 249 50 451 271 496 378 178 84 3 328 19
141 190 279 248 221
[20] 37 425 106 40 404 43
Example of simple random sampling: The Effect of Agent
Orange on Troops in Vietnam
During the Vietnam War, the U.S. army engaged in herbicidal
warfare in which the objective was to destroy forests that
4
provided cover for the Viet Cong army. The most widely used
herbicide was known as Agent Orange.
A UH-1D helicopter from the 336th Aviation Company sprays a defoliation agent on a dense jungle area
in the Mekong Delta. 07/26/1969/National Archives photograph.
Many Vietnam veterans are concerned that their health may
have been affected by exposure to Agent Orange. The
particularly worrisome component of Agent Orange is dioxin,
which in high doses is known to be associated with certain
cancers. High levels of dioxin can be detected 20 or more years
after heavy exposure to Agent Orange. To examine the dioxin
levels in Vietnam veterans, researchers from the Centers from
Disease Control used military records to identify a sampling
frame of 646 living U.S. Army combat personnel who served in
Vietnam during 1967 and 1968 in the areas that were most
heavily treated with Agent Orange (Centers for
Disease Control Veterans Health Studies, “Serum 2, 3, 7, 8 –
Tetrachlorodibenzo-p-dixoin Levels in U.S. Army Vietnam-era
Veterans,” Journal of the American Medical Association 260,
Sept. 2, 1988: 1249-1254).
In the actual study, the dioxin levels in 1987 were obtained
from all 646 veterans. The dioxin level is measured as
concentration in parts per trillion. The data can be found on the
5
class web site
http://wwwstat.wharton.upenn.edu/~dsmall/stat475-f08/ under
Data Sets. The data can be read into R by the following
command
agent.orange.data=read.table("agent_orange_data.txt",
header=TRUE)
 The population mean is
dioxin=agent.orange.data$dioxin
mean(dioxin)
[1] 4.260062
 The population standard deviation is
sd(dioxin)
[1] 2.642617
 Histogram of the population distribution of dioxin
hist(dioxin)
6
 As a point of comparison, the mean dioxin levels in 1987 in
a sample of veterans who served between 1965-1971 in
United States and Germany was 4.19.
Suppose that instead of obtaining the dioxin levels of all 646
veterans, we would like to take a sample of size 50.
n=50
N=646
sample1=sample(N,n,replace=FALSE)
sample1
[1] 132 527 542 110 475 354 458 586 370 389 544 626 538 324
159 359 272 233 213
[20] 422 238 522 596 182 329 37 74 480 563 528 623 38 81
610 111 543 577 525
[39] 344 335 4 198 83 452 145 249 455 418 540 175
mean(dioxin[sample1])
[1] 4.42
Another sample is
sample2=sample(N,n,replace=FALSE
sample2
[1] 198 501 254 392 8 417 237 221 363 146 102 390 600 388
453 55 233 211 292
[20] 34 190 549 185 579 478 111 583 217 407 103 441 263 46
452 486 199 499 70
[39] 401 529 219 1 68 227 477 415 618 77 341 362
> mean(dioxin[sample2])
[1] 3.78
7
How accurate is the sample mean as an estimate of the sample
mean? The key to answering this question is to understand the
sampling distribution of the sample mean.
III. Sampling Distribution
Let y denote the sample mean. The sampling distribution of y
is the distribution of y over repeated samples.
Mean and variance of the sampling distribution of y (to be
proved later):
Mean: E ( y )  yU .
The sample mean is an unbiased estimator of the population
mean.
S2 
n
Var
(
y
)

1



Variance:
n  N .
Finite population correction:
The usual formula for the variance of the sample mean of
independent and identically distributed random variables is the
variance divided by the sample size. The members of our
sample are not independent because we are sampling without
replacement. For simple random sampling without replacement,
n

1

we multiply by an additional factor  N  , which is called the
finite population correction (fpc). Intuitively, we make this
correction because with small populations, the greater our
8
sampling fraction n / N , the more information we have about
the population and thus the smaller the variance. If we take a
complete census, i.e, we sample the whole population so that
n  N , then the fpc is 0 and Var ( y )  0 since y will always
equal yU .
For most samples that are taken from extremely large
populations, the fpc is approximately 1. For large populations it
is the size of the sample taken, not the sampling fraction, that
determines the precision of the estimator. For example, a
sample of size 100 from a population of 100,000 has almost the
same precision as a sample of size 100 from a population of 100
million:
S2 
100  S 2
Var ( y ) 
1

 0.999 
100  100, 000  100
for N  100, 000
S2 
100
 S2
Var ( y ) 
1

 0.999999 
100  100, 000, 000  100
for N  100, 000, 000
If your soup is well stirred, you need to taste only one or two
spoonfuls to check the seasoning, whether you have made 1 liter
or 20 liters of soup.
2
The variance of y involves the population variance S , which
depends on the values for the entire population. We can
estimate the population variance by the sample variance:
1
s2 
( yi  y ) 2 ,

n  1 iS
where S denotes the sample.
An unbiased estimator of the variance of y is
9
s2 
n
ˆ ( y )  1  
Var
n N
We usually report not the estimated variance of y but its square
root, the standard error (SE):
s2 
n
SE ( y ) 
1



n  N.
We can simulate the sampling distribution of y for the Agent
Orange data:
nosims=2000; # of samples to be simulated
n=50; # size of sample
N=646; # size of population
samplemean=rep(0,nosims);
for(i in 1:nosims){
tempsample=sample(N,n,replace=FALSE);
samplemean[i]=mean(dioxin[tempsample]);
}
mean(samplemean) # mean of sample means
[1] 4.25794
mean(dioxin) # population mean
[1] 4.260062
sd(samplemean)
[1] 0.3562426
truesd.samplemean=((sd(dioxin)/sqrt(50))*(1-50/646)) # true
SD of sample mean
10
> truesd.samplemean
[1] 0.3447966
hist(samplemean) # histogram of sample means
Estimating proportions: For proportions, the above formulas for
the sampling distribution can be specialized.
1 N
Let p  N  yi  yU denote the population proportion and
i 1
p̂  y denote the sample proportion.
11
Then,
N
S2 
 ( y  p)
i 1
i
N
2

y
i 1
2
i
N
 2 p  yi  Np 2
i 1
N 1
N 1
S2 
n  p (1  p )  N  n 
Var ( pˆ ) 
1





n  N
n
 N 1 
Also,
1
n
2
ˆ
ˆ
ˆ
s2 
(
y

p
)

p
(1

p
)
 i
n  1 iS
n 1
pˆ (1  pˆ ) 
n
ˆ ( pˆ ) 
Var
1



n 1  N 
SE ( pˆ ) 
 p (1  p )
N
N 1
pˆ (1  pˆ ) 
n
1



n 1  N 
Example:
Suppose we are interested in the proportion of Vietnam veterans
with dioxin level greater than 5.
n=50;
N=646;
tempsample=sample(N,n,replace=FALSE);
phat=mean(dioxin[tempsample]>5);
var.phat=(phat*(1-phat)/(n-1))*(1-n/N);
se.phat=sqrt(var.phat);
phat
[1] 0.14
se.phat
12
[1] 0.04761262
Estimating the population total:
The natural estimate of the population total is tˆ  Ny .
N
(Note that t   yi NyU )
i 1
The sampling distribution of tˆ is
E (tˆ)  t
2
S
n

Var (tˆ)  N Var ( y )  N
1



n  N
2
2
2
s
n

ˆ (tˆ)  N
Var
1



n N
2
IV. Confidence Intervals
Consider our first sample from the Vietnam veterans population:
n=50
N=646
sample1=sample(N,n,replace=FALSE)
mean(dioxin[sample1])
[1] 4.42
It is not sufficient to just report the sample mean. We would
like to give an idea of the accuracy of our estimate of the
population mean and a range of plausible values for the
population mean. This is done by means of a confidence
interval (CI).
13
A 95% confidence interval for a population parameter should
have the following property:
If we take samples from our population over and over again and
construct a confidence interval using our procedure for each of
the samples, 95% of the resulting intervals should include the
true value of the population parameter.
Another way of thinking about confidence intervals is if we take
a survey on a different subject each day and come up with a
valid 95% confidence interval for the population mean, then
over our lifetime, about 95% of the intervals will contain the
true population means.
In introductory statistics, for situations in which the population
is infinite, a central limit theorem for infinite population
sampling with replacement was introduced and used to form
approximate confidence intervals:
If y1 , y2 , are iid random variables with mean  and variance
 2 , then under regularity conditions, for n sufficiently large,
1 n
yi  

n i 1

has approximately a standard normal distribution.
n
For finite population sampling, the usual central limit theorem
cannot be applied because the sample size cannot exceed the
population size. However, there is a finite population central
limit theorem in which we consider a sequence of populations
14
and sample sizes such that the population sizes and sample sizes
increase to infinity. Hájek (1960) proved that if certain
technical conditions hold and if n, N , and N  n are all
“sufficiently large,” then
y  yU
n  S has approximately a standard normal distribution

1



 N n
For n, N , and N  n “sufficiently large,” an approximate 95%
confidence interval for the population mean is

s
n
s
n
1  , y  1.96
1     y  1.96 SE ( y ), y  1.96 SE ( y )  (1.1)
 y  1.96
n
N
n
N

Note that if n / N is small, then this confidence interval is
approximately the same as the usual confidence interval from
introductory statistics.
The imprecise term sufficiently large in the central limit theorem
occurs because the adequacy of the normal approximation
depends on n and how closely the population { yi , i  1, , N}
resembles a population generated from a normal distribution.
The magic number of n  30 , often cited in introductory
statistics books as a sample size that is “sufficiently large” for
the central limit theorem to apply, often does not suffice in finite
population sampling problems. Many populations are highly
skewed.
Simulation study of performance of confidence intervals for the
Vietnam veteran agent orange study:
15
Recall that the dioxin distribution if fairly skewed.
To simulate the coverage (proportion of times the confidence
interval contains the true population parameter) for the
approximate 95% CI (1.1) for a sample size of n (n=50 in the
code), we use the following R code:
nosims=10000;
n=50;
N=646;
samplemean=rep(0,nosims); # vector to store the sample mean
lowerci=rep(0,nosims); # vector to store lower end point of confidence interval
upperci=rep(0,nosims); # vector to store upper end point of confidence interval
popmean=mean(dioxin); # population mean
for(i in 1:nosims){
tempsample=sample(N,n,replace=FALSE); # take sample
samplemean[i]=mean(dioxin[tempsample]); # compute sample mean
se=sqrt((1-n/N)*sd(dioxin[tempsample])^2/n);
# computes standard error of sample mean
lowerci[i]=samplemean[i]-1.96*se; # lower end point of CI
16
upperci[i]=samplemean[i]+1.96*se; # upper end point of CI
}
hist(samplemean) # histogram of sample means
ci.coverage=mean((lowerci<=popmean)*(upperci>=popmean));
# Proportion of times confidence interval contains the population mean
ci.coverage
Sample size n
Coverage of large sample size
95% CI (1.1)
15
0.903
30
0.911
50
0.919
75
0.921
100
0.922
300
0.921
The confidence interval only has 91% coverage for n  30 and
around 92% coverage for n  50 .
For a more normally distributed population, the confidence
intervals perform better.
We simulate data from a Gamma distribution with shape=5 and
scale=1.
# Simulation from Gamma distribution
N=646
gammapop=rgamma(N,shape=10,rate=1);
17
hist(gammapop);
For this population, the simulated confidence interval coverage
is
Sample size n
Coverage of large sample size
95% CI (1.1)
15
0.929
30
0.937
50
0.946
75
0.946
100
0.946
300
0.949
18
For skewed populations, stratified sampling is useful for
obtaining more accurate estimates and more accurate confidence
intervals (stratified sampling will be studied in Chapter 4).
Confidence intervals for proportions: For a proportion, the
central limit theorem based confidence interval is
pˆ (1  pˆ ) 
n
pˆ  1.96
1



n 1  N  ,
which, for n / N small and n large , is approximately
pˆ (1  pˆ )
ˆp  1.96
n
Brown, Cai and DasGupta (2001, Statistical Science) show that
this confidence interval has erratic coverage properties. They
recommend several alternatives, the easiest of which is to use
yi  2

p  iS
p̂
n  4 in place of , i.e. the CI is
p(1  p)
p  1.96
n
( p is the Bayes estimate of p from a Beta(2,2) prior).
19
Download