Simple random sampling lecture

advertisement
1
SECTION 2
SIMPLE RANDOM SAMPLING
2.1 What is Simple Random Sampling?
2.1.1
Definition
Simple Random Sampling --- A method of probability
sampling in which a
sample of n elements is
randomly chosen without
replacement from a
population of N elements
(SRSWOR vs. SRSWR)
2.1.2 One Selection Procedure for Simple Random
Sampling
A. Number the elements in the population (i.e.,
sampling frame) from 1 to N.
B. Using a table of random numbers, select and record
a random number between 1 and N.
2
C. Select a second random number between 1 and N.
If the second number is the same as the first
selected number, discard it and go to the next step.
If the second number is not the same as the first
number, record it.
D. Select a third random number between 1 and N. If
this number is the same as either one of the
previous numbers, discard it and go to the next
step. If the number is not the same as the previous
numbers, record it.
E. Continue in this manner until n different numbers
between 1 and N have been chosen.
F. Population elements corresponding to selected
numbers are an SRS sample of size n.
2.1.3 Some Statistical Notes about Simple Random
Samples
If we use SRS to select a sample of size n from a
population of N elements:
A. All possible SRS samples have the same chance
of being selected.
3
B. The probability that any one population element
will be chosen is n/N.
C. Observations taken from elements in an SRS are
not statistically independent.
2.2 Estimating a Population Mean from a Simple
Random Sample
2.2.1 Setting
A. We have selected an SRS of size n from a
population of N elements.
B. We wish to use our sample to estimate the
population mean per element (denoted by the
symbol, Y ) for some characteristic of the
population.
C. Examples:
(1) Average annual dental care expenses for
employees of a large corporation.
(2) Average expenditures for prescription drugs
paid by customers of a drug store chain.
(3) Average height in inches of adult males
students at a state university.
4
2.2.2 Estimator of Y
n
yi
y1  y 2  y n i

Y  y srs 
 1
n
n
where y i refers to the value of the i-th element
selected in the sample.
(2.1)
2.2.3 Some Statistical Notes about y srs
A. Different SRS samples are likely to produce
different values for y srs , hence y srs is a random
variable with a sampling distribution.
B. y srs is an unbiased estimator of Y . (go to D&C;
“parameter”)
C. When n is large (i.e., greater than 30), the
sampling distribution for y srs closely resembles
the normal distribution; this characteristic can
be used when forming confidence intervals or
testing hypotheses.
5
2.2.4 Estimated Variance of y srs
l  f  2
v(ysrs )  
s
 n 
(2.2)
where f = n/N is the sampling rate and s2 is the
estimated element variance for the population,
calculated as
n
s 
2
 ( y i  y srs )
i 1
n 1
n
2

n  yi
i 1
2
L
y O
MQ
P
N
2
n
i 1
i
n( n  1)
(2.3)
2.2.5 Some Statistical Notes about v(ysrs )
A. The term (l-f) in formula (2.2) is called the
finite population correction (fpc) which is a
special adjustment to account for the fact that
our sample was chosen without replacement
from a finite population (i.e., an existing
population of limited size). This correction
factor is very nearly 1 and can be effectively
ignored when the sampling rate is small (i.e.,
less than 0.05).
6
B. Different SRS samples of the same size which are
chosen from the same population are likely to
produce different values for v(ysrs ) ; hence v(ysrs )
is a random variable with a sampling distribution.
C. v(ysrs ) is a unbiased estimator of the true variance
of y srs .
D. s2 is an unbiased estimator of the population
element variance.
2.2.6 Estimated Standard Error of y srs
lf
se ( y srs ) = v(ysrs ) =
s
(2.4)
n
where s is the square root of s2 computed by formula
(2.3).
a f
2.2.7 Confidence Interval for Y n  30
Lower Boundary: y srs - {t}{se( y srs )}
(2.5)
Upper Boundary: y srs + {t}{se( y srs )}
(2.6)
7
The value for t depends on the confidence level that we
choose. For example:
Confidence Level
(In Percent)
t
68
1.00
95
1.96
99
2.58
Interpretation: We are 95 percent sure that Y is
covered by the interval whose
(t = 1.96)
boundaries are defined by formulas
(2.5) and (2.6).
2.3 Estimating a Population Total from a Simple Random
Sample
2.3.1 Setting
A. We have selected an SRS of size n from a
population of N elements.
8
B. We wish to use our sample to estimate the
population aggregate total (denoted by the symbol
o
Y ) for some characteristic of the population.
C. Examples:
(1) Total combined income for all United
States citizens if individual income is the
characteristic of interest.
(2) Total number of dental visits experienced
by persons living in some small city.
(3) Total dollar value of private health
insurance premiums paid by workers in a
large industrial plant.
o
D. We know that Y  NY .

2.3.2 Estimator of Y
n
y
N n
Y  ysrs  Nysrs     yi   i
 n  i1
i 1 n / N
ô
o
(2.7)
9
o
2.3.3 Some Statistical Notes about y srs
A. Different SRS samples of the same size which
are chosen from the same population are likely
o
o
to produce different values for y srs ; hence y srs
is a random variable with a sampling
distribution.
o
o
B. y srs is an unbiased estimator of Y .
o
C. The sampling distribution for y srs is very
similar to the normal distribution when n is
greater than 30.
o
2.3.4 Estimated Variance of y srs
 N 2 (l  f )  2
v(ysrs )  N v(ysrs )  
s

n


o
2
(2.8)
10
2.3.5 Some Statistical Notes about
v( y )
srs
A. The term (l-f) is the finite population correction
(see Section 2.2.5).
B. Different SRS samples of the same size which are
chosen from the same population are likely to
o
o
produce different values for v(ysrs ) ; hence v(ysrs ) is
a random variable with a sampling distribution.
o
C. v(ysrs ) is an unbiased estimator of the true
o
variance of y srs .
o
2.3.6 Estimated Standard Error of y srs .
o
o
se(ysrs )  v(ysrs ) 
lf
Ns
n
(2.9)
where s is the square root of s2 computed by formula
|(2.3).
11
o
2.3.7 Confidence Interval for Y ( n  30)
o

Lower Boundary: ysrs  {t} {se  ysrs }


o
o

Upper Boundary: ysrs  {t} {se  ysrs }


o
(2.10)
(2.11)
where the value of t is determined by the confidence
level (see Section 2.2.7).
o
Interpretation: We are 95 percent sure that Y is
covered by the interval whose
(t=1.96)
boundaries are defined by formulas
(2.10) and (2.11).
2.4 Estimating a Population Proportion from a Simple
Random Sample
2.4.1 Setting
A. We have selected an SRS of size n from a
population of N elements.
B. We wish to estimate the proportion of all elements
in the population which possess some
12
attribute; we denote the population proportion to
be estimated by the symbol, P.
C. Examples:
(1) Proportion of residents of a large nursing
home who favor comprehensive national
health insurance.
(2) Proportion of patients in a large hospital
who are discharged in two or fewer days.
(3) Proportion of emergency medical workers
in North Carolina who have experienced
one or more episodes of violence in the
line of work during the last six months.
D. A population proportion is a special type of
population mean in which the characteristic
associated with each element is equal to 1 if the
element has the attribute (e.g., favoring national
health insurance) and 0 if the element does not
have the attribute (e.g., not favoring national
health insurance). The mean of this type of
dichotomous 0-or-1 characteristic is also the
proportion of all population elements
possessing the attribute.
E. Quite clearly 0  P  1.
13
2.4.2 Estimator of P
n
^
P  p srs 
=
 yi
i 1
n

(2.12)
number of sample elements possessing the attribute
number of sample elements
where y i  0 if the i-th sample element does not
possess the attribute and y i  1 if it does.
2.4.3 Some Statistical Notes about p srs
A. Different SRS samples of the same size which are
chosen from the same population are likely to
produce different values for p srs ; hence p srs is a
random variable with a sampling distribution.
B. p srs is an unbiased estimator of P.
C. The sampling distribution for p srs is very similar to
the normal distribution when n p srs and n(1- p srs ) are
greater than 10.
14
2.4.4 Estimated Variance of p srs
p srs 1  p srs
lf 
v(psrs )  
p
1

p



srs
srs

n
n

1


a f
(2.13)
since
n
s 
p srs (1  p srs )
n 1
2
2.4.5 Some Statistical Notes about v(psrs )
A. The term (l-f) is the finite population correction
(see Section 2.2.5).
B. Different SRS samples of the same size which
are chosen from the same population are likely
to produce different values for v(psrs ) ; hence
v(psrs ) is a random variable with a sampling
distribution.
C. v(psrs ) is an unbiased estimator of the true
variance of p srs .
15
2.4.6 Standard Error of p srs
lf 
se  psrs   v(psrs )  
psrs  l  psrs 

 n  1
a
(2.14)
f
2.4.7 Confidence Interval for P np srs  10
Lower Boundary: psrs  {t} {se  psrs }
Upper Boundary: psrs  {t} {se  psrs }
(2.15)
(2.16)
The value of t is determined by the confidence level
(see Section 2.2.7).
Interpretation: We are 95 percent sure that P is
covered by the interval whose
(t=1.96)
boundaries are defined by formulas
(2.15) and (2.16).
16
2.5 Illustrative Example of Simple Random Sampling
2.5.1 Setting
A. We have a population of N=270 blocks from a
small town.
B. We wish to estimate the following from a
sample of n=10 blocks:
(1) Y : The average number of rented
dwellings per block in the town.
o
(2) Y : The total number of rented
dwellings in the town.
(3) P: The proportion of blocks in the
town with ten or more rented
dwellings.
2.5.2 Sampling Frame
List of blocks presented in Table 2.1
17
18
2.5.3 Selection Procedure
A. We must randomly choose 10 different numbers
between 1 and 270.
B. Using the random numbers in Table 2.2, we start
in the upper left-hand corner and move across the
page as if we are reading a book, choosing threedigit numbers at a time. Numbers between 271
and 999 (also 000) are not useful.
C. Following this procedure the following 10 numbers
are chosen:
Random
i
Number
1
256
2
106
3
54
4
267
5
51
6
8
7
154
8
48
9
112
10 160
(Skip to Section 2.6)
Number of
Rented
Dwellings
(yi)
0
27
12
3
30
30
1
58
44
4
Attribute:
> 10 Rented
Dwellings
(yi)
0
1
1
0
1
1
0
1
1
0
19
2.5.4 Estimates
A. Mean number of rented dwellings per block:
n
y srs 
 yi
i 1
n

209
 20.9
10
B. Total number of rented dwellings:
o
y srs  Ny srs  (270)(20.9)  5643
C. Proportion of blocks with ten or more rented
dwellings:
n
p srs 
 yi
i 1
n

6
 0.6
10
20
2.5.5 Variances and Standard Errors
A. Mean number of rented dwellings per block:
 n 2  n 2 
 n  y i    yi  
 l  f   i 1
 i 1  
v(ysrs )  

n(n  1)
 n  




1  0.037 (10)(7999)  (209) 2

 38.85
10
(10)(9)
L
M
N
L
O
M
P
Q
N
O
P
Q
se(ysrs )  v(ysrs )  38.85  6.23
B. Total number of rented dwellings:
o
v(ysrs )  N2 V(ysrs )  (270)2 (38.85)  2.832x106
o
o 
se  ysrs   v(ysrs )  2.832x106  1682.9


21
2.6 Some Final Notes on Simple Random Sampling
A. The variance of an estimate derived from SRS
designs (and other designs as well).
(1) Variance is a measure of the statistical
quality of the estimate.
(2) Precision is inversely related to the size of
the variance (i.e., high variance implies low
precision; low variance implies high
precision).
B. Implications of changes in sample size on variance
(1) Variance is reduced when the sample size (n)
is increased.
(2) Larger sample sizes also contribute to smaller
variances by increasing f=n/N (and thereby
reducing l-f);
(3) A change in sample size has more
pronounced effect on the variance than does
the corresponding change in f.
22
Example: N = 500,000
First SRS Sampling Design:
Second SRS Sampling Design:
2
Thus, if we assume that s1
n1  1000
f1  0.002
n2  5000
f2  0.010
|
|
|
|
 s22
(1  f 2 ) 2
s2
v 2 (ysrs )
n2
(1  f 2 )n1


v1 (ysrs ) (1  f1 ) s 2 (1  f1 )n 2
1
n1
.990 1, 000

x
  0.992  0.200   0.198
.998 5, 000
C. Simple random sampling is the simplest
probability sampling method.
23
D. Simple random sampling is only rarely used in
practice. One might conceivably use it however
when both of the following are true:
(1) The sample size is small and either:
a.
The population is relatively large
with a sequence of numbers
uniquely identifying each element
(e.g., employee ID numbers
assigned sequentially to employees);
or
b.
The population is relatively small.
(2) Stratification is either not feasible or not
possible (see Section 4).
Supplementary Reading
[1] Mendenhall, W., Ott, L., and Scheaffer, R.L., Elementary
Survey Sampling, Duxbury Press, Belmont California,
1971, Chapter 4.
[2] Kish, L., Survey Sampling, Wiley and Sons, 1965,
Sections 2.0-2.6.
[3] Cochran, W.G., Sampling Techniques, 3rd Edition, Wiley
and Sons, 1977, Sections 2.1-2.9; 3.1-3.3.
Download