PSTAT 262 AS: Survey Sampling and Estimation

advertisement
STK 4600: Statistical methods for
social sciences.
Survey sampling and statistical demography
Surveys for households and individuals
1
Survey sampling: 4 major topics
1. Traditional design-based statistical inference
•
6 weeks
2. Likelihood considerations
•
1 weeks
3. Model-based statistical inference
•
3 weeks
4. Missing data - nonresponse
•
2 weeks
2
Statistical demography
• Mortality
• Life expectancy
• Population projections
• 2-3 weeks
3
Course goals
• Give students knowledge about:
– planning surveys in social sciences
– major sampling designs
– basic concepts and the most important estimation methods in
traditional applied survey sampling
– Likelihood principle and its consequences for survey
sampling
– Use of modeling in sampling
– Treatment of nonresponse
– A basic knowledge of demography
4
But first: Basic concepts in sampling
Population (Target population): The universe of all
units of interest for a certain study
•
Denoted, with N being the size of the population:
U = {1, 2, ...., N}
All units can be identified and labeled
•
Ex: Political poll – All adults eligible to vote
•
Ex: Employment/Unemployment in Norway– All
persons in Norway, age 15 or more
•
Ex: Consumer expenditure : Unit = household
Sample: A subset of the population, to be observed.
The sample should be ”representative” of the
population
5
Sampling design:
• The sample is a probability sample if all units in the sample have been
chosen with certain probabilities, and such that each unit in the population
has a positive probability of being chosen to the sample
• We shall only be concerned with probability sampling
• Example: simple random sample (SRS). Let n denote the sample size.
Every possible subset of n units has the same chance of being the sample.
Then all units in the population have the same probability n/N of being
chosen to the sample.
• The probability distribution for SRS on all subsets of U is an example of
a sampling design: The probability plan for selecting a sample s from the
population:
N 
p( s )  1 /   if | s | n
n 
p( s )  0 if | s | n
6
Basic statistical problem: Estimation
• A typical survey has many variables of interest
• Aim of a sample is to obtain information regarding totals or
averages of these variables for the whole population
• Examples : Unemployment in Norway– Want to
estimate the total number t of individuals unemployed.
For each person i (at least 15 years old) in Norway:
yi  1 if person i is unemployed , 0 otherwise
Then :
t  iN1 yi
7
• In general, variable of interest: y with yi
equal to the value of y for unit i in the
population, and the total is denoted
t  iN1 yi
• The typical problem is to estimate t or t/N
•Sometimes, of interest also to estimate ratios of
totals:
Example- estimating the rate of unemployment:
yi  1 if person i is unemployed , 0 otherwise
xi  1 if person i is in the labor force, 0 otherwise
with totals t y , t x
Unemployment rate: t y / t x
8
Sources of error in sample surveys
1. Target population U vs Frame population UF
Access to the population is thru a list of units – a
register UF . U and UF may not be the same:
Three possible errors in UF:
–
–
–
•
Undercoverage: Some units in U are not in UF
Overcoverage: Some units in UF are not in U
Duplicate listings: A unit in U is listed more than
once in UF
UF is sometimes called the sampling frame
9
2. Nonresponse - missing data
•
•
•
•
•
Some persons cannot be contacted
Some refuse to participate in the survey
Some may be ill and incapable of responding
In postal surveys: Can be as much as 70%
nonresponse
In telephone surveys: 50% nonresponse is not
uncommon
•
Possible consequences:
–
–
Bias in the sample, not representative of the
population
Estimation becomes more inaccurate
•
Remedies:
–
imputation, weighting
10
3. Measurement error – the correct value of yi is
not measured
– In interviewer surveys:
•
•
•
Incorrect marking
interviewer effect: people may say what they think
the interviewer wants to hear – underreporting of
alcohol ute, tobacco use
misunderstanding of the question, do not
remember correctly.
11
4. Sampling error
– The error caused by observing a sample
instead of the whole population
– To assess this error- margin of error:
measure sample to sample variation
–
Design approach deals with calculating
sampling errors for different sampling designs
– One such measure: 95% confidence interval:
If we draw repeated samples, then 95% of the
calculated confidence intervals for a total t
will actually include t
12
• The first 3 errors: nonsampling errors
– Can be much larger than the sampling error
• In this course:
– Sampling error
– nonresponse bias
– Shall assume that the frame population is
identical to the target population
– No measurement error
13
Summary of basic concepts
•
•
•
•
•
Population, target population
unit
sample
sampling design
estimation
–
–
–
–
estimator
measure of bias
measure of variance
confidence interval
14
• survey errors:
–
–
–
–
register /frame population
mesurement error
nonresponse
sampling error
15
Example – Psychiatric Morbidity Survey
1993 from Great Britain
• Aim: Provide information about prevalence of
psychiatric problems among adults in GB as well
as their associated social disabilities and use of
services
• Target population: Adults aged 16-64 living in
private households
• Sample: Thru several stages: 18,000 adresses were
chosen and 1 adult in each household was chosen
• 200 interviewers, each visiting 90 households
16
Result of the sampling process
• Sample of addresses
Vacant premises
Institutions/business premises
Demolished
Second home/holiday flat
• Private household addresses
Extra households found
• Total private households
Households with no one 16-64
• Eligible households
• Nonresponse
• Sample
18,000
927
573
499
236
15,765
669
16,434
3,704
12,730
2,622
10,108
households with responding adults aged 16-64
17
Why sampling ?
• reduces costs for acceptable level of accuracy
(money, manpower, processing time...)
• may free up resources to reduce nonsampling error
and collect more information from each person in
the sample
– ex:
400 interviewers at $5 per interview: lower sampling
error
200 interviewers at 10$ per interview: lower
nonsampling error
• much quicker results
18
When is sample representative ?
• Balance on gender and age:
– proportion of women in sample @ proportion in
population
– proportions of age groups in sample @ proportions in
population
• An ideal representative sample:
– A miniature version of the population:
– implying that every unit in the sample represents the
characteristics of a known number of units in the
population
• Appropriate probability sampling ensures a
representative sample ”on the average”
19
Alternative approaches for statistical inference
based on survey sampling
• Design-based:
– No modeling, only stochastic element is the
sample s with known distribution
• Model-based: The values yi are assumed to be
values of random variables Yi:
– Two stochastic elements: Y = (Y1, …,YN) and s
– Assumes a parametric distribution for Y
– Example : suppose we have an auxiliary
variable x. Could be: age, gender, education. A
typical model is a regression of Yi on xi.
20
• Statistical principles of inference imply that the
model-based approach is the most sound and valid
approach
• Start with learning the design-based approach
since it is the most applied approach to survey
sampling used by national statistical institutes and
most research institutes for social sciences.
– Is the easy way out: Do not need to model. All
statisticians working with survey sampling in
practice need to know this approach
21
Design-based statistical inference
• Can also be viewed as a distribution-free
nonparametric approach
• The only stochastic element: Sample s, distribution
p(s) for all subsets s of the population U={1, ..., N}
• No explicit statistical modeling is done for the
variable y. All yi’s are considered fixed but unknown
• Focus on sampling error
• Sets the sample survey theory apart from usual
statistical analysis
• The traditional approach, started by Neyman in 1934
22
Estimation theory-simple random sample
N
SRS of size n: Each sample s of size n has p( s )  1 /  
n 
Can be performed in principle by drawing one unit at time
at random without replacement
Estimation of the population mean of a variable y:
  iN1 yi / N
A natural estimator - the sample mean: ys  is yi / n
Desirable properties:
( I) Unbiasedness : An estimator ˆ is unbiased if
E ( ˆ ) 
y s is unbiased for SRS design
23
The uncertainty of an unbiased estimator is measured by its
estimated sampling variance or standard error (SE):
Var( ˆ ) E ( ˆ   )2 , if E ( ˆ )  
Vˆ ( ˆ ) is an (unbiased) estimate of Var( ˆ )
SE( ˆ )  Vˆ ( ˆ )
Some results for SRS:
(1) Let  i be the probabilit y that unit i is in the sample,
Then  i  n / N  f , the sampling fraction
( 2 ) E ( y s ) 
24
(3) Let  2 be the population variance :  2 
Var( ys ) 
2
1
2
N
i 1 ( yi   )
N 1
(1  f )
n
Here, the factor (1 - f ) is called the finite population correction
• usually unimportant in social surveys:
n =10,000 and N = 5,000,000: 1- f = 0.998
n =1000 and N = 400,000: 1- f = 0.9975
n =1000 and N = 5,000,000: 1-f = 0.9998
• effect of changing n much more important than effect of
changing n/N
25
An unbiased estimator of  2 is given by the
sample variance
1
2
s 
is ( yi  y s )
n 1
2
s
The estimated variance Vˆ ( ys )  (1  f )
n
Usually we report the standard error of the estimate:
2
SE( y s )  Vˆ ( y s )
Confidence intervals for  is based on the
Central Limit Theorem:
For large n, N  n : Z  ( ys   ) /  (1  f ) / n ~ N (0,1)
Approximat e 95% CI for  :
ys  1.96  SE( ys ), ys  1.96  SE( ys )  ys  1.96  SE( ys )
26
Example
N = 341 residential blocks in Ames, Iowa
yi = number of dwellings in block i
1000 independent SRS for different values of n
n
Proportion of samples Proportion of samples
with |Z| <1.64
with |Z| <1.96
30
50
0.88
0.88
0.93
0.93
70
90
0.88
0.90
0.94
0.95
27
For one SRS with n = 90:
y s  13
s 2  75
SE( y s )  (1  90 / 341)75 / 90  0.78
Approximat e 95% CI : 13  1.96  0.78  13  1.53  (11.47, 14.53)
28
Absolute value of sampling error is not informative when
not related to value of the estimate
For example, SE =2 is small if estimate is 1000, but very
large if estimate is 3
The coefficient of variation for the estimate:
CV ( y s ) SE ( y s ) / y s
In example : CV ( y s )  0.78 / 13  0.06  6%
•A measure of the relative variability of an estimate.
•It does not depend on the unit of measurement.
• More stable over repeated surveys, can be used for
planning, for example determining sample size
• More meaningful when estimating proportions
29
Estimation of a population proportion p
with a certain characteristic A
p = (number of units in the population with A)/N
Let yi = 1 if unit i has characteristic A, 0 otherwise
Then p is the population mean of the yi’s.
Let X be the number of units in the sample with
characteristic A. Then the sample mean can be
expressed as
pˆ  y s  X / n
30
Then under SRS :
E ( pˆ )  p
and
p(1  p )
n 1
Var( pˆ ) 
(1 
)
n
N 1
since the population variance equals  2 
Np(1  p )
N 1
n
s 
pˆ (1  pˆ )
n 1
2
So the unbiased estimate of the variance of the estimator:
pˆ (1  pˆ )
n
ˆ
V ( pˆ )
(1  )
n 1
N
31
Examples
A political poll: Suppose we have a random sample of 1000
eligible voters in Norway with 280 saying they will vote
for the Labor party. Then the estimated proportion of Labor
votes in Norway is given by:
p̂  280 / 1000  0.28
p̂( 1  p̂ )
n
0.28  0.72
SE( p̂ )
(1  ) 
 0.0144
n 1
N
999
Confidence interval requires normal approximation.
Can use the guideline from binomial distribution, when
N-n is large: np  5 and n(1  p)  5
32
In this example : n = 1000 and N = 4,000,000
Approximat e 95% CI : p̂  1.96  SE( p̂ )
 0.280  0.028  (0.252, 0.308)
Ex: Psychiatric Morbidity Survey 1993 from Great Britain
p = proportion with psychiatric problems
n = 9792 (partial nonresponse on this question: 316)
N @ 40,000,000
pˆ  0.14
SE( pˆ )  (1  0.00024 )0.14  0.86 / 9791  0.0035
95 % CI : 0.14  1.96  0.0035  0.14  0.007  (0.133,0.1 47)
33
General probability sampling
• Sampling design: p(s) - known probability of selection for
each subset s of the population U
• Actually: The sampling design is the probability distribution
p(.) over all subsets of U
• Typically, for most s: p(s) = 0 . In SRS of size n, all s with
size different from n has p(s) = 0.
• The inclusion probability:
 i  P( unit i is in the sample)
 P(i  s )   p( s )
{s:is}
34
Illustration
U = {1,2,3,4}
Sample of size 2; 6 possible samples
Sampling design:
p({1,2}) = ½, p({2,3}) = 1/4, p({3,4}) = 1/8, p({1,4}) = 1/8
The inclusion probabilities:
 1   p( s )  p({1,2})  p({1,4})  5 / 8
{s:1s}
 2   p( s )  p({1,2})  p({2,3})  3 / 4  6 / 8
{s:2s}
 3   p( s )  p({2,3})  p({3,4})  3 / 8
{s:3s}
 4   p( s )  p({3,4})  p({1,4})  2 / 8
{s:4s}
35
Some results
( I )  1   2  ...   N  E ( n) ; n is the sample size
( II ) If sample size is determined to be n in advance :
 1   2  ...   N  n
Proof :
Let Z i  1 if unit i is included in the sample, 0 otherwise
 i  P( Z i  1)  E ( Z i )
n  i 1 Z i  E (n) i 1 E ( Z i )  i 1  i
N
N
N
36
Estimation theory
probability sampling in general
Problem: Estimate a population quantity for the variable y
N
For the sake of illustration: The population total t   yi
An estimator of t based on the sample : tˆ
i 1
Expected value : E (tˆ )  s tˆ( s ) p( s )
Variance : Var(tˆ )  E[tˆ  Etˆ]2  s [tˆ( s )  Etˆ]2 p( s )
Bias : E (tˆ)  t
tˆ is unbiased if E (tˆ )  t
37
Let Vˆ (tˆ) be an (unbiased if possible) estimate of Var(tˆ)
The standard error of tˆ : SE(tˆ) Vˆ (tˆ)
Coefficient of variation of tˆ : CV (tˆ) SE(tˆ) / tˆ
CV is a useful measure of uncertainty, especially when
standard error increases as the estimate increases
Margin of error : 2  SE(tˆ)
Because, typically we have that
P(tˆ  2SE(tˆ)  t  tˆ  2SE(tˆ))  0.95 for large n, N  n
Since tˆ is approximat ely normally distribute d for large n, N  n
t̂  2  SE( t̂ ) is approximat ely a 95% CI
38
Some peculiarities in the estimation theory
Example: N=3, n=2, simple random sample
s1  {1,2}, s2  {1,3}, s3  {2,3}
p( sk ) 1 / 3 for k  1,2,3
Let tˆ1  3 ys , unbiased
Let tˆ2 be given by :
ˆt 2 ( s1 )  3  1 ( y1  y2 )  tˆ1 ( s1 )
2
ˆt 2 ( s2 )  3  ( 1 y1  2 y3 )  tˆ1 ( s 2 )  1 y3
2
3
2
1
1
1
tˆ2 ( s3 )  3  ( y2  y3 )  tˆ1 ( s3 )  y3
2
3
2
39
Also tˆ2 is unbiased :
1 3 ˆ
1
ˆ
ˆ
E (t2 )  s t2 ( s ) p( s )  k 1 t2 ( sk )   3t  t
3
3
1
ˆ
ˆ
Var(t1 )  Var(t2 )  y3 (3 y2  3 y1  y3 )
6
 Var(tˆ1 )  Var(tˆ2 ) if y3  0 and 3 y2  3 y1  y3
If yi  0 / 1  variables, this happens when y1  0, y2  y3  1
For this set of values of the yi’s:
tˆ1 ( s1 )  1.5, tˆ1 ( s2 )  1.5, tˆ1 ( s3 )  3 : never correct
tˆ2 ( s1 )  1.5, tˆ2 ( s2 )  2, tˆ2 ( s3 )  2.5
tˆ2 has clearly less variabilit y than tˆ1 for these y - values
40
Let y be the population vector of the y-values.
This example shows that
Ny s
is not uniformly best ( minimum variance for all y)
among linear design-unbiased estimators
Example shows that the ”usual” basic estimators do not
have the same properties in design-based survey
sampling as they do in ordinary statistical models
In fact, we have the following much stronger result:
Theorem: Let p(.) be any sampling design. Assume each
yi can take at least two values. Then there exists no
uniformly best design-unbiased estimator of the total t
41
Proof:
Let tˆ be unbiased, and let y 0 be one possible value of y.
Then there exists unbiased tˆ0 with Var(tˆ0 )  0 when y  y 0
tˆ0 ( s, y)  tˆ( s, y)  tˆ( s, y0 )  t0 , t0 is the total for y 0
1) tˆ0 is unbiased : E (tˆ0 )  t  s tˆ( s, y 0 ) p( s )  t0  t
2) When y  y 0 : tˆ0  t0 for all samples s  Var(tˆ0 )  0
This implies that a uniformly best unbiased estimator
must have variance equal to 0 for all values of y,
which is impossible
42
Determining sample size
• The sample size has a decisive effect on the cost of the survey
• How large n should be depends on the purpose for doing the
survey
• In a poll for detemining voting preference, n = 1000 is
typically enough
• In the quarterly labor force survey in Norway, n = 24000
Mainly three factors to consider:
1. Desired accuracy of the estimates for many variables.
Focus on one or two variables of primary interest
2. Homogeneity of the population. Needs smaller
samples if little variation in the population
3. Estimation for subgroups, domains, of the population
43
It is often factor 3 that puts the highest demand on the survey
• If we want to estimate totals for domains of the
population we should take a stratified sample
• A sample from each domain
• A stratified random sample: From each domain a
simple random sample
H strata that constitute the whole population
Sample sizes : n1 , n2 ,...,nH
Total sample size : n  n1  n2  ...  nH
Must determine each nh
44
Assume the problem is to estimate a population proportion p for
a certain stratum, and we use the sample proportion from the
stratum to estimate p
Let n be the sample size of this stratum, and assume that n/N is
negligible
Desired accuracy for this stratum: 95% CI for p should be  5%
pˆ (1  pˆ )
95% CI for p : pˆ  1.96
n
The accuracy requirement:
pˆ (1  pˆ )
1
1.96
 0.05 
n
20
 n  1.96 2 20 2 pˆ (1  pˆ )  384
45
The estimate is unkown in the planning fase
Use the conservative size 384 or a planning value
p0 with n = 1536 p0(1- p0 )
F.ex.: With p0 = 0.2: n = 246
In general with accuracy requirement d, 95% CI  pˆ  d
n  3.84 p0 (1  p0 ) / d 2
Alternative accuracy requirement :
Length of 95% CI is proportion al to pˆ
(when pˆ  0.5, otherwise estimate 1 - p )
pˆ (1  pˆ )
1.96
 d  pˆ  CV ( pˆ )  d / 1.96  e
n
46
1 1  pˆ
SE( pˆ ) / pˆ  e  n  2 
pˆ
e
1 1  p0
Planning value p0 : n  2 
p0
e
With e = 0.1, then we require approximately that
when p0  0.5 : 95% CI  pˆ  0.10 and n  100
when p0  0.1 : 95% CI  pˆ  0.02 and n  900
47
Example: Monthly unemployment rate
Important to detect changes in unemployment rates
from month to month
planning value p0 = 0.05
Desired accuracy:
1.96  SE( pˆ )  d  n  3.84 p0 (1  p0 ) / d 2  0.1824 / d 2
d  0.001 (margin of error  0.1%)  n  182,400
d  0.002  n  45,600
d  0.005  n  7300
Note : d  0.005  CV ( pˆ )  0.00255 / 0.05  .051  5%
48
Two basic estimators:
Ratio estimator
Horvitz-Thompson estimator
• Ratio estimator in simple random samples
• H-T estimator for unequal probability
sampling: The inclusion probabilities are
unequal
• The goal is to estimate a population total t
for a variable y
49
Ratio estimator
Suppose we have known auxiliary
information for the whole population:
x  ( x1 , x2 ,...xN )
Ex: age, gender, education, employment status
Let X   i 1 xi
N
The ratio estimator for the y-total t:
tˆR  X 


yi
ys
 X
x
xs
is i
is
50
We can express the ratio estimator on the following form:
ˆt R  X  ( Ny s )
Nx s
It adjusts the usual “sample mean estimator” in the cases
where the x-values in the sample are too small or too large.
Reasonable if there is a positive correlation between x and y
Example: University of 4000 students, SRS of 400
Estimate the total number t of women that is planning a
career in teaching, t=Np, p is the proportion
yi = 1 if student i is a woman planning to be a teacher, t is the
y-total
51
Results : 84 out of 240 women in the sample plans to be a
teacher
pˆ  84 / 400  0.21
tˆ  Npˆ  840
HOWEVER: It was noticed that the university has 2700 women
(67,5%) while in the sample we had 60% women.
A better estimate that corrects for the underrepresentation of women
is obtained by the ratio estimate using the auxiliary
x = 1 if student is a woman
2700
tˆR 
(840)  945
4000  0.6
52
In business surveys it is very common to use a
ratio estimator.
Ex: yi = amount spent on health insurance by
business i
xi = number of employees in business i
We shall now do a comparison between the ratio
estimator and the sample mean based estimator. We
need to derive expectation and variance for the
ratio estimator
53
First: Must define the population covariance
1
N
 xy 
( xi   x )( yi   y )

i 1
N 1
 x ,  y are population means of the y and x variables
1
N
2
 
(
y


)

i
y
i 1
N 1
1
N
2
2
x 
( xi   x )

i 1
N 1
2
y
The population correlation
coefficient:
 xy
 xy 
 x y
54
Let R i 1 yi / i 1 xi  t / X 
N
N
and Rˆ  Nys / Nxs  ys / xs
( I ) Bias : E (tˆR  t )  Cov( Rˆ , Nxs )
Proof
ˆt R  t  Nys X   t  Nys (1  Nxs  X  )  t
Nx s
Nx s
N
y
 E (tˆR  t )   E s  ( Nxs  X  )  Cov( Rˆ , Nxs )
Nx s
55
It follows that
| Cov( Rˆ , Nxs ) |
| Bias( tˆR ) |

Var ( Nxs )
Var (tˆR ) X  Var ( Rˆ )Var ( Nxs )
 CV ( Nxs ) | Corr ( Rˆ , Nxs ) | CV ( xs )
Hence, in SRS, the absolute bias of the ratio
estimator is small relative to the true SE of the
estimator if the coefficient of variation of the xsample mean is small
Certainly true for large n
56
( II ) E( t̂ R )  t , for large n
2 1 f
2
2 2
ˆ
( III ) Var (t R )  N 
( y  2 R xy  R  x )
n
1
N
2 1 f
2
N 

( yi  Rx i )

i 1
n N 1
57
Note: The ratio estimator is very precise when the
population points (yi , xi) lie close around a straight
line thru the origin with slope R.
The regression model generates the ratio estimator
58
1
N
2 1 f
2
ˆ
Var (t R )  N 

(
y

Rx
)
 i i
n N  1 i 1
and recalling that
1
N
2 1 f
2
Var ( Nys )  N

(
y


)

i
y
i 1
n N 1
N
N
2
2
ˆ
Var(t R )  Var( Nys )  i 1 ( yi  Rxi )  i 1 ( yi   y )
The ratio estimator is more accurate if Rxi
predicts yi better than y does
59
Estimated variance for the ratio estimator
Estimate

N
i 1
( yi  Rx i ) /( N  1)
2
by is ( yi  Rˆ xi ) 2 /( n  1)
2
 x  2 1 f 1
2
ˆ
ˆ
ˆ
V (t R )    N 

(
y

R
x
)

i
i
is
n n 1
 xs 
Note : If xs is very small, then Rˆ is more uncertain and
the variance estimate becomes larger to reflect th at
60
For large n, N-n:
Approximate normality holds and an
approximate 95% confidence interval is
given by
X
1

f
1
2

ˆ
ˆt R  1.96

( yi  Rxi )

is
xs
n n 1
61
Unequal probability sampling
Inclusion probabilities:
 i  P(i  s)  0 for all i  1,..., N
Example:
Psychiatric Morbidity Survey: Selected individuals
from households
 i  1/ M i
M i  number of adults 16 - 64 in the household that
individual i belongs to
62
Horvitz-Thompson estimatorunequal probability sampling
 i  P(i  s)  0 for all i  1,..., N
Let’s try and use Nys
Let Z i  1 if i  s, 0 otherwise. E ( Z i )   i
1
N
N
E ( Nys )  N E (i 1 yi Z i )  ( N / n)i 1 yi i  t
n
not unbiased
Bias is large if inclusion probabilities tend to
increase or decrease systematically with yi
63
Use weighting to correct for bias:
tˆ  is wi yi ; wi does not depend on s


N
N
ˆ
E (t )  E i 1 wi yi Z i  i 1 wi i yi
and tˆ is unbiased for all possible values yi
if and only if wi  1 /  i
tˆHT  is
yi
i
In SRS,  i  n / N and tˆHT  Nys
64
a ) Var (tˆHT )  i 1
N
1 i
i
y  2i 1
2
i
N 1
 ij   i j
 j i 1   yi y j
i j
N
If | s | n, then
b) Var (tˆHT )  i 1
N 1
 yi y j 
 j i 1 ( i j   ij )    
j 
 i
2
N
 ij  P(i, j  s)  P( Z i  Z j  1)
Horvitz-Thompson estimator is widely used
f.ex., in official statistics
65
Note that the variance is small if we determine
the inclusion probabilities such that
yi /  i are approximat ely equal,
i.e.  i increases with increasing yi
Of course, we do not know the value of yi when
planning the survey, use known auxiliary xi and
choose
 i  xi   i  nxi / X 
since

N
i 1
i  n
66
If yi and  i are not related or negatively " correlated "
Var (tˆHT ) can be enormous and one should not use HT - estimator,
even thoug h the  i ' s are unequal
Example: Population of 3 elephants, to be shipped. Needs an
estimate for the total weight
•Weighing an elephant is no simple matter. Owner wants to
estimate the total weight by weighing just one elephant.
• Knows from earlier: Elephant 2 has a weight y2 close to the
average weight. Wants to use this elephant and use 3y2 as an
estimate
• However: To get an unbiased estimator, all inclusion
probabilities must be positive.
67
• Sampling design:
| s | 1 and  2  0.90,  1   3  0.05
• The weights: 1,2, 4 tons, total = 7 tons
• H-T estimator: tˆHT  yi /  i if s  {i}
20 if s  {1}

 2.22 if s  {2}
80 if s  {3}

Hopeless! Always far from true total of 7
Can not be used, even though
E (tˆHT )  7  t
68
Problem:
Var (tˆHT )  (20  7) 2  0.05  (2.22  7) 2  0.90  (80  7) 2 .0.05
 295.46
True SE (tˆHT )  Var (tˆHT )  17.2 !!!
The planned estimator, even though not a SRS:
tˆeleph  3 ys  3 yi if s  {i}
Possible values: 3, 6, 12
69
E( t̂ )  6.15
not unbiased, but look at
SE( t̂eleph )  2.2275  1.49
MSE (tˆeleph )  E (tˆeleph  t ) 2  Bias 2  Var (tˆeleph )  2.95
MSE (tˆeleph )  1.72
t̂eleph is clearly preferable to t̂ HT
70
Variance estimate for H-T estimator
Assume the size of the sample is determined in
advance to be n.
An unbiased estimator of Var (tˆHT ), provided all
joint inclusion probabilit ies  ij  0 :
 yi y j 




i
j
ij
  
Vˆ (tˆHT )  
 ij   i  j 
is js
j i
2
Approximat e 95% CI, for large n, N  n :
tˆHT  1.96 Vˆ (tˆHT )
71
• Can always compute the variance estimate!!
Since, necessarily ij > 0 for all i,j in the sample s
• But: If not all ij > 0 , should not use this
estimate! It can give very incorrect estimates
• The variance estimate can be negative, but for
most sampling designs it is always positive
72
A modified H-T estimator
Consider first estimating the population mean y  t / N
An obvious choice: yˆ HT  tˆHT / N
Alternative: Estimate N as well, whether N is known or not
1
ˆ
N  is
( yi  1, i )
i
 N 1 
N 1
ˆ
E ( N )  E i 1 Z i   i 1  i  N
i 
i

N
ˆ
For SRS,  i  n / N  N  is  N
n
73
yˆ w  tˆHT / Nˆ 


is
yi /  i
is
1/  i
 tˆw  Nyˆ w
Interestin gly, tˆw is often better tha n tˆHT , even thoug h it is
only approximat ely unbiased. It usually has smaller va riance.
So tˆw is ordinarily the estimator to use, whether
N is known or not. We note that it is a ratio estimator
Illustration :
yi  c for all i  1,...,N .
Then t̂ HT  c is 1 /  i  cN̂
while t̂ w  Nc  t , a better estimate if Var( N̂ )  0
74
If sample size varies then the “ratio” estimator
performs better than the H-T estimator, the ratio is
more stable than the numerator
Example:
yi  c, for i  1,..., N
Sampling design  Bernoulli sampling :
Each unit in the population is selected with
probabilit y  , independen tly
Z i ' s are i.i.d. with  i  P( Z i  1)  
n is a stochastic variable, has a binomial ( N ,  ) distributi on
E (n)  N
75
tˆHT 
n

c
( E (tˆHT ) 
N
c  Nc  t )

nc
/

tˆw  N
 Nc  t
n /
H-T estimator varies because n varies, while
the modified H-T is perfectly stable
76
Review of Advantages of Probability
Sampling
• Objective basis for inference
• Permits unbiased or approximately unbiased
estimation
• Permits estimation of sampling errors of
estimators
– Use central limit theorem for confidence interval
– Can choose n to reduce SE or CV for estimator
77
Outstanding issues in design-based inference
• Estimation for subpopulations, domains
• Choice of sampling design –
– discuss several different sampling designs
– appropriate estimators
• More on use of auxiliary information to improve
estimates
• More on variance estimation
78
Estimation for domains
• Domain (subpopulation): a subset of the
population of interest
• Ex: Population = all adults aged 16-64
Examples of domains:
– Women
– Adults aged 35-39
– Men aged 25-29
– Women of a certain ethnic group
– Adults living in a certain city
• Partition population U into D disjoint domains
U1,…,Ud,..., UD of sizes N1,…,Nd,…,ND
79
Estimating domain means
Simple random sample from the population
True domain mean :  d  iU yi / N d
d
• e.g., proportion of divorced women with
psychiatric problems.
Estimate  d by the sample mean from U d :
sd  the part of the sample s in U d
ysd  is yi / nd
d
nd | sd |
Note: nd is a random variable
80
The estimator is a ratio estimator:
Define
 yi if i  U d
ui  
0 otherwise
1 if i  U d
xi  
0 otherwise
 d  i 1 ui / i 1 xi  R
N
N
y sd  is ui / is xi  u s / xs  Rˆ
81
ysd is approximat ely unbiased for large n
2
1  Nd / N  2 1 f 1
2
ˆ
 N 
V ( ysd )  2 

(
u

y
x
)

i
sd i
is
N d  nd / n 
n n 1
Let sd2 be the sample variance for the domain,
1
2
s 
(
y

y
)
 i sd
nd  1 isd
2
d
2
2
s
n
1

f
Vˆ ( ysd )  2 
(nd  1) sd2  (1  f ) d
nd n(n  1)
nd
SE( ysd )  (1  f )sd2 / nd , f  n / N
82
For large samples f d  nd / N d  f
• Can then treat sd as a SRS from Ud
• Whatever size of n is, conditional on nd, sd is a
SRS from Ud – conditional inference
Example: Psychiatric Morbidity Survey 1993
Proportions with psychiatric problems
y sd
SE ( y sd )
Domain d
nd
women
4933 0.18
.18  0.82 / 4932  0.005
Divorced
women
314
0.29  0.71 / 313  0.026
0.29
83
Estimating domain totals
• Nd is known: Use N d y sd
• Nd unknown, must be estimated
Since N d is the x - total :
Nˆ d  Nxs  N  nd / n
ˆt d  Nˆ d ys  N 1  ui  Nu s
d
n is
2
ˆ
SE (td )  N (1  f )su / n
84
Stratified sampling
• Basic idea: Partition the population U into H
subpopulations, called strata.
• Nh = size of stratum h, known
• Draw a separate sample from each stratum, sh of size nh
from stratum h, independently between the strata
• In social surveys: Stratify by geographic regions, age
groups, gender
• Ex –business survey. Canadian survey of employment.
Establishments stratified by
o Standard Industrial Classification – 16 industry
divisions
o Size – number of employees, 4 groups, 0-19, 20-49, 50199, 200+
o Province – 12 provinces
Total number of strata: 16x4x12=768
85
Reasons for stratification
1. Strata form domains of interest for which
separate estimates of given precision is required,
e.g. strata = geographical regions
2. To “spread” the sample over the whole
population. Easier to get a representative sample
3. To get more accurate estimates of population
totals, reduce sampling variance
4. Can use different modes of data collection in
different strata, e.g. telephone versus home
interviews
86
Stratified simple random sampling
• The most common stratified sampling design
• SRS from each stratum
• Notation:
From stratum h : sample sh of size nh
Total sample size : n  h 1 nh
H
Values from stratum h : yhi , i  1,..., N h
Sample : ( yhi : i  sh )
Sample mean : yh  is yhi / nh
h
87
th = y-total for stratum h: th 

Nh
i 1
yhi
The population total : t  h1 th
H
Consider estimation of th: tˆh  N h yh
Assuming no auxiliary information in addition to
the “stratifying variables”
The stratified estimator of t:
tˆst  h1 tˆh  h1 N h yh
H
H
88
To estimate the population mean t / N :
H Nh
ˆ
Stratified mean : yst  t st / N  h 1 yh
N
A weighted average of the sample stratum means.
•Properties of the stratified estimator follows
from properties of SRS estimators.
•Notation:
Mean in stratum h :  h  i 1 yhi / N h
Nh
1
Nh
2
Variance in stratum h :  
(
y


)

hi
h
i 1
Nh 1
2
h
89
E( t̂ st )  t , t̂ st is unbiased
2

2
h
Var( t̂ st )  hH1Var( t̂ h )  hH1 N h
nh
(1  fh )
Estimated variance is obtained by estimating the stratum
variance with the stratum sample variance
sh2 
1
2
(
y

y
)
 hi h
nh  1 ish
2
s
Vˆ (tˆst )  h 1 N h2 h (1  f h )
nh
H
Approximate 95% confidence interval if n and N-n are large:
tˆst  1.96 Vˆ (t st )
90
Estimating population proportion in stratified
simple random sampling
ph : proportion in stratum h with a certain characteristic A
pˆ h  yh
where yhi  1 if unit i in stratum h has characteri stic A
p is the population mean: p = t/N 
Stratum mean estimator:

H
h 1
N h ph / N
pˆ st  yst  h1 ( N h / N ) pˆ h
H
Stratified estimator of the total t = number of units in the
with characteristic A:
ˆt st  Npˆ st  H N h pˆ h
h 1
91
Estimated variance:
p̂h ( 1  p̂h )
nh
V̂ ( p̂h ) 
(1 
) (slide 31)
nh  1
Nh
nh pˆ h (1  pˆ h )
H
H
2
ˆ
ˆ
 V ( pˆ st )  h 1V (Wh pˆ h )  h 1Wh (1 
)
Nh
nh  1
where Wh  N h /N
and
ˆ h (1  pˆ h )
n
p
H
H
2
2
h
ˆ
ˆ
V (tˆst )  h 1V ( N  Wh pˆ h ) N h 1Wh (1 
)
Nh
nh  1
92
Allocation of the sample units
• Important to determine the sizes of the stratum samples,
given the total sample size n and given the strata
partitioning
– how to allocate the sample units to the different strata
• Proportional allocation
– A representative sample should mirror the population
– Strata proportions: Wh=Nh/N
– Strata sample proportions should be the same:
nh/n = Wh
– Proportional allocation:
Nh
nh
n
nh  n


for all h
N
Nh N
93
The stratified estimator under
proportional allocation
 Inclusion probabilit ies :  hi  nh / N h  n / N
the same for all units in the population , but it is not a SRS
1
H
H
ˆ
 t st  h 1 N h yh  h 1 N h 
nh

N
n
 
H
h 1
ish

ish
yhi
yhi  Nys
 The stratified mean : yst  tˆst / N  ys
The equally weighted sample mean ( sample is selfweighting: Every unit in the sample represents the
same number of units in the population , N/n)
94
Variance and estimated variance under
proportional allocation
2

2
h
Var (tˆst )  h 1 N h
H
1 f
N 
n
2
nh
(1  f h )
2
W

h1 h h ,
H
2 1 f
ˆ
ˆ
V (t st )  N 
n
f  n / N , Wh  N h / N
2
W
s
h1 h h
H
95
• The estimator in simple random sample:
tˆSRS  Ny s
• Under proportional allocation:
tˆst  tˆSRS
• but the variances are different:
2 1 f
ˆ
Under SRS : VarSRS (t SRS )  N 
 2
n
2 1 f
ˆ
Under proportion al allocation : Var (t st )  N 
n
2
W

h1 h h
H
96
Nh 1 Nh
Nh
Using the approximat ions

and
 1:
N 1
N
Nh 1
 2  h 1Wh h2  h 1Wh (  h   ) 2
H
H
Total variance = variance within strata + variance between strata
Implications:
1. No matter what the stratification scheme is:
Proportional allocation gives more accurate estimates of
population total than SRS
2. Choose strata with little variability, smaller strata variances. Then
the strata means will vary more and between variance becomes
larger and precision of estimates increases compared to SRS
3. This is also essentiall y true in general, as seen from
H
2
2 1 fh
ˆ
V (t st )  N h 1Wh
 h2
nh
97
Optimal allocation
If the only concern is to estimate the population total t:
• Choose nh such that the variance of the stratified
estimator is minimum
• Solution depends on the unkown stratum variances
• If the stratum variances are approximately equal,
proportional allocation minimizes the variance of
the stratified estimator
98
Optimal allocation : nh  n 

N h h
H
k 1
N k k
Proof :
Minimize Var (tˆst ) with respect to the sample sizes
nh subject to n  h 1 nh is fixed
H
Use Lagrange multiplier method : Minimize
1
1
H
Q  h 1 N  ( 
)   (h 1 nh  n)
nh N h
H
2
h
2
h
Q
1 2 2
 0   2 N h  h    0  nh  N h h / 
nh
nh
Result follows since the sample sizes must add up to n
99
• Called Neyman allocation (Neyman, 1934)
• Should sample heavily in strata if
– The stratum accounts for a large part of the population
– The stratum variance is large
• If the stratum variances are equal, this is
proportional allocation
• Problem, of course: Stratum variances are unknown
– Take a small preliminary sample (pilot)
– The variance of the stratified estimator is not very
sensitive to deviations from the optimal allocation. Need
just crude approximations of the stratum variances
100
Optimal allocation when considering the cost
of a survey
• C represents the total cost of the survey, fixed – our
budget
• c0 : overhead cost, like maintaining an office
• ch : cost of taking an observation in stratum h
– Home interviews: traveling cost +interview
– Telephone or postal surveys: ch is the same for all strata
– In some strata: telephone, in others home interviews
C  c0  h1 nh ch
H
• Minimize the variance of the stratified estimator for a
given total cost C
101
1
H
2
2 2 1
ˆ
Minimize Var (t st )  N h 1Wh  h ( 
)
nh N h
subject to : c0  h 1 nh ch  C
H
Solution:
nh  Wh h / ch
Wh h
(C  c0 )
 nh 
 H
ch  Wk k ck
k 1
Hence, for a fixed total cost C :
(C  c0 )h1 N h h / ch
H
n

H
h 1
N h h ch
102
In particular, if ch = c for all h:
n  (C  c0 ) / c
We can express the optimal sample sizes in relation to n
 nh  n 
Wh h / ch

H
k 1
Wk k / ck
1. Large samples in inexpensiv e strata
2. If the ch ' s are equal : Neyman allocation
3. If the ch ' s are equal and the  h ' s are equal :
proportion al allocation
103
Other issues with optimal allocation
• Many survey variables
• Each variable leads to a different optimal solution
– Choose one or two key variables
– Use proportional allocation as a compromise
• If nh > Nh, let nh =Nh and use optimal allocation
for the remaining strata
• If nh=1, can not estimate variance. Force nh =2 or
collapse strata for variance estimation
• Number of strata: For a given n often best to
increase number of strata as much as possible.
Depends on available information
104
• Sometimes the main interest is in precision
of the estimates for stratum totals and less
interest in the precision of the estimate for
the population total
• Need to decide nh to achieve desired
accuracy for estimate of th, discussed earlier
– If we decide to do proportional allocation, it
can mean in small strata (small Nh) the sample
size nh must be increased
105
Poststratification
• Stratification reduces the uncertainty of the
estimator compared to SRS
• In many cases one wants to stratify according to
variables that are not known or used in sampling
• Can then stratify after the data have been collected
• Hence, the term poststratification
• The estimator is then the usual stratified estimator
according to the poststratification
• If we take a SRS and N-n and n are large, the
estimator behaves like the stratified estimator with
proportional allocation
106
Poststratification to reduce nonresponse bias
• Poststratification is mostly used to correct for
nonresponse
• Choose strata with different response rates
• Poststratification amounts to assuming that the
response sample in poststratum h is representative
for the nonresponse group in the sample from
poststratum h
107
Systematic sampling
•
•
Idea:Order the population and select every kth unit
Procedure: U = {1,…,N} and N=nk + c, c < n
1. Select a random integer r between 1 and k, with equal
probability
2. Select the sample sr by the systematic rule
sr = {i: i = r + (j-1)k: j= 1, …, nr}
where the actual sample size nr takes values
[N/k] or [N/k] +1
k : sampling interval = [N/n]
•
Very easy to implement: Visit every 10th house or
interview every 50th name in the telephone book
108
• k distinct samples each selected with probability 1/k
1 / k if s  sr , r  1,..., k
p( s)  
0 otherwise
• Unlike in SRS, many subsets of U have zero probability
Examples:
1) N =20, n=4. Then k=5 and c=0. Suppose we select r =1.
Then the sample is {1,6,11,16}
5 possible distinct samples. In SRS: 4845 distinct samples
2) N= 149, n = 12. Then k = 12, c=5. Suppose r = 3.
s3 = {3,15,27,39,51,63,75,87,99,111,123,135,147} and
sample size is 13
3) N=20, n=8. Then k=2 and c = 4. Sample size is nr =10
4) N= 100 000, n = 1500. Then k = 66 , c=1000 and c/k
=15.15 with [c/k]=15. nr = 1515 or 1516
109
Estimation of the population total
t ( s)  is yi , n( s )  sample size
Two estimators (equal when N  nk ) :
1) tˆ( s)  kt( s )  [ N / n]t ( s)
t ( s)

ˆ
2) t ( s)  Nys  N 
n( s )
These estimators are approximately the same:
n( s)  [ N / k ] or [ N / k ]  1
k
1
N / n  N   N 
N
(N / k)
110
t̂ is unbiased :
E( t̂ )  kr 1 t̂ ( sr ) p( sr )

k
r 1 kt( s r
1
)   kr 1 t( sr )  t
k
tˆ is only approximat ely unbiased (it' s a ratio estimator)
- usually slightly smaller va riance than tˆ
• Advantage of systematic sampling: Can be
implemented even where no population frame exists
•E.g. sample every 10th person admitted to a hospital,
every 100th tourist arriving at LA airport.
111
k
2
ˆ
ˆ
Var (t )  E (t  t )  r 1 (tˆ( sr )  t ) 2 p( sr )
1 k
k
2
 r 1 (k  t ( sr )  t )  k r 1 (t ( sr )  t ) 2
k
where t  r 1 t ( sr ) / k
k
is the average of the sample totals
• The variance is small if
t ( sr ) varies little, i.e., if the " strata"
{1,..,k}, {k  1, ...,2k},.. etc., are very homogeneou s
• Or, equivalently, if the values within the possible samples
sr are very different; the samples are heterogeneous
• Problem: The variance cannot be estimated properly
because we have only one observation of t(sr)
112
Systematic sampling as Implicit Stratification
In practice: Very often when using systematic sampling
(common design in national statistical institutes):
The population is ordered such that the first k units constitute a
homogeneous “stratum”, the second k units another “stratum”, etc.
Implicit strata
1
2
:
Units
1,2….,k
k+1,…,2k
:
n = N/k assumed
(n-1)k+1,.., nk
Systematic sampling selects 1 unit from each stratum at
random
113
Systematic sampling vs SRS
• Systematic sampling is more efficient if the study variable
is homogeneous within the implicit strata
– Ex: households ordered according to house numbers
within neighbourhooods and study variable related to
income
• Households in the same neighbourhood are usually
homogeneous with respect socio-economic variables
• If population is in random order (all N! permutations are
equally likely): systematic sampling is similar to SRS
• Systematic sampling can be very bad if y has periodic
variation relative to k:
– Approximately: y1 = yk+1, y2 = yk+2 , etc
114
Variance estimation
•No direct estimate, impossible to obtain unbiased estimate
• If population is in random order: can use the variance
estimate form SRS as an approximation
• Develop a conservative variance estimator by
collapsing the “implicit strata”, overestimate the variance
• The most promising approach may be:
Under a statistical model, estimate the expected value
of the design variance
• Typically, systematic sampling is used in the second stage
of two-stage sampling (to be discussed later), may not be
necessary to estimate this variance then.
115
Cluster sampling and multistage sampling
•
•
Sampling designs so far: Direct sampling of the
units in a single stage of sampling
Of economial and practical reasons: may be
necessary to modify these sampling designs
–
–
There exists no population frame (register: list of all
units in the population), and it is impossible or very
costly to produce such a register
The population units are scattered over a wide area,
and a direct sample will also be widely scattered. In
case of personal interviews, the traveling costs would
be very high and it would not be possible to visit the
whole sample
116
• Modified sampling can be done by
1. Selecting the sample indirectly in groups , called
clusters; cluster sampling
– Population is grouped into clusters
– Sample is obtained by selecting a sample of
clusters and observing all units within the
clusters
– Ex: In Labor Force Surveys: Clusters =
Households, units = persons
2. Selecting the sample in several stages;
multistage sampling
117
3. In two-stage sampling:
• Population is grouped into primary sampling
units (PSU)
• Stage 1: A sample of PSUs
• Stage 2: For each PSU in the sample at stage
1, we take a sample of population units, now
also called secondary sampling units (SSU)
• Ex: PSUs are often geographical regions
118
Examples
1.
Cluster sampling. Want a sample of high school
students in a certain area, to investigate smoking and
alcohol use. If a list of high school classes is
available,we can then select a sample of high school
classes and give the questionaire to every student in the
selected classes; cluster sampling with high school class
being the clusters
2. Two-stage cluster sampling. If a list of classes is not
available, we can first select high schools, then classes
and finally all students in the selected classes. Then we
have 2-stage cluster sample.
1. PSU = high school
2. SSU = classes
3. Units = students
119
Psychiatric Morbidity Survey is a 4-stage sample
– Population: adults aged 16-64 living in private
households in Great Britain
– PSUs = postal sectors
– SSUs = addresses
– 3SUs = households
– Units = individuals
Sampling process:
1) 200 PSUs selected
2) 90 SSUs selected within each sampled PSU
(interviewer workload)
3) All households selected per SSU
4) 1 adult selected per household
120
Cluster sampling
Number of clusters in the population : N
Number of units in cluster i: Mi
Population size : M  i 1 M i
N
s I  sample of clusters, n | sI |
Final sample of units : s  all units in s I
Size of final sample s :
m  is M i
not fixed in advance
I
ti  y  total in cluster i , t  i 1 ti
N
Population mean for the y  variable :  y  t / M
121
Simple random cluster sampling
Ratio-to-size estimator
Use auxiliary information: Size of the sampled clusters
t̂ R

M

t
is I i
is I
Mi
Approximately unbiased with approximate variance
1
N
2 1 f
2
2
ˆ
Var (t R )  N

M
(
y


)
 i i
n N  1 i 1
where yi  ti / M i , the cluster mean, and   t / M
122
estimated by
1
M / N
2 1 f
2
2
V̂ ( t̂ R )  

M
(
y

y
)
 N

i
i
s
is I
m
/
n
n
n

1


where f  n / N and ys  is ti / is M i
2
I
I
is the usual sample mean
Note that this ratio estimator is in fact the usual sample mean
based estimator with respect to the y- variable
tˆR  M  ys
And corresponding estimator of the population mean of y is
ys
Can be used also if M is unknown
123
• Estimator’s variance is highly influenced by how the clusters
are constructed.
Choose clusters to make  M i2 ( yi   ) 2 small
 make the clusters heterogene ous,
such that most of the variation in the y - values
lies in the clusters, making the yi  values similar
• Note: The opposite in stratified sampling
• Typically, clusters are formed by “nearby units’ like households,
schools, hospitals because of economical and practical reasons,
with little variation within the clusters:
Simple random cluster sampling will lead to much less precise
estimates compared to SRS, but this is offset by big cost reductions
Sometimes SRS is not possible; information only known for
124
clusters
Design Effects
A design effect (deff) compares efficiency of two designestimation strategies (sampling design and estimator) for
same sample size
Now: Compare
Strategy 1:simple random cluster sampling with ratio
estimator
Strategy 2: SRS, of same sample size m, with usual
sample mean estimator
In terms of estimating population mean:
Strategy 1 estimator : tˆR / M  ys
Strategy 2 estimator : ys
125
The design effect of simple random cluster sampling, SCS, is then
deff (SCS , ys )  VarSCS ( ys ) / VarSRS ( ys )
Estimated deff: VˆSCS ( ys ) / VˆSRS ( ys )
In probation example:
VˆSRS ( pˆ )  [ pˆ (1  pˆ ) /( m 1)](1  f )  pˆ (1  pˆ ) /( m 1)  0.003872
Estimated deff  0.0302 2 / 0.00387 2  60.9
Conclusion: Cluster sampling is much less efficient
Note : We can estimate the p.c. factor 1 - m/M by letting Mˆ  N  (m / n)
and 1 - m / Mˆ  1  n / N  16 / 26  0.615
estdeff  60.9 / 0.615  99 !
126
Two-stage sampling
• Basic justification: With homogeneous clusters
and a given budget, it is inefficient to survey all
units in the cluster- can instead select more
clusters
• Populations partioned into N primary sampling
units (PSU)
• Stage 1: Select a sample sI of PSUs
• Stage 2: For each selected PSU i in sI: Select a
sample si of units (secondary sampling units, SSU)
• The cluster totals ti must be estimated from the
sample
127
n | sI | size of stage 1 sample of PSUs
mi | si |
Total sample size : m  is mi | s |
I
General two-stage sampling plan:
 Ii  P( PSU i  sI )
 j|i  P( SSU j  si | i  sI )
Inclusion probabilit y for unit (SSU) j in cluster i :
 ij   Ii   j|i
128
Horvitz - Thompson estimator of total ti , i  sI :
tˆi , HT   js
yij
i
 j|i
where yij  value of y for unit j in cluster i
Suggested estimator for population total t :
tˆ  is
tˆi , HT
I
 Ii
1
ˆt  
is
I
 Ii

yij
jsi
 j|i
 is
I

yij
jsi
 ij
 tˆHT
Unbiased estimator
129
ˆ
Var
(
t


t
N
i , HT )
i
  i 1
Var (tˆ)  Var is
I
 Ii 
 Ii

1. The first component expresses the sampling
uncertainty on stage 1, since we are selecting a sample
of PSU’s. It is the variance of the HT-estimator with ti
as observations
2. The second component is stage 2 variance and tells us
how well we are able to estimate each ti in the whole
population
3. The second component is often negligible because of
little variability within the clusters
130
A special case: Clusters of equal size and SRS
on stage 1 and stage 2
mi  m0 - equal sample sizes at stage 2
M i  M 0 , i  1,..., N
 Ii  n / N ,  j|i
M
tˆ 
m
n m0
m
 m0 / M 0   ij  

N M0 M
 
is I
jsi
yij  M  y s
Self-weighting sample: equal inclusion probabilities for all
units in the population
131
Unequal cluster sizes. PPS – SRS sampling
• In social surveys: good reasons to have equal inclusion
probabilities (self-weighting sample) for all units in the
population (similar representation to all domains)
• Stage 1: Select PSUs with probability proportional to
size Mi
• Stage 2: SRS (or systematic sample) of SSUs
• Such that sample is self-weighting
M
 Ii  n i and  j|i  mi / M i
M
such that  ij   Ii   j|i  m / M
mi = m/n = equal sample sizes in all selected PSUs
tˆ  My s
132
Remarks
• Usually one interviewer for each selected PSU
• First stage sampling is often stratified PPS
• With self-weighting PPS-SRS:
– equal workload for each interviewer
– Total sample size m is fixed
133
II. Likelihood in statistical inference
and survey sampling
• Problems with design-based inference
• Likelihood principle, conditionality principle and
sufficiency principle
• Fundamental equivalence
• Likelihood and likelihood principle in survey sampling
134
Traditional approach
Design-based inference
• Population (Target population): The universe of all units of
interest for a certain study: U = {1,2, …, N}
– All units can be identified and labeled
– Variable of interest y with population values y  ( y1 , y2 ,..., y N )
– Typical problem: Estimate total t or population mean t/N
• Sample: A subset s of the population, to be observed
• Sampling design p(s) is known for all possible subsets;
– The probability distribution of the stochastic sample
135
Problems with design-based inference
• Generally: Design-based inference is with respect to
hypothetical replications of sampling for a fixed population
vector y
• Variance estimates may fail to reflect information in a
given sample
• If we want to measure how a certain estimation method
does in quarterly or monthly surveys, then y will vary from
quarter to quarter or month to month
– need to assume that y is a realization of a random
vector
• Use: Likelihood and likelihood principle as guideline on
how to deal with these issues
136
Problem with design-based variance measure
Illustration 1
a) N +1 possible samples: {1}, {2},…,{N}, {1,2,…N}
b) Sampling design: p({i}) =1/2N , for i = 1,..,N ;
p({1,2,…N})= 1/2
c) Use y s as estimator for population mean 
Unbiased : E( y s ) 

s
p( s ) y s 

1
1
y  
i 1 2 N i
2
N
Design - variance :
Var( y s )  E( y s   ) 
2

N
2
(
y


)

i
i 1
1
1
  ~ 2
2N 2
d) Assume we select the “sample” {1,2,…,N}. Then
we claim that the “precision” of the resulting sample
2
(known to be without error) is ~ / 2
137
Problem with design-based variance measure
Illustration 2
a) Expert 1 : SRS and estimate y s
Precision is measured by (1 - f )
1
 
N 1
2

2
n
N
2
(
y


)
, f n/ N
i
i 1
b) Expert 2 : SRS with replacement and estimate y s
measures precision by ~ 2 / n
Both experts select the same sample, compute
the same estimate, but give different measures
of precision…
138
The likelihood principle, LP
general model
Model : X ~ fq ( x), q  ; q are the unknown parameters in the model
• The likelihood function, with data x:
l x (q )  fq ( x)
l is quite a different animal than f !!
Measures the likelihood of different q values in light of the
data x
• LP: The likelihood function contains all information about the
unknown parameters
• More precisely: Two proportional likelihood functions for q,
from the same or different experiments, should give identically
the same statistical inference
139
• Maximum likelihood estimation satisfies LP, using the
curvature of the likelihood as a measure of precision
(Fisher)
• LP is controversial, but hard to argue against because
of the fundamental result by Birnbaum, 1962:
• LP follows from sufficiency (SP) and conditionality
principles (CP) that ”no one” disagrees with.
• SP: Statistical inference should be based on sufficient
statistics
• CP: If you have 2 possible experiments and choose one
at random, the inference should depend only on the
chosen experiment
140
Illustration of CP
• A choice is to be made between a census og taking a sample
of size 1. Each with probability ½.
• Census is chosen
• Unconditional approach:
 i  P(census)  P(sample of size 1 and i is selected)
 1/2  P(sample of size 1) P(i is selected | sample of size 1)
1 1 1
 1/2    .
2 N 2
141
The Horvitz-Thompson estimator:
t̂ HT  2U y i  2t !
Conditional approach: i  1 and HT estimate is t
142
LP, SP and CP
Model : X ~ fq ( x), q  ; q are the unknown parameters in the model
Experiment is a triple E  { X ,q ,{ fq },q  }
I ( E,x) : Inference about q in the experiment E with observation x
Likelihood principle:
Let E1  { X 1 ,q ,{ fq1}} and E2  { X 2 ,q ,{ fq2 }} . Assume
l1,x1 (q )  cl2 ,x2 (q ), c independen t of q . ( fq1 ( x1 )  cfq2 ( x2 ))
Then : I ( E1 , x1 )  I ( E2 , x2 )
This includes the case where E1 = E2 and x1 and x2 are
two different observations from the same experiment
143
Sufficiency principle: Let T be a sufficient statistics for q in the
experiment E. Assume T(x1) = T(x2). Then I(E, x1) = I(E, x2).
Conditionality principle:
Let E1  { X 1 ,q ,{ fq1}} and E2  { X 2 ,q ,{ fq2 }} .
Consider t he mixture experiment E* where E1 is chosen with
probabilit y 1/2 and x1 is observed and E2 is chosen with
probabilit y 1/2 and x2 is observed. The observatio n in E* is then
the value of X*  ( J , X J ), J 1,2.
E   { X  ,q ,{ fq }} where fq ( j , x j ) 
1
2
fq j ( x j )
CP : I ( E  ( j , x j ))  I ( E j , x j )
144
Theorem : CP and SP  LP
Proof :  Exercise
 (for discrete variables, the important implicatio n)
Given : E1 and E2 and observations x10 and x20 such that fq1 ( x1 )  cfq2 ( x2 )
We shall show that I ( E1 , x10 )  I ( E2 , x20 )
Consider t he mixture experiment E  . From CP :
I ( E1 , x10 )  I ( E  ,(1, x10 )) and I ( E2 , x20 )  I ( E  ,(2, x20 ))
It remains to show that SP 
I ( E  ,(1, x10 ))  I ( E  ,(2, x20 ))
Enough to find sufficent T in E  with
T (1, x10 )  T (2, x20 )
145
Reduce X as little as possible :
T ( 1, x10 )  T ( 2, x20 )  ( 1, x10 )
T ( j , x j )  ( j , x j ) otherwise
T is sufficient : Let first t  (1, x10 )  t 0 :
Pq ( X   ( j , x j ) | T  t )  1, if ( j , x j )  t , and 0 otherwise; independen t of q
With t 0  (1, x10 ) : Pq ( X   (1, x10 ) | T  t 0 )  Pq ( X   (2, x20 ) | T  t 0 )  1
and

0
P
(
X

(
1
,
x
1 ))
Pq ( X   (1, x10 ) | T  t 0 )  q
Pq (T  t 0 )

fq1 ( x10 )
cfq2 ( x20 )
c


.
1
0
2
0
2
0
2
0
1
fq ( x1 )  2 fq ( x2 ) cfq ( x2 )  fq ( x2 ) c  1
1
2
1
2
146
Consequences for statistical analysis
• Statistical analysis, given the observed data: The sample
space is irrelevant
• The usual criteria like confidence levels and P-values do
not necessarily measure reliability for the actual inference
given the observed data
• Frequentistic measures evaluate methods
– not necessarily relevant criteria for the observed data
147
Illustration- Bernoulli trials
X 1 ,..., X i ,..
X i  1 (success)with probabilit y q
Two experiments to gain informatio n about q :
E1 : n  12 observations and observe Y1  12
i 1 X i
E2 : Continue trials until we get 3 failures (0' s) and
observe Y2  number of successes
Suppose the results are y1  y2  9
148
The likelihood functions:
9
3
l9(1) (q )  (12
)
q
(
1

q
)
9
binomial
9
3
l9( 2 ) (q )  (11
)
q
(
1

q
)
9
negative binomial
Proportional likelihoods:
(2)
l9 (
q)
q)
(1)
 (1 / 4)l9 (
LP: Inference about q should be identical in the two cases
Frequentistic analyses give different results:
F.ex. test H 0 : q  1 / 2 against H1 : q  1 / 2
( E1 ,9) : P - value  0.0730
( E2 ,9) : P - value  0.0327
because different sample spaces: (0,1,..,12) and (0,1,...)
149
Frequentistic vs. likelihood
• Frequentistic approach: Statistical methods are evaluated
pre-experimental, over the sample space
• LP evaluate statistical methods post-experimental, given
the data
• History and dicussion after Birnbaum, 1962: An overview
in ”Breakthroughs in Statistics,1890-1989, Springer 1991”
150
Likelihood function in design-based inference
• Unknown parameter: y  ( y1 , y2 ..., y N )
• Data:
x  {( i, yobs,i ) : i  s}
• Likelihood function = Probability of the data,
considered as a function of the parameters
 x  {y : yi  yobs,i for i  s}
• Sampling design: p(s)
• Likelihood function:
 p( s) if y   x
l x (y )  
0 otherwise
• All possible y are equally likely !!
151
• Likehood principle, LP : The likelihood function contains
all information about the unknown parameters
• According to LP:
– The design-model is such that the data contains no
information about the unobserved part of y, yunobs
– One has to assume in advance that there is a relation
between the data and yunobs :
• As a consequence of LP: Necessary to assume a
model
– The sampling design is irrelevant for statistical
inference, because two sampling designs leading to the
same s will have proportional likelihoods
152
Let p0 and p1 be two sampling designs. Assume we get the same
sample s in either case. Then the data x are the same and x is
the same for both experiments.
The likelihood function for sampling design pi , i = 0,1:
 pi ( s) if y   x
li , x (y )  
0 otherwise
 l1, x (y ) / l0, x (y )  p1 ( s) / p0 ( s) if y   x
and then for all y :
p1 ( s )
l1, x (y ) 
l0 , x ( y )
p0 ( s )
153
• Same inference under the two different designs. This is in direct
opposition to usual design-based inference, where the only
stochastic evaluation is thru the sampling design, for example the
Horvitz-Thompson estimator
• Concepts like design unbiasedness and design variance are
irrelevant according to LP when it comes to do the actual
statistical analysis.
• Note: LP is not concerned about method performance, but the
statistical analysis after the data have been observed
• This does not mean the sampling design is not important. It is
important to assure we get a good representative sample. But
once the sample is collected the sampling design should not play
a role in the inference phase, according to LP
154
Model-based inference
•
•
•
•
Assumes a model for the y vector
Conditioning on the actual sample
Use modeling to combine information
Problem: dependence on model
– Introduces a subjective element
– almost impossible to model all variables in a
survey
• Design approach is “objective” in a perfect world
of no nonsampling errors
155
III. Model-based inference in survey
sampling
•
Model-based approach. Also called the prediction
approach
– Assumes a model for the y vector
– Use modeling to construct estimator
– Ex: ratio estimator
•
Model-based inference
– Inference is based on the assumed model
– Treating the sample s as fixed, conditioning on the actual sample
•
•
Best linear unbiased predictors
Variance estimation for different variance measures
156
Model-based approach
y1 , y2 ,..., y N are realized values of
random variables Y1 , Y2 ,...YN
Two stochastic elements:
1) sample s ~ p()
2) (Y1 , Y2 ,...YN ) ~ fq
Treat the sample s as fixed
[Model-assisted approach: use the distribution assumption
of Y to construct estimator, and evaluate according to
distribution of s, given the realized vector y]
We can decompose the total t as follows:
t  i 1 yi  is yi  is yi
N
157
Since

is
yi is known, the problem is to estimate
z  is yi , the realized value of Z  is Yi
• The unobserved z is a realized value of the random
variable Z, so the problem is actually to predict the
value z of Z.
Can be done by predicting each unobserved yi: yˆ i , i  s
Estimator : tˆpred  is yi  is yˆ i  is yi  zˆ
zˆ is a predictor for z
• The prediction approach, the prediction based estimator
Determine ŷi by modeling
158
Remarks:
1. Any estimator can be expressed on the “prediction form:
tˆ  is yi  zˆtˆ
letting zˆtˆ  tˆ  is yi
2. Can then use this form to see if the estimator
makes any sense
159
Ex 1. tˆ  Nys   y  ( N  n) ys   y   ys
is i
is i
is
Hence, zˆ  is ys and yˆ i  ys , for all i  s
Ex.2
t̂ HT 

y /π and πi  nxi /t x , t x 
is i i

N
x
i 1 i
Reasonable sampling design when y and x are positively correlated


ˆt HT   t x yi   yi   yi  t x  1
 nx

is
is
is
nxi
 i 
yi  t x  nxi  
1
 xi  is yi  zˆ HT
 is yi  is 
n
xi  t x  nxs  is

ˆ HT
zˆHT  is ˆHT xi  is yˆ i
ˆHT is a rather unusual regression coefficien t
160
Three common models
I.
A model for business surveys, the ratio model:
•
assume the existence of an auxiliary variable x for all
units in the population.
Yi  xi   i
with E( i )  0, Var( i )   2 xi and Cov( i ,  j )  0
 E(Yi )  xi , Var(Yi )   2 xi and Cov(Yi , Y j )  0
161
II. A model for social surveys, simple linear regression:
Yi  1  2 xi   i , E( i )  0, Var( i )   2 and Cov( i ,  j )  0
• Ex: xi is a measure of the “size” of unit i, and yi tends to
increase with increasing xi. In business surveys, the
regression goes thru the origin in many cases
III. Common mean model:
E (Yi )   , Var (Yi )   2 and the Yi ' s are uncorrelat ed
162
Model-based estimators (predictors)
1.
Predictor: Tˆ  is Yi  Zˆ
2.
Model parameters: q
N
ˆ
ˆ
3. T is model - unbiased if Eq (T  T | s)  0 q , T  i 1Yi
4. Model variance of model-unbiased predictor is the
variance of the prediction error, also called the
prediction variance
Var (Tˆ  T | s)  E ((Tˆ  T ) 2 | s)
q
q
5. From now on, skip s in the notation: all
expectations and variances are given the selected
sample s, for example
E (Tˆ  T )  E (Tˆ  T | s )
Var (Tˆ  T )  Var (Tˆ  T | s )
163
Prediction variance as a variance measure for the
actual observed sample
Illustration 1, slide 137
N +1 possible samples: {1}, {2},…,{N}, {1,2,…N}
Use Tˆ  NYs as the estimator for the population total T
Assume we select the “sample” {1,2,…,N}.
Then Tˆ  NY  T
Prediction variance: Var (Tˆ  T )  Var (0)  0
Illustration 2, slide 138: Exactly the same prediction
variance for the two sampling designs
164
Linear predictor: Tˆ  is ai ( s )Yi
6. Definition:
Tˆ0 is the best linear unbiased (BLU) predictor for T if
1) Tˆ0 is model - unbiased
2) Tˆ0 has uniformly minimum prediction variance among all
model - unbiased linear predictors :
For any model - unbiased linear predictor Tˆ
Varq (Tˆ0  T )  Varq (Tˆ  T ) for all q
165
Model :
Yi  xi   i , E ( i )  0 and Var ( i )   2 v( xi )
Y1 ,..., YN are uncorrelat ed , Cov( i ,  j )  0
Usually, v( x)  x g , 0  g  2
Suggested Predictor:
Tˆpred  is Yi  is ˆopt xi
where ˆopt is the best linear unbiased estimator (BLUE) of 
ˆopt



is
xiYi / v( xi )
2
x
/ v( xi )
is i
166
ˆ  is ci ( s)Yi
E ( ˆ )   is ci ( s) xi   ,   is ci ( s) xi  1
Var ( ˆ )   2 is ci2 v( xi )
Minimize
2
c
is i v( xi ) subject t o is ci xi  1
using Lagrange method
Q   c v( xi )   (is ci xi  1)
2
is i
Q / ci  2ci v( xi )  xi  0
xi
 ci  ( / 2)
v( xi )
167

Determine ( / 2) such that
c x  1:
is i i
  / 2is xi2 / v( xi )  1
 ( / 2)  1 / is xi2 / v( xi )
and ci ,opt
xi / v( xi )

2
x
 js j / v( x j )
and ˆopt  is ci ,optYi 


is
xiYi / v( xi )
2
x
/ v( x j )
js j
This is the least squares estimate based on
Yi / v( xi )
168
• We shall show that
Tˆpred is the best linear unbiased (BLU) predictor for T
Let Tˆ be a model - unbiased and linear predictor
Let Zˆ  Tˆ  is Yi and ˆ  Zˆ / is xi .
 Tˆ  is Yi  ˆ is xi
Tˆ linear predictor  ˆ is linear in (Yi , i  s )
and Tˆ model - unbiased  E ( ˆ )  
169
since E (Tˆ  T )  E ( ˆ is xi  is Yi )
 E[ ˆ is xi ]  is xi  [ E ( ˆ )   ]is xi
such that E (Tˆ  T )  0  E ( ˆ )  
The prediction variance of model-unbiased predictor:
Var (Tˆ  T )  Var ( ˆ is xi  is Yi )
 Var ( ˆ is xi )  Var (is Yi )
 (is xi ) 2 Var ( ˆ )   2 is v( xi )
To minimize the prediction variance is equivalent to
minimizing Var ( ˆ )
Giving us Tˆpred as the BLU predictor
170
The prediction variance of the BLU predictor:
Var (Tˆpred  T )  (is xi ) 2 Var ( ˆopt )   2 is v( xi )
 (is xi ) 2
2
2
x
is i / v( xi )
  2 is v( xi )
2


(
x
)

2
is i



v
(
x
)

i
is
  xi2 / v( xi )

 is

A variance estimate is obtained by using the modelunbiased estimator for 2
1
1
2
ˆ
ˆ
 
(Yi   opt xi )

is
n 1
v( xi )
2
171
2


(
x
)

i
2
i

s
Vˆ (Tˆpred  T )  ˆ 
 is v( xi ) 
2
  xi / v( xi )

 is

The central limit theorem applies such that for
large n, N-n we have that
(Tˆpred  T ) / Vˆ (Tˆ  T ) is approximat ely N (0,1)
Approximate 95% confidence interval for the value t of T:
tˆpred  1.96 Vˆ (Tˆ  T )
Also called a 95% prediction interval for the random
variable T
172
Three special cases: 1) v(x) = x, the ratio model, 2) v(x)= x2
and 3) xi =1 for all i, the common mean model
1. v(x) = x
ˆ



is
opt
xiYi / v( xi )
is
2
i
x / v( xi )



is
Yi
is
xi
 Rˆ , the usual sample ratio
Tˆpred  is Yi  is Rˆ xi
 Rˆ is xi  Rˆ is xi  Rˆ  t x
the usual ratio estimator

Var (Tˆpred  T )   2 (is xi ) 2 /( is xi )  is xi

1  f xr x 2
N

 ,
n
xs
2
f  n / N , xr  is xi /( N  n) and x  i 1 xi
N
173
2. v(x) =x2
ˆ



is
opt
xiYi / v( xi )
is
2
i
x / v( xi )


is
Yi / xi
n
, the sample mean of the ratios
Tˆpred  is Yi  is ˆopt xi
Yi
1
 is Yi  ( is )is xi
n
xi
2


(
x
)

i
2
i

s

Var (Tˆpred  T )   

v
(
x
)

i
is
  xi2 / v( xi )

 is

2


(
x
)
2  is i
2

 is xi


n


174
Resembles the H - T estimator when  i  nxi / t x :
Let Ri  Yi / xi and Rs  is Ri / n
t xYi
ˆ
Also model-unbiased
THT  is
 t x  Rs
nxi
Tˆpred  is Yi  Rs is xi  t x  Rs  is (Yi  Rs xi )
When the sampling fraction f is small or when the xi
values vary little, these two estimators are
approximately the same. In the latter case:
1
Rs 
nxs

is
Yi and is Rs xi  is Yi
175
3. xi =1
Model :
Yi     i , E ( i )  0 and Var ( i )   2
Y1 ,..., YN are uncorrelat ed , Cov( i ,  j )  0
ˆ



opt
xiYi / v( xi )
1
 is Yi  Ys , the sample mean
2
x / v( xi ) n
is i
is
Tˆpred  is Yi  is Ys  N  Ys
2


(
x
)

2
is i

Var ( N  Ys  T )  

v
(
x
)

i
is
  xi2 / v( xi )

 is

2
2


(
N

n
)

  2 
 ( N  n)   N 2 (1  f )
n
n


This is also the usual, design-based variance
formula under SRS
176
We see that the variance estimate is given by
N 2 (1  f )
ˆ 2
n
1
2
ˆ 
(
y

y
)
 i s
n  1 is
the sample variance
2
Exactly the same as in the design-approach, but the
interpretation is different
177
Simple Linear regression model
Yi  1   2 xi   i , E ( i )  0, Var ( i )   2
Y1 ,..., YN are uncorrelat ed
BLU predictor:
Tˆpred  is Yi  is ( ˆ1  ˆ2 xi )
where
ˆ1 and ˆ2 are the LS estimators ,
ˆ2 

is
( xi  xs )(Yi  Ys )
2
(
x

x
)
is i s



is
( xi  xs )Yi
2
(
x

x
)
i
s
is
ˆ1  Ys  ˆ2 xs
178
Tˆpred  is Yi  is ( ˆ1  ˆ2 xi )
 nYs  ( N  n)Ys  ˆ2 (is xi  ( N  n) xs )
 NYs  ˆ2 (t x  Nxs )
 Tˆpred  N [Ys  ˆ2 ( x  xs )]
Clearly, Tˆpred is model - unbiased :
E (T )  i 1 ( 1   2 xi )  N (1   2 x )
N
and
1
ˆ
E (Tpred )  N { is (1   2 xi )   2 ( x  xs )}  N ( 1   2 x )
n
179
We shall now show that this predictor is BLU
Assume first that x  x . Let Tˆ be a linear,
s
model - unbiased predictor, and let b  (Tˆ / N  Ys ) /( x  xs ).
1 ˆ
T  Ys  b( x  xs )  Tˆ  N [Ys  b( x  xs )]
N
Hence, any predictor can be expressed on this form and
the predictor is linear if and only if b is linear in the Yi’s
Also, Tˆ is model - unbiased  E (b)   2 :
E (Tˆ )  E (T )  N (    x )
1
2
 N [ 1   2 xs  ( x  xs ) E (b)]  N ( 1   2 x )
 ( x  xs ) E (b)   2 x   2 xs   2 ( x  xs ).
180
Prediction variance:
Var (Tˆ  T )  Var ( N  n)Ys  Nb( x  xs )   ( N  n) 2
b  is ci ( s)Yi , unbiased estimator of  2 :
E (b)   2  is ci ( 1   2 xi )   2
 1 is ci   2 is ci xi   2
E (b)   2  (1) is ci  0 and (2) is ci xi  1
So we need to minimize the prediction variance with
respect to the ci’s under (1) and (2)
181
i.e. minimize
 N n

Var ( N  n)Ys  Nb( x  xs )   Var is Yi 
 N ( x  xs )ci 
 n

 N n

  2 is 
 N ( x  xs )ci 
 n

2
N

n
(
N

n
)
  2 [ N 2 ( x  xs ) 2 is ci2  2
N ( x  xs )is ci 
]
n
n
2
Since

c  0,
is i
it is enough to minimize
2
c
is i under conditions (1) and (2)
182
Q  is ci2  21 (is ci )  22 (is ci xi  1)
Q / ci  2ci  21  22 xi  0  ci  1  2 xi

(2) 
(1)
c  0  1  2 xs  0
is i
2
c
x

1


n
x


x
1
s
2 is i  1
is i i
(1)  1  2 xs
from (2) : 2 is xi2  2 nxs2  1
2  1 / is ( xi  xs ) 2
183
xi  xs
ci  1  2 xi  2 ( xi  xs ) 
2
(
x

x
)
 js j s
( xi  xs )Yi
xi  xs

is
ˆ
and b  is Yi



2
2
2
(
x

x
)
(
x

x
)
 js j s  js j s
The prediction variance is given by
2
2


n
(
x

x
)
N
n
2
s
ˆ
Var (T pred  T ) 
 (1  ) 
2
n
N
is ( xi  xs ) 

and variance estimate is obtained by estimating  2 with
2
1
ˆ
ˆ 
(Yi  Ys   2 ( xi  xs ))

is
n2
2
184
So far, x  xs . What if x  xs ?
Then Tˆ  NY and is the BLU predictor.
pred
s
For any linear predictor, Tˆ  is aiYi
Var (Tˆ  T )  Var[is (ai  1)Yi ]  ( N  n) 2
  2 [is (ai  1) 2  ( N  n)]
Let Tˆa  a is Yi , a  is ai / n  a
Var (Tˆa  T )   2 [is (a  1) 2  ( N  n)]
  2 [n(a  1) 2  ( N  n)]
185
2
2
(
a

1
)

n
(
a

1
)
is i
 Var (Tˆ  T )  Var (Tˆa  T )
and Tˆ model - unbiased  is ai  N and a  N / n :
Tˆa  NYs  Tˆpred .
186
Anticipated variance (method variance)
We want a variance measure that tells us about the
expected uncertainty in repeated surveys
1. Conditiona l on the sample s, with model - unbiased Tˆ :
Var (Tˆ  T ) measures the uncertaint y for this particular sample s
2. The expected uncertaint y for repeated surveys :
E p {Var (Tˆ  T )}, over the sampling distributi on p()
3. This is called the anticipated variance.
4. It can be regarded as a variance measure that describes
how the estimation method is doing in repeated surveys
187
If Tˆ is not model - unbiased, we use
E {E (Tˆ  T ) 2 }
p
as a criterion for uncertaint y, the anticipate d mean square error
Note : If Tˆ is design - unbiased then
E {E (Tˆ  T ) 2 }  E{E (Tˆ  T ) 2 | Y)}
p
p
and
E p (Tˆ  T ) 2 | Y  y )  E p (tˆ  t ) 2  Varp (tˆ)
And the anticipated MSE becomes the expected designvariance, also called the anticipated design variance
E p {E (Tˆ  T ) 2 }  E{Varp (Tˆ )}
188
Example: Simple linear regression and simple
random sample
If sample mean N  Ys is used : It is not model - unbiased,
but is design - unbiased :
1 f
1
N
2
E p {E ( N  Ys  T ) }  E{Varp ( N  Ys )}  N
E{
(
Y

Y
)
}

i
i 1
n
N 1
1 f
1
N
2
 N2
{ 2 
(



)
}

i
i 1
n
N 1
2
2
i  E (Yi )  1   2 xi ,   1   2 x
1 f
E{Varp ( N  Ys )}  N
{ 2   22 S x2 }
n
1
N
2
2
Sx 
(
x

x
)
}

i
i 1
N 1
2
189
Let us now study the BLU predictor.( It can be shown
that it is approximately design-unbiased )
2
2


n
(
x

x
)
N
n
2
s
Var (Tˆpred  T ) 
 (1  ) 
2
n
N is ( xi  xs ) 

2

N
n
2
 E p {Var (Tˆpred  T )} 
 (1  )  E p
n
N

n( x  xs ) 2 
2
is ( xi  xs ) 
2


E
{
n
(
x

x
)
N 2
n
p
s }

 (1  ) 
2
n
N
E p is ( xi  xs ) 

2
E p n( xs  x ) 2  nVarp ( xs )  (1  f )S x2
E p is ( xi  xs ) 2  (n  1) S x2
190
2
N
1 f 
2
ˆ
E p {Var (Tpred  T )} 
 (1  f ) 
n
n `1 

2
N2
N

(1  f ) 2 
(1  f ) 2
n 1
n
compared to
1 f
E{Varp ( N  Ys )}  N
{ 2   22 S x2 }
n
2
Tˆpred eliminates the term  22 S x2
and is much more efficient than N  Ys
191
Remarks
• From a design-based approach, the sample mean
based estimator is unbiased, while the linear
regression estimator is not
• Considering only the design-bias, we might
choose the sample mean based estimator
• The linear regression estimator would only be
selected over the sample mean based estimator
because it has smaller anticipated variance
• Hence, difficult to see design-unbiasedness as a
means to choose among estimators
192
Robust variance estimation
• The model assumed is really a “working model”
• Especially, the variance assumption may be
misspecified and it is not always easy to detect this
kind of model failure
– like constant variance
– variance proportional to size measure xi
• Standard least squares variance estimates is
sensitive to misspecification of variance
assumption
• Concerned with robust variance estimators
193
Variance estimation for the ratio estimator
Working model:
Yi  xi   i , E ( i )  0 and Var ( i )   2 xi
Y1 ,..., YN are uncorrelat ed , Cov( i ,  j )  0
Under this working model, the unbiased estimator of
the prediction variance of the ratio estimator is
xr x 2
2 1 f
ˆ
ˆ
VR ( R  t x  T )  N

ˆ
n
xs
1
1
2
2
ˆ
ˆ
 
(
Y

R

x
)

i
i
is
n 1
xi
Rˆ  Ys / xs
194
This variance estimator is non-robust to misspecification
of the variance model.
Suppose the true model has
E (Yi )  xi and Var (Yi )   2 v( xi )
Ratio estimator is still model-unbiased but
prediction variance is now
Var ( Rˆ  t x  T )  (is xi ) 2 Var ( Rˆ )   2 is v( xi )
 (is xi )
2

is v( xi )
2
(is xi ) 2
2 2

(
N

n
)
xr
2
  
2 2
n
xs

  2 is v( xi )

is v( xi )  is v( xi ) 

195
2 2


(
N

n
)
xr
2
ˆ
Var ( R  t x  T )   
vs  ( N  n)vr 
2
nxs


2
2 1 f
 N
(1  f )vs ( xr / xs ) 2  f  vr
n


vs  is v( xi ) / n and vr  is v( xi ) /( N  n)
Moreover,
E (ˆ 2 )   2 :
1
1
2
ˆ
E (ˆ ) 
E
(
Y

R

x
)

i
i
n  1 is xi
2
1
1


  ( v / x ) s 
{( v / x) s  vs / xs } , (v / x) s  is v( xi ) / xi
n 1
n


2
196
Robust variance estimator for the ratio
estimator
1 f
Var ( Rˆ  t x  T )   2 N 2
(1  f )vs ( xr / xs ) 2  f  vr
n
1 f
  2N 2
vs ( xr / xs ) 2  f  {vr  vs ( xr / xs ) 2 }
n
1 f
  2 vs  N 2
( xr / x s ) 2 ,
n
the leading term in the prediction variance




1
1
2
and :  vs  is  v( xi )  is Var (Yi )
n
n
2
1
1
2
2
 vs  is E (Yi  xi )  E{ is (Yi  xi ) }
n
n
2
197
Suggests we may use:
1
2
ˆ
ˆ v 
(
Y

R
x
)
 i i
n  1 is
2
rob s
Leading to the robust variance estimator:
1
2
2 1 f
2
ˆ
ˆ
ˆ
Vrob ( R  t x  T )  ( xr / xs )  N

(
Y

R
x
)

i
i
n n  1 is
Almost the same as the design variance estimator in SRS:
1
2
2 1 f
2
ˆ
ˆ
ˆ
VSRS ( R  t x )  ( x / xs )  N

(
Y

R
x
)
 i i
n n  1 is
198



2
2 1 f
ˆ
ˆ
E Vrob ( R  t x  T )  ( xr / xs )  N
  2 vs  V ( Rˆ  t x  T )
n

Can we do better?
Require estimator to be exactly unbiased under ratio
model, v(x) = x:
1
2
ˆ
(
Y

R

x
)

i
i }
is
n 1
xi
1
1
2
2
ˆ

E (Yi  Rxi ) 
 xi (1 
)


is
is
n 1
n 1
nx s
When v( x)  x : E{
2


s
1
1
2
2
2
x


  x s 1   2  , s x 
(
x

x
)
is i s
n
x
n

1
s 

199
The prediction variance when v(x) = x:
xr x 2
2 1 f
ˆ
V (R  tx  T )  N


n
xs
2


s
1

f
1
2
2
2
x
ˆ
ˆ

E{Vrob ( R  t x  T )}  N
( xr / xs ) 1   2 
n
 n xs 
So a robust variance estimator that is exactly
unbiased under the working model , v(x) = x:
x  1 s
ˆ
ˆ
VR ,rob ( R  t x  T )}  1  
xr  n x
2
x
2
s
1
 ˆ ˆ
 Vrob ( R  t x )

1 f 1
2
ˆ
 {1  n ( s / x )} ( xr x / x )  N

(
Y

R
x
)

i
i
n n  1 is
1
2
x
2
s
1
2
s
2
 {1  n1 (sx2 / xs2 )}1 ( xr / x ) VˆSRS ( Rˆ  t x )
200
General approach to robust variance
estimation
1. Find robust estimators of Var(Yi), that does not depend
on model assumptions about the variance
2. Tˆ   wisYi
is
Var (Tˆ  T )  is ( wis  1) 2Var (Yi )  is Var (Yi )
3. For i  s :Vˆ (Yi )  (Yi  ˆ i ) 2
ˆ i estimate E (Yi ) under true model
4. Estimate only leading term in the prediction variance,
typically dominating, or estimate the second term from
the more general model
201
• Reference to robust variance estimation:
• Valliant, Dorfman and Royall (2000):
Finite Population Sampling and Inference.
A Prediction Approach, ch. 5
202
Model-assisted approach
• Design-based approach
• Use modeling to improve the basic HTestimator. Assume the population values y are
realized values of random Y
• Assume the existence of auxiliary variables,
known for all units in the population
• Basic idea:
Suppose yˆ i  ˆxi is a regression - based " estimate" for each
yi in the population . Here, xi is known for the whole population
t  i 1 yˆ i  i 1 ( yi  yˆ i ) and e  i 1 ei , where ei  ( yi  yˆ i ),
N
N
N
is much easier to estimate, and can be estimated by HT - estimator
203
eˆHT  is
ei
i
Final estimator, the regression estimator:
ˆtreg  N ˆxi  eˆHT
i 1
Alternative expression:
tˆreg  is
yi
i
xi
ˆ
  (t x  is ) ,
i
t x  i 1 xi
N
tˆreg  tˆy , HT  ˆ (t x  tˆx, HT )
204
Simple random sample
tˆreg  Nys  ˆ (t x  Nxs )
Model : The Yi ' s are independen t and
Yi  xi   i , E ( i )  0 and Var ( i )   2 xi
 Best linear unbiased estimator : ˆ  y / x
s
s
ys
ys
ˆ
 t reg  Nys  t x  Nys  t x , the ratio estimator
xs
xs
205
In general with this “ratio model”, in order to get
approximately design-unbiased estimators:
Can regard  - estimate as an estimate of

N
i 1
yi / i 1 xi
N
Numerator is estimated by tˆy , HT  is yi /  i
Denominato r is estimated by tˆx , HT  is xi /  i
 use ˆ  tˆy , HT / tˆx , HT 
 tˆreg


is
yi /  i
is
xi /  i
N
ˆ
 t x   i 1 yˆ i
where yˆ i  ˆ xi
206
Variance and variance estimation
Reference: Sarndal, Swensson and Wretman : Model
Assisted Survey Sampling (1992, ch. 6), Wiley
• Regression estimator is approximately unbiased
• Variance estimation:
The sample residuals : ei  yi  ŷi , i  s
where ŷi  xi ̂
If | s |  n , fixed in advance :
( i j   ij )  ei e j 
ˆ
  
ˆ
V (t reg )  is  js ,
  
j i
 ij
j 
 i
2
207
Approximate 95% CI, for large n, N-n:
tˆreg  1.96 Vˆ (tˆreg )
• Remark: In SSW (1992,ch.6), an alternative variance
estimator is mentioned that may be preferable in many cases
208
Common mean model
E (Yi )   , Var (Yi )   2 and the Yi ' s are uncorrelat ed
The ratio model with xi =1.
ˆ  tˆy , HT / tˆx, HT



yi /  i
tˆy , HT
~
 ys 
ˆ
N
1
/

i
is
is
tˆreg  t x ˆ  Nˆ  N~
ys
This is the modified H-T estimator (slide 73,74)
Typically much better than the H-T estimator when different
209
ei  yi  ~
ys
 i j   ij  ei e j 
~
ˆ
  
V ( Nys )  is  js ,
  
j i
 ij
j 
 i
2
Alternatively,
( i j   ij )  ei e j 

2
~
ˆ
ˆ
  
V ( Nys )  ( N / N ) is  js ,
  
j i
 ij
j 
 i
2
210
Remarks:
1. The model-assisted regression estimator has often the form
ˆtreg  N yˆi
i 1
2. The prediction approach makes it clear: no need to
estimate the observed yi
3. Any estimator can be expressed on the “prediction
form:
tˆ  is yi  zˆtˆ
letting zˆtˆ  tˆ  is yi
4. Can then use this form to see if the estimator
makes any sense
211
Download