Advanced Topics in Survey Sampling

advertisement
Advanced Topics in Survey Sampling
Jae-Kwang Kim
Wayne A Fuller
Pushpal Mukhopadhyay
Department of Statistics
Iowa State University
World Statistics Congress Short Course
July 23-24, 2015
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Advanced Topics in Survey Sampling
7/23-24/2015
1 / 318
7/23-24/2015
2 / 318
Outline
1
Probability sampling from a finite population
2
Use of auxiliary information in estimation
3
Use of auxiliary information in design
4
Replication variance estimation
5
Models used in conduction with sampling
6
Analytic studies
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Advanced Topics in Survey Sampling
Chapter 1
Probability sampling from a finite universe
Jae-Kwang Kim
Wayne A Fuller
Pushpal Mukhopadhyay
Department of Statistics
Iowa State University
World Statistics Congress Short Course
July 23-24, 2015
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
3 / 318
Probability Sampling
U = {1, 2, · · · , N} : list for finite population, sampling frame
F = {y1 , y2 , · · · , yN } : finite population/finite universe
A (⊂ U) : index set of sample
A : set of possible samples
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
4 / 318
Sampling Design
Definition
p(·) is a sampling design
⇔ p(a) is a function from A to [0, 1] such that
1
2
p(a) ∈ [0, 1], ∀a ∈ A,
P
a∈A p(a) = 1.
i.e. p(a) is a probability mass function defined on A.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
5 / 318
7/23-24/2015
6 / 318
Notation for Sampling
1
if i ∈ A
0
otherwise
d = (I1 , I2 , · · · , IN )
P
n= N
i=1 Ii : (realized) sample size
Ii =
πi = E [Ii ] : first order inclusion probability
πij = E [Ii Ij ] : second order inclusion probability
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
Design-estimator Characteristics
Definition
θ̂ = θ̂(yi ; i ∈ A) is design unbiased for θN = θ (y1 , y2 , · · · , yN )
P
⇔ E {θ̂|F} = θN , ∀ (y1 , y2 , · · · , yN ), where E {θ̂|F} = a∈A θ̂(a)p(a)
Definition
θ̂ is a design linear estimator ⇔ θ̂ =
P
i∈A wi yi ,
where wi are fixed with respect to the sampling design
P
Note: If wi = πi−1 , θ̂ = i∈A wi yi is the Horvitz-Thompson estimator.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
7 / 318
7/23-24/2015
8 / 318
Theorem 1.2.1
Theorem
If θ̂ =
P
i∈A wi yi
is a design linear estimator, then
P
E {θ̂n |F} = N
i=1 wi πi yi
V {θ̂n |F} =
PN PN
i=1
Kim & Fuller & Mukhopadhyay (ISU & SAS)
j=1 (πij
− πi πj ) wi yi wj yj .
Chapter 1
Proof of Theorem 1.2.1
Because E {Ii } = πi and because wi yi , i = 1, 2, ..., N, are fixed,
E {θ̂ | F} =
N
X
E {Ii | F}wi yi =
i=1
N
X
πi wi yi
i=1
Using E (Ii Ik ) = πik and Cov (Ii , Ik ) = E (Ii Ik ) − E (Ii )E (Ik ),
( N
)
X
V {θ̂ | F} = V
Ii wi yi | F
=
i=1
N
N
XX
Cov (Ii , Ik )wi yi wk yk
i=1 k=1
N X
N
X
=
(πik − πi πk )wi yi wk yk
i=1 k=1
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
9 / 318
Horvitz-Thompson Estimator of Total
Corollary
Let πi > 0 ∀i ∈ U. Then, T̂y =
X
πi−1 yi satisfies
i∈A
(i) E T̂y | F = Ty
(ii) V T̂y | F =
N X
N
X
(πij − πi πj )
i=1 j=1
yi yj
πi πj
Proof of (ii): Substitute πi−1 for wi of Theorem 1.2.1.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
10 / 318
Unbiased Variance Estimation
Theorem
P
Let πij > 0, ∀i, j ∈ U and let θ̂ = i∈A wi yi be design linear. Then,
XX 1
V̂ =
(πij − πi πj ) wi yi wj yj satisfies
πij
i∈A j∈A
E V̂ | F = V θ̂ | F
Proof : Let g (yi , yj ) = (πij − πi πj )wi yi wj yj . By Theorem 1.2.1


N X
N
X
 X
E
πij−1 g (yi , yj )|F =
g (yi , yj )


i=1 j=1
i,j∈A
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
11 / 318
Simple Random Sampling (SRS)
Choose n units from N units without replacement with equal
probability.
1
2
3
Each subset of
n distinct units is equally likely to be selected.
N
There are n samples of size n from N.
Give equal probability of selection to each subset with n units.
Definition
Sampling design for SRS:
P(A) =
 .

 1


Kim & Fuller & Mukhopadhyay (ISU & SAS)
N
n
0
if
|A| = n
otherwise.
Chapter 1
7/23-24/2015
12 / 318
Lemma
Under SRS, the inclusion probabilities are
πi
πij
−1 N
1 N −1
n
=
=
n
1
n−1
N
−1 N
1 1 N −2
n (n − 1)
=
=
n
1 1
n−2
N (N − 1)
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
for i 6= j.
7/23-24/2015
13 / 318
7/23-24/2015
14 / 318
Simple Random Samples
Let ȳn = n−1
−1 1 − N −1 n s 2 ,
y
,
V̂
=
n
i
n
i∈A
P
P
sn2 = (n − 1)−1 i∈A (yi − ȳn )2 , ȲN = N −1 N
i=1 yi ,
P
2
and SN2 = (N − 1)−1 N
i=1 (yi − ȳN ) . Then,
P
(i) E (ȳn | F) = ȳN
(ii) V (ȳn | F) = n−1 1 − N −1 n SN2
(iii) E V̂ | F = V (ȳn | F)
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
Poisson Sampling
Definition:
inde
Ii ∼ Bernoulli (πi ) ,
P
Estimation (of Ty = N
i=1 yi )
T̂y =
N
X
i = 1, 2, · · · , N.
Ii yi /πi , E {T̂y | F} =
i=1
N
X
πi yi /πi
i=1
Variance
N
N X
X
1
Var T̂y | F =
− 1 yi2
(πi − πi2 )yi2 /πi2 =
πi
i=1
i=1
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
15 / 318
Poisson Sampling
Optimal design: minimize Var T̂y | F
min
N
X
subject to
PN
i=1 πi
=n
N
X
πi−1 yi2 + λ(
πi − n)
i=1
i=1
⇒ πi−2 yi2 = λ
⇒ πi ∝ yi
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
16 / 318
A Design Result
Definition
Superpopulation model (ξ) : the model for y = (y1 , y2 , . . . , yN ).
Definition
Anticipated Variance : the expected value of the design variance of the
planned estimator calculated under the superpopulation model.
Theorem (1.2.3)
2 ), let d = (I , I , · · · , I ) be
Let y = {y1 , y2 , · · · , yN } be iid (µ, σP
1 2
N
−1
independent of y, and define T̂y = i∈A πi yi . Then V {T̂y − Ty } is
minimized at πi = n/N and πij = n(n − 1)/N(N − 1), i 6= j.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
17 / 318
Proof of Theorem 1.2.3
V {T̂y − Ty } = V {E (T̂y − Ty |F)} + E {V (T̂y − Ty |F)}


N X
N
X
yi yj 
= E
(πij − πi πj )

πi πj 
i=1 j=1
N
N X
N
X
X
1
1 1
2
= µ
(πij − πi πj )
+σ
(πi − πi2 ) 2
πi πj
πi
2
i=1
i=1 j=1
P
PN
−1
−1 n
(i) min N
π
s.t.
i=1 i
i=1 πi = n ⇒ πi = N
PN PN
PN
−1 −1
−1
(ii)
(π
−
π
π
)π
π
=
V
{
i j i
i=1
j=1 ij
i=1 Ii πi } ≥ 0
j
P
−1
−1 n and π = n(n − 1) .
V{ N
I
π
}
=
0
for
π
=
N
i
i
ij
i=1
i
N(N − 1)
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
18 / 318
Discussion of Theorem 1.2.3
(i) Finding πi and πj that minimize V {θ̂ | F} is not possible
because V {θ̂ | F} is a function of N unknown values
Godambe (1955), Godambe & Joshi (1965), Basu (1971)
(ii) If y1 , y2 , · · · , yN ∼ ξ for some model ξ (superpopulation model), then
we sometimes can find an optimal strategy (design and estimator).
Under iid & HT estimation, the optimal design is SRS.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
19 / 318
Example: Stratified Sampling
Definition
1
The finite population is stratified into H subpopulations.
U = U1 ∪ · · · ∪ UH
2
Within each population (or stratum), samples are drawn independently
in the strata.
Pr (i ∈ Ah , j ∈ Ag ) = Pr (i ∈ Ah ) Pr (j ∈ Ag ) ,
for h 6= g
where Ah is the index set of the sample in stratum h, h = 1, 2, · · · , H.
Example: Stratified SRS
1
2
3
Stratify the population. Let Nh be the population size of Uh .
Sample size allocation: Determine nh .
Perform SRS independently (select nh sample elements from Nh ) in
each stratum.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
20 / 318
Estimation under Stratified SRS
1
HT estimator:
T̂y =
H
X
Nh ȳh
h=1
−1 P
2
where ȳh = nh
Variance
i∈Ah yhi .
H
X
Nh2
nh
Var T̂y =
1−
Sh2
nh
Nh
h=1
2
P h
where Sh2 = (Nh − 1)−1 N
y
−
Ȳ
hi
h .
i=1
Variance estimation
H
X
Nh2
nh
V̂ T̂y =
1−
sh2
nh
Nh
3
h=1
where sh2 = (nh − 1)−1
Kim & Fuller & Mukhopadhyay (ISU & SAS)
P
i∈Ah
(yhi − ȳh )2 .
Chapter 1
7/23-24/2015
21 / 318
Optimal Strategy under Stratified Sampling
Theorem (1.2.6)
Let F be a stratified finite population in which the elements in stratum h
are realizations of iid(µh , σh2 ) random variables. Let C be the total cost for
sample observation and assume that it costs ch to observe an element in
stratum h. Then a sampling and estimation strategy for Ty that minimizes
the anticipated variance in the class of linear unbiased estimators and
probability design is: Select independent simple random samples in each
stratum, selecting nh∗ in stratum h, where
Nh σh
nh∗ ∝ √
ch
with C =
P
j
nh∗ ch , subject to nh∗ ≤ Nh , and use the HT estimator.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
22 / 318
Comments on Theorem 1.2.6
Anticipated variance
AV {θ̂ − θN } = E {E [(θ̂ − θN )2 | F]} − [E {E (θ̂ − θN | F)}]2
For HT estimation, E (θ̂ − θN | F) = 0 and the anticipated variance
becomes
H
X
2
−1
−1
2
AV {T̂y − Ty } =
Nh nh − Nh σh
h=1
P
PH
−1 2 2
Minimizing H
n
N
σ
subject
to
h=1 h
h=1 nh ch = C leads to the
h h
optimal allocation
Nh σh
nh∗ ∝ √ .
ch
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
23 / 318
7/23-24/2015
24 / 318
R
Sample Selection Using SAS
PROC SURVEYSELECT
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
PROC SURVEYSELECT
Probability-based random sampling
equal probability selection
PPS selection
Stratification and clustering
Sample size allocation
Sampling weights
inclusion probabilities
joint inclusion probabilities
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
25 / 318
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
26 / 318
Sampling Methods
Simple random with and without replacement
Systematic
Sequential
PPS
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
27 / 318
PPS Sampling Methods
With and without replacement
Systematic
Sequential with minimum replacement
Two units per stratum: Brewer’s, Murthy’s
Sampford’s method
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
28 / 318
Digitech Cable Company
Digital TV, high-speed Internet, digital phone
13,471 customers in four states: AL, FL, GA, and SC
Customer satisfaction survey (high-speed Internet
service)
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
29 / 318
Sampling with Stratification
Can afford to call only 300 customers
The sampling frame contains the list of customer
identifications, addresses, and types
Need adequate sampling units in every stratum (state
and type)
Select simple random sample without replacement
within strata
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
30 / 318
Sampling Frame
CustomerID
Kim & Fuller & Mukhopadhyay (ISU & SAS)
State
Type
416874322
AL
Platinum
288139763
GA
Gold
339008654
GA
Gold
118980542
GA
Platinum
421670342
FL
Platinum
623189201
SC
Platinum
324550324
FL
Gold
832902397
AL
Gold
Chapter 1
7/23-24/2015
31 / 318
Sort Sampling Frame by Strata before Selection
p r o c s o r t d a t a=C u s t o m e r s ;
by S t a t e Type ;
run ;
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
32 / 318
Select Stratified Sample
p r o c s u r v e y s e l e c t d a t a=C u s t o m e r s method=s r s n=300
s e e d =3232445 o u t=S a m p l e S t r a t a ;
s t r a t a S t a t e Type / a l l o c=p r o p ;
run ;
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
33 / 318
7/23-24/2015
34 / 318
The SURVEYSELECT Procedure
Selection Method
Simple Random Sampling
Strata Variables
State
Type
Allocation
Proportional
Input Data Set
CUSTOMERS
Random Number Seed
3232445
Number of Strata
8
Total Sample Size
Output Data Set
Kim & Fuller & Mukhopadhyay (ISU & SAS)
300
SAMPLESTRATA
Chapter 1
Strata Sizes
Kim & Fuller & Mukhopadhyay (ISU & SAS)
State
Type
SampleSize
PopSize
AL
Gold
16
706
AL
Platinum
28
1238
FL
Gold
31
1370
FL
Platinum
48
2170
GA
Gold
43
1940
GA
Platinum
78
3488
SC
Gold
19
875
SC
Platinum
37
1684
Chapter 1
7/23-24/2015
35 / 318
Data Collection
Important practical considerations that the computer
cannot decide for you:
Telephone interview
Rating, age, household size, ...
Auxiliary variables: data usage, average annual
income, home ownership rate, ...
Callbacks, edits, and imputations
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
36 / 318
Survey Objective: Digitech Cable
Rate customer satisfaction
Are customers willing to recommend Digitech?
Is satisfaction related to
household size?
race?
Is usage time related to data usage?
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
37 / 318
Digitech Cable Data Collection
Survey variables
Auxiliary variables
Rating
Data usage
Recommend
Neighborhood income
Usage time
Competitors
Household size
Race
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
38 / 318
Large Sample Results for Survey Samples
Complex designs : Weights
Few distributional assumptions
Heavy reliance on large sample theory
Central Limit Theorem
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
39 / 318
Review of Large Sample Results
Mann and Wald notation for order in probability
Sequence: X1 , X2 , · · · , Xn , · · · (gn > 0)
Xn = op (gn ) ⇔ Xn /gn → 0 in probability
⇔
lim P[|Xn /gn | > ] = 0
n→∞
Xn = Op (gn ) ⇔ Xn /gn bounded in probability
⇔ P[|Xn /gn | > M ] < for some M , ∀ > 0, ∀n
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
40 / 318
Examples of Order in Probability
Let x̄n ∼ N(0, n−1 ). Then the following statements hold
P{|x̄n | > 2n−1/2 } < 0.05 ∀n
P{|x̄n | > Φ−1 (1 − 0.5)n−1/2 } < ∀n
therefore x̄n = Op (n−1/2 ).
If x̄n ∼ N(0, n−1 σ 2 ), then x̄n = Op (?)
If x̄n ∼ N(µ, n−1 σ 2 ), then x̄n = Op (?)
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
41 / 318
Example of op
Again, let x̄n ∼ N(0, n−1 ).
lim P{|x̄n | > k} = 0 ∀k > 0 ⇒ x̄n = op (1)
n→∞
lim P{|n1/4 x̄n | > k} = 0 ∀k > 0 ⇒ x̄n = op (n−1/4 )
n→∞
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
42 / 318
Properties of Order in Probability
fn > 0, gn > 0
Xn = Op (fn ), Yn = Op (gn ), then
Xn Yn = Op (fn gn )
|Xn |s
= Op (fns ), s ≥ 0
Xn + Yn = Op (max{fn , gn }).
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
43 / 318
Chebychev’s Inequality
For given r > 0 with E {|X |r } < ∞,
E {|X − A|r }
P[|X − A| ≥ ] ≤
.
r
Corollary
E {Xn2 } = O(an2 ) ⇒ Xn = Op (an )
By Chebyshev’s inequality,
|Xn |
E {Xn2 }
P
> M ≤ 2 2 < an
an M
p
choose M = K /, where K is the upper bound of an−2 E {Xn2 }.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
44 / 318
Central Limit Theorems
P
Lindeberg: X1 , X2 , · · · : independent (µi , σi2 ), Bn2 = ni=1 σi2
P
n
(Xi − µi ) L
1 X
2
E
{(X
−
µ
)
I
(|X
−
µ
|
>
B
)}
→
0
⇒
→ N(0, 1)
n
i
i
i
i
Bn2
Bn
i=1
Liapounov:
P
X1 , X2 , · · · : independent (µi , σi2 ), Bn2 = ni=1 σi2
Pn
2+δ = o(B 2+δ ), for some δ > 0
E
|X
−
µ
|
i
i
n
i=1
Pn
(Xi − µi ) L
→ N(0, 1)
⇒ i=1
Bn
Note: Liapounov condition ⇒ Lindeberg condition
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
45 / 318
7/23-24/2015
46 / 318
Slutsky’s Theorem
{Xn }, {Yn } are sequences of random variables satisfying
L
Xn → X , p lim Yn = c
⇒
L
Xn + Yn → X + c
L
Yn Xn → cX
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
Theorem 1.3.1: Samples of Samples
Theorem
y1 , · · · , yN are iid with d.f F (y ) and c.f. ϕ(t) = E {e itY }, i =
√
−1
d = (I1 , · · · , IN ) : random vector independent of y
⇒ (yk ; k ∈ A)|d are iid with c.f. ϕ(t)
Proof: In book. A SRS of a SRS is a SRS.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
47 / 318
7/23-24/2015
48 / 318
Application of Theorem 1.3.1
y1 , · · · , yN ∼ iid N(µy , σy2 )
SRS of size nN from FN
⇒ (yk , k ∈ A)|d are iid N(µy , σy2 ) and
(yk , k ∈ U ∩ Ac )|d are iid N(µy , σy2 )
⇒ ȳn ∼ N(µy , σy2 /nN ) and
ȳN−n ∼ N(µy , σy2 /(N − nN )) indep of ȳn
−1
⇒ ȳn − ȲN ∼ N 0, nN (1 −
−1
nN (1 −
−1/2
2
fN )sn
(ȳn
Kim & Fuller & Mukhopadhyay (ISU & SAS)
fN )σy2
and
− ȲN ) ∼ tn−1
Chapter 1
Finite Population Sampling
Motivation
Is x̄n − x̄N = op (1)?
x̄n − x̄N
L
Does q
→ N(0, 1)?
V̂ {x̄n − x̄N | FN }
We’ll be able to answer these shortly.
n → N isn’t very interesting
Need n → ∞ and N − n → ∞
Need a sequence of samples from a sequence of finite populations
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
49 / 318
Sequence of Samples from a Sequence of Populations
Approach 1 {FN } is a sequence of fixed vectors
Approach 2 {y1 , y2 , · · · , yN } is a realization from a superpopulation
model.
Notation
UN = {1, 2, · · · , N} : N-th finite population
FN = {y1N , · · · , yNN }
yiN : observation associated with i-th element in the N-th population
AN : sample index set selected from UN with size nN = |AN |
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
50 / 318
Design Consistency
Definition
θ̂ is design consistent for θN if for every > 0,
lim P{|θ̂ − θN | > | FN } = 0
N→∞
almost surely, where P(· | FN ) denotes the probability induced by the
sampling mechanism.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
51 / 318
Design Consistency for ȳn in SRS
Approach 1 (fixed sequence)
Assume a sequence of finite populations {FN } s.t.
lim N
−1
N→∞
N
X
(yi , yi2 ) = (θ1 , θ2 ),
and θ2 − θ12 > 0.
i=1
By Chebyshev’s inequality,
−1
nN
(1 − fN )SN2
P |ȳn − ȲN | ≥ | FN ≤
2
where fN = nN N −1 .
Approach 2 y1 , · · · , yN ∼ iid(µ, σy2 )
⇒ limN→∞ SN2 = σy2 a.s.
⇒ V [ȳn − ȲN |FN ] = Op (nN−1 )
−1/2
⇒ ȳn − ȲN |FN = Op (nN )
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
52 / 318
Central Limit Theorem (1.3.2) Part 1
Theorem (Part 1)
{y1N , · · · , yNN } ∼ iid(µ, σ 2 ) and 2 + δ moments (δ > 0)
P
PN
−1
SRS, ȳn = n−1 N
I
y
,
Ȳ
=
N
N
i
iN
i=1
i=1 yiN
L
⇒ [V (ȳn − ȲN )]−1/2 (ȳn − ȲN ) | d → N(0, 1).
Proof : Write ȳn − ȲN = N
−1
N
X
ciN yiN , where ciN = n
−1
NIi − 1 .
i=1
Bn2 = N 2 V (ȳn − ȲN ) = N 2 V (ȳn − ȲN | d) =
Applying Lindberg CLT
Kim & Fuller & Mukhopadhyay (ISU & SAS)
PN
2 2
i=1 ciN σ
Chapter 1
= (N − n)N/nσ 2
7/23-24/2015
53 / 318
Theorem 1.3.2 Part 2
Theorem
(Part 2) Furthermore, if {yiN } has bounded fourth moments, then
L
[V̂ (ȳn − ȲN )]−1/2 (ȳn − ȲN ) → N(0, 1).
Proof : Want to apply Slutsky’s theorem:
p
ȳ − ȲN
ȳn − ȲN
V (ȳn − ȲN ) L
q n
q
=p
→ N(0, 1).
V
(ȳ
−
Ȳ
)
n
N
V̂ (ȳn − ȲN )
V̂ (ȳn − ȲN )
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
54 / 318
Proof of 1.3.2 Part 2, Continued
Then it is enough to show that
(n−1 − N −1 )σy2 p
= −1
→ 1.
(n − N −1 )sn2
V̂ (ȳn − ȲN )
V (ȳn − ȲN )
p
To show sn2 → σy2 , note that
sn2
σy2
n
=
=
=
1 1 X
(yi − ȳn )2
2
σy n − 1
1 1
σy2 n − 1
1 1
σy2 n
i=1
n
X
(yi − µ)2 −
i=1
1 n
2
(ȳ
−
µ)
n
σy2 n − 1
n
X
(yi − µ)2 + Op (n−1 )
i=1
p
→ 1 if E (yi − µ)4 ≤ M4 < ∞
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
55 / 318
Comment on Theorem 1.3.2
1
2
The CLT in Theorem 1.3.2 is derived with Approach 2 (using
superpopulation model)
The result can be extended to stratified random sampling (textbook)
{yhiN } ∼ iid (µh , σh2 )
θ̂n =
HN
X
N
−1
Nh ȳhn , ȲN =
h=1
HN
X
N −1 Nh ȲhN
h=1
L
[V̂ (θ̂n − ȲN )]−1/2 (θ̂n − ȲN ) → N(0, 1)
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
56 / 318
Poisson Sampling
Population: y1 , y2 , ..., yN
Probabilities: π1 , π2 , ..., πN
The sampling process is
Bernoulli trial for each i (Independent trials)
Sample is those i for which the trial is a success
N independent random variables
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
57 / 318
CLT under Approach 1 (Fixed Sequences)
y1 , y2 , · · · : sequence of real vectors
π1 , π2 , · · · : sequence of selection probabilities
gi = (1, yi , αN πi−1 , αN πi−1 yi )0 where αN = E (nN )/N = nBN /N
xi = gi Ii , Ii ∼ Bernoulli(πi ) (i.e. Poisson sampling)
−1 PN
−1 PN
−1 PN
Let µ̂x = nBN
x
=
n
g
I
and
µ
=
n
xN
BN
BN
i=1 i
i=1 i i
i=1 gi πi
E (µ̂x | FN ) = µxN
T̂y ,HT =
N
X
Ii yi πi−1
=
−1
αN
i=1
−1
= αN
N
X
i=1
N
X
αN πi−1 yi Ii
xi4 .
i=1
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
58 / 318
Theorem 1.3.3, Part 1
Theorem (Part 1)
Assume Poisson sampling
−1 PN
(i) limN→∞ nBN
i=1 gi πi = µx
P
Σ
Σ
11
12
N
−1
0
(ii) limN→∞ nBN
,
0
i=1 πi (1 − πi )gi gi = Σxx =
Σ12 Σ22
Σ11 , Σ22 : positive definite
(iii) lim
max PN
N→∞ 1≤k≤N
Then,
√
(γ 0 gk )2
i=1 (γ
0 g )2 π (1
i
i
− πi )
= 0 ∀γ s.t. γ 0 Σxx γ > 0 .
L
nBN (µ̂x − µxN )|FN → N(0, Σxx ).
Proof: textbook
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
59 / 318
7/23-24/2015
60 / 318
Theorem 1.3.3, Part 2
Theorem (Part 2)
Under conditions (i)-(iii), if
−1 PN
4
(iv) limN→∞ nBN
i=1 πi |gi | = M4 , then
[V̂ (T̂y )]−1/2 (T̂y − Ty )|FN → N(0, I ), where
X πi (1 − πi ) yi yi 0
.
V̂ (T̂y ) =
πi
πi
πi
i∈A
Proof: textbook
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
Theorem 1.3.4 CLT for SRS
Theorem
{yi } : sequence of real numbers with bounded 4th moments
SRS without replacement
−1/2
⇒ V̂n
L
(ȳn − ȳN )|FN → N(0, 1)
The result is obtained by showing that there is a SRS mean that differs
from Poisson mean by op (n−1/2 ).
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
61 / 318
7/23-24/2015
62 / 318
Function of Means (Theorem 1.3.7)
Theorem
(i)
√
L
n(x̄n − µx ) → N(0, Σxx )
(ii) g (x) : continuous and differentiable at x = µx
∂g (x)
: continuous at x = µx
(iii) hx (x) =
∂x
√
L
⇒ n[g (x̄n ) − g (µx )] → N(0, h0x (µx )Σxx hx (µx )).
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
Proof of Theorem 1.3.7
[Step 1] By a Taylor expansion g (x̄n ) = g (µx ) + (x̄n − µx ) h0x (µ∗ )
where µ∗x is on the line segment joining x̄n and µx .
[Step 2] Show µ∗x − µx = op (1)
[Step 3] Using the assumption that hx (x) is continuous at x = µx ,
show that hx (µ∗x ) − hx (µx ) = op (1).
[Step 4] Because
√
√
n [g (x̄n ) − g (µx )] =
n (x̄n − µx ) h0x (µ∗x )
√
=
n (x̄n − µx ) h0x (µx ) + op (1) ,
we apply Slutsky’s theorem to get the result.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
63 / 318
Example on Curvature
·
x̄n ∼ N(µ, V (x̄n ))
·
Approximation: x̄n2 ∼ N(µ2 , (2µ)2 V (x̄n ))
Let µ = 2 and V (x̄n ) = 0.01. Then
.
.
E {x̄n2 } = 22 + 0.01 = µ2 . V {x̄n2 } = 2(0.01)2 + 4(22 )(0.01) = 4µ2 V (x̄n ).
Let µ = 2 and V (x̄n ) = 3. Then
E {x̄n2 } = 4 + 3 6= µ2 . V {x̄n2 } = 2(2)2 + 4(22 )(3) 6= 4µ2 V (x̄n ).
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
64 / 318
Large Sample Bias
L
n1/2 (θ̂ − θ) → N(0, 1) does not imply E {θ̂} → θ.
For example, if
(ȳn , x̄n ) ∼ N((µy , µx ), Σ) and µx 6= 0,
then E
ȳn
x̄n
is not defined.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
65 / 318
7/23-24/2015
66 / 318
Ratio Estimation (Theorem 1.3.8, Part 1)
Theorem
xi = (x1i , x2i ), X̄1N 6= 0, R̂ = x̄2,HT /x̄1,HT
√ −1
L
nN (T̂x − Tx N ) → N(0, Mxx )
√
⇒ n(R̂ − RN ) → N(0, hN Mxx h0N )
where T̂x =
−1
i∈A πi xi , TxN =
P
PN
i=1 xi , RN =
−1
hN = x̄1N
(−RN , 1)
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
X̄2N
, and
X̄1N
Proof of Theorem 1.3.8 Part 1
R̂ =
x̄2,HT
1
X̄2N
X̄2N
+
(x̄2,HT − X̄2N ) − 2 (x̄1,HT − X̄1N ) + Remainder
=
x̄1,HT
X̄1N
X̄1N
X̄1N
Method 1. Mean value theorem & continuity of the first order partial
derivatives⇒ Remainder= op (n−1/2 )
Method 2. Second order Taylor expansion + continuity of the second
order partial derivatives ⇒ Remainder= Op (n−1 )
∂R R̂ = R +
∗ (x̄HT − X̄N )0
∂x x̄
2 ∂R
1
∂ R 0
= R+
(x̄HT − X̄N )0 + (x̄HT − X̄N )
(x̄
−
X̄
)
HT
N
∂x
2
∂x∂x0 x̄∗∗
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
67 / 318
Theorem 1.3.8 Part 2
Theorem
In addition, if {V (T̂x |FN )}−1 V̂HT (T̂x ) − I = op (n−1/2 ), then
L
[V̂ (R̂)]−1/2 (R̂ − RN ) → N(0, 1)
where V̂ (R̂) =
XX
πij−1 (πij − πi πj )πi−1 dˆi πj−1 dˆj ,
i∈A j∈A
−1
dˆi = T̂x1
(x2i − R̂x1i ), T̂x =
Kim & Fuller & Mukhopadhyay (ISU & SAS)
P
−1
i∈A πi xi , TxN =
Chapter 1
PN
i=1 xi , RN
=
x̄2N
x̄1N
7/23-24/2015
68 / 318
Remarks on Ratios
1
Variance estimation :
X
V̂ (R̂) = V̂ (
πi−1 dˆi )
dˆi
2
=
i∈A
−1
T̂x1
(x2i
− R̂x1i )
If x1i = 1 and x2i = yi , then Hájek estimator
P
−1
i∈A πi yi
ȳπ = P
−1
i∈A πi
X
.
−1
⇒ V (ȳπ − ȳN | F) = V [N
πi−1 (yi − ȳN )|F]
i∈A
V [N −1 T̂y | F] = V [N −1
X
πi−1 yi | F].
i∈A
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
69 / 318
7/23-24/2015
70 / 318
Approximations for Complex Estimators
θ̂ is defined through an estimating equation
X
wi g (xi , θ̂) = 0.
i∈A
Let Ĝ (θ) =
X
wi g (xi ; θ)
i∈A
G (θ) = N −1
N
X
g (xi ; θ)
i=1
Ĥ(θ) = ∂ Ĝ (θ)/∂θ
H(θ) = ∂G (θ)/∂θ
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
Theorem 1.3.9
Theorem
Under suitable conditions,
√
L
n(θ̂ − θN ) → N(0, V)
where
V = n[H(θN )]−1 V {Ĝ (θN |FN )}[H(θN )]−1
Also, V (θ̂) can be estimated by
V̂ = n[Ĥ(θ̂)]−1 V̂ {Ĝ (θ̂|FN )}[Ĥ(θ̂)]−1
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
71 / 318
7/23-24/2015
72 / 318
Comments
It is difficult to show CLT for general HT estimator
Exception: Large number of strata
CLT requires:
Large samples (that is, effectively large)
Moments: no extreme observations
No extreme weights
Functions: curvature small relative to s.e
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
Basic Estimators Using SAS
PROC SURVEYMEANS
PROC SURVEYFREQ
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
73 / 318
PROC SURVEYMEANS
Univariate analysis: population totals, means, ratios,
and quantiles
Variances and confidence limits
Domain analysis
Poststratified analysis
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
74 / 318
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
75 / 318
PROC SURVEYFREQ
One-way to n-way frequency and crosstabulation
tables
Totals and proportions
Tests of association between variables
Estimates of risk differences, odds ratios, and relative
risks
Standard errors and confidence limits
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
76 / 318
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
77 / 318
Digitech Cable
Describe:
Satisfaction ratings
Usage time
Satisfaction ratings based on household sizes
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
78 / 318
PROC SURVEYMEANS
o ds g r a p h i c s on ;
p r o c s u r v e y m e a n s d a t a=R e s p o n s e D a t a mean t o t a l=t o t ;
s t r a t a S t a t e Type ;
weight SamplingWeight ;
c l a s s Rating ;
v a r R a t i n g UsageTime ;
run ;
o ds g r a p h i c s o f f ;
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
79 / 318
The SURVEYMEANS Procedure
Data Summary
Number of Strata
Number of Observations
Sum of Weights
8
300
13471
Statistics
Variable
Mean
Std Error
of Mean
Computer Usage Time
284.953667
10.904880
Extremely Unsatisfied
Customer Satisfaction
0.247287
0.024463
Unsatisfied
Customer Satisfaction
0.235889
0.025091
Neutral
Customer Satisfaction
0.224797
0.024247
Satisfied
Customer Satisfaction
0.200509
0.023548
Extremely Satisfied
Customer Satisfaction
0.091518
0.016986
Level
UsageTime
Rating
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Label
Chapter 1
7/23-24/2015
80 / 318
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
81 / 318
7/23-24/2015
82 / 318
PROC SURVEYFREQ
p r o c s u r v e y f r e q d a t a=R e s p o n s e D a t a ;
s t r a t a S t a t e Type ;
weight SamplingWeight ;
t a b l e s Rating / chisq
t e s t p =(0.25 0 . 2 0 0 . 2 0 0 . 2 0 0 . 1 5 ) ;
run ;
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
The SURVEYFREQ Procedure
Customer Satisfaction
Frequency
Weighted
Frequency
Extremely
Unsatisfied
70
3154
315.53192
Unsatisfied
67
3009
Neutral
64
Satisfied
Extremely
Satisfied
Rating
Total
Std Err of
Wgt Freq Percent
Test
Percent
Std Err of
Percent
24.7287
25.00
2.4739
323.64922
23.5889
20.00
2.5376
2867
312.75943
22.4797
20.00
2.4522
57
2557
303.73618
20.0509
20.00
2.3814
26
1167
219.09943
9.1518
15.00
1.7178
284
12754
7.72778E-6
100.000
Frequency Missing = 16
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
83 / 318
7/23-24/2015
84 / 318
Rao-Scott Chi-Square Test
Pearson Chi-Square
9.1865
Design Correction
0.9857
Rao-Scott Chi-Square
9.3195
DF
4
Pr > ChiSq
0.0536
F Value
2.3299
Num DF
4
Den DF
1104
Pr > F
0.0543
Sample Size = 284
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
Domain Estimation
Domains are subsets of the entire population
Domain sample size is not fixed
Variance estimation should account for random
sample sizes in domains
Degrees of freedom measured using the entire sample
Use the DOMAIN statement
Do NOT use the BY statement for domain analysis
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
85 / 318
7/23-24/2015
86 / 318
Describe Usage Based on Household Sizes
p r o c s u r v e y m e a n s d a t a=R e s p o n s e D a t a
mean t o t a l=t o t ;
s t r a t a S t a t e Type ;
weight SamplingWeight ;
v a r UsageTime ;
domain H o u s e h o l d S i z e ;
run ;
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
87 / 318
7/23-24/2015
88 / 318
Describe Rating Based on Household Sizes
p r o c s u r v e y f r e q d a t a=R e s p o n s e D a t a ;
s t r a t a S t a t e Type ;
weight SamplingWeight ;
t a b l e s HouseholdSize ∗ Rating /
p l o t= a l l ;
run ;
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 1
7/23-24/2015
89 / 318
7/23-24/2015
90 / 318
Chapter 2
Use of Auxiliary Information in Estimation
World Statistics Congress Short Course
July 23-24, 2015
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
Ratio Estimation
Population : Observe x̄N = N −1
Sample : Observe (x̄HT , ȳHT ) =
PN
i=1 xi
P
N −1 i∈A πi−1 (xi , yi )
Ratio estimator
ȳrat = x̄N
ȳHT
x̄HT
Let RN = x̄N−1 ȳN be the population ratio, where
P
(x̄N , ȳN ) = N −1 N
i=1 (xi , yi ).
Assume that (x̄HT , ȳHT ) − (x̄N , ȳN ) = Op (n−1/2 ).
Assume that x̄N 6= 0.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
91 / 318
Asymptotic Properties of Ratio Estimator (1)
Linear approximation:
ȳrat − ȳN = ȳHT − RN x̄HT + Op (n−1 )
Proof
ȳrat − ȳN
−1
= x̄HT
x̄N (ȳHT − RN x̄HT )
n
o
−1/2
=
1 + Op (n
) (ȳHT − RN x̄HT )
= ȳHT − RN x̄HT + Op (n−1 )
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
92 / 318
Asymptotic Properties of Ratio Estimator (2)
Bias approximation: Uses second order Taylor expansion
−1
x̄HT
ȳHT
= RN + x̄N−1 (ȳHT − RN x̄HT )
n
o
2
−2
+x̄N RN (x̄HT − x̄N ) − (x̄HT − x̄N ) (ȳHT − ȳN )
+Op (n−3/2 ).
−1
Under moment conditions for x̄HT
ȳHT and x̄N 6= 0,
Bias(ȳrat ) = E (ȳrat − ȳN )
= x̄N−1 [RN V (x̄HT ) − Cov (x̄HT , ȳHT )] + O(n−2 )
= O(n−1 ).
Thus, bias of θ̂ = ȳrat is negligible because
Bias(θ̂)
R.B.(θ̂) = q
→ 0 as
Var (θ̂)
Kim & Fuller & Mukhopadhyay (ISU & SAS)
n → ∞.
Chapter 2
7/23-24/2015
93 / 318
Asymptotic Properties of Ratio Estimator (3)
Given the conditions of Theorem 1.3.8
θ̂ − θ
q
→ N(0, 1),
Var (θ̂)
and
V̂ (θ̂) = V̂ (ȳHT − R̂ x̄HT ) =
Kim & Fuller & Mukhopadhyay (ISU & SAS)
x̄HT
x̄N
Chapter 2
−2
V̂ (ȳHT − R̂ x̄HT )
7/23-24/2015
94 / 318
Other Properties of Ratio Estimator
1
Ratio estimator is the best linear unbiased estimator under
yi = xi β + ei ,
ei ∼ (0, xi σ 2 )
2
Scale invariant, not location invariant
3
Linear but not design linear
4
Ratio estimator in stratified sample,
ȳst,s
ȳst,c
H
X
x̄hN
=
Wh ȳh
: separate ratio estimator
x̄h
h=1
!P
H
H
X
Wh ȳh
: combined ratio estimator
=
Wh x̄hN Ph=1
H
W
x̄
h h
h=1
h=1
where Wh = Nh /N.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
95 / 318
§2.2 Regression estimation
Sample : Observe (xi , yi ).
Population : Observe xi = (1, x1i ) i = 1, 2, ..., N or only x̄N .
Interested in estimation of ȳN = N
−1
N
X
yi
i=1
Regression model
yi = xi β + ei
ei independent of xj for all i and j, ei ∼ ind (0, σe2 ).
Under Normal model, regression gives the best predictor
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
96 / 318
Regression Model: Estimation
yi = xi β + ei , ei ∼ ind (0, σe2 )
P
Linear Estimator: ȳw = i∈A wi yi .
To find the best linear unbiased estimator of x̄N β = E {ȳN },
!
(
)
X
X
min V
wi yi | X , x̄N s.t. E
wi yi − ȳN | X, x̄N = 0
i∈A
⇔ min
X0
=
X
i∈A
wi2 s.t.
i∈A
(x01 , x02 , ..., x0n )
X
wi xi = x̄N
i∈A
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
97 / 318
Best Linear Unbiased Estimator
Lagrange multiplier method
Q =
∂Q
∂wi
1X
2
!0
wi2 + λ0
i∈A
X
wi xi − x̄N
i∈A
= wi + λ0 x0i
P
0 = −x̄ (
0
−1
w
x
=
x̄
⇒
λ
N
N
i
i
i∈A
i∈A xi xi )
P
∴ wi = xi ( i∈A x0i xi )−1 x̄0N = x̄N (X 0 X )−1 x0i
P
Regression estimator is the solution that minimizes the variance in the
class of linear estimators that are unbiased under the model.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
98 / 318
Properties of Regression Estimator
1 Linear in y . Location-Scale Invariant.
2 Alternative expression: For xi = (1, x1i ),
X
ȳreg = ȳn + (x̄1N − x̄1n )β̂1 =
wi yi
i∈A
#−1
"
β̂1 =
wi
=
X
(x1i − x̄1n )0 (x1i − x̄1n )
X
(x1i − x̄1n )0 yi
i∈A
i∈A
1
+ (x̄1N
n
"
#−1
X
− x̄1n )
(x1i − x̄1n )0 (x1i − x̄1n )
(x1i − x̄1n )0
i∈A
3 Writing ȳreg = ȳn + (x̄1N − x̄1n )β̂1 = β̂0 + x̄1N β̂1 , the regression
estimator can be viewed as the predicted value of Y = β0 + x1 β1 + e
at x1 = x̄1N under the regression model.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
99 / 318
Example: Artificial population
0
5
y
10
15
Population Plot
(3,4)
−5
0
5
10
x
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
100 / 318
Example (Cont’d): SRS of size n = 20
15
Sample Plot (n=20)
pop. mean = 4
sam. mean = 3.42
10
reg. est. = 3.85
5
y
(3,4)
0
( 3 , 3.85 )
( 2.61 , 3.42 )
−2
0
2
4
6
8
x
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
101 / 318
Properties of Regression Estimator (2)
4 For the mean adjusted regression model
yi = γ0 + (x1i − x̄1N )γ1 + ei ,
the OLS estimator of γ0 is
γ̂0 = ȳn − (x̄1n − x̄1N ) γ̂1
where γ̂1 = β̂1 . That is, γ̂0 = ȳreg .
5 Under the linear regression model:
a ȳreg unbiased (by construction)
∵ E (ȳreg − ȳN |XN )
=
E
( N
X
xi β̂ −
i=1
=
N
X
i=1
xi β −
N
X
)
(xi β + ei )|XN
i=1
N
X
xi β = 0
i=1
(∵ E (β̂|XN ) = β & E (ei |XN ) = 0)
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
102 / 318
Properties of Regression Estimator (3)
b Variance
V (ȳreg − ȳN |X , x̄N ) = n−1 (1 − f )σe2 + (x̄1N − x̄1n )V (β̂1 | X )(x̄1N − x̄1n )0
)0 (x
−1 2
σe
− x̄1n )
V (β̂1 | X) =
P
∵ ȳreg
=
ȳn + (x̄1N − x̄1n )β̂1
=
β0 + x̄1n β1 + ēn + (x̄1N − x̄1n )β1 + (x̄1N − x̄1n )(β̂1 − β1 )
=
β0 + x̄1N β1 + ēn + (x̄1N − x̄1n )(β̂1 − β1 )
=
β0 + x̄1N β1 + ēN
ȳN
i∈A (x1i
− x̄1n
1i
⇒ ȳreg − ȳN = ēn − ēN + (x̄1N − x̄1n )(β̂1 − β1 )
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
103 / 318
Properties of Regression Estimator (4)
If x1i ∼ Normal,
V {ȳreg
k
1
− ȳN } = (1 − f ) 1 +
σe2
n
n−k −2
where k = dim(xi )
V {ȳn − ȳN } = n−1 (1 − f )σy2
If
2
Radj
σe2
k
=1− 2 ≥
, then
σy
n−2
V {ȳn − ȳN } ≥ V {ȳreg − ȳN }.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
104 / 318
Best Linear Predictor
Let yi for i ∈ A be observations from the model
yi = xi β + ei ,
with ei ∼ ind(0, σe2 ).
Predict ȳN −n = x̄N −n β + ēN −n . The BLUP of ēN −n is 0, and the BLUP
(estimator) of x̄N −n β is x̄N −n β̂. Thus, the BLUP of ȳN is
N −1 [nȳn + (N − n)x̄N −n β̂] = N −1 [nx̄n β̂ + (N − n)x̄N −n β̂]
= x̄N β̂
because ȳn − x̄n β̂ = 0.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
105 / 318
General Population: SRS Regression
Since ȳn − x̄n β̂ = 0,
ȳreg − ȳN
= x̄N β̂ − ȳN
= ȳn + (x̄N − x̄n )β̂ − ȳN
= ȳn − ȳN + (x̄N − x̄n )βN + (x̄N − x̄n )(β̂ − βN )
= ān − āN + (x̄N − x̄n )(β̂ − βN )
ai
βN
= yi − xi βN , āN = 0
!−1
!
X
X
=
x0i xi
x0i yi
i∈U
Kim & Fuller & Mukhopadhyay (ISU & SAS)
i∈U
Chapter 2
7/23-24/2015
106 / 318
Bias of Regression Estimator
Design bias is negligible (assume moments)
E [ȳreg
n
o
| FN ] = ȳN + E {ān − āN | FN } + E (x̄N − x̄n )(β̂ − βN )|FN
n
o
= ȳN + E (x̄N − x̄n )(β̂ − βN )|FN .
Bias(ȳreg | FN ) = −E [(x̄n − x̄N )(β̂ − βN )|FN ]
o
n
0
0
= −tr Cov (x̄n , β̂ ) | FN
[Bias(ȳreg | FN )]2 ≤
k
X
V (x̄j,n | FN )][V (β̂j | FN )] = O(n−1 )O(n−1 ).
j=1
⇒ Bias(ȳreg | FN ) = O(n−1 ).
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
107 / 318
Variance of Approximate Distribution
ȳreg − ȳN
= ān − āN + (x̄N − x̄n )(β̂ − βN )
= ān − āN + Op (n−1 )
V (ān | FN ) = (1 − f )n−1 Sa2
L
By Theorem 1.3.4, [V (ān | FN )]−1/2 (ān − āN ) → N(0, 1).
Recall āN = 0.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
108 / 318
Estimated Variance
V̂ {ȳreg } = (1 − f )n
−1
−1
(n − k)
X
âi2 ,
i∈A
where âi = yi − xi β̂ and k = dimension of xi .
X
âi2
=
i∈A
Xh
ai − xi (β̂ − βN )
i2
i∈A
!
=
X
=
X
ai2 − 2(β̂ − βN )0
X
x0i ai + (β̂ − βN )0
i∈A
i∈A
X
x0i xi
(β̂ − βN )
i∈A
ai2 + Op (1)
i∈A
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
109 / 318
Limiting Distribution of Coefficients
Theorem 2.2.2 (SRS) Assume (yi , xi ) iid, existence of eighth moments.
!−1
β̂ = (M̂xx )−1 M̂xy =
n−1
X
x0i xi
n−1
i∈A
X
x0i yi
i∈A
!−1
βN
−1
= (Mxx,N )
Mxy ,N =
N
−1
X
x0i xi
i∈U
N
−1
X
x0i yi
i∈U
L
⇒ V {β̂ − βN |FN }−1/2 (β̂ − βN )|FN → N(0, I)
−1
V {β̂ − βN |FN } = n−1 (1 − fN )M−1
xx,N Vbb,N Mxx,N
X
Vbb,N = N −1
x0i ai2 xi , ai = yi − xi βN
i∈U
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
110 / 318
Proof of Theorem 2.2.2
!
−1
β̂ − βN = M̂−1
xx M̂xa =: M̂xx
n−1
X
bi
i∈A
P
where M̂xa = n−1 i∈A x0i ai .
Given moment conditions,
o
√ n
L
n M̂xa − Mxa,N → N [0, (1 − fN )Vbb,N ] (Theorem 1.3.4)
P
p
where Mxa,N = N −1 i∈U xi ai = 0. Now M̂xx → Mxx,N and, by Slutsky’s
theorem, the result follows.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
111 / 318
Regression for General Designs (Theorem 2.2.1)
FN = {z1 , z2 , · · · , zN } where zi = (yi , xi ), Z0n = (z10 , z20 , · · · , zn0 ).
Define
M̂zφz = n−1 Z0n Φ−1
n Zn =
M̂xφx
M̂y φx
M̂xφy
M̂y φy
and β̂ = M̂−1
xφx M̂xφy , where Φn :n × n matrix (positive definite) e.g.
Φn = (N/n)diag{π1 , · · · , πn }. Assume
(i) V {z̄HT − z̄N |FN } = Op (n−1 ) a.s. where z̄HT = N −1
P
i∈A
πi−1 zi
(ii) M̂zφz − Mzφz,N = Op (n−1/2 ) a.s. and M̂zφz nonsingular.
(iii) K1 < Nn−1 πi < K2
L
(iv) [V {z̄HT − z̄N |FN }]−1/2 (z̄HT − z̄N ) → N(0, I)
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
112 / 318
Theorem 2.2.1
Theorem
Given moments, design consistent HT estimators,
0
−1
(i) β̂ − βN = M−1
xφx,N b̄HT + Op (n ) a.s.
P
P
0 x −1
0
where βN = (Mxx,N )−1 Mxy ,N =
x
i
i∈U i
i∈U xi yi .
L
(ii) [V̂ (β̂|FN )]−1/2 (β̂ − βN ) → N(0, I)
where b0i = n−1 Nπi ξi ai , ai = yi − xi βN ,
−1
ξi is the i-th column of X0n Φ−1
n , b̄HT = N
P
i∈A
πi−1 bi and
−1
−1
0
V̂ (β̂ | FN ) = M̂xφx
V̂ (b̄HT
)M̂xφx
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
113 / 318
Proof of Theorem 2.2.1
β̂ − βN
−1
= M̂xφx
(n−1 Xn0 Φ−1
n an )
−1/2
n−1 Xn0 Φ−1
)
n an = Op (n
−1
M̂xφx
β̂ − βN
n−1 Xn0 Φ−1
n an
−1
= Mxφx,
+ Op (n−1/2 )
N
−1
−1
= Mxφx,
(n−1 Xn0 Φ−1
n an ) + Op (n )
N
X
X
−1
−1
0
= n
ξ i ai = N
πi−1 bi0 = b̄HT
i∈A
i∈A
−1
0
ξi is the i-th column of Xn0 Φ−1
n , bi = n Nπi ξi ai .
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
114 / 318
Remarks on Theorem 2.2.1
1
The choice of Φn is arbitrary.
(i.e. The result in Theorem 2.2.1 holds for given Φn )
A simple case is Φn = (N/n)diag{π1 , · · · , πn }
2
Variance estimation:
−1
−1
V̂ (β̂) = M̂xφx
V̂bb M̂xφx
where V̂bb is the estimated sampling variance of b̄0HT calculated with
b̂0i = n−1 Nπi ξi âi and âi = yi − xi β̂.
3
Result holds for a general regression estimator. That is, the
asymptotic normality of x̄N β̂ follows from the asymptotic normality of
β̂.
4
Theorem 2.2.1 states the consistency of β̂ for βN . But we are also
interested in the consistency of the estimator of ȳN .
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
115 / 318
Theorem 2.2.3: Design Consistency of ȳreg for ȳN
Theorem
n
o
Let p lim β̃ − βN | FN = 0. Then,
p lim {ȳreg
N
1 X
− ȳN | FN } = 0 ⇐⇒ p lim
ai = 0
N→∞ N
i=1
where ai = yi − xi βN and ȳreg = x̄N β̃.
Proof : Because β̃ is design consistent for βN ,
)
(
N
n
o
X
= p lim N −1
(yi − xi β̃) | FN
p lim ȳN − x̄N β̃ | FN
(
= p lim N −1
i=1
N
X
)
(yi − xi βN ) | FN
.
i=1
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
116 / 318
Condition for Design Consistency
Corollary (2.2.3.1)
Assume design consistency for z̄HT and for sample moments.
−1 0 −1
Let ȳreg = x̄N β̂, β̂ = (X0n Φ−1
n Xn ) Xn Φn yn
If ∃γn such that
Xn γn = Φn D−1
π Jn
(1)
where Dπ = diag(π1 , π2 , · · · , πn ) and Jn is a column vector of 1’s,
then ȳreg is design-consistent for ȳN .
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
117 / 318
7/23-24/2015
118 / 318
Proof of Corollary 2.2.3.1
By Theorem 2.2.3, we have only to show that
X
−1
N
(yi − xi βN ) = 0,
i∈U
where βN = p lim β̂.
Now,
(y − Xn β̂)0 Φ−1
n Xn = 0
⇒ (y − Xn β̂)0 Φ−1
n Xn γn = 0
⇒ (y − Xn β̂)0 Dπ−1 Jn = N(ȳHT − x̄HT β̂) = 0
⇒ p lim{ȳHT − x̄HT βN } = p lim{ȳ
P HT − x̄HT β̂} = 0
−1
& p lim{ȳHT − x̄HT βN } = N
i∈U (yi − xi βN ).
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
Remarks on Corollary 2.2.3.1
1
2
Condition Φn Dπ−1 Jn ∈ C(Xn ) is a crucial condition for the design
consistency of the regression estimator of the form ȳreg = x̄N β̂ with
−1 0 −1
β̂ = (Xn0 Φ−1
n Xn ) Xn Φn yn .
If condition Φn Dπ−1 Jn ∈ C(Xn ) does not hold, one can expand the Xn
matrix by including z0 = Φn Dπ−1 Jn and use Zn = [z0 , Xn ] to
construct the regression estimator:
ȳreg
= z̄N γ̂
= (z̄0N , x̄N )
z00 Φ−1
z00 Φ−1
n z0
n Xn
0 −1
Xn0 Φ−1
n z0 Xn Φn Xn
Kim & Fuller & Mukhopadhyay (ISU & SAS)
−1 Chapter 2
z00 Φ−1
n y
Xn0 Φ−1
n y
7/23-24/2015
119 / 318
Examples for Φn Dπ−1 Jn ∈ C(Xn )
1
Φn = Dπ and xi = (1, x1i )
⇒ ȳreg = ȳπ + (x̄1,N − x̄1,π ) β̂1
P
−1 −1 P
−1
where (ȳπ , x̄1,π ) =
π
i∈A i
i∈A πi (yi , x1i )
P
−1 P
−1
−1
0 (x − x̄
0
β̂1 =
π
(x
−
x̄
)
)
1,π
1,π
1i
1i
i∈A i
i∈A πi (x1i − x̄1,π ) yi .
2
P
Φn = In , xi = πi−1 , x1i = (wi , x1i ), w̄N = N −1 N
i=1 wi ,
⇒ ȳreg ,ω = w̄N ȳω + (x̄1,N − w̄N x̄1,ω ) β̂1,ols
(ȳω , x̄1,ω ) =
−1 P
−1
−1
i∈A πi wi
i∈A πi (yi , x1i )
P
0
(β̂0,ols , β̂1,ols
)0 =
0
i∈A xi xi
P
Kim & Fuller & Mukhopadhyay (ISU & SAS)
−1 P
0
i∈A xi yi .
Chapter 2
7/23-24/2015
120 / 318
Design Optimal Regression Estimator (Theorem 2.2.4)
Theorem
Sequence of populations and designs giving consistent estimators of
moments. Consider ȳreg (β̂) = ȳπ + (x̄1N − x̄1π )β̂ for some β̂
(i) ȳreg is design consistent for ȳN for any reasonable β̂.
(ii) β̂ ∗ = [V̂ (x̄1π )]−1 Ĉ (x̄1π , ȳ1π ) minimizes the estimated variance of
ȳreg (β̂)
(iii) CLT for ȳreg (β̂ ∗ ) can be established
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
121 / 318
Remarks on Theorem 2.2.4
1
Optimal estimator can be viewed as Rao-Blackwellization based on
0 0 x̄π
x̄N
V(x̄0π )
C(x̄π , ȳπ )
∼N
,
C(x̄0π , ȳπ )
V (ȳπ )
ȳπ
ȳN
2
GLS interpretation
Kim & Fuller & Mukhopadhyay (ISU & SAS)
x̄0π − x̄0N
ȳπ
=
Chapter 2
0
1
ȳN +
e1
e2
7/23-24/2015
122 / 318
§2.3 Linear Model Prediction
Population model : yN = XN β + eN , eN ∼ (0, Σee NN )
eN independent of XN . Assume Σee NN known
Sample model : yA = XA β + eA , eA ∼ (0, Σee AA )
Best linear predictor (BLUP)


X
X
−1 

θ̂ = N
yi +
{ŷi + Σee ĀA Σ−1
ee AA (yA − XA β̂)}
i∈A
i∈Ā
−1
−1
where ŷi = xi β̂ and β̂ = (X0A Σ−1
ee AA XA ) XA Σee AA yA
Q : When is the model-based predictor θ̂ design consistent?
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
123 / 318
Model and Design Consistency
Theorem (2.3.1)
If Σee AA (D−1
π Jn − Jn ) − Σee AĀ JN−n ∈ C(XA ), then
θ̂ = ȳHT + (x̄N − x̄HT )β̂ and θ̂ − ȳN = āHT − āN + Op (n−1 )
where ai = yi − xi βN
Analogous to Corollary 2.2.3.1 for prediction.
If the model satisfies the conditions for design consistency, then the model
is called full model. Otherwise, it is called reduced model (or restricted
model).
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
124 / 318
Model and Design Consistency
General strategy (for general purpose survey)
a Pick important y
b Find a model y = X β + e
c Use ȳreg = x̄N β̂ (Full model) or use ȳreg ,π = ȳπ + (x̄N − x̄π )β̂
−1
0 −1
where β̂ = (Xn0 Σ−1
ee Xn ) (Xn Σee yn )
If the design consistency condition does not hold, we can expand the
XA matrix by including z0 such as Σee AA Dπ−1 Jn , Z = [z0 , X ]. If z̄0N is
not known, use ȳreg ,π of (c).
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
125 / 318
§2.3.2 Nonlinear Models (All x Values Known)
Superpopulation model
yi = α(xi ; θ) + ei , E (ei ) = 0, ei indep. of xj , for all i and j.
1
2
P
P
ȳc,reg = ȳHT + N −1 i∈U α(xi ; θ̂) − N −1 i∈A πi−1 α(xi ; θ̂)
hP
i
P
−1
ȳm,reg = N
i∈A yi +
i∈Ā α(xi ; θ̂)
Remark: ȳc,reg = ȳm,reg if
Bennett, 1988).
Kim & Fuller & Mukhopadhyay (ISU & SAS)
−1
i∈A (πi
P
− 1)(yi − ŷi ) = 0. (Firth and
Chapter 2
7/23-24/2015
126 / 318
Consistency of Nonlinear Regression (Theorem 2.3.2)
Theorem
(i) There exist θN such that θ̂ − θN = Op (n−1/2 ), a.s..
(ii) α(x, θ) is a continuous differentiable function of θ with derivative
uniformly continuous on B, a closed set containing θN .
(iii) The partial derivative h(xi ; θ) = ∂α(xi ; θ)/∂θ satisfies
X
X
−1
−1 sup N πi h(xi ; θ) −
h(xi ; θ) = Op (n−1/2 ) a.s.
θ∈B
i∈A
i∈U
⇒ ȳc,reg − ȳN = āHT − āN + Op (n−1 ) where ai = yi − α(xi ; θN )
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
127 / 318
Calibration
Minimize ω 0 V ω s.t. ω 0 X = x̄N
(ω 0 Vω)(aX0 V−1 Xa0 ) ≥ (ω 0 Xa0 )2
with equality iff
ω 0 V 1/2 ∝ aX 0 V −1/2
ω 0 ∝ aX 0 V −1
ω 0 = kaX 0 V −1 , k : constant
ω0X
= kaX 0 V −1 X
x̄N (X 0 V −1 X )−1 = ka
∴
ω 0 = x̄N (X 0 V −1 X )−1 X 0 V −1
& ω 0 V ω ≥ x̄N (X 0 V −1 X )−1 x̄0N
Note Minimize V (ω 0 y | X, d) s.t. E (ω 0 y − ȳN | X, d) = 0.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
128 / 318
Alternative Minimization
Lemma
α : given n-dimensional vector
Let ωa = arg minω ω 0 Vω s.t ω 0 X = x̄N
Let ωb = arg minω (ω − α)0 V(ω − α) s.t ω 0 X = x̄N
If V α ∈ C(X), then ωa = ωb .
Proof :
(ω − α)0 V(ω − α)
= ω 0 Vω − α0 Vω − ω 0 Vα + α0 Vα
= ω 0 Vω − λ0 X0 ω − ω 0 Xλ + α0 Vα
= ω 0 Vω − 2λ0 x̄0N + α0 Vα
where V α = Xλ
∵ ω 0 X = x̄N
If α = Dπ−1 Jn , then V α ∈ C(X ) is the condition for design
consistency in Corollary 2.2.3.1.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
129 / 318
General Objective Function
min
X
G (ωi , αi ) s.t.
i∈A
X
ωi xi = x̄N
i∈A
Lagrange multiplier method
∂G
g (ωi , αi ) − λ0 x0i = 0 where g (ωi , αi ) =
∂ω
X i
g −1 (λ0 x0i , αi )xi = x̄N
⇒ ωi = g −1 (λ0 x0i , αi ) where λ0 is from
i∈A
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
130 / 318
GREG Estimator
min Q(ω, d) =
X
di
i∈A
ωi
−1
di
2
qi s.t.
X
ωi xi = x̄N .
i∈A
⇒ di−1 (ωi − di )qi + λ0 x0i = 0
⇒ ωi = di + λ0 di x0i /qi
X
X
X
0
⇒
ωi x i =
di xi + λ
di x0i xi /qi
i∈A
∴
i∈A
i∈A
X
0
λ = (x̄N − x̄HT )(
di x0i xi /qi )−1
i∈A
∴
X
wi = di + (x̄N − x̄HT )(
di x0i xi /qi )−1 di x0i /qi
i∈A
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
131 / 318
Other Objective Functions
Pseudo empirical likelihood
Q(ω, d ) = −
X
di log
ωi
di
ωi
di
, ωi = di /(1 + xi λ)
Kullback-Leibler distance:
Q(ω, d) =
X
ωi log
, ωi = di exp(xi λ)
where di = 1/πi .
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
132 / 318
Theorem 2.7.1 Deville and Särndal (1992)
Theorem
Let G (ω, α) be a continuous convex function with a first derivative that is
zero for ω = α. Under some regularity conditions, the solution ωi that
minimizes
X
X
G (ωi , αi ) s.t.
ωi xi = x̄N
i∈A
i∈A
satisfies
X
ωi yi =
i∈A
X
αi yi + (x̄N − x̄α ) β̂ + Op (n−1 )
i∈A
P
P
0 x /φ −1
0
where β̂ =
x
i
ii
i∈A i
i∈A xi yi /φii and
φii = ∂ 2 G (ωi , αi )/∂ωi2 ω =α .
i
i
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
133 / 318
Weight Bounds
ωi = di + di λ0 x0i /ci can take negative values (or take very large
values)
P
Add L1 ≤ ωi ≤ L2 to
ωi xi = x̄N .
Approaches
1
Huang and Fuller (1978)
Q(wi , di ) =
2
X
di Ψ
wi
di
, Ψ : Huber function
Husain (1969)
0
min ω 0 ω + γ(ω 0 X − x̄N )0 Σ−1
x̄ x̄ (ω X − x̄N ) for some γ
3
Other methods, quadratic programming.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
134 / 318
Comments
Regression estimation is large sample superior to mean and ratio
estimation for k << n.
Applications require restrictions on regression weights ( wi > 1/N )
Model estimator is design consistent if X γ = Σee Dπ−1 J.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
135 / 318
7/23-24/2015
136 / 318
Regression Estimators Using SAS
PROC SURVEYREG
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
Regression Estimators
Response variable is correlated to a list of auxiliary
variables
Population totals for the auxiliary variables are known
Efficient estimators can be constructed by using a
linear contrast from a regression model
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
137 / 318
Digitech Cable
Can you improve the estimate of the average usage time
by taking data usage into account?
Average data usage (MB) for the population is
available
Data usage for every unit in the sample is available
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
138 / 318
ESTIMATE Statement
p r o c s u r v e y r e g d a t a=R e s p o n s e D a t a
p l o t= f i t ( w e i g h t=heatmap s h a p e=hex
n b i n s =20);
s t r a t a S t a t e Type ;
weight SamplingWeight ;
model UsageTime = DataUsage ;
estimate ’ Regression Estimator ’
i n t e r c e p t 1 DataUsage 4 0 0 2 . 1 4 ;
run ;
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
139 / 318
7/23-24/2015
140 / 318
The SURVEYREG Procedure
Regression Analysis for Dependent Variable UsageTime
Fit Statistics
R-Square
0.6555
Root MSE
121.56
Denominator DF
292
Estimate
Label
Regression Estimator
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Estimate
Standard
Error
279.18
6.9860
Chapter 2
DF t Value
Pr > |t|
292
<.0001
39.96
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
141 / 318
Poststratification Using SAS
PROC SURVEYMEANS
Strata identification is unknown, but strata totals or
percentages are known
Stratification after the sample is observed
Use poststratification to
produce efficient estimators
adjust for nonresponse bias
perform direct standardization
Variance estimators must be adjusted
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
142 / 318
Digitech Cable
Known distribution of race in the four states
Adjust the distribution of race in the sample to match
the population
Estimate the average usage time after adjusting for
race
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
143 / 318
7/23-24/2015
144 / 318
POSTSTRATA Statement
p r o c s u r v e y m e a n s d a t a=R e s p o n s e D a t a
mean s t d e r r ;
s t r a t a S t a t e Type ;
weight SamplingWeight ;
v a r UsageTime ;
p o s t s t r a t a Race /
p o s t p c t=R a c e P e r c e n t ;
run ;
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
Poststratified Estimator
The SURVEYMEANS Procedure
Statistics
Variable
Label
UsageTime
Computer Usage Time
Mean
Std Error
of Mean
288.477541
10.612532
A set of poststratified-adjusted weights is created
The variance estimator uses the poststratification
information
Store the poststratified-adjusted replicate weights
from PROC SURVEYMEANS and use the adjusted
replicate weights in other survey procedures
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 2
7/23-24/2015
145 / 318
7/23-24/2015
146 / 318
Chapter 3
Use of Auxiliary Information in Design
World Statistics Congress Short Course
July 23-24, 2015
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
Design Strategy
Find the best strategy (design, estimator) for ȳN under the model
yi = xi β + ei , ei ∼ ind (0, γii σ 2 )
x̄N , γii known, β, σ 2 unknown
P
Estimator class : θ̂ = i∈A wi yi : linear in y
and E {(θ̂ − ȳN ) | d , XN } = 0, so θ̂ is model-unbiased & design
consistent.
Criterion : Anticipated variance
AV{θ̂ − ȳN } = E {V (θ̂ − ȳN |F)}
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
7/23-24/2015
147 / 318
Candidate Estimator
!
θ̂ = N −1
X
yi +
P
xi β̂
i∈Ac
i∈A
−1 0 −1
β̂ = (X 0 D−1
γ X ) X Dγ y =
X
0
i∈A xi xi /γii
−1 P
0
i∈A xi yi /γii
Dγ = diag(γ11 , γ22 , ..., γnn )
If the vector of γii is in the column space of X ,
P
i∈A (yi − xi β̂) = 0 and θ̂ = ȳreg = x̄N β̂.
1/2
1/2
If πi ∝ γii and the vector of γii
ȳreg = ȳHT + (x̄N − x̄HT )β̂.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
in in the column space of X ,
7/23-24/2015
148 / 318
Theorem 3.1.1 (Isaki and Fuller, 1982)
1/2
1/2
Under moment conditions, if πi ∝ γii , γii = xi τ1 , and γii = xi τ2 for
some τ1 and τ2 , then


!2
N
N
n X  2
1 X 1/2

− 2
γii
γii σ
lim nAV{ȳreg − ȳN } = lim
N→∞
N→∞
N
N
i=1
i=1
and
lim n [AV{ȳreg − ȳN } − AV{Ψl − ȳN }] ≤ 0
N→∞
for all Ψl ∈ Dl and all p ∈ Pc , where
X
Dl = {Ψl ; Ψl =
αi yi and E {(Ψl − ȳN ) | d, X } = 0}
i∈A
and Pc is the class of fixed-sample-size nonreplacement designs with fixed
probabilities admitting design-consistent estimators of ȳN .
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
7/23-24/2015
149 / 318
Proof of Theorem 3.1.1
For Ψl = α0 y
E {Ψl − ȳN |d , XN } = 0 ⇔ α0 X = x̄N
V {Ψl − ȳN |d , XN } = α0 Dγ α − 2N −1 α0 Dγ Jn + N −2 J0N Dγ N JN σ 2
α0 Dγ Jn = α0 X τ2 = x̄N τ2 = N −1 J0N XN τ2 = N −1 J0N Dγ N JN
0
2
−2 0
∴ V {Ψl − ȳN |d, XN } = α Dγ α − N JN Dγ N JN σ
Enough to find α that minimizes α0 Dγ α s.t. α0 X = x̄N
−1 0 −1
⇒ α∗0 = x̄N (X 0 D−1
γ X ) X Dγ
α0 Dγ α ≥ x̄N (X 0 Dγ−1 X )−1 x̄0N
(See Section 2.7)
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
7/23-24/2015
150 / 318
Remarks on Theorem 3.1.1 (1)
Under the model, yi = xi β + ei ,
= ēHT − ēN + Op (n−1 )
.
− ȳN ) = AV(ēHT − ēN )
ȳreg − ȳN
AV(ȳreg
= N
−2
N
X
[(1 − πi )πi−1 ]γii σ 2
i=1
min AV(ȳreg − ȳN ) s.t.

PN
P
1/2
N
j=1 γjj
i=1 πi
AV(ȳreg − ȳN ) ≥ N −2 n−1
= n, πi = n

!2
N
N
X 1/2
X
γii
−
γii  σ 2
i=1
1/2
γii
i=1
(Godambe-Joshi lower bound)
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
7/23-24/2015
151 / 318
Remarks on Theorem 3.1.1 (2)
For model: yi = xi β + ei , ei ∼ (0, γii σ 2 ), best strategy is
1/2
ȳreg = ȳHT + (x̄N − x̄HT )β̂ with πi ∝ γii
1/2
To achieve a sampling design with πi ∝ γii
1
2
3
Use Poisson sampling : n is not fixed (Not covered by Theorem).
Use systematic sampling
Use approximation by stratified random sampling
P
P
1/2 .
1/2
Choose Uh s.t. i∈Uh γii = H −1 i∈U γii ,
γ11 ≤ γ22 ≤ · · · ≤ γNN
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
7/23-24/2015
152 / 318
Stratification, Example 3.1.2
yi = β0 + xi β1 + ei , ei ∼ (0, σe2 ) and xi = i, i = 1, 2 · · · , N = 1, 600
stratified sampling with Nh = N/H(= M), nh = n/H, n = 64
ȳst
H
X
Nh
=
H
1 X
ȳh =
ȳh
N
H
h=1
h=1
H
H
1 X
1 X 1 2
1 2
V
(ȳ
)
=
σ
σ
=
h
H2
H2
nh yh
n w
V (ȳst ) =
h=1
where
σw2
=
H −1
PH
2
h=1 σyh ,
h=1
2
σyh
= E (yhi − ȳh
)2
.
=
M 2 −1 2
12 β1
+ σe2
(M 2 − 1)/12 = variance of {1, 2, · · · , M}
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
7/23-24/2015
153 / 318
Example 3.1.2 continued
Number of
Strata
2
4
8
32
ρ2
V (ȳst )
= 0.25 ρ2
81
77
75
75
= 0.9
32
16
11
10
ρ2
V {V̂ (ȳst })
= 0.25 ρ2 = 0.90
67
7.7
62
2.4
64
1.5
111
2.0
ρ2 = 1 − (σy2 )−1 σe2
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
7/23-24/2015
154 / 318
Remarks on Stratification
1
For efficient point estimation, increase H
2
V {V̂ (ȳst )} depends on ρ: V {V̂ (ȳst )} can decrease in H and then
increase in H
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
7/23-24/2015
155 / 318
One per Stratum
A common procedure is to select one unit per stratum and to
combine or “collapse” two adjacent strata to form a variance
estimation stratum
V̂cal {ȳcol } = 0.25(y1 − y2 )2
h
i
E V̂cal {ȳcol } = 0.25(µ1 − µ2 )2 + 0.25(σ12 + σ22 )
Two-per-stratum design
V {ȳ2,st } = 0.125(µ1 − µ2 )2 + 0.25(σ12 + σ22 )
Controlled two-per-stratum design (§ 3.1.4) can be used to reduce the
variance of the two-per-stratum design
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
7/23-24/2015
156 / 318
Cluster Sampling
Population of cluster of elements
May have different cluster sizes
Cluster size can be either known or unknown at the design stage
Clusters are sampling units
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
7/23-24/2015
157 / 318
Model for Cluster Sampling
yij
= µy + bi + eij , i = 1, 2, · · · , N, j = 1, 2, · · · , Mi
bi
∼ (0, σb2 ), eij ∼ (0, σe2 ), Mi ∼ (µM , σM2 )
and bi , eij , and Mi are independent.
M
i
X
yi =
yij ∼ (Mi µy , γii )
j=1
Mi
γii = V (yi |Mi ) = Mi2 σb2 + Mi σe2
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
7/23-24/2015
158 / 318
Strategies for Mean per Element
1
θ̂n,1 = M̄n−1 ȳn : SRS
2
−1
θ̂n,2 = M̄HT
ȳHT : with πi ∝ Mi
3
−1
θ̂n,3 = M̄HT
ȳHT : with πi ∝ γii
1/2
Then V (θ̂n,3 ) ≤ V (θ̂n,2 ) ≤ V (θ̂n,1 )
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
7/23-24/2015
159 / 318
Two Stage Sampling
Population of N clusters (PSUs)
Select a sample of n1 clusters
Select a sample of mi elements from Mi elements in cluster i
Sampling within clusters is independent
A model:
yij = µy + bi + eij
bi ∼ (0, σb2 ) ind of eij ∼ (0, σe2 )
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
7/23-24/2015
160 / 318
Estimation of Mean per Element
N
X
θN =
!−1
Mi
i=1
θ̂SRC =
where ȳi· = mi−1
yij
i=1 j=1
n
X
!−1
Mi
i=1
mi
X
Mi
N X
X
n
X
Mi ȳi·
i=1
yij .
j=1
With equal Mi , equal mi , and SRS at both stages,
V(θ̂SRC − θN ) =
.
=
1
n1
1
n1
n1 2
1 h
n1 m i 2
1−
σb +
1−
σ
N
n1 m
NM e
1 2
2
σb + σe
m
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
7/23-24/2015
161 / 318
Optimal Allocation
Cost function: C = c1 n1 + c2 n1 m
Minimize V{θ̂ − θN } s.t. C = c1 n1 + c2 n1 m:
σe2 c1
m =
σb2 c2
∗
1/2
Proof: C is fixed. Minimize
C · V {θ̂ − θn } = n1−1 σb2 + m−1 σe2 (c1 n1 + c2 n1 m)
= σb2 c1 + σe2 c2 + σb2 c2 m + c1 σe2 m−1 .
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
7/23-24/2015
162 / 318
Two-Phase Sampling
1
Phase-one : Select A1 from U. Observe xi
2
Phase-two : Select A2 from A1 . Observe (xi , yi )
π1i
π2i|1i
π2i
= Pr[i ∈ A1 ]
= Pr[i ∈ A2 |i ∈ A1 ]
= π1i π2i|1i
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
7/23-24/2015
163 / 318
Two-Phase Sampling for Stratification
xi
xig
T̂2pr ,st
= (xi1 , · · · , xiG )
1
if i ∈ group g
=
0
otherwise
=
G
X
N̂1g ȳ2πg : reweighted expansion estimator
g =1
N̂1g
=
X
−1
π1i
xig
i∈A1
P
ȳ2πg
=
i∈A2
P
−1 −1
π1i
π2i|1i xig yi
i∈A2
If π2i|1i = f2g
−1 −1
π1i
π2i|1i xig
P
−1
n2g
i∈A2 π1i xig yi
=
, i ∈ group g , then ȳ2πg = P
.
−1
n1g
π
x
i∈A
1i ig
2
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
7/23-24/2015
164 / 318
Theorem 3.3.1
Let the second phase sample be stratified random sampling with π2i|1i
fixed and constant within groups. Under moment conditions,
L
[V {ȳ2p,st |FN }]−1/2 (ȳ2p,st − ȳN )|FN → N(0, 1)
where V (ȳ2p,st |FN ) = V (ȳ1π |FN ) + E
ȳ1π =

G
X
1
1
−
n2g
n1g
−1

g =1

X
X
−1 
w1i yi , w1i = 
π1j
i∈A1
2
S̃1ug
2
n1g
2
S̃1ug
|FN



−1
π1i
j∈A1
X
= (n1g − 1)−1
(ui − ū1g )2
i∈A1g
ū1g
−1
= n1g
X
w1i yi
i∈A1g
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
7/23-24/2015
165 / 318
Proof of Theorem 3.3.1
ȳ2p,st − ȳN = (ȳ2p,st − ȳ1π ) + (ȳ1π − ȳN )
L
(i) {V [ȳ2p,st − ȳ1π |FN ]}−1/2 (ȳ2p,st − ȳ1π )|(A1 , FN ) → N(0, 1)
∵ ȳ2p,st − ȳ1π =
G
X
n1g (ū2g − ū1g )
g =1
ū2g
=
1 X
1 X
wi yi , ū1g =
wi yi
n2g
n1g
i∈A2g
V {ȳ2p,st − ȳ1π |A1 , FN } =
G
X
2
n1g
i∈A1g
−1
n2g
−
−1
n1g
2
S̃1ug
g =1
E {ȳ2p,st − ȳ1π |A1 , FN } = 0
L
(ii) V [ȳ1π − ȳN |FN ]−1/2 (ȳ1π − ȳN ) | FN → N(0, 1)
By Theorem 1.3.6 (p 54-55), the result follows
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
7/23-24/2015
166 / 318
Separate Samples with Common Characteristics
Two independent surveys
A1 : observe x
A2 : observe (x, y )
Interested in estimating θ = (x̄N , ȳN )0


1 0
x̄01
0
 x̄2  =  1 0
0 1
ȳ2

e1
V = V  e2
e3
u = Zθ + e




x̄0N
ȳN


e1
+  e2 
e3


V11 0
0
 =  0 V22 V23 
0 V32 V33
= (Z 0 V −1 Z )−1 Z 0 V −1 u
θ̂GLS
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
7/23-24/2015
167 / 318
Composite Estimation
Example(Two Time Periods: O = observed)
Sample
A
B
C
t=1
O
O
t=2
O
O
→ core panel part (detecting change)
supplemental panel survey (cross sectional)
Sample A : ȳ1A , ȳ2A ,
Sample B : ȳ1B ,
Sample C : ȳ2C
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
7/23-24/2015
168 / 318
Two Time Periods GLS
 


1 0 e1
ȳ1B
 e2
 ȳ1A   1 0  ȳ1,N


=

+
 e3
 ȳ2A   0 1  ȳ2,N
0 1
e4
ȳ2C
  −1

nB
0
0
0
e1
−1
−1
 e2   0
nA
nA ρ 0
=
V
−1
 e3   0 n ρ n−1
0
A
A
−1
e4
0
0
0
nC









Composite estimator θ̂ = (Z0 V−1 Z)−1 Z0 V−1 y
Design is complex because more than one item.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
7/23-24/2015
169 / 318
Rejective Sampling
yi = xi β + ei , xi , i = 1, 2, ..., N known
Vxx = V {x̄p } under initial design Pd , with pi initial selection probability.
Procedure: Select sample using Pd
−1
If (x̄p − x̄N )Vxx
(x̄p − x̄N )0 < Kd keep sample.
Otherwise reject and select a new sample
Continue until sample is kept
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
7/23-24/2015
170 / 318
Result for Rejective Sampling
If ȳreg
= z̄N β̂ design consistent under Pd
xi c = (1 − pi )−1 pi , for some c,
then ȳreg
= z̄N β̂ design consistent for Rejection
−1

X
X
zi0 φi pi−2 yi
β̂ = 
zi0 φi pi−2 zi 
i∈Arej
i∈Arej
zi
= (xi , z2i ), z2i design var, eg. stratum indicators
φi
= (1 − pi ) for Poisson
φi
= (Nh − 1)−1 (Nh − nh ) stratification
V̂ {ȳreg } is design consistent for Vrej {ȳreg }
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
7/23-24/2015
171 / 318
Sample Design (Fairy Tale)
Client: Desire estimate of average daily consumption of Cherrios by
females 18-80 who made any purchase at store x between January 1, 2015
and July 1, 2015 (here is list) with a CV of 2%
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
7/23-24/2015
172 / 318
Design Discussion
Objectives ?
What is population of interest ?
What data are needed ?
How are data to be collected?
What has been done before ?
What information (auxiliary data) available ?
How soon must it be done?
Budget
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
7/23-24/2015
173 / 318
7/23-24/2015
174 / 318
BLM Sage Grouse Study
Bureau of Land Management
Rangeland Health
Emphasis on sage grouse habitat
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
Sage Grouse
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
7/23-24/2015
175 / 318
Chapter 3
7/23-24/2015
176 / 318
Sage Grouse
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Sample Frame, Sample Units
Public Land Survey System (PLSS)
Central and western US
Grid system of square miles (sections) (1.7km)
Quarter section 0.5mi on a slide (segment)
Much ownership based on PLSS
ISU has used PLSS as a frame
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
7/23-24/2015
177 / 318
Chapter 3
7/23-24/2015
178 / 318
Low Sage Grouse Density
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Sample Selection
Stratify area (Thiessen polygons)
Select random point locations (two per stratum)
Segment containing point is sample segment
Select observation points in segment
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
7/23-24/2015
179 / 318
Chapter 3
7/23-24/2015
180 / 318
Stratification
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Sample Points
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
7/23-24/2015
181 / 318
7/23-24/2015
182 / 318
Selection Probabilities
Segments vary in size
Segment rate less than 1/300
Segment πi treated as proportional to size
Joint probability treated as πi πj
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
Variables
xhi = Elevation of selected point
yhi = created variable at point (Range health)
Chi = acre size of segment
h: stratum, i: segment
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
7/23-24/2015
183 / 318
7/23-24/2015
184 / 318
Weights
−1
Segment probability = TCh
(nh Chi )
TCh = acre size of stratum
−1
−1
−1
Point weight (acres) = πhi
= [TCh
(nh Chi )]−1 Chi mhi
mhi = number of points observed in segment hi
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
PPS - Fixed Take
Let
One point = one acre
Chi = acre size of segment i in stratum h
m = fixed number of points per segment
Probability of selection (point j in segment i):
P(acreij ) = P(segi )P(acreij | segi )
−1
−1
= Tch
Chi × mChi−1 = Tch
m
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
7/23-24/2015
185 / 318
One Point Per Segment
ȳst
= (136)
−1
68 X
2
X
yhi = 1.9458
h=1 i=1
[V̂ (ȳst )]0.5 =
" 68
X
#0.5
N −2 Nh2 V̂ (ȳh )
h=1
68 X
2
X
−1
(136)
(yhi − ȳh )2
"
=
#0.5
= 0.0192
h=1 i=1
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
7/23-24/2015
186 / 318
Regression Estimator
yhi
= xhi β1 +
H
X
δj,hi β1+j + ehi
j=1
δj,hi
=
x0hi
1
0
if j = h
otherwise.
= (xhi , δ1,hi , δ2,hi , · · · , δH,hi )
ȳreg
= (x̄N − x̄st ) β̂
= ȳst + (x̄N − x̄st )β̂1 .
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
7/23-24/2015
187 / 318
Regression Estimator
x̄N = 0.9878 mi. known, x̄st = 0.9875
ȳreg = ȳst + (x̄N − x̄st )β̂1
= 1.9458 + (0.0003)(0.9788) = 1.9461
(0.0165)
( 68 2
)−1 68 2
XX
XX
2
β̂1 =
(xhi − x̄h )
(xhi − x̄h )(yhi − ȳh )
h=1 i=1
Kim & Fuller & Mukhopadhyay (ISU & SAS)
h=1 i=1
Chapter 3
7/23-24/2015
188 / 318
Regression Weights
ȳreg
=
68 X
2
X
whi yhi
h=1 i=1
whi
( 68 2
)−1
XX
= n−1 + (x̄N − x̄st )
(xhi − x̄h )2
(xhi − x̄h )
h=1 i=1
V̂ {ȳreg } =
68 X
2
X
whi2
n
o2
yhi − ȳh − (xhi − x̄h )β̂1
h=1 i=1
68 X
2
X
whi2
( 68 2
)−1
XX
= n−1 + (x̄N − x̄st )2
(xhi − x̄h )2
h=1 i=1
h=1 i=1
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
7/23-24/2015
189 / 318
Model Variance
yhj = β0 + β1 (xhi − x̄hN ) + ehi , ehi ∼ (0, σe2 )
#
68 X
2
X
V {ȳreg } = n−1 + (x̄N − x̄st )2 {
(xhi − x̄h )2 }−1 σe2
"
h=1 i=1
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
7/23-24/2015
190 / 318
Sample Number Two
x̄N
ȳreg
= 0.9878
x̄st = 0.9589
= 1.9689 + (0.0289)(1.0203) = 1.9984
#
68 X
2
X
n−1 + (x̄N − x̄st )2 {
(xhi − x̄h )2 }−1 σe2
"
V {ȳreg } =
h=1 i=1
−2
V̂ {ȳreg } = {0.7353 + 0.0656}10
σ̂e2
= (67)
−1
= (0.0172)2
68 X
2
X
{yhi − ȳh − (xhi − x̄h )β̂1 }2 = 1.000.
h=1 i=1
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
7/23-24/2015
191 / 318
7/23-24/2015
192 / 318
Comments on Design
Stratification
Model for design
Error model – selection probabilities
Avoid “Over design”
Variance estimation
Simple - explanation for users
Rejective - avoids “bad” samples
Data will be used for more than designed for
Be prepared (budget cut, use sampling again).
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 3
Chapter 4
Replication Variance Estimation
World Statistics Congress Short Course
July 23-24, 2015
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 4
7/23-24/2015
193 / 318
7/23-24/2015
194 / 318
Jackknife Variance Estimation
Create a new sample by deleting one observation
nx̄ − xk
n−1
xk − x̄
= −
n−1
x̄ (k) =
x̄ (k) − x̄
n
n
k=1
k=1
X
n − 1 X (k)
1
2
∴
(x̄ − x̄) =
(xk − x̄)2 = n−1 sx2
n
n(n − 1)
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 4
Alternative Jackknife Weights
(k)
x̄ψ
= ψxk + (1 − ψ)x̄ (k)
(k)
= ψ(xk − x̄) + (1 − ψ)(x̄ (k) − x̄)
(k)
= (ψ −
x̄ψ − x̄
x̄ψ − x̄
n
X
(k)
(x̄ψ − x̄)2 =
k=1
nψ − 1
1−ψ
)(xk − x̄) = (
)(xk − x̄)
n−1
n−1
n
(nψ − 1)2 X
2
(x
−
x̄)
k
(n − 1)2
k=1
n
X
n
−
1
1
(k)
If (nψ − 1)2 =
(1 − f ), then
(x̄ψ − x̄)2 = (1 − f )sx2 .
n
n
k=1
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 4
7/23-24/2015
195 / 318
Random Group Jackknife
n = mb : m groups of size b, x̄1 , ..., x̄m
m
1 X
x̄ =
x¯i
m
i=1
m
V̂ (x̄) =
1 1 X
(x¯i − x̄)2
mm−1
i=1
(k)
x̄b
x̄n
(k)
=
− x̄
=
nx̄ − bx̄k
mx̄ − x̄k
=
n−b
m−1
nx̄ − bx̄k
1
− x̄ = −
(x̄k − x̄)
n−b
m−1
m
m
k=1
k=1
X
1
m − 1 X (k)
2
V̂RGJK (x̄) ≡
(x̄b − x̄) =
(x̄k − x̄)2
m
m(m − 1)
Unbiased but d.f . = m − 1.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 4
7/23-24/2015
196 / 318
Theorem 4.1.1
Theorem
FN
= {y1 , . . . , yN } : sequence of finite population
yi
∼
g (·)
:
iid(µy , σy2 ) with finite 4 + δ moments
continuous function with continuous first derivative at µy
n
n−1X
0
⇒
{g (ȳ (k) ) − g (ȳ )}2 = [g (ȳ )]2 V̂ (ȳ ) + op (n−1 )
n
k=1
where V̂ (ȳ ) = n−1 s 2 and g 0 (ȳ ) =
∂g (ȳ )
.
∂ ȳ
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 4
7/23-24/2015
197 / 318
Proof of Theorem 4.1.1
By a Taylor linearization,
0
0
g (ȳ (k) ) = g (ȳ ) + g (ȳk∗ )(ȳ (k) − ȳ ) = g (ȳ ) + g (ȳ )(ȳ (k) − ȳ ) + Rnk (ȳ (k) − ȳ )
for some ȳk∗ ∈ Bδk (ȳ ), δk =k ȳ (k) − ȳ k where Rnk = g 0 (ȳk∗ ) − g 0 (ȳ ). Thus,
n n
n
o2
X
X
0 ∗ 2 (k)
=
g (ȳ (k) ) − g (ȳ )
g (ȳk ) (ȳ − ȳ )2
k=1
=
k=1
n
X
0
2
g (ȳ ) + Rnk (ȳ (k) − ȳ )2
k=1
(i) max1≤k≤n |ȳk∗ − ȳ | → 0 in probability.
0
0
(ii) max1≤k≤n |g (ȳk∗ ) − g (ȳ )| → 0 in probability.
n
∴
n−1X
0
{g (ȳ (k) ) − g (ȳ )}2 = [g (ȳ )]2 V̂ (ȳ ) + op (n−1 )
n
k=1
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 4
7/23-24/2015
198 / 318
Remainder with Second Derivatives
If g (·) has continuous second derivatives at µy , then
P
n−1 (n − 1) nk=1 [g (ȳ (k) ) − g (ȳ )]2 = [g 0 (ȳ )]2 V̂ (ȳ ) + Op (n−2 ).
Proof :
[g (ȳ (k) ) − g (ȳ )]2 = [g 0 (ȳk∗ )]2 (ȳ (k) − ȳ )2
g 0 (ȳk∗ )2 = [g 0 (ȳ )]2 + 2[g 0 (ȳk∗∗ )]g 00 (ȳk∗∗ )(ȳk∗ − ȳ )
⇒
[g 0 (ȳk∗ )]2 = [g 0 (ȳ )]2 + K1 |ȳk∗ − ȳ |
for some K1 . Thus, since ȳk∗ − ȳ = Op (n−1 ), we have the result.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 4
7/23-24/2015
199 / 318
Jackknife Often Larger than Taylor
R̂ = x̄ −1 ȳ
R̂ (k) − R̂ = [x̄ (k) ]−1 ȳ (k) − x̄ −1 ȳ
= −[x̄ (k) ]−1 (yk − R̂xk )(n − 1)−1
n
n − 1 X (k)
(R̂ − R̂)2
V̂JK (R̂) =
n
k=1
n
=
X
1
[x̄ (k) ]−2 (yk − R̂xk )2
n(n − 1)
k=1
n
vs. V̂L (R̂) =
X
1
(x̄)−2 (yk − R̂xk )2
n(n − 1)
k=1
E [(x̄ (k) )−2 ] ≥ E [(x̄)−2 ]
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 4
7/23-24/2015
200 / 318
Quantiles
ξp = Q(p) = F −1 (p) , p ∈ (0, 1) where F (y ) is cdf
ξˆp = Q̂(p) = inf {F̂ (y ) ≥ p}, p = 0.5 for median
y
!−1
X
X
F̂ (y ) =
wi
wi I (yi ≤ y )
i∈A
i∈A
To reduce the bias, use interpolation:
ξˆp = ξˆp0 +
ξˆp1 − ξˆp0
{p − F̂ (ξˆp0 )}
ˆ
ˆ
F̂ (ξp1 ) − F̂ (ξp0 )
where ξˆp1 = inf x1 ,...,xn {x; F̂ (x) ≥ p} & ξˆp0 = supx1 ,...,xn {x; F̂ (x) ≤ p}
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 4
7/23-24/2015
201 / 318
Test Inversion for Quantile C.I.
Construct acceptance region for H0 : p = p0
{p0 − 2[V̂ (p̂0 )]1/2 , p0 + 2[V̂ (p̂0 )]1/2 }
Invert p-interval to give C.I. for ξp0
{Q̂(p0 − 2[V̂ (p̂0 )]1/2 ), Q̂(p0 + 2[V̂ (p̂0 )]1/2 )}
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 4
7/23-24/2015
202 / 318
Plots of CDF and Inverse CDF
CDF (F)
Inverse CDF (Q)
ζp
p
ζp
p
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 4
7/23-24/2015
203 / 318
Bahadur Representation
Let F̂ (x) be an unbiased estimator of F (x), the population CDF of X .
For given p ∈ (0, 1), we can define ζp = F −1 (p) to be the p-th
population quantile of X . Let ζ̂p = F̂ −1 (p) be the p-th sample
quantile of X using F̂ . Also, define p̂ = F̂ (ζp ).
Bahadur (1966):
ζp = F̂ −1 (p̂)
d F̂ −1 (p)
−1
∼
(p̂ − p)
= F̂ (p) +
dp
dF −1 (p)
−1
∼
(p̂ − p)
= F̂ (p) +
dp
1
= ζ̂p +
(p̂ − p).
f (ζp )
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 4
7/23-24/2015
204 / 318
Variance Estimation for Quantiles (1)
Bahadur Representation
√ ˆ
p(1 − p)
(SRS)
n(ξp − ξp ) → N 0, 0
[F (ξp )]2
.
V [F (ξˆb )] =
Q̂ 0 (p) = [F̂ 0 (ξp )]−1
∂F
∂ξ
2
V (ξˆp )
q
p
Q̂(p̂ + 2 V (p̂)) − Q̂(p̂ − 2 V̂ (p̂))
q
=
=: γ̂
p
(p̂ + 2 V̂ (p̂)) − (p̂ − 2 V (p̂))
V̂ (ξˆp ) = γ̂ 2 V̂ {F̂ (ξp )}
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 4
7/23-24/2015
205 / 318
7/23-24/2015
206 / 318
Variance Estimation for Quantiles (2)
1
Jackknife variance estimator is not consistent.
Median θ̂ = 0.5(xm + xm+1 ) for n=2m even
V̂JK = 0.25(n − 1)[xm+1 − xm ]2
2
Bootstrap and BRR are O.K.
3
Smoothed quantile
ξˆp = ξˆp0 + γ̂[p − F̂ (ξˆp0 )]
(k)
ξˆp
= ξˆp0 + γ̂[p − F̂ (k) (ξˆp0 )]
X
(k)
V̂JK (ξˆp ) =
ck (ξˆp − ξˆp )2
k
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 4
§4.4 Two-Phase Samples
ȳ2p,reg
β̂2
= ȳ2p,REE = ȳ2π + (x̄1π − x̄2π )β̂2

−1
X
 X
0
=
w2i (xi − x̄2π ) (xi − x̄2π )
w2i (xi − x̄2π )0 yi


i∈A2
i∈A2
−1
w2i−1 = π1i π2i|1i , w1i = π1i
V̂JK (ȳ2p,REE ) =
L
X
h
i2
(k)
ck ȳ2p,REE − ȳ2p,REE
k=1
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 4
7/23-24/2015
207 / 318
Two-Phase Samples
(k)
(k)
−1
where w2i = (w1i )(π2i|1i
)
−1 

(k)
x̄1π
= 
X
(k)
w1i 
X

= 
−1 
X
(k)
w2i 
i∈A2
(k)
=

X

(k)
w2i (xi , yi )
i∈A2
−1

β̂2
(k)
w1i xi 
i∈A1
i∈A1

(k) (k)
(x̄2π , ȳ2π )

X
(k)
(k)
(k)
w2i (xi − x̄2π )0 (xi − x̄2π )
i∈A2
Kim & Fuller & Mukhopadhyay (ISU & SAS)
X
(k)
(k)
(k)
w2i (xi − x̄2π )0 (yi − ȳ2π )
i∈A2
Chapter 4
7/23-24/2015
208 / 318
Theorem 4.2.1 Kim, Navarro, Fuller (2006 JASA),
Assumptions
(i) Second phase is stratified with π2i|1i constant within group
(ii) KL < Nn−1 π1i < KU for some KL &KU
(iii) V {T̂1y |F} ≤ KM V {T̂1y ,SRS } where T̂1y =
(iv) nV {T̂1y |F} =
PN PN
i=1

(v) E 
!2
V̂1 (T̂1y )
V (T̂1y |F)
−1
j=1 Ωij yi yj where
P
i∈A1
PN
−1
π1i
yi
i=1 |Ωij |
= O(N −1 )

|F  = o(1)
(k)
(vi) E {[ck (T̂1y − T̂1y )2 ]2 |F} < KL L−2 [V (T̂1y )]2
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 4
7/23-24/2015
209 / 318
Theorem 4.2.1 Kim, Navarro, Fuller (2006 JASA), Result
N
1 X −1
κ2i (1 − κ2i )ei2 + op (n−1 )
⇒ V̂ {ȳ2p,reg } = V (ȳ2p,reg |F) − 2
N
i=1
where κ2i
ei
βN
= π2i|1i
= yi − ȳN − (xi − x̄)βN
" N
#−1 N
X
X
=
(xi − x̄N )0 (xi − x̄N )
(xi − x̄N )0 yi .
i=1
Kim & Fuller & Mukhopadhyay (ISU & SAS)
i=1
Chapter 4
7/23-24/2015
210 / 318
Variance Estimation
Replication is computationally efficient for large surveys, simple for
users
Jackknife works if Taylor appropriate
Grouping for computational efficiency
Theoretical improvement for quantiles
Problem with rare items
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 4
7/23-24/2015
211 / 318
Variance Estimation Using SAS
Always use weights, strata, clusters, and domains
Taylor series linearization
Replication variance estimation
Balanced repeated replication (BRR)
Jackknife repeated replication (delete-one jackknife)
User-specified replicate weights
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 4
7/23-24/2015
212 / 318
Taylor Series Linearization Variance Estimation
Use first-stage sampling units (PSUs)
Compute pooled variance from strata
Compute stratum variance based on cluster (PSU)
totals
Use the VARMETHOD=TAYLOR option
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 4
7/23-24/2015
213 / 318
Taylor Series Linearization Variance Estimation
X
X
−1
V̂ (θ̂) =
(nh − 1) nh (1 − fh )
(ehi+ − ēh.. )2
i
h
ehi+

−1
X X

=
whij  whij (yhij − ȳ... )
j
hij
−1

erc,hi+ =
X
j
Kim & Fuller & Mukhopadhyay (ISU & SAS)
X

whij 
whij (δrc,hij − P̂rc )
hij
Chapter 4
7/23-24/2015
214 / 318
VARMETHOD=TAYLOR
p r o c s u r v e y f r e q d a t a=R e s p o n s e D a t a
varmethod=T a y l o r t o t a l=t o t ;
s t r a t a S t a t e Type ;
weight SamplingWeight ;
t a b l e s Rating ;
run ;
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 4
7/23-24/2015
215 / 318
Replication Variance Estimation Using SAS
Methods include delete-one jackknife, BRR, and
user-specified replicate weights
The quantity of interest is computed for every
replicate subsample, and the deviation from the full
sample estimate is measured
Design information is not necessary if the replicate
weights are supplied
Use the VARMETHOD=JACKKNIFE | BRR option
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 4
7/23-24/2015
216 / 318
Replication Variance Estimation Using SAS
Create R replicate samples based on the replication method specified
For any statistic θ, compute θ̂ for the full sample and θ̂(r ) for every
replicate sample
The replication variance estimator is
X
V̂ (θ̂) =
αr (θ̂(r ) − θ̂)2
r
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 4
7/23-24/2015
217 / 318
SAS Statements and Options
VARMETHOD= TAYLOR | JACKKNIFE | BRR
OUTWEIGHT=
OUTJKCOEFS=
REPWEIGHTS statement
JKCOEFS=
TOTAL | RATE =
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 4
7/23-24/2015
218 / 318
Create Replicate Weights
p r o c s u r v e y m e a n s d a t a=R e s p o n s e D a t a
varmethod=j a c k k n i f e
( o u t w e i g h t=ResDataJK
o u t j k c o e f s=ResJKCoef ) ;
s t r a t a S t a t e Type ;
weight SamplingWeight ;
v a r UsageTime ;
run ;
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 4
7/23-24/2015
219 / 318
The SURVEYMEANS Procedure
Variance Estimation
Method
Jackknife
Number of Replicates
300
Statistics
Variable
Label
UsageTime
Computer Usage Time
Kim & Fuller & Mukhopadhyay (ISU & SAS)
N
Mean
Std Error
of Mean
300
284.953667
11.028403
Chapter 4
95% CL for Mean
263.248431
306.658903
7/23-24/2015
220 / 318
Use Replicate Weights
p r o c s u r v e y f r e q d a t a=ResDataJK ;
weight SamplingWeight ;
t a b l e s Rating / chisq
t e s t p =(0.25 0 . 2 0 0 . 2 0 0 . 2 0 0 . 1 5 ) ;
t a b l e s Recommend ;
r e p w e i g h t s RepWt : /
j k c o e f s=ResJKCoef ;
run ;
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 4
7/23-24/2015
221 / 318
The SURVEYFREQ Procedure
Variance Estimation
Method
Jackknife
Replicate Weights
RESDATAJK
Number of Replicates
300
Customer Satisfaction
Frequency
Weighted
Frequency
Extremely
Unsatisfied
70
3154
318.59453
Unsatisfied
67
3009
Neutral
64
Satisfied
Extremely
Satisfied
Rating
Total
Std Err of
Wgt Freq Percent
Test
Percent
Std Err of
Percent
24.7287
25.00
2.4800
326.36987
23.5889
20.00
2.5368
2867
316.40091
22.4797
20.00
2.4535
57
2557
305.15457
20.0509
20.00
2.3809
26
1167
219.47434
9.1518
15.00
1.7185
284
12754
173.34658
100.000
Frequency Missing = 16
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 4
7/23-24/2015
222 / 318
Recommend
Recommend
Frequency
Weighted
Frequency
Std Err of
Wgt Freq Percent
Std Err of
Percent
0
171
7683
385.01543
59.1915
2.8930
1
118
5297
380.02703
40.8085
2.8930
Total
289
12979
144.14289
100.000
Frequency Missing = 11
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 4
7/23-24/2015
223 / 318
Chapter 5
Models Used in Conjunction with Sampling
World Statistics Congress Short Course
July 23-24, 2015
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
224 / 318
Nonresponse
Unit Nonresponse: weight adjustment
Item Nonresponse: imputation
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
225 / 318
Two-Phase Setup for Item Nonresponse
Phase one (A): Observe xi
Phase two (AR ) : Observe (xi , yi )
π1i
π2i|1i
= Pr[i ∈ A] : inclusion probability phase one (known)
= Pr[i ∈ AR |i ∈ A] : inclusion probability phase two (unknown)
Response indicator variable:
1 if i ∈ AR
Ri =
for i ∈ A
0 if i ∈
/ AR
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
226 / 318
Two-Phase Setup for Unit Nonresponse
We are interested in estimating the population mean of Y using
weighted mean of the observations:
P
i∈A wi yi
ȳR = P R
i∈AR wi
−1 −1
where wi = π1i
π̂2i|1i
Regression weighting approach
ȳreg ,1 = x̄N β̂ or ȳreg ,2 = x̄1 β̂
P
−1 −1 P
−1
where x̄1 = ( i∈A π1i
) ( i∈A π1i
xi ) and
β̂ = (
X
−1 0
π1i
xi xi )−1 (
i∈AR
Kim & Fuller & Mukhopadhyay (ISU & SAS)
X
−1 0
π1i
xi yi ).
i∈AR
Chapter 5
7/23-24/2015
227 / 318
7/23-24/2015
228 / 318
Theorem 5.1.1
Theorem
(i) V [N −1
−1
i∈A π1i (xi , yi )|F]
)|F] = O(n−3 )
P
(ii) V [V̂ (ȲHT
(iii) KL < π2i|1i < KU ,
= O(n−1 )
−1
π2i|1i
= xi α for some α,
(iv) xi λ = 1 for all i for some λ,
(iv) Ri : independent
⇒ ȳreg ,1 − ȳN =
1 X −1
π2i ei + Op (n−1 ),
N
i∈AR
where π2i = π1i π2i|1i , ei = yi − xi βN , and
P
P
βN = ( i∈U π2i|1i x0i xi )−1 i∈U π2i|1i x0i yi .
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
Proof of Theorem 5.1.1
P
P
−1 0
−1 0
Since β̂ = ( i∈AR π1i
xi xi )−1 i∈AR π1i
xi y i ,
β̂ − βN = Op (n−1/2 )
P
P
where βN = ( i∈U π2i|1i x0i xi )−1 i∈U π2i|1i x0i yi .
ȳreg ,1 − ȳN
= x̄N β̂ − x̄N β
N
N
X
X
−1
∵
(yi − xi βN ) =
π2i|1i
π2i|1i (yi − xi βN )
i=1
=
i=1
N
X
(α0 x0i )π2i|1i (yi − xi βN ) = 0
i=1
−1
= xi α, transform xi to show
Use π2i|1i
x̄N (β̂ − βN ) = N
−1
X
−1
ei + Op (n−1 ).
π2i
i∈AR
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
229 / 318
Variance Estimation for ȳreg ,1

ȳreg ,1 =
X
i∈AR
−1

x̄N 
X
−1 0 
π1j
xj xj

−1 0 
π1i
xi y i
j∈AR
1 X 1 1
=:
yi
N
π1i π̂2i|1i
i∈AR
−1
Small f = n/N, let b̂j = π̂2j|1j
êj , êj = yj − xj β̂.
V̂ =
1 X X π1ij − π1i π1j b̂i b̂j
N2
π1ij
π1i π1j
i∈AR j∈AR
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
230 / 318
Justification
Variance


X

XX
V
w2i ei | F
=
(π2ij − π2i π2j ) w2i w2j ei ej


i∈AR
i∈U j∈U
X
=
π2i|1i π2j|1j (π1ij − π1i π1j )w2i w2j ei ej
i6=j;i,j∈U
+
X
2
(π2i − π2i
)w2i2 ei2
i∈U
where
π2ij =
Kim & Fuller & Mukhopadhyay (ISU & SAS)
π1ij π2i|1i π2j|1j
π1i π2i|1i
for i 6= j
for i = j.
Chapter 5
7/23-24/2015
231 / 318
Justification (Cont’d)
Expectation of variance estimator



X X
−1
E
π1ij (π1ij − π1i π1j )w2i ei w2j ej | F


i∈AR j∈AR
X
2
=
(π1i − π1i
)π2i|1i w2i2 ei2
i∈U
+
X
π2i|1i π2j|1j (π1ij − π1i π1j )w2i ei w2j ej
i6=j;i,j∈U
XX
=
(π2ij − π2i π2j )w2i ei w2j ej
i∈U j∈U
+
X
π2i (π2i − π1i )w2i2 ei2 ,
i∈U
−1
where w2i = N −1 π2i
. The second term is the bias of the variance
estimator and it is of order O(N −1 ).
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
232 / 318
Variance Estimation for ȳreg ,2
ȳreg ,2 = x̄1 β̂
ȳreg ,2 − ȳN
= (x̄1 − x̄N )βN + x̄N (β̂ − βN ) + Op (n−1 )
X
−1
−1
= (x̄1 − x̄N )βN + N
π2i
(yi − xi βN ) + Op (n−1 ).
i∈AR
Variance estimator
1 X X π1ij − π1i π1j b̂i2 b̂j2
V̂2 = 2
N
π1ij
π1i π1j
i∈A j∈A
where b̂j2 = (xj − x̄1 )β̂ + (Nx̄1 )
Kim & Fuller & Mukhopadhyay (ISU & SAS)
P
−1 0
i∈AR π1i xi xi
Chapter 5
−1
Rj x0j êj .
7/23-24/2015
233 / 318
Imputation
Fill in missing values with plausible values
Provides a complete data file: we can apply standard complete data
methods
By filling in missing values, analyses by different users will be consistent
Good imputation model reduces the nonresponse bias
Makes full use of information
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
234 / 318
A Hot Deck Imputation Procedure
Partition the sample into G groups: A = A1 ∪ A2 ∪ · · · ∪ AG .
In group g , we have ng elements, rg respondents, and mg = ng − rg
nonrespondents.
For each group Ag , select mg imputed values from rg respondents
with replacement (or without replacement).
Imputation model: yi ∼ iid(µg , σg2 ), i ∈ Ag (respondents and
missing)
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
235 / 318
Example 5.2.1: Hot Deck Imputation Under SRS
- yi : study variable. subject to missing
- xi : auxiliary variable. always observed (Group indicator)
- Ii : sampling indicator function for unit i
- Ri : response indicator function for yi
- yi∗ : imputed value
Ag = ARg ∪ AMg with ARg = {i ∈ Ag ; Ri = 1} and
AMg = {i ∈ Ag ; Ri = 0}.
Imputation: yj∗ = yi with probability 1/rg for i ∈ ARg and j ∈ AMg .
Imputed estimator of ȳN :
X
X
−1
∗
−1
ȳI = n
{Ri yi + (1 − Ri ) yi } =: n
yIi
i∈A
Kim & Fuller & Mukhopadhyay (ISU & SAS)
i∈A
Chapter 5
7/23-24/2015
236 / 318
Variance of Hot Deck Imputed Mean
V (ȳI ) = V {EI (ȳI | yn )} + E {VI (ȳI | yn )}




G
G



X
X
2 
−2
−1
−1
mg 1 − rg SRg
ng ȳRg + E n
= V n




g =1
g =1
P
P
2
2 = (r − 1)−1
where ȳRg = rg−1 i∈ARg yi and SRg
g
i∈ARg (yi − ȳRg ) ,
yn = (y1 , y2 , ..., yn )
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
237 / 318
Variance of Hot Deck Imputed Sample (2)
Model : yi | i ∈ Ag ∼ iid(µg , σg2 )
V {ȳI } = V {ȳn } + n−2
G
X
ng mg rg−1 σg2 + n−2
g =1
= V {ȳn } + n
−2
G
X
G
X
mg (1 − rg−1 )σg2
g =1
cg σg2
g =1
Reduced sample size: n−2 ng2 (rg−1 − ng−1 )σg2
Randomness due to stochastic imputation: n−2 mg (1 − rg−1 )σg2
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
238 / 318
Variance Estimation
Naive approach: Treat imputed values as if observed
Naive approach underestimates the true variance!
Example: Naive: V̂I = n−1 SI2
n
X
−1
(n − 1)
(yIi − ȳI )2
(
E SI2
= E
)
i=1
.
= (n − 1)−1 E {(yIi − µ)2 } − V {ȳI }
.
= E (Sy2,n )
Bias corrected estimator V̂ = V̂I +
G
X
2
cg SRg
g =1
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
239 / 318
Other Approaches for Variance Estimation
Multiple imputation: Rubin (1987)
Adjusted jackknife: Rao and Shao (1992)
Fractional imputation: Kim and Fuller (2004), Fuller and Kim (2005)
Linearization: Shao and Steel (1999), Kim and Rao (2009)
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
240 / 318
Fractional Imputation
Basic Idea
Split the record with missing item into M imputed values
Assign fractional weights
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
241 / 318
Fractional imputation
Features
Split the record with missing item into m(> 1) imputed values
Assign fractional weights
The final product is a single data file with size ≤ nm.
For variance estimation, the fractional weights are replicated.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
242 / 318
Fractional imputation
Example (n = 10)
ID Weight
1
w1
2
w2
3
w3
4
w4
5
w5
6
w6
7
w7
8
w8
9
w9
10
w10
?: Missing
y1
y1,1
y2,1
?
y4,1
y5,1
y6,1
?
?
y9,1
y10,2
Kim & Fuller & Mukhopadhyay (ISU & SAS)
y2
y1,2
?
y3,2
y4,2
y5,2
y6,2
y7,2
?
y9,2
y10,2
Chapter 5
7/23-24/2015
243 / 318
Fractional imputation (categorical case)
Fully Efficient Fractional Imputation (FEFI)
If both y1 and y2 are categorical, then fractional imputation is easy to
apply.
We have only finite number of possible values.
Imputed values = possible values
The fractional weights are the conditional probabilities of the possible
values given the observations.
Can use “EM by weighting” method of Ibrahim (1990) to compute
the fractional weights.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
244 / 318
FEFI
Example (y1 , y2 : dichotomous, taking 0 or 1)
ID Weight y1
y2
1 w1
y1,1 y1,2
∗
2 w2 w2,1
y2,1
0
∗
w2 w2,2
y2,1
1
∗
3 w3 w3,1
0
y3,2
∗
w3 w3,2
1
y3,2
4 w4
y4,1 y4,2
5 w5
y5,1 y5,2
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
245 / 318
7/23-24/2015
246 / 318
FEFI
Example (y1 , y2 : dichotomous, taking 0 or 1)
ID Weight
y1
y2
6 w6
y6,1
y6,2
∗
7 w7 w7,1
0
y7,2
∗
w7 w7,2
1
y7,2
∗
8 w8 w8,1
0
0
∗
w8 w8,2
0
1
∗
w8 w8,3
1
0
∗
w8 w8,4
1
1
9 w9
y9,1
y9,2
10 w10
y10,1 y10,2
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
FEFI
Example (Cont’d)
E-step: Fractional weights are the conditional probabilities of the
imputed values given the observations.
∗(j)
wij∗ = P̂(yi,mis | yi,obs )
∗(j)
=
π̂(yi,obs , yi,mis )
∗(l)
l=1 π̂(yi,obs , yi,mis )
PMi
where (yi,obs , yi,mis ) is the (observed, missing) part of
yi = (yi1 , · · · , yi,p ).
M-step: Update the joint probability using the fractional weights.
π̂ab =
with N̂ =
n Mi
1 XX
N̂
∗(j)
∗(j)
wi wij∗ I (yi,1 = a, yi,2 = b)
i=1 j=1
Pn
i=1 wi .
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
247 / 318
FEFI
Example (Cont’d) Variance estimation
Recompute the fractional weights for each replication
Apply the same EM algorithm using the replicated weights.
E-step: Fractional weights are the conditional probabilities of the
imputed values given the observations.
∗(j)
∗(k)
wij
=
π̂ (k) (yi,obs , yi,mis )
PMi
l=1
∗(l)
π̂ (k) (yi,obs , yi,mis )
M-step: Update the joint probability using the fractional weights.
(k)
π̂ab
where N̂ (k) =
=
n Mi
1 XX
N̂ (k)
Pn
Kim & Fuller & Mukhopadhyay (ISU & SAS)
i=1
(k)
∗(k)
wi wij
∗(j)
∗(j)
I (yi,1 = a, yi,2 = b)
i=1 j=1
(k)
wi .
Chapter 5
7/23-24/2015
248 / 318
FEFI
Example (Cont’d) Final Product
Replication Weights
Rep 1
Rep 2
· · · Rep L
(1)
(2)
(L)
w1
w1
· · · w1
(1) ∗(1)
(2) ∗(2)
(L) ∗(L)
w2 w2,1
w2 w2,1
· · · w2 w2,1
Weight
w1
∗
w2 w2,1
x
x1
x2
y1
y1,1
y2,2
y2
y1,2
0
∗
w2 w2,2
x2
y2,2
1
w2 w2,2
∗
w3 w3,1
x3
0
y3,2
w3 w3,1
∗
w3 w3,2
x3
1
y3,2
w3 w3,2
w4
w5
w6
x4
x5
x6
y4,1
y5,1
y6,1
y4,2
y5,2
y6,2
w4
(1)
w5
(1)
w6
(1)
∗(1)
(1)
∗(1)
(1)
∗(1)
(1)
Kim & Fuller & Mukhopadhyay (ISU & SAS)
(2)
∗(2)
···
w2 w2,2
(2)
∗(2)
···
w3 w3,1
(2)
∗(2)
···
w3 w3,2
···
···
···
w4
(L)
w5
(L)
w6
w2 w2,1
w3 w3,1
w3 w3,2
(2)
w4
(2)
w5
(2)
w6
Chapter 5
(L)
∗(L)
(L)
∗(L)
(L)
∗(L)
(L)
7/23-24/2015
249 / 318
FEFI
Example (Cont’d) Final Product
Weight
∗
w7 w7,1
x
x7
y1
0
y2
y7,2
Replication Weights
Rep 1
Rep 2
Rep L
(1) ∗(1)
(2) ∗(2)
(L) ∗(L)
w7 w7,1
w7 w7,1
w7 w7,1
∗
w7 w7,2
x7
1
y7,2
w7 w7,2
∗
w8 w8,1
x8
0
0
w8 w8,1
∗
w8 w8,2
x8
0
1
w8 w8,2
∗
w8 w8,3
x8
1
0
w8 w8,3
∗
w8 w8,4
x8
1
1
w8 w8,4
w9
w10
x9
x10
y9,1
y10,1
y9,2
y10,2
Kim & Fuller & Mukhopadhyay (ISU & SAS)
(1)
∗(1)
(1)
∗(1)
(1)
∗(1)
(1)
∗(1)
(1)
∗(1)
(1)
w9
(1)
w10
Chapter 5
(2)
∗(2)
(2)
∗(2)
(2)
∗(2)
(2)
∗(2)
(2)
∗(2)
w7 w7,2
w8 w8,1
(L)
∗(L)
(L)
∗(L)
(L)
∗(L)
(L)
∗(L)
w8 w8,2
w8 w8,3
w8 w8,3
w8 w8,4
(2)
∗(L)
w8 w8,1
w8 w8,2
w9
(2)
w10
(L)
w7 w7,2
w8 w8,4
···
···
(L)
w9
(L)
w10
7/23-24/2015
250 / 318
Fractional Hot-Deck Imputation Using SAS
PROC SURVEYIMPUTE
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
251 / 318
Missing Values
Exclude observations with missing weights
Analyze missing levels as a separate level
Delete observations with missing values (equivalent to
imputing missing values with the estimated values
from the analysis model [SAS default])
Analyze observations without missing values in a
separate domain (equivalent to imputing missing
values by 0 [NOMCAR])
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
252 / 318
Nonresponse
Follow-up interviews
Weight adjustment, poststratification
Hot-deck and fractional hot-deck imputation
Multiple imputation (MI and MIANALYZE procedures
in SAS)
Other techniques
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
253 / 318
PROC SURVEYIMPUTE
Hot-deck imputation
Simple random samples
Proportional to weights
Approximate Bayesian bootstrap
Fully efficient fractional imputation (FEFI)
Imputation-adjusted replicate weights
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
254 / 318
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
255 / 318
FEFI Detail
Use all possible levels that a missing item can take, given the levels of
the observed items
Assign fractional weights proportional to the weighted frequencies of
the imputed levels in the observed data
Unit
Kim & Fuller & Mukhopadhyay (ISU & SAS)
X Y
1
0
0
2
0
.
3
0
1
4
0
0
5
.
1
6
1
0
7
1
1
8
1
1
9
.
.
Chapter 5
7/23-24/2015
256 / 318
FEFI: Initialization
Fill in missing values
Use the complete cases to compute fractional weights
Recipient
ImpWt Unit
X Y
0
1.00000
1
0
0
1
0.66667
2
0
0
2
0.33333
2
0
1
0
1.00000
3
0
1
0
1.00000
4
0
0
1
0.33333
5
0
1
2
0.66667
5
1
1
0
1.00000
6
1
0
0
1.00000
7
1
1
0
1.00000
8
1
1
1
0.33333
9
0
0
2
0.16667
9
0
1
3
0.16667
9
1
0
4
0.33333
9
1
1
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
257 / 318
7/23-24/2015
258 / 318
M Step: Compute Proportions
The FREQ Procedure
Frequency
Percent
Row Pct
Col Pct
Table of X by Y
Y
X
0
1
Total
3
33.33
62.07
72.00
1.83333
20.37
37.93
37.93
4.83333
53.70
1 1.16667
12.96
28.00
28.00
3
33.33
72.00
62.07
4.16667
46.30
4.16667
46.30
4.83333
53.70
9
100.00
0
Total
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
E Step: Adjust Fractional Weights
Recipient
Kim & Fuller & Mukhopadhyay (ISU & SAS)
ImpWt Unit
X Y
0
1.00000
1
0
0
1
0.62069
2
0
0
2
0.37931
2
0
1
0
1.00000
3
0
1
0
1.00000
4
0
0
1
0.37931
5
0
1
2
0.62069
5
1
1
0
1.00000
6
1
0
0
1.00000
7
1
1
0
1.00000
8
1
1
1
0.33333
9
0
0
2
0.20370
9
0
1
3
0.12963
9
1
0
4
0.33333
9
1
1
Chapter 5
7/23-24/2015
259 / 318
Repeat E-M Steps in Replicate Samples
Recipient
ImpWt FracWgt
ImpRepWt_1 ImpRepWt_2 ImpRepWt_9 Unit
X Y
0
1.00000
1.00000
0.00000
1.12500
1.12500
1
0
0
1
0.58601
0.58601
0.46072
0.00000
0.65877
2
0
0
2
0.41399
0.41399
0.66428
0.00000
0.46623
2
0
1
0
1.00000
1.00000
1.12500
1.12500
1.12500
3
0
1
0
1.00000
1.00000
1.12500
1.12500
1.12500
4
0
0
1
0.41399
0.41399
0.49821
0.37510
0.46623
5
0
1
2
0.58601
0.58601
0.62679
0.74990
0.65877
5
1
1
0
1.00000
1.00000
1.12500
1.12500
1.12500
6
1
0
0
1.00000
1.00000
1.12500
1.12500
1.12500
7
1
1
0
1.00000
1.00000
1.12500
1.12500
1.12500
8
1
1
1
0.32330
0.32330
0.22659
0.32143
0.00000
9
0
0
2
0.22840
0.22840
0.32669
0.21434
0.00000
9
0
1
3
0.12500
0.12500
0.16071
0.16071
0.00000
9
1
0
4
0.32330
0.32330
0.41101
0.42851
0.00000
9
1
1
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
260 / 318
Digitech Cable
Item nonresponse: Rating, Recommend, ...
Impute missing items from the observed data
Use imputation cells
Create an imputed data set and a set of replicate
weights for future use
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
261 / 318
PROC SURVEYIMPUTE
p r o c s u r v e y i m p u t e d a t a=R e s p o n s e D a t a method= f e f i ;
s t r a t a S t a t e Type ;
weight SamplingWeight ;
c l a s s R a t i n g Recommend H o u s e h o l d S i z e
Race ;
v a r R a t i n g Recommend H o u s e h o l d S i z e
Race ;
c e l l s ImputationCells ;
o u t p u t o u t=ImputedData
o u t j k c o e f=J K C o e f f i c i e n t s ;
run ;
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
262 / 318
The SURVEYIMPUTE Procedure
Missing Data Patterns
Group
Sum of Unweighted
Freq Weights
Percent
Rating
Recommend
HouseholdSize
Race
1
X
X
X
X
269
12081.4
89.67
2
X
X
.
.
6
270.2982
2.00
3
X
.
X
X
9
402.655
3.00
4
.
X
X
X
14
627.7105
4.67
5
.
.
X
X
1
44.21429
0.33
6
.
.
.
.
1
44.71795
0.33
Missing Data
Patterns
Group
Weighted
Percent
1
89.68
2
2.01
3
2.99
4
4.66
5
0.33
6
0.33
Imputation Summary
Number of
Sum of
Observations Weights
Observation Status
Nonmissing
269
12081.4
Missing
31
1389.596
Missing, Imputed
31
1389.596
Missing, Not Imputed
0
0
Missing, Partially Imputed
0
0
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
NumberOfDonors
NumberOfUnits
NumberOfRows
0
269
269
1
3
3
2
9
18
3
2
6
4
5
20
5
5
25
6
1
6
8
2
16
9
2
18
10
1
10
59
1
59
300
450
263 / 318
Output data set contains the imputed data
New variables: Replicate Weights, Recipient Index, Unit ID
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
264 / 318
UnitID
Recipient
ImpWt SamplingWeight
ImputationCells
186
1
11.2356
44.7179
1
186
2
11.2356
44.7179
1
186
3
22.2467
44.7179
1
264
1
1.1099
45.5135
1
264
2
1.1431
45.5135
1
264
3
2.2784
45.5135
1
264
4
1.7025
45.5135
1
264
5
7.1105
45.5135
1
264
6
3.4138
45.5135
1
264
7
6.2488
45.5135
1
264
8
2.2761
45.5135
1
264
9
1.1099
45.5135
1
264
10
19.1205
45.5135
1
Rating
Recommend
HouseholdSize
Race
Neutral
0
Medium
Other
Satisfied
0
Medium
Other
Unsatisfied
0
Medium
Other
Extremely Unsatisfied
0
Large
NA
Extremely Unsatisfied
0
Large
White
Extremely Unsatisfied
0
Medium
Black
Extremely Unsatisfied
0
Medium
Hispanic
Extremely Unsatisfied
0
Medium
White
Extremely Unsatisfied
0
Small
Black
Extremely Unsatisfied
0
Small
Hispanic
Extremely Unsatisfied
0
Small
NA
Extremely Unsatisfied
0
Small
Other
Extremely Unsatisfied
0
Small
White
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
265 / 318
Analyses of FEFI Data
The imputed data set can be used for any analyses
Use the imputation-adjusted weights
Use the imputation-adjusted replicate weights
The number of rows in the imputed data set is NOT
the same as the number of observation units
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
266 / 318
Use the Imputed Data to Estimate Usage
p r o c s u r v e y m e a n s d a t a=ImputedData mean
varmethod=j a c k k n i f e ;
w e i g h t ImpWt ;
v a r UsageTime ;
r e p w e i g h t s ImpRepWt : /
j k c o e f s=J K C o e f f i c i e n t s ;
run ;
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
267 / 318
7/23-24/2015
268 / 318
The SURVEYMEANS Procedure
Data Summary
Number of Observations
Sum of Weights
450
13471
Variance Estimation
Method
Jackknife
Replicate Weights
Number of Replicates
IMPUTEDDATA
300
Statistics
Variable
Label
UsageTime
Computer Usage Time
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
Mean
Std Error
of Mean
284.953667
11.028403
Use the Imputed Data to Estimate Rating
p r o c s u r v e y f r e q d a t a=ImputedData
varmethod=j a c k k n i f e ;
w e i g h t ImpWt ;
t a b l e s R a t i n g Recommend ;
r e p w e i g h t s ImpRepWt : /
j k c o e f s=J K C o e f f i c i e n t s ;
run ;
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
269 / 318
The SURVEYFREQ Procedure
Customer Satisfaction
Frequency
Weighted
Frequency
108
3297
338.60477
24.4745
2.5136
97
3215
343.59043
23.8664
2.5506
102
3054
339.31908
22.6703
2.5189
Satisfied
99
2678
319.57801
19.8770
2.3723
Extremely Satisfied
44
1227
230.65076
9.1117
1.7122
450
13471
4.0261E-10
100.000
Rating
Extremely Unsatisfied
Unsatisfied
Neutral
Total
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
Std Err of
Wgt Freq Percent
Std Err of
Percent
7/23-24/2015
270 / 318
Recommend
Recommend
Frequency
Weighted
Frequency
0
261
8027
389.02594
59.5863
2.8879
1
189
5444
389.02594
40.4137
2.8879
Total
450
13471
2.0157E-10
100.000
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Std Err of
Wgt Freq Percent
Chapter 5
Std Err of
Percent
7/23-24/2015
271 / 318
§5.5 Small area estimation
Basic Setup
Original sample A is decomposed into G domains such that
A = A1 ∪ · · · ∪ AG and n = n1 + · · · + nG
n is large but ng can be very small.
P
Direct estimator of Yg = i∈Ug yi
Ŷd,g
X 1
=
yi
πi
i∈Ag
Unbiased
May have high variance.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
272 / 318
If there is some auxiliary information available, then we can do
something:
Synthetic estimator of Yg
Ŷs,g = Xg β̂
P
where Xg = i∈Ug xi is the known total of xi in Ug and β̂ is an
estimated regression coefficient.
Low variance (if xi does P
not contain the domain indicator).
Could be biased (unless i∈Ug (yi − x0i B) = 0)
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
273 / 318
Composite estimation: consider
Ŷc,g = αg Ŷd,g + (1 − αg ) Ŷs,g
for some αg ∈ (0, 1). We are interested in finding αg∗ that minimizes
the MSE of Ŷc . The optimal choice is
MSE Ŷs,g
∗ ∼
αg =
MSE Ŷd,g + MSE Ŷs,g
For the direct estimation part, MSE Ŷd,g = V Ŷd,g can be
estimated.
n
o
2
For the synthetic estimation part, MSE Ŷs,g = E (Ŷs,g − Yg )
cannot be computed directly without assuming some error model.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
274 / 318
Area level estimation
Basic Setup
Parameter of interest: Ȳg = Ng−1
P
i∈Ug
yi
Model
Ȳg = X̄0g β + ug
and ug ∼ 0, σu2 .
Also, we have
ˆ
Ȳd,g ∼ Ȳg , Vg
with Vg = V (Ȳˆd,g ).
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
275 / 318
Area level estimation (Cont’d)
Two model can be written
Ȳˆd,g
= Ȳg + eg
X̄0g β = Ȳg − ug
where eg and ug are independent error terms with mean zeros and
variance Vg and σu2 , respectively. Thus, the best linear unbiased predictor
(BLUP) can be written as
Ȳˆg∗ = αg∗ Ȳˆd,g + 1 − αg∗ X̄0g β
where αg∗ = σu2 /(Vg + σu2 ).
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
276 / 318
Area level estimation (Cont’d)
MSE: If β, Vg , and σu2 are known, then
ˆ
ˆ
∗
∗
MSE Ȳg
= V Ȳg − Ȳg
n 0
o
∗ ˆ
∗
= V αg Ȳd,g − Ȳg + 1 − αg X̄g β − Ȳg
2
2
= αg∗ Vg + 1 − αg∗ σu2
2
∗
∗
= αg Vg = 1 − αg σu .
Note that, since 0 < αg∗ < 1,
ˆ
∗
MSE Ȳg < Vg
and
ˆ
∗
MSE Ȳg < σu2 .
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
277 / 318
Area level estimation (Cont’d)
If β and σu2 are unknown:
1
2
Find a consistent estimator of β and σu2 .
Use
Ȳˆg∗ (α̂g∗ , β̂) = α̂g∗ Ȳˆd,g + 1 − α̂g∗ X̄0g β̂.
where α̂g∗ = σ̂u2 /(V̂g + σ̂u2 )
Estimation of σu2 : Method of moment
2
X G σ̂u2 =
kg
Ȳˆd,g − X̄0g β̂ − V̂d,g ,
G −p
g
n
o−1
P
2
2
where kg ∝ σ̂u + V̂g
and G
g =1 kg = 1. If σ̂u is negative, then
we set it to zero.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
278 / 318
Area level estimation (Cont’d)
MSE
n
o
n
o
ˆ
ˆ
∗ ∗
∗ ∗
MSE Ȳg (α̂g , β̂)
= V Ȳg (α̂g , β̂) − Ȳg
n o
0
∗ ˆ
∗
= V α̂g Ȳd,g − Ȳg + 1 − α̂g X̄g β̂ − Ȳg
o
n 2
∗ 2
∗ 2
0
= αg Vg + 1 − αg
σu + X̄g V (β̂)X̄g
+V (α̂g ) Vg + σu2
∗
∗ 2 0
= αg Vg + 1 − αg X̄g V (β̂)X̄g
+V (α̂g ) Vg + σu2
MSE estimation (Prasad and Rao, 1990):
n
o
ˆ
∗
∗
∗
∗ 2 0
ˆ
= α̂g V̂g + 1 − α̂g X̄g V̂ (β̂)X̄g
MSE Ȳg (α̂g , β̂)
n
o
2
+2V̂ (α̂g ) V̂g + σ̂u .
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
279 / 318
Unit level estimation
Unit level estimation: Battese, Harter, and Fuller (1988).
Use a unit level modeling
ygi = x0gi β + ug + egi
and
Ŷg∗
o
Xn
0
=
xgi β̂ + ûg .
i∈Ug
It can be shown that
Ȳˆg∗ = α̂g∗ Ȳreg ,g + 1 − α̂g∗ Ȳs,g
where
Ȳreg ,g
0
ˆ
ˆ
= Ȳd,g + X̄g − X̄d,g β̂
and
Ȳs,g = X̄0g β̂.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 5
7/23-24/2015
280 / 318
Chapter 6
Analytic Studies
World Statistics Congress Short Course
July 23-24, 2015
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 6
7/23-24/2015
281 / 318
Parameters
Two types of parameters
Descriptive parameter: “How many people in the United States were
unemployed on March 10, 2015?”
Analytic parameter: “If personal income (in the United States)
increases 2%, how much will the consumption of beef increase ?”
Basic approach to estimating analytic parameters
1
2
Specify a model that describes the relationship among the variables
(often called superpopulation model).
Estimate the parameters in the model using the realized sample.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 6
7/23-24/2015
282 / 318
Parameter Estimation for Model Parameter
1
2
θN : finite population characteristic for θ satisfying
E (θN ) = θ + O(N −1 ) and V (θN − θ) = O(N −1 ), where the
distribution is with respect to the model.
n
o
θ̂: estimator of θN satisfying E θ̂ − θN | FN = Op (n−1 ) and
n
o
V θ̂ | FN = Op (n−1 ) almost surely.
E (θ̂) = θ + O(n−1 )
n o
V (θ̂ − θ) = E V θ̂ | FN
+ V (θN )
V̂ (θ̂ − θ) = V̂ (θ̂ | FN ) + V̂ (θ̂N − θ)
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 6
7/23-24/2015
283 / 318
A Regression Model
The finite population is a realization from a model
yi = xi β + ei
(2)
where the ei are independent (0, σ 2 ) random variables independent of
xj for all i and j.
We are interested in estimating β from the sample.
First order inclusion probabilities πi are available.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 6
7/23-24/2015
284 / 318
Estimation of Regression Coefficients
OLS estimator
!−1
β̂ols =
X
x0i xi
i∈A
X
x0i yi
i∈A
Probability weighted estimator
!−1
β̂π =
X
πi−1 x0i xi
X
πi−1 x0i yi
i∈A
i∈A
If superpopulation model also holds for the sample, then OLS
estimator is optimal.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 6
7/23-24/2015
285 / 318
Informative Sampling
Non-informative sampling design (with respect to the superpopulation
model) satisfies
P (yi ∈ B | xi , i ∈ A) = P (yi ∈ B | xi )
(3)
for any measurable set B. The left side is the sample model and the
right side is the population model.
Informative sampling design: Equality (3) does not hold.
Non-informative sampling for regression implies
E x0i ei | i ∈ A = 0.
(4)
If condition (4) is satisfied and moment conditions hold, β̂ols is
consistent
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 6
7/23-24/2015
286 / 318
Hypothesis Testing
Thus, one may want to test (4), or test directly
n
o
n o
H0 : E β̂ols = E β̂π
1
2
(5)
From the sample, fit a regression of yi on (xi , zi ) where zi = πi−1 xi
Perform a test for γ = 0 under the expanded model
y = Xβ + Zγ + a
where a is the error term satisfying E (a | X, Z) = 0.
Justification : Testing (5) is equivalent to testing
E Z0 (I − PX ) y = 0
(6)
where PX = X (X 0 X )−1 X 0 . Since
γ̂ = {Z 0 (I − PX ) Z }−1 Z 0 (I − PX ) y, testing for γ = 0 is equivalent
to testing for (6).
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 6
7/23-24/2015
287 / 318
Remarks on Testing
When performing the hypothesis testing, design consistent variance
estimator is preferable.
Rejecting the null hypothesis means that we cannot directly use the
OLS estimator under the current model.
Include more x’s until the sampling design is non-informative under the
expanded model. (Example 6.3.1)
Use the probability weighted estimator or use other consistent
estimators (Section 6.3.2)
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 6
7/23-24/2015
288 / 318
Example 6.3.1: Birthweight and Age Data
y : gestational age
x: birthweight
stratified sample of babies
OLS result
ŷ
= 25.765 + 0.389 · x
(0.370) (0.012)
Weighted regression
ŷ
= 28.974 + 0.297 · x
(0.535) (0.016)
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 6
7/23-24/2015
289 / 318
Example 6.3.1 (Cont’d)
DuMouchel & Duncan test: Fit the OLS regression of y on
(1, x, w , wx), where wi = πi−1 , to obtain
(β̂0 , β̂1 , γ̂0 , γ̂1 ) = (22.088, 0.583, 8.287, −0.326)
(0.532) (0.033) (0.861) (0.332)
The hypothesis is rejected. (F (2, 86) = 55.59.)
Thus, we cannot use OLS method in this data.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 6
7/23-24/2015
290 / 318
Example 6.3.1 (Cont’d)
However, if we include a quadratic term into the model then the
sampling design becomes noninformative.
OLS result
ŷ
= 28.335 + 0.331 · x − 0.887 · x2
(0.343) (0.010)
(0.082)
where x2 = 0.01(x − 30)2 .
Weighted regression
ŷ
= 28.458 + 0.327 · x − 0.864 · x2
(0.386) (0.011)
(0.108)
DuMouchel & Duncan test: F (3, 84) = 0.49.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 6
7/23-24/2015
291 / 318
Estimators Under Informative Sampling
Pfeffermann and Sverchkov (1999) estimator: Minimize
X wi
Q(β) =
(yi − xi β)2
w̃i
i∈A
where wi = πi−1 , w̃i = E (wi | xi , i ∈ A). w̃i can be a function of xi .
(Estimated) GLS estimator: Minimize
X
Q(β) =
wi (yi − xi β)2 /vi2
(7)
i∈A
where
1
2
vi2
n
o
2
= E wi (yi − xi β) | xi .
Obtain β̂π and compute êi = yi − xi β̂π .
Fit a (nonlinear) regression model ai2 = wi êi2 on xi ,
ai2 = qa (xi ; γa ) + rai
to get v̂i2 = qa (xi ; γ̂a ) and insert v̂i2 in (7).
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 6
7/23-24/2015
292 / 318
Comments
Fitting models to complex survey data
Always test for informative design
If the hypothesis of noninformative design is rejected:
Examine model
Use HT estimator or more complex design consistent estimator
Variance estimation for clusters and two-stage designs must recognize
clusters
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 6
7/23-24/2015
293 / 318
7/23-24/2015
294 / 318
Linear and Logistic Regression Using SAS
PROC SURVEYREG
PROC SURVEYLOGISTIC
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 6
PROC SURVEYREG
Linear regression
Regression coefficients
Significance tests
Estimates and contrasts
Regression estimator
Comparisons of domain means
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 6
7/23-24/2015
295 / 318
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 6
7/23-24/2015
296 / 318
PROC SURVEYLOGISTIC
Categorical response
Logit, probit, complementary log-log, and generalized
logit regressions
Regression coefficients
Estimates and contrasts
Odds ratios
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 6
7/23-24/2015
297 / 318
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 6
7/23-24/2015
298 / 318
Pseudo-likelihood Estimation
Finite population parameter is defined
by the
P
population likelihood, lN (θN ) = UN log {L(θN , xi )}
A sample-based estimate of the likelihood is used to
estimate the parameter,
n
o
P −1
lπ (θ̂) = A πi log L(θ̂, xi )
Variance estimators assume fixed population values,
V (θ̂|FN )
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 6
7/23-24/2015
299 / 318
Taylor Series Linearization Variance Estimation
Sandwich variance estimator that accounts for strata,
cluster, and weights:
V̂ (θ̂) = I −1 GI −1
G = (n − p)
−1
X
X
−1
(n − 1)
(nh − 1) nh (1 − fh )
(ehi+ − ēh.. )0 (ehi+ − ēh.. )
i
h
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 6
7/23-24/2015
300 / 318
Replication Variance Estimation
Estimate θ in the full sample and in every replicate
sample:
V̂ (θ̂) =
X
αr (θ̂(r ) − θ̂)(θ̂(r ) − θ̂)0
r
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 6
7/23-24/2015
301 / 318
Digitech Cable
Customer satisfaction survey
Describe usage time based on data usage after
adjusting for race
Describe usage time based on race
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 6
7/23-24/2015
302 / 318
Linear Regression
p r o c s u r v e y r e g d a t a=ImputedData ;
w e i g h t ImpWt ;
c l a s s Race ;
model UsageTime = DataUsage Race /
solution ;
r e p w e i g h t s ImpRepWt : /
j k c o e f s=J K C o e f f i c i e n t s ;
l s m e a n s Race / d i f f ;
o u t p u t o u t=RegOut
p r e d i c t e d=F i t t e d V a l u e s
r e s i d u a l=R e s i d u a l s ;
run ;
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 6
7/23-24/2015
303 / 318
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 6
7/23-24/2015
304 / 318
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 6
7/23-24/2015
305 / 318
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 6
7/23-24/2015
306 / 318
The SURVEYREG Procedure
Regression Analysis for Dependent Variable UsageTime
Fit Statistics
R-Square
0.7136
Root MSE
111.20
Denominator DF
300
Tests of Model Effects
Effect
Num DF F Value
Pr > F
Model
5
135.08
<.0001
Intercept
1
70.87
<.0001
DataUsage
1
484.50
<.0001
Race
4
22.09
<.0001
Note: The denominator degrees of freedom for the F tests is 300.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 6
7/23-24/2015
307 / 318
Estimated Regression Coefficients
Standard
Error t Value
Parameter
Estimate
Pr > |t|
Intercept
30.395290
10.7067145
2.84
0.0048
DataUsage
0.055261
0.0025106
22.01
<.0001
Race Black
43.460901
20.2069823
2.15
0.0323
Race Hispanic
38.191181
23.6997846
1.61
0.1081
Race NA
202.394973
23.3000699
8.69
<.0001
Race Other
141.623908
33.7004078
4.20
<.0001
Race White
0.000000
0.0000000
.
.
Note: The degrees of freedom for the t tests is 300.
Matrix X'WX is singular and a generalized inverse was used to solve the normal equations.
Estimates are not unique.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 6
7/23-24/2015
308 / 318
Digitech Cable
Customer satisfaction survey
Describe usage time based on data usage after
adjusting for race
Describe usage time based on race
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 6
7/23-24/2015
309 / 318
Comparisons of Domain Means
p r o c s u r v e y r e g d a t a=ImputedData ;
w e i g h t ImpWt ;
c l a s s Race ;
model UsageTime = DataUsage Race /
solution ;
r e p w e i g h t s ImpRepWt : /
j k c o e f s=J K C o e f f i c i e n t s ;
l s m e a n s Race / d i f f ;
o u t p u t o u t=RegOut
p r e d i c t e d=F i t t e d V a l u e s
r e s i d u a l=R e s i d u a l s ;
run ;
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 6
7/23-24/2015
310 / 318
Differences of Race Least Squares Means
Race
Race
Black
Hispanic
Black
NA
Black
Black
Estimate
Standard
Error
5.2697
DF t Value
Pr > |t|
29.5971
300
0.18
0.8588
-158.93
29.0487
300
-5.47
<.0001
Other
-98.1630
37.3004
300
-2.63
0.0089
White
43.4609
20.2070
300
2.15
0.0323
Hispanic
NA
-164.20
31.4596
300
-5.22
<.0001
Hispanic
Other
-103.43
39.9883
300
-2.59
0.0102
Hispanic
White
38.1912
23.6998
300
1.61
0.1081
NA
Other
60.7711
39.2901
300
1.55
0.1230
NA
White
202.39
23.3001
300
8.69
<.0001
Other
White
141.62
33.7004
300
4.20
<.0001
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 6
7/23-24/2015
311 / 318
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 6
7/23-24/2015
312 / 318
Digitech Cable
Customer satisfaction survey
Describe recommendation based on race after
adjusting for data usage
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 6
7/23-24/2015
313 / 318
7/23-24/2015
314 / 318
Logistic Regression
p r o c s u r v e y l o g i s t i c d a t a=ImputedData ;
w e i g h t ImpWt ;
c l a s s Race ;
model Recommend=DataUsage Race ;
r e p w e i g h t s ImpRepWt : /
j k c o e f s=J K C o e f f i c i e n t s ;
run ;
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 6
The SURVEYLOGISTIC Procedure
Analysis of Maximum Likelihood Estimates
Parameter
Estimate
Intercept
DataUsage
Standard
Error t Value
Pr > |t|
0.4507
0.2839
1.59
0.1135
0.000046
0.000042
1.08
0.2815
-0.2246
0.3302
-0.68
0.4969
Race
Black
Race
Hispanic
0.0466
0.3505
0.13
0.8944
Race
NA
0.4332
0.6435
0.67
0.5014
Race
Other
0.1304
0.5145
0.25
0.8000
NOTE: The degrees of freedom for the t tests is 300.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 6
7/23-24/2015
315 / 318
7/23-24/2015
316 / 318
Odds Ratio Estimates
Point
Estimate
Effect
95%
Confidence
Limits
DataUsage
1.000
1.000
1.000
Race
Black vs White
1.175
0.583
2.367
Race
Hispanic vs White
1.541
0.728
3.260
Race
NA
2.268
0.474
10.850
Race
Other vs White
1.675
0.496
5.661
vs White
NOTE:
The degrees of freedom in computing the
confidence limits is 300.
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 6
Survey Data Analysis Using SAS
Do not use non-survey procedures for survey data
analysis
Always use complete design information: weight,
strata, clusters, ...
SURVEYSELECT, SURVEYIMPUTE,
SURVEYMEANS, SURVEYFREQ, SURVEYREG,
SURVEYLOGISTIC, and SURVEYPHREG procedures
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 6
7/23-24/2015
317 / 318
R
For More Information about SAS/STAT
support.sas.com/statistics/
In-depth information about statistical products and
link to e-newsletter
support.sas.com/STAT/
Portal to SAS Technical Support, discussion forum,
documentation, and more
Kim & Fuller & Mukhopadhyay (ISU & SAS)
Chapter 6
7/23-24/2015
318 / 318
Download