Advanced Topics in Survey Sampling Jae-Kwang Kim Wayne A Fuller Pushpal Mukhopadhyay Department of Statistics Iowa State University World Statistics Congress Short Course July 23-24, 2015 Kim & Fuller & Mukhopadhyay (ISU & SAS) Advanced Topics in Survey Sampling 7/23-24/2015 1 / 318 7/23-24/2015 2 / 318 Outline 1 Probability sampling from a finite population 2 Use of auxiliary information in estimation 3 Use of auxiliary information in design 4 Replication variance estimation 5 Models used in conduction with sampling 6 Analytic studies Kim & Fuller & Mukhopadhyay (ISU & SAS) Advanced Topics in Survey Sampling Chapter 1 Probability sampling from a finite universe Jae-Kwang Kim Wayne A Fuller Pushpal Mukhopadhyay Department of Statistics Iowa State University World Statistics Congress Short Course July 23-24, 2015 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 3 / 318 Probability Sampling U = {1, 2, · · · , N} : list for finite population, sampling frame F = {y1 , y2 , · · · , yN } : finite population/finite universe A (⊂ U) : index set of sample A : set of possible samples Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 4 / 318 Sampling Design Definition p(·) is a sampling design ⇔ p(a) is a function from A to [0, 1] such that 1 2 p(a) ∈ [0, 1], ∀a ∈ A, P a∈A p(a) = 1. i.e. p(a) is a probability mass function defined on A. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 5 / 318 7/23-24/2015 6 / 318 Notation for Sampling 1 if i ∈ A 0 otherwise d = (I1 , I2 , · · · , IN ) P n= N i=1 Ii : (realized) sample size Ii = πi = E [Ii ] : first order inclusion probability πij = E [Ii Ij ] : second order inclusion probability Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 Design-estimator Characteristics Definition θ̂ = θ̂(yi ; i ∈ A) is design unbiased for θN = θ (y1 , y2 , · · · , yN ) P ⇔ E {θ̂|F} = θN , ∀ (y1 , y2 , · · · , yN ), where E {θ̂|F} = a∈A θ̂(a)p(a) Definition θ̂ is a design linear estimator ⇔ θ̂ = P i∈A wi yi , where wi are fixed with respect to the sampling design P Note: If wi = πi−1 , θ̂ = i∈A wi yi is the Horvitz-Thompson estimator. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 7 / 318 7/23-24/2015 8 / 318 Theorem 1.2.1 Theorem If θ̂ = P i∈A wi yi is a design linear estimator, then P E {θ̂n |F} = N i=1 wi πi yi V {θ̂n |F} = PN PN i=1 Kim & Fuller & Mukhopadhyay (ISU & SAS) j=1 (πij − πi πj ) wi yi wj yj . Chapter 1 Proof of Theorem 1.2.1 Because E {Ii } = πi and because wi yi , i = 1, 2, ..., N, are fixed, E {θ̂ | F} = N X E {Ii | F}wi yi = i=1 N X πi wi yi i=1 Using E (Ii Ik ) = πik and Cov (Ii , Ik ) = E (Ii Ik ) − E (Ii )E (Ik ), ( N ) X V {θ̂ | F} = V Ii wi yi | F = i=1 N N XX Cov (Ii , Ik )wi yi wk yk i=1 k=1 N X N X = (πik − πi πk )wi yi wk yk i=1 k=1 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 9 / 318 Horvitz-Thompson Estimator of Total Corollary Let πi > 0 ∀i ∈ U. Then, T̂y = X πi−1 yi satisfies i∈A (i) E T̂y | F = Ty (ii) V T̂y | F = N X N X (πij − πi πj ) i=1 j=1 yi yj πi πj Proof of (ii): Substitute πi−1 for wi of Theorem 1.2.1. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 10 / 318 Unbiased Variance Estimation Theorem P Let πij > 0, ∀i, j ∈ U and let θ̂ = i∈A wi yi be design linear. Then, XX 1 V̂ = (πij − πi πj ) wi yi wj yj satisfies πij i∈A j∈A E V̂ | F = V θ̂ | F Proof : Let g (yi , yj ) = (πij − πi πj )wi yi wj yj . By Theorem 1.2.1 N X N X X E πij−1 g (yi , yj )|F = g (yi , yj ) i=1 j=1 i,j∈A Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 11 / 318 Simple Random Sampling (SRS) Choose n units from N units without replacement with equal probability. 1 2 3 Each subset of n distinct units is equally likely to be selected. N There are n samples of size n from N. Give equal probability of selection to each subset with n units. Definition Sampling design for SRS: P(A) = . 1 Kim & Fuller & Mukhopadhyay (ISU & SAS) N n 0 if |A| = n otherwise. Chapter 1 7/23-24/2015 12 / 318 Lemma Under SRS, the inclusion probabilities are πi πij −1 N 1 N −1 n = = n 1 n−1 N −1 N 1 1 N −2 n (n − 1) = = n 1 1 n−2 N (N − 1) Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 for i 6= j. 7/23-24/2015 13 / 318 7/23-24/2015 14 / 318 Simple Random Samples Let ȳn = n−1 −1 1 − N −1 n s 2 , y , V̂ = n i n i∈A P P sn2 = (n − 1)−1 i∈A (yi − ȳn )2 , ȲN = N −1 N i=1 yi , P 2 and SN2 = (N − 1)−1 N i=1 (yi − ȳN ) . Then, P (i) E (ȳn | F) = ȳN (ii) V (ȳn | F) = n−1 1 − N −1 n SN2 (iii) E V̂ | F = V (ȳn | F) Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 Poisson Sampling Definition: inde Ii ∼ Bernoulli (πi ) , P Estimation (of Ty = N i=1 yi ) T̂y = N X i = 1, 2, · · · , N. Ii yi /πi , E {T̂y | F} = i=1 N X πi yi /πi i=1 Variance N N X X 1 Var T̂y | F = − 1 yi2 (πi − πi2 )yi2 /πi2 = πi i=1 i=1 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 15 / 318 Poisson Sampling Optimal design: minimize Var T̂y | F min N X subject to PN i=1 πi =n N X πi−1 yi2 + λ( πi − n) i=1 i=1 ⇒ πi−2 yi2 = λ ⇒ πi ∝ yi Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 16 / 318 A Design Result Definition Superpopulation model (ξ) : the model for y = (y1 , y2 , . . . , yN ). Definition Anticipated Variance : the expected value of the design variance of the planned estimator calculated under the superpopulation model. Theorem (1.2.3) 2 ), let d = (I , I , · · · , I ) be Let y = {y1 , y2 , · · · , yN } be iid (µ, σP 1 2 N −1 independent of y, and define T̂y = i∈A πi yi . Then V {T̂y − Ty } is minimized at πi = n/N and πij = n(n − 1)/N(N − 1), i 6= j. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 17 / 318 Proof of Theorem 1.2.3 V {T̂y − Ty } = V {E (T̂y − Ty |F)} + E {V (T̂y − Ty |F)} N X N X yi yj = E (πij − πi πj ) πi πj i=1 j=1 N N X N X X 1 1 1 2 = µ (πij − πi πj ) +σ (πi − πi2 ) 2 πi πj πi 2 i=1 i=1 j=1 P PN −1 −1 n (i) min N π s.t. i=1 i i=1 πi = n ⇒ πi = N PN PN PN −1 −1 −1 (ii) (π − π π )π π = V { i j i i=1 j=1 ij i=1 Ii πi } ≥ 0 j P −1 −1 n and π = n(n − 1) . V{ N I π } = 0 for π = N i i ij i=1 i N(N − 1) Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 18 / 318 Discussion of Theorem 1.2.3 (i) Finding πi and πj that minimize V {θ̂ | F} is not possible because V {θ̂ | F} is a function of N unknown values Godambe (1955), Godambe & Joshi (1965), Basu (1971) (ii) If y1 , y2 , · · · , yN ∼ ξ for some model ξ (superpopulation model), then we sometimes can find an optimal strategy (design and estimator). Under iid & HT estimation, the optimal design is SRS. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 19 / 318 Example: Stratified Sampling Definition 1 The finite population is stratified into H subpopulations. U = U1 ∪ · · · ∪ UH 2 Within each population (or stratum), samples are drawn independently in the strata. Pr (i ∈ Ah , j ∈ Ag ) = Pr (i ∈ Ah ) Pr (j ∈ Ag ) , for h 6= g where Ah is the index set of the sample in stratum h, h = 1, 2, · · · , H. Example: Stratified SRS 1 2 3 Stratify the population. Let Nh be the population size of Uh . Sample size allocation: Determine nh . Perform SRS independently (select nh sample elements from Nh ) in each stratum. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 20 / 318 Estimation under Stratified SRS 1 HT estimator: T̂y = H X Nh ȳh h=1 −1 P 2 where ȳh = nh Variance i∈Ah yhi . H X Nh2 nh Var T̂y = 1− Sh2 nh Nh h=1 2 P h where Sh2 = (Nh − 1)−1 N y − Ȳ hi h . i=1 Variance estimation H X Nh2 nh V̂ T̂y = 1− sh2 nh Nh 3 h=1 where sh2 = (nh − 1)−1 Kim & Fuller & Mukhopadhyay (ISU & SAS) P i∈Ah (yhi − ȳh )2 . Chapter 1 7/23-24/2015 21 / 318 Optimal Strategy under Stratified Sampling Theorem (1.2.6) Let F be a stratified finite population in which the elements in stratum h are realizations of iid(µh , σh2 ) random variables. Let C be the total cost for sample observation and assume that it costs ch to observe an element in stratum h. Then a sampling and estimation strategy for Ty that minimizes the anticipated variance in the class of linear unbiased estimators and probability design is: Select independent simple random samples in each stratum, selecting nh∗ in stratum h, where Nh σh nh∗ ∝ √ ch with C = P j nh∗ ch , subject to nh∗ ≤ Nh , and use the HT estimator. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 22 / 318 Comments on Theorem 1.2.6 Anticipated variance AV {θ̂ − θN } = E {E [(θ̂ − θN )2 | F]} − [E {E (θ̂ − θN | F)}]2 For HT estimation, E (θ̂ − θN | F) = 0 and the anticipated variance becomes H X 2 −1 −1 2 AV {T̂y − Ty } = Nh nh − Nh σh h=1 P PH −1 2 2 Minimizing H n N σ subject to h=1 h h=1 nh ch = C leads to the h h optimal allocation Nh σh nh∗ ∝ √ . ch Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 23 / 318 7/23-24/2015 24 / 318 R Sample Selection Using SAS PROC SURVEYSELECT Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 PROC SURVEYSELECT Probability-based random sampling equal probability selection PPS selection Stratification and clustering Sample size allocation Sampling weights inclusion probabilities joint inclusion probabilities Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 25 / 318 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 26 / 318 Sampling Methods Simple random with and without replacement Systematic Sequential PPS Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 27 / 318 PPS Sampling Methods With and without replacement Systematic Sequential with minimum replacement Two units per stratum: Brewer’s, Murthy’s Sampford’s method Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 28 / 318 Digitech Cable Company Digital TV, high-speed Internet, digital phone 13,471 customers in four states: AL, FL, GA, and SC Customer satisfaction survey (high-speed Internet service) Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 29 / 318 Sampling with Stratification Can afford to call only 300 customers The sampling frame contains the list of customer identifications, addresses, and types Need adequate sampling units in every stratum (state and type) Select simple random sample without replacement within strata Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 30 / 318 Sampling Frame CustomerID Kim & Fuller & Mukhopadhyay (ISU & SAS) State Type 416874322 AL Platinum 288139763 GA Gold 339008654 GA Gold 118980542 GA Platinum 421670342 FL Platinum 623189201 SC Platinum 324550324 FL Gold 832902397 AL Gold Chapter 1 7/23-24/2015 31 / 318 Sort Sampling Frame by Strata before Selection p r o c s o r t d a t a=C u s t o m e r s ; by S t a t e Type ; run ; Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 32 / 318 Select Stratified Sample p r o c s u r v e y s e l e c t d a t a=C u s t o m e r s method=s r s n=300 s e e d =3232445 o u t=S a m p l e S t r a t a ; s t r a t a S t a t e Type / a l l o c=p r o p ; run ; Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 33 / 318 7/23-24/2015 34 / 318 The SURVEYSELECT Procedure Selection Method Simple Random Sampling Strata Variables State Type Allocation Proportional Input Data Set CUSTOMERS Random Number Seed 3232445 Number of Strata 8 Total Sample Size Output Data Set Kim & Fuller & Mukhopadhyay (ISU & SAS) 300 SAMPLESTRATA Chapter 1 Strata Sizes Kim & Fuller & Mukhopadhyay (ISU & SAS) State Type SampleSize PopSize AL Gold 16 706 AL Platinum 28 1238 FL Gold 31 1370 FL Platinum 48 2170 GA Gold 43 1940 GA Platinum 78 3488 SC Gold 19 875 SC Platinum 37 1684 Chapter 1 7/23-24/2015 35 / 318 Data Collection Important practical considerations that the computer cannot decide for you: Telephone interview Rating, age, household size, ... Auxiliary variables: data usage, average annual income, home ownership rate, ... Callbacks, edits, and imputations Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 36 / 318 Survey Objective: Digitech Cable Rate customer satisfaction Are customers willing to recommend Digitech? Is satisfaction related to household size? race? Is usage time related to data usage? Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 37 / 318 Digitech Cable Data Collection Survey variables Auxiliary variables Rating Data usage Recommend Neighborhood income Usage time Competitors Household size Race Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 38 / 318 Large Sample Results for Survey Samples Complex designs : Weights Few distributional assumptions Heavy reliance on large sample theory Central Limit Theorem Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 39 / 318 Review of Large Sample Results Mann and Wald notation for order in probability Sequence: X1 , X2 , · · · , Xn , · · · (gn > 0) Xn = op (gn ) ⇔ Xn /gn → 0 in probability ⇔ lim P[|Xn /gn | > ] = 0 n→∞ Xn = Op (gn ) ⇔ Xn /gn bounded in probability ⇔ P[|Xn /gn | > M ] < for some M , ∀ > 0, ∀n Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 40 / 318 Examples of Order in Probability Let x̄n ∼ N(0, n−1 ). Then the following statements hold P{|x̄n | > 2n−1/2 } < 0.05 ∀n P{|x̄n | > Φ−1 (1 − 0.5)n−1/2 } < ∀n therefore x̄n = Op (n−1/2 ). If x̄n ∼ N(0, n−1 σ 2 ), then x̄n = Op (?) If x̄n ∼ N(µ, n−1 σ 2 ), then x̄n = Op (?) Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 41 / 318 Example of op Again, let x̄n ∼ N(0, n−1 ). lim P{|x̄n | > k} = 0 ∀k > 0 ⇒ x̄n = op (1) n→∞ lim P{|n1/4 x̄n | > k} = 0 ∀k > 0 ⇒ x̄n = op (n−1/4 ) n→∞ Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 42 / 318 Properties of Order in Probability fn > 0, gn > 0 Xn = Op (fn ), Yn = Op (gn ), then Xn Yn = Op (fn gn ) |Xn |s = Op (fns ), s ≥ 0 Xn + Yn = Op (max{fn , gn }). Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 43 / 318 Chebychev’s Inequality For given r > 0 with E {|X |r } < ∞, E {|X − A|r } P[|X − A| ≥ ] ≤ . r Corollary E {Xn2 } = O(an2 ) ⇒ Xn = Op (an ) By Chebyshev’s inequality, |Xn | E {Xn2 } P > M ≤ 2 2 < an an M p choose M = K /, where K is the upper bound of an−2 E {Xn2 }. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 44 / 318 Central Limit Theorems P Lindeberg: X1 , X2 , · · · : independent (µi , σi2 ), Bn2 = ni=1 σi2 P n (Xi − µi ) L 1 X 2 E {(X − µ ) I (|X − µ | > B )} → 0 ⇒ → N(0, 1) n i i i i Bn2 Bn i=1 Liapounov: P X1 , X2 , · · · : independent (µi , σi2 ), Bn2 = ni=1 σi2 Pn 2+δ = o(B 2+δ ), for some δ > 0 E |X − µ | i i n i=1 Pn (Xi − µi ) L → N(0, 1) ⇒ i=1 Bn Note: Liapounov condition ⇒ Lindeberg condition Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 45 / 318 7/23-24/2015 46 / 318 Slutsky’s Theorem {Xn }, {Yn } are sequences of random variables satisfying L Xn → X , p lim Yn = c ⇒ L Xn + Yn → X + c L Yn Xn → cX Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 Theorem 1.3.1: Samples of Samples Theorem y1 , · · · , yN are iid with d.f F (y ) and c.f. ϕ(t) = E {e itY }, i = √ −1 d = (I1 , · · · , IN ) : random vector independent of y ⇒ (yk ; k ∈ A)|d are iid with c.f. ϕ(t) Proof: In book. A SRS of a SRS is a SRS. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 47 / 318 7/23-24/2015 48 / 318 Application of Theorem 1.3.1 y1 , · · · , yN ∼ iid N(µy , σy2 ) SRS of size nN from FN ⇒ (yk , k ∈ A)|d are iid N(µy , σy2 ) and (yk , k ∈ U ∩ Ac )|d are iid N(µy , σy2 ) ⇒ ȳn ∼ N(µy , σy2 /nN ) and ȳN−n ∼ N(µy , σy2 /(N − nN )) indep of ȳn −1 ⇒ ȳn − ȲN ∼ N 0, nN (1 − −1 nN (1 − −1/2 2 fN )sn (ȳn Kim & Fuller & Mukhopadhyay (ISU & SAS) fN )σy2 and − ȲN ) ∼ tn−1 Chapter 1 Finite Population Sampling Motivation Is x̄n − x̄N = op (1)? x̄n − x̄N L Does q → N(0, 1)? V̂ {x̄n − x̄N | FN } We’ll be able to answer these shortly. n → N isn’t very interesting Need n → ∞ and N − n → ∞ Need a sequence of samples from a sequence of finite populations Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 49 / 318 Sequence of Samples from a Sequence of Populations Approach 1 {FN } is a sequence of fixed vectors Approach 2 {y1 , y2 , · · · , yN } is a realization from a superpopulation model. Notation UN = {1, 2, · · · , N} : N-th finite population FN = {y1N , · · · , yNN } yiN : observation associated with i-th element in the N-th population AN : sample index set selected from UN with size nN = |AN | Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 50 / 318 Design Consistency Definition θ̂ is design consistent for θN if for every > 0, lim P{|θ̂ − θN | > | FN } = 0 N→∞ almost surely, where P(· | FN ) denotes the probability induced by the sampling mechanism. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 51 / 318 Design Consistency for ȳn in SRS Approach 1 (fixed sequence) Assume a sequence of finite populations {FN } s.t. lim N −1 N→∞ N X (yi , yi2 ) = (θ1 , θ2 ), and θ2 − θ12 > 0. i=1 By Chebyshev’s inequality, −1 nN (1 − fN )SN2 P |ȳn − ȲN | ≥ | FN ≤ 2 where fN = nN N −1 . Approach 2 y1 , · · · , yN ∼ iid(µ, σy2 ) ⇒ limN→∞ SN2 = σy2 a.s. ⇒ V [ȳn − ȲN |FN ] = Op (nN−1 ) −1/2 ⇒ ȳn − ȲN |FN = Op (nN ) Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 52 / 318 Central Limit Theorem (1.3.2) Part 1 Theorem (Part 1) {y1N , · · · , yNN } ∼ iid(µ, σ 2 ) and 2 + δ moments (δ > 0) P PN −1 SRS, ȳn = n−1 N I y , Ȳ = N N i iN i=1 i=1 yiN L ⇒ [V (ȳn − ȲN )]−1/2 (ȳn − ȲN ) | d → N(0, 1). Proof : Write ȳn − ȲN = N −1 N X ciN yiN , where ciN = n −1 NIi − 1 . i=1 Bn2 = N 2 V (ȳn − ȲN ) = N 2 V (ȳn − ȲN | d) = Applying Lindberg CLT Kim & Fuller & Mukhopadhyay (ISU & SAS) PN 2 2 i=1 ciN σ Chapter 1 = (N − n)N/nσ 2 7/23-24/2015 53 / 318 Theorem 1.3.2 Part 2 Theorem (Part 2) Furthermore, if {yiN } has bounded fourth moments, then L [V̂ (ȳn − ȲN )]−1/2 (ȳn − ȲN ) → N(0, 1). Proof : Want to apply Slutsky’s theorem: p ȳ − ȲN ȳn − ȲN V (ȳn − ȲN ) L q n q =p → N(0, 1). V (ȳ − Ȳ ) n N V̂ (ȳn − ȲN ) V̂ (ȳn − ȲN ) Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 54 / 318 Proof of 1.3.2 Part 2, Continued Then it is enough to show that (n−1 − N −1 )σy2 p = −1 → 1. (n − N −1 )sn2 V̂ (ȳn − ȲN ) V (ȳn − ȲN ) p To show sn2 → σy2 , note that sn2 σy2 n = = = 1 1 X (yi − ȳn )2 2 σy n − 1 1 1 σy2 n − 1 1 1 σy2 n i=1 n X (yi − µ)2 − i=1 1 n 2 (ȳ − µ) n σy2 n − 1 n X (yi − µ)2 + Op (n−1 ) i=1 p → 1 if E (yi − µ)4 ≤ M4 < ∞ Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 55 / 318 Comment on Theorem 1.3.2 1 2 The CLT in Theorem 1.3.2 is derived with Approach 2 (using superpopulation model) The result can be extended to stratified random sampling (textbook) {yhiN } ∼ iid (µh , σh2 ) θ̂n = HN X N −1 Nh ȳhn , ȲN = h=1 HN X N −1 Nh ȲhN h=1 L [V̂ (θ̂n − ȲN )]−1/2 (θ̂n − ȲN ) → N(0, 1) Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 56 / 318 Poisson Sampling Population: y1 , y2 , ..., yN Probabilities: π1 , π2 , ..., πN The sampling process is Bernoulli trial for each i (Independent trials) Sample is those i for which the trial is a success N independent random variables Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 57 / 318 CLT under Approach 1 (Fixed Sequences) y1 , y2 , · · · : sequence of real vectors π1 , π2 , · · · : sequence of selection probabilities gi = (1, yi , αN πi−1 , αN πi−1 yi )0 where αN = E (nN )/N = nBN /N xi = gi Ii , Ii ∼ Bernoulli(πi ) (i.e. Poisson sampling) −1 PN −1 PN −1 PN Let µ̂x = nBN x = n g I and µ = n xN BN BN i=1 i i=1 i i i=1 gi πi E (µ̂x | FN ) = µxN T̂y ,HT = N X Ii yi πi−1 = −1 αN i=1 −1 = αN N X i=1 N X αN πi−1 yi Ii xi4 . i=1 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 58 / 318 Theorem 1.3.3, Part 1 Theorem (Part 1) Assume Poisson sampling −1 PN (i) limN→∞ nBN i=1 gi πi = µx P Σ Σ 11 12 N −1 0 (ii) limN→∞ nBN , 0 i=1 πi (1 − πi )gi gi = Σxx = Σ12 Σ22 Σ11 , Σ22 : positive definite (iii) lim max PN N→∞ 1≤k≤N Then, √ (γ 0 gk )2 i=1 (γ 0 g )2 π (1 i i − πi ) = 0 ∀γ s.t. γ 0 Σxx γ > 0 . L nBN (µ̂x − µxN )|FN → N(0, Σxx ). Proof: textbook Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 59 / 318 7/23-24/2015 60 / 318 Theorem 1.3.3, Part 2 Theorem (Part 2) Under conditions (i)-(iii), if −1 PN 4 (iv) limN→∞ nBN i=1 πi |gi | = M4 , then [V̂ (T̂y )]−1/2 (T̂y − Ty )|FN → N(0, I ), where X πi (1 − πi ) yi yi 0 . V̂ (T̂y ) = πi πi πi i∈A Proof: textbook Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 Theorem 1.3.4 CLT for SRS Theorem {yi } : sequence of real numbers with bounded 4th moments SRS without replacement −1/2 ⇒ V̂n L (ȳn − ȳN )|FN → N(0, 1) The result is obtained by showing that there is a SRS mean that differs from Poisson mean by op (n−1/2 ). Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 61 / 318 7/23-24/2015 62 / 318 Function of Means (Theorem 1.3.7) Theorem (i) √ L n(x̄n − µx ) → N(0, Σxx ) (ii) g (x) : continuous and differentiable at x = µx ∂g (x) : continuous at x = µx (iii) hx (x) = ∂x √ L ⇒ n[g (x̄n ) − g (µx )] → N(0, h0x (µx )Σxx hx (µx )). Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 Proof of Theorem 1.3.7 [Step 1] By a Taylor expansion g (x̄n ) = g (µx ) + (x̄n − µx ) h0x (µ∗ ) where µ∗x is on the line segment joining x̄n and µx . [Step 2] Show µ∗x − µx = op (1) [Step 3] Using the assumption that hx (x) is continuous at x = µx , show that hx (µ∗x ) − hx (µx ) = op (1). [Step 4] Because √ √ n [g (x̄n ) − g (µx )] = n (x̄n − µx ) h0x (µ∗x ) √ = n (x̄n − µx ) h0x (µx ) + op (1) , we apply Slutsky’s theorem to get the result. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 63 / 318 Example on Curvature · x̄n ∼ N(µ, V (x̄n )) · Approximation: x̄n2 ∼ N(µ2 , (2µ)2 V (x̄n )) Let µ = 2 and V (x̄n ) = 0.01. Then . . E {x̄n2 } = 22 + 0.01 = µ2 . V {x̄n2 } = 2(0.01)2 + 4(22 )(0.01) = 4µ2 V (x̄n ). Let µ = 2 and V (x̄n ) = 3. Then E {x̄n2 } = 4 + 3 6= µ2 . V {x̄n2 } = 2(2)2 + 4(22 )(3) 6= 4µ2 V (x̄n ). Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 64 / 318 Large Sample Bias L n1/2 (θ̂ − θ) → N(0, 1) does not imply E {θ̂} → θ. For example, if (ȳn , x̄n ) ∼ N((µy , µx ), Σ) and µx 6= 0, then E ȳn x̄n is not defined. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 65 / 318 7/23-24/2015 66 / 318 Ratio Estimation (Theorem 1.3.8, Part 1) Theorem xi = (x1i , x2i ), X̄1N 6= 0, R̂ = x̄2,HT /x̄1,HT √ −1 L nN (T̂x − Tx N ) → N(0, Mxx ) √ ⇒ n(R̂ − RN ) → N(0, hN Mxx h0N ) where T̂x = −1 i∈A πi xi , TxN = P PN i=1 xi , RN = −1 hN = x̄1N (−RN , 1) Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 X̄2N , and X̄1N Proof of Theorem 1.3.8 Part 1 R̂ = x̄2,HT 1 X̄2N X̄2N + (x̄2,HT − X̄2N ) − 2 (x̄1,HT − X̄1N ) + Remainder = x̄1,HT X̄1N X̄1N X̄1N Method 1. Mean value theorem & continuity of the first order partial derivatives⇒ Remainder= op (n−1/2 ) Method 2. Second order Taylor expansion + continuity of the second order partial derivatives ⇒ Remainder= Op (n−1 ) ∂R R̂ = R + ∗ (x̄HT − X̄N )0 ∂x x̄ 2 ∂R 1 ∂ R 0 = R+ (x̄HT − X̄N )0 + (x̄HT − X̄N ) (x̄ − X̄ ) HT N ∂x 2 ∂x∂x0 x̄∗∗ Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 67 / 318 Theorem 1.3.8 Part 2 Theorem In addition, if {V (T̂x |FN )}−1 V̂HT (T̂x ) − I = op (n−1/2 ), then L [V̂ (R̂)]−1/2 (R̂ − RN ) → N(0, 1) where V̂ (R̂) = XX πij−1 (πij − πi πj )πi−1 dˆi πj−1 dˆj , i∈A j∈A −1 dˆi = T̂x1 (x2i − R̂x1i ), T̂x = Kim & Fuller & Mukhopadhyay (ISU & SAS) P −1 i∈A πi xi , TxN = Chapter 1 PN i=1 xi , RN = x̄2N x̄1N 7/23-24/2015 68 / 318 Remarks on Ratios 1 Variance estimation : X V̂ (R̂) = V̂ ( πi−1 dˆi ) dˆi 2 = i∈A −1 T̂x1 (x2i − R̂x1i ) If x1i = 1 and x2i = yi , then Hájek estimator P −1 i∈A πi yi ȳπ = P −1 i∈A πi X . −1 ⇒ V (ȳπ − ȳN | F) = V [N πi−1 (yi − ȳN )|F] i∈A V [N −1 T̂y | F] = V [N −1 X πi−1 yi | F]. i∈A Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 69 / 318 7/23-24/2015 70 / 318 Approximations for Complex Estimators θ̂ is defined through an estimating equation X wi g (xi , θ̂) = 0. i∈A Let Ĝ (θ) = X wi g (xi ; θ) i∈A G (θ) = N −1 N X g (xi ; θ) i=1 Ĥ(θ) = ∂ Ĝ (θ)/∂θ H(θ) = ∂G (θ)/∂θ Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 Theorem 1.3.9 Theorem Under suitable conditions, √ L n(θ̂ − θN ) → N(0, V) where V = n[H(θN )]−1 V {Ĝ (θN |FN )}[H(θN )]−1 Also, V (θ̂) can be estimated by V̂ = n[Ĥ(θ̂)]−1 V̂ {Ĝ (θ̂|FN )}[Ĥ(θ̂)]−1 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 71 / 318 7/23-24/2015 72 / 318 Comments It is difficult to show CLT for general HT estimator Exception: Large number of strata CLT requires: Large samples (that is, effectively large) Moments: no extreme observations No extreme weights Functions: curvature small relative to s.e Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 Basic Estimators Using SAS PROC SURVEYMEANS PROC SURVEYFREQ Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 73 / 318 PROC SURVEYMEANS Univariate analysis: population totals, means, ratios, and quantiles Variances and confidence limits Domain analysis Poststratified analysis Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 74 / 318 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 75 / 318 PROC SURVEYFREQ One-way to n-way frequency and crosstabulation tables Totals and proportions Tests of association between variables Estimates of risk differences, odds ratios, and relative risks Standard errors and confidence limits Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 76 / 318 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 77 / 318 Digitech Cable Describe: Satisfaction ratings Usage time Satisfaction ratings based on household sizes Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 78 / 318 PROC SURVEYMEANS o ds g r a p h i c s on ; p r o c s u r v e y m e a n s d a t a=R e s p o n s e D a t a mean t o t a l=t o t ; s t r a t a S t a t e Type ; weight SamplingWeight ; c l a s s Rating ; v a r R a t i n g UsageTime ; run ; o ds g r a p h i c s o f f ; Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 79 / 318 The SURVEYMEANS Procedure Data Summary Number of Strata Number of Observations Sum of Weights 8 300 13471 Statistics Variable Mean Std Error of Mean Computer Usage Time 284.953667 10.904880 Extremely Unsatisfied Customer Satisfaction 0.247287 0.024463 Unsatisfied Customer Satisfaction 0.235889 0.025091 Neutral Customer Satisfaction 0.224797 0.024247 Satisfied Customer Satisfaction 0.200509 0.023548 Extremely Satisfied Customer Satisfaction 0.091518 0.016986 Level UsageTime Rating Kim & Fuller & Mukhopadhyay (ISU & SAS) Label Chapter 1 7/23-24/2015 80 / 318 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 81 / 318 7/23-24/2015 82 / 318 PROC SURVEYFREQ p r o c s u r v e y f r e q d a t a=R e s p o n s e D a t a ; s t r a t a S t a t e Type ; weight SamplingWeight ; t a b l e s Rating / chisq t e s t p =(0.25 0 . 2 0 0 . 2 0 0 . 2 0 0 . 1 5 ) ; run ; Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 The SURVEYFREQ Procedure Customer Satisfaction Frequency Weighted Frequency Extremely Unsatisfied 70 3154 315.53192 Unsatisfied 67 3009 Neutral 64 Satisfied Extremely Satisfied Rating Total Std Err of Wgt Freq Percent Test Percent Std Err of Percent 24.7287 25.00 2.4739 323.64922 23.5889 20.00 2.5376 2867 312.75943 22.4797 20.00 2.4522 57 2557 303.73618 20.0509 20.00 2.3814 26 1167 219.09943 9.1518 15.00 1.7178 284 12754 7.72778E-6 100.000 Frequency Missing = 16 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 83 / 318 7/23-24/2015 84 / 318 Rao-Scott Chi-Square Test Pearson Chi-Square 9.1865 Design Correction 0.9857 Rao-Scott Chi-Square 9.3195 DF 4 Pr > ChiSq 0.0536 F Value 2.3299 Num DF 4 Den DF 1104 Pr > F 0.0543 Sample Size = 284 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 Domain Estimation Domains are subsets of the entire population Domain sample size is not fixed Variance estimation should account for random sample sizes in domains Degrees of freedom measured using the entire sample Use the DOMAIN statement Do NOT use the BY statement for domain analysis Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 85 / 318 7/23-24/2015 86 / 318 Describe Usage Based on Household Sizes p r o c s u r v e y m e a n s d a t a=R e s p o n s e D a t a mean t o t a l=t o t ; s t r a t a S t a t e Type ; weight SamplingWeight ; v a r UsageTime ; domain H o u s e h o l d S i z e ; run ; Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 87 / 318 7/23-24/2015 88 / 318 Describe Rating Based on Household Sizes p r o c s u r v e y f r e q d a t a=R e s p o n s e D a t a ; s t r a t a S t a t e Type ; weight SamplingWeight ; t a b l e s HouseholdSize ∗ Rating / p l o t= a l l ; run ; Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 1 7/23-24/2015 89 / 318 7/23-24/2015 90 / 318 Chapter 2 Use of Auxiliary Information in Estimation World Statistics Congress Short Course July 23-24, 2015 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 Ratio Estimation Population : Observe x̄N = N −1 Sample : Observe (x̄HT , ȳHT ) = PN i=1 xi P N −1 i∈A πi−1 (xi , yi ) Ratio estimator ȳrat = x̄N ȳHT x̄HT Let RN = x̄N−1 ȳN be the population ratio, where P (x̄N , ȳN ) = N −1 N i=1 (xi , yi ). Assume that (x̄HT , ȳHT ) − (x̄N , ȳN ) = Op (n−1/2 ). Assume that x̄N 6= 0. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 91 / 318 Asymptotic Properties of Ratio Estimator (1) Linear approximation: ȳrat − ȳN = ȳHT − RN x̄HT + Op (n−1 ) Proof ȳrat − ȳN −1 = x̄HT x̄N (ȳHT − RN x̄HT ) n o −1/2 = 1 + Op (n ) (ȳHT − RN x̄HT ) = ȳHT − RN x̄HT + Op (n−1 ) Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 92 / 318 Asymptotic Properties of Ratio Estimator (2) Bias approximation: Uses second order Taylor expansion −1 x̄HT ȳHT = RN + x̄N−1 (ȳHT − RN x̄HT ) n o 2 −2 +x̄N RN (x̄HT − x̄N ) − (x̄HT − x̄N ) (ȳHT − ȳN ) +Op (n−3/2 ). −1 Under moment conditions for x̄HT ȳHT and x̄N 6= 0, Bias(ȳrat ) = E (ȳrat − ȳN ) = x̄N−1 [RN V (x̄HT ) − Cov (x̄HT , ȳHT )] + O(n−2 ) = O(n−1 ). Thus, bias of θ̂ = ȳrat is negligible because Bias(θ̂) R.B.(θ̂) = q → 0 as Var (θ̂) Kim & Fuller & Mukhopadhyay (ISU & SAS) n → ∞. Chapter 2 7/23-24/2015 93 / 318 Asymptotic Properties of Ratio Estimator (3) Given the conditions of Theorem 1.3.8 θ̂ − θ q → N(0, 1), Var (θ̂) and V̂ (θ̂) = V̂ (ȳHT − R̂ x̄HT ) = Kim & Fuller & Mukhopadhyay (ISU & SAS) x̄HT x̄N Chapter 2 −2 V̂ (ȳHT − R̂ x̄HT ) 7/23-24/2015 94 / 318 Other Properties of Ratio Estimator 1 Ratio estimator is the best linear unbiased estimator under yi = xi β + ei , ei ∼ (0, xi σ 2 ) 2 Scale invariant, not location invariant 3 Linear but not design linear 4 Ratio estimator in stratified sample, ȳst,s ȳst,c H X x̄hN = Wh ȳh : separate ratio estimator x̄h h=1 !P H H X Wh ȳh : combined ratio estimator = Wh x̄hN Ph=1 H W x̄ h h h=1 h=1 where Wh = Nh /N. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 95 / 318 §2.2 Regression estimation Sample : Observe (xi , yi ). Population : Observe xi = (1, x1i ) i = 1, 2, ..., N or only x̄N . Interested in estimation of ȳN = N −1 N X yi i=1 Regression model yi = xi β + ei ei independent of xj for all i and j, ei ∼ ind (0, σe2 ). Under Normal model, regression gives the best predictor Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 96 / 318 Regression Model: Estimation yi = xi β + ei , ei ∼ ind (0, σe2 ) P Linear Estimator: ȳw = i∈A wi yi . To find the best linear unbiased estimator of x̄N β = E {ȳN }, ! ( ) X X min V wi yi | X , x̄N s.t. E wi yi − ȳN | X, x̄N = 0 i∈A ⇔ min X0 = X i∈A wi2 s.t. i∈A (x01 , x02 , ..., x0n ) X wi xi = x̄N i∈A Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 97 / 318 Best Linear Unbiased Estimator Lagrange multiplier method Q = ∂Q ∂wi 1X 2 !0 wi2 + λ0 i∈A X wi xi − x̄N i∈A = wi + λ0 x0i P 0 = −x̄ ( 0 −1 w x = x̄ ⇒ λ N N i i i∈A i∈A xi xi ) P ∴ wi = xi ( i∈A x0i xi )−1 x̄0N = x̄N (X 0 X )−1 x0i P Regression estimator is the solution that minimizes the variance in the class of linear estimators that are unbiased under the model. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 98 / 318 Properties of Regression Estimator 1 Linear in y . Location-Scale Invariant. 2 Alternative expression: For xi = (1, x1i ), X ȳreg = ȳn + (x̄1N − x̄1n )β̂1 = wi yi i∈A #−1 " β̂1 = wi = X (x1i − x̄1n )0 (x1i − x̄1n ) X (x1i − x̄1n )0 yi i∈A i∈A 1 + (x̄1N n " #−1 X − x̄1n ) (x1i − x̄1n )0 (x1i − x̄1n ) (x1i − x̄1n )0 i∈A 3 Writing ȳreg = ȳn + (x̄1N − x̄1n )β̂1 = β̂0 + x̄1N β̂1 , the regression estimator can be viewed as the predicted value of Y = β0 + x1 β1 + e at x1 = x̄1N under the regression model. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 99 / 318 Example: Artificial population 0 5 y 10 15 Population Plot (3,4) −5 0 5 10 x Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 100 / 318 Example (Cont’d): SRS of size n = 20 15 Sample Plot (n=20) pop. mean = 4 sam. mean = 3.42 10 reg. est. = 3.85 5 y (3,4) 0 ( 3 , 3.85 ) ( 2.61 , 3.42 ) −2 0 2 4 6 8 x Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 101 / 318 Properties of Regression Estimator (2) 4 For the mean adjusted regression model yi = γ0 + (x1i − x̄1N )γ1 + ei , the OLS estimator of γ0 is γ̂0 = ȳn − (x̄1n − x̄1N ) γ̂1 where γ̂1 = β̂1 . That is, γ̂0 = ȳreg . 5 Under the linear regression model: a ȳreg unbiased (by construction) ∵ E (ȳreg − ȳN |XN ) = E ( N X xi β̂ − i=1 = N X i=1 xi β − N X ) (xi β + ei )|XN i=1 N X xi β = 0 i=1 (∵ E (β̂|XN ) = β & E (ei |XN ) = 0) Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 102 / 318 Properties of Regression Estimator (3) b Variance V (ȳreg − ȳN |X , x̄N ) = n−1 (1 − f )σe2 + (x̄1N − x̄1n )V (β̂1 | X )(x̄1N − x̄1n )0 )0 (x −1 2 σe − x̄1n ) V (β̂1 | X) = P ∵ ȳreg = ȳn + (x̄1N − x̄1n )β̂1 = β0 + x̄1n β1 + ēn + (x̄1N − x̄1n )β1 + (x̄1N − x̄1n )(β̂1 − β1 ) = β0 + x̄1N β1 + ēn + (x̄1N − x̄1n )(β̂1 − β1 ) = β0 + x̄1N β1 + ēN ȳN i∈A (x1i − x̄1n 1i ⇒ ȳreg − ȳN = ēn − ēN + (x̄1N − x̄1n )(β̂1 − β1 ) Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 103 / 318 Properties of Regression Estimator (4) If x1i ∼ Normal, V {ȳreg k 1 − ȳN } = (1 − f ) 1 + σe2 n n−k −2 where k = dim(xi ) V {ȳn − ȳN } = n−1 (1 − f )σy2 If 2 Radj σe2 k =1− 2 ≥ , then σy n−2 V {ȳn − ȳN } ≥ V {ȳreg − ȳN }. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 104 / 318 Best Linear Predictor Let yi for i ∈ A be observations from the model yi = xi β + ei , with ei ∼ ind(0, σe2 ). Predict ȳN −n = x̄N −n β + ēN −n . The BLUP of ēN −n is 0, and the BLUP (estimator) of x̄N −n β is x̄N −n β̂. Thus, the BLUP of ȳN is N −1 [nȳn + (N − n)x̄N −n β̂] = N −1 [nx̄n β̂ + (N − n)x̄N −n β̂] = x̄N β̂ because ȳn − x̄n β̂ = 0. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 105 / 318 General Population: SRS Regression Since ȳn − x̄n β̂ = 0, ȳreg − ȳN = x̄N β̂ − ȳN = ȳn + (x̄N − x̄n )β̂ − ȳN = ȳn − ȳN + (x̄N − x̄n )βN + (x̄N − x̄n )(β̂ − βN ) = ān − āN + (x̄N − x̄n )(β̂ − βN ) ai βN = yi − xi βN , āN = 0 !−1 ! X X = x0i xi x0i yi i∈U Kim & Fuller & Mukhopadhyay (ISU & SAS) i∈U Chapter 2 7/23-24/2015 106 / 318 Bias of Regression Estimator Design bias is negligible (assume moments) E [ȳreg n o | FN ] = ȳN + E {ān − āN | FN } + E (x̄N − x̄n )(β̂ − βN )|FN n o = ȳN + E (x̄N − x̄n )(β̂ − βN )|FN . Bias(ȳreg | FN ) = −E [(x̄n − x̄N )(β̂ − βN )|FN ] o n 0 0 = −tr Cov (x̄n , β̂ ) | FN [Bias(ȳreg | FN )]2 ≤ k X V (x̄j,n | FN )][V (β̂j | FN )] = O(n−1 )O(n−1 ). j=1 ⇒ Bias(ȳreg | FN ) = O(n−1 ). Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 107 / 318 Variance of Approximate Distribution ȳreg − ȳN = ān − āN + (x̄N − x̄n )(β̂ − βN ) = ān − āN + Op (n−1 ) V (ān | FN ) = (1 − f )n−1 Sa2 L By Theorem 1.3.4, [V (ān | FN )]−1/2 (ān − āN ) → N(0, 1). Recall āN = 0. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 108 / 318 Estimated Variance V̂ {ȳreg } = (1 − f )n −1 −1 (n − k) X âi2 , i∈A where âi = yi − xi β̂ and k = dimension of xi . X âi2 = i∈A Xh ai − xi (β̂ − βN ) i2 i∈A ! = X = X ai2 − 2(β̂ − βN )0 X x0i ai + (β̂ − βN )0 i∈A i∈A X x0i xi (β̂ − βN ) i∈A ai2 + Op (1) i∈A Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 109 / 318 Limiting Distribution of Coefficients Theorem 2.2.2 (SRS) Assume (yi , xi ) iid, existence of eighth moments. !−1 β̂ = (M̂xx )−1 M̂xy = n−1 X x0i xi n−1 i∈A X x0i yi i∈A !−1 βN −1 = (Mxx,N ) Mxy ,N = N −1 X x0i xi i∈U N −1 X x0i yi i∈U L ⇒ V {β̂ − βN |FN }−1/2 (β̂ − βN )|FN → N(0, I) −1 V {β̂ − βN |FN } = n−1 (1 − fN )M−1 xx,N Vbb,N Mxx,N X Vbb,N = N −1 x0i ai2 xi , ai = yi − xi βN i∈U Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 110 / 318 Proof of Theorem 2.2.2 ! −1 β̂ − βN = M̂−1 xx M̂xa =: M̂xx n−1 X bi i∈A P where M̂xa = n−1 i∈A x0i ai . Given moment conditions, o √ n L n M̂xa − Mxa,N → N [0, (1 − fN )Vbb,N ] (Theorem 1.3.4) P p where Mxa,N = N −1 i∈U xi ai = 0. Now M̂xx → Mxx,N and, by Slutsky’s theorem, the result follows. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 111 / 318 Regression for General Designs (Theorem 2.2.1) FN = {z1 , z2 , · · · , zN } where zi = (yi , xi ), Z0n = (z10 , z20 , · · · , zn0 ). Define M̂zφz = n−1 Z0n Φ−1 n Zn = M̂xφx M̂y φx M̂xφy M̂y φy and β̂ = M̂−1 xφx M̂xφy , where Φn :n × n matrix (positive definite) e.g. Φn = (N/n)diag{π1 , · · · , πn }. Assume (i) V {z̄HT − z̄N |FN } = Op (n−1 ) a.s. where z̄HT = N −1 P i∈A πi−1 zi (ii) M̂zφz − Mzφz,N = Op (n−1/2 ) a.s. and M̂zφz nonsingular. (iii) K1 < Nn−1 πi < K2 L (iv) [V {z̄HT − z̄N |FN }]−1/2 (z̄HT − z̄N ) → N(0, I) Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 112 / 318 Theorem 2.2.1 Theorem Given moments, design consistent HT estimators, 0 −1 (i) β̂ − βN = M−1 xφx,N b̄HT + Op (n ) a.s. P P 0 x −1 0 where βN = (Mxx,N )−1 Mxy ,N = x i i∈U i i∈U xi yi . L (ii) [V̂ (β̂|FN )]−1/2 (β̂ − βN ) → N(0, I) where b0i = n−1 Nπi ξi ai , ai = yi − xi βN , −1 ξi is the i-th column of X0n Φ−1 n , b̄HT = N P i∈A πi−1 bi and −1 −1 0 V̂ (β̂ | FN ) = M̂xφx V̂ (b̄HT )M̂xφx Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 113 / 318 Proof of Theorem 2.2.1 β̂ − βN −1 = M̂xφx (n−1 Xn0 Φ−1 n an ) −1/2 n−1 Xn0 Φ−1 ) n an = Op (n −1 M̂xφx β̂ − βN n−1 Xn0 Φ−1 n an −1 = Mxφx, + Op (n−1/2 ) N −1 −1 = Mxφx, (n−1 Xn0 Φ−1 n an ) + Op (n ) N X X −1 −1 0 = n ξ i ai = N πi−1 bi0 = b̄HT i∈A i∈A −1 0 ξi is the i-th column of Xn0 Φ−1 n , bi = n Nπi ξi ai . Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 114 / 318 Remarks on Theorem 2.2.1 1 The choice of Φn is arbitrary. (i.e. The result in Theorem 2.2.1 holds for given Φn ) A simple case is Φn = (N/n)diag{π1 , · · · , πn } 2 Variance estimation: −1 −1 V̂ (β̂) = M̂xφx V̂bb M̂xφx where V̂bb is the estimated sampling variance of b̄0HT calculated with b̂0i = n−1 Nπi ξi âi and âi = yi − xi β̂. 3 Result holds for a general regression estimator. That is, the asymptotic normality of x̄N β̂ follows from the asymptotic normality of β̂. 4 Theorem 2.2.1 states the consistency of β̂ for βN . But we are also interested in the consistency of the estimator of ȳN . Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 115 / 318 Theorem 2.2.3: Design Consistency of ȳreg for ȳN Theorem n o Let p lim β̃ − βN | FN = 0. Then, p lim {ȳreg N 1 X − ȳN | FN } = 0 ⇐⇒ p lim ai = 0 N→∞ N i=1 where ai = yi − xi βN and ȳreg = x̄N β̃. Proof : Because β̃ is design consistent for βN , ) ( N n o X = p lim N −1 (yi − xi β̃) | FN p lim ȳN − x̄N β̃ | FN ( = p lim N −1 i=1 N X ) (yi − xi βN ) | FN . i=1 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 116 / 318 Condition for Design Consistency Corollary (2.2.3.1) Assume design consistency for z̄HT and for sample moments. −1 0 −1 Let ȳreg = x̄N β̂, β̂ = (X0n Φ−1 n Xn ) Xn Φn yn If ∃γn such that Xn γn = Φn D−1 π Jn (1) where Dπ = diag(π1 , π2 , · · · , πn ) and Jn is a column vector of 1’s, then ȳreg is design-consistent for ȳN . Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 117 / 318 7/23-24/2015 118 / 318 Proof of Corollary 2.2.3.1 By Theorem 2.2.3, we have only to show that X −1 N (yi − xi βN ) = 0, i∈U where βN = p lim β̂. Now, (y − Xn β̂)0 Φ−1 n Xn = 0 ⇒ (y − Xn β̂)0 Φ−1 n Xn γn = 0 ⇒ (y − Xn β̂)0 Dπ−1 Jn = N(ȳHT − x̄HT β̂) = 0 ⇒ p lim{ȳHT − x̄HT βN } = p lim{ȳ P HT − x̄HT β̂} = 0 −1 & p lim{ȳHT − x̄HT βN } = N i∈U (yi − xi βN ). Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 Remarks on Corollary 2.2.3.1 1 2 Condition Φn Dπ−1 Jn ∈ C(Xn ) is a crucial condition for the design consistency of the regression estimator of the form ȳreg = x̄N β̂ with −1 0 −1 β̂ = (Xn0 Φ−1 n Xn ) Xn Φn yn . If condition Φn Dπ−1 Jn ∈ C(Xn ) does not hold, one can expand the Xn matrix by including z0 = Φn Dπ−1 Jn and use Zn = [z0 , Xn ] to construct the regression estimator: ȳreg = z̄N γ̂ = (z̄0N , x̄N ) z00 Φ−1 z00 Φ−1 n z0 n Xn 0 −1 Xn0 Φ−1 n z0 Xn Φn Xn Kim & Fuller & Mukhopadhyay (ISU & SAS) −1 Chapter 2 z00 Φ−1 n y Xn0 Φ−1 n y 7/23-24/2015 119 / 318 Examples for Φn Dπ−1 Jn ∈ C(Xn ) 1 Φn = Dπ and xi = (1, x1i ) ⇒ ȳreg = ȳπ + (x̄1,N − x̄1,π ) β̂1 P −1 −1 P −1 where (ȳπ , x̄1,π ) = π i∈A i i∈A πi (yi , x1i ) P −1 P −1 −1 0 (x − x̄ 0 β̂1 = π (x − x̄ ) ) 1,π 1,π 1i 1i i∈A i i∈A πi (x1i − x̄1,π ) yi . 2 P Φn = In , xi = πi−1 , x1i = (wi , x1i ), w̄N = N −1 N i=1 wi , ⇒ ȳreg ,ω = w̄N ȳω + (x̄1,N − w̄N x̄1,ω ) β̂1,ols (ȳω , x̄1,ω ) = −1 P −1 −1 i∈A πi wi i∈A πi (yi , x1i ) P 0 (β̂0,ols , β̂1,ols )0 = 0 i∈A xi xi P Kim & Fuller & Mukhopadhyay (ISU & SAS) −1 P 0 i∈A xi yi . Chapter 2 7/23-24/2015 120 / 318 Design Optimal Regression Estimator (Theorem 2.2.4) Theorem Sequence of populations and designs giving consistent estimators of moments. Consider ȳreg (β̂) = ȳπ + (x̄1N − x̄1π )β̂ for some β̂ (i) ȳreg is design consistent for ȳN for any reasonable β̂. (ii) β̂ ∗ = [V̂ (x̄1π )]−1 Ĉ (x̄1π , ȳ1π ) minimizes the estimated variance of ȳreg (β̂) (iii) CLT for ȳreg (β̂ ∗ ) can be established Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 121 / 318 Remarks on Theorem 2.2.4 1 Optimal estimator can be viewed as Rao-Blackwellization based on 0 0 x̄π x̄N V(x̄0π ) C(x̄π , ȳπ ) ∼N , C(x̄0π , ȳπ ) V (ȳπ ) ȳπ ȳN 2 GLS interpretation Kim & Fuller & Mukhopadhyay (ISU & SAS) x̄0π − x̄0N ȳπ = Chapter 2 0 1 ȳN + e1 e2 7/23-24/2015 122 / 318 §2.3 Linear Model Prediction Population model : yN = XN β + eN , eN ∼ (0, Σee NN ) eN independent of XN . Assume Σee NN known Sample model : yA = XA β + eA , eA ∼ (0, Σee AA ) Best linear predictor (BLUP) X X −1 θ̂ = N yi + {ŷi + Σee ĀA Σ−1 ee AA (yA − XA β̂)} i∈A i∈Ā −1 −1 where ŷi = xi β̂ and β̂ = (X0A Σ−1 ee AA XA ) XA Σee AA yA Q : When is the model-based predictor θ̂ design consistent? Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 123 / 318 Model and Design Consistency Theorem (2.3.1) If Σee AA (D−1 π Jn − Jn ) − Σee AĀ JN−n ∈ C(XA ), then θ̂ = ȳHT + (x̄N − x̄HT )β̂ and θ̂ − ȳN = āHT − āN + Op (n−1 ) where ai = yi − xi βN Analogous to Corollary 2.2.3.1 for prediction. If the model satisfies the conditions for design consistency, then the model is called full model. Otherwise, it is called reduced model (or restricted model). Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 124 / 318 Model and Design Consistency General strategy (for general purpose survey) a Pick important y b Find a model y = X β + e c Use ȳreg = x̄N β̂ (Full model) or use ȳreg ,π = ȳπ + (x̄N − x̄π )β̂ −1 0 −1 where β̂ = (Xn0 Σ−1 ee Xn ) (Xn Σee yn ) If the design consistency condition does not hold, we can expand the XA matrix by including z0 such as Σee AA Dπ−1 Jn , Z = [z0 , X ]. If z̄0N is not known, use ȳreg ,π of (c). Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 125 / 318 §2.3.2 Nonlinear Models (All x Values Known) Superpopulation model yi = α(xi ; θ) + ei , E (ei ) = 0, ei indep. of xj , for all i and j. 1 2 P P ȳc,reg = ȳHT + N −1 i∈U α(xi ; θ̂) − N −1 i∈A πi−1 α(xi ; θ̂) hP i P −1 ȳm,reg = N i∈A yi + i∈Ā α(xi ; θ̂) Remark: ȳc,reg = ȳm,reg if Bennett, 1988). Kim & Fuller & Mukhopadhyay (ISU & SAS) −1 i∈A (πi P − 1)(yi − ŷi ) = 0. (Firth and Chapter 2 7/23-24/2015 126 / 318 Consistency of Nonlinear Regression (Theorem 2.3.2) Theorem (i) There exist θN such that θ̂ − θN = Op (n−1/2 ), a.s.. (ii) α(x, θ) is a continuous differentiable function of θ with derivative uniformly continuous on B, a closed set containing θN . (iii) The partial derivative h(xi ; θ) = ∂α(xi ; θ)/∂θ satisfies X X −1 −1 sup N πi h(xi ; θ) − h(xi ; θ) = Op (n−1/2 ) a.s. θ∈B i∈A i∈U ⇒ ȳc,reg − ȳN = āHT − āN + Op (n−1 ) where ai = yi − α(xi ; θN ) Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 127 / 318 Calibration Minimize ω 0 V ω s.t. ω 0 X = x̄N (ω 0 Vω)(aX0 V−1 Xa0 ) ≥ (ω 0 Xa0 )2 with equality iff ω 0 V 1/2 ∝ aX 0 V −1/2 ω 0 ∝ aX 0 V −1 ω 0 = kaX 0 V −1 , k : constant ω0X = kaX 0 V −1 X x̄N (X 0 V −1 X )−1 = ka ∴ ω 0 = x̄N (X 0 V −1 X )−1 X 0 V −1 & ω 0 V ω ≥ x̄N (X 0 V −1 X )−1 x̄0N Note Minimize V (ω 0 y | X, d) s.t. E (ω 0 y − ȳN | X, d) = 0. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 128 / 318 Alternative Minimization Lemma α : given n-dimensional vector Let ωa = arg minω ω 0 Vω s.t ω 0 X = x̄N Let ωb = arg minω (ω − α)0 V(ω − α) s.t ω 0 X = x̄N If V α ∈ C(X), then ωa = ωb . Proof : (ω − α)0 V(ω − α) = ω 0 Vω − α0 Vω − ω 0 Vα + α0 Vα = ω 0 Vω − λ0 X0 ω − ω 0 Xλ + α0 Vα = ω 0 Vω − 2λ0 x̄0N + α0 Vα where V α = Xλ ∵ ω 0 X = x̄N If α = Dπ−1 Jn , then V α ∈ C(X ) is the condition for design consistency in Corollary 2.2.3.1. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 129 / 318 General Objective Function min X G (ωi , αi ) s.t. i∈A X ωi xi = x̄N i∈A Lagrange multiplier method ∂G g (ωi , αi ) − λ0 x0i = 0 where g (ωi , αi ) = ∂ω X i g −1 (λ0 x0i , αi )xi = x̄N ⇒ ωi = g −1 (λ0 x0i , αi ) where λ0 is from i∈A Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 130 / 318 GREG Estimator min Q(ω, d) = X di i∈A ωi −1 di 2 qi s.t. X ωi xi = x̄N . i∈A ⇒ di−1 (ωi − di )qi + λ0 x0i = 0 ⇒ ωi = di + λ0 di x0i /qi X X X 0 ⇒ ωi x i = di xi + λ di x0i xi /qi i∈A ∴ i∈A i∈A X 0 λ = (x̄N − x̄HT )( di x0i xi /qi )−1 i∈A ∴ X wi = di + (x̄N − x̄HT )( di x0i xi /qi )−1 di x0i /qi i∈A Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 131 / 318 Other Objective Functions Pseudo empirical likelihood Q(ω, d ) = − X di log ωi di ωi di , ωi = di /(1 + xi λ) Kullback-Leibler distance: Q(ω, d) = X ωi log , ωi = di exp(xi λ) where di = 1/πi . Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 132 / 318 Theorem 2.7.1 Deville and Särndal (1992) Theorem Let G (ω, α) be a continuous convex function with a first derivative that is zero for ω = α. Under some regularity conditions, the solution ωi that minimizes X X G (ωi , αi ) s.t. ωi xi = x̄N i∈A i∈A satisfies X ωi yi = i∈A X αi yi + (x̄N − x̄α ) β̂ + Op (n−1 ) i∈A P P 0 x /φ −1 0 where β̂ = x i ii i∈A i i∈A xi yi /φii and φii = ∂ 2 G (ωi , αi )/∂ωi2 ω =α . i i Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 133 / 318 Weight Bounds ωi = di + di λ0 x0i /ci can take negative values (or take very large values) P Add L1 ≤ ωi ≤ L2 to ωi xi = x̄N . Approaches 1 Huang and Fuller (1978) Q(wi , di ) = 2 X di Ψ wi di , Ψ : Huber function Husain (1969) 0 min ω 0 ω + γ(ω 0 X − x̄N )0 Σ−1 x̄ x̄ (ω X − x̄N ) for some γ 3 Other methods, quadratic programming. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 134 / 318 Comments Regression estimation is large sample superior to mean and ratio estimation for k << n. Applications require restrictions on regression weights ( wi > 1/N ) Model estimator is design consistent if X γ = Σee Dπ−1 J. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 135 / 318 7/23-24/2015 136 / 318 Regression Estimators Using SAS PROC SURVEYREG Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 Regression Estimators Response variable is correlated to a list of auxiliary variables Population totals for the auxiliary variables are known Efficient estimators can be constructed by using a linear contrast from a regression model Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 137 / 318 Digitech Cable Can you improve the estimate of the average usage time by taking data usage into account? Average data usage (MB) for the population is available Data usage for every unit in the sample is available Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 138 / 318 ESTIMATE Statement p r o c s u r v e y r e g d a t a=R e s p o n s e D a t a p l o t= f i t ( w e i g h t=heatmap s h a p e=hex n b i n s =20); s t r a t a S t a t e Type ; weight SamplingWeight ; model UsageTime = DataUsage ; estimate ’ Regression Estimator ’ i n t e r c e p t 1 DataUsage 4 0 0 2 . 1 4 ; run ; Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 139 / 318 7/23-24/2015 140 / 318 The SURVEYREG Procedure Regression Analysis for Dependent Variable UsageTime Fit Statistics R-Square 0.6555 Root MSE 121.56 Denominator DF 292 Estimate Label Regression Estimator Kim & Fuller & Mukhopadhyay (ISU & SAS) Estimate Standard Error 279.18 6.9860 Chapter 2 DF t Value Pr > |t| 292 <.0001 39.96 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 141 / 318 Poststratification Using SAS PROC SURVEYMEANS Strata identification is unknown, but strata totals or percentages are known Stratification after the sample is observed Use poststratification to produce efficient estimators adjust for nonresponse bias perform direct standardization Variance estimators must be adjusted Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 142 / 318 Digitech Cable Known distribution of race in the four states Adjust the distribution of race in the sample to match the population Estimate the average usage time after adjusting for race Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 143 / 318 7/23-24/2015 144 / 318 POSTSTRATA Statement p r o c s u r v e y m e a n s d a t a=R e s p o n s e D a t a mean s t d e r r ; s t r a t a S t a t e Type ; weight SamplingWeight ; v a r UsageTime ; p o s t s t r a t a Race / p o s t p c t=R a c e P e r c e n t ; run ; Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 Poststratified Estimator The SURVEYMEANS Procedure Statistics Variable Label UsageTime Computer Usage Time Mean Std Error of Mean 288.477541 10.612532 A set of poststratified-adjusted weights is created The variance estimator uses the poststratification information Store the poststratified-adjusted replicate weights from PROC SURVEYMEANS and use the adjusted replicate weights in other survey procedures Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 2 7/23-24/2015 145 / 318 7/23-24/2015 146 / 318 Chapter 3 Use of Auxiliary Information in Design World Statistics Congress Short Course July 23-24, 2015 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 Design Strategy Find the best strategy (design, estimator) for ȳN under the model yi = xi β + ei , ei ∼ ind (0, γii σ 2 ) x̄N , γii known, β, σ 2 unknown P Estimator class : θ̂ = i∈A wi yi : linear in y and E {(θ̂ − ȳN ) | d , XN } = 0, so θ̂ is model-unbiased & design consistent. Criterion : Anticipated variance AV{θ̂ − ȳN } = E {V (θ̂ − ȳN |F)} Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 7/23-24/2015 147 / 318 Candidate Estimator ! θ̂ = N −1 X yi + P xi β̂ i∈Ac i∈A −1 0 −1 β̂ = (X 0 D−1 γ X ) X Dγ y = X 0 i∈A xi xi /γii −1 P 0 i∈A xi yi /γii Dγ = diag(γ11 , γ22 , ..., γnn ) If the vector of γii is in the column space of X , P i∈A (yi − xi β̂) = 0 and θ̂ = ȳreg = x̄N β̂. 1/2 1/2 If πi ∝ γii and the vector of γii ȳreg = ȳHT + (x̄N − x̄HT )β̂. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 in in the column space of X , 7/23-24/2015 148 / 318 Theorem 3.1.1 (Isaki and Fuller, 1982) 1/2 1/2 Under moment conditions, if πi ∝ γii , γii = xi τ1 , and γii = xi τ2 for some τ1 and τ2 , then !2 N N n X 2 1 X 1/2 − 2 γii γii σ lim nAV{ȳreg − ȳN } = lim N→∞ N→∞ N N i=1 i=1 and lim n [AV{ȳreg − ȳN } − AV{Ψl − ȳN }] ≤ 0 N→∞ for all Ψl ∈ Dl and all p ∈ Pc , where X Dl = {Ψl ; Ψl = αi yi and E {(Ψl − ȳN ) | d, X } = 0} i∈A and Pc is the class of fixed-sample-size nonreplacement designs with fixed probabilities admitting design-consistent estimators of ȳN . Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 7/23-24/2015 149 / 318 Proof of Theorem 3.1.1 For Ψl = α0 y E {Ψl − ȳN |d , XN } = 0 ⇔ α0 X = x̄N V {Ψl − ȳN |d , XN } = α0 Dγ α − 2N −1 α0 Dγ Jn + N −2 J0N Dγ N JN σ 2 α0 Dγ Jn = α0 X τ2 = x̄N τ2 = N −1 J0N XN τ2 = N −1 J0N Dγ N JN 0 2 −2 0 ∴ V {Ψl − ȳN |d, XN } = α Dγ α − N JN Dγ N JN σ Enough to find α that minimizes α0 Dγ α s.t. α0 X = x̄N −1 0 −1 ⇒ α∗0 = x̄N (X 0 D−1 γ X ) X Dγ α0 Dγ α ≥ x̄N (X 0 Dγ−1 X )−1 x̄0N (See Section 2.7) Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 7/23-24/2015 150 / 318 Remarks on Theorem 3.1.1 (1) Under the model, yi = xi β + ei , = ēHT − ēN + Op (n−1 ) . − ȳN ) = AV(ēHT − ēN ) ȳreg − ȳN AV(ȳreg = N −2 N X [(1 − πi )πi−1 ]γii σ 2 i=1 min AV(ȳreg − ȳN ) s.t. PN P 1/2 N j=1 γjj i=1 πi AV(ȳreg − ȳN ) ≥ N −2 n−1 = n, πi = n !2 N N X 1/2 X γii − γii σ 2 i=1 1/2 γii i=1 (Godambe-Joshi lower bound) Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 7/23-24/2015 151 / 318 Remarks on Theorem 3.1.1 (2) For model: yi = xi β + ei , ei ∼ (0, γii σ 2 ), best strategy is 1/2 ȳreg = ȳHT + (x̄N − x̄HT )β̂ with πi ∝ γii 1/2 To achieve a sampling design with πi ∝ γii 1 2 3 Use Poisson sampling : n is not fixed (Not covered by Theorem). Use systematic sampling Use approximation by stratified random sampling P P 1/2 . 1/2 Choose Uh s.t. i∈Uh γii = H −1 i∈U γii , γ11 ≤ γ22 ≤ · · · ≤ γNN Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 7/23-24/2015 152 / 318 Stratification, Example 3.1.2 yi = β0 + xi β1 + ei , ei ∼ (0, σe2 ) and xi = i, i = 1, 2 · · · , N = 1, 600 stratified sampling with Nh = N/H(= M), nh = n/H, n = 64 ȳst H X Nh = H 1 X ȳh = ȳh N H h=1 h=1 H H 1 X 1 X 1 2 1 2 V (ȳ ) = σ σ = h H2 H2 nh yh n w V (ȳst ) = h=1 where σw2 = H −1 PH 2 h=1 σyh , h=1 2 σyh = E (yhi − ȳh )2 . = M 2 −1 2 12 β1 + σe2 (M 2 − 1)/12 = variance of {1, 2, · · · , M} Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 7/23-24/2015 153 / 318 Example 3.1.2 continued Number of Strata 2 4 8 32 ρ2 V (ȳst ) = 0.25 ρ2 81 77 75 75 = 0.9 32 16 11 10 ρ2 V {V̂ (ȳst }) = 0.25 ρ2 = 0.90 67 7.7 62 2.4 64 1.5 111 2.0 ρ2 = 1 − (σy2 )−1 σe2 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 7/23-24/2015 154 / 318 Remarks on Stratification 1 For efficient point estimation, increase H 2 V {V̂ (ȳst )} depends on ρ: V {V̂ (ȳst )} can decrease in H and then increase in H Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 7/23-24/2015 155 / 318 One per Stratum A common procedure is to select one unit per stratum and to combine or “collapse” two adjacent strata to form a variance estimation stratum V̂cal {ȳcol } = 0.25(y1 − y2 )2 h i E V̂cal {ȳcol } = 0.25(µ1 − µ2 )2 + 0.25(σ12 + σ22 ) Two-per-stratum design V {ȳ2,st } = 0.125(µ1 − µ2 )2 + 0.25(σ12 + σ22 ) Controlled two-per-stratum design (§ 3.1.4) can be used to reduce the variance of the two-per-stratum design Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 7/23-24/2015 156 / 318 Cluster Sampling Population of cluster of elements May have different cluster sizes Cluster size can be either known or unknown at the design stage Clusters are sampling units Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 7/23-24/2015 157 / 318 Model for Cluster Sampling yij = µy + bi + eij , i = 1, 2, · · · , N, j = 1, 2, · · · , Mi bi ∼ (0, σb2 ), eij ∼ (0, σe2 ), Mi ∼ (µM , σM2 ) and bi , eij , and Mi are independent. M i X yi = yij ∼ (Mi µy , γii ) j=1 Mi γii = V (yi |Mi ) = Mi2 σb2 + Mi σe2 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 7/23-24/2015 158 / 318 Strategies for Mean per Element 1 θ̂n,1 = M̄n−1 ȳn : SRS 2 −1 θ̂n,2 = M̄HT ȳHT : with πi ∝ Mi 3 −1 θ̂n,3 = M̄HT ȳHT : with πi ∝ γii 1/2 Then V (θ̂n,3 ) ≤ V (θ̂n,2 ) ≤ V (θ̂n,1 ) Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 7/23-24/2015 159 / 318 Two Stage Sampling Population of N clusters (PSUs) Select a sample of n1 clusters Select a sample of mi elements from Mi elements in cluster i Sampling within clusters is independent A model: yij = µy + bi + eij bi ∼ (0, σb2 ) ind of eij ∼ (0, σe2 ) Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 7/23-24/2015 160 / 318 Estimation of Mean per Element N X θN = !−1 Mi i=1 θ̂SRC = where ȳi· = mi−1 yij i=1 j=1 n X !−1 Mi i=1 mi X Mi N X X n X Mi ȳi· i=1 yij . j=1 With equal Mi , equal mi , and SRS at both stages, V(θ̂SRC − θN ) = . = 1 n1 1 n1 n1 2 1 h n1 m i 2 1− σb + 1− σ N n1 m NM e 1 2 2 σb + σe m Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 7/23-24/2015 161 / 318 Optimal Allocation Cost function: C = c1 n1 + c2 n1 m Minimize V{θ̂ − θN } s.t. C = c1 n1 + c2 n1 m: σe2 c1 m = σb2 c2 ∗ 1/2 Proof: C is fixed. Minimize C · V {θ̂ − θn } = n1−1 σb2 + m−1 σe2 (c1 n1 + c2 n1 m) = σb2 c1 + σe2 c2 + σb2 c2 m + c1 σe2 m−1 . Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 7/23-24/2015 162 / 318 Two-Phase Sampling 1 Phase-one : Select A1 from U. Observe xi 2 Phase-two : Select A2 from A1 . Observe (xi , yi ) π1i π2i|1i π2i = Pr[i ∈ A1 ] = Pr[i ∈ A2 |i ∈ A1 ] = π1i π2i|1i Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 7/23-24/2015 163 / 318 Two-Phase Sampling for Stratification xi xig T̂2pr ,st = (xi1 , · · · , xiG ) 1 if i ∈ group g = 0 otherwise = G X N̂1g ȳ2πg : reweighted expansion estimator g =1 N̂1g = X −1 π1i xig i∈A1 P ȳ2πg = i∈A2 P −1 −1 π1i π2i|1i xig yi i∈A2 If π2i|1i = f2g −1 −1 π1i π2i|1i xig P −1 n2g i∈A2 π1i xig yi = , i ∈ group g , then ȳ2πg = P . −1 n1g π x i∈A 1i ig 2 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 7/23-24/2015 164 / 318 Theorem 3.3.1 Let the second phase sample be stratified random sampling with π2i|1i fixed and constant within groups. Under moment conditions, L [V {ȳ2p,st |FN }]−1/2 (ȳ2p,st − ȳN )|FN → N(0, 1) where V (ȳ2p,st |FN ) = V (ȳ1π |FN ) + E ȳ1π = G X 1 1 − n2g n1g −1 g =1 X X −1 w1i yi , w1i = π1j i∈A1 2 S̃1ug 2 n1g 2 S̃1ug |FN −1 π1i j∈A1 X = (n1g − 1)−1 (ui − ū1g )2 i∈A1g ū1g −1 = n1g X w1i yi i∈A1g Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 7/23-24/2015 165 / 318 Proof of Theorem 3.3.1 ȳ2p,st − ȳN = (ȳ2p,st − ȳ1π ) + (ȳ1π − ȳN ) L (i) {V [ȳ2p,st − ȳ1π |FN ]}−1/2 (ȳ2p,st − ȳ1π )|(A1 , FN ) → N(0, 1) ∵ ȳ2p,st − ȳ1π = G X n1g (ū2g − ū1g ) g =1 ū2g = 1 X 1 X wi yi , ū1g = wi yi n2g n1g i∈A2g V {ȳ2p,st − ȳ1π |A1 , FN } = G X 2 n1g i∈A1g −1 n2g − −1 n1g 2 S̃1ug g =1 E {ȳ2p,st − ȳ1π |A1 , FN } = 0 L (ii) V [ȳ1π − ȳN |FN ]−1/2 (ȳ1π − ȳN ) | FN → N(0, 1) By Theorem 1.3.6 (p 54-55), the result follows Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 7/23-24/2015 166 / 318 Separate Samples with Common Characteristics Two independent surveys A1 : observe x A2 : observe (x, y ) Interested in estimating θ = (x̄N , ȳN )0 1 0 x̄01 0 x̄2 = 1 0 0 1 ȳ2 e1 V = V e2 e3 u = Zθ + e x̄0N ȳN e1 + e2 e3 V11 0 0 = 0 V22 V23 0 V32 V33 = (Z 0 V −1 Z )−1 Z 0 V −1 u θ̂GLS Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 7/23-24/2015 167 / 318 Composite Estimation Example(Two Time Periods: O = observed) Sample A B C t=1 O O t=2 O O → core panel part (detecting change) supplemental panel survey (cross sectional) Sample A : ȳ1A , ȳ2A , Sample B : ȳ1B , Sample C : ȳ2C Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 7/23-24/2015 168 / 318 Two Time Periods GLS 1 0 e1 ȳ1B e2 ȳ1A 1 0 ȳ1,N = + e3 ȳ2A 0 1 ȳ2,N 0 1 e4 ȳ2C −1 nB 0 0 0 e1 −1 −1 e2 0 nA nA ρ 0 = V −1 e3 0 n ρ n−1 0 A A −1 e4 0 0 0 nC Composite estimator θ̂ = (Z0 V−1 Z)−1 Z0 V−1 y Design is complex because more than one item. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 7/23-24/2015 169 / 318 Rejective Sampling yi = xi β + ei , xi , i = 1, 2, ..., N known Vxx = V {x̄p } under initial design Pd , with pi initial selection probability. Procedure: Select sample using Pd −1 If (x̄p − x̄N )Vxx (x̄p − x̄N )0 < Kd keep sample. Otherwise reject and select a new sample Continue until sample is kept Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 7/23-24/2015 170 / 318 Result for Rejective Sampling If ȳreg = z̄N β̂ design consistent under Pd xi c = (1 − pi )−1 pi , for some c, then ȳreg = z̄N β̂ design consistent for Rejection −1 X X zi0 φi pi−2 yi β̂ = zi0 φi pi−2 zi i∈Arej i∈Arej zi = (xi , z2i ), z2i design var, eg. stratum indicators φi = (1 − pi ) for Poisson φi = (Nh − 1)−1 (Nh − nh ) stratification V̂ {ȳreg } is design consistent for Vrej {ȳreg } Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 7/23-24/2015 171 / 318 Sample Design (Fairy Tale) Client: Desire estimate of average daily consumption of Cherrios by females 18-80 who made any purchase at store x between January 1, 2015 and July 1, 2015 (here is list) with a CV of 2% Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 7/23-24/2015 172 / 318 Design Discussion Objectives ? What is population of interest ? What data are needed ? How are data to be collected? What has been done before ? What information (auxiliary data) available ? How soon must it be done? Budget Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 7/23-24/2015 173 / 318 7/23-24/2015 174 / 318 BLM Sage Grouse Study Bureau of Land Management Rangeland Health Emphasis on sage grouse habitat Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 Sage Grouse Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 7/23-24/2015 175 / 318 Chapter 3 7/23-24/2015 176 / 318 Sage Grouse Kim & Fuller & Mukhopadhyay (ISU & SAS) Sample Frame, Sample Units Public Land Survey System (PLSS) Central and western US Grid system of square miles (sections) (1.7km) Quarter section 0.5mi on a slide (segment) Much ownership based on PLSS ISU has used PLSS as a frame Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 7/23-24/2015 177 / 318 Chapter 3 7/23-24/2015 178 / 318 Low Sage Grouse Density Kim & Fuller & Mukhopadhyay (ISU & SAS) Sample Selection Stratify area (Thiessen polygons) Select random point locations (two per stratum) Segment containing point is sample segment Select observation points in segment Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 7/23-24/2015 179 / 318 Chapter 3 7/23-24/2015 180 / 318 Stratification Kim & Fuller & Mukhopadhyay (ISU & SAS) Sample Points Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 7/23-24/2015 181 / 318 7/23-24/2015 182 / 318 Selection Probabilities Segments vary in size Segment rate less than 1/300 Segment πi treated as proportional to size Joint probability treated as πi πj Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 Variables xhi = Elevation of selected point yhi = created variable at point (Range health) Chi = acre size of segment h: stratum, i: segment Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 7/23-24/2015 183 / 318 7/23-24/2015 184 / 318 Weights −1 Segment probability = TCh (nh Chi ) TCh = acre size of stratum −1 −1 −1 Point weight (acres) = πhi = [TCh (nh Chi )]−1 Chi mhi mhi = number of points observed in segment hi Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 PPS - Fixed Take Let One point = one acre Chi = acre size of segment i in stratum h m = fixed number of points per segment Probability of selection (point j in segment i): P(acreij ) = P(segi )P(acreij | segi ) −1 −1 = Tch Chi × mChi−1 = Tch m Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 7/23-24/2015 185 / 318 One Point Per Segment ȳst = (136) −1 68 X 2 X yhi = 1.9458 h=1 i=1 [V̂ (ȳst )]0.5 = " 68 X #0.5 N −2 Nh2 V̂ (ȳh ) h=1 68 X 2 X −1 (136) (yhi − ȳh )2 " = #0.5 = 0.0192 h=1 i=1 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 7/23-24/2015 186 / 318 Regression Estimator yhi = xhi β1 + H X δj,hi β1+j + ehi j=1 δj,hi = x0hi 1 0 if j = h otherwise. = (xhi , δ1,hi , δ2,hi , · · · , δH,hi ) ȳreg = (x̄N − x̄st ) β̂ = ȳst + (x̄N − x̄st )β̂1 . Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 7/23-24/2015 187 / 318 Regression Estimator x̄N = 0.9878 mi. known, x̄st = 0.9875 ȳreg = ȳst + (x̄N − x̄st )β̂1 = 1.9458 + (0.0003)(0.9788) = 1.9461 (0.0165) ( 68 2 )−1 68 2 XX XX 2 β̂1 = (xhi − x̄h ) (xhi − x̄h )(yhi − ȳh ) h=1 i=1 Kim & Fuller & Mukhopadhyay (ISU & SAS) h=1 i=1 Chapter 3 7/23-24/2015 188 / 318 Regression Weights ȳreg = 68 X 2 X whi yhi h=1 i=1 whi ( 68 2 )−1 XX = n−1 + (x̄N − x̄st ) (xhi − x̄h )2 (xhi − x̄h ) h=1 i=1 V̂ {ȳreg } = 68 X 2 X whi2 n o2 yhi − ȳh − (xhi − x̄h )β̂1 h=1 i=1 68 X 2 X whi2 ( 68 2 )−1 XX = n−1 + (x̄N − x̄st )2 (xhi − x̄h )2 h=1 i=1 h=1 i=1 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 7/23-24/2015 189 / 318 Model Variance yhj = β0 + β1 (xhi − x̄hN ) + ehi , ehi ∼ (0, σe2 ) # 68 X 2 X V {ȳreg } = n−1 + (x̄N − x̄st )2 { (xhi − x̄h )2 }−1 σe2 " h=1 i=1 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 7/23-24/2015 190 / 318 Sample Number Two x̄N ȳreg = 0.9878 x̄st = 0.9589 = 1.9689 + (0.0289)(1.0203) = 1.9984 # 68 X 2 X n−1 + (x̄N − x̄st )2 { (xhi − x̄h )2 }−1 σe2 " V {ȳreg } = h=1 i=1 −2 V̂ {ȳreg } = {0.7353 + 0.0656}10 σ̂e2 = (67) −1 = (0.0172)2 68 X 2 X {yhi − ȳh − (xhi − x̄h )β̂1 }2 = 1.000. h=1 i=1 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 7/23-24/2015 191 / 318 7/23-24/2015 192 / 318 Comments on Design Stratification Model for design Error model – selection probabilities Avoid “Over design” Variance estimation Simple - explanation for users Rejective - avoids “bad” samples Data will be used for more than designed for Be prepared (budget cut, use sampling again). Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 3 Chapter 4 Replication Variance Estimation World Statistics Congress Short Course July 23-24, 2015 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 4 7/23-24/2015 193 / 318 7/23-24/2015 194 / 318 Jackknife Variance Estimation Create a new sample by deleting one observation nx̄ − xk n−1 xk − x̄ = − n−1 x̄ (k) = x̄ (k) − x̄ n n k=1 k=1 X n − 1 X (k) 1 2 ∴ (x̄ − x̄) = (xk − x̄)2 = n−1 sx2 n n(n − 1) Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 4 Alternative Jackknife Weights (k) x̄ψ = ψxk + (1 − ψ)x̄ (k) (k) = ψ(xk − x̄) + (1 − ψ)(x̄ (k) − x̄) (k) = (ψ − x̄ψ − x̄ x̄ψ − x̄ n X (k) (x̄ψ − x̄)2 = k=1 nψ − 1 1−ψ )(xk − x̄) = ( )(xk − x̄) n−1 n−1 n (nψ − 1)2 X 2 (x − x̄) k (n − 1)2 k=1 n X n − 1 1 (k) If (nψ − 1)2 = (1 − f ), then (x̄ψ − x̄)2 = (1 − f )sx2 . n n k=1 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 4 7/23-24/2015 195 / 318 Random Group Jackknife n = mb : m groups of size b, x̄1 , ..., x̄m m 1 X x̄ = x¯i m i=1 m V̂ (x̄) = 1 1 X (x¯i − x̄)2 mm−1 i=1 (k) x̄b x̄n (k) = − x̄ = nx̄ − bx̄k mx̄ − x̄k = n−b m−1 nx̄ − bx̄k 1 − x̄ = − (x̄k − x̄) n−b m−1 m m k=1 k=1 X 1 m − 1 X (k) 2 V̂RGJK (x̄) ≡ (x̄b − x̄) = (x̄k − x̄)2 m m(m − 1) Unbiased but d.f . = m − 1. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 4 7/23-24/2015 196 / 318 Theorem 4.1.1 Theorem FN = {y1 , . . . , yN } : sequence of finite population yi ∼ g (·) : iid(µy , σy2 ) with finite 4 + δ moments continuous function with continuous first derivative at µy n n−1X 0 ⇒ {g (ȳ (k) ) − g (ȳ )}2 = [g (ȳ )]2 V̂ (ȳ ) + op (n−1 ) n k=1 where V̂ (ȳ ) = n−1 s 2 and g 0 (ȳ ) = ∂g (ȳ ) . ∂ ȳ Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 4 7/23-24/2015 197 / 318 Proof of Theorem 4.1.1 By a Taylor linearization, 0 0 g (ȳ (k) ) = g (ȳ ) + g (ȳk∗ )(ȳ (k) − ȳ ) = g (ȳ ) + g (ȳ )(ȳ (k) − ȳ ) + Rnk (ȳ (k) − ȳ ) for some ȳk∗ ∈ Bδk (ȳ ), δk =k ȳ (k) − ȳ k where Rnk = g 0 (ȳk∗ ) − g 0 (ȳ ). Thus, n n n o2 X X 0 ∗ 2 (k) = g (ȳ (k) ) − g (ȳ ) g (ȳk ) (ȳ − ȳ )2 k=1 = k=1 n X 0 2 g (ȳ ) + Rnk (ȳ (k) − ȳ )2 k=1 (i) max1≤k≤n |ȳk∗ − ȳ | → 0 in probability. 0 0 (ii) max1≤k≤n |g (ȳk∗ ) − g (ȳ )| → 0 in probability. n ∴ n−1X 0 {g (ȳ (k) ) − g (ȳ )}2 = [g (ȳ )]2 V̂ (ȳ ) + op (n−1 ) n k=1 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 4 7/23-24/2015 198 / 318 Remainder with Second Derivatives If g (·) has continuous second derivatives at µy , then P n−1 (n − 1) nk=1 [g (ȳ (k) ) − g (ȳ )]2 = [g 0 (ȳ )]2 V̂ (ȳ ) + Op (n−2 ). Proof : [g (ȳ (k) ) − g (ȳ )]2 = [g 0 (ȳk∗ )]2 (ȳ (k) − ȳ )2 g 0 (ȳk∗ )2 = [g 0 (ȳ )]2 + 2[g 0 (ȳk∗∗ )]g 00 (ȳk∗∗ )(ȳk∗ − ȳ ) ⇒ [g 0 (ȳk∗ )]2 = [g 0 (ȳ )]2 + K1 |ȳk∗ − ȳ | for some K1 . Thus, since ȳk∗ − ȳ = Op (n−1 ), we have the result. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 4 7/23-24/2015 199 / 318 Jackknife Often Larger than Taylor R̂ = x̄ −1 ȳ R̂ (k) − R̂ = [x̄ (k) ]−1 ȳ (k) − x̄ −1 ȳ = −[x̄ (k) ]−1 (yk − R̂xk )(n − 1)−1 n n − 1 X (k) (R̂ − R̂)2 V̂JK (R̂) = n k=1 n = X 1 [x̄ (k) ]−2 (yk − R̂xk )2 n(n − 1) k=1 n vs. V̂L (R̂) = X 1 (x̄)−2 (yk − R̂xk )2 n(n − 1) k=1 E [(x̄ (k) )−2 ] ≥ E [(x̄)−2 ] Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 4 7/23-24/2015 200 / 318 Quantiles ξp = Q(p) = F −1 (p) , p ∈ (0, 1) where F (y ) is cdf ξˆp = Q̂(p) = inf {F̂ (y ) ≥ p}, p = 0.5 for median y !−1 X X F̂ (y ) = wi wi I (yi ≤ y ) i∈A i∈A To reduce the bias, use interpolation: ξˆp = ξˆp0 + ξˆp1 − ξˆp0 {p − F̂ (ξˆp0 )} ˆ ˆ F̂ (ξp1 ) − F̂ (ξp0 ) where ξˆp1 = inf x1 ,...,xn {x; F̂ (x) ≥ p} & ξˆp0 = supx1 ,...,xn {x; F̂ (x) ≤ p} Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 4 7/23-24/2015 201 / 318 Test Inversion for Quantile C.I. Construct acceptance region for H0 : p = p0 {p0 − 2[V̂ (p̂0 )]1/2 , p0 + 2[V̂ (p̂0 )]1/2 } Invert p-interval to give C.I. for ξp0 {Q̂(p0 − 2[V̂ (p̂0 )]1/2 ), Q̂(p0 + 2[V̂ (p̂0 )]1/2 )} Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 4 7/23-24/2015 202 / 318 Plots of CDF and Inverse CDF CDF (F) Inverse CDF (Q) ζp p ζp p Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 4 7/23-24/2015 203 / 318 Bahadur Representation Let F̂ (x) be an unbiased estimator of F (x), the population CDF of X . For given p ∈ (0, 1), we can define ζp = F −1 (p) to be the p-th population quantile of X . Let ζ̂p = F̂ −1 (p) be the p-th sample quantile of X using F̂ . Also, define p̂ = F̂ (ζp ). Bahadur (1966): ζp = F̂ −1 (p̂) d F̂ −1 (p) −1 ∼ (p̂ − p) = F̂ (p) + dp dF −1 (p) −1 ∼ (p̂ − p) = F̂ (p) + dp 1 = ζ̂p + (p̂ − p). f (ζp ) Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 4 7/23-24/2015 204 / 318 Variance Estimation for Quantiles (1) Bahadur Representation √ ˆ p(1 − p) (SRS) n(ξp − ξp ) → N 0, 0 [F (ξp )]2 . V [F (ξˆb )] = Q̂ 0 (p) = [F̂ 0 (ξp )]−1 ∂F ∂ξ 2 V (ξˆp ) q p Q̂(p̂ + 2 V (p̂)) − Q̂(p̂ − 2 V̂ (p̂)) q = =: γ̂ p (p̂ + 2 V̂ (p̂)) − (p̂ − 2 V (p̂)) V̂ (ξˆp ) = γ̂ 2 V̂ {F̂ (ξp )} Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 4 7/23-24/2015 205 / 318 7/23-24/2015 206 / 318 Variance Estimation for Quantiles (2) 1 Jackknife variance estimator is not consistent. Median θ̂ = 0.5(xm + xm+1 ) for n=2m even V̂JK = 0.25(n − 1)[xm+1 − xm ]2 2 Bootstrap and BRR are O.K. 3 Smoothed quantile ξˆp = ξˆp0 + γ̂[p − F̂ (ξˆp0 )] (k) ξˆp = ξˆp0 + γ̂[p − F̂ (k) (ξˆp0 )] X (k) V̂JK (ξˆp ) = ck (ξˆp − ξˆp )2 k Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 4 §4.4 Two-Phase Samples ȳ2p,reg β̂2 = ȳ2p,REE = ȳ2π + (x̄1π − x̄2π )β̂2 −1 X X 0 = w2i (xi − x̄2π ) (xi − x̄2π ) w2i (xi − x̄2π )0 yi i∈A2 i∈A2 −1 w2i−1 = π1i π2i|1i , w1i = π1i V̂JK (ȳ2p,REE ) = L X h i2 (k) ck ȳ2p,REE − ȳ2p,REE k=1 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 4 7/23-24/2015 207 / 318 Two-Phase Samples (k) (k) −1 where w2i = (w1i )(π2i|1i ) −1 (k) x̄1π = X (k) w1i X = −1 X (k) w2i i∈A2 (k) = X (k) w2i (xi , yi ) i∈A2 −1 β̂2 (k) w1i xi i∈A1 i∈A1 (k) (k) (x̄2π , ȳ2π ) X (k) (k) (k) w2i (xi − x̄2π )0 (xi − x̄2π ) i∈A2 Kim & Fuller & Mukhopadhyay (ISU & SAS) X (k) (k) (k) w2i (xi − x̄2π )0 (yi − ȳ2π ) i∈A2 Chapter 4 7/23-24/2015 208 / 318 Theorem 4.2.1 Kim, Navarro, Fuller (2006 JASA), Assumptions (i) Second phase is stratified with π2i|1i constant within group (ii) KL < Nn−1 π1i < KU for some KL &KU (iii) V {T̂1y |F} ≤ KM V {T̂1y ,SRS } where T̂1y = (iv) nV {T̂1y |F} = PN PN i=1 (v) E !2 V̂1 (T̂1y ) V (T̂1y |F) −1 j=1 Ωij yi yj where P i∈A1 PN −1 π1i yi i=1 |Ωij | = O(N −1 ) |F = o(1) (k) (vi) E {[ck (T̂1y − T̂1y )2 ]2 |F} < KL L−2 [V (T̂1y )]2 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 4 7/23-24/2015 209 / 318 Theorem 4.2.1 Kim, Navarro, Fuller (2006 JASA), Result N 1 X −1 κ2i (1 − κ2i )ei2 + op (n−1 ) ⇒ V̂ {ȳ2p,reg } = V (ȳ2p,reg |F) − 2 N i=1 where κ2i ei βN = π2i|1i = yi − ȳN − (xi − x̄)βN " N #−1 N X X = (xi − x̄N )0 (xi − x̄N ) (xi − x̄N )0 yi . i=1 Kim & Fuller & Mukhopadhyay (ISU & SAS) i=1 Chapter 4 7/23-24/2015 210 / 318 Variance Estimation Replication is computationally efficient for large surveys, simple for users Jackknife works if Taylor appropriate Grouping for computational efficiency Theoretical improvement for quantiles Problem with rare items Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 4 7/23-24/2015 211 / 318 Variance Estimation Using SAS Always use weights, strata, clusters, and domains Taylor series linearization Replication variance estimation Balanced repeated replication (BRR) Jackknife repeated replication (delete-one jackknife) User-specified replicate weights Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 4 7/23-24/2015 212 / 318 Taylor Series Linearization Variance Estimation Use first-stage sampling units (PSUs) Compute pooled variance from strata Compute stratum variance based on cluster (PSU) totals Use the VARMETHOD=TAYLOR option Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 4 7/23-24/2015 213 / 318 Taylor Series Linearization Variance Estimation X X −1 V̂ (θ̂) = (nh − 1) nh (1 − fh ) (ehi+ − ēh.. )2 i h ehi+ −1 X X = whij whij (yhij − ȳ... ) j hij −1 erc,hi+ = X j Kim & Fuller & Mukhopadhyay (ISU & SAS) X whij whij (δrc,hij − P̂rc ) hij Chapter 4 7/23-24/2015 214 / 318 VARMETHOD=TAYLOR p r o c s u r v e y f r e q d a t a=R e s p o n s e D a t a varmethod=T a y l o r t o t a l=t o t ; s t r a t a S t a t e Type ; weight SamplingWeight ; t a b l e s Rating ; run ; Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 4 7/23-24/2015 215 / 318 Replication Variance Estimation Using SAS Methods include delete-one jackknife, BRR, and user-specified replicate weights The quantity of interest is computed for every replicate subsample, and the deviation from the full sample estimate is measured Design information is not necessary if the replicate weights are supplied Use the VARMETHOD=JACKKNIFE | BRR option Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 4 7/23-24/2015 216 / 318 Replication Variance Estimation Using SAS Create R replicate samples based on the replication method specified For any statistic θ, compute θ̂ for the full sample and θ̂(r ) for every replicate sample The replication variance estimator is X V̂ (θ̂) = αr (θ̂(r ) − θ̂)2 r Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 4 7/23-24/2015 217 / 318 SAS Statements and Options VARMETHOD= TAYLOR | JACKKNIFE | BRR OUTWEIGHT= OUTJKCOEFS= REPWEIGHTS statement JKCOEFS= TOTAL | RATE = Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 4 7/23-24/2015 218 / 318 Create Replicate Weights p r o c s u r v e y m e a n s d a t a=R e s p o n s e D a t a varmethod=j a c k k n i f e ( o u t w e i g h t=ResDataJK o u t j k c o e f s=ResJKCoef ) ; s t r a t a S t a t e Type ; weight SamplingWeight ; v a r UsageTime ; run ; Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 4 7/23-24/2015 219 / 318 The SURVEYMEANS Procedure Variance Estimation Method Jackknife Number of Replicates 300 Statistics Variable Label UsageTime Computer Usage Time Kim & Fuller & Mukhopadhyay (ISU & SAS) N Mean Std Error of Mean 300 284.953667 11.028403 Chapter 4 95% CL for Mean 263.248431 306.658903 7/23-24/2015 220 / 318 Use Replicate Weights p r o c s u r v e y f r e q d a t a=ResDataJK ; weight SamplingWeight ; t a b l e s Rating / chisq t e s t p =(0.25 0 . 2 0 0 . 2 0 0 . 2 0 0 . 1 5 ) ; t a b l e s Recommend ; r e p w e i g h t s RepWt : / j k c o e f s=ResJKCoef ; run ; Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 4 7/23-24/2015 221 / 318 The SURVEYFREQ Procedure Variance Estimation Method Jackknife Replicate Weights RESDATAJK Number of Replicates 300 Customer Satisfaction Frequency Weighted Frequency Extremely Unsatisfied 70 3154 318.59453 Unsatisfied 67 3009 Neutral 64 Satisfied Extremely Satisfied Rating Total Std Err of Wgt Freq Percent Test Percent Std Err of Percent 24.7287 25.00 2.4800 326.36987 23.5889 20.00 2.5368 2867 316.40091 22.4797 20.00 2.4535 57 2557 305.15457 20.0509 20.00 2.3809 26 1167 219.47434 9.1518 15.00 1.7185 284 12754 173.34658 100.000 Frequency Missing = 16 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 4 7/23-24/2015 222 / 318 Recommend Recommend Frequency Weighted Frequency Std Err of Wgt Freq Percent Std Err of Percent 0 171 7683 385.01543 59.1915 2.8930 1 118 5297 380.02703 40.8085 2.8930 Total 289 12979 144.14289 100.000 Frequency Missing = 11 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 4 7/23-24/2015 223 / 318 Chapter 5 Models Used in Conjunction with Sampling World Statistics Congress Short Course July 23-24, 2015 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 224 / 318 Nonresponse Unit Nonresponse: weight adjustment Item Nonresponse: imputation Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 225 / 318 Two-Phase Setup for Item Nonresponse Phase one (A): Observe xi Phase two (AR ) : Observe (xi , yi ) π1i π2i|1i = Pr[i ∈ A] : inclusion probability phase one (known) = Pr[i ∈ AR |i ∈ A] : inclusion probability phase two (unknown) Response indicator variable: 1 if i ∈ AR Ri = for i ∈ A 0 if i ∈ / AR Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 226 / 318 Two-Phase Setup for Unit Nonresponse We are interested in estimating the population mean of Y using weighted mean of the observations: P i∈A wi yi ȳR = P R i∈AR wi −1 −1 where wi = π1i π̂2i|1i Regression weighting approach ȳreg ,1 = x̄N β̂ or ȳreg ,2 = x̄1 β̂ P −1 −1 P −1 where x̄1 = ( i∈A π1i ) ( i∈A π1i xi ) and β̂ = ( X −1 0 π1i xi xi )−1 ( i∈AR Kim & Fuller & Mukhopadhyay (ISU & SAS) X −1 0 π1i xi yi ). i∈AR Chapter 5 7/23-24/2015 227 / 318 7/23-24/2015 228 / 318 Theorem 5.1.1 Theorem (i) V [N −1 −1 i∈A π1i (xi , yi )|F] )|F] = O(n−3 ) P (ii) V [V̂ (ȲHT (iii) KL < π2i|1i < KU , = O(n−1 ) −1 π2i|1i = xi α for some α, (iv) xi λ = 1 for all i for some λ, (iv) Ri : independent ⇒ ȳreg ,1 − ȳN = 1 X −1 π2i ei + Op (n−1 ), N i∈AR where π2i = π1i π2i|1i , ei = yi − xi βN , and P P βN = ( i∈U π2i|1i x0i xi )−1 i∈U π2i|1i x0i yi . Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 Proof of Theorem 5.1.1 P P −1 0 −1 0 Since β̂ = ( i∈AR π1i xi xi )−1 i∈AR π1i xi y i , β̂ − βN = Op (n−1/2 ) P P where βN = ( i∈U π2i|1i x0i xi )−1 i∈U π2i|1i x0i yi . ȳreg ,1 − ȳN = x̄N β̂ − x̄N β N N X X −1 ∵ (yi − xi βN ) = π2i|1i π2i|1i (yi − xi βN ) i=1 = i=1 N X (α0 x0i )π2i|1i (yi − xi βN ) = 0 i=1 −1 = xi α, transform xi to show Use π2i|1i x̄N (β̂ − βN ) = N −1 X −1 ei + Op (n−1 ). π2i i∈AR Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 229 / 318 Variance Estimation for ȳreg ,1 ȳreg ,1 = X i∈AR −1 x̄N X −1 0 π1j xj xj −1 0 π1i xi y i j∈AR 1 X 1 1 =: yi N π1i π̂2i|1i i∈AR −1 Small f = n/N, let b̂j = π̂2j|1j êj , êj = yj − xj β̂. V̂ = 1 X X π1ij − π1i π1j b̂i b̂j N2 π1ij π1i π1j i∈AR j∈AR Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 230 / 318 Justification Variance X XX V w2i ei | F = (π2ij − π2i π2j ) w2i w2j ei ej i∈AR i∈U j∈U X = π2i|1i π2j|1j (π1ij − π1i π1j )w2i w2j ei ej i6=j;i,j∈U + X 2 (π2i − π2i )w2i2 ei2 i∈U where π2ij = Kim & Fuller & Mukhopadhyay (ISU & SAS) π1ij π2i|1i π2j|1j π1i π2i|1i for i 6= j for i = j. Chapter 5 7/23-24/2015 231 / 318 Justification (Cont’d) Expectation of variance estimator X X −1 E π1ij (π1ij − π1i π1j )w2i ei w2j ej | F i∈AR j∈AR X 2 = (π1i − π1i )π2i|1i w2i2 ei2 i∈U + X π2i|1i π2j|1j (π1ij − π1i π1j )w2i ei w2j ej i6=j;i,j∈U XX = (π2ij − π2i π2j )w2i ei w2j ej i∈U j∈U + X π2i (π2i − π1i )w2i2 ei2 , i∈U −1 where w2i = N −1 π2i . The second term is the bias of the variance estimator and it is of order O(N −1 ). Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 232 / 318 Variance Estimation for ȳreg ,2 ȳreg ,2 = x̄1 β̂ ȳreg ,2 − ȳN = (x̄1 − x̄N )βN + x̄N (β̂ − βN ) + Op (n−1 ) X −1 −1 = (x̄1 − x̄N )βN + N π2i (yi − xi βN ) + Op (n−1 ). i∈AR Variance estimator 1 X X π1ij − π1i π1j b̂i2 b̂j2 V̂2 = 2 N π1ij π1i π1j i∈A j∈A where b̂j2 = (xj − x̄1 )β̂ + (Nx̄1 ) Kim & Fuller & Mukhopadhyay (ISU & SAS) P −1 0 i∈AR π1i xi xi Chapter 5 −1 Rj x0j êj . 7/23-24/2015 233 / 318 Imputation Fill in missing values with plausible values Provides a complete data file: we can apply standard complete data methods By filling in missing values, analyses by different users will be consistent Good imputation model reduces the nonresponse bias Makes full use of information Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 234 / 318 A Hot Deck Imputation Procedure Partition the sample into G groups: A = A1 ∪ A2 ∪ · · · ∪ AG . In group g , we have ng elements, rg respondents, and mg = ng − rg nonrespondents. For each group Ag , select mg imputed values from rg respondents with replacement (or without replacement). Imputation model: yi ∼ iid(µg , σg2 ), i ∈ Ag (respondents and missing) Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 235 / 318 Example 5.2.1: Hot Deck Imputation Under SRS - yi : study variable. subject to missing - xi : auxiliary variable. always observed (Group indicator) - Ii : sampling indicator function for unit i - Ri : response indicator function for yi - yi∗ : imputed value Ag = ARg ∪ AMg with ARg = {i ∈ Ag ; Ri = 1} and AMg = {i ∈ Ag ; Ri = 0}. Imputation: yj∗ = yi with probability 1/rg for i ∈ ARg and j ∈ AMg . Imputed estimator of ȳN : X X −1 ∗ −1 ȳI = n {Ri yi + (1 − Ri ) yi } =: n yIi i∈A Kim & Fuller & Mukhopadhyay (ISU & SAS) i∈A Chapter 5 7/23-24/2015 236 / 318 Variance of Hot Deck Imputed Mean V (ȳI ) = V {EI (ȳI | yn )} + E {VI (ȳI | yn )} G G X X 2 −2 −1 −1 mg 1 − rg SRg ng ȳRg + E n = V n g =1 g =1 P P 2 2 = (r − 1)−1 where ȳRg = rg−1 i∈ARg yi and SRg g i∈ARg (yi − ȳRg ) , yn = (y1 , y2 , ..., yn ) Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 237 / 318 Variance of Hot Deck Imputed Sample (2) Model : yi | i ∈ Ag ∼ iid(µg , σg2 ) V {ȳI } = V {ȳn } + n−2 G X ng mg rg−1 σg2 + n−2 g =1 = V {ȳn } + n −2 G X G X mg (1 − rg−1 )σg2 g =1 cg σg2 g =1 Reduced sample size: n−2 ng2 (rg−1 − ng−1 )σg2 Randomness due to stochastic imputation: n−2 mg (1 − rg−1 )σg2 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 238 / 318 Variance Estimation Naive approach: Treat imputed values as if observed Naive approach underestimates the true variance! Example: Naive: V̂I = n−1 SI2 n X −1 (n − 1) (yIi − ȳI )2 ( E SI2 = E ) i=1 . = (n − 1)−1 E {(yIi − µ)2 } − V {ȳI } . = E (Sy2,n ) Bias corrected estimator V̂ = V̂I + G X 2 cg SRg g =1 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 239 / 318 Other Approaches for Variance Estimation Multiple imputation: Rubin (1987) Adjusted jackknife: Rao and Shao (1992) Fractional imputation: Kim and Fuller (2004), Fuller and Kim (2005) Linearization: Shao and Steel (1999), Kim and Rao (2009) Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 240 / 318 Fractional Imputation Basic Idea Split the record with missing item into M imputed values Assign fractional weights Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 241 / 318 Fractional imputation Features Split the record with missing item into m(> 1) imputed values Assign fractional weights The final product is a single data file with size ≤ nm. For variance estimation, the fractional weights are replicated. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 242 / 318 Fractional imputation Example (n = 10) ID Weight 1 w1 2 w2 3 w3 4 w4 5 w5 6 w6 7 w7 8 w8 9 w9 10 w10 ?: Missing y1 y1,1 y2,1 ? y4,1 y5,1 y6,1 ? ? y9,1 y10,2 Kim & Fuller & Mukhopadhyay (ISU & SAS) y2 y1,2 ? y3,2 y4,2 y5,2 y6,2 y7,2 ? y9,2 y10,2 Chapter 5 7/23-24/2015 243 / 318 Fractional imputation (categorical case) Fully Efficient Fractional Imputation (FEFI) If both y1 and y2 are categorical, then fractional imputation is easy to apply. We have only finite number of possible values. Imputed values = possible values The fractional weights are the conditional probabilities of the possible values given the observations. Can use “EM by weighting” method of Ibrahim (1990) to compute the fractional weights. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 244 / 318 FEFI Example (y1 , y2 : dichotomous, taking 0 or 1) ID Weight y1 y2 1 w1 y1,1 y1,2 ∗ 2 w2 w2,1 y2,1 0 ∗ w2 w2,2 y2,1 1 ∗ 3 w3 w3,1 0 y3,2 ∗ w3 w3,2 1 y3,2 4 w4 y4,1 y4,2 5 w5 y5,1 y5,2 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 245 / 318 7/23-24/2015 246 / 318 FEFI Example (y1 , y2 : dichotomous, taking 0 or 1) ID Weight y1 y2 6 w6 y6,1 y6,2 ∗ 7 w7 w7,1 0 y7,2 ∗ w7 w7,2 1 y7,2 ∗ 8 w8 w8,1 0 0 ∗ w8 w8,2 0 1 ∗ w8 w8,3 1 0 ∗ w8 w8,4 1 1 9 w9 y9,1 y9,2 10 w10 y10,1 y10,2 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 FEFI Example (Cont’d) E-step: Fractional weights are the conditional probabilities of the imputed values given the observations. ∗(j) wij∗ = P̂(yi,mis | yi,obs ) ∗(j) = π̂(yi,obs , yi,mis ) ∗(l) l=1 π̂(yi,obs , yi,mis ) PMi where (yi,obs , yi,mis ) is the (observed, missing) part of yi = (yi1 , · · · , yi,p ). M-step: Update the joint probability using the fractional weights. π̂ab = with N̂ = n Mi 1 XX N̂ ∗(j) ∗(j) wi wij∗ I (yi,1 = a, yi,2 = b) i=1 j=1 Pn i=1 wi . Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 247 / 318 FEFI Example (Cont’d) Variance estimation Recompute the fractional weights for each replication Apply the same EM algorithm using the replicated weights. E-step: Fractional weights are the conditional probabilities of the imputed values given the observations. ∗(j) ∗(k) wij = π̂ (k) (yi,obs , yi,mis ) PMi l=1 ∗(l) π̂ (k) (yi,obs , yi,mis ) M-step: Update the joint probability using the fractional weights. (k) π̂ab where N̂ (k) = = n Mi 1 XX N̂ (k) Pn Kim & Fuller & Mukhopadhyay (ISU & SAS) i=1 (k) ∗(k) wi wij ∗(j) ∗(j) I (yi,1 = a, yi,2 = b) i=1 j=1 (k) wi . Chapter 5 7/23-24/2015 248 / 318 FEFI Example (Cont’d) Final Product Replication Weights Rep 1 Rep 2 · · · Rep L (1) (2) (L) w1 w1 · · · w1 (1) ∗(1) (2) ∗(2) (L) ∗(L) w2 w2,1 w2 w2,1 · · · w2 w2,1 Weight w1 ∗ w2 w2,1 x x1 x2 y1 y1,1 y2,2 y2 y1,2 0 ∗ w2 w2,2 x2 y2,2 1 w2 w2,2 ∗ w3 w3,1 x3 0 y3,2 w3 w3,1 ∗ w3 w3,2 x3 1 y3,2 w3 w3,2 w4 w5 w6 x4 x5 x6 y4,1 y5,1 y6,1 y4,2 y5,2 y6,2 w4 (1) w5 (1) w6 (1) ∗(1) (1) ∗(1) (1) ∗(1) (1) Kim & Fuller & Mukhopadhyay (ISU & SAS) (2) ∗(2) ··· w2 w2,2 (2) ∗(2) ··· w3 w3,1 (2) ∗(2) ··· w3 w3,2 ··· ··· ··· w4 (L) w5 (L) w6 w2 w2,1 w3 w3,1 w3 w3,2 (2) w4 (2) w5 (2) w6 Chapter 5 (L) ∗(L) (L) ∗(L) (L) ∗(L) (L) 7/23-24/2015 249 / 318 FEFI Example (Cont’d) Final Product Weight ∗ w7 w7,1 x x7 y1 0 y2 y7,2 Replication Weights Rep 1 Rep 2 Rep L (1) ∗(1) (2) ∗(2) (L) ∗(L) w7 w7,1 w7 w7,1 w7 w7,1 ∗ w7 w7,2 x7 1 y7,2 w7 w7,2 ∗ w8 w8,1 x8 0 0 w8 w8,1 ∗ w8 w8,2 x8 0 1 w8 w8,2 ∗ w8 w8,3 x8 1 0 w8 w8,3 ∗ w8 w8,4 x8 1 1 w8 w8,4 w9 w10 x9 x10 y9,1 y10,1 y9,2 y10,2 Kim & Fuller & Mukhopadhyay (ISU & SAS) (1) ∗(1) (1) ∗(1) (1) ∗(1) (1) ∗(1) (1) ∗(1) (1) w9 (1) w10 Chapter 5 (2) ∗(2) (2) ∗(2) (2) ∗(2) (2) ∗(2) (2) ∗(2) w7 w7,2 w8 w8,1 (L) ∗(L) (L) ∗(L) (L) ∗(L) (L) ∗(L) w8 w8,2 w8 w8,3 w8 w8,3 w8 w8,4 (2) ∗(L) w8 w8,1 w8 w8,2 w9 (2) w10 (L) w7 w7,2 w8 w8,4 ··· ··· (L) w9 (L) w10 7/23-24/2015 250 / 318 Fractional Hot-Deck Imputation Using SAS PROC SURVEYIMPUTE Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 251 / 318 Missing Values Exclude observations with missing weights Analyze missing levels as a separate level Delete observations with missing values (equivalent to imputing missing values with the estimated values from the analysis model [SAS default]) Analyze observations without missing values in a separate domain (equivalent to imputing missing values by 0 [NOMCAR]) Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 252 / 318 Nonresponse Follow-up interviews Weight adjustment, poststratification Hot-deck and fractional hot-deck imputation Multiple imputation (MI and MIANALYZE procedures in SAS) Other techniques Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 253 / 318 PROC SURVEYIMPUTE Hot-deck imputation Simple random samples Proportional to weights Approximate Bayesian bootstrap Fully efficient fractional imputation (FEFI) Imputation-adjusted replicate weights Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 254 / 318 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 255 / 318 FEFI Detail Use all possible levels that a missing item can take, given the levels of the observed items Assign fractional weights proportional to the weighted frequencies of the imputed levels in the observed data Unit Kim & Fuller & Mukhopadhyay (ISU & SAS) X Y 1 0 0 2 0 . 3 0 1 4 0 0 5 . 1 6 1 0 7 1 1 8 1 1 9 . . Chapter 5 7/23-24/2015 256 / 318 FEFI: Initialization Fill in missing values Use the complete cases to compute fractional weights Recipient ImpWt Unit X Y 0 1.00000 1 0 0 1 0.66667 2 0 0 2 0.33333 2 0 1 0 1.00000 3 0 1 0 1.00000 4 0 0 1 0.33333 5 0 1 2 0.66667 5 1 1 0 1.00000 6 1 0 0 1.00000 7 1 1 0 1.00000 8 1 1 1 0.33333 9 0 0 2 0.16667 9 0 1 3 0.16667 9 1 0 4 0.33333 9 1 1 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 257 / 318 7/23-24/2015 258 / 318 M Step: Compute Proportions The FREQ Procedure Frequency Percent Row Pct Col Pct Table of X by Y Y X 0 1 Total 3 33.33 62.07 72.00 1.83333 20.37 37.93 37.93 4.83333 53.70 1 1.16667 12.96 28.00 28.00 3 33.33 72.00 62.07 4.16667 46.30 4.16667 46.30 4.83333 53.70 9 100.00 0 Total Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 E Step: Adjust Fractional Weights Recipient Kim & Fuller & Mukhopadhyay (ISU & SAS) ImpWt Unit X Y 0 1.00000 1 0 0 1 0.62069 2 0 0 2 0.37931 2 0 1 0 1.00000 3 0 1 0 1.00000 4 0 0 1 0.37931 5 0 1 2 0.62069 5 1 1 0 1.00000 6 1 0 0 1.00000 7 1 1 0 1.00000 8 1 1 1 0.33333 9 0 0 2 0.20370 9 0 1 3 0.12963 9 1 0 4 0.33333 9 1 1 Chapter 5 7/23-24/2015 259 / 318 Repeat E-M Steps in Replicate Samples Recipient ImpWt FracWgt ImpRepWt_1 ImpRepWt_2 ImpRepWt_9 Unit X Y 0 1.00000 1.00000 0.00000 1.12500 1.12500 1 0 0 1 0.58601 0.58601 0.46072 0.00000 0.65877 2 0 0 2 0.41399 0.41399 0.66428 0.00000 0.46623 2 0 1 0 1.00000 1.00000 1.12500 1.12500 1.12500 3 0 1 0 1.00000 1.00000 1.12500 1.12500 1.12500 4 0 0 1 0.41399 0.41399 0.49821 0.37510 0.46623 5 0 1 2 0.58601 0.58601 0.62679 0.74990 0.65877 5 1 1 0 1.00000 1.00000 1.12500 1.12500 1.12500 6 1 0 0 1.00000 1.00000 1.12500 1.12500 1.12500 7 1 1 0 1.00000 1.00000 1.12500 1.12500 1.12500 8 1 1 1 0.32330 0.32330 0.22659 0.32143 0.00000 9 0 0 2 0.22840 0.22840 0.32669 0.21434 0.00000 9 0 1 3 0.12500 0.12500 0.16071 0.16071 0.00000 9 1 0 4 0.32330 0.32330 0.41101 0.42851 0.00000 9 1 1 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 260 / 318 Digitech Cable Item nonresponse: Rating, Recommend, ... Impute missing items from the observed data Use imputation cells Create an imputed data set and a set of replicate weights for future use Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 261 / 318 PROC SURVEYIMPUTE p r o c s u r v e y i m p u t e d a t a=R e s p o n s e D a t a method= f e f i ; s t r a t a S t a t e Type ; weight SamplingWeight ; c l a s s R a t i n g Recommend H o u s e h o l d S i z e Race ; v a r R a t i n g Recommend H o u s e h o l d S i z e Race ; c e l l s ImputationCells ; o u t p u t o u t=ImputedData o u t j k c o e f=J K C o e f f i c i e n t s ; run ; Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 262 / 318 The SURVEYIMPUTE Procedure Missing Data Patterns Group Sum of Unweighted Freq Weights Percent Rating Recommend HouseholdSize Race 1 X X X X 269 12081.4 89.67 2 X X . . 6 270.2982 2.00 3 X . X X 9 402.655 3.00 4 . X X X 14 627.7105 4.67 5 . . X X 1 44.21429 0.33 6 . . . . 1 44.71795 0.33 Missing Data Patterns Group Weighted Percent 1 89.68 2 2.01 3 2.99 4 4.66 5 0.33 6 0.33 Imputation Summary Number of Sum of Observations Weights Observation Status Nonmissing 269 12081.4 Missing 31 1389.596 Missing, Imputed 31 1389.596 Missing, Not Imputed 0 0 Missing, Partially Imputed 0 0 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 NumberOfDonors NumberOfUnits NumberOfRows 0 269 269 1 3 3 2 9 18 3 2 6 4 5 20 5 5 25 6 1 6 8 2 16 9 2 18 10 1 10 59 1 59 300 450 263 / 318 Output data set contains the imputed data New variables: Replicate Weights, Recipient Index, Unit ID Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 264 / 318 UnitID Recipient ImpWt SamplingWeight ImputationCells 186 1 11.2356 44.7179 1 186 2 11.2356 44.7179 1 186 3 22.2467 44.7179 1 264 1 1.1099 45.5135 1 264 2 1.1431 45.5135 1 264 3 2.2784 45.5135 1 264 4 1.7025 45.5135 1 264 5 7.1105 45.5135 1 264 6 3.4138 45.5135 1 264 7 6.2488 45.5135 1 264 8 2.2761 45.5135 1 264 9 1.1099 45.5135 1 264 10 19.1205 45.5135 1 Rating Recommend HouseholdSize Race Neutral 0 Medium Other Satisfied 0 Medium Other Unsatisfied 0 Medium Other Extremely Unsatisfied 0 Large NA Extremely Unsatisfied 0 Large White Extremely Unsatisfied 0 Medium Black Extremely Unsatisfied 0 Medium Hispanic Extremely Unsatisfied 0 Medium White Extremely Unsatisfied 0 Small Black Extremely Unsatisfied 0 Small Hispanic Extremely Unsatisfied 0 Small NA Extremely Unsatisfied 0 Small Other Extremely Unsatisfied 0 Small White Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 265 / 318 Analyses of FEFI Data The imputed data set can be used for any analyses Use the imputation-adjusted weights Use the imputation-adjusted replicate weights The number of rows in the imputed data set is NOT the same as the number of observation units Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 266 / 318 Use the Imputed Data to Estimate Usage p r o c s u r v e y m e a n s d a t a=ImputedData mean varmethod=j a c k k n i f e ; w e i g h t ImpWt ; v a r UsageTime ; r e p w e i g h t s ImpRepWt : / j k c o e f s=J K C o e f f i c i e n t s ; run ; Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 267 / 318 7/23-24/2015 268 / 318 The SURVEYMEANS Procedure Data Summary Number of Observations Sum of Weights 450 13471 Variance Estimation Method Jackknife Replicate Weights Number of Replicates IMPUTEDDATA 300 Statistics Variable Label UsageTime Computer Usage Time Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 Mean Std Error of Mean 284.953667 11.028403 Use the Imputed Data to Estimate Rating p r o c s u r v e y f r e q d a t a=ImputedData varmethod=j a c k k n i f e ; w e i g h t ImpWt ; t a b l e s R a t i n g Recommend ; r e p w e i g h t s ImpRepWt : / j k c o e f s=J K C o e f f i c i e n t s ; run ; Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 269 / 318 The SURVEYFREQ Procedure Customer Satisfaction Frequency Weighted Frequency 108 3297 338.60477 24.4745 2.5136 97 3215 343.59043 23.8664 2.5506 102 3054 339.31908 22.6703 2.5189 Satisfied 99 2678 319.57801 19.8770 2.3723 Extremely Satisfied 44 1227 230.65076 9.1117 1.7122 450 13471 4.0261E-10 100.000 Rating Extremely Unsatisfied Unsatisfied Neutral Total Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 Std Err of Wgt Freq Percent Std Err of Percent 7/23-24/2015 270 / 318 Recommend Recommend Frequency Weighted Frequency 0 261 8027 389.02594 59.5863 2.8879 1 189 5444 389.02594 40.4137 2.8879 Total 450 13471 2.0157E-10 100.000 Kim & Fuller & Mukhopadhyay (ISU & SAS) Std Err of Wgt Freq Percent Chapter 5 Std Err of Percent 7/23-24/2015 271 / 318 §5.5 Small area estimation Basic Setup Original sample A is decomposed into G domains such that A = A1 ∪ · · · ∪ AG and n = n1 + · · · + nG n is large but ng can be very small. P Direct estimator of Yg = i∈Ug yi Ŷd,g X 1 = yi πi i∈Ag Unbiased May have high variance. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 272 / 318 If there is some auxiliary information available, then we can do something: Synthetic estimator of Yg Ŷs,g = Xg β̂ P where Xg = i∈Ug xi is the known total of xi in Ug and β̂ is an estimated regression coefficient. Low variance (if xi does P not contain the domain indicator). Could be biased (unless i∈Ug (yi − x0i B) = 0) Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 273 / 318 Composite estimation: consider Ŷc,g = αg Ŷd,g + (1 − αg ) Ŷs,g for some αg ∈ (0, 1). We are interested in finding αg∗ that minimizes the MSE of Ŷc . The optimal choice is MSE Ŷs,g ∗ ∼ αg = MSE Ŷd,g + MSE Ŷs,g For the direct estimation part, MSE Ŷd,g = V Ŷd,g can be estimated. n o 2 For the synthetic estimation part, MSE Ŷs,g = E (Ŷs,g − Yg ) cannot be computed directly without assuming some error model. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 274 / 318 Area level estimation Basic Setup Parameter of interest: Ȳg = Ng−1 P i∈Ug yi Model Ȳg = X̄0g β + ug and ug ∼ 0, σu2 . Also, we have ˆ Ȳd,g ∼ Ȳg , Vg with Vg = V (Ȳˆd,g ). Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 275 / 318 Area level estimation (Cont’d) Two model can be written Ȳˆd,g = Ȳg + eg X̄0g β = Ȳg − ug where eg and ug are independent error terms with mean zeros and variance Vg and σu2 , respectively. Thus, the best linear unbiased predictor (BLUP) can be written as Ȳˆg∗ = αg∗ Ȳˆd,g + 1 − αg∗ X̄0g β where αg∗ = σu2 /(Vg + σu2 ). Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 276 / 318 Area level estimation (Cont’d) MSE: If β, Vg , and σu2 are known, then ˆ ˆ ∗ ∗ MSE Ȳg = V Ȳg − Ȳg n 0 o ∗ ˆ ∗ = V αg Ȳd,g − Ȳg + 1 − αg X̄g β − Ȳg 2 2 = αg∗ Vg + 1 − αg∗ σu2 2 ∗ ∗ = αg Vg = 1 − αg σu . Note that, since 0 < αg∗ < 1, ˆ ∗ MSE Ȳg < Vg and ˆ ∗ MSE Ȳg < σu2 . Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 277 / 318 Area level estimation (Cont’d) If β and σu2 are unknown: 1 2 Find a consistent estimator of β and σu2 . Use Ȳˆg∗ (α̂g∗ , β̂) = α̂g∗ Ȳˆd,g + 1 − α̂g∗ X̄0g β̂. where α̂g∗ = σ̂u2 /(V̂g + σ̂u2 ) Estimation of σu2 : Method of moment 2 X G σ̂u2 = kg Ȳˆd,g − X̄0g β̂ − V̂d,g , G −p g n o−1 P 2 2 where kg ∝ σ̂u + V̂g and G g =1 kg = 1. If σ̂u is negative, then we set it to zero. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 278 / 318 Area level estimation (Cont’d) MSE n o n o ˆ ˆ ∗ ∗ ∗ ∗ MSE Ȳg (α̂g , β̂) = V Ȳg (α̂g , β̂) − Ȳg n o 0 ∗ ˆ ∗ = V α̂g Ȳd,g − Ȳg + 1 − α̂g X̄g β̂ − Ȳg o n 2 ∗ 2 ∗ 2 0 = αg Vg + 1 − αg σu + X̄g V (β̂)X̄g +V (α̂g ) Vg + σu2 ∗ ∗ 2 0 = αg Vg + 1 − αg X̄g V (β̂)X̄g +V (α̂g ) Vg + σu2 MSE estimation (Prasad and Rao, 1990): n o ˆ ∗ ∗ ∗ ∗ 2 0 ˆ = α̂g V̂g + 1 − α̂g X̄g V̂ (β̂)X̄g MSE Ȳg (α̂g , β̂) n o 2 +2V̂ (α̂g ) V̂g + σ̂u . Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 279 / 318 Unit level estimation Unit level estimation: Battese, Harter, and Fuller (1988). Use a unit level modeling ygi = x0gi β + ug + egi and Ŷg∗ o Xn 0 = xgi β̂ + ûg . i∈Ug It can be shown that Ȳˆg∗ = α̂g∗ Ȳreg ,g + 1 − α̂g∗ Ȳs,g where Ȳreg ,g 0 ˆ ˆ = Ȳd,g + X̄g − X̄d,g β̂ and Ȳs,g = X̄0g β̂. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 5 7/23-24/2015 280 / 318 Chapter 6 Analytic Studies World Statistics Congress Short Course July 23-24, 2015 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 6 7/23-24/2015 281 / 318 Parameters Two types of parameters Descriptive parameter: “How many people in the United States were unemployed on March 10, 2015?” Analytic parameter: “If personal income (in the United States) increases 2%, how much will the consumption of beef increase ?” Basic approach to estimating analytic parameters 1 2 Specify a model that describes the relationship among the variables (often called superpopulation model). Estimate the parameters in the model using the realized sample. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 6 7/23-24/2015 282 / 318 Parameter Estimation for Model Parameter 1 2 θN : finite population characteristic for θ satisfying E (θN ) = θ + O(N −1 ) and V (θN − θ) = O(N −1 ), where the distribution is with respect to the model. n o θ̂: estimator of θN satisfying E θ̂ − θN | FN = Op (n−1 ) and n o V θ̂ | FN = Op (n−1 ) almost surely. E (θ̂) = θ + O(n−1 ) n o V (θ̂ − θ) = E V θ̂ | FN + V (θN ) V̂ (θ̂ − θ) = V̂ (θ̂ | FN ) + V̂ (θ̂N − θ) Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 6 7/23-24/2015 283 / 318 A Regression Model The finite population is a realization from a model yi = xi β + ei (2) where the ei are independent (0, σ 2 ) random variables independent of xj for all i and j. We are interested in estimating β from the sample. First order inclusion probabilities πi are available. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 6 7/23-24/2015 284 / 318 Estimation of Regression Coefficients OLS estimator !−1 β̂ols = X x0i xi i∈A X x0i yi i∈A Probability weighted estimator !−1 β̂π = X πi−1 x0i xi X πi−1 x0i yi i∈A i∈A If superpopulation model also holds for the sample, then OLS estimator is optimal. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 6 7/23-24/2015 285 / 318 Informative Sampling Non-informative sampling design (with respect to the superpopulation model) satisfies P (yi ∈ B | xi , i ∈ A) = P (yi ∈ B | xi ) (3) for any measurable set B. The left side is the sample model and the right side is the population model. Informative sampling design: Equality (3) does not hold. Non-informative sampling for regression implies E x0i ei | i ∈ A = 0. (4) If condition (4) is satisfied and moment conditions hold, β̂ols is consistent Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 6 7/23-24/2015 286 / 318 Hypothesis Testing Thus, one may want to test (4), or test directly n o n o H0 : E β̂ols = E β̂π 1 2 (5) From the sample, fit a regression of yi on (xi , zi ) where zi = πi−1 xi Perform a test for γ = 0 under the expanded model y = Xβ + Zγ + a where a is the error term satisfying E (a | X, Z) = 0. Justification : Testing (5) is equivalent to testing E Z0 (I − PX ) y = 0 (6) where PX = X (X 0 X )−1 X 0 . Since γ̂ = {Z 0 (I − PX ) Z }−1 Z 0 (I − PX ) y, testing for γ = 0 is equivalent to testing for (6). Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 6 7/23-24/2015 287 / 318 Remarks on Testing When performing the hypothesis testing, design consistent variance estimator is preferable. Rejecting the null hypothesis means that we cannot directly use the OLS estimator under the current model. Include more x’s until the sampling design is non-informative under the expanded model. (Example 6.3.1) Use the probability weighted estimator or use other consistent estimators (Section 6.3.2) Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 6 7/23-24/2015 288 / 318 Example 6.3.1: Birthweight and Age Data y : gestational age x: birthweight stratified sample of babies OLS result ŷ = 25.765 + 0.389 · x (0.370) (0.012) Weighted regression ŷ = 28.974 + 0.297 · x (0.535) (0.016) Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 6 7/23-24/2015 289 / 318 Example 6.3.1 (Cont’d) DuMouchel & Duncan test: Fit the OLS regression of y on (1, x, w , wx), where wi = πi−1 , to obtain (β̂0 , β̂1 , γ̂0 , γ̂1 ) = (22.088, 0.583, 8.287, −0.326) (0.532) (0.033) (0.861) (0.332) The hypothesis is rejected. (F (2, 86) = 55.59.) Thus, we cannot use OLS method in this data. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 6 7/23-24/2015 290 / 318 Example 6.3.1 (Cont’d) However, if we include a quadratic term into the model then the sampling design becomes noninformative. OLS result ŷ = 28.335 + 0.331 · x − 0.887 · x2 (0.343) (0.010) (0.082) where x2 = 0.01(x − 30)2 . Weighted regression ŷ = 28.458 + 0.327 · x − 0.864 · x2 (0.386) (0.011) (0.108) DuMouchel & Duncan test: F (3, 84) = 0.49. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 6 7/23-24/2015 291 / 318 Estimators Under Informative Sampling Pfeffermann and Sverchkov (1999) estimator: Minimize X wi Q(β) = (yi − xi β)2 w̃i i∈A where wi = πi−1 , w̃i = E (wi | xi , i ∈ A). w̃i can be a function of xi . (Estimated) GLS estimator: Minimize X Q(β) = wi (yi − xi β)2 /vi2 (7) i∈A where 1 2 vi2 n o 2 = E wi (yi − xi β) | xi . Obtain β̂π and compute êi = yi − xi β̂π . Fit a (nonlinear) regression model ai2 = wi êi2 on xi , ai2 = qa (xi ; γa ) + rai to get v̂i2 = qa (xi ; γ̂a ) and insert v̂i2 in (7). Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 6 7/23-24/2015 292 / 318 Comments Fitting models to complex survey data Always test for informative design If the hypothesis of noninformative design is rejected: Examine model Use HT estimator or more complex design consistent estimator Variance estimation for clusters and two-stage designs must recognize clusters Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 6 7/23-24/2015 293 / 318 7/23-24/2015 294 / 318 Linear and Logistic Regression Using SAS PROC SURVEYREG PROC SURVEYLOGISTIC Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 6 PROC SURVEYREG Linear regression Regression coefficients Significance tests Estimates and contrasts Regression estimator Comparisons of domain means Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 6 7/23-24/2015 295 / 318 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 6 7/23-24/2015 296 / 318 PROC SURVEYLOGISTIC Categorical response Logit, probit, complementary log-log, and generalized logit regressions Regression coefficients Estimates and contrasts Odds ratios Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 6 7/23-24/2015 297 / 318 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 6 7/23-24/2015 298 / 318 Pseudo-likelihood Estimation Finite population parameter is defined by the P population likelihood, lN (θN ) = UN log {L(θN , xi )} A sample-based estimate of the likelihood is used to estimate the parameter, n o P −1 lπ (θ̂) = A πi log L(θ̂, xi ) Variance estimators assume fixed population values, V (θ̂|FN ) Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 6 7/23-24/2015 299 / 318 Taylor Series Linearization Variance Estimation Sandwich variance estimator that accounts for strata, cluster, and weights: V̂ (θ̂) = I −1 GI −1 G = (n − p) −1 X X −1 (n − 1) (nh − 1) nh (1 − fh ) (ehi+ − ēh.. )0 (ehi+ − ēh.. ) i h Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 6 7/23-24/2015 300 / 318 Replication Variance Estimation Estimate θ in the full sample and in every replicate sample: V̂ (θ̂) = X αr (θ̂(r ) − θ̂)(θ̂(r ) − θ̂)0 r Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 6 7/23-24/2015 301 / 318 Digitech Cable Customer satisfaction survey Describe usage time based on data usage after adjusting for race Describe usage time based on race Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 6 7/23-24/2015 302 / 318 Linear Regression p r o c s u r v e y r e g d a t a=ImputedData ; w e i g h t ImpWt ; c l a s s Race ; model UsageTime = DataUsage Race / solution ; r e p w e i g h t s ImpRepWt : / j k c o e f s=J K C o e f f i c i e n t s ; l s m e a n s Race / d i f f ; o u t p u t o u t=RegOut p r e d i c t e d=F i t t e d V a l u e s r e s i d u a l=R e s i d u a l s ; run ; Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 6 7/23-24/2015 303 / 318 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 6 7/23-24/2015 304 / 318 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 6 7/23-24/2015 305 / 318 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 6 7/23-24/2015 306 / 318 The SURVEYREG Procedure Regression Analysis for Dependent Variable UsageTime Fit Statistics R-Square 0.7136 Root MSE 111.20 Denominator DF 300 Tests of Model Effects Effect Num DF F Value Pr > F Model 5 135.08 <.0001 Intercept 1 70.87 <.0001 DataUsage 1 484.50 <.0001 Race 4 22.09 <.0001 Note: The denominator degrees of freedom for the F tests is 300. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 6 7/23-24/2015 307 / 318 Estimated Regression Coefficients Standard Error t Value Parameter Estimate Pr > |t| Intercept 30.395290 10.7067145 2.84 0.0048 DataUsage 0.055261 0.0025106 22.01 <.0001 Race Black 43.460901 20.2069823 2.15 0.0323 Race Hispanic 38.191181 23.6997846 1.61 0.1081 Race NA 202.394973 23.3000699 8.69 <.0001 Race Other 141.623908 33.7004078 4.20 <.0001 Race White 0.000000 0.0000000 . . Note: The degrees of freedom for the t tests is 300. Matrix X'WX is singular and a generalized inverse was used to solve the normal equations. Estimates are not unique. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 6 7/23-24/2015 308 / 318 Digitech Cable Customer satisfaction survey Describe usage time based on data usage after adjusting for race Describe usage time based on race Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 6 7/23-24/2015 309 / 318 Comparisons of Domain Means p r o c s u r v e y r e g d a t a=ImputedData ; w e i g h t ImpWt ; c l a s s Race ; model UsageTime = DataUsage Race / solution ; r e p w e i g h t s ImpRepWt : / j k c o e f s=J K C o e f f i c i e n t s ; l s m e a n s Race / d i f f ; o u t p u t o u t=RegOut p r e d i c t e d=F i t t e d V a l u e s r e s i d u a l=R e s i d u a l s ; run ; Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 6 7/23-24/2015 310 / 318 Differences of Race Least Squares Means Race Race Black Hispanic Black NA Black Black Estimate Standard Error 5.2697 DF t Value Pr > |t| 29.5971 300 0.18 0.8588 -158.93 29.0487 300 -5.47 <.0001 Other -98.1630 37.3004 300 -2.63 0.0089 White 43.4609 20.2070 300 2.15 0.0323 Hispanic NA -164.20 31.4596 300 -5.22 <.0001 Hispanic Other -103.43 39.9883 300 -2.59 0.0102 Hispanic White 38.1912 23.6998 300 1.61 0.1081 NA Other 60.7711 39.2901 300 1.55 0.1230 NA White 202.39 23.3001 300 8.69 <.0001 Other White 141.62 33.7004 300 4.20 <.0001 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 6 7/23-24/2015 311 / 318 Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 6 7/23-24/2015 312 / 318 Digitech Cable Customer satisfaction survey Describe recommendation based on race after adjusting for data usage Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 6 7/23-24/2015 313 / 318 7/23-24/2015 314 / 318 Logistic Regression p r o c s u r v e y l o g i s t i c d a t a=ImputedData ; w e i g h t ImpWt ; c l a s s Race ; model Recommend=DataUsage Race ; r e p w e i g h t s ImpRepWt : / j k c o e f s=J K C o e f f i c i e n t s ; run ; Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 6 The SURVEYLOGISTIC Procedure Analysis of Maximum Likelihood Estimates Parameter Estimate Intercept DataUsage Standard Error t Value Pr > |t| 0.4507 0.2839 1.59 0.1135 0.000046 0.000042 1.08 0.2815 -0.2246 0.3302 -0.68 0.4969 Race Black Race Hispanic 0.0466 0.3505 0.13 0.8944 Race NA 0.4332 0.6435 0.67 0.5014 Race Other 0.1304 0.5145 0.25 0.8000 NOTE: The degrees of freedom for the t tests is 300. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 6 7/23-24/2015 315 / 318 7/23-24/2015 316 / 318 Odds Ratio Estimates Point Estimate Effect 95% Confidence Limits DataUsage 1.000 1.000 1.000 Race Black vs White 1.175 0.583 2.367 Race Hispanic vs White 1.541 0.728 3.260 Race NA 2.268 0.474 10.850 Race Other vs White 1.675 0.496 5.661 vs White NOTE: The degrees of freedom in computing the confidence limits is 300. Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 6 Survey Data Analysis Using SAS Do not use non-survey procedures for survey data analysis Always use complete design information: weight, strata, clusters, ... SURVEYSELECT, SURVEYIMPUTE, SURVEYMEANS, SURVEYFREQ, SURVEYREG, SURVEYLOGISTIC, and SURVEYPHREG procedures Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 6 7/23-24/2015 317 / 318 R For More Information about SAS/STAT support.sas.com/statistics/ In-depth information about statistical products and link to e-newsletter support.sas.com/STAT/ Portal to SAS Technical Support, discussion forum, documentation, and more Kim & Fuller & Mukhopadhyay (ISU & SAS) Chapter 6 7/23-24/2015 318 / 318