PSTAT 262 AS: Survey Sampling and Estimation

STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals 1 Survey sampling: 4 major topics 1. Traditional design-based statistical inference • 6 weeks 2. Likelihood considerations • 1 weeks 3. Model-based statistical inference • 3 weeks 4. Missing data - nonresponse • 2 weeks 2 Statistical demography • Mortality • Life expectancy • Population projections • 2-3 weeks 3 Course goals • Give students knowledge about: – planning surveys in social sciences – major sampling designs – basic concepts and the most important estimation methods in traditional applied survey sampling – Likelihood principle and its consequences for survey sampling – Use of modeling in sampling – Treatment of nonresponse – A basic knowledge of demography 4 But first: Basic concepts in sampling Population (Target population): The universe of all units of interest for a certain study • Denoted, with N being the size of the population: U = {1, 2, ...., N} All units can be identified and labeled • Ex: Political poll – All adults eligible to vote • Ex: Employment/Unemployment in Norway– All persons in Norway, age 15 or more • Ex: Consumer expenditure : Unit = household Sample: A subset of the population, to be observed. The sample should be ”representative” of the population 5 Sampling design: • The sample is a probability sample if all units in the sample have been chosen with certain probabilities, and such that each unit in the population has a positive probability of being chosen to the sample • We shall only be concerned with probability sampling • Example: simple random sample (SRS). Let n denote the sample size. Every possible subset of n units has the same chance of being the sample. Then all units in the population have the same probability n/N of being chosen to the sample. • The probability distribution for SRS on all subsets of U is an example of a sampling design: The probability plan for selecting a sample s from the population: N  p( s )  1 /   if | s | n n  p( s )  0 if | s | n 6 Basic statistical problem: Estimation • A typical survey has many variables of interest • Aim of a sample is to obtain information regarding totals or averages of these variables for the whole population • Examples : Unemployment in Norway– Want to estimate the total number t of individuals unemployed. For each person i (at least 15 years old) in Norway: yi  1 if person i is unemployed , 0 otherwise Then : t  iN1 yi 7 • In general, variable of interest: y with yi equal to the value of y for unit i in the population, and the total is denoted t  iN1 yi • The typical problem is to estimate t or t/N •Sometimes, of interest also to estimate ratios of totals: Example- estimating the rate of unemployment: yi  1 if person i is unemployed , 0 otherwise xi  1 if person i is in the labor force, 0 otherwise with totals t y , t x Unemployment rate: t y / t x 8 Sources of error in sample surveys 1. Target population U vs Frame population UF Access to the population is thru a list of units – a register UF . U and UF may not be the same: Three possible errors in UF: – – – • Undercoverage: Some units in U are not in UF Overcoverage: Some units in UF are not in U Duplicate listings: A unit in U is listed more than once in UF UF is sometimes called the sampling frame 9 2. Nonresponse - missing data • • • • • Some persons cannot be contacted Some refuse to participate in the survey Some may be ill and incapable of responding In postal surveys: Can be as much as 70% nonresponse In telephone surveys: 50% nonresponse is not uncommon • Possible consequences: – – Bias in the sample, not representative of the population Estimation becomes more inaccurate • Remedies: – imputation, weighting 10 3. Measurement error – the correct value of yi is not measured – In interviewer surveys: • • • Incorrect marking interviewer effect: people may say what they think the interviewer wants to hear – underreporting of alcohol ute, tobacco use misunderstanding of the question, do not remember correctly. 11 4. Sampling error – The error caused by observing a sample instead of the whole population – To assess this error- margin of error: measure sample to sample variation – Design approach deals with calculating sampling errors for different sampling designs – One such measure: 95% confidence interval: If we draw repeated samples, then 95% of the calculated confidence intervals for a total t will actually include t 12 • The first 3 errors: nonsampling errors – Can be much larger than the sampling error • In this course: – Sampling error – nonresponse bias – Shall assume that the frame population is identical to the target population – No measurement error 13 Summary of basic concepts • • • • • Population, target population unit sample sampling design estimation – – – – estimator measure of bias measure of variance confidence interval 14 • survey errors: – – – – register /frame population mesurement error nonresponse sampling error 15 Example – Psychiatric Morbidity Survey 1993 from Great Britain • Aim: Provide information about prevalence of psychiatric problems among adults in GB as well as their associated social disabilities and use of services • Target population: Adults aged 16-64 living in private households • Sample: Thru several stages: 18,000 adresses were chosen and 1 adult in each household was chosen • 200 interviewers, each visiting 90 households 16 Result of the sampling process • Sample of addresses Vacant premises Institutions/business premises Demolished Second home/holiday flat • Private household addresses Extra households found • Total private households Households with no one 16-64 • Eligible households • Nonresponse • Sample 18,000 927 573 499 236 15,765 669 16,434 3,704 12,730 2,622 10,108 households with responding adults aged 16-64 17 Why sampling ? • reduces costs for acceptable level of accuracy (money, manpower, processing time...) • may free up resources to reduce nonsampling error and collect more information from each person in the sample – ex: 400 interviewers at $5 per interview: lower sampling error 200 interviewers at 10$ per interview: lower nonsampling error • much quicker results 18 When is sample representative ? • Balance on gender and age: – proportion of women in sample @ proportion in population – proportions of age groups in sample @ proportions in population • An ideal representative sample: – A miniature version of the population: – implying that every unit in the sample represents the characteristics of a known number of units in the population • Appropriate probability sampling ensures a representative sample ”on the average” 19 Alternative approaches for statistical inference based on survey sampling • Design-based: – No modeling, only stochastic element is the sample s with known distribution • Model-based: The values yi are assumed to be values of random variables Yi: – Two stochastic elements: Y = (Y1, …,YN) and s – Assumes a parametric distribution for Y – Example : suppose we have an auxiliary variable x. Could be: age, gender, education. A typical model is a regression of Yi on xi. 20 • Statistical principles of inference imply that the model-based approach is the most sound and valid approach • Start with learning the design-based approach since it is the most applied approach to survey sampling used by national statistical institutes and most research institutes for social sciences. – Is the easy way out: Do not need to model. All statisticians working with survey sampling in practice need to know this approach 21 Design-based statistical inference • Can also be viewed as a distribution-free nonparametric approach • The only stochastic element: Sample s, distribution p(s) for all subsets s of the population U={1, ..., N} • No explicit statistical modeling is done for the variable y. All yi’s are considered fixed but unknown • Focus on sampling error • Sets the sample survey theory apart from usual statistical analysis • The traditional approach, started by Neyman in 1934 22 Estimation theory-simple random sample N SRS of size n: Each sample s of size n has p( s )  1 /   n  Can be performed in principle by drawing one unit at time at random without replacement Estimation of the population mean of a variable y:   iN1 yi / N A natural estimator - the sample mean: ys  is yi / n Desirable properties: ( I) Unbiasedness : An estimator ˆ is unbiased if E ( ˆ )  y s is unbiased for SRS design 23 The uncertainty of an unbiased estimator is measured by its estimated sampling variance or standard error (SE): Var( ˆ ) E ( ˆ   )2 , if E ( ˆ )   Vˆ ( ˆ ) is an (unbiased) estimate of Var( ˆ ) SE( ˆ )  Vˆ ( ˆ ) Some results for SRS: (1) Let  i be the probabilit y that unit i is in the sample, Then  i  n / N  f , the sampling fraction ( 2 ) E ( y s )  24 (3) Let  2 be the population variance :  2  Var( ys )  2 1 2 N i 1 ( yi   ) N 1 (1  f ) n Here, the factor (1 - f ) is called the finite population correction • usually unimportant in social surveys: n =10,000 and N = 5,000,000: 1- f = 0.998 n =1000 and N = 400,000: 1- f = 0.9975 n =1000 and N = 5,000,000: 1-f = 0.9998 • effect of changing n much more important than effect of changing n/N 25 An unbiased estimator of  2 is given by the sample variance 1 2 s  is ( yi  y s ) n 1 2 s The estimated variance Vˆ ( ys )  (1  f ) n Usually we report the standard error of the estimate: 2 SE( y s )  Vˆ ( y s ) Confidence intervals for  is based on the Central Limit Theorem: For large n, N  n : Z  ( ys   ) /  (1  f ) / n ~ N (0,1) Approximat e 95% CI for  : ys  1.96  SE( ys ), ys  1.96  SE( ys )  ys  1.96  SE( ys ) 26 Example N = 341 residential blocks in Ames, Iowa yi = number of dwellings in block i 1000 independent SRS for different values of n n Proportion of samples Proportion of samples with |Z| <1.64 with |Z| <1.96 30 50 0.88 0.88 0.93 0.93 70 90 0.88 0.90 0.94 0.95 27 For one SRS with n = 90: y s  13 s 2  75 SE( y s )  (1  90 / 341)75 / 90  0.78 Approximat e 95% CI : 13  1.96  0.78  13  1.53  (11.47, 14.53) 28 Absolute value of sampling error is not informative when not related to value of the estimate For example, SE =2 is small if estimate is 1000, but very large if estimate is 3 The coefficient of variation for the estimate: CV ( y s ) SE ( y s ) / y s In example : CV ( y s )  0.78 / 13  0.06  6% •A measure of the relative variability of an estimate. •It does not depend on the unit of measurement. • More stable over repeated surveys, can be used for planning, for example determining sample size • More meaningful when estimating proportions 29 Estimation of a population proportion p with a certain characteristic A p = (number of units in the population with A)/N Let yi = 1 if unit i has characteristic A, 0 otherwise Then p is the population mean of the yi’s. Let X be the number of units in the sample with characteristic A. Then the sample mean can be expressed as pˆ  y s  X / n 30 Then under SRS : E ( pˆ )  p and p(1  p ) n 1 Var( pˆ )  (1  ) n N 1 since the population variance equals  2  Np(1  p ) N 1 n s  pˆ (1  pˆ ) n 1 2 So the unbiased estimate of the variance of the estimator: pˆ (1  pˆ ) n ˆ V ( pˆ ) (1  ) n 1 N 31 Examples A political poll: Suppose we have a random sample of 1000 eligible voters in Norway with 280 saying they will vote for the Labor party. Then the estimated proportion of Labor votes in Norway is given by: p̂  280 / 1000  0.28 p̂( 1  p̂ ) n 0.28  0.72 SE( p̂ ) (1  )   0.0144 n 1 N 999 Confidence interval requires normal approximation. Can use the guideline from binomial distribution, when N-n is large: np  5 and n(1  p)  5 32 In this example : n = 1000 and N = 4,000,000 Approximat e 95% CI : p̂  1.96  SE( p̂ )  0.280  0.028  (0.252, 0.308) Ex: Psychiatric Morbidity Survey 1993 from Great Britain p = proportion with psychiatric problems n = 9792 (partial nonresponse on this question: 316) N @ 40,000,000 pˆ  0.14 SE( pˆ )  (1  0.00024 )0.14  0.86 / 9791  0.0035 95 % CI : 0.14  1.96  0.0035  0.14  0.007  (0.133,0.1 47) 33 General probability sampling • Sampling design: p(s) - known probability of selection for each subset s of the population U • Actually: The sampling design is the probability distribution p(.) over all subsets of U • Typically, for most s: p(s) = 0 . In SRS of size n, all s with size different from n has p(s) = 0. • The inclusion probability:  i  P( unit i is in the sample)  P(i  s )   p( s ) {s:is} 34 Illustration U = {1,2,3,4} Sample of size 2; 6 possible samples Sampling design: p({1,2}) = ½, p({2,3}) = 1/4, p({3,4}) = 1/8, p({1,4}) = 1/8 The inclusion probabilities:  1   p( s )  p({1,2})  p({1,4})  5 / 8 {s:1s}  2   p( s )  p({1,2})  p({2,3})  3 / 4  6 / 8 {s:2s}  3   p( s )  p({2,3})  p({3,4})  3 / 8 {s:3s}  4   p( s )  p({3,4})  p({1,4})  2 / 8 {s:4s} 35 Some results ( I )  1   2  ...   N  E ( n) ; n is the sample size ( II ) If sample size is determined to be n in advance :  1   2  ...   N  n Proof : Let Z i  1 if unit i is included in the sample, 0 otherwise  i  P( Z i  1)  E ( Z i ) n  i 1 Z i  E (n) i 1 E ( Z i )  i 1  i N N N 36 Estimation theory probability sampling in general Problem: Estimate a population quantity for the variable y N For the sake of illustration: The population total t   yi An estimator of t based on the sample : tˆ i 1 Expected value : E (tˆ )  s tˆ( s ) p( s ) Variance : Var(tˆ )  E[tˆ  Etˆ]2  s [tˆ( s )  Etˆ]2 p( s ) Bias : E (tˆ)  t tˆ is unbiased if E (tˆ )  t 37 Let Vˆ (tˆ) be an (unbiased if possible) estimate of Var(tˆ) The standard error of tˆ : SE(tˆ) Vˆ (tˆ) Coefficient of variation of tˆ : CV (tˆ) SE(tˆ) / tˆ CV is a useful measure of uncertainty, especially when standard error increases as the estimate increases Margin of error : 2  SE(tˆ) Because, typically we have that P(tˆ  2SE(tˆ)  t  tˆ  2SE(tˆ))  0.95 for large n, N  n Since tˆ is approximat ely normally distribute d for large n, N  n t̂  2  SE( t̂ ) is approximat ely a 95% CI 38 Some peculiarities in the estimation theory Example: N=3, n=2, simple random sample s1  {1,2}, s2  {1,3}, s3  {2,3} p( sk ) 1 / 3 for k  1,2,3 Let tˆ1  3 ys , unbiased Let tˆ2 be given by : ˆt 2 ( s1 )  3  1 ( y1  y2 )  tˆ1 ( s1 ) 2 ˆt 2 ( s2 )  3  ( 1 y1  2 y3 )  tˆ1 ( s 2 )  1 y3 2 3 2 1 1 1 tˆ2 ( s3 )  3  ( y2  y3 )  tˆ1 ( s3 )  y3 2 3 2 39 Also tˆ2 is unbiased : 1 3 ˆ 1 ˆ ˆ E (t2 )  s t2 ( s ) p( s )  k 1 t2 ( sk )   3t  t 3 3 1 ˆ ˆ Var(t1 )  Var(t2 )  y3 (3 y2  3 y1  y3 ) 6  Var(tˆ1 )  Var(tˆ2 ) if y3  0 and 3 y2  3 y1  y3 If yi  0 / 1  variables, this happens when y1  0, y2  y3  1 For this set of values of the yi’s: tˆ1 ( s1 )  1.5, tˆ1 ( s2 )  1.5, tˆ1 ( s3 )  3 : never correct tˆ2 ( s1 )  1.5, tˆ2 ( s2 )  2, tˆ2 ( s3 )  2.5 tˆ2 has clearly less variabilit y than tˆ1 for these y - values 40 Let y be the population vector of the y-values. This example shows that Ny s is not uniformly best ( minimum variance for all y) among linear design-unbiased estimators Example shows that the ”usual” basic estimators do not have the same properties in design-based survey sampling as they do in ordinary statistical models In fact, we have the following much stronger result: Theorem: Let p(.) be any sampling design. Assume each yi can take at least two values. Then there exists no uniformly best design-unbiased estimator of the total t 41 Proof: Let tˆ be unbiased, and let y 0 be one possible value of y. Then there exists unbiased tˆ0 with Var(tˆ0 )  0 when y  y 0 tˆ0 ( s, y)  tˆ( s, y)  tˆ( s, y0 )  t0 , t0 is the total for y 0 1) tˆ0 is unbiased : E (tˆ0 )  t  s tˆ( s, y 0 ) p( s )  t0  t 2) When y  y 0 : tˆ0  t0 for all samples s  Var(tˆ0 )  0 This implies that a uniformly best unbiased estimator must have variance equal to 0 for all values of y, which is impossible 42 Determining sample size • The sample size has a decisive effect on the cost of the survey • How large n should be depends on the purpose for doing the survey • In a poll for detemining voting preference, n = 1000 is typically enough • In the quarterly labor force survey in Norway, n = 24000 Mainly three factors to consider: 1. Desired accuracy of the estimates for many variables. Focus on one or two variables of primary interest 2. Homogeneity of the population. Needs smaller samples if little variation in the population 3. Estimation for subgroups, domains, of the population 43 It is often factor 3 that puts the highest demand on the survey • If we want to estimate totals for domains of the population we should take a stratified sample • A sample from each domain • A stratified random sample: From each domain a simple random sample H strata that constitute the whole population Sample sizes : n1 , n2 ,...,nH Total sample size : n  n1  n2  ...  nH Must determine each nh 44 Assume the problem is to estimate a population proportion p for a certain stratum, and we use the sample proportion from the stratum to estimate p Let n be the sample size of this stratum, and assume that n/N is negligible Desired accuracy for this stratum: 95% CI for p should be  5% pˆ (1  pˆ ) 95% CI for p : pˆ  1.96 n The accuracy requirement: pˆ (1  pˆ ) 1 1.96  0.05  n 20  n  1.96 2 20 2 pˆ (1  pˆ )  384 45 The estimate is unkown in the planning fase Use the conservative size 384 or a planning value p0 with n = 1536 p0(1- p0 ) F.ex.: With p0 = 0.2: n = 246 In general with accuracy requirement d, 95% CI  pˆ  d n  3.84 p0 (1  p0 ) / d 2 Alternative accuracy requirement : Length of 95% CI is proportion al to pˆ (when pˆ  0.5, otherwise estimate 1 - p ) pˆ (1  pˆ ) 1.96  d  pˆ  CV ( pˆ )  d / 1.96  e n 46 1 1  pˆ SE( pˆ ) / pˆ  e  n  2  pˆ e 1 1  p0 Planning value p0 : n  2  p0 e With e = 0.1, then we require approximately that when p0  0.5 : 95% CI  pˆ  0.10 and n  100 when p0  0.1 : 95% CI  pˆ  0.02 and n  900 47 Example: Monthly unemployment rate Important to detect changes in unemployment rates from month to month planning value p0 = 0.05 Desired accuracy: 1.96  SE( pˆ )  d  n  3.84 p0 (1  p0 ) / d 2  0.1824 / d 2 d  0.001 (margin of error  0.1%)  n  182,400 d  0.002  n  45,600 d  0.005  n  7300 Note : d  0.005  CV ( pˆ )  0.00255 / 0.05  .051  5% 48 Two basic estimators: Ratio estimator Horvitz-Thompson estimator • Ratio estimator in simple random samples • H-T estimator for unequal probability sampling: The inclusion probabilities are unequal • The goal is to estimate a population total t for a variable y 49 Ratio estimator Suppose we have known auxiliary information for the whole population: x  ( x1 , x2 ,...xN ) Ex: age, gender, education, employment status Let X   i 1 xi N The ratio estimator for the y-total t: tˆR  X    yi ys  X x xs is i is 50 We can express the ratio estimator on the following form: ˆt R  X  ( Ny s ) Nx s It adjusts the usual “sample mean estimator” in the cases where the x-values in the sample are too small or too large. Reasonable if there is a positive correlation between x and y Example: University of 4000 students, SRS of 400 Estimate the total number t of women that is planning a career in teaching, t=Np, p is the proportion yi = 1 if student i is a woman planning to be a teacher, t is the y-total 51 Results : 84 out of 240 women in the sample plans to be a teacher pˆ  84 / 400  0.21 tˆ  Npˆ  840 HOWEVER: It was noticed that the university has 2700 women (67,5%) while in the sample we had 60% women. A better estimate that corrects for the underrepresentation of women is obtained by the ratio estimate using the auxiliary x = 1 if student is a woman 2700 tˆR  (840)  945 4000  0.6 52 In business surveys it is very common to use a ratio estimator. Ex: yi = amount spent on health insurance by business i xi = number of employees in business i We shall now do a comparison between the ratio estimator and the sample mean based estimator. We need to derive expectation and variance for the ratio estimator 53 First: Must define the population covariance 1 N  xy  ( xi   x )( yi   y )  i 1 N 1  x ,  y are population means of the y and x variables 1 N 2   ( y   )  i y i 1 N 1 1 N 2 2 x  ( xi   x )  i 1 N 1 2 y The population correlation coefficient:  xy  xy   x y 54 Let R i 1 yi / i 1 xi  t / X  N N and Rˆ  Nys / Nxs  ys / xs ( I ) Bias : E (tˆR  t )  Cov( Rˆ , Nxs ) Proof ˆt R  t  Nys X   t  Nys (1  Nxs  X  )  t Nx s Nx s N y  E (tˆR  t )   E s  ( Nxs  X  )  Cov( Rˆ , Nxs ) Nx s 55 It follows that | Cov( Rˆ , Nxs ) | | Bias( tˆR ) |  Var ( Nxs ) Var (tˆR ) X  Var ( Rˆ )Var ( Nxs )  CV ( Nxs ) | Corr ( Rˆ , Nxs ) | CV ( xs ) Hence, in SRS, the absolute bias of the ratio estimator is small relative to the true SE of the estimator if the coefficient of variation of the xsample mean is small Certainly true for large n 56 ( II ) E( t̂ R )  t , for large n 2 1 f 2 2 2 ˆ ( III ) Var (t R )  N  ( y  2 R xy  R  x ) n 1 N 2 1 f 2 N   ( yi  Rx i )  i 1 n N 1 57 Note: The ratio estimator is very precise when the population points (yi , xi) lie close around a straight line thru the origin with slope R. The regression model generates the ratio estimator 58 1 N 2 1 f 2 ˆ Var (t R )  N   ( y  Rx )  i i n N  1 i 1 and recalling that 1 N 2 1 f 2 Var ( Nys )  N  ( y   )  i y i 1 n N 1 N N 2 2 ˆ Var(t R )  Var( Nys )  i 1 ( yi  Rxi )  i 1 ( yi   y ) The ratio estimator is more accurate if Rxi predicts yi better than y does 59 Estimated variance for the ratio estimator Estimate  N i 1 ( yi  Rx i ) /( N  1) 2 by is ( yi  Rˆ xi ) 2 /( n  1) 2  x  2 1 f 1 2 ˆ ˆ ˆ V (t R )    N   ( y  R x )  i i is n n 1  xs  Note : If xs is very small, then Rˆ is more uncertain and the variance estimate becomes larger to reflect th at 60 For large n, N-n: Approximate normality holds and an approximate 95% confidence interval is given by X 1  f 1 2  ˆ ˆt R  1.96  ( yi  Rxi )  is xs n n 1 61 Unequal probability sampling Inclusion probabilities:  i  P(i  s)  0 for all i  1,..., N Example: Psychiatric Morbidity Survey: Selected individuals from households  i  1/ M i M i  number of adults 16 - 64 in the household that individual i belongs to 62 Horvitz-Thompson estimatorunequal probability sampling  i  P(i  s)  0 for all i  1,..., N Let’s try and use Nys Let Z i  1 if i  s, 0 otherwise. E ( Z i )   i 1 N N E ( Nys )  N E (i 1 yi Z i )  ( N / n)i 1 yi i  t n not unbiased Bias is large if inclusion probabilities tend to increase or decrease systematically with yi 63 Use weighting to correct for bias: tˆ  is wi yi ; wi does not depend on s   N N ˆ E (t )  E i 1 wi yi Z i  i 1 wi i yi and tˆ is unbiased for all possible values yi if and only if wi  1 /  i tˆHT  is yi i In SRS,  i  n / N and tˆHT  Nys 64 a ) Var (tˆHT )  i 1 N 1 i i y  2i 1 2 i N 1  ij   i j  j i 1   yi y j i j N If | s | n, then b) Var (tˆHT )  i 1 N 1  yi y j   j i 1 ( i j   ij )     j   i 2 N  ij  P(i, j  s)  P( Z i  Z j  1) Horvitz-Thompson estimator is widely used f.ex., in official statistics 65 Note that the variance is small if we determine the inclusion probabilities such that yi /  i are approximat ely equal, i.e.  i increases with increasing yi Of course, we do not know the value of yi when planning the survey, use known auxiliary xi and choose  i  xi   i  nxi / X  since  N i 1 i  n 66 If yi and  i are not related or negatively " correlated " Var (tˆHT ) can be enormous and one should not use HT - estimator, even thoug h the  i ' s are unequal Example: Population of 3 elephants, to be shipped. Needs an estimate for the total weight •Weighing an elephant is no simple matter. Owner wants to estimate the total weight by weighing just one elephant. • Knows from earlier: Elephant 2 has a weight y2 close to the average weight. Wants to use this elephant and use 3y2 as an estimate • However: To get an unbiased estimator, all inclusion probabilities must be positive. 67 • Sampling design: | s | 1 and  2  0.90,  1   3  0.05 • The weights: 1,2, 4 tons, total = 7 tons • H-T estimator: tˆHT  yi /  i if s  {i} 20 if s  {1}   2.22 if s  {2} 80 if s  {3}  Hopeless! Always far from true total of 7 Can not be used, even though E (tˆHT )  7  t 68 Problem: Var (tˆHT )  (20  7) 2  0.05  (2.22  7) 2  0.90  (80  7) 2 .0.05  295.46 True SE (tˆHT )  Var (tˆHT )  17.2 !!! The planned estimator, even though not a SRS: têleph  3 ys  3 yi if s  {i} Possible values: 3, 6, 12 69 E( t̂ )  6.15 not unbiased, but look at SE( t̂eleph )  2.2275  1.49 MSE (têleph )  E (têleph  t ) 2  Bias 2  Var (têleph )  2.95 MSE (têleph )  1.72 t̂eleph is clearly preferable to t̂ HT 70 Variance estimate for H-T estimator Assume the size of the sample is determined in advance to be n. An unbiased estimator of Var (tˆHT ), provided all joint inclusion probabilit ies  ij  0 :  yi y j      i j ij    Vˆ (tˆHT )    ij   i  j  is js j i 2 Approximat e 95% CI, for large n, N  n : tˆHT  1.96 Vˆ (tˆHT ) 71 • Can always compute the variance estimate!! Since, necessarily ij > 0 for all i,j in the sample s • But: If not all ij > 0 , should not use this estimate! It can give very incorrect estimates • The variance estimate can be negative, but for most sampling designs it is always positive 72 A modified H-T estimator Consider first estimating the population mean y  t / N An obvious choice: yˆ HT  tˆHT / N Alternative: Estimate N as well, whether N is known or not 1 ˆ N  is ( yi  1, i ) i  N 1  N 1 ˆ E ( N )  E i 1 Z i   i 1  i  N i  i  N ˆ For SRS,  i  n / N  N  is  N n 73 yˆ w  tˆHT / Nˆ    is yi /  i is 1/  i  tˆw  Nyˆ w Interestin gly, tˆw is often better tha n tˆHT , even thoug h it is only approximat ely unbiased. It usually has smaller va riance. So tˆw is ordinarily the estimator to use, whether N is known or not. We note that it is a ratio estimator Illustration : yi  c for all i  1,...,N . Then t̂ HT  c is 1 /  i  cN̂ while t̂ w  Nc  t , a better estimate if Var( N̂ )  0 74 If sample size varies then the “ratio” estimator performs better than the H-T estimator, the ratio is more stable than the numerator Example: yi  c, for i  1,..., N Sampling design  Bernoulli sampling : Each unit in the population is selected with probabilit y  , independen tly Z i ' s are i.i.d. with  i  P( Z i  1)   n is a stochastic variable, has a binomial ( N ,  ) distributi on E (n)  N 75 tˆHT  n  c ( E (tˆHT )  N c  Nc  t )  nc /  tˆw  N  Nc  t n / H-T estimator varies because n varies, while the modified H-T is perfectly stable 76 Review of Advantages of Probability Sampling • Objective basis for inference • Permits unbiased or approximately unbiased estimation • Permits estimation of sampling errors of estimators – Use central limit theorem for confidence interval – Can choose n to reduce SE or CV for estimator 77 Outstanding issues in design-based inference • Estimation for subpopulations, domains • Choice of sampling design – – discuss several different sampling designs – appropriate estimators • More on use of auxiliary information to improve estimates • More on variance estimation 78 Estimation for domains • Domain (subpopulation): a subset of the population of interest • Ex: Population = all adults aged 16-64 Examples of domains: – Women – Adults aged 35-39 – Men aged 25-29 – Women of a certain ethnic group – Adults living in a certain city • Partition population U into D disjoint domains U1,…,Ud,..., UD of sizes N1,…,Nd,…,ND 79 Estimating domain means Simple random sample from the population True domain mean :  d  iU yi / N d d • e.g., proportion of divorced women with psychiatric problems. Estimate  d by the sample mean from U d : sd  the part of the sample s in U d ysd  is yi / nd d nd | sd | Note: nd is a random variable 80 The estimator is a ratio estimator: Define  yi if i  U d ui   0 otherwise 1 if i  U d xi   0 otherwise  d  i 1 ui / i 1 xi  R N N y sd  is ui / is xi  u s / xs  Rˆ 81 ysd is approximat ely unbiased for large n 2 1  Nd / N  2 1 f 1 2 ˆ  N  V ( ysd )  2   ( u  y x )  i sd i is N d  nd / n  n n 1 Let sd2 be the sample variance for the domain, 1 2 s  ( y  y )  i sd nd  1 isd 2 d 2 2 s n 1  f Vˆ ( ysd )  2  (nd  1) sd2  (1  f ) d nd n(n  1) nd SE( ysd )  (1  f )sd2 / nd , f  n / N 82 For large samples f d  nd / N d  f • Can then treat sd as a SRS from Ud • Whatever size of n is, conditional on nd, sd is a SRS from Ud – conditional inference Example: Psychiatric Morbidity Survey 1993 Proportions with psychiatric problems y sd SE ( y sd ) Domain d nd women 4933 0.18 .18  0.82 / 4932  0.005 Divorced women 314 0.29  0.71 / 313  0.026 0.29 83 Estimating domain totals • Nd is known: Use N d y sd • Nd unknown, must be estimated Since N d is the x - total : Nˆ d  Nxs  N  nd / n ˆt d  Nˆ d ys  N 1  ui  Nu s d n is 2 ˆ SE (td )  N (1  f )su / n 84 Stratified sampling • Basic idea: Partition the population U into H subpopulations, called strata. • Nh = size of stratum h, known • Draw a separate sample from each stratum, sh of size nh from stratum h, independently between the strata • In social surveys: Stratify by geographic regions, age groups, gender • Ex –business survey. Canadian survey of employment. Establishments stratified by o Standard Industrial Classification – 16 industry divisions o Size – number of employees, 4 groups, 0-19, 20-49, 50199, 200+ o Province – 12 provinces Total number of strata: 16x4x12=768 85 Reasons for stratification 1. Strata form domains of interest for which separate estimates of given precision is required, e.g. strata = geographical regions 2. To “spread” the sample over the whole population. Easier to get a representative sample 3. To get more accurate estimates of population totals, reduce sampling variance 4. Can use different modes of data collection in different strata, e.g. telephone versus home interviews 86 Stratified simple random sampling • The most common stratified sampling design • SRS from each stratum • Notation: From stratum h : sample sh of size nh Total sample size : n  h 1 nh H Values from stratum h : yhi , i  1,..., N h Sample : ( yhi : i  sh ) Sample mean : yh  is yhi / nh h 87 th = y-total for stratum h: th   Nh i 1 yhi The population total : t  h1 th H Consider estimation of th: tˆh  N h yh Assuming no auxiliary information in addition to the “stratifying variables” The stratified estimator of t: tˆst  h1 tˆh  h1 N h yh H H 88 To estimate the population mean t / N : H Nh ˆ Stratified mean : yst  t st / N  h 1 yh N A weighted average of the sample stratum means. •Properties of the stratified estimator follows from properties of SRS estimators. •Notation: Mean in stratum h :  h  i 1 yhi / N h Nh 1 Nh 2 Variance in stratum h :   ( y   )  hi h i 1 Nh 1 2 h 89 E( t̂ st )  t , t̂ st is unbiased 2  2 h Var( t̂ st )  hH1Var( t̂ h )  hH1 N h nh (1  fh ) Estimated variance is obtained by estimating the stratum variance with the stratum sample variance sh2  1 2 ( y  y )  hi h nh  1 ish 2 s Vˆ (tˆst )  h 1 N h2 h (1  f h ) nh H Approximate 95% confidence interval if n and N-n are large: tˆst  1.96 Vˆ (t st ) 90 Estimating population proportion in stratified simple random sampling ph : proportion in stratum h with a certain characteristic A pˆ h  yh where yhi  1 if unit i in stratum h has characteri stic A p is the population mean: p = t/N  Stratum mean estimator:  H h 1 N h ph / N pˆ st  yst  h1 ( N h / N ) pˆ h H Stratified estimator of the total t = number of units in the with characteristic A: ˆt st  Npˆ st  H N h pˆ h h 1 91 Estimated variance: p̂h ( 1  p̂h ) nh V̂ ( p̂h )  (1  ) (slide 31) nh  1 Nh nh pˆ h (1  pˆ h ) H H 2 ˆ ˆ  V ( pˆ st )  h 1V (Wh pˆ h )  h 1Wh (1  ) Nh nh  1 where Wh  N h /N and ˆ h (1  pˆ h ) n p H H 2 2 h ˆ ˆ V (tˆst )  h 1V ( N  Wh pˆ h ) N h 1Wh (1  ) Nh nh  1 92 Allocation of the sample units • Important to determine the sizes of the stratum samples, given the total sample size n and given the strata partitioning – how to allocate the sample units to the different strata • Proportional allocation – A representative sample should mirror the population – Strata proportions: Wh=Nh/N – Strata sample proportions should be the same: nh/n = Wh – Proportional allocation: Nh nh n nh  n   for all h N Nh N 93 The stratified estimator under proportional allocation  Inclusion probabilit ies :  hi  nh / N h  n / N the same for all units in the population , but it is not a SRS 1 H H ˆ  t st  h 1 N h yh  h 1 N h  nh  N n   H h 1 ish  ish yhi yhi  Nys  The stratified mean : yst  tˆst / N  ys The equally weighted sample mean ( sample is selfweighting: Every unit in the sample represents the same number of units in the population , N/n) 94 Variance and estimated variance under proportional allocation 2  2 h Var (tˆst )  h 1 N h H 1 f N  n 2 nh (1  f h ) 2 W  h1 h h , H 2 1 f ˆ ˆ V (t st )  N  n f  n / N , Wh  N h / N 2 W s h1 h h H 95 • The estimator in simple random sample: tˆSRS  Ny s • Under proportional allocation: tˆst  tˆSRS • but the variances are different: 2 1 f ˆ Under SRS : VarSRS (t SRS )  N   2 n 2 1 f ˆ Under proportion al allocation : Var (t st )  N  n 2 W  h1 h h H 96 Nh 1 Nh Nh Using the approximat ions  and  1: N 1 N Nh 1  2  h 1Wh h2  h 1Wh (  h   ) 2 H H Total variance = variance within strata + variance between strata Implications: 1. No matter what the stratification scheme is: Proportional allocation gives more accurate estimates of population total than SRS 2. Choose strata with little variability, smaller strata variances. Then the strata means will vary more and between variance becomes larger and precision of estimates increases compared to SRS 3. This is also essentiall y true in general, as seen from H 2 2 1 fh ˆ V (t st )  N h 1Wh  h2 nh 97 Optimal allocation If the only concern is to estimate the population total t: • Choose nh such that the variance of the stratified estimator is minimum • Solution depends on the unkown stratum variances • If the stratum variances are approximately equal, proportional allocation minimizes the variance of the stratified estimator 98 Optimal allocation : nh  n   N h h H k 1 N k k Proof : Minimize Var (tˆst ) with respect to the sample sizes nh subject to n  h 1 nh is fixed H Use Lagrange multiplier method : Minimize 1 1 H Q  h 1 N  (  )   (h 1 nh  n) nh N h H 2 h 2 h Q 1 2 2  0   2 N h  h    0  nh  N h h /  nh nh Result follows since the sample sizes must add up to n 99 • Called Neyman allocation (Neyman, 1934) • Should sample heavily in strata if – The stratum accounts for a large part of the population – The stratum variance is large • If the stratum variances are equal, this is proportional allocation • Problem, of course: Stratum variances are unknown – Take a small preliminary sample (pilot) – The variance of the stratified estimator is not very sensitive to deviations from the optimal allocation. Need just crude approximations of the stratum variances 100 Optimal allocation when considering the cost of a survey • C represents the total cost of the survey, fixed – our budget • c0 : overhead cost, like maintaining an office • ch : cost of taking an observation in stratum h – Home interviews: traveling cost +interview – Telephone or postal surveys: ch is the same for all strata – In some strata: telephone, in others home interviews C  c0  h1 nh ch H • Minimize the variance of the stratified estimator for a given total cost C 101 1 H 2 2 2 1 ˆ Minimize Var (t st )  N h 1Wh  h (  ) nh N h subject to : c0  h 1 nh ch  C H Solution: nh  Wh h / ch Wh h (C  c0 )  nh   H ch  Wk k ck k 1 Hence, for a fixed total cost C : (C  c0 )h1 N h h / ch H n  H h 1 N h h ch 102 In particular, if ch = c for all h: n  (C  c0 ) / c We can express the optimal sample sizes in relation to n  nh  n  Wh h / ch  H k 1 Wk k / ck 1. Large samples in inexpensiv e strata 2. If the ch ' s are equal : Neyman allocation 3. If the ch ' s are equal and the  h ' s are equal : proportion al allocation 103 Other issues with optimal allocation • Many survey variables • Each variable leads to a different optimal solution – Choose one or two key variables – Use proportional allocation as a compromise • If nh > Nh, let nh =Nh and use optimal allocation for the remaining strata • If nh=1, can not estimate variance. Force nh =2 or collapse strata for variance estimation • Number of strata: For a given n often best to increase number of strata as much as possible. Depends on available information 104 • Sometimes the main interest is in precision of the estimates for stratum totals and less interest in the precision of the estimate for the population total • Need to decide nh to achieve desired accuracy for estimate of th, discussed earlier – If we decide to do proportional allocation, it can mean in small strata (small Nh) the sample size nh must be increased 105 Poststratification • Stratification reduces the uncertainty of the estimator compared to SRS • In many cases one wants to stratify according to variables that are not known or used in sampling • Can then stratify after the data have been collected • Hence, the term poststratification • The estimator is then the usual stratified estimator according to the poststratification • If we take a SRS and N-n and n are large, the estimator behaves like the stratified estimator with proportional allocation 106 Poststratification to reduce nonresponse bias • Poststratification is mostly used to correct for nonresponse • Choose strata with different response rates • Poststratification amounts to assuming that the response sample in poststratum h is representative for the nonresponse group in the sample from poststratum h 107 Systematic sampling • • Idea:Order the population and select every kth unit Procedure: U = {1,…,N} and N=nk + c, c < n 1. Select a random integer r between 1 and k, with equal probability 2. Select the sample sr by the systematic rule sr = {i: i = r + (j-1)k: j= 1, …, nr} where the actual sample size nr takes values [N/k] or [N/k] +1 k : sampling interval = [N/n] • Very easy to implement: Visit every 10th house or interview every 50th name in the telephone book 108 • k distinct samples each selected with probability 1/k 1 / k if s  sr , r  1,..., k p( s)   0 otherwise • Unlike in SRS, many subsets of U have zero probability Examples: 1) N =20, n=4. Then k=5 and c=0. Suppose we select r =1. Then the sample is {1,6,11,16} 5 possible distinct samples. In SRS: 4845 distinct samples 2) N= 149, n = 12. Then k = 12, c=5. Suppose r = 3. s3 = {3,15,27,39,51,63,75,87,99,111,123,135,147} and sample size is 13 3) N=20, n=8. Then k=2 and c = 4. Sample size is nr =10 4) N= 100 000, n = 1500. Then k = 66 , c=1000 and c/k =15.15 with [c/k]=15. nr = 1515 or 1516 109 Estimation of the population total t ( s)  is yi , n( s )  sample size Two estimators (equal when N  nk ) : 1) tˆ( s)  kt( s )  [ N / n]t ( s) t ( s)  ˆ 2) t ( s)  Nys  N  n( s ) These estimators are approximately the same: n( s)  [ N / k ] or [ N / k ]  1 k 1 N / n  N   N  N (N / k) 110 t̂ is unbiased : E( t̂ )  kr 1 t̂ ( sr ) p( sr )  k r 1 kt( s r 1 )   kr 1 t( sr )  t k tˆ is only approximat ely unbiased (it' s a ratio estimator) - usually slightly smaller va riance than tˆ • Advantage of systematic sampling: Can be implemented even where no population frame exists •E.g. sample every 10th person admitted to a hospital, every 100th tourist arriving at LA airport. 111 k 2 ˆ ˆ Var (t )  E (t  t )  r 1 (tˆ( sr )  t ) 2 p( sr ) 1 k k 2  r 1 (k  t ( sr )  t )  k r 1 (t ( sr )  t ) 2 k where t  r 1 t ( sr ) / k k is the average of the sample totals • The variance is small if t ( sr ) varies little, i.e., if the " strata" {1,..,k}, {k  1, ...,2k},.. etc., are very homogeneou s • Or, equivalently, if the values within the possible samples sr are very different; the samples are heterogeneous • Problem: The variance cannot be estimated properly because we have only one observation of t(sr) 112 Systematic sampling as Implicit Stratification In practice: Very often when using systematic sampling (common design in national statistical institutes): The population is ordered such that the first k units constitute a homogeneous “stratum”, the second k units another “stratum”, etc. Implicit strata 1 2 : Units 1,2….,k k+1,…,2k : n = N/k assumed (n-1)k+1,.., nk Systematic sampling selects 1 unit from each stratum at random 113 Systematic sampling vs SRS • Systematic sampling is more efficient if the study variable is homogeneous within the implicit strata – Ex: households ordered according to house numbers within neighbourhooods and study variable related to income • Households in the same neighbourhood are usually homogeneous with respect socio-economic variables • If population is in random order (all N! permutations are equally likely): systematic sampling is similar to SRS • Systematic sampling can be very bad if y has periodic variation relative to k: – Approximately: y1 = yk+1, y2 = yk+2 , etc 114 Variance estimation •No direct estimate, impossible to obtain unbiased estimate • If population is in random order: can use the variance estimate form SRS as an approximation • Develop a conservative variance estimator by collapsing the “implicit strata”, overestimate the variance • The most promising approach may be: Under a statistical model, estimate the expected value of the design variance • Typically, systematic sampling is used in the second stage of two-stage sampling (to be discussed later), may not be necessary to estimate this variance then. 115 Cluster sampling and multistage sampling • • Sampling designs so far: Direct sampling of the units in a single stage of sampling Of economial and practical reasons: may be necessary to modify these sampling designs – – There exists no population frame (register: list of all units in the population), and it is impossible or very costly to produce such a register The population units are scattered over a wide area, and a direct sample will also be widely scattered. In case of personal interviews, the traveling costs would be very high and it would not be possible to visit the whole sample 116 • Modified sampling can be done by 1. Selecting the sample indirectly in groups , called clusters; cluster sampling – Population is grouped into clusters – Sample is obtained by selecting a sample of clusters and observing all units within the clusters – Ex: In Labor Force Surveys: Clusters = Households, units = persons 2. Selecting the sample in several stages; multistage sampling 117 3. In two-stage sampling: • Population is grouped into primary sampling units (PSU) • Stage 1: A sample of PSUs • Stage 2: For each PSU in the sample at stage 1, we take a sample of population units, now also called secondary sampling units (SSU) • Ex: PSUs are often geographical regions 118 Examples 1. Cluster sampling. Want a sample of high school students in a certain area, to investigate smoking and alcohol use. If a list of high school classes is available,we can then select a sample of high school classes and give the questionaire to every student in the selected classes; cluster sampling with high school class being the clusters 2. Two-stage cluster sampling. If a list of classes is not available, we can first select high schools, then classes and finally all students in the selected classes. Then we have 2-stage cluster sample. 1. PSU = high school 2. SSU = classes 3. Units = students 119 Psychiatric Morbidity Survey is a 4-stage sample – Population: adults aged 16-64 living in private households in Great Britain – PSUs = postal sectors – SSUs = addresses – 3SUs = households – Units = individuals Sampling process: 1) 200 PSUs selected 2) 90 SSUs selected within each sampled PSU (interviewer workload) 3) All households selected per SSU 4) 1 adult selected per household 120 Cluster sampling Number of clusters in the population : N Number of units in cluster i: Mi Population size : M  i 1 M i N s I  sample of clusters, n | sI | Final sample of units : s  all units in s I Size of final sample s : m  is M i not fixed in advance I ti  y  total in cluster i , t  i 1 ti N Population mean for the y  variable :  y  t / M 121 Simple random cluster sampling Ratio-to-size estimator Use auxiliary information: Size of the sampled clusters t̂ R  M  t is I i is I Mi Approximately unbiased with approximate variance 1 N 2 1 f 2 2 ˆ Var (t R )  N  M ( y   )  i i n N  1 i 1 where yi  ti / M i , the cluster mean, and   t / M 122 estimated by 1 M / N 2 1 f 2 2 V̂ ( t̂ R )    M ( y  y )  N  i i s is I m / n n n  1   where f  n / N and ys  is ti / is M i 2 I I is the usual sample mean Note that this ratio estimator is in fact the usual sample mean based estimator with respect to the y- variable tˆR  M  ys And corresponding estimator of the population mean of y is ys Can be used also if M is unknown 123 • Estimator’s variance is highly influenced by how the clusters are constructed. Choose clusters to make  M i2 ( yi   ) 2 small  make the clusters heterogene ous, such that most of the variation in the y - values lies in the clusters, making the yi  values similar • Note: The opposite in stratified sampling • Typically, clusters are formed by “nearby units’ like households, schools, hospitals because of economical and practical reasons, with little variation within the clusters: Simple random cluster sampling will lead to much less precise estimates compared to SRS, but this is offset by big cost reductions Sometimes SRS is not possible; information only known for 124 clusters Design Effects A design effect (deff) compares efficiency of two designestimation strategies (sampling design and estimator) for same sample size Now: Compare Strategy 1:simple random cluster sampling with ratio estimator Strategy 2: SRS, of same sample size m, with usual sample mean estimator In terms of estimating population mean: Strategy 1 estimator : tˆR / M  ys Strategy 2 estimator : ys 125 The design effect of simple random cluster sampling, SCS, is then deff (SCS , ys )  VarSCS ( ys ) / VarSRS ( ys ) Estimated deff: VˆSCS ( ys ) / VˆSRS ( ys ) In probation example: VˆSRS ( pˆ )  [ pˆ (1  pˆ ) /( m 1)](1  f )  pˆ (1  pˆ ) /( m 1)  0.003872 Estimated deff  0.0302 2 / 0.00387 2  60.9 Conclusion: Cluster sampling is much less efficient Note : We can estimate the p.c. factor 1 - m/M by letting Mˆ  N  (m / n) and 1 - m / Mˆ  1  n / N  16 / 26  0.615 estdeff  60.9 / 0.615  99 ! 126 Two-stage sampling • Basic justification: With homogeneous clusters and a given budget, it is inefficient to survey all units in the cluster- can instead select more clusters • Populations partioned into N primary sampling units (PSU) • Stage 1: Select a sample sI of PSUs • Stage 2: For each selected PSU i in sI: Select a sample si of units (secondary sampling units, SSU) • The cluster totals ti must be estimated from the sample 127 n | sI | size of stage 1 sample of PSUs mi | si | Total sample size : m  is mi | s | I General two-stage sampling plan:  Ii  P( PSU i  sI )  j|i  P( SSU j  si | i  sI ) Inclusion probabilit y for unit (SSU) j in cluster i :  ij   Ii   j|i 128 Horvitz - Thompson estimator of total ti , i  sI : tî , HT   js yij i  j|i where yij  value of y for unit j in cluster i Suggested estimator for population total t : tˆ  is tî , HT I  Ii 1 ˆt   is I  Ii  yij jsi  j|i  is I  yij jsi  ij  tˆHT Unbiased estimator 129 ˆ Var ( t   t N i , HT ) i   i 1 Var (tˆ)  Var is I  Ii   Ii  1. The first component expresses the sampling uncertainty on stage 1, since we are selecting a sample of PSU’s. It is the variance of the HT-estimator with ti as observations 2. The second component is stage 2 variance and tells us how well we are able to estimate each ti in the whole population 3. The second component is often negligible because of little variability within the clusters 130 A special case: Clusters of equal size and SRS on stage 1 and stage 2 mi  m0 - equal sample sizes at stage 2 M i  M 0 , i  1,..., N  Ii  n / N ,  j|i M tˆ  m n m0 m  m0 / M 0   ij    N M0 M   is I jsi yij  M  y s Self-weighting sample: equal inclusion probabilities for all units in the population 131 Unequal cluster sizes. PPS – SRS sampling • In social surveys: good reasons to have equal inclusion probabilities (self-weighting sample) for all units in the population (similar representation to all domains) • Stage 1: Select PSUs with probability proportional to size Mi • Stage 2: SRS (or systematic sample) of SSUs • Such that sample is self-weighting M  Ii  n i and  j|i  mi / M i M such that  ij   Ii   j|i  m / M mi = m/n = equal sample sizes in all selected PSUs tˆ  My s 132 Remarks • Usually one interviewer for each selected PSU • First stage sampling is often stratified PPS • With self-weighting PPS-SRS: – equal workload for each interviewer – Total sample size m is fixed 133 II. Likelihood in statistical inference and survey sampling • Problems with design-based inference • Likelihood principle, conditionality principle and sufficiency principle • Fundamental equivalence • Likelihood and likelihood principle in survey sampling 134 Traditional approach Design-based inference • Population (Target population): The universe of all units of interest for a certain study: U = {1,2, …, N} – All units can be identified and labeled – Variable of interest y with population values y  ( y1 , y2 ,..., y N ) – Typical problem: Estimate total t or population mean t/N • Sample: A subset s of the population, to be observed • Sampling design p(s) is known for all possible subsets; – The probability distribution of the stochastic sample 135 Problems with design-based inference • Generally: Design-based inference is with respect to hypothetical replications of sampling for a fixed population vector y • Variance estimates may fail to reflect information in a given sample • If we want to measure how a certain estimation method does in quarterly or monthly surveys, then y will vary from quarter to quarter or month to month – need to assume that y is a realization of a random vector • Use: Likelihood and likelihood principle as guideline on how to deal with these issues 136 Problem with design-based variance measure Illustration 1 a) N +1 possible samples: {1}, {2},…,{N}, {1,2,…N} b) Sampling design: p({i}) =1/2N , for i = 1,..,N ; p({1,2,…N})= 1/2 c) Use y s as estimator for population mean  Unbiased : E( y s )   s p( s ) y s   1 1 y   i 1 2 N i 2 N Design - variance : Var( y s )  E( y s   )  2  N 2 ( y   )  i i 1 1 1   ~ 2 2N 2 d) Assume we select the “sample” {1,2,…,N}. Then we claim that the “precision” of the resulting sample 2 (known to be without error) is ~ / 2 137 Problem with design-based variance measure Illustration 2 a) Expert 1 : SRS and estimate y s Precision is measured by (1 - f ) 1   N 1 2  2 n N 2 ( y   ) , f n/ N i i 1 b) Expert 2 : SRS with replacement and estimate y s measures precision by ~ 2 / n Both experts select the same sample, compute the same estimate, but give different measures of precision… 138 The likelihood principle, LP general model Model : X ~ fq ( x), q  ; q are the unknown parameters in the model • The likelihood function, with data x: l x (q )  fq ( x) l is quite a different animal than f !! Measures the likelihood of different q values in light of the data x • LP: The likelihood function contains all information about the unknown parameters • More precisely: Two proportional likelihood functions for q, from the same or different experiments, should give identically the same statistical inference 139 • Maximum likelihood estimation satisfies LP, using the curvature of the likelihood as a measure of precision (Fisher) • LP is controversial, but hard to argue against because of the fundamental result by Birnbaum, 1962: • LP follows from sufficiency (SP) and conditionality principles (CP) that ”no one” disagrees with. • SP: Statistical inference should be based on sufficient statistics • CP: If you have 2 possible experiments and choose one at random, the inference should depend only on the chosen experiment 140 Illustration of CP • A choice is to be made between a census og taking a sample of size 1. Each with probability ½. • Census is chosen • Unconditional approach:  i  P(census)  P(sample of size 1 and i is selected)  1/2  P(sample of size 1) P(i is selected | sample of size 1) 1 1 1  1/2    . 2 N 2 141 The Horvitz-Thompson estimator: t̂ HT  2U y i  2t ! Conditional approach: i  1 and HT estimate is t 142 LP, SP and CP Model : X ~ fq ( x), q  ; q are the unknown parameters in the model Experiment is a triple E  { X ,q ,{ fq },q  } I ( E,x) : Inference about q in the experiment E with observation x Likelihood principle: Let E1  { X 1 ,q ,{ fq1}} and E2  { X 2 ,q ,{ fq2 }} . Assume l1,x1 (q )  cl2 ,x2 (q ), c independen t of q . ( fq1 ( x1 )  cfq2 ( x2 )) Then : I ( E1 , x1 )  I ( E2 , x2 ) This includes the case where E1 = E2 and x1 and x2 are two different observations from the same experiment 143 Sufficiency principle: Let T be a sufficient statistics for q in the experiment E. Assume T(x1) = T(x2). Then I(E, x1) = I(E, x2). Conditionality principle: Let E1  { X 1 ,q ,{ fq1}} and E2  { X 2 ,q ,{ fq2 }} . Consider t he mixture experiment E* where E1 is chosen with probabilit y 1/2 and x1 is observed and E2 is chosen with probabilit y 1/2 and x2 is observed. The observatio n in E* is then the value of X*  ( J , X J ), J 1,2. E   { X  ,q ,{ fq }} where fq ( j , x j )  1 2 fq j ( x j ) CP : I ( E  ( j , x j ))  I ( E j , x j ) 144 Theorem : CP and SP  LP Proof :  Exercise  (for discrete variables, the important implicatio n) Given : E1 and E2 and observations x10 and x20 such that fq1 ( x1 )  cfq2 ( x2 ) We shall show that I ( E1 , x10 )  I ( E2 , x20 ) Consider t he mixture experiment E  . From CP : I ( E1 , x10 )  I ( E  ,(1, x10 )) and I ( E2 , x20 )  I ( E  ,(2, x20 )) It remains to show that SP  I ( E  ,(1, x10 ))  I ( E  ,(2, x20 )) Enough to find sufficent T in E  with T (1, x10 )  T (2, x20 ) 145 Reduce X as little as possible : T ( 1, x10 )  T ( 2, x20 )  ( 1, x10 ) T ( j , x j )  ( j , x j ) otherwise T is sufficient : Let first t  (1, x10 )  t 0 : Pq ( X   ( j , x j ) | T  t )  1, if ( j , x j )  t , and 0 otherwise; independen t of q With t 0  (1, x10 ) : Pq ( X   (1, x10 ) | T  t 0 )  Pq ( X   (2, x20 ) | T  t 0 )  1 and  0 P ( X  ( 1 , x 1 )) Pq ( X   (1, x10 ) | T  t 0 )  q Pq (T  t 0 )  fq1 ( x10 ) cfq2 ( x20 ) c   . 1 0 2 0 2 0 2 0 1 fq ( x1 )  2 fq ( x2 ) cfq ( x2 )  fq ( x2 ) c  1 1 2 1 2 146 Consequences for statistical analysis • Statistical analysis, given the observed data: The sample space is irrelevant • The usual criteria like confidence levels and P-values do not necessarily measure reliability for the actual inference given the observed data • Frequentistic measures evaluate methods – not necessarily relevant criteria for the observed data 147 Illustration- Bernoulli trials X 1 ,..., X i ,.. X i  1 (success)with probabilit y q Two experiments to gain informatio n about q : E1 : n  12 observations and observe Y1  12 i 1 X i E2 : Continue trials until we get 3 failures (0' s) and observe Y2  number of successes Suppose the results are y1  y2  9 148 The likelihood functions: 9 3 l9(1) (q )  (12 ) q ( 1  q ) 9 binomial 9 3 l9( 2 ) (q )  (11 ) q ( 1  q ) 9 negative binomial Proportional likelihoods: (2) l9 ( q) q) (1)  (1 / 4)l9 ( LP: Inference about q should be identical in the two cases Frequentistic analyses give different results: F.ex. test H 0 : q  1 / 2 against H1 : q  1 / 2 ( E1 ,9) : P - value  0.0730 ( E2 ,9) : P - value  0.0327 because different sample spaces: (0,1,..,12) and (0,1,...) 149 Frequentistic vs. likelihood • Frequentistic approach: Statistical methods are evaluated pre-experimental, over the sample space • LP evaluate statistical methods post-experimental, given the data • History and dicussion after Birnbaum, 1962: An overview in ”Breakthroughs in Statistics,1890-1989, Springer 1991” 150 Likelihood function in design-based inference • Unknown parameter: y  ( y1 , y2 ..., y N ) • Data: x  {( i, yobs,i ) : i  s} • Likelihood function = Probability of the data, considered as a function of the parameters  x  {y : yi  yobs,i for i  s} • Sampling design: p(s) • Likelihood function:  p( s) if y   x l x (y )   0 otherwise • All possible y are equally likely !! 151 • Likehood principle, LP : The likelihood function contains all information about the unknown parameters • According to LP: – The design-model is such that the data contains no information about the unobserved part of y, yunobs – One has to assume in advance that there is a relation between the data and yunobs : • As a consequence of LP: Necessary to assume a model – The sampling design is irrelevant for statistical inference, because two sampling designs leading to the same s will have proportional likelihoods 152 Let p0 and p1 be two sampling designs. Assume we get the same sample s in either case. Then the data x are the same and x is the same for both experiments. The likelihood function for sampling design pi , i = 0,1:  pi ( s) if y   x li , x (y )   0 otherwise  l1, x (y ) / l0, x (y )  p1 ( s) / p0 ( s) if y   x and then for all y : p1 ( s ) l1, x (y )  l0 , x ( y ) p0 ( s ) 153 • Same inference under the two different designs. This is in direct opposition to usual design-based inference, where the only stochastic evaluation is thru the sampling design, for example the Horvitz-Thompson estimator • Concepts like design unbiasedness and design variance are irrelevant according to LP when it comes to do the actual statistical analysis. • Note: LP is not concerned about method performance, but the statistical analysis after the data have been observed • This does not mean the sampling design is not important. It is important to assure we get a good representative sample. But once the sample is collected the sampling design should not play a role in the inference phase, according to LP 154 Model-based inference • • • • Assumes a model for the y vector Conditioning on the actual sample Use modeling to combine information Problem: dependence on model – Introduces a subjective element – almost impossible to model all variables in a survey • Design approach is “objective” in a perfect world of no nonsampling errors 155 III. Model-based inference in survey sampling • Model-based approach. Also called the prediction approach – Assumes a model for the y vector – Use modeling to construct estimator – Ex: ratio estimator • Model-based inference – Inference is based on the assumed model – Treating the sample s as fixed, conditioning on the actual sample • • Best linear unbiased predictors Variance estimation for different variance measures 156 Model-based approach y1 , y2 ,..., y N are realized values of random variables Y1 , Y2 ,...YN Two stochastic elements: 1) sample s ~ p() 2) (Y1 , Y2 ,...YN ) ~ fq Treat the sample s as fixed [Model-assisted approach: use the distribution assumption of Y to construct estimator, and evaluate according to distribution of s, given the realized vector y] We can decompose the total t as follows: t  i 1 yi  is yi  is yi N 157 Since  is yi is known, the problem is to estimate z  is yi , the realized value of Z  is Yi • The unobserved z is a realized value of the random variable Z, so the problem is actually to predict the value z of Z. Can be done by predicting each unobserved yi: yˆ i , i  s Estimator : tˆpred  is yi  is yˆ i  is yi  zˆ zˆ is a predictor for z • The prediction approach, the prediction based estimator Determine ŷi by modeling 158 Remarks: 1. Any estimator can be expressed on the “prediction form: tˆ  is yi  zˆtˆ letting zˆtˆ  tˆ  is yi 2. Can then use this form to see if the estimator makes any sense 159 Ex 1. tˆ  Nys   y  ( N  n) ys   y   ys is i is i is Hence, zˆ  is ys and yˆ i  ys , for all i  s Ex.2 t̂ HT   y /π and πi  nxi /t x , t x  is i i  N x i 1 i Reasonable sampling design when y and x are positively correlated   ˆt HT   t x yi   yi   yi  t x  1  nx  is is is nxi  i  yi  t x  nxi   1  xi  is yi  zˆ HT  is yi  is  n xi  t x  nxs  is  ˆ HT zˆHT  is ˆHT xi  is yˆ i ˆHT is a rather unusual regression coefficien t 160 Three common models I. A model for business surveys, the ratio model: • assume the existence of an auxiliary variable x for all units in the population. Yi  xi   i with E( i )  0, Var( i )   2 xi and Cov( i ,  j )  0  E(Yi )  xi , Var(Yi )   2 xi and Cov(Yi , Y j )  0 161 II. A model for social surveys, simple linear regression: Yi  1  2 xi   i , E( i )  0, Var( i )   2 and Cov( i ,  j )  0 • Ex: xi is a measure of the “size” of unit i, and yi tends to increase with increasing xi. In business surveys, the regression goes thru the origin in many cases III. Common mean model: E (Yi )   , Var (Yi )   2 and the Yi ' s are uncorrelat ed 162 Model-based estimators (predictors) 1. Predictor: Tˆ  is Yi  Zˆ 2. Model parameters: q N ˆ ˆ 3. T is model - unbiased if Eq (T  T | s)  0 q , T  i 1Yi 4. Model variance of model-unbiased predictor is the variance of the prediction error, also called the prediction variance Var (Tˆ  T | s)  E ((Tˆ  T ) 2 | s) q q 5. From now on, skip s in the notation: all expectations and variances are given the selected sample s, for example E (Tˆ  T )  E (Tˆ  T | s ) Var (Tˆ  T )  Var (Tˆ  T | s ) 163 Prediction variance as a variance measure for the actual observed sample Illustration 1, slide 137 N +1 possible samples: {1}, {2},…,{N}, {1,2,…N} Use Tˆ  NYs as the estimator for the population total T Assume we select the “sample” {1,2,…,N}. Then Tˆ  NY  T Prediction variance: Var (Tˆ  T )  Var (0)  0 Illustration 2, slide 138: Exactly the same prediction variance for the two sampling designs 164 Linear predictor: Tˆ  is ai ( s )Yi 6. Definition: Tˆ0 is the best linear unbiased (BLU) predictor for T if 1) Tˆ0 is model - unbiased 2) Tˆ0 has uniformly minimum prediction variance among all model - unbiased linear predictors : For any model - unbiased linear predictor Tˆ Varq (Tˆ0  T )  Varq (Tˆ  T ) for all q 165 Model : Yi  xi   i , E ( i )  0 and Var ( i )   2 v( xi ) Y1 ,..., YN are uncorrelat ed , Cov( i ,  j )  0 Usually, v( x)  x g , 0  g  2 Suggested Predictor: Tˆpred  is Yi  is ôpt xi where ôpt is the best linear unbiased estimator (BLUE) of  ôpt    is xiYi / v( xi ) 2 x / v( xi ) is i 166 ˆ  is ci ( s)Yi E ( ˆ )   is ci ( s) xi   ,   is ci ( s) xi  1 Var ( ˆ )   2 is ci2 v( xi ) Minimize 2 c is i v( xi ) subject t o is ci xi  1 using Lagrange method Q   c v( xi )   (is ci xi  1) 2 is i Q / ci  2ci v( xi )  xi  0 xi  ci  ( / 2) v( xi ) 167  Determine ( / 2) such that c x  1: is i i   / 2is xi2 / v( xi )  1  ( / 2)  1 / is xi2 / v( xi ) and ci ,opt xi / v( xi )  2 x  js j / v( x j ) and ôpt  is ci ,optYi    is xiYi / v( xi ) 2 x / v( x j ) js j This is the least squares estimate based on Yi / v( xi ) 168 • We shall show that Tˆpred is the best linear unbiased (BLU) predictor for T Let Tˆ be a model - unbiased and linear predictor Let Zˆ  Tˆ  is Yi and ˆ  Zˆ / is xi .  Tˆ  is Yi  ˆ is xi Tˆ linear predictor  ˆ is linear in (Yi , i  s ) and Tˆ model - unbiased  E ( ˆ )   169 since E (Tˆ  T )  E ( ˆ is xi  is Yi )  E[ ˆ is xi ]  is xi  [ E ( ˆ )   ]is xi such that E (Tˆ  T )  0  E ( ˆ )   The prediction variance of model-unbiased predictor: Var (Tˆ  T )  Var ( ˆ is xi  is Yi )  Var ( ˆ is xi )  Var (is Yi )  (is xi ) 2 Var ( ˆ )   2 is v( xi ) To minimize the prediction variance is equivalent to minimizing Var ( ˆ ) Giving us Tˆpred as the BLU predictor 170 The prediction variance of the BLU predictor: Var (Tˆpred  T )  (is xi ) 2 Var ( ôpt )   2 is v( xi )  (is xi ) 2 2 2 x is i / v( xi )   2 is v( xi ) 2   ( x )  2 is i    v ( x )  i is   xi2 / v( xi )   is  A variance estimate is obtained by using the modelunbiased estimator for 2 1 1 2 ˆ ˆ   (Yi   opt xi )  is n 1 v( xi ) 2 171 2   ( x )  i 2 i  s Vˆ (Tˆpred  T )  ˆ   is v( xi )  2   xi / v( xi )   is  The central limit theorem applies such that for large n, N-n we have that (Tˆpred  T ) / Vˆ (Tˆ  T ) is approximat ely N (0,1) Approximate 95% confidence interval for the value t of T: tˆpred  1.96 Vˆ (Tˆ  T ) Also called a 95% prediction interval for the random variable T 172 Three special cases: 1) v(x) = x, the ratio model, 2) v(x)= x2 and 3) xi =1 for all i, the common mean model 1. v(x) = x ˆ    is opt xiYi / v( xi ) is 2 i x / v( xi )    is Yi is xi  Rˆ , the usual sample ratio Tˆpred  is Yi  is Rˆ xi  Rˆ is xi  Rˆ is xi  Rˆ  t x the usual ratio estimator  Var (Tˆpred  T )   2 (is xi ) 2 /( is xi )  is xi  1  f xr x 2 N   , n xs 2 f  n / N , xr  is xi /( N  n) and x  i 1 xi N 173 2. v(x) =x2 ˆ    is opt xiYi / v( xi ) is 2 i x / v( xi )   is Yi / xi n , the sample mean of the ratios Tˆpred  is Yi  is ôpt xi Yi 1  is Yi  ( is )is xi n xi 2   ( x )  i 2 i  s  Var (Tˆpred  T )     v ( x )  i is   xi2 / v( xi )   is  2   ( x ) 2  is i 2   is xi   n   174 Resembles the H - T estimator when  i  nxi / t x : Let Ri  Yi / xi and Rs  is Ri / n t xYi ˆ Also model-unbiased THT  is  t x  Rs nxi Tˆpred  is Yi  Rs is xi  t x  Rs  is (Yi  Rs xi ) When the sampling fraction f is small or when the xi values vary little, these two estimators are approximately the same. In the latter case: 1 Rs  nxs  is Yi and is Rs xi  is Yi 175 3. xi =1 Model : Yi     i , E ( i )  0 and Var ( i )   2 Y1 ,..., YN are uncorrelat ed , Cov( i ,  j )  0 ˆ    opt xiYi / v( xi ) 1  is Yi  Ys , the sample mean 2 x / v( xi ) n is i is Tˆpred  is Yi  is Ys  N  Ys 2   ( x )  2 is i  Var ( N  Ys  T )    v ( x )  i is   xi2 / v( xi )   is  2 2   ( N  n )    2   ( N  n)   N 2 (1  f ) n n   This is also the usual, design-based variance formula under SRS 176 We see that the variance estimate is given by N 2 (1  f ) ˆ 2 n 1 2 ˆ  ( y  y )  i s n  1 is the sample variance 2 Exactly the same as in the design-approach, but the interpretation is different 177 Simple Linear regression model Yi  1   2 xi   i , E ( i )  0, Var ( i )   2 Y1 ,..., YN are uncorrelat ed BLU predictor: Tˆpred  is Yi  is ( ˆ1  ˆ2 xi ) where ˆ1 and ˆ2 are the LS estimators , ˆ2   is ( xi  xs )(Yi  Ys ) 2 ( x  x ) is i s    is ( xi  xs )Yi 2 ( x  x ) i s is ˆ1  Ys  ˆ2 xs 178 Tˆpred  is Yi  is ( ˆ1  ˆ2 xi )  nYs  ( N  n)Ys  ˆ2 (is xi  ( N  n) xs )  NYs  ˆ2 (t x  Nxs )  Tˆpred  N [Ys  ˆ2 ( x  xs )] Clearly, Tˆpred is model - unbiased : E (T )  i 1 ( 1   2 xi )  N (1   2 x ) N and 1 ˆ E (Tpred )  N { is (1   2 xi )   2 ( x  xs )}  N ( 1   2 x ) n 179 We shall now show that this predictor is BLU Assume first that x  x . Let Tˆ be a linear, s model - unbiased predictor, and let b  (Tˆ / N  Ys ) /( x  xs ). 1 ˆ T  Ys  b( x  xs )  Tˆ  N [Ys  b( x  xs )] N Hence, any predictor can be expressed on this form and the predictor is linear if and only if b is linear in the Yi’s Also, Tˆ is model - unbiased  E (b)   2 : E (Tˆ )  E (T )  N (    x ) 1 2  N [ 1   2 xs  ( x  xs ) E (b)]  N ( 1   2 x )  ( x  xs ) E (b)   2 x   2 xs   2 ( x  xs ). 180 Prediction variance: Var (Tˆ  T )  Var ( N  n)Ys  Nb( x  xs )   ( N  n) 2 b  is ci ( s)Yi , unbiased estimator of  2 : E (b)   2  is ci ( 1   2 xi )   2  1 is ci   2 is ci xi   2 E (b)   2  (1) is ci  0 and (2) is ci xi  1 So we need to minimize the prediction variance with respect to the ci’s under (1) and (2) 181 i.e. minimize  N n  Var ( N  n)Ys  Nb( x  xs )   Var is Yi   N ( x  xs )ci   n   N n    2 is   N ( x  xs )ci   n  2 N  n ( N  n )   2 [ N 2 ( x  xs ) 2 is ci2  2 N ( x  xs )is ci  ] n n 2 Since  c  0, is i it is enough to minimize 2 c is i under conditions (1) and (2) 182 Q  is ci2  21 (is ci )  22 (is ci xi  1) Q / ci  2ci  21  22 xi  0  ci  1  2 xi  (2)  (1) c  0  1  2 xs  0 is i 2 c x  1   n x   x 1 s 2 is i  1 is i i (1)  1  2 xs from (2) : 2 is xi2  2 nxs2  1 2  1 / is ( xi  xs ) 2 183 xi  xs ci  1  2 xi  2 ( xi  xs )  2 ( x  x )  js j s ( xi  xs )Yi xi  xs  is ˆ and b  is Yi    2 2 2 ( x  x ) ( x  x )  js j s  js j s The prediction variance is given by 2 2   n ( x  x ) N n 2 s ˆ Var (T pred  T )   (1  )  2 n N is ( xi  xs )   and variance estimate is obtained by estimating  2 with 2 1 ˆ ˆ  (Yi  Ys   2 ( xi  xs ))  is n2 2 184 So far, x  xs . What if x  xs ? Then Tˆ  NY and is the BLU predictor. pred s For any linear predictor, Tˆ  is aiYi Var (Tˆ  T )  Var[is (ai  1)Yi ]  ( N  n) 2   2 [is (ai  1) 2  ( N  n)] Let Tâ  a is Yi , a  is ai / n  a Var (Tâ  T )   2 [is (a  1) 2  ( N  n)]   2 [n(a  1) 2  ( N  n)] 185 2 2 ( a  1 )  n ( a  1 ) is i  Var (Tˆ  T )  Var (Tâ  T ) and Tˆ model - unbiased  is ai  N and a  N / n : Tâ  NYs  Tˆpred . 186 Anticipated variance (method variance) We want a variance measure that tells us about the expected uncertainty in repeated surveys 1. Conditiona l on the sample s, with model - unbiased Tˆ : Var (Tˆ  T ) measures the uncertaint y for this particular sample s 2. The expected uncertaint y for repeated surveys : E p {Var (Tˆ  T )}, over the sampling distributi on p() 3. This is called the anticipated variance. 4. It can be regarded as a variance measure that describes how the estimation method is doing in repeated surveys 187 If Tˆ is not model - unbiased, we use E {E (Tˆ  T ) 2 } p as a criterion for uncertaint y, the anticipate d mean square error Note : If Tˆ is design - unbiased then E {E (Tˆ  T ) 2 }  E{E (Tˆ  T ) 2 | Y)} p p and E p (Tˆ  T ) 2 | Y  y )  E p (tˆ  t ) 2  Varp (tˆ) And the anticipated MSE becomes the expected designvariance, also called the anticipated design variance E p {E (Tˆ  T ) 2 }  E{Varp (Tˆ )} 188 Example: Simple linear regression and simple random sample If sample mean N  Ys is used : It is not model - unbiased, but is design - unbiased : 1 f 1 N 2 E p {E ( N  Ys  T ) }  E{Varp ( N  Ys )}  N E{ ( Y  Y ) }  i i 1 n N 1 1 f 1 N 2  N2 { 2  (    ) }  i i 1 n N 1 2 2 i  E (Yi )  1   2 xi ,   1   2 x 1 f E{Varp ( N  Ys )}  N { 2   22 S x2 } n 1 N 2 2 Sx  ( x  x ) }  i i 1 N 1 2 189 Let us now study the BLU predictor.( It can be shown that it is approximately design-unbiased ) 2 2   n ( x  x ) N n 2 s Var (Tˆpred  T )   (1  )  2 n N is ( xi  xs )   2  N n 2  E p {Var (Tˆpred  T )}   (1  )  E p n N  n( x  xs ) 2  2 is ( xi  xs )  2   E { n ( x  x ) N 2 n p s }   (1  )  2 n N E p is ( xi  xs )   2 E p n( xs  x ) 2  nVarp ( xs )  (1  f )S x2 E p is ( xi  xs ) 2  (n  1) S x2 190 2 N 1 f  2 ˆ E p {Var (Tpred  T )}   (1  f )  n n `1   2 N2 N  (1  f ) 2  (1  f ) 2 n 1 n compared to 1 f E{Varp ( N  Ys )}  N { 2   22 S x2 } n 2 Tˆpred eliminates the term  22 S x2 and is much more efficient than N  Ys 191 Remarks • From a design-based approach, the sample mean based estimator is unbiased, while the linear regression estimator is not • Considering only the design-bias, we might choose the sample mean based estimator • The linear regression estimator would only be selected over the sample mean based estimator because it has smaller anticipated variance • Hence, difficult to see design-unbiasedness as a means to choose among estimators 192 Robust variance estimation • The model assumed is really a “working model” • Especially, the variance assumption may be misspecified and it is not always easy to detect this kind of model failure – like constant variance – variance proportional to size measure xi • Standard least squares variance estimates is sensitive to misspecification of variance assumption • Concerned with robust variance estimators 193 Variance estimation for the ratio estimator Working model: Yi  xi   i , E ( i )  0 and Var ( i )   2 xi Y1 ,..., YN are uncorrelat ed , Cov( i ,  j )  0 Under this working model, the unbiased estimator of the prediction variance of the ratio estimator is xr x 2 2 1 f ˆ ˆ VR ( R  t x  T )  N  ˆ n xs 1 1 2 2 ˆ ˆ   ( Y  R  x )  i i is n 1 xi Rˆ  Ys / xs 194 This variance estimator is non-robust to misspecification of the variance model. Suppose the true model has E (Yi )  xi and Var (Yi )   2 v( xi ) Ratio estimator is still model-unbiased but prediction variance is now Var ( Rˆ  t x  T )  (is xi ) 2 Var ( Rˆ )   2 is v( xi )  (is xi ) 2  is v( xi ) 2 (is xi ) 2 2 2  ( N  n ) xr 2    2 2 n xs    2 is v( xi )  is v( xi )  is v( xi )   195 2 2   ( N  n ) xr 2 ˆ Var ( R  t x  T )    vs  ( N  n)vr  2 nxs   2 2 1 f  N (1  f )vs ( xr / xs ) 2  f  vr n   vs  is v( xi ) / n and vr  is v( xi ) /( N  n) Moreover, E (ˆ 2 )   2 : 1 1 2 ˆ E (ˆ )  E ( Y  R  x )  i i n  1 is xi 2 1 1     ( v / x ) s  {( v / x) s  vs / xs } , (v / x) s  is v( xi ) / xi n 1 n   2 196 Robust variance estimator for the ratio estimator 1 f Var ( Rˆ  t x  T )   2 N 2 (1  f )vs ( xr / xs ) 2  f  vr n 1 f   2N 2 vs ( xr / xs ) 2  f  {vr  vs ( xr / xs ) 2 } n 1 f   2 vs  N 2 ( xr / x s ) 2 , n the leading term in the prediction variance     1 1 2 and :  vs  is  v( xi )  is Var (Yi ) n n 2 1 1 2 2  vs  is E (Yi  xi )  E{ is (Yi  xi ) } n n 2 197 Suggests we may use: 1 2 ˆ ˆ v  ( Y  R x )  i i n  1 is 2 rob s Leading to the robust variance estimator: 1 2 2 1 f 2 ˆ ˆ ˆ Vrob ( R  t x  T )  ( xr / xs )  N  ( Y  R x )  i i n n  1 is Almost the same as the design variance estimator in SRS: 1 2 2 1 f 2 ˆ ˆ ˆ VSRS ( R  t x )  ( x / xs )  N  ( Y  R x )  i i n n  1 is 198    2 2 1 f ˆ ˆ E Vrob ( R  t x  T )  ( xr / xs )  N   2 vs  V ( Rˆ  t x  T ) n  Can we do better? Require estimator to be exactly unbiased under ratio model, v(x) = x: 1 2 ˆ ( Y  R  x )  i i } is n 1 xi 1 1 2 2 ˆ  E (Yi  Rxi )   xi (1  )   is is n 1 n 1 nx s When v( x)  x : E{ 2   s 1 1 2 2 2 x     x s 1   2  , s x  ( x  x ) is i s n x n  1 s   199 The prediction variance when v(x) = x: xr x 2 2 1 f ˆ V (R  tx  T )  N   n xs 2   s 1  f 1 2 2 2 x ˆ ˆ  E{Vrob ( R  t x  T )}  N ( xr / xs ) 1   2  n  n xs  So a robust variance estimator that is exactly unbiased under the working model , v(x) = x: x  1 s ˆ ˆ VR ,rob ( R  t x  T )}  1   xr  n x 2 x 2 s 1  ˆ ˆ  Vrob ( R  t x )  1 f 1 2 ˆ  {1  n ( s / x )} ( xr x / x )  N  ( Y  R x )  i i n n  1 is 1 2 x 2 s 1 2 s 2  {1  n1 (sx2 / xs2 )}1 ( xr / x ) VˆSRS ( Rˆ  t x ) 200 General approach to robust variance estimation 1. Find robust estimators of Var(Yi), that does not depend on model assumptions about the variance 2. Tˆ   wisYi is Var (Tˆ  T )  is ( wis  1) 2Var (Yi )  is Var (Yi ) 3. For i  s :Vˆ (Yi )  (Yi  ˆ i ) 2 ˆ i estimate E (Yi ) under true model 4. Estimate only leading term in the prediction variance, typically dominating, or estimate the second term from the more general model 201 • Reference to robust variance estimation: • Valliant, Dorfman and Royall (2000): Finite Population Sampling and Inference. A Prediction Approach, ch. 5 202 Model-assisted approach • Design-based approach • Use modeling to improve the basic HTestimator. Assume the population values y are realized values of random Y • Assume the existence of auxiliary variables, known for all units in the population • Basic idea: Suppose yˆ i  ˆxi is a regression - based " estimate" for each yi in the population . Here, xi is known for the whole population t  i 1 yˆ i  i 1 ( yi  yˆ i ) and e  i 1 ei , where ei  ( yi  yˆ i ), N N N is much easier to estimate, and can be estimated by HT - estimator 203 eˆHT  is ei i Final estimator, the regression estimator: ˆtreg  N ˆxi  eˆHT i 1 Alternative expression: tˆreg  is yi i xi ˆ   (t x  is ) , i t x  i 1 xi N tˆreg  tˆy , HT  ˆ (t x  tˆx, HT ) 204 Simple random sample tˆreg  Nys  ˆ (t x  Nxs ) Model : The Yi ' s are independen t and Yi  xi   i , E ( i )  0 and Var ( i )   2 xi  Best linear unbiased estimator : ˆ  y / x s s ys ys ˆ  t reg  Nys  t x  Nys  t x , the ratio estimator xs xs 205 In general with this “ratio model”, in order to get approximately design-unbiased estimators: Can regard  - estimate as an estimate of  N i 1 yi / i 1 xi N Numerator is estimated by tˆy , HT  is yi /  i Denominato r is estimated by tˆx , HT  is xi /  i  use ˆ  tˆy , HT / tˆx , HT   tˆreg   is yi /  i is xi /  i N ˆ  t x   i 1 yˆ i where yˆ i  ˆ xi 206 Variance and variance estimation Reference: Sarndal, Swensson and Wretman : Model Assisted Survey Sampling (1992, ch. 6), Wiley • Regression estimator is approximately unbiased • Variance estimation: The sample residuals : ei  yi  ŷi , i  s where ŷi  xi ̂ If | s |  n , fixed in advance : ( i j   ij )  ei e j  ˆ    ˆ V (t reg )  is  js ,    j i  ij j   i 2 207 Approximate 95% CI, for large n, N-n: tˆreg  1.96 Vˆ (tˆreg ) • Remark: In SSW (1992,ch.6), an alternative variance estimator is mentioned that may be preferable in many cases 208 Common mean model E (Yi )   , Var (Yi )   2 and the Yi ' s are uncorrelat ed The ratio model with xi =1. ˆ  tˆy , HT / tˆx, HT    yi /  i tˆy , HT ~  ys  ˆ N 1 /  i is is tˆreg  t x ˆ  Nˆ  N~ ys This is the modified H-T estimator (slide 73,74) Typically much better than the H-T estimator when different 209 ei  yi  ~ ys  i j   ij  ei e j  ~ ˆ    V ( Nys )  is  js ,    j i  ij j   i 2 Alternatively, ( i j   ij )  ei e j   2 ~ ˆ ˆ    V ( Nys )  ( N / N ) is  js ,    j i  ij j   i 2 210 Remarks: 1. The model-assisted regression estimator has often the form ˆtreg  N yî i 1 2. The prediction approach makes it clear: no need to estimate the observed yi 3. Any estimator can be expressed on the “prediction form: tˆ  is yi  zˆtˆ letting zˆtˆ  tˆ  is yi 4. Can then use this form to see if the estimator makes any sense 211

PSTAT 262 AS: Survey Sampling and Estimation

Related documents

Products

Support

PSTAT 262 AS: Survey Sampling and Estimation

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib