Analys av surveydata, AN. Del 1 Stockholms universitet 2008 Dan Hedlin Acknowledgement: Milorad Kovacevic 1 Analysis of (complex) survey data What is survey data? What makes data complex? “A survey concerns a set of objects that comprises a population.” (Dalenius, from Biemer & Lyberg, 2003, p. 2) In this context, the term ”complex” refers usually to “not iid”, not identically and independently distributed. For example, a heterogeneous population that calls for different models in different strata, or a population with clusters. “Complex survey data” refers to micro data, as opposed to macro data = a collection of point estimates 2 Analysis and inference? What is analysis? Not a well-defined term. In this context, analysis is usually when you are interested in model parameters rather than finite population parameters. What is statistical inference? “The development of generalization from sample data, usually with calculated degrees of uncertainty” (Dictionary of epidemiology) Not a definition but still a good description of the term statistical inference. 3 What characterises survey data? Refers often to heterogeneous populations May be a wealth of auxiliary variables Often repeated samples Often large populations Often large samples (n > 10 000) Units in samples drawn with varying inclusion probabilities Aim of analysis is sometimes unclear or debatable 4 Some history Cochran (1977, and earlier editions): omit the finite population correction in certain analyses Kish (1965): design effect Kish and Frankel (1974): large simulation study 1980– papers by Gad Nathan, Tim Holt, Danny Pfeffermann and others Skinner, Holt and Smith (1989): first book devoted solely to the subject 1992– texts books with one or two chapters on analysis of survey data (e.g. “Yellow-book”, Lohr) Chambers and Skinner (2003). Analysis of Survey Data. 5 Finite population Examples: the population of Sweden at a certain point in time; the population of businesses with at least five employees; the municipals of Sweden A variable is the set of values of some characteristic associated with the population objects; e.g. income. Formally y y1 , y2 , ..., y N , where N is the population size. Or a vector Y y1 , y 2 , ..., y N , where y j y j1 , y j 2 , ..., y jN 6 Completely known if census, no nonresponse and no measurement errors Here: finite population = actual and definite population 7 Finite population parameter A finite population parameter a function of y. For example, mean income. Or a function of two variables; for example, the correlation of income and tax: 1 N 2 xy S S y S x where S y y , with i U N 1 i 1 N 1 N 2 yU yi N , and S xy xi xU yi yU N 1 i1 i 1 2 xy 2 y Descriptive aim, if we are interested in finite population parameters. “Descriptive population quantity” (Pfeffermann 1993) 8 Model parameter Y has probability density function (pdf) fY y . We may be interested in E Y . This is a model parameter. Analytic aim. Note that this is different from the finite population parameter: There is no actual and definite population of size N; this population is conceptual E Y is defined in terms of a pdf Finite population parameter is not model dependent; ie does not depend on a distribution or pdf 9 Contingency table example Consider an r x c table with proportions or probabilities. 1. Proportions of people in an actual population => descriptive aim 2. Cell probabilities, with probabilities from a certain model => analytic aim 10 How do we know whether we are interested in a model parameter or a finite population parameter? Example (Thompson 1997, p. 199): Smokers in Ontario… …whether they have smoked brand A during October => probably descriptive aim (how many smokers in this definite population?) …whether they have switched brands in the month prior to the survey => then we may be interested in the probability that a randomly selected smoker, in Ontario or a larger area, will switch brands in some future month under similar conditions 11 Another analytic aim of use of survey data A shoe manufacturer may be interested in the properties of soles and glue, e.g. what combination offers best quality. Is this manufacturer interested in the 1000 shoes that were produced last month, and from which a sample was taken? Most likely not. This manufacturer will be interested in the shoes that will be produced under similar conditions. Note that this is a finite population too, but a conceptual one. For example, its size is not determined. 12 Common analytic aims To establish theory about associations (relationships, causal links, etc.) between the variables To assess the likely impact of policy changes or making predictions about the possible consequences of `no change` policy To draw conclusions that hold beyond the population at the time it was sampled 13 A more difficult example We work for a national statistical institute (NSI) as producers of official statistics. A labour force survey produces estimates of number of employed last month. Using personal identification numbers, we match the labour force survey data with education data. We make a contingency table of employment status vs level of education. Are we interested in the finite population parameters in the population of Sweden last month, or in some model parameters? 14 Note two “dimensions” in target: Analysis at the level of the actual and definite population or at the level of a conceptual population Model parameter or a finite population parameter Finite population parameter Model parameter Actual, definite population e.g. books by Cochran and Lohr Conceptual population Via a superpopulation e.g. analysis of variance NB: aim not type of inference 15 The distinction between descriptive and analytic questions/aims is often hard to make: Many descriptive questions can be expressed as modelling, thus analytic, questions. Sometimes analytic questions can be constrained to the relationships in a finite population. Distinction is useful but is not a barrier. 16 The difference may not be large or crucial. For a large population, finite population parameters and “corresponding” model parameters may be very similar. [analytic questions] involve stochastic models that attempt to represent the associations that the descriptive statistics portray (Skinner, Holt and Smith, 1989) Corresponding descriptive population quantity (CDPQ) (Pfeffermann 1993) 17 Model Yi xi i , Yi iid ~ N ( xi , 2 ) Superpopulation Finite population, B Sample B̂ 18 At the level of the conceptual population: n Estimating from a random sample of size n: ̂ xi yi i 1 n 2 x i i 1 Define CDPQ B: N B xi yi i 1 N 2 x i i 1 This is an “estimate” of if the N units are viewed as a random sample from superpopulation distribution. Note that B is not model dependent. 19 Formal definition of a CDPQ Pfeffermann (1993) Let Y Y1 , Y2 , ...,YN be a vector of random variables from a family of distributions, indexed by a vector ( in the example) Let RY θ be an estimation rule N Here: RY min yi xi 2 i 1 The estimation rule leads to some estimating equations U Y, θ 0 20 Here: the normal equations The solution T Y to U Y, θ 0 (i.e. the quantity that satisfies U Y, T Y 0 ) is the CDPQ for under the estimation rule. 21 t s θ ts T T θ O n 0.5 O N 0.5 O n 0.5 In the example, ts Bˆ , T = B, Suggests that the “superpopulation error” is negligible compared to the sampling error One interpretation: if you find an estimator t s that gives a decent estimate of B, then you also have a decent estimate of 22 Summary; aim Choices to make In terms of parameter, finite population parameter or model parameter In terms of population, actual and definite population or a conceptual one Can have both: A model parameter and a finite population parameter that “correspond” to each other 23 Design-based vs model-based inference Model-based inference is “ordinary” frequentist inference. For example, a model is assumed or fitted, and ML estimates calculated. Aim: to estimate a model parameter in a conceptual population. Design-based inference is frequentist, but different (see following pages) 24 Recap of design-based survey sampling for finite populations Finite population of N units, conceptually labelled 1, 2, 3, …, N This is not just notation: without labels design based inference is problematic (Thompson 1997, p. 147) Population values of one variable is denoted y y1 , y2 , ..., y N For design-based inference, y y1 , y2 , ..., y N regarded as fixed. Widely used at NSIs 25 Design-based randomness A sample s is a subset of U. The collection of all possible samples s is denoted by S. Randomness comes from the sample design; what is perceived as random is the sample that happens to be drawn A probability associated with each possible sample s and interpreted as a probability that s is drawn: ps 0 and p s 1 sS 26 The function ps is referred to as sampling design. Example: simple random sample without replacement of size n is defined as the sampling design where all ps are equal N Hence ps 1 n Can show that the probability to draw unit i is n N In general: the probability to draw unit i is referred to as inclusion probability, sometimes denoted by i 27 Design-based estimation of finite population parameters N Estimating the total of y, i.e. t yi : i 1 n tˆ i 1 yi i the Horvitz-Thompson estimator (HT estimator) Alternatively n tˆ wi yi i 1 wi 1 i are design weights (base weights) 28 A design-based interpretation: unit i represents wi units including itself. Representation principle: unit i must represent (about) wi units including itself, otherwise the HT estimator will be poor (Brewer 1999, Basu’s (1971) elephants) The design-based variance of an estimator ˆ 2 2 ˆ ˆ ˆ ˆ ˆ V p E p E p ps s E p s sS For example, for the HT estimator: N N yi y j V tˆ ij i j i 1 j 1 i j 29 ij Pi s, j s second order inclusion probability For the HT estimator and simple random sampling: 2 1 n N 2 V tˆ N Sy n N N 1 2 yi yU , where yU yi N with S y2 N 1 i 1 i 1 30 The finite population covariance matrix Finite population parameter: 1 N S xy xi xU yi yU N 1 i 1 with some algebra: 1 1 S xy t yz t yt z N 1 N N 1 Estimator: 1 1 S xy tˆyz tˆy tˆz Nˆ 1 Nˆ Nˆ 1 31 with n 1 i 1 i Nˆ n , tˆy i 1 yi i n , and tˆyz i 1 yi zi i Equivalently, 1 n ~ S xy xi ~ xs yi ~ ys Nˆ 1 i 1 n 1 yi ~ with ys Nˆ i 1 i Estimators for other finite population parameters in the Yellowbook 32 Finite population regression coefficient What is estimated is the “census fit” , which is a finite population parameter: N B xi yi i 1 N 2 x i i 1 Estimator: n Bˆ qi wi xi yi i 1 n 2 q w x i ii i 1 qi 1 i2 (viewed as variance of residuals, just as in “ordinary, weighted, regression analysis”, or as an unspecified weight, in which case often qi 1) 33 The numerators and the denominators can be viewed as totals and HT-estimators of them n tˆqxy i 1 qi xi yi i , where wi 1 i 34 Summary, design-based inference What is perceived as random is the sample that you happen to obtain In a census there is nothing random (except for measurement and nonresponse) The HT estimator is a basic estimator that is a component of more complicated estimators The design weights play a crucial role 35 Referenser Basu, D. (1971). An Essay on the Logical Foundations of Survey Sampling. I Foundations of Statistical Inference (red. V.P. Godambe och D.A. Sprott). Toronto: Holt, Rinehart and Winston, 203–242. Biemer, P.P. och Lyberg, L.E. (2003). Introduction to Survey Quality. New York: Wiley. Brewer, K.R.W. (1999). Design-Based or Prediction-Based Inference? Stratified Random vs Stratified Balanced Sampling. International Statistical Review, 67, 35–47. Chambers, R. L. och Skinner, C. J. (red.) (2003). Analysis of Survey Data. Chichester: Wiley. Cochran, W.G. (1977). Sampling Techniques, 3rd ed. New York: Wiley. Folsom, R., LaVange, L. och Williams, R.L. (1989). A Probability Sampling Perspective on Panel Data Analysis. I Panel Surveys (red. D. Kasprzyk, G.J. Duncan, G. Kalton och M.P. Singh). New York: Wiley, 108–138. Kish, L. (1987). Statistical Design for Research. New York: Wiley. Kish, L. and Frankel, M.R. (1974). Inference from Complex Samples (with discussion). Journal of the Royal Statistical Society, series B, 36, 1-37. Lehtonen, R. och Pahkinen, E. (2004). Practical Methods for Design and Analysis of Complex Surveys, 2nd ed. New York: Wiley. Lohr, S. (1999). Sampling: Design and Analysis. Pacific Grove, CA: Duxbury. 36 Pfeffermann, D. (1993). The role of sampling weights when modeling survey data. International Statistical Review, 61, 317-337. Skinner, C.J., Holt, D. och Smith, T.M.F. (red.) (1989). Analysis of Complex Surveys. Chichester: Wiley. Statistiska centralbyrån (2008). Urval – från teori till praktik. ISSN 1654-7268. Särndal, C.-E., Swensson, B. och Wretman J. (1992). Model Assisted Survey Sampling. New York: Springer-Verlag. Thompson, M.E. (1997). Theory of Sample Surveys. London: Chapman & Hall. 37