survey data - People of Statistics

advertisement
Analys av surveydata, AN. Del 1
Stockholms universitet 2008
Dan Hedlin
Acknowledgement: Milorad Kovacevic
1
Analysis of (complex) survey data
What is survey data? What makes data complex?
“A survey concerns a set of objects that comprises a population.”
(Dalenius, from Biemer & Lyberg, 2003, p. 2)
In this context, the term ”complex” refers usually to “not iid”, not
identically and independently distributed. For example, a
heterogeneous population that calls for different models in
different strata, or a population with clusters.
“Complex survey data” refers to micro data, as opposed to macro
data = a collection of point estimates
2
Analysis and inference?
What is analysis? Not a well-defined term. In this context, analysis
is usually when you are interested in model parameters rather than
finite population parameters.
What is statistical inference?
“The development of generalization from sample data, usually
with calculated degrees of uncertainty”
(Dictionary of epidemiology)
Not a definition but still a good description of the term statistical
inference.
3
What characterises survey data?
 Refers often to heterogeneous populations
 May be a wealth of auxiliary variables
 Often repeated samples
 Often large populations
 Often large samples (n > 10 000)
 Units in samples drawn with varying inclusion probabilities
 Aim of analysis is sometimes unclear or debatable
4
Some history
 Cochran (1977, and earlier editions): omit the finite population
correction in certain analyses
 Kish (1965): design effect
 Kish and Frankel (1974): large simulation study
 1980– papers by Gad Nathan, Tim Holt, Danny Pfeffermann
and others
 Skinner, Holt and Smith (1989): first book devoted solely to
the subject
 1992– texts books with one or two chapters on analysis of
survey data (e.g. “Yellow-book”, Lohr)
 Chambers and Skinner (2003). Analysis of Survey Data.
5
Finite population
Examples: the population of Sweden at a certain point in time; the
population of businesses with at least five employees; the
municipals of Sweden
A variable is the set of values of some characteristic associated
with the population objects; e.g. income. Formally
y   y1 , y2 , ..., y N , where N is the population size.
Or a vector Y  y1 , y 2 , ..., y N , where y j   y j1 , y j 2 , ..., y jN 
6
Completely known if census, no nonresponse and no measurement
errors
Here: finite population = actual and definite population
7
Finite population parameter
A finite population parameter a function of y. For example, mean
income. Or a function of two variables; for example, the
correlation of income and tax:
1 N
2
 xy  S S y S x where S 

y

y

, with

i
U
N  1 i 1
N
1 N
2
yU   yi N , and S xy 
 xi  xU  yi  yU 

N  1 i1
i 1
2
xy
2
y
Descriptive aim, if we are interested in finite population
parameters. “Descriptive population quantity” (Pfeffermann 1993)
8
Model parameter
Y has probability density function (pdf) fY  y . We may be
interested in   E Y . This is a model parameter. Analytic aim.
Note that this is different from the finite population parameter:
 There is no actual and definite population of size N; this
population is conceptual
   E Y  is defined in terms of a pdf
 Finite population parameter is not model dependent; ie does not
depend on a distribution or pdf
9
Contingency table example
Consider an r x c table with proportions or probabilities.
1. Proportions of people in an actual population => descriptive aim
2. Cell probabilities, with probabilities from a certain model =>
analytic aim
10
How do we know whether we are interested in a model parameter
or a finite population parameter?
Example (Thompson 1997, p. 199):
Smokers in Ontario…
…whether they have smoked brand A during October => probably
descriptive aim (how many smokers in this definite population?)
…whether they have switched brands in the month prior to the
survey => then we may be interested in the probability that a
randomly selected smoker, in Ontario or a larger area, will switch
brands in some future month under similar conditions
11
Another analytic aim of use of survey data
A shoe manufacturer may be interested in the properties of soles
and glue, e.g. what combination offers best quality. Is this
manufacturer interested in the 1000 shoes that were produced last
month, and from which a sample was taken? Most likely not.
This manufacturer will be interested in the shoes that will be
produced under similar conditions. Note that this is a finite
population too, but a conceptual one. For example, its size is not
determined.
12
Common analytic aims
 To establish theory about associations (relationships, causal
links, etc.) between the variables
 To assess the likely impact of policy changes or making
predictions about the possible consequences of `no change`
policy
 To draw conclusions that hold beyond the population at the
time it was sampled
13
A more difficult example
We work for a national statistical institute (NSI) as producers of
official statistics. A labour force survey produces estimates of
number of employed last month. Using personal identification
numbers, we match the labour force survey data with education
data. We make a contingency table of employment status vs level
of education. Are we interested in the finite population parameters
in the population of Sweden last month, or in some model
parameters?
14
Note two “dimensions” in target:
Analysis at the level of the actual and definite population or at the
level of a conceptual population
Model parameter or a finite population parameter
Finite population
parameter
Model parameter
Actual, definite
population
e.g. books by
Cochran and Lohr
Conceptual
population
Via a
superpopulation
e.g. analysis of
variance
NB: aim not type of inference
15
The distinction between descriptive and analytic questions/aims is
often hard to make:
 Many descriptive questions can be expressed as modelling,
thus analytic, questions.
 Sometimes analytic questions can be constrained to the
relationships in a finite population.
 Distinction is useful but is not a barrier.
16
The difference may not be large or crucial. For a large population,
finite population parameters and “corresponding” model
parameters may be very similar.
[analytic questions] involve stochastic models that attempt to
represent the associations that the descriptive statistics portray
(Skinner, Holt and Smith, 1989)
Corresponding descriptive population quantity (CDPQ)
(Pfeffermann 1993)
17
Model Yi  xi   i , Yi iid ~ N ( xi , 2 )
Superpopulation 
Finite population, B
Sample
B̂
18
At the level of the conceptual population:

n
Estimating  from a random sample of size n: ̂   xi yi
i 1
n
2
x
 i
i 1
Define CDPQ B:
N
B   xi yi
i 1
N
2
x
 i
i 1
This is an “estimate” of  if the N units are viewed as a random
sample from superpopulation distribution. Note that B is not model
dependent.
19
Formal definition of a CDPQ
Pfeffermann (1993)
Let Y  Y1 , Y2 , ...,YN  be a vector of random variables from a
family of distributions, indexed by a vector  ( in the example)
Let RY  θ  be an estimation rule
N
Here: RY     min   yi   xi 

2
i 1
The estimation rule leads to some estimating equations U Y, θ   0
20
Here: the normal equations
The solution T Y  to U Y, θ   0 (i.e. the quantity that satisfies
U Y, T Y   0 ) is the CDPQ for  under the estimation rule.
21
t s  θ  ts  T   T  θ 
 O n 0.5   O N 0.5 
 O n 0.5 
In the example, ts  Bˆ , T = B,   
Suggests that the “superpopulation error” is negligible compared
to the sampling error
One interpretation: if you find an estimator t s that gives a decent
estimate of B, then you also have a decent estimate of 
22
Summary; aim
Choices to make
 In terms of parameter, finite population parameter or model
parameter
 In terms of population, actual and definite population or a
conceptual one
Can have both:
A model parameter and a finite population parameter that
“correspond” to each other
23
Design-based vs model-based inference
Model-based inference is “ordinary” frequentist inference. For
example, a model is assumed or fitted, and ML estimates
calculated. Aim: to estimate a model parameter in a conceptual
population.
Design-based inference is frequentist, but different (see following
pages)
24
Recap of design-based survey sampling for
finite populations
 Finite population of N units, conceptually labelled 1, 2, 3, …, N
 This is not just notation: without labels design based inference is
problematic (Thompson 1997, p. 147)
 Population values of one variable is denoted y   y1 , y2 , ..., y N 
 For design-based inference, y   y1 , y2 , ..., y N  regarded as fixed.
 Widely used at NSIs
25
Design-based randomness
 A sample s is a subset of U. The collection of all possible
samples s is denoted by S.
 Randomness comes from the sample design; what is
perceived as random is the sample that happens to be drawn
 A probability associated with each possible sample s and
interpreted as a probability that s is drawn:
ps   0 and
 p s   1
sS
26
The function ps  is referred to as sampling design.
Example: simple random sample without replacement of size n is
defined as the sampling design where all ps  are equal
N
Hence ps   1  
n
Can show that the probability to draw unit i is n N 
In general: the probability to draw unit i is referred to as inclusion
probability, sometimes denoted by  i
27
Design-based estimation of finite population
parameters
N
Estimating the total of y, i.e. t   yi :
i 1
n
tˆ  
i 1
yi
i
the Horvitz-Thompson estimator (HT estimator)
Alternatively
n
tˆ   wi yi
i 1
wi 
1
i
are design weights (base weights)
28
A design-based interpretation: unit i represents wi units including
itself.
Representation principle: unit i must represent (about) wi units
including itself, otherwise the HT estimator will be poor (Brewer
1999, Basu’s (1971) elephants)
The design-based variance of an estimator ˆ
2
2
ˆ
ˆ
ˆ
ˆ
ˆ
V p    E p   E p     ps  s  E p  s 


sS
For example, for the HT estimator:
N N
yi y j
V tˆ     ij   i j 
i 1 j 1
 i j
29
 ij  Pi  s, j  s  second order inclusion probability
For the HT estimator and simple random sampling:
2 1 n N 2
V tˆ   N
Sy
n
N
N
1
2
 yi  yU  , where yU   yi N
with S y2 

N  1 i 1
i 1
30
The finite population covariance matrix
Finite population parameter:
1 N
S xy 
 xi  xU  yi  yU 

N  1 i 1
with some algebra:
1
1
S xy 
t yz 
t yt z
N 1
N  N  1
Estimator:
1
1
S xy 
tˆyz 
tˆy tˆz
Nˆ  1
Nˆ Nˆ  1
31
with
n
1
i 1
i
Nˆ  
n
, tˆy  
i 1
yi
i
n
, and tˆyz  
i 1
yi zi
i
Equivalently,
1 n
~
S xy 

xi  ~
xs  yi  ~
ys 

Nˆ  1 i 1
n
1
yi
~
with ys  
Nˆ i 1  i
Estimators for other finite population parameters in the Yellowbook
32
Finite population regression coefficient
What is estimated is the “census fit” , which is a finite population
parameter:
N
B   xi yi
i 1
N
2
x
 i
i 1
Estimator:
n
Bˆ   qi wi xi yi
i 1
n
2
q
w
x
 i ii
i 1
qi  1  i2 (viewed as variance of residuals, just as in “ordinary,
weighted, regression analysis”, or as an unspecified weight, in
which case often qi  1)
33
The numerators and the denominators can be viewed as totals and
HT-estimators of them
n
tˆqxy  
i 1
qi xi yi
i
, where wi 
1
i
34
Summary, design-based inference
 What is perceived as random is the sample that you happen to
obtain
 In a census there is nothing random (except for measurement
and nonresponse)
 The HT estimator is a basic estimator that is a component of
more complicated estimators
 The design weights play a crucial role
35
Referenser
Basu, D. (1971). An Essay on the Logical Foundations of Survey Sampling. I Foundations of
Statistical Inference (red. V.P. Godambe och D.A. Sprott). Toronto: Holt, Rinehart and Winston,
203–242.
Biemer, P.P. och Lyberg, L.E. (2003). Introduction to Survey Quality. New York: Wiley.
Brewer, K.R.W. (1999). Design-Based or Prediction-Based Inference? Stratified Random vs
Stratified Balanced Sampling. International Statistical Review, 67,
35–47.
Chambers, R. L. och Skinner, C. J. (red.) (2003). Analysis of Survey Data. Chichester: Wiley.
Cochran, W.G. (1977). Sampling Techniques, 3rd ed. New York: Wiley.
Folsom, R., LaVange, L. och Williams, R.L. (1989). A Probability Sampling Perspective on Panel
Data Analysis. I Panel Surveys (red. D. Kasprzyk, G.J. Duncan, G. Kalton och M.P. Singh). New
York: Wiley, 108–138.
Kish, L. (1987). Statistical Design for Research. New York: Wiley.
Kish, L. and Frankel, M.R. (1974). Inference from Complex Samples (with discussion). Journal of the
Royal Statistical Society, series B, 36, 1-37.
Lehtonen, R. och Pahkinen, E. (2004). Practical Methods for Design and Analysis of Complex Surveys,
2nd ed. New York: Wiley.
Lohr, S. (1999). Sampling: Design and Analysis. Pacific Grove, CA: Duxbury.
36
Pfeffermann, D. (1993). The role of sampling weights when modeling survey data. International
Statistical Review, 61, 317-337.
Skinner, C.J., Holt, D. och Smith, T.M.F. (red.) (1989). Analysis of Complex Surveys. Chichester:
Wiley.
Statistiska centralbyrån (2008). Urval – från teori till praktik. ISSN 1654-7268.
Särndal, C.-E., Swensson, B. och Wretman J. (1992). Model Assisted Survey Sampling. New York:
Springer-Verlag.
Thompson, M.E. (1997). Theory of Sample Surveys. London: Chapman & Hall.
37
Download