Sources of Variation - Statistics

advertisement
Author(s): Kerby Shedden, Ph.D., 2010
License: Unless otherwise noted, this material is made available under the
terms of the Creative Commons Attribution Share Alike 3.0 License:
http://creativecommons.org/licenses/by-sa/3.0/
We have reviewed this material in accordance with U.S. Copyright Law and have tried to maximize your
ability to use, share, and adapt it. The citation key on the following slide provides information about how you
may share and adapt this material.
Copyright holders of content included in this material should contact open.michigan@umich.edu with any
questions, corrections, or clarification regarding the use of content.
For more information about how to cite these materials visit http://open.umich.edu/privacy-and-terms-use.
Any medical information in this material is intended to inform and educate and is not a tool for self-diagnosis
or a replacement for medical evaluation, advice, diagnosis or treatment by a healthcare professional. Please
speak to your physician if you have questions about your medical condition.
Viewer discretion is advised: Some medical content is graphic and may not be suitable for all viewers.
1 / 42
Sources of Variation
Kerby Shedden
Department of Statistics, University of Michigan
April 8, 2011
2 / 42
Populations
The population is the set of all units (i.e. people, rats, cars) that are of
interest in a research study.
Examples:
1. If we are retrospectively studying voter participation in a past US
presidential election, the population is everyone who was eligible to vote
in the election.
2. If we are studying factors that were associated with a voter’s decision
about which candidate to vote for in a past US election, the population
would be everyone who actually voted in the election.
3. If we are studying patients’ responses to a new type of leukemia therapy,
the population is everyone with leukemia (or everyone with leukemia who
is in some sense “eligible” for the treatment).
4. If we are studying energy metabolism of Sprague Dawley rats, the
population is the set of all Sprague Dawley rats.
3 / 42
Samples and sampling variation
A sample is a selected subset of a population.
A sample will differ from the population, and different samples will
generally differ from each other.
The goal of working with a sample is to identify characteristics of the
sample that are likely to generalize to the population.
Since a sample only allows us to estimate properties of the population, we
must assess our confidence that any generalizations we make are correct.
Variation in our estimates due to the process of sampling is called
“sampling variation.”
4 / 42
Sampling variation
For example, suppose the population consists of the three values x =
1, 2, 5, with a population mean of EX = 8/3. We sample two of the
values (“without replacement”), and use the sample mean to estimate
the population mean. The following table gives the possible results.
Sample
1,2
1,5
2,5
Estimated mean
1.5
3.0
3.5
The variation among 1.5, 3.0, and 3.5 is the sampling variability of the
sample mean X̄ .
5 / 42
Measurement error
What if we are unable to measure the quantity we are interested in
exactly?
For example, when we sample a unit with X = 1, suppose we observe
Y = 1 + , where has a normal distribution with mean 0 and standard
deviation σ = 0.2. Then we might get
Ideal sample (X )
1,2
1,5
2,5
Actual sample (Y )
0.71,2.05
1.03,5.13
1.86,5.31
Measurement error
-0.29,0.05
0.03,0.13
-0.14,0.31
Estimated mean (Ȳ )
1.38
3.08
3.59
The variation among 1.38, 3.08, and 3.59 is a combination of sampling
and measurement variability.
6 / 42
Random and systematic measurement error
Measurement error is a random variable and hence has both a mean and
a variance.
If the measurement error has mean zero, the measurement is correct on
average, and the measurement error is purely random.
If the measurement error has a nonzero mean, there is a systematic
component to the measurement error (measurement bias).
Some common working assumptions about measurement error:
I
The measurement error is independent across units.
I
The measurement error is independent of the underlying true value
being sampled (e.g. if measuring heights, the measurement error
variance is the same for tall and short people).
7 / 42
Sampling and measurement variability in practice
When working with humans or experimental animals, people often talk
about biological variability and technical variability.
For example, suppose we are measuring insulin levels in rat serum.
I
Some rats truly have higher circulating insulin levels than other rats.
This is biological variability.
I
The assay used to measure insulin levels is imprecise, and one
measurement may by chance be higher or lower than the true value.
This is technical variability.
8 / 42
Sampling and measurement variability in practice
Here are some possible sources of measurement error in a survey asking
about voting behavior:
I
Voters may incorrectly remember whether they voted, or who they
voted for.
I
I
Innocent errors of recall may be purely random, but it is also
possible that supporters of a particular candidate are more
susceptible to such errors.
Voters may deliberately misstate whether they voted, or who they
voted for.
9 / 42
Sampling and measurement variability in practice
Suppose we are studying whether a certain drug benefits patients with a
particular form of cancer. Depending on our goals, we will select an
appropriate endpoint for our analysis. Depending on the endpoint, we
will have different sources of measurement error:
I
If the endpoint is tumor size, the errors result because the
techniques used to measure tumor size have limited accuracy.
I
If the endpoint is a quality of life symptom, like pain, anxiety,
sleeplessness, skin rashes, etc., measurement error results from recall
errors, deliberate misstatements, and differing interpretations of
questions and concepts (e.g. a concept like “severe pain” will mean
different things to different people).
I
If the endpoint is overall survival from diagnosis (a “hard
endpoint”), the measurement is usually made without appreciable
error.
10 / 42
Blocking
Suppose there are factors influencing the quantity we are studying that
we cannot control.
For example, we may be interested in which of two types of crop plant
produces a greater yield (e.g. of corn). If we have 10 plots of land to use
for our study, how should we proceed?
One possibility is to randomly assign five plots of land to be planted with
one type of plant, and have the other five plots of land be planted with
the other type of plant.
However, it is likely that some plots are more fertile than others (due to
better water, drainage, sunlight, soil conditions, etc.) – these are
ordinarily factors we cannot control.
A better approach is to divide each plot of land in half, and randomly
assign one of the half-plots to be planted with one type of plant, with the
other half-plot being planted with the other type of plant.
This strategy is called blocking.
11 / 42
Blocking
Blocking reduces the influence of uncontrolled factors on the results of a
study. For example, if one plot of land is more favorable for crop yields,
having half of it planted with each crop type ensures that no advantage is
given to either plant type.
Blocking also allows us to apply a modified form of comparison that gives
us more power in many situations.
The paired t-test is the most basic analysis that uses blocking. If the goal
is to analyze the difference in expected values between populations X
and Y (e.g. to make inferences about EX − EY ) the paired t-test can be
applied when the sample can be grouped into homogeneous pairs, with
each pair consisting of one observation from the X population and one
observation from the Y population.
12 / 42
Paired t-test
Examples: Paired t-tests can be used in the following situations:
I
A “split plot study” (e.g. in an agricultural comparison), where each
primary plot is partitioned into two sub-plots that are treated with
the two treatments being compared.
I
A “before/after” comparison, where each individual is his or her
own control, e.g. we may measure a person’s blood pressure prior to
a treatment, and again after the treatment, with the goal of
estimating the treatment effect.
I
Comparisons in clustered populations – for example, if we are
interested in the difference between taking a standardized test on a
computer, or on a paper exam sheet. We might select pairs of
children from the same classroom, and randomly assign one of the
children to the computer test, and one to the paper test.
13 / 42
Paired t-test
Blocking is most useful if the units within a block are very similar. For
example, suppose we are considering the effectiveness of a drug designed
to lower blood pressure.
We may have the following data:
30
Patient number
25
20
Pre-treatment
Post-treatment
15
10
5
080
90
100 110 120
Blood pressure
130
140
14 / 42
Paired t-test
Post-treatment blood pressure
The key feature of these data is that people who have higher than
average blood pressure before treatment tend to have higher than
average blood pressure after treatment, and similarly for people with
lower than average blood pressure:
140
130
120
110
100
90
90
100 110 120 130
Pre-treatment blood pressure
140
The similarities between pre-treatment and post-treatment blood pressure
are due to other risk factors (diet, exercise, smoking, genetics, . . .) that
are not affected by the treatment.
15 / 42
Paired t-test
Density
The overall distributions for pre-treatment and post-treatment data are
similar. If we use the standard two-sample comparison, it will require a
large sample to have good power for detecting the difference:
0.040
0.035
0.030
0.025
0.020
0.015
0.010
0.005
0.00090
Pre-treatment
Post-treatment
100
110 120 130
Blood pressure
140
150
16 / 42
Paired t-test
But the differences between pre-treatment and post-treatment measures
are overwhelmingly positive. This suggests that there may be more
information in the data than we realize.
0.14
0.12
Density
0.10
0.08
0.06
0.04
0.02
0.00 5
5
15
0
10
Pre-treatment minus post-treatment blood pressure
17 / 42
Paired t-test
To carry out the paired t-test, let Xi and Yi be the pre-treatment and
post-treatment measurements for the i th subject, and let
Di = Xi − Yi
be the difference between them. The expected value of Di is
EDi = E (Xi − Yi ) = EXi − EYi = EX − EY .
The variance of Di is
var(Di ) = var(Xi ) + var(Yi ) − 2cov(Xi , Yi ),
which can be estimated as
2
σ̂D
= σ̂X2 + σ̂Y2 − 2d
cov(X , Y ).
18 / 42
Paired t-test
Under the null hypothesis ED = EX − EY = 0, so the test statistic
√
nD̄/σ̂D
is large if there is a lot of evidence in the data that EY and EX are
different. A p-value can be obtained by comparing the test-statistic to a
standard normal (or t) reference distribution.
Note that the D̄ = X̄ − Ȳ , so the numerator of the paired and unpaired
two-sample statistics are the same. If the unpaired two-sample statistic is
used to analyze paired data (ignoring the pairing), we get
X̄ − Ȳ
p
σ̂X2 /n + σ̂Y2 /n
=
√
nD̄/
q
σ̂X2 + σ̂Y2 .
So if cd
ov(X , Y ) > 0, the paired statistic is always larger than the
unpaired statistic, and hence has greater power.
19 / 42
Simple random samples
A simple random sample of size m is a subset of a population that is
generated in such a way that any subset of size m is equally likely to be
chosen as the sample.
Simple random samples are typically drawn “without replacement” (i.e.
the same unit may only appear in a sample once).
It is common to analyze statistical methods as if the sampling were done
“with replacement.” This way the data becomes an iid sample (iid stands
for “independent and identically distributed”). An iid sample is easier to
analyze than a SRS. For moderate or large samples, the results obtained
when treating the sampling as being with or without replacement are
very similar.
In practical terms, the only certain way to generate a simple random
sample is to use a list of all members of the population, and draw entries
from this list at random.
20 / 42
Systematic sampling
A systematic sample is generated by following a fixed rule for determining
which units are in the sample.
For example, suppose we have access to the units in sequence, and every
k th unit in the sequence is sampled.
As a concrete example, we could include in our sample every 10th person
arriving at a hospital emergency room in an ambulance. This would be a
systematic sample of all patients who arrive at the hospital’s emergency
room by ambulance.
A systematic sample may approximate a simple random sample as long as
there are no trends or periodicities in the sequential arrangement of the
units.
21 / 42
Stratified sampling
Suppose we are studying a particular treatment for lung cancer, and our
goal is to estimate the treatment response, as measured on a scale from
1 to 100.
It may be that the responses of men and women differ. Since the goal of
the study is to estimate the overall treatment response, we aren’t
interested in sex-specific response as a research aim. But sex-specific
response may affect the precision of our measurement of overall
treatment response.
22 / 42
Stratified sampling
Suppose the treatment population consists of a fraction qf of females
and qm of males, where qf + qm = 1. Also suppose the sex-specific
treatment responses have the following means and standard deviations
Group
Female
Male
Mean
µf
µm
SD
σf
σm
Let X be the treatment response of a randomly-selected patient.
Applying the double expectation theorem, the overall expected treatment
response is
EX = Esex E (X |sex) = qf µf + qm µm .
23 / 42
Stratified sampling
The overall variance in treatment responses can be obtained from the law
of total variation:
var(X )
= Esex var(X |sex) + varsex E (X |sex)
2
= qf σf2 + qm σm
+ qf (µf − EX )2 + qm (µm − EX )2
2
2
= qf σf2 + qm σm
+ qf ((1 − qf )µf − qm µm ) +
qm ((1 − qm )µm − qf µf )
2
2
= qf σf2 + qm σm
+ qf (1 − qf )2 (µf − µm )2 + qm (1 − qm )2 (µf − µm )2
2
= qf σf2 + qm σm
+ qf (1 − qf )2 (µf − µm )2 + qf2 (1 − qf )(µf − µm )2
2
= qf σf2 + qm σm
+ qf (1 − qf )(1 − qf + qf )(µf − µm )2
=
2
qf σf2 + qm σm
+ qf (1 − qf )(µf − µm )2
24 / 42
Stratified sampling
First suppose we draw an iid sample X1 , . . . , Xn from the population, and
use the sample mean X̄ (iid) to estimate EX .
The expected value and variance of X̄ (iid) are
E X̄ iid = qf µf + qm µm = EX
and
2
var(X̄ iid ) = qf σf2 + qm σm
+ qf (1 − qf )(µf − µm )2 /n.
25 / 42
Stratified sampling
Now suppose that instead of an iid sample, we randomly sample qf n
females from the set of all females in the population, and qm n males
from the set of all males in the population. Let X̄ (strat) denote the mean
of this stratified random sample.
What are the mean and variance of X̄ (strat) ?
For notation, let mf = qf n, mm = qm n, X1f , . . . , Xmf f be the female data,
and X1m , . . . , Xmmm be the male data.
E X̄ (strat)
=
(EX1f + · · · EXmf f + EX1m + · · · + EXmmm )/n
=
(mf µf + mm µm )/n
=
qf µf + qm µm
=
EX .
26 / 42
Stratified sampling
varX̄ (strat)
=
(varX1f + · · · varXmf f + varX1m + · · · + varXmmm )/n2
=
2
(mf σf2 + mm σm
)/n2
=
2
(qf σf2 + qm σm
)/n.
Therefore
varX̄ (iid) − varX̄ (strat) =
qf (1 − qf )(µf − µm )2
.
n
So stratified sampling is always equal to or better than iid sampling in
terms of variance.
27 / 42
Stratified sampling
In the preceding example, the proportion of males in the sample was the
same as the proportion of males in the population (mm = qm n), and also
with females (mf = qf n). This is called proportionate allocation.
Suppose that we sample fractions wf of females and wm of males, where
wf + wm = 1. So if the total sample size is n we sample wf n females and
wm n males. If wf 6= qf and wm 6= qm , we have disproportionate
allocation.
28 / 42
Stratified sampling
Let X̄f and X̄m denote the sample proportions of female and male
responses.
In order to unbiasedly estimate the overall response rate we need to use a
weighted average of the sex-specific rates:
qf X̄f + qm X̄m .
The variance of this estimate is
var(qf X̄f + qm X̄m )
2
= qf2 var(X̄f ) + qm
var(X̄m )
2 2
= qf2 σf2 /(wf n) + qm
σm /(wm n)
−1
2 2
2 2
qf σf /wf + qm
σm /wm
= n
Exercise: Use calculus to show that the variance is minimized when
wf = qf σf /(qf σf + qm σm ).
29 / 42
Cluster sampling
Suppose the population under study can be organized into a number of
“clusters,” for example:
I
We are interested in the mean mathematics test score for all 3rd
grade students in the state of Michigan. The students are clustered
by their classrooms.
I
We are interested in the rate of serious infections among patients in
intensive care units in United States hospitals. The patients are
clustered by hospital.
I
We are interested in the purity of pills synthesized in a
pharmaceutical factory. The pills are clustered by the batch of raw
materials and reagents used in the chemical synthesis.
I
We are interested in whether US adults plan to make a major
purchase in the next month. The adults are clustered by census
tract.
30 / 42
Cluster sampling
In practice, it is often easier to obtain a sample by randomly selecting a
subset of the clusters, then including all units from those clusters in the
sample (or taking iid samples from the selected clusters if the clusters are
large).
This is called cluster sampling, and the clusters are called primary
sampling units (PSU’s).
If we form an estimate Ȳ of the population mean EY by averaging the
data in the sample, how does the fact that the data were sampled as
clusters affect the statistical properties of this estimate?
Units in the same PSU tend to be more similar to each other than units
in different PSU’s. We will see that for this reason, a cluster sample of
size n gives less precision than an iid sample of size n.
31 / 42
Cluster sampling
One way to think about cluster sampling is to model cluster i as having
its own mean value µi , which is not directly observed. Let Yij denote the
j th observation in cluster i. We can model Yij |µi as having mean µi and
variance σ 2 .
The µi values are modeled as being independent random values that
follow their own distribution with mean µ and variance τ 2 .
The conditional (within-cluster) mean and variance of the data are
described by:
E (Yij |µi )
= µi
var(Yij |µi )
= σ2
32 / 42
Cluster sampling
Setting: The goal of our analysis is to estimate the overall mean µ ≡ EY .
The unconditional mean is
EYij = Eµi E (Yij |µi ) = Eµi µi = µ,
using the double expectation theorem.
The unconditional variance is
varYij = Eµi var(Yij |µi ) + varµi E (Yij |µi ) = σ 2 + τ 2 ,
using the law of total variation.
The unconditional mean and variance tell us about the expected value
and variance of Ȳ iid – what we would obtain if we pooled all units from
all clusters and sampled randomly from that set of values.
33 / 42
Cluster Sampling
To calculate the variance for cluster sampling we will need the following
result first. Take two values Yij and Yi 0 j 0 . The covariance between these
values is
cov(Yij , Yi 0 j 0 )
= E cov(Yij , Yi 0 j 0 |µi , µi 0 ) +
cov(E (Yij |µi ), E (Yi 0 ,j 0 |µi 0 ))
= σ 2 δii 0 δjj 0 + cov(µi , µi 0 )
= σ 2 δii 0 δjj 0 + τ 2 δii 0 .
where δij is 1 if i = j and is 0 otherwise.
34 / 42
Cluster Sampling
If there are ni observations in the i th cluster, the covariance matrix for
the cluster looks like this:








σ2 + τ 2
τ2
···
···
τ2
τ2
τ2
2
σ + τ2
···
···
τ2
τ2
···
···
···
···
···
···
···
···
···
···
···
···
τ2
τ2
···
···
σ2 + τ 2
τ2
τ2
τ2
···
···
τ2
2
σ + τ2








All between-cluster covariances are zero.
35 / 42
Cluster Sampling
Recall
that if Y has covariance matrix Σ, then the variance of Ȳ is
P
2
Σ
ij ij /n .
Suppose we sample ni people from the i th of q clusters, where
n1 + · · · + nq = n.
P
The contribution to ij Σij from the i th cluster is
ni (σ 2 + τ 2 ) + ni (ni − 1)τ 2 = ni σ 2 + ni2 τ 2 . Thus the variance of Ȳ clust is
n
−2
q
X
2
ni σ +
ni2 τ 2
q
X
= σ /n + τ
(ni /n)2 .
2
i=1
2
i=1
The estimation variance of cluster sampling is greater than or equal to
the estimation variance of iid sampling.
q
X
σ /n + τ
(ni /n)2 − (σ 2 + τ 2 )/n = τ 2
2
2
i=1
!
q
X
2
( (ni /n) ) − 1/n .
i=1
36 / 42
Cluster sampling
The variance of the sample mean of a cluster sample, var(Ȳ clust ),
involves the quantity
q
X
(ni /n)2 .
i=1
We can learn a few interesting things about this expression...
Pq
First note that i=1 ni /n = 1.
If there are only two clusters (q = 2 above), we can relabel as x = n1 /n
and y = n2 /n.
We have x + y = 1, 0 ≤ x, y ≤ 1, and we are looking at the quantity
x 2 + y 2.
37 / 42
Cluster sampling
The length of the line segment from (0, 0) to (x, y ) is
have a picture like this:
p
x 2 + y 2 , so we
1.0
0.8
y
0.6
0.4
x,y
0.2
0.00.0
0.2
0.4
x
0.6
0.8
1.0
The blue line is the line x + y = 1, on which x, y must lie. The outer
black curve is the set of points where x 2 + y 2 = 1 and the inner black
curve is the set of points where x 2 + y 2 = 1/2. We can see that x 2 + y 2
is maximized when x = 0 or when y = 0, and x 2 + y 2 is minimized when
x = y.
38 / 42
Cluster sampling
The facts
Pq given on the previous slide for q = 2 can be generalized to get
that i=1 (ni /n)2 is maximized when all but one of the ni equals zero,
and is minimized when all the ni are equal.
Pq
Thus i=1 (ni /n)2 is minimized when ni = n/q for all i, so
q
q
X
X
2
(ni /n) =
1/q 2 = 1/q,
i=1
i=1
and since q ≤ n, it follows that 1/q ≥ 1/n, so
(
q
X
(ni /n)2 ) − 1/n ≥ 0.
i=1
39 / 42
Cluster sampling
Cluster sampling and iid sampling have the same estimation variance in
two situations:
I
τ 2 = 0, so all the µi are the same
I
Every unit is in its own cluster, so every ni = 1.
Otherwise, cluster sampling has (strictly) greater estimation variance
than iid sampling.
40 / 42
Illustration
These plots all depict cluster sampling with 5 clusters when the variance
of one observation is σ 2 + τ 2 = 1. The orange line is the population of
Xi , the grey lines are the distributions of Xi within clusters, and the green
line (not to scale) is the distribution of X̄ based on n = 25:
σ2 =0.8, τ2 =0.2
2.5
σ2 =0.6, τ2 =0.4
2.5
2.0
2.0
1.5
1.5
1.0
1.0
0.5
0.5
0.0 3
0.0 3
1
2
σ2
2.5
1
0
=0.4,
τ2
2
3
=0.6
σ2
2.5
2.0
1
2
1
0
=0.2,
τ2
2
3
2
3
=0.8
2.0
1.5
1.5
1.0
1.0
0.5
0.5
0.0 3
0.0 3
2
1
0
1
2
3
2
1
0
1
41 / 42
Summary of sampling designs
Stratified sampling: Most precise of the three approaches listed here;
requires that we identify and measure factors that contribute to variation
in the response; somewhat more complex to analyze than an iid sample.
IID sampling: Simple to analyze, intermediate in precision among the
three approaches listed here; may be difficult to carry out if the
population is dispersed over a large area.
Cluster sampling: Least precise of the three approaches listed here, but
may be significantly easier to implement than the others if the population
is naturally organized into clusters; somewhat more complex to analyze
than a iid sample.
42 / 42
Download