Author(s): Kerby Shedden, Ph.D., 2010 License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution Share Alike 3.0 License: http://creativecommons.org/licenses/by-sa/3.0/ We have reviewed this material in accordance with U.S. Copyright Law and have tried to maximize your ability to use, share, and adapt it. The citation key on the following slide provides information about how you may share and adapt this material. Copyright holders of content included in this material should contact open.michigan@umich.edu with any questions, corrections, or clarification regarding the use of content. For more information about how to cite these materials visit http://open.umich.edu/privacy-and-terms-use. Any medical information in this material is intended to inform and educate and is not a tool for self-diagnosis or a replacement for medical evaluation, advice, diagnosis or treatment by a healthcare professional. Please speak to your physician if you have questions about your medical condition. Viewer discretion is advised: Some medical content is graphic and may not be suitable for all viewers. 1 / 42 Sources of Variation Kerby Shedden Department of Statistics, University of Michigan April 8, 2011 2 / 42 Populations The population is the set of all units (i.e. people, rats, cars) that are of interest in a research study. Examples: 1. If we are retrospectively studying voter participation in a past US presidential election, the population is everyone who was eligible to vote in the election. 2. If we are studying factors that were associated with a voter’s decision about which candidate to vote for in a past US election, the population would be everyone who actually voted in the election. 3. If we are studying patients’ responses to a new type of leukemia therapy, the population is everyone with leukemia (or everyone with leukemia who is in some sense “eligible” for the treatment). 4. If we are studying energy metabolism of Sprague Dawley rats, the population is the set of all Sprague Dawley rats. 3 / 42 Samples and sampling variation A sample is a selected subset of a population. A sample will differ from the population, and different samples will generally differ from each other. The goal of working with a sample is to identify characteristics of the sample that are likely to generalize to the population. Since a sample only allows us to estimate properties of the population, we must assess our confidence that any generalizations we make are correct. Variation in our estimates due to the process of sampling is called “sampling variation.” 4 / 42 Sampling variation For example, suppose the population consists of the three values x = 1, 2, 5, with a population mean of EX = 8/3. We sample two of the values (“without replacement”), and use the sample mean to estimate the population mean. The following table gives the possible results. Sample 1,2 1,5 2,5 Estimated mean 1.5 3.0 3.5 The variation among 1.5, 3.0, and 3.5 is the sampling variability of the sample mean X̄ . 5 / 42 Measurement error What if we are unable to measure the quantity we are interested in exactly? For example, when we sample a unit with X = 1, suppose we observe Y = 1 + , where has a normal distribution with mean 0 and standard deviation σ = 0.2. Then we might get Ideal sample (X ) 1,2 1,5 2,5 Actual sample (Y ) 0.71,2.05 1.03,5.13 1.86,5.31 Measurement error -0.29,0.05 0.03,0.13 -0.14,0.31 Estimated mean (Ȳ ) 1.38 3.08 3.59 The variation among 1.38, 3.08, and 3.59 is a combination of sampling and measurement variability. 6 / 42 Random and systematic measurement error Measurement error is a random variable and hence has both a mean and a variance. If the measurement error has mean zero, the measurement is correct on average, and the measurement error is purely random. If the measurement error has a nonzero mean, there is a systematic component to the measurement error (measurement bias). Some common working assumptions about measurement error: I The measurement error is independent across units. I The measurement error is independent of the underlying true value being sampled (e.g. if measuring heights, the measurement error variance is the same for tall and short people). 7 / 42 Sampling and measurement variability in practice When working with humans or experimental animals, people often talk about biological variability and technical variability. For example, suppose we are measuring insulin levels in rat serum. I Some rats truly have higher circulating insulin levels than other rats. This is biological variability. I The assay used to measure insulin levels is imprecise, and one measurement may by chance be higher or lower than the true value. This is technical variability. 8 / 42 Sampling and measurement variability in practice Here are some possible sources of measurement error in a survey asking about voting behavior: I Voters may incorrectly remember whether they voted, or who they voted for. I I Innocent errors of recall may be purely random, but it is also possible that supporters of a particular candidate are more susceptible to such errors. Voters may deliberately misstate whether they voted, or who they voted for. 9 / 42 Sampling and measurement variability in practice Suppose we are studying whether a certain drug benefits patients with a particular form of cancer. Depending on our goals, we will select an appropriate endpoint for our analysis. Depending on the endpoint, we will have different sources of measurement error: I If the endpoint is tumor size, the errors result because the techniques used to measure tumor size have limited accuracy. I If the endpoint is a quality of life symptom, like pain, anxiety, sleeplessness, skin rashes, etc., measurement error results from recall errors, deliberate misstatements, and differing interpretations of questions and concepts (e.g. a concept like “severe pain” will mean different things to different people). I If the endpoint is overall survival from diagnosis (a “hard endpoint”), the measurement is usually made without appreciable error. 10 / 42 Blocking Suppose there are factors influencing the quantity we are studying that we cannot control. For example, we may be interested in which of two types of crop plant produces a greater yield (e.g. of corn). If we have 10 plots of land to use for our study, how should we proceed? One possibility is to randomly assign five plots of land to be planted with one type of plant, and have the other five plots of land be planted with the other type of plant. However, it is likely that some plots are more fertile than others (due to better water, drainage, sunlight, soil conditions, etc.) – these are ordinarily factors we cannot control. A better approach is to divide each plot of land in half, and randomly assign one of the half-plots to be planted with one type of plant, with the other half-plot being planted with the other type of plant. This strategy is called blocking. 11 / 42 Blocking Blocking reduces the influence of uncontrolled factors on the results of a study. For example, if one plot of land is more favorable for crop yields, having half of it planted with each crop type ensures that no advantage is given to either plant type. Blocking also allows us to apply a modified form of comparison that gives us more power in many situations. The paired t-test is the most basic analysis that uses blocking. If the goal is to analyze the difference in expected values between populations X and Y (e.g. to make inferences about EX − EY ) the paired t-test can be applied when the sample can be grouped into homogeneous pairs, with each pair consisting of one observation from the X population and one observation from the Y population. 12 / 42 Paired t-test Examples: Paired t-tests can be used in the following situations: I A “split plot study” (e.g. in an agricultural comparison), where each primary plot is partitioned into two sub-plots that are treated with the two treatments being compared. I A “before/after” comparison, where each individual is his or her own control, e.g. we may measure a person’s blood pressure prior to a treatment, and again after the treatment, with the goal of estimating the treatment effect. I Comparisons in clustered populations – for example, if we are interested in the difference between taking a standardized test on a computer, or on a paper exam sheet. We might select pairs of children from the same classroom, and randomly assign one of the children to the computer test, and one to the paper test. 13 / 42 Paired t-test Blocking is most useful if the units within a block are very similar. For example, suppose we are considering the effectiveness of a drug designed to lower blood pressure. We may have the following data: 30 Patient number 25 20 Pre-treatment Post-treatment 15 10 5 080 90 100 110 120 Blood pressure 130 140 14 / 42 Paired t-test Post-treatment blood pressure The key feature of these data is that people who have higher than average blood pressure before treatment tend to have higher than average blood pressure after treatment, and similarly for people with lower than average blood pressure: 140 130 120 110 100 90 90 100 110 120 130 Pre-treatment blood pressure 140 The similarities between pre-treatment and post-treatment blood pressure are due to other risk factors (diet, exercise, smoking, genetics, . . .) that are not affected by the treatment. 15 / 42 Paired t-test Density The overall distributions for pre-treatment and post-treatment data are similar. If we use the standard two-sample comparison, it will require a large sample to have good power for detecting the difference: 0.040 0.035 0.030 0.025 0.020 0.015 0.010 0.005 0.00090 Pre-treatment Post-treatment 100 110 120 130 Blood pressure 140 150 16 / 42 Paired t-test But the differences between pre-treatment and post-treatment measures are overwhelmingly positive. This suggests that there may be more information in the data than we realize. 0.14 0.12 Density 0.10 0.08 0.06 0.04 0.02 0.00 5 5 15 0 10 Pre-treatment minus post-treatment blood pressure 17 / 42 Paired t-test To carry out the paired t-test, let Xi and Yi be the pre-treatment and post-treatment measurements for the i th subject, and let Di = Xi − Yi be the difference between them. The expected value of Di is EDi = E (Xi − Yi ) = EXi − EYi = EX − EY . The variance of Di is var(Di ) = var(Xi ) + var(Yi ) − 2cov(Xi , Yi ), which can be estimated as 2 σ̂D = σ̂X2 + σ̂Y2 − 2d cov(X , Y ). 18 / 42 Paired t-test Under the null hypothesis ED = EX − EY = 0, so the test statistic √ nD̄/σ̂D is large if there is a lot of evidence in the data that EY and EX are different. A p-value can be obtained by comparing the test-statistic to a standard normal (or t) reference distribution. Note that the D̄ = X̄ − Ȳ , so the numerator of the paired and unpaired two-sample statistics are the same. If the unpaired two-sample statistic is used to analyze paired data (ignoring the pairing), we get X̄ − Ȳ p σ̂X2 /n + σ̂Y2 /n = √ nD̄/ q σ̂X2 + σ̂Y2 . So if cd ov(X , Y ) > 0, the paired statistic is always larger than the unpaired statistic, and hence has greater power. 19 / 42 Simple random samples A simple random sample of size m is a subset of a population that is generated in such a way that any subset of size m is equally likely to be chosen as the sample. Simple random samples are typically drawn “without replacement” (i.e. the same unit may only appear in a sample once). It is common to analyze statistical methods as if the sampling were done “with replacement.” This way the data becomes an iid sample (iid stands for “independent and identically distributed”). An iid sample is easier to analyze than a SRS. For moderate or large samples, the results obtained when treating the sampling as being with or without replacement are very similar. In practical terms, the only certain way to generate a simple random sample is to use a list of all members of the population, and draw entries from this list at random. 20 / 42 Systematic sampling A systematic sample is generated by following a fixed rule for determining which units are in the sample. For example, suppose we have access to the units in sequence, and every k th unit in the sequence is sampled. As a concrete example, we could include in our sample every 10th person arriving at a hospital emergency room in an ambulance. This would be a systematic sample of all patients who arrive at the hospital’s emergency room by ambulance. A systematic sample may approximate a simple random sample as long as there are no trends or periodicities in the sequential arrangement of the units. 21 / 42 Stratified sampling Suppose we are studying a particular treatment for lung cancer, and our goal is to estimate the treatment response, as measured on a scale from 1 to 100. It may be that the responses of men and women differ. Since the goal of the study is to estimate the overall treatment response, we aren’t interested in sex-specific response as a research aim. But sex-specific response may affect the precision of our measurement of overall treatment response. 22 / 42 Stratified sampling Suppose the treatment population consists of a fraction qf of females and qm of males, where qf + qm = 1. Also suppose the sex-specific treatment responses have the following means and standard deviations Group Female Male Mean µf µm SD σf σm Let X be the treatment response of a randomly-selected patient. Applying the double expectation theorem, the overall expected treatment response is EX = Esex E (X |sex) = qf µf + qm µm . 23 / 42 Stratified sampling The overall variance in treatment responses can be obtained from the law of total variation: var(X ) = Esex var(X |sex) + varsex E (X |sex) 2 = qf σf2 + qm σm + qf (µf − EX )2 + qm (µm − EX )2 2 2 = qf σf2 + qm σm + qf ((1 − qf )µf − qm µm ) + qm ((1 − qm )µm − qf µf ) 2 2 = qf σf2 + qm σm + qf (1 − qf )2 (µf − µm )2 + qm (1 − qm )2 (µf − µm )2 2 = qf σf2 + qm σm + qf (1 − qf )2 (µf − µm )2 + qf2 (1 − qf )(µf − µm )2 2 = qf σf2 + qm σm + qf (1 − qf )(1 − qf + qf )(µf − µm )2 = 2 qf σf2 + qm σm + qf (1 − qf )(µf − µm )2 24 / 42 Stratified sampling First suppose we draw an iid sample X1 , . . . , Xn from the population, and use the sample mean X̄ (iid) to estimate EX . The expected value and variance of X̄ (iid) are E X̄ iid = qf µf + qm µm = EX and 2 var(X̄ iid ) = qf σf2 + qm σm + qf (1 − qf )(µf − µm )2 /n. 25 / 42 Stratified sampling Now suppose that instead of an iid sample, we randomly sample qf n females from the set of all females in the population, and qm n males from the set of all males in the population. Let X̄ (strat) denote the mean of this stratified random sample. What are the mean and variance of X̄ (strat) ? For notation, let mf = qf n, mm = qm n, X1f , . . . , Xmf f be the female data, and X1m , . . . , Xmmm be the male data. E X̄ (strat) = (EX1f + · · · EXmf f + EX1m + · · · + EXmmm )/n = (mf µf + mm µm )/n = qf µf + qm µm = EX . 26 / 42 Stratified sampling varX̄ (strat) = (varX1f + · · · varXmf f + varX1m + · · · + varXmmm )/n2 = 2 (mf σf2 + mm σm )/n2 = 2 (qf σf2 + qm σm )/n. Therefore varX̄ (iid) − varX̄ (strat) = qf (1 − qf )(µf − µm )2 . n So stratified sampling is always equal to or better than iid sampling in terms of variance. 27 / 42 Stratified sampling In the preceding example, the proportion of males in the sample was the same as the proportion of males in the population (mm = qm n), and also with females (mf = qf n). This is called proportionate allocation. Suppose that we sample fractions wf of females and wm of males, where wf + wm = 1. So if the total sample size is n we sample wf n females and wm n males. If wf 6= qf and wm 6= qm , we have disproportionate allocation. 28 / 42 Stratified sampling Let X̄f and X̄m denote the sample proportions of female and male responses. In order to unbiasedly estimate the overall response rate we need to use a weighted average of the sex-specific rates: qf X̄f + qm X̄m . The variance of this estimate is var(qf X̄f + qm X̄m ) 2 = qf2 var(X̄f ) + qm var(X̄m ) 2 2 = qf2 σf2 /(wf n) + qm σm /(wm n) −1 2 2 2 2 qf σf /wf + qm σm /wm = n Exercise: Use calculus to show that the variance is minimized when wf = qf σf /(qf σf + qm σm ). 29 / 42 Cluster sampling Suppose the population under study can be organized into a number of “clusters,” for example: I We are interested in the mean mathematics test score for all 3rd grade students in the state of Michigan. The students are clustered by their classrooms. I We are interested in the rate of serious infections among patients in intensive care units in United States hospitals. The patients are clustered by hospital. I We are interested in the purity of pills synthesized in a pharmaceutical factory. The pills are clustered by the batch of raw materials and reagents used in the chemical synthesis. I We are interested in whether US adults plan to make a major purchase in the next month. The adults are clustered by census tract. 30 / 42 Cluster sampling In practice, it is often easier to obtain a sample by randomly selecting a subset of the clusters, then including all units from those clusters in the sample (or taking iid samples from the selected clusters if the clusters are large). This is called cluster sampling, and the clusters are called primary sampling units (PSU’s). If we form an estimate Ȳ of the population mean EY by averaging the data in the sample, how does the fact that the data were sampled as clusters affect the statistical properties of this estimate? Units in the same PSU tend to be more similar to each other than units in different PSU’s. We will see that for this reason, a cluster sample of size n gives less precision than an iid sample of size n. 31 / 42 Cluster sampling One way to think about cluster sampling is to model cluster i as having its own mean value µi , which is not directly observed. Let Yij denote the j th observation in cluster i. We can model Yij |µi as having mean µi and variance σ 2 . The µi values are modeled as being independent random values that follow their own distribution with mean µ and variance τ 2 . The conditional (within-cluster) mean and variance of the data are described by: E (Yij |µi ) = µi var(Yij |µi ) = σ2 32 / 42 Cluster sampling Setting: The goal of our analysis is to estimate the overall mean µ ≡ EY . The unconditional mean is EYij = Eµi E (Yij |µi ) = Eµi µi = µ, using the double expectation theorem. The unconditional variance is varYij = Eµi var(Yij |µi ) + varµi E (Yij |µi ) = σ 2 + τ 2 , using the law of total variation. The unconditional mean and variance tell us about the expected value and variance of Ȳ iid – what we would obtain if we pooled all units from all clusters and sampled randomly from that set of values. 33 / 42 Cluster Sampling To calculate the variance for cluster sampling we will need the following result first. Take two values Yij and Yi 0 j 0 . The covariance between these values is cov(Yij , Yi 0 j 0 ) = E cov(Yij , Yi 0 j 0 |µi , µi 0 ) + cov(E (Yij |µi ), E (Yi 0 ,j 0 |µi 0 )) = σ 2 δii 0 δjj 0 + cov(µi , µi 0 ) = σ 2 δii 0 δjj 0 + τ 2 δii 0 . where δij is 1 if i = j and is 0 otherwise. 34 / 42 Cluster Sampling If there are ni observations in the i th cluster, the covariance matrix for the cluster looks like this: σ2 + τ 2 τ2 ··· ··· τ2 τ2 τ2 2 σ + τ2 ··· ··· τ2 τ2 ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· τ2 τ2 ··· ··· σ2 + τ 2 τ2 τ2 τ2 ··· ··· τ2 2 σ + τ2 All between-cluster covariances are zero. 35 / 42 Cluster Sampling Recall that if Y has covariance matrix Σ, then the variance of Ȳ is P 2 Σ ij ij /n . Suppose we sample ni people from the i th of q clusters, where n1 + · · · + nq = n. P The contribution to ij Σij from the i th cluster is ni (σ 2 + τ 2 ) + ni (ni − 1)τ 2 = ni σ 2 + ni2 τ 2 . Thus the variance of Ȳ clust is n −2 q X 2 ni σ + ni2 τ 2 q X = σ /n + τ (ni /n)2 . 2 i=1 2 i=1 The estimation variance of cluster sampling is greater than or equal to the estimation variance of iid sampling. q X σ /n + τ (ni /n)2 − (σ 2 + τ 2 )/n = τ 2 2 2 i=1 ! q X 2 ( (ni /n) ) − 1/n . i=1 36 / 42 Cluster sampling The variance of the sample mean of a cluster sample, var(Ȳ clust ), involves the quantity q X (ni /n)2 . i=1 We can learn a few interesting things about this expression... Pq First note that i=1 ni /n = 1. If there are only two clusters (q = 2 above), we can relabel as x = n1 /n and y = n2 /n. We have x + y = 1, 0 ≤ x, y ≤ 1, and we are looking at the quantity x 2 + y 2. 37 / 42 Cluster sampling The length of the line segment from (0, 0) to (x, y ) is have a picture like this: p x 2 + y 2 , so we 1.0 0.8 y 0.6 0.4 x,y 0.2 0.00.0 0.2 0.4 x 0.6 0.8 1.0 The blue line is the line x + y = 1, on which x, y must lie. The outer black curve is the set of points where x 2 + y 2 = 1 and the inner black curve is the set of points where x 2 + y 2 = 1/2. We can see that x 2 + y 2 is maximized when x = 0 or when y = 0, and x 2 + y 2 is minimized when x = y. 38 / 42 Cluster sampling The facts Pq given on the previous slide for q = 2 can be generalized to get that i=1 (ni /n)2 is maximized when all but one of the ni equals zero, and is minimized when all the ni are equal. Pq Thus i=1 (ni /n)2 is minimized when ni = n/q for all i, so q q X X 2 (ni /n) = 1/q 2 = 1/q, i=1 i=1 and since q ≤ n, it follows that 1/q ≥ 1/n, so ( q X (ni /n)2 ) − 1/n ≥ 0. i=1 39 / 42 Cluster sampling Cluster sampling and iid sampling have the same estimation variance in two situations: I τ 2 = 0, so all the µi are the same I Every unit is in its own cluster, so every ni = 1. Otherwise, cluster sampling has (strictly) greater estimation variance than iid sampling. 40 / 42 Illustration These plots all depict cluster sampling with 5 clusters when the variance of one observation is σ 2 + τ 2 = 1. The orange line is the population of Xi , the grey lines are the distributions of Xi within clusters, and the green line (not to scale) is the distribution of X̄ based on n = 25: σ2 =0.8, τ2 =0.2 2.5 σ2 =0.6, τ2 =0.4 2.5 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 3 0.0 3 1 2 σ2 2.5 1 0 =0.4, τ2 2 3 =0.6 σ2 2.5 2.0 1 2 1 0 =0.2, τ2 2 3 2 3 =0.8 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 3 0.0 3 2 1 0 1 2 3 2 1 0 1 41 / 42 Summary of sampling designs Stratified sampling: Most precise of the three approaches listed here; requires that we identify and measure factors that contribute to variation in the response; somewhat more complex to analyze than an iid sample. IID sampling: Simple to analyze, intermediate in precision among the three approaches listed here; may be difficult to carry out if the population is dispersed over a large area. Cluster sampling: Least precise of the three approaches listed here, but may be significantly easier to implement than the others if the population is naturally organized into clusters; somewhat more complex to analyze than a iid sample. 42 / 42