Chapter 3: A Closer Look at Assumptions Sleuth quote: “Although statisical computer programs faithfully supply confidence intervals and p-values whenever asked, the human data analyst must consider whether the assumptions on which use of the tools is based are met, at least approximately. In this regard, an important distinction exists between the mathematical assumptions on which the use of t-tools is exactly justified and the broader conditions under which such tools work quite well.” 3.1 Case Studies 3.1.1 Cloud Seeding to Increase Rainfall • Scope of Inference? 3.1.2 Effects of Agent Orange on Troops in Vietnam • Scope of Inference? 3.2 Robustness of the Two-sample t-Tools Robustness: a statistical procedure is robust to departures from a particular assumption if it is still valid even when the assumption is not met. • valid = uncertainty measures (CIs and p-values) are nearly equal to the stated rates – e.g. Does a 95% confidence interval really cover the true value for 95% of all random samples? – e.g. Does a p-value of 0.02 really mean that only 2 % of random samples (or randomizations) would have produced estimates as or more extreme than that observed under the assumed (null) distribution? • Evaluate robustness separately for each assumption. 1 • What are the ideal assumptions of the two-sample t-tools? 1. Normality 2. Equal standard deviations 3. Independence - 1. Robustness against departures from Normality • Assumption: populations are normally distributed • The t-tools remain reasonably valid in large samples with many nonnormal populations (robust to non-normality) – Main reason why? ∗ Caution: CLT does not address effects of estimating the population SD of a non-normal distribution. – How large does n really have to be? ∗ It depends on how non-normal the population distribution is. – General effects of skewness and long-tailedness (kurtosis): 1. Same SDs, approximately same shapes, and about equal sample sizes −→ validity affected moderately by long-tailedness and very little by skewness 2. Same SDs, approximately same shapes, different sample sizes −→ validity affected moderately by long-tailedness and substantially by skewness. (gets less serious with increasing sample sizes) 3. If skewness differs in the 2 populations, then t-tools can be very misleading with small and moderate sample sizes. • Summary: t-tools do well when 2 populations are non-normal but have same shape, SD, and sample sizes are equal (see Display 3.4) • What tool can we use to help clarify the role of sample sizes for different situations? 2 2. Robustness against departures from equal SDs: • More serious problems arise when SDs of the 2 populations are substantially unequal: – Pooled estimate of SD does not estimate any population parameter −→ ∗ SE error formula is wrong. ∗ the t-ratio does not have t-distribution – Theory shows t-tools remain fairly valid with unequal SDs as long as (see Display 3.5) – Worst situation: the ratio of SDs is far different from 1 and the population with the smaller sample size has the larger standard deviation. 3. Robustness against departures from independence: • How can I judge if observations are independent? – Does knowledge of one observation tell you anything about another observation? (i.e. If you know one observation is above average, does that allow an improved guess about whether another observation will be above average?) If yes, then independence is lacking! – Two common types of dependence: 1. Cluster effects ∗ Often occur when data have been collected in subgroups (by accident or on purpose). 2. Serial effects ∗ Measurements are taken over time (time series, temporal autocorrelation) ∗ Measurements taken over space (spatial correlation) 3 . – What is the effect of lack of independence on the t-tools? ∗ The standard error (SE) is an inappropriate estimate (usually too small) ∗ The t-ratio no longer has a t-distribution, and therefore may give misleading results ∗ How do these relate to confidence intervals and p-values? ∗ Seriousness depends on the seriousness of the violation. ∗ Generally a bad idea to use t-tools if you suspect cluster or serial effects. 3.3 Resistance of the Two-sample t-Tools Resistance: a statistical procedure is resistant if it does not change much when a small part of the data change. • Outlier: an observation judged to be far from its group average. • Explanations for outliers: 1. Long-tailed distributions 2. Contamination of a distribution with an observation that belongs in another population. • Means (averages) are not resistant to outliers. Medians are resistant. • Are sample standard deviations resistant to outliers? Why or why not? • What are the t-tools based on? • Implication: A small portion of the data can have a major influence on the results. One or two outliers can affect a CI or p-value enough to change a conclusion. – Goal of statistics is to describe group characteristics. A conclusion based on one or two data points is fragile! 4 3.4 Practical strategies for the Two-sample problem • Challenge: Evaluate the appropriateness of the t-tools for your data. – Involves thinking about independence, graphically examining data, and considering alternatives. • Consider Serial and Cluster Effects (Independence): – Carefully think about HOW the data were collected: 1. Were the subjects selected in distinct groups? 2. Were different groups of subjects treated differently in a way that was unrelated to the “treatment”? 3. Were observations taken at different but proximate times or locations? – What to do if independence is lacking? ∗ Principal remedy - use a more sophisticated statistical tool. • Evaluate the Suitability of the t-Tools: – Look at plots: ∗ Side-by-side boxplots and histograms (on the same scale): · Symmetric distributions? · Spread approximately the same? – What to do if conditions do not appear suitable for the t-tools? ∗ Consider a transformation. ∗ Consider alternative methods with less assumptions: 5 • A Strategy for Dealing with Outliers – Challenge: Decide if an outlying observation belongs in the data or was recorded improperly or is result of contamination from a different population. – Strategies: 1. Employ a resistant statistical tool (non-parametric) 2. Use a careful examination strategy (Display 3.6): ∗ Agent Orange Data Example: 6 • NOTES: – It is not useful to give a precise definition for an outlier. It is subjective! If any doubt, then give it further examination. – The effect of outliers is greater when sample sizes are smaller. – Differences in boxplots may be due to the differences in sample sizes. Expect more variability (and more observations that appear extreme) in large sample sizes as compared to small sample sizes, if population distributions are identical! 3.5 Transformations of the Data 3.5.1 The Logarithmic Transformation: • Arguably the most useful transformation for positive data. – Advantages: ∗ Preserves the ordering of the original data ∗ Results can still be presented on the original scale (after backtransforming). • Common scale for scientific work is the natural log (ln). When we say log in this class, we mean ln. • When should I think about using a log transformation? 1. 2. • Recall these properties of the logarithm: log(ex ) = x elog(x) = x log(e) = 1 log(1) = 0 log10 (10x ) =x log(x) + log(y) = log(xy) log(x) − log(y) = log( xy ) and and ex ey = ex+y ex ey = ex−y • Behavior of the log transformation: See Display 3.8 7 • How do I implement a log transformation? 1. Take the log of all your measurements (your response variable) In R: If your variable name is height, then type log.ht <- log(height). 2. Do the analysis on your log transformed data. 3. Backtransform to the original scale for interpretation (unless the results are appropriately presented on the log scale). This is the tricky step! In R: to go from the natural log scale back to the original scale, use the antilogarithm function exp(). For example, if mn.ht <- mean(log.ht), then we can backtransform mn.ht by typing exp(mn.ht). • Backtransformation and Interpretation: I. Population model: (Random sampling setting) ∗ We now make inference about the difference in means of the logged measurements instead of the difference in means on the original scale: Mean[log(Y2 )] − Mean[log(Y1 )] (instead of Mean(Y2 ) − Mean(Y1 )) ∗ Problem: The mean of the logged values is not the log of the mean! Therefore, backtransforming the estimate of the mean on the log scale does NOT give an estimate of the mean on the original scale. Mean[log(Y )] 6= log (Mean[Y ]) ∗ IF the log-transformed data have symmetric distributions, then we can say: Mean[log(Y )] = Median[log(Y )] ∗ The median of the logged values IS the log of the median (b/c log preserves ordering). ∗ Therefore, backtransforming the mean of the logged values gives the median on the original scale (if symmetric distributions)! ∗ Implication for the 2-sample case: Median(Y1 ) (Mean[log(Y1 )] − Mean[log(Y2 )]) estimates log Median(Y2 ) and therefore, exp (Mean[log(Y1 )] − Mean[log(Y2 )]) 8 estimates Median(Y1 ) . Median(Y2 ) – How does this change interpretation? ∗ Differences in means −→ ratios of medians! ∗ WORDING: It is estimated that the median for population 2 is exp (Average[log(Y2 )] − Average[log(Y1 )]) times the median for population 1, with an associated 95% confidence interval for the ratio from exp(lower CI limit on log scale) to exp(upper CI limit on log scale). – WHAT am I really supposed to do???! ∗ Backtransform 3 numbers: · the point estimate obtained on the log scale (i.e the difference in averages on the log scale) to get an estimate of the ratio of the medians on the original scale. · the ends of the log scale confidence interval to obtain a confidence interval for the ratio of the medians on the original scale ∗ DO NOT backtransform variances, standard deviations, or standard errors. II. Randomized Experiment Model: ∗ Phrase in terms of treatment effects rather than population differences. ∗ Additive treatment effect model: ∗ Additive treatment effect model on log transformed data: ∗ Multiplicative treatment effect on the original scale: 9 ∗ How does this change interpretation? · additive treatment effect −→ multiplicative treatment effect · WORDING: It is estimated that an experimental unit’s response to treatment 2 is exp (Average[log(Y2 )] − Average[log(Y1 )]) times as large as an experimental unit’s response to treatment 1. 3.5.3 Other Transformations for Positive Measurements: – Square Root: ∗ Counts ∗ Measurements of area – Reciprocal: ∗ Waiting times ∗ Often interpreted directly as a rate or speed – arcsine square root: ∗ Proportions – Logit: ∗ Proportions Choosing a transformation: – There are formal methods for seeking a transformation. However, The Sleuth recommends a trial-and-error approach with graphical analysis. – Primary goal: establish a scale where the two groups have roughly the same spread. – If several seem appropriate, choose based on ease of interpretation. 10 3.6 Related Issues • Graphical methods vs. Formal tests: – There are many formal statistical tests for assessing whether assumptions are adequately met. – These tests are widely available, but not very helpful. For small sample size, they have trouble detecting non-normality, and for large samples they magnify small departures from normality and give them small p-values. ∗ Does it matter if the two populations are not exactly normal? ∗ The formal test themselves have assumptions! AND, they are often less robust to departures from those assumptions than are the t-tools! ∗ Graphical displays are more informative and provide good indication of whether or not the t-tools should be applied and also can suggest a remedy! • Robustness and transformations for the one-sample and paired t-tools: – Assumptions: 1. Independent observations 2. Normally distributed population ∗ Remain valid for moderate and large sample sizes for non-normal distributions. ∗ Skewness can be problem for smaller sample sizes. ∗ Not valid with lack of independence. – The log transformation in a paired situation: ∗ What suggests use of log transformation (with positive observations)? · An apparent multiplicative treatment effect (in an experiment) · A tendency for larger differences in pairs with larger values (observational study). ∗ How to apply a log transformation in this case: · Apply before taking the difference! · log(a/b) = log(a) − log(b) so the difference in logs is like log of the ratio. · Details: 11