Avoiding lost discoveries by using modern, robust statistical methods: An illustration based on a lifestyle intervention study Running head: Robust methods Word total: 4,709 Rand Wilcox1, Mike Carlson2, Stan Azen3, and Florence Clark4 1 Department of Psychology, University of Southern California, USA Division of Occupational Science and Occupational Therapy, University of Southern California, USA 3 Department of Preventive Medicine, University of Southern California, USA 2, 4 This work was supported by National Institute on Aging, R01 AG021108. Corresponding Author: Rand Wilcox, Department of Psychology, University of Southern California. 618 Seeley Mudd Building, University Park Campus, Los Angeles, CA 90089-1061, USA. Email: rwilcox@usc.edu Abstract Background When analyzing clinical trial data, it is desirable to take advantage of major advances in statistical techniques for assessing central tendency and measures of association. Although numerous new procedures that have considerable practical value have been developed in the last quarter century, such advances remain underutilized. Purpose This article has two purposes. The first is to review common problems associated with standard methodologies (low power, lack of control over type I errors, incorrect assessments of strength of association), and then summarize some modern methods that can be used to circumvent such problems. The second purpose is to illustrate the practical utility of modern robust methods using data from a recently conducted intervention study. Methods To examine the value of robust statistical methodology, data from the Well Elderly 2 randomized controlled trial were analyzed to document the participant characteristics that are associated with pre-to-post change over the course of a lifestyle intervention for older adults. Results In multiple instances, robust methods uncovered differences among groups and associations among variables that were not detected by classic techniques. In particular, the results demonstrated that details of the nature and strength of association were sometimes overlooked when using ordinary least squares regression and Pearson's correlation. Conclusions Modern robust methods can make a practical difference in detecting and describing differences between groups and associations between variables. Such procedures should be applied more frequently when analyzing trial-based data. Keywords: robust methods, violation of assumptions, outliers, heteroscedasticity, lifestyle intervention, occupational therapy, quality of life Introduction During the last quarter of a century, many new and improved methods for comparing groups and studying associations have been derived that have the potential of documenting important statistical relationships that are likely to be missed when using standard techniques.1-7 Although awareness of these modern procedures is growing, it is evident that their practical utility remains relatively unknown and that they remain underutilized. The goal of this paper is to briefly review why the new methods are important and to illustrate their value using data from a recently conducted randomized controlled trial (RCT). Problems with Standard Methodology Appreciating the importance of recently developed robust statistical procedures requires an understanding of modern insights into when and why traditional methods based on means can be highly unsatisfactory. In the case of detecting differences among groups, there are three fundamental problematic issues that have important implications in controlled trials. The first goal of this article is to briefly review these issues. After this overview, we will then examine the viability of various robust procedures as a means of solving the problems associated with standard procedures. Problem 1: Heavy-tailed distributions The first insight has to do with the realization that small departures from normality toward a more heavy-tailed distribution can have an undesirable impact on estimation of the population variance and can greatly affect power.1, 7 Moreover, in a seminal paper, Tukey argued that heavier than normal tails are to be expected. 8 Random samples from such distributions are prone to having outliers, and experience with modern outlier detection methods supports Tukey's claim. This is not to suggest that heavy-tailed distributions always occur, but to document that it is ill advised to ignore problems that can result from heavy-tailed distributions. The classic illustration of the practical consequences of heavy-tailed distributions is based on a mixed (contaminated) normal distribution where with probability 0.9 an observation is sampled from a standard normal, otherwise an observation is sampled from a normal distribution with mean 0 and standard deviation 10. Despite the apparent similarity between the two distributions, as shown in Figure 1, the variance of the mixed normal is 10.9, which greatly reduces power. As an illustration, consider two normal distributions with variance 1 and means 0 and 1, which are shown in the left panel of Figure 2. When testing at the 0.05 level with a sample size of 25 per group, Student's T test has power=0.9. The right panel shows the mixed normal as well as another mixed normal distribution that has been shifted to have mean 1. Now Student's T test has power=0.28. For two distributions with means µ1 and µ2 and common variance σ2, a well-known measure of effect size, popularized by Cohen9, is 𝛿= 𝜇1−𝜇2 𝜎 . (1) The usual estimate of δ, d, is obtained by estimating the means and the assumed common variance in the standard way. Although the importance of a given value for δ depends on the situation, Cohen suggested that as a general guide a large effect size is one that is visible to the naked eye. Based on this criterion, he concluded that for normal distributions, δ =0.2, 0.5, and 0.8 correspond to small, medium and large effect sizes, respectively. In the left panel of Figure 2, δ =1.0, which from Cohen's perspective would be judged to be quite large. In the right panel, δ=0.28, which would be judged to be a small effect size from a numerical point of view. But from a graphical point of view, the difference between means in the right panel corresponds to what Cohen describes as a large effect size. This illustrates that δ can be highly sensitive to small changes in the tails of a given distribution and that it has the potential of characterizing an effect size as small when in fact it is large. The upshot of this result is that in the context of analyzing RCTs, the use of Cohen’s d can oftentimes lead to a spuriously low estimate of effect magnitude. Similar anomalies can result when using Pearson's correlation. The left panel of Figure 3 shows a bivariate normal distribution with correlation ρ =0.8, and the middle panel shows a bivariate normal distribution with correlation ρ=0.2. However, in the right panel, although ρ=0.2, the bivariate distribution appears to be similar to the bivariate normal distribution with ρ =0.8. One of the marginal distributions in the third panel in Figure 3 is normal, but the other is a mixed normal, which illustrates that even a small change in one of marginal distribution can have a large impact on ρ. Problem 2: Assuming normality via the central limit theorem Many introductory statistics books claim that with a sample size of 30 or more, when testing 𝐻0 : 𝜇 = 𝜇0 , 𝜇0 given, normality can be assumed.10-12 Although this claim is not based on wild speculations, two things were missed. First, when studying the sampling distribution of the mean, 𝑋̅, early efforts focused on relatively light-tailed distributions for which outliers are relatively rare. However, in moving from skewed, light-tailed distributions to skewed, heavytailed distributions, larger than anticipated sample sizes are needed for the sampling distribution of 𝑋̅ to be approximately normal. Second, and seemingly more serious, is the implicit assumption that if the distribution of 𝑋̅ is approximately normal, 𝑇= √𝑛(𝑋̅ − 𝜇) 𝑠 will have, approximately, a Student's t distribution with n-1 degrees of freedom, where n is the sample size. To illustrate that this is not necessarily the case, imagine that 25 observations are randomly sampled from a lognormal distribution. In such a case the sampling distribution of the mean is approximately normal, but the distribution of T is poorly approximated by a Student's t distribution with 24 degrees of freedom. Figure 4 shows an estimate of the actual distribution T based on a simulation with 10,000 replications. If T has a Student's t distribution, then P (T ≤ 2.086)=0.025. But when sampling from a lognormal distribution, the actual probability is approximately 0.12. Moreover, P(T ≥ 2.086) is approximately 0.001 and E(T)= -0.54. As a result, Student's t is severely biased. If the goal is to have an actual Type I error probability between 0.025 and 0.075 when testing at the 0.05 level, a sample size of 200 or more is required. For skewed distributions having heavier tails, roughly meaning the expected proportion of outliers is higher, a sample size in excess of 300 can be required. A possible criticism of the illustrations just given is that they are based on hypothetical distributions and that in actual studies such concerns may be less severe. But at least in some situations, problems with achieving accurate probability coverage in actual studies have been found to be worse.1 In particular, based on bootstrap samples, the true distribution of T can differ from an assumed Student's t distribution even more than indicated here.1 Turning to the two-sample case, it should be noted that early simulation studies dealing with non-normality focused on identical distributions. If the sample sizes are equal, the sampling distribution of the difference between the sample means will be symmetric even when the individual distributions are asymmetric but have the same amount of skewness. In the onesample case, when sampling from a symmetric, heavy-tailed distribution, the actual probability level of Student's T is generally less than the nominal level, which helps explain why early simulations studies regarding the two-sample case concluded that Student's T is robust in terms of controlling the probability of a Type I error. But more recent studies have indicated that when the two distributions being compared differ in skewness, the actual Type I error probability can exceed the nominal level by a substantial amount 1, 13. Problem 3: Heteroscedasticity The third fundamental insight is that violating the usual homoscedasticity assumption (i.e. the assumption that all groups are assumed to have a common variance), is much more serious than once thought. Both relatively poor power and inaccurate confidence intervals can result. Cressie and Whitford14 established general conditions under which Student's T is not even asymptotically correct under random sampling. When comparing the means of two independent groups, some software packages now default to the heteroscedastic method derived by Welch15, which is asymptotically correct. Welch's method can produce more accurate confidence intervals than Student's T, but serious concerns persist in terms of both Type I errors and power. Indeed, all methods based on means can have relatively low power under arbitrarily small departures from normality.1 Also, when comparing more than two groups, commonly used software employs the homoscedastic ANOVA F test, and typically no heteroscedastic option is provided. This is a concern because with more than two groups, problems are exacerbated in terms of both Type I errors and power.1 Similar concerns arise when dealing with regression. Robust Solutions As suggested above, the application of standard methodology for comparing means can seriously bias the conclusions drawn from clinical trial data. Fortunately, a wide array of procedures has been developed that produce more accurate, robust results when the assumptions that underlie standard procedures are violated. We describe some of these methods below. Alternate Measures of Location One way of dealing with outliers is to replace the mean with the median. Two centuries ago, Laplace was aware that for heavy-tailed distributions, the usual sample median can have a smaller standard error than the mean.16 But a criticism is that under normality the median is rather unsatisfactory in terms of power. Note that the median belongs to the class of trimmed means. By definition, a trimmed mean is the average after a prespecified proportion of the largest and smallest values is deleted. A crude description of why methods based on the median can have unsatisfactory power is that they trim all but one or two values. A way of dealing with this problem is to trim less, and based on efficiency, Rosenberger and Gasko17 conclude that a 20% trimmed mean is a good choice for general use. That is, its standard error compares well to the standard error of the mean under normality, but unlike the mean, it provides a reasonable degree of protection against the deleterious effects of outliers. Heteroscedastic methods for comparing trimmed means, when dealing with a range of commonly used designs, have also been derived.1 Another general approach when dealing with outliers is to use a measure of location that first empirically identifies values that are extreme, after which these extreme values are down weighted or deleted. Robust M-estimators are based in part on this strategy.1,2,5,7 Methods for testing hypotheses based on robust M-estimators and trimmed means have been studied extensively. The strategy of applying standard hypothesis testing methods after outliers are removed is technically unsound and can yield highly inaccurate results due to using an invalid estimate of the standard error. Theoretically sound methods have been derived.1-7 In both theoretical and simulation studies, when testing hypotheses about skewed distributions, it seems that a 20% trimmed mean has an advantage over a robust M-estimator,1 although no one single method dominates. Transformations A simple way of dealing with skewness is to transform the data. Well-known possibilities are taking logarithms and using a Box-Cox transformation.18, 19 One serious limitation is that simple transformations do not deal effectively with outliers, leading Doksum and Wong20 to conclude that some amount of trimming remains beneficial even when transformations are used. Another concern is that after data are transformed, the resulting distribution can remain highly skewed. However, both theory and simulations indicate that trimmed means reduce practical problems associated with skewness, particularly when used in conjunction with certain bootstrap methods.1, 13 Nonparametric regression There are well-known parametric methods for dealing with regression lines that are nonlinear. But parametric models do not always provide a sufficiently flexible approach, particularly when dealing with more than one predictor. There is a vast literature on nonparametric regression estimators, sometimes called smoothers, for addressing this problem. Robust versions have been developed.1 To provide a crude indication of the strategy used by smoothers, imagine that in a regression situation the goal is to estimate the mean of Y, given that X=6, based on n pairs of observations. The strategy is to focus on the observed X values close to 6 and use the corresponding Y values to estimate the mean of Y. Typically, smoothers give more weight to Y values for which the corresponding X values are close to 6. For pairs of points for which the X value is far from 6, the corresponding Y values are ignored. The general problem, then, is determining which values of X are close to 6 and how much weight the corresponding Y values should be given. Special methods for handling robust measures of variation have also been derived.1 Robust measures of association and effect size There are two general approaches to robustly measuring the strength of association between two variables. The first is to use some analog of Pearson's correlation that removes or down weights outliers. The other is to fit a regression line and measure the strength of the association based on this fit. Doksum and Samarov20 studied an approach based on this latter strategy using what they call explanatory power. Let Y be some outcome variable of interest, let X be some predictor variable, and let 𝑌̂ be the predicted value of Y based on some fit, which might be based on either a robust linear model or a nonparametric regression estimator that provides a flexible approach to curvature. Then explanatory power is 𝜎2 (𝑌̂) 𝜌𝑒2 = 𝜎2 (𝑌)′ where 𝜎 2 (𝑌) is the variance of Y. When 𝑌̂ is obtained via ordinary least squares, 𝜌𝑒2 is equal to 𝜌2 where 𝜌 is Pearson's correlation. A robust version of explanatory power is obtained simply by replacing the variance, 𝜎 2 , with some robust measure of variation.1 Here the biweight midvariance is used with the understanding that several other choices are reasonable. Roughly, the biweight midvariance empirically determines whether any values among the marginal distributions are unusually large or small. If such values are found, they are down weighted. This measure of variation has good efficiency among the many robust measures of variation that have been proposed. It should be noted, however, that the choice of smoother matters when estimating explanatory power, with various smoothers performing rather poorly in terms of mean squared error and bias, at least with small to moderate sample sizes.21 One smoother that performs relatively well in simulations is the robust version of Cleveland's smoother.22 When comparing measures of location corresponding to two independent groups, Algina, Keselman, and Penfield23 suggest a robust version of Cohen's measure of effect size that replaces the means and the assumed common variance with a 20% trimmed mean and 20% Winsorized variance. They also rescale their measure of effect size so that under normality, it is equal to δ. The resulting estimate is denoted by dt. It is noted that a heteroscedastic analog of δ can be derived using a robust version of explanatory power.24 Briefly, it reflects the variation of the measures of location divided by some robust measure of variation of the pooled data, which has been rescaled so that it estimates the population variance under normality. Here, we label this version of explanatory power 𝜉 2 , and ξ is called a heteroscedastic measure of effect size. It can be shown that for normal distributions with a common variance, δ =0.2, 0.5 and 0.8 correspond to ξ =0.15, 0.35 and 0.5, respectively. This measure of effect size is readily extended to more than two groups. Consider again the two mixed normal distributions in the right panel of Figure 2, only now the second distribution is shifted to have a mean equal to 0.8. Then Cohen's d is 0.24, suggesting a small effect size, even though from a graphical perspective we have what Cohen characterized as a large effect size. The Algina et al measure of effect size, dt is 0.71, and ξ =0.46, approximately, both of which suggest a large effect size. Comments on Software A practical issue is applying robust methods. This can be done using one of the most important software developments during the last thirty years: the free software R, which can be downloaded from http://www.r-project.org/. The illustrations used here are based on a library of over 900 R functions1, which are available at http://www-rcf.usc.edu/~rwilcox/ . An R package WRS is available as well and can be installed with the R command install.packages(``WRS'',repos=``http:R-Forge.R-project.org''), assuming that the most recent version of R is being used. Practical illustration of robust methods: Analysis of a lifestyle intervention for older adults Design To illustrate the practical benefits of using recently developed robust methods, we report some results stemming from a recent RCT of a lifestyle intervention designed to promote the health and well-being of community-dwelling older adults. Details regarding the wider RCT are summarized by Clark et al.25 Briefly, this trial was conducted to compare a six-month lifestyle intervention to a no treatment control condition. The design utilized a cross-over component, such that the intervention was administered to experimental participants during the first six study months and to control participants during the second six months. The results reported here are based on a pooled sample of n=364 participants who completed assessments both prior to and following receipt of the intervention. In assessing the potential importance of using robust statistical methods, our goal was to assess the association between number of hours of treatment and various outcome variables. Outcome variables included: (a) eight indices of health-related quality of life, based on the SF -36 (physical function, bodily pain, general health, vitality, social function, mental health, physical component scores, and mental component scores)26; (b) depression, based on the Center for Epidemiologic Studies Depression Scale27; and (c) life satisfaction, based on the Life Satisfaction Index – Z Scale21. In each instance, the outcome variable consisted of signed posttreatment minus pre-treatment change scores. Preliminary analyses revealed that all outcome variables were found to have outliers based on boxplots, raising the possibility that modern robust methods might make a practical difference in the conclusions reached. Prior to discussing the primary results, it is informative to first examine plots of the regression lines estimated via the nonparametric quartile regression line derived by Koenker and Ng28. Figure 5 shows the estimated regression line (based on the R function qsmcobs) when predicting the median value of the variable physical function given the number of hours of individual sessions28. Notice that the regression line is approximately horizontal from 0 to 5 hours, but then an association appears. Pearson's correlation over all sessions, is 0.178 (p=0.001), indicating an association. However, Pearson’s correlation is potentially misleading because the association appears to be nonlinear. The assumption of a straight regression line can be tested with the R function qrchk, which yields p=0.006. The method used by this function stems from results derived by He and Zhu29, which is based on a nonparametric method for estimating the median of a given outcome variable of interest. As can be seen, it appears that with 5 hours or less, the treatment has little or no association with physical function, but at or above 5 hours, an association appears. In fitting a straight regression line via the Theil and Sen estimator, for hours greater than or equal to 5, the robust explanatory measure of association is 0.347. And the hypothesis of a zero slope, tested with the R function regci, yields p=0.015. For hours less than 5, the Theil-Sen estimate of the slope is 0. In summary, Pearson’s correlation indicates an association, but it is misleading in the sense that for less than 5 hours there appears to little or no association, and for more than 5 hours the association appears to be much stronger than indicated by Pearson’s correlation. Note also that after about 5 hours, Figure 5 suggests that the association is strictly increasing. However, the use of a running-interval smoother with bootstrap bagging, which was applied with the R function rplotsm, yields the plot in Figure 6. Again, there is little or no association initially and around 5-7 hours an association appears, but unlike Figure 5, Figure 6 suggests that the regression line levels off at about 9 or 10 hours. The main point here is that smoothers can play an important role in terms of detecting and describing an association. Indeed, no association is found after 9 hours, although this might be due to low power. Similar issues regarding nonlinear associations occurred when analyzing other outcome variables. Figure 7 shows the median regression line when predicting physical composite based on individual session hours. Pearson's correlation rejects the null hypothesis of no association (r=.2, p=0.015), but again a smoother indicates that there is little or no association from about 0 to 5 hours, after which an association appears. For 0 to 5 hours, r=-0.071 (p=0.257), and for 5 hours or more, r=0.25 (p=0.045). But the 20% Winsorized correlation for 5 hours or more is rw=0.358 (p=0.0045). (Winsorizing means that the more extreme values are “pulled in”, which reduces the deleterious effects of outliers). The explanatory measure of effect size, ρe, is estimated to be 0.47. A cautionary remark should be made. When using a smoother, the ends of the regression line can be misleading. In Figure 8, as the number of session hours gets close to the maximum observed value, the regression appears to be monotonic decreasing. But this could be due to an inherent bias when dealing with predictor values that are close to the maximum value. For example, if the goal is to predict some outcome variable Y when X=8, the method gives more weight to pairs of observations for which the X values are close to 8. But if there are many X values less than 8, and only a few larger than 8, this can have a negative impact on the accuracy of the predicted Y value. Figure 8 shows the estimate of the regression line based on a running-interval smoother with bootstrap bagging. When the number of session hours is relatively high, again there is some indication of a monotonic decreasing association, but it is less pronounced, suggesting that it might be an artifact of the estimators used. One concern is that there are very few points close to the maximum number of session hours, 16.5. Consequently, the precision of the estimates of physical composite, given that the number of session hours is close to 16.5, would be expected to be relatively low. Table 1 summarizes measures of association between session hours and the 10 outcome variables. Some degree of curvature was found for all of the variables listed. Shown are Pearson's r; the Winsorized correlation rw , which measures the strength of a linear association while guarding against outliers among the marginal distributions; and the robust explanatory measure of the strength of association based on Cleveland's smoother, re . (Note that there is no p-value reported for re . Methods for determining a p-value are available but the efficacy of these methods, in the context of Type I error probabilities, has not been investigated.) As can be seen, when testing at the 0.05 level, if Pearson's correlation rejects, the Winsorized correlation rejects as well. There are, however, four instances where the Winsorized correlation rejects and Pearson's correlation does not: vitality, mental composite, depression, and life satisfaction. Particularly striking are the results for depression. Pearson's correlation is -0.022 (p=0.694), whereas the Winsorized correlation is -0.132 (p=0.018). This illustrates the basic principle that outliers can mask a strong association among the bulk of the points. Note that re values are always positive. This is because it measures the strength of the association without assuming that the regression line is monotonic. That is, it allows for the possibility that the regression line is increasing over some region of the predictor values, but decreasing otherwise. Another portion of the study dealt with comparing the change scores of two groups of participants. The variables were measures of physical function, bodily pain, physical composite and a cognitive score. For the first group there was an ethnic match between the participant and therapist; in the other group there was no ethnic match. The total sample size was n=205. Table 2 summarizes the results when comparing means with Welch's test and 20% trimmed means with Yuen's test, which was applied with the R function yuenv2. (A theoretically correct estimate of the standard error of a trimmed mean is based in part on the 20% Winsorized variance, which was used here.) As can be seen, Welch's test rejects at the 0.05 level for three of the four outcome variables. In contrast, Yuen's test rejects in all four situations. Also shown are the three measures of effect size previously described. Note that following Cohen's suggestion, d is relatively small for the first outcome variable and a medium effect size is indicated in the other three cases. The Algina et al.23 measure of effect size, dt, is a bit larger. Of particular interest is the heteroscedastic measure of effect size, x , which effectively deals with outliers. Now the results indicate a medium to large effect size ( x > .3 ) for three of the outcome variables. (Formal hypothesis testing methods for comparing the population analogs of d and x have not been derived.) Discussion There are many important applications of robust statistical methodology that extend beyond those described in this paper. For example, the violation of assumptions in more complex experimental designs is associated with especially dire consequences, and robust methods for dealing with these problems have been derived1. A point worth stressing is that no single method is always best. The optimal method in a given situation depends in part on the magnitude of true differences or associations, and the nature of the distributions being studied, which are often unknown. Robust methods are designed to perform relatively well under normality and homoscedasticity, while continuing to perform well when these assumptions are violated. It is possible that classic methods offer an advantage in some situations. For example, for skewed distributions, the difference between the means might be larger than the difference between the 20% trimmed mean, suggesting that power might be higher using means. However, this is no guarantee that means have more power because it is often the case that a 20% trimmed mean has a substantially smaller standard error. Of course, classic methods perform well when standard assumptions are met, but modern robust methods are designed to perform nearly as well for this particular situation. In general, complete reliance on classic techniques seems highly questionable in that hundreds of papers published during the last quarter century underscore the strong practical value of modern procedures. All indications are that classic, routinely used methods can be highly unsatisfactory under general conditions. Moreover, access to modern methods is now available via a vast array of R functions. The main message here is that modern technology offers the opportunity for analyzing data in a much more flexible and informative manner, which would enable researchers to learn more from their data, thereby augmenting the accuracy and practical utility of clinical trial results. Table 1: Measures of association between hours of treatment and the variables listed in column 1 (n=364) Pearson’s r 0.178 p 0.001 rw * 0.135 p 0.016 re ** 0.048 BODILY PAIN 0.170 0.002 0.156 0.005 0.198 GENERAL HEALTH 0.209 0.0001 0.130 0.012 0.111 VITALITY 0.099 0.075 0.139 0.012 0.241 SOCIAL FUNCTION 0.112 0.043 0.157 0.005 0.228 MENTAL HEALTH 0.141 0.011 0.167 0.003 0.071 PHYSICAL COMPOSITE 0.200 0.0002 0.136 0.015 0.255 MENTAL COMPOSITE 0.095 0.087 0.149 0.007 0.028 -0.022 0.694 -0.132 0.018 0.134 0.086 0.125 0.118 0.035 0.119 PHYSICAL FUNCTION DEPRESSION LIFE SATISFACTION *rw= 20% Winsorized correlation **re=robust explanatory measure of association Table 2: P-values and measures of effect size when comparing ethnic matched patients to a non-matched group Yuen’s test: p-value 0.0469 d dt ξ Physical Function Welch’s test: p-value 0.1445 0.212 0.310 0.252 Bodily Pain .01397 <.0001 0.591 0.666 0.501 Physical Composite <.0001 0.0002 0.420 0.503 0.391 Cognition 0.0332 0.0091 0.415 0.408 0.308 The measures of effect size are d (Cohen’s d), dt (a robust measure of effect size that assumes homoscedasticity), and x (a robust measure of effect size that allows heteroscedasticity). Under normality and homoscedasticity, d=0.2, 0.5 and 0.8 approximately correspond to x =.15, .35 and .5, respectively Figure 1: Despite the obvious similarity between the standard normal and contaminated normal distributions, the standard normal has variance 1 and the contaminated normal has variance 10.9. Figure 2: Left panel, d=1, power=0.96. Right panel, d=.3, power=.28. Figure 3: Illustration of the sensitivity of Pearson's Correlation to contaminated (heavy-tailed) distributions. Figure 4: The solid line is the distribution of Student's T, n = 30, when sampling from a lognormal distribution. The dashed line is the distribution T when sampling from a normal distribution. Figure 5: The median regression line for predicting physical function based on the number of session hours. Figure 6: The running-interval smooth based on bootstrap bagging using the same data in Figure 5. Figure 7: The median regression line for predicting physical composite based on the number of session hours. Figure 8: The estimated regression line for predicting physical composite based on the number of session hours. The estimate is based on the running-interval smoother with bootstrap bagging. References 1. Wilcox RR. Introduction to robust estimation and hypothesis testing. 2nd Ed. Elsevier Academic Press, Burlington, MA, 2005. 2. Hampel FR, Ronchetti EM, Rousseeuw PJ, and Stahel WA. Robust statistics: the approach based on influence functions. Wiley, New York, 1986. 3. Hoaglin DC, Mosteller F, and Tukey, JW. Understanding robust and exploratory data analysis. Wiley, New York, 1983. 4. Hoaglin DC, Mosteller F, and Tukey JW. Exploring data tables, trends, and shapes. Wiley, New York, 1985. 5. Huber P and Ronchetti EM. Robust statistics. 2nd Ed. Wiley, New York, 2009. 6. Maronna RA, Martin MA and Yohai VJ. Robust statistics: theory and methods. Wiley, New York, 2006. 7. Staudte RG and Sheather SJ. Robust estimation and testing. Wiley, New York, 1990. 8. Tukey JW. A survey of sampling from contaminated normal distributions. In: I. Olkin et al. (Eds.) Contributions to probability and statistics. Stanford University Press, Stanford, CA, 1960, pp. 448-485. 9. Cohen, J. Statistical Power Analysis for the Behavioral Sciences. 2nd Ed. Academic Press, New York, 1988. 10. Shavelson RJ. Statistical reasoning for the behavioral sciences. Needham Heights, MA: Allyn and Bacon, 1988, p. 266. 11. Goldman RN and Weinberg JS. Statistics an introduction. Englewood Cliffs, NJ: PrenticeHall, 1985, p. 252. 12. Huntsberger DV and Billingsley P. Elements of statistical inference. 5th ed. Boston, MA: Allyn and Bacon, 1981. 13. Cribbie RA, Fiksenbaum L, Keselman H J and Wilcox RR. (in press). Effects of nonnormality on test statistics for one-way independent groups designs. British Journal of Math Statistical Psychology, in press. 14. Cressie NA and Whitford HJ. How to use the two sample t-test. Biometrical Journal, 1986; 28: 131-148. 15. Welch BL. The significance of the difference between two means when the population variances are unequal. Biometrika 1938; 29, 350-362. 16. Hand A. A history of mathematical statistics from 1750 to 1930. Wiley, New York, 1998. 17. Rosenberger JL and Gasko M. Comparing location estimators: Trimmed means, medians, and trimean. In: D. Hoaglin, F. Mosteller, J. Tukey (Eds.) Understanding robust and exploratory data analysis. Wiley, New York, 1983, pp. 297-336. 18. Box GE and Cox DR. (1964) An analysis of transformations. Journal of the Royal Statistical Society B 1964; 26: 211-252. 19. Sakia R M. The Box-Cox transformation: A review. The Statistician 1992; 41:169-178. 20. Doksum KA and Wong C-W. Statistical tests based on transformed data. Journal of the American Statistical Association 1983; 78: 411-417. 21. Wood V, Wylie ML and Sheafor B. An analysis of short self-report measure of life satisfaction: Correlation with rater judgments. Journal of Gerontology 1969; 24:465-469. 22. Cleveland WS. (1979). Robust locally-weighted regression and smoothing scatterplots. Journal of the American Statistical Association 1979; 74: 829-836. 23. Algina J, Keselman HJ and Penfield RD. An alternative to Cohen's standardized mean difference effect size: A robust parameter and confidence interval in the two independent groups case. Psychol Methods 2005; 10: 317-328. 24. Wilcox RR and Tian T. Measuring effect size: A robust heteroscedastic approach for two or more groups. Journal of Applied Statistics, in press. 25. Clark F, Jackson J, Mandel D, Blanchard J, Carlson M, Azen S, et al. Confronting challenges in intervention research with ethnically diverse older adults: the USC Well Elderly II trial. Clin Trials 2009; 6: 90-101. 26. Ware JE and Kosinski M, Dewey JE. How to score version two of the SF-36 Health Survey. QualityMetric, Lincoln, RI, 2000. 27. Radloff LS. The CES-D Scale: A self-report depression scale for research in the general population. Applied Psychological Measurement 1977; 1:385-401. 28. Koenker R and Ng P. Inequality Constrained Quantile Regression. Sankhya, Indian Journal of Statistics 2005; 67:418-440. 29. He X and Zhu L-X. A lack-of-fit test for quantile regression. Journal of the American Statistical Association 2003; 98:1013-1022.