wilcox_paper04112011 - University of Southern California

advertisement
Avoiding lost discoveries by using modern, robust statistical methods: An illustration based on
a lifestyle intervention study
Running head: Robust methods
Word total: 4,709
Rand Wilcox1, Mike Carlson2, Stan Azen3, and Florence Clark4
1
Department of Psychology, University of Southern California, USA
Division of Occupational Science and Occupational Therapy, University of Southern
California, USA
3
Department of Preventive Medicine, University of Southern California, USA
2, 4
This work was supported by National Institute on Aging, R01 AG021108.
Corresponding Author: Rand Wilcox, Department of Psychology, University of Southern
California. 618 Seeley Mudd Building, University Park Campus, Los Angeles, CA 90089-1061,
USA. Email: rwilcox@usc.edu
Abstract
Background When analyzing clinical trial data, it is desirable to take advantage of major
advances in statistical techniques for assessing central tendency and measures of association.
Although numerous new procedures that have considerable practical value have been developed
in the last quarter century, such advances remain underutilized.
Purpose This article has two purposes. The first is to review common problems associated with
standard methodologies (low power, lack of control over type I errors, incorrect assessments of
strength of association), and then summarize some modern methods that can be used to
circumvent such problems. The second purpose is to illustrate the practical utility of modern
robust methods using data from a recently conducted intervention study.
Methods To examine the value of robust statistical methodology, data from the Well Elderly 2
randomized controlled trial were analyzed to document the participant characteristics that are
associated with pre-to-post change over the course of a lifestyle intervention for older adults.
Results In multiple instances, robust methods uncovered differences among groups and
associations among variables that were not detected by classic techniques. In particular, the
results demonstrated that details of the nature and strength of association were sometimes
overlooked when using ordinary least squares regression and Pearson's correlation.
Conclusions Modern robust methods can make a practical difference in detecting and
describing differences between groups and associations between variables. Such procedures
should be applied more frequently when analyzing trial-based data.
Keywords:
robust methods, violation of assumptions, outliers, heteroscedasticity, lifestyle intervention,
occupational therapy, quality of life
Introduction
During the last quarter of a century, many new and improved methods for comparing groups
and studying associations have been derived that have the potential of documenting important
statistical relationships that are likely to be missed when using standard techniques.1-7 Although
awareness of these modern procedures is growing, it is evident that their practical utility
remains relatively unknown and that they remain underutilized. The goal of this paper is to
briefly review why the new methods are important and to illustrate their value using data from a
recently conducted randomized controlled trial (RCT).
Problems with Standard Methodology
Appreciating the importance of recently developed robust statistical procedures requires an
understanding of modern insights into when and why traditional methods based on means can
be highly unsatisfactory. In the case of detecting differences among groups, there are three
fundamental problematic issues that have important implications in controlled trials. The first
goal of this article is to briefly review these issues. After this overview, we will then examine
the viability of various robust procedures as a means of solving the problems associated with
standard procedures.
Problem 1: Heavy-tailed distributions
The first insight has to do with the realization that small departures from normality toward a
more heavy-tailed distribution can have an undesirable impact on estimation of the population
variance and can greatly affect power.1, 7 Moreover, in a seminal paper, Tukey argued that
heavier than normal tails are to be expected. 8 Random samples from such distributions are
prone to having outliers, and experience with modern outlier detection methods supports
Tukey's claim. This is not to suggest that heavy-tailed distributions always occur, but to
document that it is ill advised to ignore problems that can result from heavy-tailed distributions.
The classic illustration of the practical consequences of heavy-tailed distributions is based on a
mixed (contaminated) normal distribution where with probability 0.9 an observation is sampled
from a standard normal, otherwise an observation is sampled from a normal distribution with
mean 0 and standard deviation 10. Despite the apparent similarity between the two
distributions, as shown in Figure 1, the variance of the mixed normal is 10.9, which greatly
reduces power. As an illustration, consider two normal distributions with variance 1 and means
0 and 1, which are shown in the left panel of Figure 2. When testing at the 0.05 level with a
sample size of 25 per group, Student's T test has power=0.9. The right panel shows the mixed
normal as well as another mixed normal distribution that has been shifted to have mean 1. Now
Student's T test has power=0.28.
For two distributions with means µ1 and µ2 and common variance σ2, a well-known measure of
effect size, popularized by Cohen9, is
𝛿=
𝜇1−𝜇2
𝜎
.
(1)
The usual estimate of δ, d, is obtained by estimating the means and the assumed common
variance in the standard way. Although the importance of a given value for δ depends on the
situation, Cohen suggested that as a general guide a large effect size is one that is visible to the
naked eye. Based on this criterion, he concluded that for normal distributions,
δ =0.2, 0.5, and 0.8 correspond to small, medium and large effect sizes, respectively. In the left
panel of Figure 2, δ =1.0, which from Cohen's perspective would be judged to be quite large. In
the right panel, δ=0.28, which would be judged to be a small effect size from a numerical point
of view. But from a graphical point of view, the difference between means in the right panel
corresponds to what Cohen describes as a large effect size. This illustrates that δ can be highly
sensitive to small changes in the tails of a given distribution and that it has the potential of
characterizing an effect size as small when in fact it is large. The upshot of this result is that in
the context of analyzing RCTs, the use of Cohen’s d can oftentimes lead to a spuriously low
estimate of effect magnitude.
Similar anomalies can result when using Pearson's correlation. The left panel of Figure 3 shows
a bivariate normal distribution with correlation ρ =0.8, and the middle panel shows a bivariate
normal distribution with correlation ρ=0.2. However, in the right panel, although ρ=0.2, the
bivariate distribution appears to be similar to the bivariate normal distribution with ρ =0.8. One
of the marginal distributions in the third panel in Figure 3 is normal, but the other is a mixed
normal, which illustrates that even a small change in one of marginal distribution can have a
large impact on ρ.
Problem 2: Assuming normality via the central limit theorem
Many introductory statistics books claim that with a sample size of 30 or more, when testing
𝐻0 : 𝜇 = 𝜇0 , 𝜇0 given, normality can be assumed.10-12 Although this claim is not based on wild
speculations, two things were missed. First, when studying the sampling distribution of the
mean, 𝑋̅, early efforts focused on relatively light-tailed distributions for which outliers are
relatively rare. However, in moving from skewed, light-tailed distributions to skewed, heavytailed distributions, larger than anticipated sample sizes are needed for the sampling distribution
of 𝑋̅ to be approximately normal. Second, and seemingly more serious, is the implicit
assumption that if the distribution of 𝑋̅ is approximately normal,
𝑇=
√𝑛(𝑋̅ − 𝜇)
𝑠
will have, approximately, a Student's t distribution with n-1 degrees of freedom, where n is the
sample size. To illustrate that this is not necessarily the case, imagine that 25 observations are
randomly sampled from a lognormal distribution. In such a case the sampling distribution of the
mean is approximately normal, but the distribution of T is poorly approximated by a Student's t
distribution with 24 degrees of freedom. Figure 4 shows an estimate of the actual distribution T
based on a simulation with 10,000 replications. If T has a Student's t distribution, then P (T ≤ 2.086)=0.025. But when sampling from a lognormal distribution, the actual probability is
approximately 0.12. Moreover, P(T ≥ 2.086) is approximately 0.001 and E(T)= -0.54.
As a result, Student's t is severely biased. If the goal is to have an actual Type I error probability
between 0.025 and 0.075 when testing at the 0.05 level, a sample size of 200 or more is
required. For skewed distributions having heavier tails, roughly meaning the expected
proportion of outliers is higher, a sample size in excess of 300 can be required.
A possible criticism of the illustrations just given is that they are based on hypothetical
distributions and that in actual studies such concerns may be less severe. But at least in some
situations, problems with achieving accurate probability coverage in actual studies have been
found to be worse.1 In particular, based on bootstrap samples, the true distribution of T can
differ from an assumed Student's t distribution even more than indicated here.1
Turning to the two-sample case, it should be noted that early simulation studies dealing with
non-normality focused on identical distributions. If the sample sizes are equal, the sampling
distribution of the difference between the sample means will be symmetric even when the
individual distributions are asymmetric but have the same amount of skewness. In the onesample case, when sampling from a symmetric, heavy-tailed distribution, the actual probability
level of Student's T is generally less than the nominal level, which helps explain why early
simulations studies regarding the two-sample case concluded that Student's T is robust in terms
of controlling the probability of a Type I error. But more recent studies have indicated that
when the two distributions being compared differ in skewness, the actual Type I error
probability can exceed the nominal level by a substantial amount 1, 13.
Problem 3: Heteroscedasticity
The third fundamental insight is that violating the usual homoscedasticity assumption (i.e. the
assumption that all groups are assumed to have a common variance), is much more serious than
once thought. Both relatively poor power and inaccurate confidence intervals can result. Cressie
and Whitford14 established general conditions under which Student's T is not even
asymptotically correct under random sampling. When comparing the means of two independent
groups, some software packages now default to the heteroscedastic method derived by Welch15,
which is asymptotically correct. Welch's method can produce more accurate confidence
intervals than Student's T, but serious concerns persist in terms of both Type I errors and power.
Indeed, all methods based on means can have relatively low power under arbitrarily small
departures from normality.1 Also, when comparing more than two groups, commonly used
software employs the homoscedastic ANOVA F test, and typically no heteroscedastic option is
provided. This is a concern because with more than two groups, problems are exacerbated in
terms of both Type I errors and power.1 Similar concerns arise when dealing with regression.
Robust Solutions
As suggested above, the application of standard methodology for comparing means can
seriously bias the conclusions drawn from clinical trial data. Fortunately, a wide array of
procedures has been developed that produce more accurate, robust results when the
assumptions that underlie standard procedures are violated. We describe some of these methods
below.
Alternate Measures of Location
One way of dealing with outliers is to replace the mean with the median. Two centuries ago,
Laplace was aware that for heavy-tailed distributions, the usual sample median can have a
smaller standard error than the mean.16 But a criticism is that under normality the median is
rather unsatisfactory in terms of power.
Note that the median belongs to the class of trimmed means. By definition, a trimmed mean is
the average after a prespecified proportion of the largest and smallest values is deleted. A crude
description of why methods based on the median can have unsatisfactory power is that they
trim all but one or two values. A way of dealing with this problem is to trim less, and based on
efficiency, Rosenberger and Gasko17 conclude that a 20% trimmed mean is a good choice for
general use. That is, its standard error compares well to the standard error of the mean under
normality, but unlike the mean, it provides a reasonable degree of protection against the
deleterious effects of outliers. Heteroscedastic methods for comparing trimmed means, when
dealing with a range of commonly used designs, have also been derived.1
Another general approach when dealing with outliers is to use a measure of location that first
empirically identifies values that are extreme, after which these extreme values are down
weighted or deleted. Robust M-estimators are based in part on this strategy.1,2,5,7 Methods for
testing hypotheses based on robust M-estimators and trimmed means have been studied
extensively. The strategy of applying standard hypothesis testing methods after outliers are
removed is technically unsound and can yield highly inaccurate results due to using an invalid
estimate of the standard error. Theoretically sound methods have been derived.1-7 In both
theoretical and simulation studies, when testing hypotheses about skewed distributions, it seems
that a 20% trimmed mean has an advantage over a robust M-estimator,1 although no one single
method dominates.
Transformations
A simple way of dealing with skewness is to transform the data. Well-known possibilities are
taking logarithms and using a Box-Cox transformation.18, 19 One serious limitation is that
simple transformations do not deal effectively with outliers, leading Doksum and Wong20 to
conclude that some amount of trimming remains beneficial even when transformations are
used. Another concern is that after data are transformed, the resulting distribution can remain
highly skewed. However, both theory and simulations indicate that trimmed means reduce
practical problems associated with skewness, particularly when used in conjunction with certain
bootstrap methods.1, 13
Nonparametric regression
There are well-known parametric methods for dealing with regression lines that are nonlinear.
But parametric models do not always provide a sufficiently flexible approach, particularly when
dealing with more than one predictor. There is a vast literature on nonparametric regression
estimators, sometimes called smoothers, for addressing this problem. Robust versions have
been developed.1
To provide a crude indication of the strategy used by smoothers, imagine that in a regression
situation the goal is to estimate the mean of Y, given that X=6, based on n pairs of observations.
The strategy is to focus on the observed X values close to 6 and use the corresponding Y values
to estimate the mean of Y. Typically, smoothers give more weight to Y values for which the
corresponding X values are close to 6. For pairs of points for which the X value is far from 6,
the corresponding Y values are ignored. The general problem, then, is determining which values
of X are close to 6 and how much weight the corresponding Y values should be given. Special
methods for handling robust measures of variation have also been derived.1
Robust measures of association and effect size
There are two general approaches to robustly measuring the strength of association between
two variables. The first is to use some analog of Pearson's correlation that removes or down
weights outliers. The other is to fit a regression line and measure the strength of the association
based on this fit. Doksum and Samarov20 studied an approach based on this latter strategy using
what they call explanatory power. Let Y be some outcome variable of interest, let X be some
predictor variable, and let 𝑌̂ be the predicted value of Y based on some fit, which might be
based on either a robust linear model or a nonparametric regression estimator that provides a
flexible approach to curvature. Then explanatory power is
𝜎2 (𝑌̂)
𝜌𝑒2 = 𝜎2 (𝑌)′
where 𝜎 2 (𝑌) is the variance of Y. When 𝑌̂ is obtained via ordinary least squares, 𝜌𝑒2 is equal to
𝜌2 where 𝜌 is Pearson's correlation. A robust version of explanatory power is obtained simply
by replacing the variance, 𝜎 2 , with some robust measure of variation.1 Here the biweight
midvariance is used with the understanding that several other choices are reasonable. Roughly,
the biweight midvariance empirically determines whether any values among the marginal
distributions are unusually large or small. If such values are found, they are down weighted.
This measure of variation has good efficiency among the many robust measures of variation
that have been proposed. It should be noted, however, that the choice of smoother matters
when estimating explanatory power, with various smoothers performing rather poorly in terms
of mean squared error and bias, at least with small to moderate sample sizes.21 One smoother
that performs relatively well in simulations is the robust version of Cleveland's smoother.22
When comparing measures of location corresponding to two independent groups, Algina,
Keselman, and Penfield23 suggest a robust version of Cohen's measure of effect size that
replaces the means and the assumed common variance with a 20% trimmed mean and 20%
Winsorized variance. They also rescale their measure of effect size so that under normality, it is
equal to δ. The resulting estimate is denoted by dt. It is noted that a heteroscedastic analog of δ
can be derived using a robust version of explanatory power.24 Briefly, it reflects the variation of
the measures of location divided by some robust measure of variation of the pooled data, which
has been rescaled so that it estimates the population variance under normality. Here, we label
this version of explanatory power 𝜉 2 , and ξ is called a heteroscedastic measure of effect size. It
can be shown that for normal distributions with a common variance, δ =0.2, 0.5 and 0.8
correspond to ξ =0.15, 0.35 and 0.5, respectively. This measure of effect size is readily
extended to more than two groups.
Consider again the two mixed normal distributions in the right panel of Figure 2, only now the
second distribution is shifted to have a mean equal to 0.8. Then Cohen's d is 0.24, suggesting a
small effect size, even though from a graphical perspective we have what Cohen characterized
as a large effect size. The Algina et al measure of effect size, dt is 0.71, and ξ =0.46,
approximately, both of which suggest a large effect size.
Comments on Software
A practical issue is applying robust methods. This can be done using one of the most important
software developments during the last thirty years: the free software R, which can be
downloaded from http://www.r-project.org/. The illustrations used here are based on a library of
over 900 R functions1, which are available at http://www-rcf.usc.edu/~rwilcox/ .
An R package WRS is available as well and can be installed with the R command
install.packages(``WRS'',repos=``http:R-Forge.R-project.org''), assuming that the most recent
version of R is being used.
Practical illustration of robust methods: Analysis of a lifestyle intervention for older
adults
Design
To illustrate the practical benefits of using recently developed robust methods, we report some
results stemming from a recent RCT of a lifestyle intervention designed to promote the health
and well-being of community-dwelling older adults. Details regarding the wider RCT are
summarized by Clark et al.25 Briefly, this trial was conducted to compare a six-month lifestyle
intervention to a no treatment control condition. The design utilized a cross-over component,
such that the intervention was administered to experimental participants during the first six
study months and to control participants during the second six months. The results reported
here are based on a pooled sample of n=364 participants who completed assessments both prior
to and following receipt of the intervention.
In assessing the potential importance of using robust statistical methods, our goal was to assess
the association between number of hours of treatment and various outcome variables. Outcome
variables included: (a) eight indices of health-related quality of life, based on the SF -36
(physical function, bodily pain, general health, vitality, social function, mental health, physical
component scores, and mental component scores)26; (b) depression, based on the Center for
Epidemiologic Studies Depression Scale27; and (c) life satisfaction, based on the Life
Satisfaction Index – Z Scale21. In each instance, the outcome variable consisted of signed posttreatment minus pre-treatment change scores. Preliminary analyses revealed that all outcome
variables were found to have outliers based on boxplots, raising the possibility that modern
robust methods might make a practical difference in the conclusions reached.
Prior to discussing the primary results, it is informative to first examine plots of the regression
lines estimated via the nonparametric quartile regression line derived by Koenker and Ng28.
Figure 5 shows the estimated regression line (based on the R function qsmcobs) when
predicting the median value of the variable physical function given the number of hours of
individual sessions28. Notice that the regression line is approximately horizontal from 0 to 5
hours, but then an association appears. Pearson's correlation over all sessions, is 0.178
(p=0.001), indicating an association. However, Pearson’s correlation is potentially misleading
because the association appears to be nonlinear. The assumption of a straight regression line
can be tested with the R function qrchk, which yields p=0.006. The method used by this
function stems from results derived by He and Zhu29, which is based on a nonparametric
method for estimating the median of a given outcome variable of interest. As can be seen, it
appears that with 5 hours or less, the treatment has little or no association with physical
function, but at or above 5 hours, an association appears. In fitting a straight regression line via
the Theil and Sen estimator, for hours greater than or equal to 5, the robust explanatory measure
of association is 0.347. And the hypothesis of a zero slope, tested with the R function regci,
yields p=0.015. For hours less than 5, the Theil-Sen estimate of the slope is 0. In summary,
Pearson’s correlation indicates an association, but it is misleading in the sense that for less than
5 hours there appears to little or no association, and for more than 5 hours the association
appears to be much stronger than indicated by Pearson’s correlation.
Note also that after about 5 hours, Figure 5 suggests that the association is strictly increasing.
However, the use of a running-interval smoother with bootstrap bagging, which was applied
with the R function rplotsm, yields the plot in Figure 6. Again, there is little or no association
initially and around 5-7 hours an association appears, but unlike Figure 5, Figure 6 suggests that
the regression line levels off at about 9 or 10 hours. The main point here is that smoothers can
play an important role in terms of detecting and describing an association. Indeed, no
association is found after 9 hours, although this might be due to low power.
Similar issues regarding nonlinear associations occurred when analyzing other outcome
variables. Figure 7 shows the median regression line when predicting physical composite based
on individual session hours. Pearson's correlation rejects the null hypothesis of no association
(r=.2, p=0.015), but again a smoother indicates that there is little or no association from about 0
to 5 hours, after which an association appears. For 0 to 5 hours, r=-0.071 (p=0.257), and for 5
hours or more, r=0.25 (p=0.045). But the 20% Winsorized correlation for 5 hours or more is
rw=0.358 (p=0.0045). (Winsorizing means that the more extreme values are “pulled in”, which
reduces the deleterious effects of outliers). The explanatory measure of effect size, ρe, is
estimated to be 0.47.
A cautionary remark should be made. When using a smoother, the ends of the regression line
can be misleading. In Figure 8, as the number of session hours gets close to the maximum
observed value, the regression appears to be monotonic decreasing. But this could be due to an
inherent bias when dealing with predictor values that are close to the maximum value. For
example, if the goal is to predict some outcome variable Y when X=8, the method gives more
weight to pairs of observations for which the X values are close to 8. But if there are many X
values less than 8, and only a few larger than 8, this can have a negative impact on the accuracy
of the predicted Y value.
Figure 8 shows the estimate of the regression line based on a running-interval smoother with
bootstrap bagging. When the number of session hours is relatively high, again there is some
indication of a monotonic decreasing association, but it is less pronounced, suggesting that it
might be an artifact of the estimators used. One concern is that there are very few points close
to the maximum number of session hours, 16.5. Consequently, the precision of the estimates of
physical composite, given that the number of session hours is close to 16.5, would be expected
to be relatively low.
Table 1 summarizes measures of association between session hours and the 10 outcome
variables. Some degree of curvature was found for all of the variables listed. Shown are
Pearson's r; the Winsorized correlation rw , which measures the strength of a linear association
while guarding against outliers among the marginal distributions; and the robust explanatory
measure of the strength of association based on Cleveland's smoother, re . (Note that there is no
p-value reported for re . Methods for determining a p-value are available but the efficacy of
these methods, in the context of Type I error probabilities, has not been investigated.)
As can be seen, when testing at the 0.05 level, if Pearson's correlation rejects, the Winsorized
correlation rejects as well. There are, however, four instances where the Winsorized correlation
rejects and Pearson's correlation does not: vitality, mental composite, depression, and life
satisfaction. Particularly striking are the results for depression. Pearson's correlation is -0.022
(p=0.694), whereas the Winsorized correlation is -0.132 (p=0.018). This illustrates the basic
principle that outliers can mask a strong association among the bulk of the points.
Note that re values are always positive. This is because it measures the strength of the
association without assuming that the regression line is monotonic. That is, it allows for the
possibility that the regression line is increasing over some region of the predictor values, but
decreasing otherwise.
Another portion of the study dealt with comparing the change scores of two groups of
participants. The variables were measures of physical function, bodily pain, physical composite
and a cognitive score. For the first group there was an ethnic match between the participant and
therapist; in the other group there was no ethnic match. The total sample size was n=205. Table
2 summarizes the results when comparing means with Welch's test and 20% trimmed means
with Yuen's test, which was applied with the R function yuenv2. (A theoretically correct
estimate of the standard error of a trimmed mean is based in part on the 20% Winsorized
variance, which was used here.) As can be seen, Welch's test rejects at the 0.05 level for three
of the four outcome variables. In contrast, Yuen's test rejects in all four situations. Also shown
are the three measures of effect size previously described. Note that following Cohen's
suggestion, d is relatively small for the first outcome variable and a medium effect size is
indicated in the other three cases. The Algina et al.23 measure of effect size, dt, is a bit larger. Of
particular interest is the heteroscedastic measure of effect size, x , which effectively deals with
outliers. Now the results indicate a medium to large effect size ( x > .3 ) for three of the outcome
variables. (Formal hypothesis testing methods for comparing the population analogs of d and x
have not been derived.)
Discussion
There are many important applications of robust statistical methodology that extend beyond
those described in this paper. For example, the violation of assumptions in more complex
experimental designs is associated with especially dire consequences, and robust methods for
dealing with these problems have been derived1.
A point worth stressing is that no single method is always best. The optimal method in a given
situation depends in part on the magnitude of true differences or associations, and the nature of
the distributions being studied, which are often unknown. Robust methods are designed to
perform relatively well under normality and homoscedasticity, while continuing to perform
well when these assumptions are violated. It is possible that classic methods offer an advantage
in some situations. For example, for skewed distributions, the difference between the means
might be larger than the difference between the 20% trimmed mean, suggesting that power
might be higher using means. However, this is no guarantee that means have more power
because it is often the case that a 20% trimmed mean has a substantially smaller standard error.
Of course, classic methods perform well when standard assumptions are met, but modern robust
methods are designed to perform nearly as well for this particular situation. In general,
complete reliance on classic techniques seems highly questionable in that hundreds of papers
published during the last quarter century underscore the strong practical value of modern
procedures. All indications are that classic, routinely used methods can be highly unsatisfactory
under general conditions. Moreover, access to modern methods is now available via a vast array
of R functions. The main message here is that modern technology offers the opportunity for
analyzing data in a much more flexible and informative manner, which would enable
researchers to learn more from their data, thereby augmenting the accuracy and practical utility
of clinical trial results.
Table 1: Measures of association between hours of treatment and the variables listed in column
1 (n=364)
Pearson’s r
0.178
p
0.001
rw *
0.135
p
0.016
re **
0.048
BODILY PAIN
0.170
0.002
0.156
0.005
0.198
GENERAL HEALTH
0.209
0.0001
0.130
0.012
0.111
VITALITY
0.099
0.075
0.139
0.012
0.241
SOCIAL FUNCTION
0.112
0.043
0.157
0.005
0.228
MENTAL HEALTH
0.141
0.011
0.167
0.003
0.071
PHYSICAL COMPOSITE
0.200
0.0002
0.136
0.015
0.255
MENTAL COMPOSITE
0.095
0.087
0.149
0.007
0.028
-0.022
0.694
-0.132
0.018
0.134
0.086
0.125
0.118
0.035
0.119
PHYSICAL FUNCTION
DEPRESSION
LIFE SATISFACTION
*rw= 20% Winsorized correlation
**re=robust explanatory measure of association
Table 2: P-values and measures of effect size when comparing ethnic matched patients to a
non-matched group
Yuen’s test:
p-value
0.0469
d
dt
ξ
Physical Function
Welch’s test:
p-value
0.1445
0.212
0.310
0.252
Bodily Pain
.01397
<.0001
0.591
0.666
0.501
Physical Composite
<.0001
0.0002
0.420
0.503
0.391
Cognition
0.0332
0.0091
0.415
0.408
0.308
The measures of effect size are d (Cohen’s d), dt (a robust measure of effect size that assumes
homoscedasticity), and x (a robust measure of effect size that allows heteroscedasticity). Under
normality and homoscedasticity, d=0.2, 0.5 and 0.8 approximately correspond to x =.15, .35
and .5, respectively
Figure 1: Despite the obvious similarity between the standard normal and contaminated normal
distributions, the standard normal has variance 1 and the contaminated normal has variance
10.9.
Figure 2: Left panel, d=1, power=0.96. Right panel, d=.3, power=.28.
Figure 3: Illustration of the sensitivity of Pearson's Correlation to contaminated (heavy-tailed)
distributions.
Figure 4: The solid line is the distribution of Student's T, n = 30, when sampling from a
lognormal distribution. The dashed line is the distribution T when sampling from a normal
distribution.
Figure 5: The median regression line for predicting physical function based on the number of
session hours.
Figure 6: The running-interval smooth based on bootstrap bagging using the same data in
Figure 5.
Figure 7: The median regression line for predicting physical composite based on the number of
session hours.
Figure 8: The estimated regression line for predicting physical composite based on the number
of session hours. The estimate is based on the running-interval smoother with bootstrap
bagging.
References
1. Wilcox RR. Introduction to robust estimation and hypothesis testing. 2nd Ed. Elsevier
Academic Press, Burlington, MA, 2005.
2. Hampel FR, Ronchetti EM, Rousseeuw PJ, and Stahel WA. Robust statistics: the approach
based on influence functions. Wiley, New York, 1986.
3. Hoaglin DC, Mosteller F, and Tukey, JW. Understanding robust and exploratory data
analysis. Wiley, New York, 1983.
4. Hoaglin DC, Mosteller F, and Tukey JW. Exploring data tables, trends, and shapes. Wiley,
New York, 1985.
5. Huber P and Ronchetti EM. Robust statistics. 2nd Ed. Wiley, New York, 2009.
6. Maronna RA, Martin MA and Yohai VJ. Robust statistics: theory and methods. Wiley, New
York, 2006.
7. Staudte RG and Sheather SJ. Robust estimation and testing. Wiley, New York, 1990.
8. Tukey JW. A survey of sampling from contaminated normal distributions. In: I. Olkin et al.
(Eds.) Contributions to probability and statistics. Stanford University Press, Stanford, CA,
1960, pp. 448-485.
9. Cohen, J. Statistical Power Analysis for the Behavioral Sciences. 2nd Ed. Academic Press,
New York, 1988.
10. Shavelson RJ. Statistical reasoning for the behavioral sciences. Needham Heights, MA:
Allyn and Bacon, 1988, p. 266.
11. Goldman RN and Weinberg JS. Statistics an introduction. Englewood Cliffs, NJ: PrenticeHall, 1985, p. 252.
12. Huntsberger DV and Billingsley P. Elements of statistical inference. 5th ed. Boston, MA:
Allyn and Bacon, 1981.
13. Cribbie RA, Fiksenbaum L, Keselman H J and Wilcox RR. (in press). Effects of
nonnormality on test statistics for one-way independent groups designs. British Journal of Math
Statistical Psychology, in press.
14. Cressie NA and Whitford HJ. How to use the two sample t-test. Biometrical Journal,
1986; 28: 131-148.
15. Welch BL. The significance of the difference between two means when the population
variances are unequal. Biometrika 1938; 29, 350-362.
16. Hand A. A history of mathematical statistics from 1750 to 1930. Wiley, New York, 1998.
17. Rosenberger JL and Gasko M. Comparing location estimators: Trimmed means, medians,
and trimean. In: D. Hoaglin, F. Mosteller, J. Tukey (Eds.) Understanding robust and
exploratory data analysis. Wiley, New York, 1983, pp. 297-336.
18. Box GE and Cox DR. (1964) An analysis of transformations. Journal of the Royal
Statistical Society B 1964; 26: 211-252.
19. Sakia R M. The Box-Cox transformation: A review. The Statistician 1992; 41:169-178.
20. Doksum KA and Wong C-W. Statistical tests based on transformed data. Journal of the
American Statistical Association 1983; 78: 411-417.
21. Wood V, Wylie ML and Sheafor B. An analysis of short self-report measure of life
satisfaction: Correlation with rater judgments. Journal of Gerontology 1969; 24:465-469.
22. Cleveland WS. (1979). Robust locally-weighted regression and smoothing scatterplots.
Journal of the American Statistical Association 1979; 74: 829-836.
23. Algina J, Keselman HJ and Penfield RD. An alternative to Cohen's standardized mean
difference effect size: A robust parameter and confidence interval in the two independent
groups case. Psychol Methods 2005; 10: 317-328.
24. Wilcox RR and Tian T. Measuring effect size: A robust heteroscedastic approach for two
or more groups. Journal of Applied Statistics, in press.
25. Clark F, Jackson J, Mandel D, Blanchard J, Carlson M, Azen S, et al. Confronting
challenges in intervention research with ethnically diverse older adults: the USC Well Elderly II
trial. Clin Trials 2009; 6: 90-101.
26. Ware JE and Kosinski M, Dewey JE. How to score version two of the SF-36 Health Survey.
QualityMetric, Lincoln, RI, 2000.
27. Radloff LS. The CES-D Scale: A self-report depression scale for research in the general
population. Applied Psychological Measurement 1977; 1:385-401.
28. Koenker R and Ng P. Inequality Constrained Quantile Regression. Sankhya, Indian Journal
of Statistics 2005; 67:418-440.
29. He X and Zhu L-X. A lack-of-fit test for quantile regression. Journal of the American
Statistical Association 2003; 98:1013-1022.
Download