Overview of Robust Methods Analysis

advertisement
Overview of Robust Methods
Analysis
Jinxia Ma
November 7, 2013
Contents
•
•
•
•
What are robust methods
Why robust methods
How to conduct the robust methods analysis
Apply robust analysis to your data
What are “robust methods”?
• Robust statistics are statistics with good
performance for data drawn from a wide
range of probability distributions, especially
for distributions that are not normally
distributed.
– Outliers
– Departures from parametric distributions
Why robust methods?
• What’s the problem of standard
methodologies?
– Example: Linear regression assumptions
•
•
•
•
Linearity
Independence of errors
Errors are normally distributed
Homoscedasticity
– Example: comparing groups (ANOVA F-test)
• Errors have a common variance, normally distributed
and independent
Why robust methods?
– Example: Detecting differences among groups
• Problem 1: Heavy-tailed distributions
Figure 1: Despite the obvious similarity between the standard normal
and contaminated normal distributions, the standard normal has
variance 1 and the contaminated normal has variance 10.9.
Why robust methods?
– Example: Detecting differences among groups
• Problem 1: Heavy-tailed distributions
Figure 2: Left panel, power = 0.96. Right panel, power = 0.28.
(n= 25 per group, Student’s T test.
Why robust methods?
– Example: Detecting differences among groups
• Problem 1: Heavy-tailed distributions
Correlation = .8
Correlation = .2
Correlation = .2
Figure 3: Left panel, a bivariate normal distribution, corr = .8.
Middle panel, a bivariate normal distribution, corr= .2.
Right panel, one marginal distribution is normal, but the other
is a contaminated normal, corr = .2.
Why robust methods?
– Example: Detecting differences among groups
• Problem 2: Assuming normality via the central limit
theorem
Figure 4: The distribution of Student’s T, n=25, when sampling from a
(standard) lognormal distribution. The dashed line is the distribution
under normality.
For real Student’s T: P(T<=-2.086)=P(T>=2.086)=.025, E(T)=0.
For “Lognormal T”: P(T<=-2.086)=.12, P(T>=2.86)=.001, E(T)=-.54.
Why robust methods?
– Example: Detecting differences among groups
• Problem 3: Heteroscedasticity
– The third fundamental insight is that violating the usual
homoscedasticity assumption (i.e. the assumption that all
groups are assumed to have a common variance), is much
more serious than once thought. Both relatively poor power
and inaccurate confidence intervals can result.
How to test/compare robust methods?
– Example: Comparing dependent groups with
missing values: an approach based on a robust
method
• 1: Simulation
• 2: Bootstrap
How to test/compare robust methods?
– Example: Comparing dependent groups with
missing values: an approach based on a robust
method
• 1: Simulation
– g-and-h distribution
– Let Z be a random variable generated from a standard normal
distribution, then W has a g-and-h distribution.
How to test/compare robust methods?
– Example: Comparing dependent groups with
missing values: an approach based on a robust
method
• 1: Simulation
– g-and-h distribution
» g=h=0, standard normal
» G>0, skewed; the bigger the value of g, the more skewed.
» H>0, heavy-tailed; the bigger the value of h, the more
heavy-tailed.
How to test/compare robust methods?
• 1: Simulation
– g-and-h distribution
How to test/compare robust methods?
• 2: Bootstrap (B = 2000)
Robust solutions
– Alternate Measures of Location
• One way of dealing with outliers is to replace the mean
with alternative measures of location
–
–
–
–
Median
Trimmed mean
Winsorized mean
M-estimator
Robust solutions
– Transformations
• A simple way of dealing with skewness is to transform
the data.
– Logarithms
– Simple transformations do not deal effectively with outliers
– The resulting distributions can remain highly skewed
Robust solutions
– Nonparametric regression
• Sometimes called smoothers.
• Imagine that in a regression situation the goal is to
estimate the mean of Y, given that X=6, based on n
pairs of observations. The strategy is to focus on the
observed X values close to 6 and use the corresponding
Y values to estimate the mean of Y. Typically, smoothers
give more weight to Y values for which the
corresponding X values are close to 6. For pairs of
points for which the X value is far from 6, the
corresponding Y values are ignored.
Robust solutions
– Robust measures of association
• Use some analog of Pearson’s correlation that removes
or down weights outliers
• Fit a regression line and measure the strength of the
association based on this fit.
Practical Illustration of Robust
Methods
– Analysis of a lifestyle intervention for older adults
• N=364
• This trial was conducted to compare a six-month
lifestyle intervention to a no treatment control
condition
• Outcome variables: (a) eight indices of health-related
quality of life; (b) depression; (c) life satisfaction.
• Preliminary analysis revealed that all outcome variables
were found to have outliers based on boxplots.
Practical Illustration of Robust
Methods
– Analysis of a lifestyle intervention for older adults
Figure 5: The median regression line for predicting physical function based on
the number of session hours (R function: qsmcobs).
- r=.178 (p=.001). However, the association appears to be non-linear.
Practical Illustration of Robust
Methods
– Analysis of a lifestyle intervention for older adults
Figure 6: The median regression line for predicting physical composite based on
the number of session hours (R function: qsmcobs).
- For 0 to 5 hours, r=-.071 (p=.257).
- For 5 hours or more, r=.25 (p=.045).
Practical Illustration of Robust
Methods
– Analysis of a lifestyle intervention for older adults
Pearson’s r
0.178
p
0.001
rw *
0.135
p
0.016
re **
0.048
BODILY PAIN
0.170
0.002
0.156
0.005
0.198
GENERAL HEALTH
0.209
0.0001
0.130
0.012
0.111
VITALITY
0.099
0.075
0.139
0.012
0.241
SOCIAL FUNCTION
0.112
0.043
0.157
0.005
0.228
MENTAL HEALTH
0.141
0.011
0.167
0.003
0.071
PHYSICAL COMPOSITE
0.200
0.0002
0.136
0.015
0.255
MENTAL COMPOSITE
0.095
0.087
0.149
0.007
0.028
-0.022
0.694
-0.132
0.018
0.134
0.086
0.125
0.118
0.035
0.119
PHYSICAL FUNCTION
DEPRESSION
LIFE SATISFACTION
Table: Measures of association between hours of treatment and the variables
listed in column 1 (n = 364).
rw * = 20% Winsorized correlation
Practical Illustration of Robust
Methods
– Analysis of a lifestyle intervention for older adults
Yuen’s test:
p-value
0.0469
d
dt
ξ
Physical Function
Welch’s test:
p-value
0.1445
0.212
0.310
0.252
Bodily Pain
.01397
<.0001
0.591
0.666
0.501
Physical Composite
<.0001
0.0002
0.420
0.503
0.391
Cognition
0.0332
0.0091
0.415
0.408
0.308
Table 2: P-values when comparing ethnic matched group patients to a nonmatched group.
Welch’s test: dealing with heteroscedasticity
Yuen’s test: based on trimmed means
No single method is always best.
Software
– R: www.r-project.org
– www.rcf.usc.edu/~rwilcox
– Example: comparing two groups
•
•
•
•
•
> x1=read.table(file=“ ”)
> x2=read.table(file=“ ”)
> x<-list(x1,x2)
> lincon(x,tr=0.2,alpha=0.05)
Lincon is a heteroscedastic test of d linear contrasts
using trimmed means.
No single method is always best.
Thank you!
Download