Marshall University School of Medicine
Department of Biochemistry and Microbiology
BMS 617
Lecture 6 – Multiple comparisons,
non-normality, outliers
Marshall University Genomics Core Facility
Analyzing data without a plan
• The framework for hypothesis testing assumes
all aspects of the experimental design are
defined before the experiment and the
analysis are performed
– Not doing this can invalidate the interpretation of
a p-value
– Easy trap to fall into
• This happens a lot!
Multiple comparisons
• Previous lecture discussed multiple
comparisons and their effect on interpretation
of p-values
– In that context, the multiple comparisons were
part of the experimental design
• In this case, can correct for them or otherwise analyze
the data appropriately
• Not having a complete plan can introduce
multiple comparison effects in an uncontrolled
Examples of multiple comparisons
• Trying multiple statistical tests for the same data set
– “A t-test doesn’t give me significance… let’s separate into
three groups instead and try an ANOVA”
• Trying multiple algorithmic implementations of the
same test
• In multiple regression (we will discuss this later…),
choosing to include or exclude different independent
– For complex data sets, trying enough approaches will
almost always result in a “statistically significant” result
Sequential Analyses
• Another common approach is to try an
experiment, and if the result is not statistically
significant, to then repeat it with additional
samples or experimental replicates
– Another form of multiple comparisons
– Problem with this approach is that it is biased towards
a statistically significant result
• Stop experimenting if a result is statistically significant
• Continue experimenting otherwise
• In theory you can always get a statistically significant result
with this approach
– Though it may take a very long time…
Publication Bias
• Remember, the interpretation of a p-value is the probability of
observing data at least as extreme as the data observed, assuming
the null hypothesis is true
– This is not the same as the probability the null hypothesis is true
• Most p-values we see are in journal articles
• There is a strong preference to publish results which are
“statistically significant,” i.e. which have p < 0.05
• Some of these results are “real” and some are false positives
• Because the publications are selected based on the p-value, the
interpretation of a p-value in published results is skewed
– If we assume the null hypothesis is true, and the result was published,
the probability of a false positive can be much higher than 5%
Normally distributed data
• Many of the statistical tests we will study rely on
the assumption that the data were sampled from
a normal distribution
• How reasonable is this assumption?
• The normal distribution is an ideal distribution
that likely never exists in reality
– Includes arbitrarily large values and arbitrarily small
(negative) values
• However, simulations show that most tests that
rely on the assumption of normality are robust to
deviations from the normal distribution
The ideal normal distribution
• Image shows data sampled from a theoretical normal distribution
• Uses a very large sample size
• Close approximation to theoretical distribution
Samples from a normal distribution
Tests for normality
• It is possible to perform tests to see if the
sample data are consistent with the
assumption that they were sampled from a
normal distribution
– Unfortunately, this is not what we really want to
– Would really like to know if the distribution is
close enough to normal for the test we use to be
Tests for normality
• A test for normality is a statistical test for
which the null hypothesis is
The data were sampled from a normal
• Common normality tests include
– D’Agostino-Pearson omnibus K2 normality test
– Shapiro-Wilk test
– Kolmogorov-Smirnov test
D’Agostino-Pearson omnibus K2
normality test
• The D’Agostino-Pearson omnibus K2 normality
test works by computing two values for the data
– The skewness, which measures how far the data is
from being symmetric
– The kurtosis, which measures how sharply peaked the
data is
• The test then combines these to a single value
that describes how far from normal the data
appear to lie
– Computes a p-value for this combined value
Problem with normality tests
• If the p-value for a normality test is small, the interpretation is:
– If the data were sampled from an ideal normal distribution, it is
unlikely the sample would be this skewed and/or kurtotic
• If the p-value for a normality test is large, then the data are not
inconsistent with being sampled from a normal distribution
• However…
– If the sample size is large, it is possible to get a small p-value even for
small deviations from the normal distribution
• Data are likely sampled from a distribution that is close to, but not exactly,
– If the sample size is small, it is possible to get a large p-value even if
the underlying distribution is far from normal
• Data do not provide sufficient evidence to reject the null hypothesis…
– Useful to examine the values for skewness and kurtosis as well as the
Skewness and kurtosis
Interpreting skewness and kurtosis
• The real question we would like to answer is
– How much skewness and kurtosis are acceptable?
– Difficult to answer…
• In general, interpret a skewness between -0.5 and 0.5
as being approximately symmetric
– Between -1.0 and -0.5, or 0.5 and 1.0 is moderately skewed
– Less than -1.0 or more than 1.0 is highly skewed
• For kurtosis, values between -2 and 2 are generally
accepted as being “within limits”
– Outside this is evidence the distribution is far from normal
What to do if the data fail a test for
• If the data fail a test for normality, the following options are
– Can the data be transformed to data that come from a normal
• For example, if the data are negatively skewed, transforming to logs
may give normally distributed data
– Are there a small number of outliers that are causing the data to
fail a normality test?
• Next section discusses outliers
– Is the departure from normality small? I.e. are the skewness and
kurtosis “small”. If so, your statistical tests may still be accurate
– Use a test that does not assume a normal distribution (a nonparametric test)
Non-parametric tests
• The most common statistical tests assume the
data are sampled from a normal distribution
– T-tests, ANOVA, Pearson correlation, etc
• Some other tests do not make this assumption
– Mann-Whitney test, Kruskal-Wallis test, Spearman
correlation, etc
• However, these tests have (much) lower
statistical power than their parametric
equivalents when the data are normally
Choosing non-parametric tests
• When running a series of similar experiments, all data
should be analyzed the same way
– Use normality tests to choose the statistical test for all
experiments together
– Following “common practice” is acceptable…
– Ideally, run one experiment just to determine whether the
data look like they come from a normal distribution
• For small data sets
– A test for normality does not tell you much
• Not likely to get a small p-value anyway
– Violations of the normality assumption are more egregious
– Non-parametric tests have very low statistical power
• Outliers are values in the data that are “far” from the other values
• Occur for several reasons:
– Invalid data entry
– Experimental mistakes
– Random chance
• In any distribution, some values are far from the others
– In a normal distribution, these values are rarer, but still exist
– Biological diversity
• If your samples are from patient or animal samples, the outlier may be
“correct” and due to biological diversity
– May be an interesting finding!
– Wrong assumptions
• For example, in a lognormal distribution, some values are far from the others
Why test for outliers
• Presence of erroneous outliers, or assuming
the wrong distribution, can introduce spurious
results or mask real results
• Trying to detect outliers without a test can be
– We tend to want to observe patterns in data
– Anything that appears to be counter to these
patterns seems to be an outlier
– We tend to see too many outliers
Before testing for outliers
• Before testing for outliers:
– Check the data entry
• Errors here can often be fixed
– Were there problems with the experiment?
• If errors were observed during the experiment, remove data
associated with those errors
• Many experimental protocols have quality control measures
– Is it possible your data is not normally distributed
• Most outlier tests assume the (non-outlier) data is normally
– Was there anything different about any of the samples
• Was one of the mice phenotypically different, etc?
Outlier tests
• After addressing the concerns on the previous
slide, if you still suspect an outlier you can run
an outlier test
• Outlier tests answer the following question:
If the data were sampled from a normal
distribution, what is the chance of observing
one value as far from the others as is in the
observed data?
Results of an outlier test
• If an outlier test results in a small p-value, then
the conclusion is that the outlying value is
(probably) not from the same distribution as the
other values
– Justifies excluding it from the analysis
• If the outlier test results in a high p-value, there is
no evidence the value came from a different
– Doesn’t prove it did come from the same distribution,
just that there is no strong evidence to the contrary
Guidelines on removing outliers
• If you address all the previous concerns, and
an outlier test gives strong evidence of an
outlier, then it is legitimate to remove it from
the analysis
– The rules for eliminating outliers should be
established before you generate the data
– You should report the number of outliers removed
and the rationale for doing so in any publication
using the data
How outlier tests work
• Outlier tests work by computing the
difference between the extreme value and
some measure of central tendency
• That value is typically divided by a measure of
the variability
• Resulting ratio is compared with a table or
expected distribution of those values
Grubb’s outlier test
• Grubb’s outlier test calculates the difference
between the extreme value and the mean of
all values (including the extreme value), and
divides by the standard deviation
• Resulting value is then compared to a table of
critical values
– Critical value depends on the sample size
– If the value is larger than the critical value, then
the extreme value can be considered an outlier
