t Test for a Single Group Mean (Part 5), Outliers

advertisement
Psych 5500/6500
The t Test for a Single Group Mean (Part 5):
Outliers
Fall, 2008
1
Viewing Your Data
The first step in analyzing your data is to take
a look at it. When you look at the data you
might notice that there are problems in
using the statistical test you had planned to
use to analyze the data.
2
Outliers
One problem that can occur in your data that
might foul up your analysis is the presence
of one or more outliers. An outlier is a
score that falls significantly above or below
the bulk of the scores. For example, the
following set of data contains one outlier:
Y = 10, 7, 9, 30, 8, 8, 7, 10, 6, 8
3
1. First we will look at how outliers can
affect our analysis.
2. Then we will look at easy ways to
determine if there are outliers in our data.
3. Finally we will look at what we might
want to do if we find outliers.
4
Data With Outlier
Y = 10, 7, 9, 30, 8, 8, 7, 10, 6, 8
5
Effect of the Outlier on the Mean and Standard Deviation
Y  10.30
S  7.04
Because of the outlier the mean of the sample is larger
than nine of the ten scores, putting into doubt the
validity of using the mean to describe the sample. The
standard deviation is also quite large considering that 9
of the 10 scores are within 4 of each other, so we need
to be concerned that the standard deviation does not
accurately portray how close most of the scores are to
6
each other.
The effect of the outlier can be seen by comparing
the statistics when the outlier is included versus
when it is excluded.
Including the outlier: Mean=10.3 Std. Dev.=7.04
Excluding the outlier: Mean=8.11 Std. Dev.=1.36
7
Effect of Outlier on the t Test
Both the mean and the standard deviation go into
the formula for computing tobtained , and thus the
outlier will have an effect on our t test. The
outlier will pull the sample mean either towards
what is predicted by H0 or towards what is
predicted by Ha (thus either helping or hurting
your chances of rejecting H0), it will also
increase the variance of the data (making it more
difficult to reject H0).
8
Example 1: H0: μY  10.5
Ha: μY < 10.5
All but one of the scores (the outlier) is below 10.5, which
would seem to give strong support to Ha. The outlier,
however, is making the sample mean larger (hurting our
chances to reject H0) and the extra variance it is creating
in the sample is also hurting our chances to reject H0.
With outlier: t=-0.09 p=0.465
Without outlier: t=-5.25 p=.0005
9
Example 2: H0: μY  6.5
Ha: μY > 6.5
All but one of the scores are above 6.5, which would seem
to give strong support to Ha. The outlier is helping by
making the sample mean even larger (increasing our
chances to reject H0) but the extra variance it is creating
in the sample is still killing our chances to reject H0.
With outlier: t=1.71 p=0.061
Without outlier: t=3.54 p=.004
10
Detecting Outliers
Below are four simple ways to detect the
presence of outliers in your data:
1.
2.
3.
4.
Sort your data.
View a histogram of your data.
View a scatterplot of your data.
View a boxplot of your data.
11
Detecting Outliers (Sort Your Data)
Have SPSS sort your data in ascending or
descending order, then look at the top and
bottom of the list for any unusual scores.
If you have a large data set an extreme
score might be hard to find if it is in the
middle of the list, sorting the data puts it at
the first or end of the list where it is easier
to spot.
12
Detecting Outliers (Histogram)
We have already seen how a histogram can
make a outlier easy to spot.
13
Detecting Outliers (Scatterplot)
The scatterplot shows the data
from ‘Sample 1’. Each circle
represents a score from the
sample, all but one of the
scores fall roughly between
Y=40 and Y=60 but we can
see there was one usually
small score (Y=20) which is
an outlier.
14
Detecting Outliers (Boxplot)
The boxplot also shows the data from
‘Sample 1’. We will see how to
interpret a boxplot later, but note
that the outlier of Y=20 is marked
on the plot, along with the number
‘9’ which tells us that this was the
9th score in our sample (rather
convenient if we want to find that
score in the data).
15
Handling Outliers
What we do about outliers depends upon our
belief concerning what might have been the
cause of the outlier. One important decision
is whether the outlier is due to an error or is
a valid score.
16
Error
It is important to determine whether the outlier
might be due to error, either an inputting error or
a measurement error.
17
Error
An inputting error involves making a mistake made while
typing the data into the computer. You should always
double-check data entry before analyzing it, you put a lot
of hard work into the experiment and you don’t want it
to be fouled up by a typo. If you then turn to analyzing
the data and find an outlier which you think might be an
inputting error then triple check to make sure it was
entered correctly. If you find an inputting error then you
should probably triple check all of your scores so that
you aren’t just fixing the scores you don’t like.
18
Error
A measurement error involves an error in
measuring that could be due to a mistake by the
researcher (e.g. mishandling the equipment) or by
the participant. Participant error my involve
misunderstanding the question (e.g. thinking that
the question was asking how many times he
engages in some activity during a year when the
question was actually asking how many times he
engages in some activity during a month) or by
the participant not giving an honest answer.
19
Error
If at all possible, if you think there might be a
measurement error then attempt to test the
validity of the outlier. This may involve
contacting the participant. I don't think that
people will insist that if you find an error in the
outlier that you then contact all of the participants
again, unless what you discover is a basic flaw in
how you measured.
20
Error
Sometimes you can reasonably infer that an outlier
must be due to some sort of measurement error
simply because of its value, for example,
obtaining a reaction time that is a negative value
(i.e. the participant responded before the
stimulus, which is a violation of the instructions
for the task).
21
Handling Error
If you have a strong reason to conclude that a
score (in this case an outlier) is in error then
fix it or remove it from the data. Some
people believe you should never do that,
fearing that to remove any datum is to open
the door to manipulating the results, but
others believe that is more inappropriate to
include in the analysis a score that you have
good reason to think is invalid.
22
Handling Error
If you are less than certain that a score is invalid
then analyze the data both with and without that
score and include both analyses in your
empirical report.
In all cases, if you remove any scores (and perhaps
even if you have scores that seem questionable
but after investigation you decide to keep them)
you should discuss the situation in your
empirical report, describing what steps you took
to determine what to do about those scores.
23
Valid Outliers
Now we will turn to outliers that are 'valid', in other
words, the score is not an error, it really is from
the population from which we are sampling. The
relevant question here is 'why did we get an
outlier?' We will investigate two possibilities: 1)
the population is normally distributed and the
outlier is simply from one of the tails of the
population; and 2) we obtained the outlier because
the population is not normally distributed.
24
Outlier from a Normal Population
Let us assume that the population is actually normally
distributed. It is more likely that a score sampled from
such a population will be close to the mean, rather than
far away from the mean. But in a sample it can happen
that one or more scores, just by weird luck, are far away
from the mean. Because the population is normally
distributed we have met the assumption of normality
required by the t test, but that doesn't mean that
everything is ok, for we still have a sample whose mean
and standard deviation are not very representative of the
population from which we sampled (we saw earlier how
that can influence the t test).
25
Outlier from a Normal Population
So what can we do about this situation? Not much, we
certainly can't remove a valid score just because we don't
like it. What we can do is to plan on using a large N in
our study. One outlier has less of an effect on the mean
and standard deviation when the sample is large.
Even if increasing N leads to more than one outlier, the
second outlier might very well be on the other side of the
distribution and cancel out the effect the first outlier had
on the mean (although they will both effect the standard
deviation). So, having a large sample is a good way to
lessen the problems of outliers.
26
Outlier from a Normal Population
One note before going on, however, is that if you
have a small sample and find that it contains an
outlier, and consequently you decide to sample a
few more times in hope of lowering the effect of
that outlier, then this would be considered an
inappropriate manipulation of the sampling
process.
27
Outlier from a Non-Normal Population
It is also possible that your outlier is due to your
population not being normally distributed, perhaps you
are sampling from a population that has a greater
proportion of extreme scores than is found in the
normal distribution.
We will be covering how to detect non-normality of your
population in the next lecture, as well as some
techniques for handling non-normal data, but as those
issues have more to do with meeting the assumption of
normality required by the t test than they do with
handling outliers we will hold off most of the
discussion until then.
28
Outlier from a Non-Normal Population
If the population is non-normal due to having a
greater-than-normal proportion of extreme
scores then having a large N might not help, it
may even make things worse (you will simply
get more outliers), again we will cover that next
lecture.
If the population is non-normal in some other way
then having a large N will probably help lower
the effect of an outlier.
29
Download