Psych 5500/6500 The t Test for a Single Group Mean (Part 5): Outliers Fall, 2008 1 Viewing Your Data The first step in analyzing your data is to take a look at it. When you look at the data you might notice that there are problems in using the statistical test you had planned to use to analyze the data. 2 Outliers One problem that can occur in your data that might foul up your analysis is the presence of one or more outliers. An outlier is a score that falls significantly above or below the bulk of the scores. For example, the following set of data contains one outlier: Y = 10, 7, 9, 30, 8, 8, 7, 10, 6, 8 3 1. First we will look at how outliers can affect our analysis. 2. Then we will look at easy ways to determine if there are outliers in our data. 3. Finally we will look at what we might want to do if we find outliers. 4 Data With Outlier Y = 10, 7, 9, 30, 8, 8, 7, 10, 6, 8 5 Effect of the Outlier on the Mean and Standard Deviation Y 10.30 S 7.04 Because of the outlier the mean of the sample is larger than nine of the ten scores, putting into doubt the validity of using the mean to describe the sample. The standard deviation is also quite large considering that 9 of the 10 scores are within 4 of each other, so we need to be concerned that the standard deviation does not accurately portray how close most of the scores are to 6 each other. The effect of the outlier can be seen by comparing the statistics when the outlier is included versus when it is excluded. Including the outlier: Mean=10.3 Std. Dev.=7.04 Excluding the outlier: Mean=8.11 Std. Dev.=1.36 7 Effect of Outlier on the t Test Both the mean and the standard deviation go into the formula for computing tobtained , and thus the outlier will have an effect on our t test. The outlier will pull the sample mean either towards what is predicted by H0 or towards what is predicted by Ha (thus either helping or hurting your chances of rejecting H0), it will also increase the variance of the data (making it more difficult to reject H0). 8 Example 1: H0: μY 10.5 Ha: μY < 10.5 All but one of the scores (the outlier) is below 10.5, which would seem to give strong support to Ha. The outlier, however, is making the sample mean larger (hurting our chances to reject H0) and the extra variance it is creating in the sample is also hurting our chances to reject H0. With outlier: t=-0.09 p=0.465 Without outlier: t=-5.25 p=.0005 9 Example 2: H0: μY 6.5 Ha: μY > 6.5 All but one of the scores are above 6.5, which would seem to give strong support to Ha. The outlier is helping by making the sample mean even larger (increasing our chances to reject H0) but the extra variance it is creating in the sample is still killing our chances to reject H0. With outlier: t=1.71 p=0.061 Without outlier: t=3.54 p=.004 10 Detecting Outliers Below are four simple ways to detect the presence of outliers in your data: 1. 2. 3. 4. Sort your data. View a histogram of your data. View a scatterplot of your data. View a boxplot of your data. 11 Detecting Outliers (Sort Your Data) Have SPSS sort your data in ascending or descending order, then look at the top and bottom of the list for any unusual scores. If you have a large data set an extreme score might be hard to find if it is in the middle of the list, sorting the data puts it at the first or end of the list where it is easier to spot. 12 Detecting Outliers (Histogram) We have already seen how a histogram can make a outlier easy to spot. 13 Detecting Outliers (Scatterplot) The scatterplot shows the data from ‘Sample 1’. Each circle represents a score from the sample, all but one of the scores fall roughly between Y=40 and Y=60 but we can see there was one usually small score (Y=20) which is an outlier. 14 Detecting Outliers (Boxplot) The boxplot also shows the data from ‘Sample 1’. We will see how to interpret a boxplot later, but note that the outlier of Y=20 is marked on the plot, along with the number ‘9’ which tells us that this was the 9th score in our sample (rather convenient if we want to find that score in the data). 15 Handling Outliers What we do about outliers depends upon our belief concerning what might have been the cause of the outlier. One important decision is whether the outlier is due to an error or is a valid score. 16 Error It is important to determine whether the outlier might be due to error, either an inputting error or a measurement error. 17 Error An inputting error involves making a mistake made while typing the data into the computer. You should always double-check data entry before analyzing it, you put a lot of hard work into the experiment and you don’t want it to be fouled up by a typo. If you then turn to analyzing the data and find an outlier which you think might be an inputting error then triple check to make sure it was entered correctly. If you find an inputting error then you should probably triple check all of your scores so that you aren’t just fixing the scores you don’t like. 18 Error A measurement error involves an error in measuring that could be due to a mistake by the researcher (e.g. mishandling the equipment) or by the participant. Participant error my involve misunderstanding the question (e.g. thinking that the question was asking how many times he engages in some activity during a year when the question was actually asking how many times he engages in some activity during a month) or by the participant not giving an honest answer. 19 Error If at all possible, if you think there might be a measurement error then attempt to test the validity of the outlier. This may involve contacting the participant. I don't think that people will insist that if you find an error in the outlier that you then contact all of the participants again, unless what you discover is a basic flaw in how you measured. 20 Error Sometimes you can reasonably infer that an outlier must be due to some sort of measurement error simply because of its value, for example, obtaining a reaction time that is a negative value (i.e. the participant responded before the stimulus, which is a violation of the instructions for the task). 21 Handling Error If you have a strong reason to conclude that a score (in this case an outlier) is in error then fix it or remove it from the data. Some people believe you should never do that, fearing that to remove any datum is to open the door to manipulating the results, but others believe that is more inappropriate to include in the analysis a score that you have good reason to think is invalid. 22 Handling Error If you are less than certain that a score is invalid then analyze the data both with and without that score and include both analyses in your empirical report. In all cases, if you remove any scores (and perhaps even if you have scores that seem questionable but after investigation you decide to keep them) you should discuss the situation in your empirical report, describing what steps you took to determine what to do about those scores. 23 Valid Outliers Now we will turn to outliers that are 'valid', in other words, the score is not an error, it really is from the population from which we are sampling. The relevant question here is 'why did we get an outlier?' We will investigate two possibilities: 1) the population is normally distributed and the outlier is simply from one of the tails of the population; and 2) we obtained the outlier because the population is not normally distributed. 24 Outlier from a Normal Population Let us assume that the population is actually normally distributed. It is more likely that a score sampled from such a population will be close to the mean, rather than far away from the mean. But in a sample it can happen that one or more scores, just by weird luck, are far away from the mean. Because the population is normally distributed we have met the assumption of normality required by the t test, but that doesn't mean that everything is ok, for we still have a sample whose mean and standard deviation are not very representative of the population from which we sampled (we saw earlier how that can influence the t test). 25 Outlier from a Normal Population So what can we do about this situation? Not much, we certainly can't remove a valid score just because we don't like it. What we can do is to plan on using a large N in our study. One outlier has less of an effect on the mean and standard deviation when the sample is large. Even if increasing N leads to more than one outlier, the second outlier might very well be on the other side of the distribution and cancel out the effect the first outlier had on the mean (although they will both effect the standard deviation). So, having a large sample is a good way to lessen the problems of outliers. 26 Outlier from a Normal Population One note before going on, however, is that if you have a small sample and find that it contains an outlier, and consequently you decide to sample a few more times in hope of lowering the effect of that outlier, then this would be considered an inappropriate manipulation of the sampling process. 27 Outlier from a Non-Normal Population It is also possible that your outlier is due to your population not being normally distributed, perhaps you are sampling from a population that has a greater proportion of extreme scores than is found in the normal distribution. We will be covering how to detect non-normality of your population in the next lecture, as well as some techniques for handling non-normal data, but as those issues have more to do with meeting the assumption of normality required by the t test than they do with handling outliers we will hold off most of the discussion until then. 28 Outlier from a Non-Normal Population If the population is non-normal due to having a greater-than-normal proportion of extreme scores then having a large N might not help, it may even make things worse (you will simply get more outliers), again we will cover that next lecture. If the population is non-normal in some other way then having a large N will probably help lower the effect of an outlier. 29