Residuals Outliers and influential points. Correlation vs. causation When we use correlation, we make certain assumptions about the data: 1. 2. 3. 4. A straight-line relationship. Interval data Random sampling Normal distributed characteristics (approximate is OK) Today we’re going to look at ways these assumptions can be violated. First, a tool for finding problems in correlation: Residuals. One way to show a correlation is to fit a line through the middle of the data. (Line of best fit) If the line is definitely upwards and keeps close to the data, you have a correlation. Since a line won’t perfect describe the relationship between two variables, especially when randomness is involved, there’s some error left over. These leftovers are called residuals. (as in “left behind”) Looking at a graph of the residuals can magnify patterns that were not immediately obvious in the data before. In this case, the points dip below the line and then come back above it. If the relationship between two variables really is linear, then any other patterns should be random noise. That means if we see any obvious pattern in the residuals, including this one, a correlation coefficient isn’t going to tell you the whole story. Sometimes people try to correlate interval data to something ordinal or nominal. This is dumb. These are residuals from trying to correlate a yes/no response to something interval. Using ordinal data will leave huge jumps from one level to the next. Nominal data simply won’t find on a scatterplot. Both cases violate of the assumption of interval data. Sometimes the pattern isn’t a trend in the center of the data, it can also be a trend in the spread of the data. If the variation in y changes as x changes, the relationship between x and y is called heteroscedastic Hetero means “different” and Scedastic means “scattered”. Heteroscedastic means there is a different amount of scatter at different data points. If you encounter it, it could lower your correlation so it’s worth mentioning. (Look for fan shapes) If the variation in y is the same everywhere, we call that Homoscedastic, meaning “same-scatter” If the Toronto Stock Exchange Index were correlated to something linearly, the residual graph would resemble this. As the index numbers get higher, they tend to jump up and down more. Going from 10,000 to 10,100 is no big deal. Going from 500 to 600 is a big deal. Residuals should look like this: a horizontal band of noise. There should be no obvious trends or patterns. The occasional point can be outside the data without issue. But how far out is too far? What happens when you get there? Outliers are a violation of the assumption of normality, and correlation can be sensitive to outliers. A value that is far from the rest of the data can throw off the correlation, or create a false correlation. Example: In the 1960’s, a survey was done to get various facts about TV stars. Intelligence Quotient (IQ) was found to be positively correlated with shoe size. (r = 0.503, n = 30) (This story is true, the exact data has been made up) Could this be a fluke of the data? Did they falsely find a correlation? (r=.503, n=30) t-score = 3.08 t* = 2.048 for df=28, .05 significance (2 tailed) t* = 2.763 for df=28, .01 sig So p < 0.01. (By computer: p=.0046) That means it’s possible, but highly unlikely we’ll see a correlation of this strength in uncorrelated data by chance alone. Standard practice is to visualize the data when possible. There’s no obvious trend, except... What is that? There’s one person with very high IQ and very large shoes. In other words, an outlier. It`s Bozo the clown. He had huge clown shoes, and he was a verified genius. We can’t assume normality with that Bozo in the way. So what now? We could remove Bozo from the dataset, but if we remove data points we don’t like, we could come to almost any “conclusion” we wanted. That’s why we have assumption of random selection If Bozo can’t be in the sample, then his chance of being selected is zero (no longer equal chance in population). We can remove him and keep randomization, but it implies that Bozo was not in the population of interest. (Equal chance of selection among non-clowns?) Most respondents wear shoes that fit their feet. Bozo wore absurdly large shoes, much larger than his feet, for entertainment. So dismissing the Bozo data as an outlier is reasonable, his shoes are fundamentally different. Let’s try the analysis again without including Bozo’s data. r = -.006, n=29 Is there still a significant correlation? ...not even close. t* = 1.314 at .20 significance, so p-value > .20 (actually p-value = 0.9975) So removing Bozo the Clown from the dataset completely changed our results. Bozo wasn’t just an outlier, he was an influential outlier because he alone influenced the results noticeably. Not every outlier is influential, and Not every influential point is an outlier. Outliers are points that don’t fit the pattern. Correlation assumes a linear pattern. r = .032, p-value = .866 An outlier is anything outside the linear trend. For a point to be influential, it just has to change the linear trend. If it’s far enough for the mean in the x-direction, it doesn’t have to be far from the trend to change the results. Changing this one point from IQ 100 to IQ 110 Changes the correlation from .016 to .155. More formally, an outlier is anything with a large residual. Since normality, and hence symmetry is assumed, the 3 standard deviation rule applies. Anything with a residual of 3 standard deviations above or below zero is considered an outlier. If residuals show heteroscedasticity, outliers are more likely to show up, and in greater numbers. Look at your data closely. Get right in its face. You can use statistics and graphs to intimately know a dataset, but numbers and pictures aren’t a substitute for reasoning. Just because two things happen together (or one after another) doesn’t mean that one of them causes the other. A correlation between two things doesn’t imply causation. Consider this crime and sales data of a large city over five years (one point = one month) Homicide rates are strongly positively correlated with ice cream sales. (r = .652, n=60) Jumping from correlation to causation, we find that availability of ice cream is driving people to kill each other. But correlation works both ways. Ice cream sales are correlated with homicide rates. That also must mean that nothing builds an appetite for cold, cold ice cream like... cold, cold murder. Causation works in one direction. Correlation works both ways. That alone should be enough not to make that leap. Often there’s a common explanation to increases in both variables. In this case it’s heat. Both increase in summer. Simple right? Then how do mistakes like this get made? Study: Mercury can cause NY loon population drop. (source: Wall Street Journal June 28, 2:21pm) “ A 10-year study of Adirondack loons shows mercury contamination can lead to population declines because birds with elevated mercury levels produce fewer chicks than those with low levels” “But how can we ever tell causation with statistics?” Short answer: You can’t Good answer: You can’t with statistics alone, because dealing with numbers after the fact is observational. But you can use it in combination with other fields (Experimental Design) to manipulate variables. Indoor greenhouses can manipulate soil type, moisture, and light directly, but plants still have randomness. Better answer: (for interest only) Google books for a preview, or look up the term “Contrapositive” Next time, we expand to multiple correlations and partial correlations. We may finish chapter 10 early. ASSIGNMENTS DUE AT 4:30PM!!!!