Residuals Outliers and influential points. Correlation vs. causation

advertisement
Residuals
Outliers and influential points.
Correlation vs. causation
When we use correlation, we make certain assumptions
about the data:
1.
2.
3.
4.
A straight-line relationship.
Interval data
Random sampling
Normal distributed characteristics (approximate is OK)
Today we’re going to look at ways these assumptions can
be violated.
First, a tool for finding problems in correlation: Residuals.
One way to show a correlation is to fit a line through the
middle of the data. (Line of best fit)
If the line is definitely upwards and keeps close to the data,
you have a correlation.
Since a line won’t perfect describe the relationship between
two variables, especially when randomness is involved,
there’s some error left over.
These leftovers are called residuals. (as in “left behind”)
Looking at a graph of the residuals can magnify patterns
that were not immediately obvious in the data before.
In this case, the points dip below the line and then come
back above it.
If the relationship between two variables really is linear,
then any other patterns should be random noise.
That means if we see any obvious pattern in the residuals,
including this one, a correlation coefficient isn’t going to tell
you the whole story.
Sometimes people try to correlate interval data to
something ordinal or nominal. This is dumb.
These are residuals from trying to correlate a yes/no
response to something interval.
Using ordinal data will leave huge jumps from one level to
the next. Nominal data simply won’t find on a scatterplot.
Both cases violate of the assumption of interval data.
Sometimes the pattern isn’t a trend in the center of the
data, it can also be a trend in the spread of the data.
If the variation in y changes as x changes, the relationship
between x and y is called heteroscedastic
Hetero means “different” and Scedastic means “scattered”.
Heteroscedastic means there is a different amount of
scatter at different data points. If you encounter it, it could
lower your correlation so it’s worth mentioning. (Look for
fan shapes)
If the variation in y is the same everywhere, we call that
Homoscedastic, meaning “same-scatter”
If the Toronto Stock Exchange Index were correlated to
something linearly, the residual graph would resemble this.
As the index numbers get higher, they tend to jump up and
down more. Going from 10,000 to 10,100 is no big deal.
Going from 500 to 600 is a big deal.
Residuals should look like this: a horizontal band of noise.
There should be no obvious trends or patterns.
The occasional point can be outside the data without issue.
But how far out is too far?
What happens when you get there?
Outliers are a violation of the assumption of normality, and
correlation can be sensitive to outliers.
A value that is far from the rest of the data can throw off
the correlation, or create a false correlation.
Example: In the 1960’s, a survey was done to get various
facts about TV stars.
Intelligence Quotient (IQ) was found to be positively
correlated with shoe size. (r = 0.503, n = 30)
(This story is true, the exact data has been made up)
Could this be a fluke of the data? Did they falsely find a
correlation? (r=.503, n=30)
t-score = 3.08
t* = 2.048 for df=28, .05 significance (2 tailed)
t* = 2.763 for df=28, .01 sig
So p < 0.01.
(By computer: p=.0046)
That means it’s possible, but highly unlikely we’ll see a
correlation of this strength in uncorrelated data by chance
alone.
Standard practice is to visualize the data when possible.
There’s no obvious trend, except...
What is that?
There’s one person with very high IQ and very large shoes.
In other words, an outlier.
It`s Bozo the clown.
He had huge clown shoes, and he was a verified genius.
We can’t assume normality with that Bozo in the way.
So what now?
We could remove Bozo from the dataset, but if we remove
data points we don’t like, we could come to almost any
“conclusion” we wanted.
That’s why we have assumption of random selection
If Bozo can’t be in the sample, then his chance of being
selected is zero (no longer equal chance in population).
We can remove him and keep randomization, but it implies
that Bozo was not in the population of interest. (Equal
chance of selection among non-clowns?)
Most respondents wear shoes that fit their feet. Bozo wore
absurdly large shoes, much larger than his feet, for
entertainment.
So dismissing the Bozo data as an outlier is reasonable, his
shoes are fundamentally different.
Let’s try the analysis again without including Bozo’s data.
r = -.006, n=29
Is there still a significant correlation?
...not even close.
t* = 1.314 at .20 significance,
so p-value > .20
(actually p-value = 0.9975)
So removing Bozo the Clown from the dataset completely
changed our results.
Bozo wasn’t just an outlier, he was an influential outlier
because he alone influenced the results noticeably.
Not every outlier is influential, and
Not every influential point is an outlier.
Outliers are points that don’t fit the pattern. Correlation
assumes a linear pattern.
r = .032, p-value = .866
An outlier is anything outside the linear trend.
For a point to be influential, it just has to change the linear
trend. If it’s far enough for the mean in the x-direction, it
doesn’t have to be far from the trend to change the results.
Changing this one point from IQ 100 to IQ 110
Changes the correlation from .016 to .155.
More formally, an outlier is anything with a large residual.
Since normality, and hence symmetry is assumed, the 3
standard deviation rule applies.
Anything with a residual of 3 standard deviations above or
below zero is considered an outlier.
If residuals show heteroscedasticity, outliers are more
likely to show up, and in greater numbers.
Look at your data closely. Get right in its face.
You can use statistics and graphs to intimately know a
dataset, but numbers and pictures aren’t a substitute for
reasoning.
Just because two things happen together (or one after
another) doesn’t mean that one of them causes the other.
A correlation between two things doesn’t imply causation.
Consider this crime and sales data of a large city over five
years (one point = one month)
Homicide rates are strongly positively correlated with ice
cream sales. (r = .652, n=60)
Jumping from correlation to causation, we find that
availability of ice cream is driving people to kill each other.
But correlation works both ways. Ice cream sales are
correlated with homicide rates.
That also must mean that nothing builds an appetite for
cold, cold ice cream like...
cold, cold murder.
Causation works in one direction. Correlation works both
ways. That alone should be enough not to make that leap.
Often there’s a common explanation to increases in both
variables. In this case it’s heat. Both increase in summer.
Simple right? Then how do mistakes like this get made?
Study: Mercury can cause NY loon population
drop. (source: Wall Street Journal June 28, 2:21pm)
“ A 10-year study of Adirondack loons shows mercury contamination can lead to
population declines because birds with elevated mercury levels produce fewer chicks
than those with low levels”
“But how can we ever tell causation with statistics?”
Short answer: You can’t
Good answer: You can’t with statistics alone, because
dealing with numbers after the fact is observational.
But you can use it in combination with other fields
(Experimental Design) to manipulate variables.
Indoor greenhouses can manipulate soil type, moisture, and
light directly, but plants still have randomness.
Better answer: (for interest only)
Google books for a preview, or look up the term
“Contrapositive”
Next time, we expand to multiple correlations and partial
correlations. We may finish chapter 10 early.
ASSIGNMENTS DUE AT 4:30PM!!!!
Download