Transcript

advertisement
Slide 1
Correlation and Causation, Part II – The Correlation Coefficient
Slide 2
This video is designed to accompany pages 19-24 of the workbook “Making Sense of Uncertainty:
Activities for Teaching Statistical Reasoning,” a publication of the Van-Griner Publishing Company
Slide 3
While “correlation” is a word that is used by just about everyone, the “correlation coefficient” is a
particular, numerical way of summarizing the strength of the association exhibited between two
variables that you could represent in a scatterplot. This is a useful exercise because it summarizing the
strength of the association on a scale that allows comparison with other associations.
The scatterplot of heights versus weights is clearly positive and fairly strong. How strong is it?
Slide 4
The complex expression you see on this slide is one convenient way that the correlation coefficient can
be calculated. We will illustrate on a small data set. But first, make sure you understand the basic
notion. For example xy simply means that you take all the (x,y) pairs and multiply the x term in each by
the corresponding y term, then add up all those products. The  symbol signals that you are going to
add up whatever follows. We’ll illustrate this in just a moment. But before we do, let’s make sure we
understand some important facts about the correlation coefficient.
Slide 5
The first thing to keep in mind is that the correlation coefficient should only be computed on a pair of
variables that you can exhibit in a scatterplot. So computing a correlation coefficient between height
and weight measurements makes sense. But computing a correlation coefficient between hours
studying and gender does not make sense. You will still hear people talk about the “correlation”
between variables like “hours studying” and “gender.” But they are using the word “correlation” in the
vernacular.
The correlation coefficient only makes sense for scatterplots that exhibit a straight line trend. This is
confusing to a lot of people. It is not enough to just learn how to compute r because the computation of
r for data that have a curvilinear scatterplot can potentially be misleading.
The formula for r guarantees that r is always between -1 and 1. And r will be negative if the association
in the scatterplot is negative; r will be positive if the association in the scatterplot is positive. The closer
r is to 1 or -1 the more tightly packed the points are about the linear trend.
Finally, if r is close to 0 (positive or negative) then the association between the two variables being
studied is likely very weak.
Slide 6
The scatterplot you see here is manufactured for the purposes of illustrating how to compute the
correlation coefficient. We have Glucose Level on the y axis and Age on the x, or horizontal axis.
What do you think the correlation coefficient will be for this scatterplot? There is a slight upwards to
the right trend so we’d expect r to be positive. But the points are not that tightly packed about that
upwards to the right trend, so r surely won’t be very close to 1. Let’s compute it and see what we get.
Slide 7
Here are the six (x,y) pairs from the scatterplot. Notice how we formed three more columns, the xy
column, the x-squared column, and the y-squared column.
In the very last row of the table we have added up the x’s, the y’s, the xy’s, the x-squares, and the ysquares. These all correspond to terms in the formula for r. n is the number of pairs - 6 in this case.
Notice carefully where the terms go. Once they are all plugged in the right spots, it is easy to compute
r, which turns out to be 0.5298. As we predicted, r is positive and not that close to 1.
Using a table like this to compute r is a convenient method for small data sets, and for getting
comfortable with r. For large data sets we’d want to use a computer or a spreadsheet.
Slide 8
Let’s take a look at some sample scatterplots, all from real data. What do you think would be the values
for r in each case?
For Student Grades versus Time Spent Studying, r is clearly positive and pretty strong. Turns out that for
this plot r is about 0.75.
For LDL Level and Hours of Exercise, r is negative and pretty strong, about -0.93 in this case.
What about Final Exam grades and Quiz Averages? Not surprising that r is about 0.02.
Now have a look at Life Expectancy at Birth and GNP Per Capita. The trend is clearly upwards to the
right, so the association is positive. The higher GNP per Capita is the higher the Life Expectancy at Birth.
This makes sense. However, the trend is not a straight line trend but a curvilinear (curved) trend, so it is
not appropriate to compute r for this kind of trend. Still yet, the points seems to be tightly clustered
about that curvilinear trend so one could argue that the association exhibited in this scatterplot is pretty
strong..
Slide 9
This concludes our video on the correlation coefficient. Remember, the correlation coefficient is the
most common numerical measure of the strength of a straight line relationship between two variables
that can represented by a scatterplot.
Download