Slide 1 Correlation and Causation, Part II – The Correlation Coefficient Slide 2 This video is designed to accompany pages 19-24 of the workbook “Making Sense of Uncertainty: Activities for Teaching Statistical Reasoning,” a publication of the Van-Griner Publishing Company Slide 3 While “correlation” is a word that is used by just about everyone, the “correlation coefficient” is a particular, numerical way of summarizing the strength of the association exhibited between two variables that you could represent in a scatterplot. This is a useful exercise because it summarizing the strength of the association on a scale that allows comparison with other associations. The scatterplot of heights versus weights is clearly positive and fairly strong. How strong is it? Slide 4 The complex expression you see on this slide is one convenient way that the correlation coefficient can be calculated. We will illustrate on a small data set. But first, make sure you understand the basic notion. For example xy simply means that you take all the (x,y) pairs and multiply the x term in each by the corresponding y term, then add up all those products. The symbol signals that you are going to add up whatever follows. We’ll illustrate this in just a moment. But before we do, let’s make sure we understand some important facts about the correlation coefficient. Slide 5 The first thing to keep in mind is that the correlation coefficient should only be computed on a pair of variables that you can exhibit in a scatterplot. So computing a correlation coefficient between height and weight measurements makes sense. But computing a correlation coefficient between hours studying and gender does not make sense. You will still hear people talk about the “correlation” between variables like “hours studying” and “gender.” But they are using the word “correlation” in the vernacular. The correlation coefficient only makes sense for scatterplots that exhibit a straight line trend. This is confusing to a lot of people. It is not enough to just learn how to compute r because the computation of r for data that have a curvilinear scatterplot can potentially be misleading. The formula for r guarantees that r is always between -1 and 1. And r will be negative if the association in the scatterplot is negative; r will be positive if the association in the scatterplot is positive. The closer r is to 1 or -1 the more tightly packed the points are about the linear trend. Finally, if r is close to 0 (positive or negative) then the association between the two variables being studied is likely very weak. Slide 6 The scatterplot you see here is manufactured for the purposes of illustrating how to compute the correlation coefficient. We have Glucose Level on the y axis and Age on the x, or horizontal axis. What do you think the correlation coefficient will be for this scatterplot? There is a slight upwards to the right trend so we’d expect r to be positive. But the points are not that tightly packed about that upwards to the right trend, so r surely won’t be very close to 1. Let’s compute it and see what we get. Slide 7 Here are the six (x,y) pairs from the scatterplot. Notice how we formed three more columns, the xy column, the x-squared column, and the y-squared column. In the very last row of the table we have added up the x’s, the y’s, the xy’s, the x-squares, and the ysquares. These all correspond to terms in the formula for r. n is the number of pairs - 6 in this case. Notice carefully where the terms go. Once they are all plugged in the right spots, it is easy to compute r, which turns out to be 0.5298. As we predicted, r is positive and not that close to 1. Using a table like this to compute r is a convenient method for small data sets, and for getting comfortable with r. For large data sets we’d want to use a computer or a spreadsheet. Slide 8 Let’s take a look at some sample scatterplots, all from real data. What do you think would be the values for r in each case? For Student Grades versus Time Spent Studying, r is clearly positive and pretty strong. Turns out that for this plot r is about 0.75. For LDL Level and Hours of Exercise, r is negative and pretty strong, about -0.93 in this case. What about Final Exam grades and Quiz Averages? Not surprising that r is about 0.02. Now have a look at Life Expectancy at Birth and GNP Per Capita. The trend is clearly upwards to the right, so the association is positive. The higher GNP per Capita is the higher the Life Expectancy at Birth. This makes sense. However, the trend is not a straight line trend but a curvilinear (curved) trend, so it is not appropriate to compute r for this kind of trend. Still yet, the points seems to be tightly clustered about that curvilinear trend so one could argue that the association exhibited in this scatterplot is pretty strong.. Slide 9 This concludes our video on the correlation coefficient. Remember, the correlation coefficient is the most common numerical measure of the strength of a straight line relationship between two variables that can represented by a scatterplot.