Correlation Correlation describes the strength of association between two quantitative variables. Correlation describes the variability (scatter) of observations around a regression line. Correlation describes how accurately you can predict the value of one variable when you know the value of another variable. Correlation is also helpful in understanding how to design good experiments, as we shall see. The most widely used correlation measure is R, the Pearson linear correlation coefficient. We'll also look at another correlation measure, the Spearman rank correlation coefficient, which uses rank values. Hours 0 1 2 3 4 5 6 7 8 9 10 Dollars 0 10 20 30 40 50 60 70 80 90 100 Here is an example of two quantitative variables that are perfectly correlated (R=1). Suppose that you have a job where you are paid $10 per hour. The table and graph show the number of hours you work and how many dollars you earn. If you know the number of hours you worked, then you know exactly how many dollars you earned. So the correlation is R=1.0. Next is an example of two variables (number of drinks bought and dollars in your wallet) that are perfectly negatively correlated (R=-1). Suppose that you are buying drinks that cost $5 each. The table and graph show the number of drinks bought and how many dollars are left in your wallet. If you know the number of drinks you bought then you know exactly how many dollars are left in your wallet. Of course, after 10 drinks, you may not remember how many you bought. So you may not correctly estimate how much money you have left. Drinks bought 1 2 3 4 5 6 7 8 9 10 Dollars in your wallet 50 45 40 35 30 25 20 15 10 5 Here is an example of two variables (IQ vs number of pills) that have correlation near zero. Suppose we are testing if a new drug has any effect on IQ. The table and graph show the number of pills vs. IQ. In this case, the number of pills doesn't appear to have any relationship to the IQ. The correlation is R = -0.005. Patient ID 1 2 3 4 5 6 7 8 9 10 11 12 Number of pills 0 0 0 1 1 1 2 2 2 3 3 3 IQ 132 141 150 136 151 131 134 132 151 133 139 151 The Pearson linear correlation R ranges from 1.0, meaning perfect prediction, to 0.0, meaning no linear association at all, to -1.0, meaning that the variables are perfectly correlated but go in opposite directions, that is, they are negatively correlated. Here's an example with positive correlation R=0.95. Perhaps we have a job that pays tips, rather than a fixed hourly wage. Suppose we get these tips for working 1 to 8 hours. Hours 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 Tips 18 10 19 28 29 45 27 47 52 66 51 60 78 74 81 92 From these examples, you can see that correlation describes the variability (scatter) of observations around a regression line. When all the observations fall exactly on the regression line, there is no scatter around the line, so the correlation is R=1. When there is only a little scatter around the line, correlation is slightly smaller, say R=0.9. As the scatter of observations around the line increases, the correlation approaches zero. Coefficient of determination R2 If we square the Pearson correlation coefficient, R, we get R2 (R-squared), which is the coefficient of determination. R -1.00 -0.90 -0.71 -0.50 0.00 0.50 0.71 0.90 1.00 R2 1.00 0.81 0.50 0.25 0.00 0.25 0.50 0.81 1.00 Notice that R2 can only take values between 0 and 1. Since R2 is always nearer to zero than is R, it doesn't look as impressive. Software sometimes reports R, sometimes R2, and sometimes both. Watch out for which is being reported. Outliers and Spearman rank correlation The Pearson correlation coefficient may be greatly affected by single influential points (outliers). We'll see examples in a moment. Sometimes we would like to have a measure of association that is less sensitive to single points, and at those times we can use Spearman rank correlation. Recall that, when we calculate the mean of a set of numbers, a single extreme value can greatly increase the mean. But when we calculate the median, which is based on ranks, extreme values have very little influence. The same idea applies to Pearson and Spearman correlation. Pearson correlation uses the actual values of the observations, while Spearman uses only the ranks of the observations, so it is less affected by outliers. Here are the Pearson and Spearman correlations for some outlier examples. Example: Outlier increases the Pearson correlation. x value y value x value y value 1 4 1 4 1 1 1 2 2 3 2 3 2 3 2 3 3 1 3 1 3 4 3 4 4 3 4 3 4 2 4 2 4 3 10 10 Pearson R= 0.0000 Pearson R= 0.812324 Example: In this second example, the outlier has a large effect on both the Pearson and the Spearman correlation coefficients. Drug dose IQ Drug dose IQ 5 5 5 10 10 10 15 15 20 20 Pearson R= 151 145 136 137 124 124 111 105 110 98 -0.922204443 5 5 5 10 10 10 15 15 20 20 Pearson R= 151 145 136 137 124 124 111 105 110 150 -0.472650854 Pearson detects linear correlation, while Spearman detects monotonic relationships that are not necessarily linear, as these examples show. Two variables may have zero Pearson linear or Spearman rank correlation, but not be independent, as this example shows. Calculation of the Pearson linear correlation coefficient Here is the procedure for calculating the Pearson linear correlation coefficient, R. We use variance and covariance to calculate the correlation, so we'll start with those. Recall the formula for variance from the section on descriptive statistics. Variance describes variability around the mean value. ( xi x) 2 Variance i N Covariance extends the idea of variance to two variables. The formula for Covariance is similar to that for the variance. 𝐶𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒(𝑥, 𝑦) = ∑𝑁 ̅) 𝑖 (𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦 𝑁 Correlation uses the covariance of two variables. The correlation of two variables, x and y, is equal to the covariance of x and y divided by a number that forces correlation to be between -1.0 and 1.0. 𝐶𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛(𝑥, 𝑦) = 𝑅 = 𝐶𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 (𝑥, 𝑦) √𝑉𝑎𝑟(𝑥) ∗ 𝑉𝑎𝑟(𝑦) The denominator, the square root of Var(x) * Var(y), forces the correlation coefficient to be between -1.0 and 1.0. Correlation in Design of Experiments When we design experiments, we usually want to avoid having correlation between our independent variables. Suppose we want to measure the effect of the amount of two reagents on yield. We design the following (bad) experiment to study the effects of Reagent 1 and Reagent 2 on yield in 4 batches. Batch 1 2 3 4 Reagent 1 0 0 1 1 Reagent 2 0 0 1 1 Yield 0 0 30 30 What can we conclude about the effects of reagent 1 and 2 on the yield? Unfortunately, we can't tell if the differences in yield are due to Reagent 1, Reagent 2, or an interaction between them. From this experiment, it is possible that Reagent 1 has no effect on yield. It is possible that Reagent 2 has no effect on yield. We can't tell, because, in the experiment design, Reagent 1 and 2 are correlated, with R = 1. When we do a scatterplot of the levels of reagent 1 vs. reagent 2 in the design, it is obvious that they are perfectly correlated with R = 1. Here is an alternative (good) experiment design that removes the correlation. Batch 1 2 3 4 Reagent 1 0 1 0 1 Reagent 2 0 0 1 1 Yield 0 30 0 30 What can we conclude about the effects of reagent 1 and 2 on the yield? In this experiment, it is clear that Reagent 1 increases yield from 0 to 30. In this experiment, Reagent 2 has no effect on yield. We can determine the effects of the reagents because, in this experiment design, Reagent 1 and 2 are not correlated: R = 0. When we do a scatterplot of the levels of reagent 1 vs. reagent 2, it is clear that they are not correlated. This second experiment design is much superior to the first design. The correlation R = 0 among the independent factors tell us that the second design is superior. When two independent variables are perfectly correlated (as in the bad experiment design above) we cannot separate the effects of the two variables. We say that the two variables are confounded, or aliased. The variables are confounded because we can't attribute the effects to one or the other. The variables are aliased because one variable has the same pattern (in the design) as the other. Correlation is not the same as interaction Correlation and interaction are often confused, but they are quite different. Correlation involves two variables. It describes the association between two variables. Interaction involves three or more variables. It is the effect of two (or more) factors on a third (response) variable. Here is an example of two factors (Time and Temperature) that have zero correlation, but have an interaction in their effect on the third variable, Yield. Time 0 1 0 1 Temp 0 0 1 1 Cookie yield 50 70 70 50 We have zero correlation between time and temp. But there is an interaction between time and temp in their effect on yield. The effect on cookie yield of increasing temperature depends on the value of time.