4.0 Correlation

advertisement
Correlation

Correlation describes the strength of association between two quantitative
variables.

Correlation describes the variability (scatter) of observations around a regression
line.

Correlation describes how accurately you can predict the value of one variable
when you know the value of another variable.

Correlation is also helpful in understanding how to design good experiments, as
we shall see.
The most widely used correlation measure is R, the Pearson linear correlation
coefficient. We'll also look at another correlation measure, the Spearman rank
correlation coefficient, which uses rank values.
Hours 0 1 2 3 4 5 6 7 8 9 10
Dollars 0 10 20 30 40 50 60 70 80 90 100
Here is an example of two quantitative variables that are perfectly correlated (R=1).
Suppose that you have a job where you are paid $10 per hour. The table and graph
show the number of hours you work and how many dollars you earn. If you know the
number of hours you worked, then you know exactly how many dollars you earned. So
the correlation is R=1.0.
Next is an example of two variables (number of drinks bought and dollars in your wallet)
that are perfectly negatively correlated (R=-1). Suppose that you are buying drinks that
cost $5 each. The table and graph show the number of drinks bought and how many
dollars are left in your wallet. If you know the number of drinks you bought then you
know exactly how many dollars are left in your wallet. Of course, after 10 drinks, you
may not remember how many you bought. So you may not correctly estimate how
much money you have left.
Drinks bought
1 2 3 4 5 6 7 8 9 10
Dollars in your wallet 50 45 40 35 30 25 20 15 10 5
Here is an example of two variables (IQ vs number of pills) that have correlation near
zero. Suppose we are testing if a new drug has any effect on IQ. The table and graph
show the number of pills vs. IQ. In this case, the number of pills doesn't appear to have
any relationship to the IQ. The correlation is R = -0.005.
Patient ID
1
2
3
4
5
6
7
8
9
10
11
12
Number of
pills
0
0
0
1
1
1
2
2
2
3
3
3
IQ
132
141
150
136
151
131
134
132
151
133
139
151
The Pearson linear correlation R ranges from 1.0, meaning perfect prediction, to 0.0,
meaning no linear association at all, to -1.0, meaning that the variables are perfectly
correlated but go in opposite directions, that is, they are negatively correlated.
Here's an example with positive correlation R=0.95. Perhaps we have a job that pays
tips, rather than a fixed hourly wage. Suppose we get these tips for working 1 to 8
hours.
Hours 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8
Tips
18 10 19 28 29 45 27 47 52 66 51 60 78 74 81 92
From these examples, you can see that correlation describes the variability (scatter) of
observations around a regression line. When all the observations fall exactly on the
regression line, there is no scatter around the line, so the correlation is R=1. When there
is only a little scatter around the line, correlation is slightly smaller, say R=0.9. As the
scatter of observations around the line increases, the correlation approaches zero.
Coefficient of determination R2
If we square the Pearson correlation coefficient, R, we get R2 (R-squared), which is the
coefficient of determination.
R
-1.00
-0.90
-0.71
-0.50
0.00
0.50
0.71
0.90
1.00
R2
1.00
0.81
0.50
0.25
0.00
0.25
0.50
0.81
1.00
Notice that R2 can only take values between 0 and 1. Since R2 is always nearer to zero
than is R, it doesn't look as impressive. Software sometimes reports R, sometimes R2,
and sometimes both. Watch out for which is being reported.
Outliers and Spearman rank correlation
The Pearson correlation coefficient may be greatly affected by single influential points
(outliers). We'll see examples in a moment. Sometimes we would like to have a measure
of association that is less sensitive to single points, and at those times we can use
Spearman rank correlation.
Recall that, when we calculate the mean of a set of numbers, a single extreme value can
greatly increase the mean. But when we calculate the median, which is based on ranks,
extreme values have very little influence. The same idea applies to Pearson and
Spearman correlation. Pearson correlation uses the actual values of the observations,
while Spearman uses only the ranks of the observations, so it is less affected by outliers.
Here are the Pearson and Spearman correlations for some outlier examples.
Example: Outlier increases the Pearson correlation.
x value
y value
x value
y value
1
4
1
4
1
1
1
2
2
3
2
3
2
3
2
3
3
1
3
1
3
4
3
4
4
3
4
3
4
2
4
2
4
3
10
10
Pearson R= 0.0000
Pearson R= 0.812324
Example: In this second example, the outlier has a large effect on both the Pearson and
the Spearman correlation coefficients.
Drug dose
IQ
Drug dose
IQ
5
5
5
10
10
10
15
15
20
20
Pearson R=
151
145
136
137
124
124
111
105
110
98
-0.922204443
5
5
5
10
10
10
15
15
20
20
Pearson R=
151
145
136
137
124
124
111
105
110
150
-0.472650854
Pearson detects linear correlation, while Spearman detects monotonic relationships that
are not necessarily linear, as these examples show.
Two variables may have zero Pearson linear or Spearman rank correlation, but not be
independent, as this example shows.
Calculation of the Pearson linear correlation coefficient
Here is the procedure for calculating the Pearson linear correlation coefficient, R. We
use variance and covariance to calculate the correlation, so we'll start with those.
Recall the formula for variance from the section on descriptive statistics. Variance
describes variability around the mean value.
 ( xi  x)
2
Variance 
i
N
Covariance extends the idea of variance to two variables. The formula for Covariance is
similar to that for the variance.
𝐶𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒(𝑥, 𝑦) =
∑𝑁
̅)
𝑖 (𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦
𝑁
Correlation uses the covariance of two variables. The correlation of two variables, x and
y, is equal to the covariance of x and y divided by a number that forces correlation to be
between -1.0 and 1.0.
𝐶𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛(𝑥, 𝑦) = 𝑅 =
𝐶𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 (𝑥, 𝑦)
√𝑉𝑎𝑟(𝑥) ∗ 𝑉𝑎𝑟(𝑦)
The denominator, the square root of Var(x) * Var(y), forces the correlation coefficient to
be between -1.0 and 1.0.
Correlation in Design of Experiments
When we design experiments, we usually want to avoid having correlation between our
independent variables. Suppose we want to measure the effect of the amount of two
reagents on yield. We design the following (bad) experiment to study the effects of
Reagent 1 and Reagent 2 on yield in 4 batches.
Batch
1
2
3
4
Reagent 1
0
0
1
1
Reagent 2
0
0
1
1
Yield
0
0
30
30
What can we conclude about the effects of reagent 1 and 2 on the yield? Unfortunately,
we can't tell if the differences in yield are due to Reagent 1, Reagent 2, or an interaction
between them. From this experiment, it is possible that Reagent 1 has no effect on
yield. It is possible that Reagent 2 has no effect on yield. We can't tell, because, in the
experiment design, Reagent 1 and 2 are correlated, with R = 1.
When we do a scatterplot of the levels of reagent 1 vs. reagent 2 in the design, it is
obvious that they are perfectly correlated with R = 1.
Here is an alternative (good) experiment design that removes the correlation.
Batch
1
2
3
4
Reagent 1
0
1
0
1
Reagent 2
0
0
1
1
Yield
0
30
0
30
What can we conclude about the effects of reagent 1 and 2 on the yield? In this
experiment, it is clear that Reagent 1 increases yield from 0 to 30. In this experiment,
Reagent 2 has no effect on yield. We can determine the effects of the reagents because,
in this experiment design, Reagent 1 and 2 are not correlated: R = 0.
When we do a scatterplot of the levels of reagent 1 vs. reagent 2, it is clear that they are
not correlated. This second experiment design is much superior to the first design. The
correlation R = 0 among the independent factors tell us that the second design is
superior.
When two independent variables are perfectly correlated (as in the bad experiment
design above) we cannot separate the effects of the two variables. We say that the two
variables are confounded, or aliased. The variables are confounded because we can't
attribute the effects to one or the other. The variables are aliased because one variable
has the same pattern (in the design) as the other.
Correlation is not the same as interaction
Correlation and interaction are often confused, but they are quite different.

Correlation involves two variables. It describes the association between two
variables.

Interaction involves three or more variables. It is the effect of two (or more)
factors on a third (response) variable.
Here is an example of two factors (Time and Temperature) that have zero correlation,
but have an interaction in their effect on the third variable, Yield.
Time
0
1
0
1
Temp
0
0
1
1
Cookie yield
50
70
70
50
We have zero correlation between time and temp. But there is an interaction between
time and temp in their effect on yield. The effect on cookie yield of increasing
temperature depends on the value of time.
Download