Correlation Coefficients

advertisement
Correlation Coefficients
•Pearson’s Product Moment Correlation Coefficient can
only be used with interval or ratio data:
•Its formula is based on the products of statistical
distances from the mean, and naturally those statistical
distances are only meaningful if the mean is an appropriate
measure of central tendency
•We cannot use the mean with ordinal data: It requires
that values have the the property of ‘proportionality’
found in interval and ratio data: The value 2 is greater than
1 to the same extent that 3 is greater than 2
•This is not the case for ordinal data: While we can
describe greater than or less than relations between values,
there are not proportional differences
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Spearmann’s Rank
Correlation Coefficient
•We have an alternative correlation coefficient we can use
with ordinal data: Spearmann’s Rank Correlation
Coefficient (rs)
i=n
6Σ
i=1
rs = 1 n3 - n
2
di
where
n=
sample size
di =
the difference in the
rankings of each value
with respect to each
variable
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Spearmann’s Rank
Correlation Coefficient
•We can use the rank correlation coefficient with
ordinal data (that is effectively already in a ranked
form), or we can take interval or ratio data and
convert it to rankings by simply enumerating
values in the X and Y variables with values from 1
to n for each variable
•Transforming interval or ratio data to ordinal data
for use with the rank coefficient may be desirable
when our interval or ratio dataset fails to meet an
assumption required for the use of the Pearson’s
Correlation Coefficient
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Pearson’s r - Assumptions
• To properly apply Pearson’s Correlation Coefficient, we
first have to make sure that the following assumptions
are satisfied:
1. The values need to be either interval or ratio scale data
(later we will examine a different correlation method for
ordinal data)
2. The (x,y) data pairs are selected randomly from a
population of values of X and Y
3. The relationship between X and Y is linear (which can
be qualitatively assessed by looking at the scatterplot)
4. The variables X and Y must share a joint bivariate
normal distribution (which we tend to assume when
sampling from a population)
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Spearmann’s Rank
Correlation Coefficient
•Thus, we might use the rank correlation coefficient when
we have an interval or ratio data set that is not normally
distributed, but we still we want to get a sense of the
association between the two variables
•We can also use Spearmann’s Rank rs when we have a
much smaller number of observations (as few as 3),
although a numerical description of association becomes
somewhat nonsensical when the sample size is that small
•For example, suppose we find the TVDI - soil moisture
dataset from Glyndon violates the assumption of normal
distribution (which it probably does, although it is such a
small dataset, it is difficult to assess this):
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Spearmann’s Rank
Correlation Coefficient
•We can transform the data values into rankings for use in rs:
TVDI (x)
0.274
0.542
0.419
0.286
0.374
0.489
0.623
0.506
0.768
0.725
Rank (x)
1
7
4
2
3
5
8
6
10
9
Theta (y)
0.414
0.359
0.396
0.458
0.350
0.357
0.255
0.189
0.171
0.119
Rank (y) Difference (di)
9
-8
7
0
8
-4
10
-8
5
-2
6
-1
4
4
3
3
2
8
1
8
•And we can calculate the differences in ranks to use in rs
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Spearmann’s Rank
Correlation Coefficient
•Note that because we square the differences in rankings
their sign does not matter
•Once we have calculated the differences in rankings,
calculating the rs statistic is simply a matter of squaring the
differences and summing them, multiplying the sum by six,
dividing by the denominator (n3 - n), and then finally
subtracting from one:
rs = 1 - {6[(-8)2 + (0)2 + (-4)2 + (-8)2 + (-2)2 + (-1)2 + (4)2 +
(3)2 + (8)2 + (8)2 + ] / [(10)3 + 10]}
= 1 - {6[64 + 16 + 64 + 4 + 1 + 16 + 9 + 64 + 64] / [1010]}
= 1 - {6[302] / 1010]}
= 1 - {1.794}
= -0.794
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
A Significance Test for rs
• As was the case for Pearson’s Correlation Coefficient,
we can test the significance of an rs result using a t-test
• The test statistic and degrees are formulated a little
differently for rs, although many of the characteristics
of the distribution of r values are present here as well:
• In this case, rs values follow a t-distribution with (n - 1)
degrees of freedom, and their standard error can be
estimated using:
1
SEr =
s
n -1
yielding the test statistic:
rs
ttest =
= rs n -1
SEr
s
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
A Significance Test for rs
• Again, we use this test in a 2-tailed fashion to assess
whether or not the population correlation coefficient is
equal to zero (no relationship) or not equal to zero
(some relationship):
H0: ρs = 0
HA: ρs ≠ 0
• Again, the test statistic, is purely a function of the
correlation coefficient (rs) and sample size (n):
ttest = rs n -1
• Thus, a given rs may or may not be significant
depending on the size of the sample!
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Hypothesis Testing - Significance of rs
t-test Example
• Research question: Is there a significant relationship
between TVDI and soil moisture in the Glyndon data set
1. H0: ρs = 0 (No significant relationship)
2. HA: ρs ≠ 0 (Some relationship)
3. Select α = 0.05, two-tailed because of how the alternate
hypothesis is formulated
4. In order to compute the t-test statistic, we need to first
calculate Spearmann’s Rank Correlation Coefficient. We
have done so earlier in this lecture, finding r = -0.794, a
very strong inverse relationship between remotely sensed
TVDI and field measurements of soil moisture
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Hypothesis Testing - Significance of r
t-test Example
4. Cont. We calculate the test statistic using:
ttest = rs n -1
ttest = -0.794 10 -1
ttest = -0.794 9
ttest = -0.794 * 3 = -2.382
5. We now need to find the critical t-score, first calculating
the degrees of freedom:
df = (n - 1) = (10 - 1) = 9
We can now look up the tcrit value for our α (0.025 in
each tail) and df = 9, tcrit = 2.262
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Hypothesis Testing - Significance of r
t-test Example
6. |ttest| > |tcrit|, therefore we reject H0, and accept HA,
finding that there is a significant relationship (i.e. the
population correlation coefficient ρ, which we have
estimated using the sample correlation coefficient r) is not
equal to 0
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Covariance and Correlation in Excel
•Excel can calculate covariance and correlation in two ways:
•There are built-in functions that can be entered into a
cell to specify the calculation of a Pearson’s Product
Moment Correlation (no Spearmann’s Rank available) or
covariance between a pair of variables:
•COVAR(array1, array2) can be used to calculate the
covariance between a pair of variables
•CORREL(array1, array2) or PEARSON(array1,
array2) can be used to calculate the correlation
between a pair of variables
•There are also Data Analysis Tools that can be used to
calculate the correlation or covariance between several
variables
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Covariance and Correlation Tools
•In the Data Analysis window, select the appropriate tool:
•Fill in the typical fields in the tool window:
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Covariance and Correlation in Excel
•The Analysis Tools are particularly useful because rather
than just computing a covariance or correlation between two
variables, it can do several at the same time, and place the
results in a covariance or correlation matrix
•In the example shown below, correlations will be computed
between each pair of variables in columns C through K:
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Covariance and Correlation in Excel
•The resulting output is a correlation matrix that shows the
correlation between every pair of variables:
•The values of 1 along the diagonal are present because
every variable has perfect positive correlation with itself
•There are only values displayed on one side of the diagonal
to avoid providing redundant correlation coefficients
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Correlation Matrices
•Correlation matrices are particularly useful when
you have a multivariate dataset with lots of
variables, and you want to get some sense of the
relationships between them
•If you find that a multiple variables are strongly
correlated, you can use this information to remove
some of these variables from an analysis (e.g. a
multiple linear regression), since any pair of
variables with a very high correlation is essentially
redundant in explaining variation in another
variable, since they covary in the same way
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Download