Correlation Coefficients •Pearson’s Product Moment Correlation Coefficient can only be used with interval or ratio data: •Its formula is based on the products of statistical distances from the mean, and naturally those statistical distances are only meaningful if the mean is an appropriate measure of central tendency •We cannot use the mean with ordinal data: It requires that values have the the property of ‘proportionality’ found in interval and ratio data: The value 2 is greater than 1 to the same extent that 3 is greater than 2 •This is not the case for ordinal data: While we can describe greater than or less than relations between values, there are not proportional differences David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Spearmann’s Rank Correlation Coefficient •We have an alternative correlation coefficient we can use with ordinal data: Spearmann’s Rank Correlation Coefficient (rs) i=n 6Σ i=1 rs = 1 n3 - n 2 di where n= sample size di = the difference in the rankings of each value with respect to each variable David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Spearmann’s Rank Correlation Coefficient •We can use the rank correlation coefficient with ordinal data (that is effectively already in a ranked form), or we can take interval or ratio data and convert it to rankings by simply enumerating values in the X and Y variables with values from 1 to n for each variable •Transforming interval or ratio data to ordinal data for use with the rank coefficient may be desirable when our interval or ratio dataset fails to meet an assumption required for the use of the Pearson’s Correlation Coefficient David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Pearson’s r - Assumptions • To properly apply Pearson’s Correlation Coefficient, we first have to make sure that the following assumptions are satisfied: 1. The values need to be either interval or ratio scale data (later we will examine a different correlation method for ordinal data) 2. The (x,y) data pairs are selected randomly from a population of values of X and Y 3. The relationship between X and Y is linear (which can be qualitatively assessed by looking at the scatterplot) 4. The variables X and Y must share a joint bivariate normal distribution (which we tend to assume when sampling from a population) David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Spearmann’s Rank Correlation Coefficient •Thus, we might use the rank correlation coefficient when we have an interval or ratio data set that is not normally distributed, but we still we want to get a sense of the association between the two variables •We can also use Spearmann’s Rank rs when we have a much smaller number of observations (as few as 3), although a numerical description of association becomes somewhat nonsensical when the sample size is that small •For example, suppose we find the TVDI - soil moisture dataset from Glyndon violates the assumption of normal distribution (which it probably does, although it is such a small dataset, it is difficult to assess this): David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Spearmann’s Rank Correlation Coefficient •We can transform the data values into rankings for use in rs: TVDI (x) 0.274 0.542 0.419 0.286 0.374 0.489 0.623 0.506 0.768 0.725 Rank (x) 1 7 4 2 3 5 8 6 10 9 Theta (y) 0.414 0.359 0.396 0.458 0.350 0.357 0.255 0.189 0.171 0.119 Rank (y) Difference (di) 9 -8 7 0 8 -4 10 -8 5 -2 6 -1 4 4 3 3 2 8 1 8 •And we can calculate the differences in ranks to use in rs David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Spearmann’s Rank Correlation Coefficient •Note that because we square the differences in rankings their sign does not matter •Once we have calculated the differences in rankings, calculating the rs statistic is simply a matter of squaring the differences and summing them, multiplying the sum by six, dividing by the denominator (n3 - n), and then finally subtracting from one: rs = 1 - {6[(-8)2 + (0)2 + (-4)2 + (-8)2 + (-2)2 + (-1)2 + (4)2 + (3)2 + (8)2 + (8)2 + ] / [(10)3 + 10]} = 1 - {6[64 + 16 + 64 + 4 + 1 + 16 + 9 + 64 + 64] / [1010]} = 1 - {6[302] / 1010]} = 1 - {1.794} = -0.794 David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 A Significance Test for rs • As was the case for Pearson’s Correlation Coefficient, we can test the significance of an rs result using a t-test • The test statistic and degrees are formulated a little differently for rs, although many of the characteristics of the distribution of r values are present here as well: • In this case, rs values follow a t-distribution with (n - 1) degrees of freedom, and their standard error can be estimated using: 1 SEr = s n -1 yielding the test statistic: rs ttest = = rs n -1 SEr s David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 A Significance Test for rs • Again, we use this test in a 2-tailed fashion to assess whether or not the population correlation coefficient is equal to zero (no relationship) or not equal to zero (some relationship): H0: ρs = 0 HA: ρs ≠ 0 • Again, the test statistic, is purely a function of the correlation coefficient (rs) and sample size (n): ttest = rs n -1 • Thus, a given rs may or may not be significant depending on the size of the sample! David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Hypothesis Testing - Significance of rs t-test Example • Research question: Is there a significant relationship between TVDI and soil moisture in the Glyndon data set 1. H0: ρs = 0 (No significant relationship) 2. HA: ρs ≠ 0 (Some relationship) 3. Select α = 0.05, two-tailed because of how the alternate hypothesis is formulated 4. In order to compute the t-test statistic, we need to first calculate Spearmann’s Rank Correlation Coefficient. We have done so earlier in this lecture, finding r = -0.794, a very strong inverse relationship between remotely sensed TVDI and field measurements of soil moisture David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Hypothesis Testing - Significance of r t-test Example 4. Cont. We calculate the test statistic using: ttest = rs n -1 ttest = -0.794 10 -1 ttest = -0.794 9 ttest = -0.794 * 3 = -2.382 5. We now need to find the critical t-score, first calculating the degrees of freedom: df = (n - 1) = (10 - 1) = 9 We can now look up the tcrit value for our α (0.025 in each tail) and df = 9, tcrit = 2.262 David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Hypothesis Testing - Significance of r t-test Example 6. |ttest| > |tcrit|, therefore we reject H0, and accept HA, finding that there is a significant relationship (i.e. the population correlation coefficient ρ, which we have estimated using the sample correlation coefficient r) is not equal to 0 David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Covariance and Correlation in Excel •Excel can calculate covariance and correlation in two ways: •There are built-in functions that can be entered into a cell to specify the calculation of a Pearson’s Product Moment Correlation (no Spearmann’s Rank available) or covariance between a pair of variables: •COVAR(array1, array2) can be used to calculate the covariance between a pair of variables •CORREL(array1, array2) or PEARSON(array1, array2) can be used to calculate the correlation between a pair of variables •There are also Data Analysis Tools that can be used to calculate the correlation or covariance between several variables David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Covariance and Correlation Tools •In the Data Analysis window, select the appropriate tool: •Fill in the typical fields in the tool window: David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Covariance and Correlation in Excel •The Analysis Tools are particularly useful because rather than just computing a covariance or correlation between two variables, it can do several at the same time, and place the results in a covariance or correlation matrix •In the example shown below, correlations will be computed between each pair of variables in columns C through K: David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Covariance and Correlation in Excel •The resulting output is a correlation matrix that shows the correlation between every pair of variables: •The values of 1 along the diagonal are present because every variable has perfect positive correlation with itself •There are only values displayed on one side of the diagonal to avoid providing redundant correlation coefficients David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Correlation Matrices •Correlation matrices are particularly useful when you have a multivariate dataset with lots of variables, and you want to get some sense of the relationships between them •If you find that a multiple variables are strongly correlated, you can use this information to remove some of these variables from an analysis (e.g. a multiple linear regression), since any pair of variables with a very high correlation is essentially redundant in explaining variation in another variable, since they covary in the same way David Tenenbaum – GEOG 090 – UNC-CH Spring 2005