UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas Correlation Analysis Correlation Analysis measures the strength/degree of linear relationship (if any) between two quantitative measurement variables. Correlation analysis is not appropriate for categorical variables such as ordinal or nominal variables. (A different type of analysis is used with categorical and ordinal variables. We’ll learn about that later.) Again, Correlation Analysis measures only linear relationships--if two variables are related to each other in a nonlinear way, correlation analysis will not necessarily detect and measure any non-linear relationship. (Looking ahead, we will use Regression Analysis to detect and measure both linear and non-linear relationships. Also, we will need to use Correlation Analysis to check an assumption made in Regression Analysis.) Linear Relationship Y Nonlinear Relationships Y or X X Linear Relationship Y Nonlinear Relationships Y or X X Linear Relationship Y Nonlinear Relationship Y X X 1 UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas The Pearson Correlation Coefficient, r, is used to estimate the strength of linear correlation (if any) between two variables based on a sample of data.1 "r" is an estimate of the true strength of linear correlation (if any) between two variables in a population, usually denoted by Greek letter rho, ρ. The formula for r for two variables X and Y is given by: 𝑠 𝑟 = 𝑠 𝑋𝑌 The r calculated from our sample is our estimate of ρ in the population. ∙𝑠 𝑋 𝑌 where sX is the standard deviation of X, sY is the standard deviation of Y, and sXY is the Covariance between X and Y, as given by the formulas below (where n is sample size): ∑[(𝑋𝑖 −𝑋̅)∙(𝑋𝑖 −𝑋̅)] 𝑠𝑋 = √ ∑[(𝑌𝑖 −𝑌̅)∙(𝑌𝑖 −𝑌̅)] 𝑠𝑌 = √ 𝑛−1 𝑛−1 𝑠𝑋𝑌 = ∑[(𝑋𝑖 −𝑋̅)∙(𝑌𝑖 −𝑌̅)] 𝑛−1 (no square root) Note: The covariance sXY can be either positive or negative, depending on whether X and Y move together (positive covariance) or in opposite directions (negative covariance). Unfortunately, the covariance is affected by the measurement units of X and Y. When calculating the correlation coefficient, r, we divide the covariance by sX and sY in order to remove the effects of the measurement units. Therefore, the correlation coefficient, r, is not affected by the measurement units of X and Y. Interpretation of r -- The Pearson Correlation Coefficient ranges between -1 and +1, with r = -1 indicating a perfect negative linear correlation, r = +1 indicating a perfect positive linear correlation, and r = 0 indicating no linear correlation. Values of r between 0 and +1 indicate imperfect positive linear correlation, and values between 0 and -1 indicate imperfect negative linear correlation. Y Perfect Positive Linear Relationship r = +1 X Y Y Y Perfect Negative Linear Relationship r = -1 X No Linear Relationship r=0 Imperfect Positive Linear Relationship 0 < r < +1 X X Imperfect Negative Linear Relationship -1 < r < 0 Y X Y No Linear Relationship (a relationship, yes, but not linear) r = 0 X 1 Side note: Theoretically, the Pearson Correlation Coefficient is only valid when X and Y have normal distributions; however, in practice, it works well even when X and Y are not normally distributed as long as the sample size is relatively large. When these conditions are not met, Spearman's Correlation Coefficient may be used instead, but Spearman's coefficient measures the strength of any monotone relationship between X and Y rather than the strength of any linear relationship. 2 UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas Hypothesis Tests Involving r and ρ The Pearson Correlation Coefficient, r, can be used to test hypotheses about ρ in the population. One-Sided Test One-Sided Test Two-Sided Test H0: ρ = 0 H1: ρ > 0 H0: ρ = 0 H1: ρ < 0 H0: ρ = 0 H1: ρ ≠ 0 The ttest statistic for testing the null hypothesis ρ = 0 is given by: 𝑡𝑡𝑒𝑠𝑡 = 𝑟√𝑛 − 2 √1 − 𝑟 2 where r is the Pearson Correlation Coefficient calculated from our sample data, and n is sample size. This ttest value can be compared with a tcrititcal value from a t-table using degrees of freedom = n - 2 and our selected value of α (for a one-sided test) or α/2 (for a two-sided test). As always, if ttest is farther from zero than tcritical, then we Reject H0 and Accept H1. Or, if you are given the p-value for the test, you can instead: compare the p-value with α, if you want to do a one-sided test., or compare the p-value with α/2, for a two-sided test. As usual, if the p-value is less than α (for a one-sided test) or α/2 (for a two-sided test), then we Reject H0 and Accept H1. Comments: Note that a relationship can be strong (that is, the value of r can be large) and yet not significant. On the other hand, a relationship can be weak (that is, the value of r can be small) but significant. The key factor is the size of the sample: o For small samples, it is easy to produce a strong correlation by chance and one must pay attention to significance to keep from jumping to conclusions: i.e., rejecting a null hypothesis when you shouldn't. o For large samples, it is easy to achieve significance, and one must pay attention to the strength of the correlation to determine whether the relationship explains very much. 3 UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas Another Interpretation of r It turns out that the square of r gives the proportion of the variation in one variable that is explained by the variation in the other variable. So, if the correlation between X and Y is r = 0.60, then r2 = 0.36 or 36 percent of the movements (variation) in Y can be explained by movements (variation) in X. Correlation Does Not Imply Causation If X and Y are linearly correlated, this does not necessarily mean that X is causing movements in Y (or vice versa). Instead, some third variable (not X or Y) may be causing X and Y to move together in a linear way. Or, it may be simply coincidence that X and Y are moving together in a linear way. Correlation simply means that X and Y are moving together in a linear way; it doesn't tell you why. Examples: When ice cream consumption goes up in New York City, so does the homicide rate, but eating ice cream is not causing people to commit more murders. Instead, an increase in a third variable, air temperature, causes both ice cream consumption and homicides to increase. When the number of churches in a city is large, so is the number of bars, but going to church is not causing people to drink. Instead, an increase in a third variable, city population size, causes both the number of churches and the number of bars to increase. Looking ahead, we will use Regression Analysis to rule-out possible third variables as possible causes of correlations between X and Y. Example Suppose we calculate the Pearson Correlation Coefficient, r, for two variables in a data set about North Carolina counties: population in NC counties (PopCens) and number of older persons per 10000 population (Age65per10000). Suppose the value of r = -0.477 and the p-value = 0.001. What can we conclude about a possible linear relationship between PopCens and Age65per10000? There does appear to be a linear relationship between PopCens and Age65per10000 because the p-value of 0.001 is less than α = 0.05. Because r is negative, the direction of the relationship is negative, and because r is about midway between 0 and 1, the strength of the relationship is moderate; r2 = 0.228, or about 23 percent of the variation in Age65per10000 can be explained by PopCens. These results indicate that counties with larger populations tend to have fewer older folks, on average, and counties with smaller populations tend to have more older folks, on average. However, this result does not control for the effects of any third variables that might be affecting the relationship between PopCens and Age65per10000. (To control for the effects of other variables, we would need to do a Regression Analysis.) 4 UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas Correlations Results Are Valid Only Within the Relevant Range of the Sample Data Theoretically, correlation analysis only describes the relationship between X and Y for the range of data values in the sample, called the "Relevant Range" of the data. For example, if the sample contains values of X between 10 and 20 and values of Y between 300 and 500, as in the figure below, then the correlation analysis results are only valid for X's and Y's in these ranges. If X were 5, say, or if X were 40, we do not have any data on what is happening for those values of X, so we don't know whether or not the linear correlation between X and Y holds in those values of X. We could assume that the correlation would continue to hold for those values of X outside the Relevant Range, but it would be simply an assumption. Y Correlation Results Are Valid Only Within the Relevant Range of the Sample Data 500 Relevant Range of Y 300 10 20 X Relevant Range of X 5