Correlation Analysis

advertisement
UNC-Wilmington
Department of Economics and Finance
ECN 377
Dr. Chris Dumas
Correlation Analysis
Correlation Analysis measures the strength/degree of linear relationship (if any) between two quantitative
measurement variables. Correlation analysis is not appropriate for categorical variables such as ordinal or
nominal variables. (A different type of analysis is used with categorical and ordinal variables. We’ll learn about
that later.)
Again, Correlation Analysis measures only linear relationships--if two variables are related to each other in a nonlinear way, correlation analysis will not necessarily detect and measure any non-linear relationship. (Looking
ahead, we will use Regression Analysis to detect and measure both linear and non-linear relationships. Also, we
will need to use Correlation Analysis to check an assumption made in Regression Analysis.)
Linear Relationship
Y
Nonlinear Relationships
Y
or
X
X
Linear Relationship
Y
Nonlinear Relationships
Y
or
X
X
Linear Relationship
Y
Nonlinear Relationship
Y
X
X
1
UNC-Wilmington
Department of Economics and Finance
ECN 377
Dr. Chris Dumas
The Pearson Correlation Coefficient, r, is used to estimate the strength of linear correlation (if any) between
two variables based on a sample of data.1 "r" is an estimate of the true strength of linear correlation (if any)
between two variables in a population, usually denoted by Greek letter rho, ρ.
The formula for r for two variables X and Y is given by:
𝑠
𝑟 = 𝑠 𝑋𝑌
The r calculated from our sample is our estimate of ρ in the population.
∙𝑠
𝑋 𝑌
where sX is the standard deviation of X, sY is the standard deviation of Y, and sXY is the Covariance between X
and Y, as given by the formulas below (where n is sample size):
∑[(𝑋𝑖 −𝑋̅)∙(𝑋𝑖 −𝑋̅)]
𝑠𝑋 = √
∑[(𝑌𝑖 −𝑌̅)∙(𝑌𝑖 −𝑌̅)]
𝑠𝑌 = √
𝑛−1
𝑛−1
𝑠𝑋𝑌 =
∑[(𝑋𝑖 −𝑋̅)∙(𝑌𝑖 −𝑌̅)]
𝑛−1
(no square root)
Note: The covariance sXY can be either positive or negative, depending on whether X and Y move together
(positive covariance) or in opposite directions (negative covariance). Unfortunately, the covariance is affected by
the measurement units of X and Y. When calculating the correlation coefficient, r, we divide the covariance by
sX and sY in order to remove the effects of the measurement units. Therefore, the correlation coefficient, r, is not
affected by the measurement units of X and Y.
Interpretation of r -- The Pearson Correlation Coefficient ranges between -1 and +1, with r = -1 indicating a
perfect negative linear correlation, r = +1 indicating a perfect positive linear correlation, and r = 0 indicating no
linear correlation. Values of r between 0 and +1 indicate imperfect positive linear correlation, and values between
0 and -1 indicate imperfect negative linear correlation.
Y
Perfect Positive
Linear Relationship
r = +1
X
Y
Y
Y
Perfect Negative
Linear Relationship
r = -1
X
No Linear Relationship
r=0
Imperfect Positive
Linear Relationship
0 < r < +1
X
X
Imperfect Negative
Linear Relationship
-1 < r < 0
Y
X
Y
No Linear Relationship
(a relationship, yes, but
not linear) r = 0
X
1
Side note: Theoretically, the Pearson Correlation Coefficient is only valid when X and Y have normal distributions; however, in practice,
it works well even when X and Y are not normally distributed as long as the sample size is relatively large. When these conditions are not
met, Spearman's Correlation Coefficient may be used instead, but Spearman's coefficient measures the strength of any monotone
relationship between X and Y rather than the strength of any linear relationship.
2
UNC-Wilmington
Department of Economics and Finance
ECN 377
Dr. Chris Dumas
Hypothesis Tests Involving r and ρ
The Pearson Correlation Coefficient, r, can be used to test hypotheses about ρ in the population.
One-Sided Test
One-Sided Test
Two-Sided Test
H0: ρ = 0
H1: ρ > 0
H0: ρ = 0
H1: ρ < 0
H0: ρ = 0
H1: ρ ≠ 0
The ttest statistic for testing the null hypothesis ρ = 0 is given by:
𝑡𝑡𝑒𝑠𝑡 =
𝑟√𝑛 − 2
√1 − 𝑟 2
where r is the Pearson Correlation Coefficient calculated from our sample data, and n is sample size.
This ttest value can be compared with a tcrititcal value from a t-table using degrees of freedom = n - 2 and our
selected value of α (for a one-sided test) or α/2 (for a two-sided test).
As always, if ttest is farther from zero than tcritical, then we Reject H0 and Accept H1.
Or, if you are given the p-value for the test, you can instead:
 compare the p-value with α, if you want to do a one-sided test., or
 compare the p-value with α/2, for a two-sided test.
As usual, if the p-value is less than α (for a one-sided test) or α/2 (for a two-sided test),
then we Reject H0 and Accept H1.
Comments:
 Note that a relationship can be strong (that is, the value of r can be large) and yet not significant.
 On the other hand, a relationship can be weak (that is, the value of r can be small) but significant.
 The key factor is the size of the sample:
o For small samples, it is easy to produce a strong correlation by chance and one must pay attention
to significance to keep from jumping to conclusions: i.e., rejecting a null hypothesis when you
shouldn't.
o For large samples, it is easy to achieve significance, and one must pay attention to the strength of
the correlation to determine whether the relationship explains very much.
3
UNC-Wilmington
Department of Economics and Finance
ECN 377
Dr. Chris Dumas
Another Interpretation of r
It turns out that the square of r gives the proportion of the variation in one variable that is explained by the
variation in the other variable. So, if the correlation between X and Y is r = 0.60, then r2 = 0.36 or 36 percent of
the movements (variation) in Y can be explained by movements (variation) in X.
Correlation Does Not Imply Causation
If X and Y are linearly correlated, this does not necessarily mean that X is causing movements in Y (or vice
versa). Instead, some third variable (not X or Y) may be causing X and Y to move together in a linear way. Or,
it may be simply coincidence that X and Y are moving together in a linear way. Correlation simply means that X
and Y are moving together in a linear way; it doesn't tell you why.
Examples:


When ice cream consumption goes up in New York City, so does the homicide rate, but eating ice cream
is not causing people to commit more murders. Instead, an increase in a third variable, air temperature,
causes both ice cream consumption and homicides to increase.
When the number of churches in a city is large, so is the number of bars, but going to church is not
causing people to drink. Instead, an increase in a third variable, city population size, causes both the
number of churches and the number of bars to increase.
Looking ahead, we will use Regression Analysis to rule-out possible third variables as possible causes of
correlations between X and Y.
Example
Suppose we calculate the Pearson Correlation Coefficient, r, for two variables in a data set about North Carolina
counties: population in NC counties (PopCens) and number of older persons per 10000 population
(Age65per10000). Suppose the value of r = -0.477 and the p-value = 0.001. What can we conclude about a
possible linear relationship between PopCens and Age65per10000?
There does appear to be a linear relationship between PopCens and Age65per10000 because the p-value of 0.001
is less than α = 0.05. Because r is negative, the direction of the relationship is negative, and because r is about
midway between 0 and 1, the strength of the relationship is moderate; r2 = 0.228, or about 23 percent of the
variation in Age65per10000 can be explained by PopCens. These results indicate that counties with larger
populations tend to have fewer older folks, on average, and counties with smaller populations tend to have more
older folks, on average. However, this result does not control for the effects of any third variables that might be
affecting the relationship between PopCens and Age65per10000. (To control for the effects of other variables, we
would need to do a Regression Analysis.)
4
UNC-Wilmington
Department of Economics and Finance
ECN 377
Dr. Chris Dumas
Correlations Results Are Valid Only Within the Relevant Range of the Sample Data
Theoretically, correlation analysis only describes the relationship between X and Y for the range of data values in
the sample, called the "Relevant Range" of the data. For example, if the sample contains values of X between 10
and 20 and values of Y between 300 and 500, as in the figure below, then the correlation analysis results are only
valid for X's and Y's in these ranges. If X were 5, say, or if X were 40, we do not have any data on what is
happening for those values of X, so we don't know whether or not the linear correlation between X and Y holds in
those values of X. We could assume that the correlation would continue to hold for those values of X outside the
Relevant Range, but it would be simply an assumption.
Y
Correlation Results Are Valid
Only Within the Relevant Range
of the Sample Data
500
Relevant Range of Y
300
10
20
X
Relevant Range of X
5
Download