Lecture_7_Correlation

advertisement
Correlation
Association, Measures of – a) a measure of the degree of relationship between two
variables. b) understanding the degree of association between two variables might enable
the estimation of one variable based on a value of the other variable.
The relationship between two categorical variables (with two or more categories each)
can be assessed using a Contingency Table/Cross Tabulation and a Chi-square test.
The relationship between a categorical independent variable (with two categories) and a
categorical dependent variable (with two categories) can be assessed using a Difference
of Proportions test. Similarly, the relationship between a categorical independent
variable (with two categories) and an interval/ratio dependent variable can be assessed
using a Difference of Means test. The relationship between a categorical independent
variable (with two or more categories) and an interval/ratio dependent variable or a
categorical dependent variable (with two categories) can be assessed using One-way
Analysis of Variance. The relationship between two interval/ratio variables can be
assessed using a Scatterplot and the Correlation Coefficient.
Scatterplot – a) a mathematical diagram using Cartesian coordinates, X the independent
axis/variable and Y the dependent axis/variable, to display values for two variables for a
set of data. b) a visual representation of the direction (rising is positive and falling is
negative) and strength (strong is tightly clustered along a line and in perfect correlation a
straight line would intersect every point) of the linear correlation. c) can also provide a
representation of nonlinear correlation.
While scatterplots can be useful in the visual assessment of correlation, statistically
significant correlations can appear to be a random pattern.
Cross-product Deviations – a) the product of the deviation of two variables from their
means. b) the contribution of each observation to the direction and strength of the
association.
CP ( xy)  ( X  X )(Y  Y )
Covariation (Sum of Cross-Product Deviations) – a) the sum of the product of the joint
deviations of the individual observations of X and Y from their respective means. b)the
aggregate association of the association between X and Y. c) if there is no association
between X and Y the Covariance will be zero, if the association is positive the
Covariation will be positive and if the association is negative the Covariation will be
negative, although the strength of the association cannot be evaluated by the Covariation.
SCP( xy)   ( X  X )(Y  Y )
For a particular observation if both X and Y are greater than their respective means then
the SCP is greater than zero.
If X  X and Y  Y then  ( X  X )(Y  Y )  0
The same holds true if both X and Y are less than their respective means.
If X  X and Y  Y then  ( X  X )(Y  Y )  0
But if X is greater than its mean, while Y is less than its mean, or vice versa, the SCP is
less than zero.
If X  X and Y  Y then  ( X  X )(Y  Y )  0
If X  X and Y  Y then  ( X  X )(Y  Y )  0
Total Sum of Squares (TSS) in the Standard Deviation equation can be viewed as the
Covariation of a variable with itself.
TSS ( x)   ( X  X )   ( X  X )( X  X )
2
Covariance – a) the sum of the cross-product deviations divided by the number of cases
less one or the average amount that the paired observations of X and Y covary.
cov( xy) 
 ( X  X )(Y  Y )
n 1
The Covariance is so named because of its similarity to the Variance.
var( x) 
(X  X )
n 1
2

 ( X  X )( X  X )
n 1
The difficulties of interpreting the Variance that arise from its units being the units of the
variable are compounded with the Covariance because the units are a combination of X
and Y, that is the size of the Covariance is a function of the standard deviations of X and
Y.
Pearson’s Product-Moment Correlation Coefficient a) a measure of the direction and
magnitude of the linear association of two interval/ratio scale variables that ranges from
negative 1 (a perfect negative relationship) to positive 1 (a perfect positive relationship,
with 0 indicating no relationship). b) is the covariance of two variables (X and Y) divided
by the product of the standard deviations of the two variables (X and Y). c) is
symmetrical, in that it does not matter which variable is treated as independent and which
as dependent, the results will be exactly the same regardless. d) is invariant to changes in
location and scale, that is variables can be transformed/standardized without changing the
correlation. e) a perfect correlation is an indication of a problem with the data because
the two variables are exactly the same.
r
cov( xy )
sd ( x) sd ( y )
Although the equation for the Correlation Coefficient does not look similar to the
Standard Deviation, it actually is if we transform the Standard Deviation equation.
(X  X )
sd ( x) 
n 1
2
 var( x) 
var( x)
sd ( x)
The correlation coefficient can also be expressed as the mean of the products of the
standardized scores. Later, in regression, this becomes a useful perception because
regression with standardized scores/variables is simpler to interpret.
z
X X
sd (x)
r ( xy ) 
1
 X  X  Y  Y 



n  1  sd ( x)  sd ( y ) 
 X  X  Y  Y 


 sd ( x)  sd ( y ) 
n 1

The Population Correlation Coefficient is symbolized as the small Greek letter ρ (rho).

cov( xy )
 ( x) ( y )
Coefficient of Determination – a) the overall magnitude of the relationship between two
variables. b) the proportion of variation in a dependent variable that is explained by the
independent variable. c) the Peason Correlation Coefficient squared.
R r
2
2
Correlation t-test – a) an inferential test of whether a sample correlation is different
from the null hypothesis, which is generally a zero correlation.
tr
n2
1 R
2
Spearman’s Rank Correlation Coefficient (Spearman’s rho) a) a nonparametric
measure of the direction and magnitude of the linear association of two ordinal or
interval/ratio scale variables that ranges from -1 (a perfect negative relationship) to 1 (a
perfect positive relationship, with 0 indicating no relationship). b) is the Pearson’s
Correlation Coefficient calculated on the rank order of the variables X and Y, that is X
and Y converted from interval/ratio to ordinal. c) is the covariance of the rank order of
the two variables (X and Y) divided by the product of the rank order standard deviation of
the two variables, that is the same equation as for the Pearson Correlation Coefficient. d)
if the observations have the exact same rank-order on both variables and there are no ties
between observations, then the Spearman’s coefficient will have a value of one. d) useful
in testing for nonlinear association, which is one of the meanings of the assertion that it is
a nonparametric test.
Kendall’s Rank Correlation Coefficient (Kendall’s tau) a) a nonparametric measure of
the direction and magnitude of the linear association of two ordinal or interval/ratio scale
variables that ranges from -1 (a perfect negative relationship) to 1 (a perfect positive
relationship, with 0 indicating no relationship). b) is similar to Spearman’s Correlation
Coefficient but is calculated on the comparison of the rank order of the variables X and Y
for all possible pairs of observations with those in which the sign agrees considered
concordant and those in which the sign does not agree nonconcordant.
Download