Correlation Association, Measures of – a) a measure of the degree of relationship between two variables. b) understanding the degree of association between two variables might enable the estimation of one variable based on a value of the other variable. The relationship between two categorical variables (with two or more categories each) can be assessed using a Contingency Table/Cross Tabulation and a Chi-square test. The relationship between a categorical independent variable (with two categories) and a categorical dependent variable (with two categories) can be assessed using a Difference of Proportions test. Similarly, the relationship between a categorical independent variable (with two categories) and an interval/ratio dependent variable can be assessed using a Difference of Means test. The relationship between a categorical independent variable (with two or more categories) and an interval/ratio dependent variable or a categorical dependent variable (with two categories) can be assessed using One-way Analysis of Variance. The relationship between two interval/ratio variables can be assessed using a Scatterplot and the Correlation Coefficient. Scatterplot – a) a mathematical diagram using Cartesian coordinates, X the independent axis/variable and Y the dependent axis/variable, to display values for two variables for a set of data. b) a visual representation of the direction (rising is positive and falling is negative) and strength (strong is tightly clustered along a line and in perfect correlation a straight line would intersect every point) of the linear correlation. c) can also provide a representation of nonlinear correlation. While scatterplots can be useful in the visual assessment of correlation, statistically significant correlations can appear to be a random pattern. Cross-product Deviations – a) the product of the deviation of two variables from their means. b) the contribution of each observation to the direction and strength of the association. CP ( xy) ( X X )(Y Y ) Covariation (Sum of Cross-Product Deviations) – a) the sum of the product of the joint deviations of the individual observations of X and Y from their respective means. b)the aggregate association of the association between X and Y. c) if there is no association between X and Y the Covariance will be zero, if the association is positive the Covariation will be positive and if the association is negative the Covariation will be negative, although the strength of the association cannot be evaluated by the Covariation. SCP( xy) ( X X )(Y Y ) For a particular observation if both X and Y are greater than their respective means then the SCP is greater than zero. If X X and Y Y then ( X X )(Y Y ) 0 The same holds true if both X and Y are less than their respective means. If X X and Y Y then ( X X )(Y Y ) 0 But if X is greater than its mean, while Y is less than its mean, or vice versa, the SCP is less than zero. If X X and Y Y then ( X X )(Y Y ) 0 If X X and Y Y then ( X X )(Y Y ) 0 Total Sum of Squares (TSS) in the Standard Deviation equation can be viewed as the Covariation of a variable with itself. TSS ( x) ( X X ) ( X X )( X X ) 2 Covariance – a) the sum of the cross-product deviations divided by the number of cases less one or the average amount that the paired observations of X and Y covary. cov( xy) ( X X )(Y Y ) n 1 The Covariance is so named because of its similarity to the Variance. var( x) (X X ) n 1 2 ( X X )( X X ) n 1 The difficulties of interpreting the Variance that arise from its units being the units of the variable are compounded with the Covariance because the units are a combination of X and Y, that is the size of the Covariance is a function of the standard deviations of X and Y. Pearson’s Product-Moment Correlation Coefficient a) a measure of the direction and magnitude of the linear association of two interval/ratio scale variables that ranges from negative 1 (a perfect negative relationship) to positive 1 (a perfect positive relationship, with 0 indicating no relationship). b) is the covariance of two variables (X and Y) divided by the product of the standard deviations of the two variables (X and Y). c) is symmetrical, in that it does not matter which variable is treated as independent and which as dependent, the results will be exactly the same regardless. d) is invariant to changes in location and scale, that is variables can be transformed/standardized without changing the correlation. e) a perfect correlation is an indication of a problem with the data because the two variables are exactly the same. r cov( xy ) sd ( x) sd ( y ) Although the equation for the Correlation Coefficient does not look similar to the Standard Deviation, it actually is if we transform the Standard Deviation equation. (X X ) sd ( x) n 1 2 var( x) var( x) sd ( x) The correlation coefficient can also be expressed as the mean of the products of the standardized scores. Later, in regression, this becomes a useful perception because regression with standardized scores/variables is simpler to interpret. z X X sd (x) r ( xy ) 1 X X Y Y n 1 sd ( x) sd ( y ) X X Y Y sd ( x) sd ( y ) n 1 The Population Correlation Coefficient is symbolized as the small Greek letter ρ (rho). cov( xy ) ( x) ( y ) Coefficient of Determination – a) the overall magnitude of the relationship between two variables. b) the proportion of variation in a dependent variable that is explained by the independent variable. c) the Peason Correlation Coefficient squared. R r 2 2 Correlation t-test – a) an inferential test of whether a sample correlation is different from the null hypothesis, which is generally a zero correlation. tr n2 1 R 2 Spearman’s Rank Correlation Coefficient (Spearman’s rho) a) a nonparametric measure of the direction and magnitude of the linear association of two ordinal or interval/ratio scale variables that ranges from -1 (a perfect negative relationship) to 1 (a perfect positive relationship, with 0 indicating no relationship). b) is the Pearson’s Correlation Coefficient calculated on the rank order of the variables X and Y, that is X and Y converted from interval/ratio to ordinal. c) is the covariance of the rank order of the two variables (X and Y) divided by the product of the rank order standard deviation of the two variables, that is the same equation as for the Pearson Correlation Coefficient. d) if the observations have the exact same rank-order on both variables and there are no ties between observations, then the Spearman’s coefficient will have a value of one. d) useful in testing for nonlinear association, which is one of the meanings of the assertion that it is a nonparametric test. Kendall’s Rank Correlation Coefficient (Kendall’s tau) a) a nonparametric measure of the direction and magnitude of the linear association of two ordinal or interval/ratio scale variables that ranges from -1 (a perfect negative relationship) to 1 (a perfect positive relationship, with 0 indicating no relationship). b) is similar to Spearman’s Correlation Coefficient but is calculated on the comparison of the rank order of the variables X and Y for all possible pairs of observations with those in which the sign agrees considered concordant and those in which the sign does not agree nonconcordant.