How Variables Move Jointly: Correlation This chapter begins a look at how two or more variables covary: that is, how higher values on one variable are associated with higher values on another, and how lower values on the two variables are also associated. The reverse situation also occurs frequently, when higher values on one variable are associated with lower values on another variable. [p.79] Understanding Correlation The degree to which two variables behave in this way—that is, the way they covary—is called correlation. A familiar example is height and weight. They have what’s called a positive correlation: High values on one variable are associated with high values on the other variable. [p.79] Notice the figure in cell E2 of Figure 4.1. It is the correlation coefficient. The correlation coefficient, or r, can take on values that range from −1.0 to +1.0. The closer that r is to plus or minus 1.0, the stronger the relationship. When two variables are unrelated, the correlation that you might calculate between the two of them should be close to 0.0. [p. 80] The Correlation, Calculated Notice the formula in the formula bar shown in Figure 4.3: [p. 81] =CORREL(A2:A13,B2:B13) The important point to recognize is that r expresses the strength of a relationship between two variables. Notice the diagonal line in the chart in Figure 4.4. That’s called a regression line (or, in Excel terms, a trendline). [p.84] Figure 4.4 -> The author uses 2 datapoints which is the simplest case - he is doing this to illustrate conceptually how covariance and correlation work. I will cover the mathematics in class as I realize these formulas may be hard to interpret. Moving from the Covariance to the Correlation [p.86] Even without Excel’s CORREL() function, it’s easy to get from the covariance to the correlation. The definitional formula for the correlation coefficient between variable x and variable y is as follows: r = Sxy / SxSy In words, the correlation is equal to the covariance (sxy) divided by the product of the standard deviation of x (sx) and the standard deviation of y (sy). In practice, you almost never do the actual calculations, but leave them to the Excel worksheet functions CORREL() for the correlation coefficient and COVAR() for the covariance. Using the Correl() Function [p. 86-88] Housing Price vs. Buyer’s Annual Income: Please try this using the Excel Spreadsheet. [p. 86-87] Correlation coefficients can be tricky. Here are two ways they can steer you wrong: There’s a strong relationship between the two variables, but the normal correlation coefficient, r, obscures that relationship. (The example in the book is that correlation is for linear (line) relationships; and so data that has a curve will not work with the linear approach. There are curverelated formulas I will show in class). There’s no strong relationship between the two variables, but one or two highly unusual observations make it seem as though there is one. Make it a habit to create XY charts of variables that you investigate via correlation analysis. [p .89] Using the Analysis Tools As shown with R, there are "correlation matrices" that can help. In the past I would use the below correlation Excel Add-In; though now use it infrequently. That said, this correlation option can be a quick way to get several correlation values. I will demo this in class, so you can skim or skip this section and resume a few pages down with "Correlation Isn't Causation". Correlation Isn’t Causation [p. 93 There’s an important difference between believing that one variable is related to another and believing that changes to one variable cause changes to another. Besides the issue of the complexity of the relationships, there are two general reasons, discussed next, that you should be very careful of assuming that a correlational relationship is also causal. A Third Variable It sometimes happens that you find a strong correlation between two variables that suggests a causal relationship. The classic example is the number of books in school district libraries and scores on the standardized SAT exams. Suppose you found a strong correlation—say, 0.7—between the number of books per student in districts’ libraries and the average performance by those districts’ students on the SATs. A first-glance interpretation might be that the availability of a larger number of books results in more knowledge, thus better outcomes on standardized tests. A more careful examination might reveal that communities where the annual household income is higher have more in the way of property taxes to spend on schools and their libraries. Such communities also tend to spend more on other important aspects of children’s development, such as nutrition and stable home environments. In other words, children raised in wealthier districts are more likely to score well on standardized tests. In contrast, it is difficult to argue that simply adding more books to a school library will result in higher SAT scores. The third variable here, in addition to number of library books and SAT scores, is the wealth of the community.