How Variables Move Jointly: Correlation This chapter begins a look

advertisement
How Variables Move Jointly: Correlation
This chapter begins a look at how two or more variables covary: that is, how higher values on one
variable are associated with higher values on another, and how lower values on the two variables are
also associated. The reverse situation also occurs frequently, when higher values on one variable are
associated with lower values on another variable. [p.79]
Understanding Correlation
The degree to which two variables behave in this way—that is, the way they covary—is called
correlation. A familiar example is height and weight. They have what’s called a positive correlation: High
values on one variable are associated with high values on the other variable. [p.79]
Notice the figure in cell E2 of Figure 4.1. It is
the correlation coefficient. The correlation
coefficient, or r, can take on values that
range from −1.0 to +1.0. The closer that r is
to plus or minus 1.0, the stronger the
relationship. When two variables are
unrelated, the correlation that you might
calculate between the two of them should
be close to 0.0. [p. 80]
The Correlation, Calculated
Notice the formula in the formula bar shown in Figure 4.3: [p. 81]
=CORREL(A2:A13,B2:B13)
The important point to recognize is that r expresses the strength of a relationship between two
variables. Notice the diagonal line in the chart in Figure 4.4. That’s called a regression line (or, in Excel
terms, a trendline). [p.84]
Figure 4.4 -> The author uses 2 datapoints which is the simplest case - he is doing this to illustrate
conceptually how covariance and correlation work.
I will cover the mathematics in class as I realize these formulas may be hard to interpret.
Moving from the Covariance to the Correlation [p.86]
Even without Excel’s CORREL() function, it’s easy to get from the covariance to the correlation. The
definitional formula for the correlation coefficient between variable x and variable y is as follows:
r = Sxy / SxSy
In words, the correlation is equal to the covariance (sxy) divided by the product of the standard
deviation of x (sx) and the standard deviation of y (sy).
In practice, you almost never do the actual calculations, but leave them to the Excel worksheet
functions CORREL() for the correlation coefficient and COVAR() for the covariance.
Using the Correl() Function [p. 86-88]
Housing Price vs. Buyer’s Annual Income: Please try this using the Excel Spreadsheet. [p. 86-87]
Correlation coefficients can be tricky. Here are two ways they can steer you wrong:

There’s a strong relationship between the two
variables, but the normal correlation coefficient,
r, obscures that relationship. (The example in the
book is that correlation is for linear (line)
relationships; and so data that has a curve will not
work with the linear approach. There are curverelated formulas I will show in class).

There’s no strong relationship between the two
variables, but one or two highly unusual
observations make it seem as though there is one.

Make it a habit to create XY charts of variables
that you investigate via correlation analysis. [p .89]
Using the Analysis Tools
As shown with R, there are "correlation matrices" that can help. In the past I would use the below
correlation Excel Add-In; though now use it infrequently. That said, this correlation option can be a
quick way to get several correlation values. I will demo this in class, so you can skim or skip this
section and resume a few pages down with "Correlation Isn't Causation".
Correlation Isn’t Causation [p. 93

There’s an important difference between believing that one variable is related to another and
believing that changes to one variable cause changes to another.
Besides the issue of the complexity of the relationships, there are two general reasons, discussed
next, that you should be very careful of assuming that a correlational relationship is also causal.
A Third Variable


It sometimes happens that you find a strong correlation between two variables that suggests a
causal relationship. The classic example is the number of books in school district libraries and scores
on the standardized SAT exams. Suppose you found a strong correlation—say, 0.7—between the
number of books per student in districts’ libraries and the average performance by those districts’
students on the SATs. A first-glance interpretation might be that the availability of a larger number
of books results in more knowledge, thus better outcomes on standardized tests.
A more careful examination might reveal that communities where the annual household income is
higher have more in the way of property taxes to spend on schools and their libraries. Such
communities also tend to spend more on other important aspects of children’s development, such
as nutrition and stable home environments. In other words, children raised in wealthier districts are
more likely to score well on standardized tests. In contrast, it is difficult to argue that simply adding
more books to a school library will result in higher SAT scores. The third variable here, in addition to
number of library books and SAT scores, is the wealth of the community.
Download