Correlation and Regression Geography 450, Urban Research Elvin Wyly “To avoid falling for the post hoc fallacy and thus wind up believing many things that are not so, you need to put any statement of relationship through a sharp inspection. The correlation, that convincingly precise figure that seems to prove that something is because of something, can actually be any of several types.”1 “The correlation coefficient is the most commonly seen measure of association between two variables. It is often denoted r or R, and sometimes by the Greek r, ρ (rho). ... the correlation coefficient, R, is not always a sufficient summary of association, but it is useful and often used. The fact is that no ideal summary numbers exist.”2 Suppose we’re doing a study of working-class housing in the Vancouver region, and we’re interested in the circumstances of people who live as renters in mobile homes. We have survey responses from a sample of people, and two of the questions deal with the total monthly rent, and the total household income. The survey responses are listed in Table 1. What is the relationship between total monthly rent and total household income? Does rent co-vary with income? In other words, is there a correlation between these two measures? 1 Darrell Huff (1954). How to Lie With Statistics. New York: W.W. Norton, p. 89. Loren Haskins and Kirk Jeffrey (1990). Understanding Quantitative History. Cambridge, MA: MIT Press, p. 234. 2 1 Table 1. Rent and Income for a Sample of Renters in Mobile Homes, Vancouver CMA, 2001. Gross monthly Total household rent income 1033 179 850 608 413 710 850 726 350 99 425 825 792 1192 718 99 99 99 1300 560 99 60000 38595 34267 23071 34300 50165 29064 61506 45382 33501 59000 32804 28688 38513 43411 16864 12000 24312 46624 20608 27214 Data Source: Statistics Canada (2005). 2001 Census, Public Use Microdata File (PUMF), households and housing file. Ottawa: Statistics Canada. When two or more things co-vary with one another, they share variance. If households with higher incomes tend to have higher monthly rents, and if those with lower incomes tend to also have lower rents, then these two variables have a positive covariance. If the opposite held -- if households with low incomes tend to have higher rents and those with high incomes have lower rents, then the two measures have a negative or inverse covariance. Given everything that we know about household finances and housing markets in this society, in this region, at this point in time, we would not expect to observe negative covariance between income and rent for renters living in mobile homes. It would be logical to anticipate some kind of positive covariance. But what is covariance? Recall that variance is one of the measures of the ‘spread’of a set of numerical scores. Take the difference between each observation and its mean, square the result, add up all the squared deviations, and then divide by the number of observations to obtain the mean. ∑ (X − X ) = 2 s 2 n 2 In some textbooks you’ll see an equation which is just a little bit different: ∑ (X − X ) = 2 s 2 n −1 Statistical purists emphasize that calculating variance for a sample, the denominator should be n-1 rather than n in order to provide an unbiased estimate. This adjustment doesn’t make much of a difference when n is large, but of course things do matter a lot if you’re working with a small sample. Covariance is the product of the differences, for two separate variables, of each score from its mean value. Instead of multiplying a score’s deviation from the mean by itself (that is, squaring it), we multiply the deviation by the corresponding difference with the mean for another variable. For variables X and Y, then, covariance is calculated as COV ( X , Y ) = ∑ (X − X )(Y − Y ) n As in the case of the equation with variance, when you’re working with a sample, the equation is COV ( X , Y ) = ∑ (X − X )(Y − Y ) n −1 In both of these equations, however, covariance is affected by the scale of measurement of the two variables. In our case, rent is measured on a scale that varies several hundred dollars, while the range for income is many thousands of dollars. If we multiply the denominator by the product of the standard deviations for both of the variables, we can effectively standardize the covariance. This creates a ratio that will always range between -1.0 and +1.0, no matter what the measurement scale of the original variables (kilometers, liters, thousands of dollars, etc.). The only restriction is that the variables must be measured on an interval or ratio scale. The standardized covariance is known as the correlation coefficient: r= ∑ (X − X )(Y − Y ) n − 1( s x s y ) The correlation coefficient is often called Pearson’s r, or Pearson’s product-moment correlation coefficient. Karl Pearson developed this measure in 1895, as part of a series of breakthroughs in measurement, probability theory, and the assessment of “goodness of fit” between observed patterns and expectations derived either from a priori theory or an 3 assumed benchmark of pure, random variation. 3 If this equation looks a bit cumbersome or complicated, just keep in mind that when we are expressing things in terms of standard deviations, that’s the same as a z-score. So the correlation coefficient can also be calculated as r= ∑z z x y n −1 Table 2 shows the calculations for the variance, covariance, and then the correlation coefficient for our small sample of renters in mobile homes in the Vancouver region. The covariance is positive -- as we expected it would be -- and the correlation coefficient is also positive. Correlation coefficients range between -1.0 and +1.0. If two variables have no relationship whatsoever, the correlation will be close to zero. Two variables that approach “perfect” positive correlation will have a coefficient close to +1.0. Two variables that approach perfect negative correlation will have a coefficient near -1.0. If we take the square of the correlation coefficient, we obtain the coefficient of determination, r2. The coefficient of determination also ranges from -1.0 to +1.0, but it has a more interesting and valuable property: r2 measures the percentage of variance that two variables share. For our example, r2 is 0.2021. This means that 20.21 percent of the variance in monthly rents for mobile home renters in the Vancouver region can be associated with the variance in total household income. 3 For a fascinating history, see M. Eileen Magnello (1999). “The Non-Correlation of Biometrics and Eugenics: Rival Forms of Laboratory Work in Karl Pearson’s Career at University College London, Part 1.” History of Science 37, 79-106, especially p. 96. 4 Table 2. Calculating the Variance, Covariance, and Correlation. 1 2 3 4 5 6 Gross Mean monthly Difference from Squared Total household Difference from Squared Column 2 * rent mean difference income mean difference Column 5 1033 179 850 608 413 710 850 726 350 99 425 825 792 1192 718 99 99 99 1300 560 99 460 -394 277 35 -160 137 277 153 -223 -474 -148 252 219 619 145 -474 -474 -474 727 -13 -474 211907 154973 76914 1248 25493 18860 76914 23511 49580 224360 21805 63672 48107 383574 21122 224360 224360 224360 529014 160 224360 60000 38595 34267 23071 34300 50165 29064 61506 45382 33501 59000 32804 28688 38513 43411 16864 12000 24312 46624 20608 27214 23815 2410 -1918 -13114 -1885 13980 -7121 25321 9197 -2684 22815 -3381 -7497 2328 7226 -19321 -24185 -11873 10439 -15577 -8971 567145153 5807182 3679455 171981992 3553943 195435074 50711354 641143395 84581305 7204879 520515534 11432449 56207865 5418697 52212323 373308401 584923438 140972652 108968744 242648863 80482259 10962751 -948662 -531978 -463368 301002 1919894 -1974943 3882524 -2047823 1271412 -3368987 -853187 -1644384 1441690 1050151 9151804 11455719 5623935 7592494 197311 4249354 573 Variance Standard Deviation 36185 141433 376 195416748 13979 47266707.3 Sum of the products... 2363335.37 ...divided by N-1 is the covariance 105144366 N-1 * (product of the two standard deviations) A B C 0.4495 A divided by C is the correlation coefficient Assessing the Significance of r If we’re working with sample data, we know that our results will be different if we draw a different random sample. Correlation coefficients, like means, ratios, and other parameters, are subject to random sampling variability. If we find a particular correlation coefficient in our sample (r), how can we know if r is just the product of chance, random sampling variability? Perhaps the r we observe is just random, chance variation that would lead us to believe there is a relationship when in fact the true, population correlation coefficient (ρ)is actually zero? If we can safely assume that the data for each variable come from a population distribution that is normal, and if we can safely assume that observations are independent -- that is, that one observation for x does not affect the other observations of x, and the same holds for y -- then we can use a t-test to evaluate the significance of a sample correlation coefficient: r n−2 t= 1− r2 If the null hypothesis is correct -- if, in fact, there is no true correlation in the population - then this statistic will follow a student’s t distribution, with n-2 degrees of freedom. This means that even if there is no true correlation in the population, if we were to draw repeated random samples and calculate correlation coefficients for each sample, there would be a sampling distribution something like what appears in Figure 2. Most of the sample correlation coefficients would cluster fairly close to the true zero population 5 correlation. But in a small number of cases -- the “tails” -- we would obtain coefficients very far away from zero. The shape of this distribution depends on the degrees of freedom -- the number of sampled observations minus two (to adjust for the calculation of standard deviations from two different variables). So we calculate the t statistic using the formula above, and then look up the critical values of the t distribution in an appendix of any standard statistics textbook. The Distribution of Pearson’s r. Source: Perry R. Hinton (1995). Statistics Explained. New York: Routledge, p. 261. For our example, the formula yields a t value of 2.64. For df=19 (our sample of 21 households minus 2), a table of “Critical Values of the T Distribution” indicates that in ninety-five percent of all random samples when the population correlation coefficient is zero, the t statistic will be between -2.093 and +2.093. Since our t value is outside this range, it means we can reject the null hypothesis. We do have sufficient evidence to conclude that there is a statistically significant correlation between the monthly rents paid and the total employment income of workers in households living in mobile homes in the Vancouver region. Correlation in Stata Fortunately, we don’t have to go through all the tedious calculations that Karl Pearson (or, to be much more accurate, Karl Pearson’s many hardworking assistants) had to do in the 1890s. Make sure the 2001 Census of Canada PUMF is located in your c:\data\pumf directory, and then open Stata and issue the following commands: set memory 200m use “c:\data\pumf\2001hh.dta” corr grosrth totinch if cmah==933 & tenurh==2 & dtypeh==8 & totinch > 10000 & totinch < 70000 6 The “corr” command asks for an analysis of the correlation between grosrth (monthly gross rent) and totinch (total household income). All the specifications after the “if” narrow the analysis to renters (tenurh 2) who live in dwellings classified as “mobile home or other movable dwelling” (dtypeh 8); finally, the analysis excludes households with annual incomes of less than $10,000 or more than $70,000. After you submit these commands, your screen should look something like this: The correlation between grosrth (monthly gross rent) and totinch (total household income) is 0.4495 for all of the households who meet the criteria in that command. The figure of 0.4495 is precisely what we calculated in the worksheet shown in Table 2. If you would like to request a t-test for the significance of the correlation coefficient, then the command is a little bit different: pwcorr grosrth totinch if cmah==933 & tenurh==2 & dtypeh==8 & totinch > 10000 & totinch < 70000, sig which gives you this 7 The figure below the correlation coefficient -- the 0.0409 below the 0.4495 -- indicates that given the sample size we’re working with, random sampling variability will mean that about 4 percent of the time, a random sample will yield a correlation this large even when the correlation in the population is actually zero. Any correlation with a probability below 0.05 is usually regarded as “statistically significant” -- meaning that it probably did not occur solely through chance, random sampling variability. Statistical significance is not the same as practical significance, however. As sample sizes increase, even very small correlation coefficients will yield t statistics that lie in the extreme ranges of the tail of the t distribution. This means that analysts who are working with small sample size tend to “accept” correlation coefficients as meaningful if they pass a t test at P<0.10, while analysts working with extremely large sample sizes will focus on the magnitude of the correlation coefficient itself -- say, above 0.50 or 0.75. Keep in mind that our small example yields a correlation coefficient of 0.4495, and so the squared correlation -- the coefficient of determination -- is only 0.2021. Only a fifth of the variance in rent levels can be associated with the variance in total household income. This is not a strong relationship at all. Assessing Correlation with Scatter Diagrams Thus far, we’ve considered the relations between our two variables in terms of variance and covariance. But we can get a simpler and more intuitive view of the ideas behind correlation if we take the data in Table 1 and draw a scatter diagram of the households. Issue the following command twoway scatter grosrth totinch if cmah==933 & tenurh==2 & dtypeh==8 & totinch > 10000 & totinch < 70000 8 1500 Monthly gross rent 500 1000 0 10000 20000 30000 40000 Total household income 50000 60000 This is an immediately intuitive confirmation of a positive -- but weak relationship between total household income and monthly gross rent for our sample of households. In fact, the relationship is even weaker. Since I couldn’t bring myself to work all the way through the calculations for a pathetically small correlation coefficient, the Stata commands we’ve been using in this example have excluded households with incomes lower than $10,000 per year, and a small number of households with incomes over $70,000. Get rid of these restrictions and draw the scatterplot again: twoway scatter grosrth totinch if cmah==933 & tenurh==2 & dtypeh==8 9 1500 Monthly gross rent 500 1000 0 0 20000 40000 60000 Total household income 80000 100000 Here, the relationship is even weaker. The correlation coefficient for this graph is 0.0981, which means that only 0.96 percent -- less than one percent -- of the variance in monthly rents can be associated with variance in total household income. Given the sample size and the small value of the coefficient, a t test fails to reject the null hypothesis -- and so we cannot have confidence that the observed correlation is not just a random sampling fluctuation from a zero correlation in the population. This relationship is a little bit stronger in some places, however. Try these two commands to explore the relations in the Edmonton metropolitan area: twoway scatter grosrth totinch if cmah==835 & tenurh==2 & dtypeh==8 pwcorr grosrth totinch if cmah==835 & tenurh==2 & dtypeh==8, sig 10 1500 Monthly gross rent 500 1000 0 -50000 0 50000 Total household income 100000 150000 The scatter diagram seems to show a somewhat stronger relationship, and the pwcorr command -- “pairwise correlation” -- yields a coefficient of 0.4340, implying a coefficient of determination that can account for about nineteen percent of all the variance in gross rent values among mobile-home renters. But the sample size is pretty small -- only 18 sampled households -- and thus the there is a 0.0719 probability level attached to the t statistic. It’s a judgment call as to how much confidence to place in this correlation. Indeed, if we are suspicious about that one sample household in the upper right-hand corner -- if there’s any reason to believe that there is something fundamentally unique or un-generalizable about this household with an income of about $135,000 living in a rented mobile home -- then we might make a case for eliminating this “outlier.” We can do this by editing the command like this: twoway scatter grosrth totinch if cmah==835 & tenurh==2 & dtypeh==8 & totinch < 130000 And we get the scatterplot that appears on the following page. If we also issue the pwcorr command, we see that the scatter indicates no relationship whatsoever, and the correlation coefficient -- 0.0463 -- implies that less than two-tenths of one percent of the variance in rent levels can be associated with total household income of renters living in mobile homes in Edmonton. Again, whether it makes sense to exclude the “outlier” household is a subjective judgment call -- that would be guided by sifting through the data to explore other characteristics of this household, in an attempt to draw a conceptual inference. (For example, perhaps this household is a middle-class family living temporarily in a mobile home while their new, custom home is under construction; we 11 0 200 Monthly gross rent 400 600 800 1000 could make a case that this household is different from most other renters living in mobile homes.) -50000 0 50000 Total household income 100000 You should always draw scatter diagrams when exploring correlations among variables. Scatter diagrams are often the best way to begin your inquiry. This is true even if you’ll eventually calculate correlations and publish the results in a table, like Martin Danyluk and David Ley did when they correlated neighborhood-level gentrification in Vancouver, Toronto, and Montreal to the proportion of workers commuting to work by various means. Danyluk and Ley’s Correlation Analysis. Source: Martin Danyluk and David Ley (2007). “Modalities of the New Middle Class: Ideology and Behavior in the Journey to Work from Gentrified Neighbourhoods in Canada.” Urban Studies 44(11), 2195-2210. 12 Scatter diagrams are also essential in detecting non-linear relationships (Figure 4). The correlation coefficient measures the strength of a linear relationship between two variables -- and thus it is entirely possible to obtain weak correlation coefficients for relations that are strong but non-linear. A “tight” scatter of observations along something that looks roughly like a straight line will yield a very large correlation coefficient -- approaching +1.0 if the slope is upward to the right, or approaching -1.0 if the slope is down to the right (compare the left and right scatters of the top panel in Figure 4). Conventionally, the vertical axis Y in a correlation analysis is referred to as the dependent variable, and the horizontal axis X is described as the independent variable. Scatter Diagrams for Various Kinds of Relations between Two Variables. Source: Martin Bland (2005). Clinical Biostatistics, Lecture Notes, Week 7. Toronto: Department of Health Sciences, York University. Available at http://www-users.york.ac.uk/~mb55/msc/clinbio/week7/corr.htm Regression If a correlation coefficient is useful to evaluate the strength of a relationship, and if scatter diagrams are useful to convey this information visually, these approaches still leave important questions unanswered: if there is a relationship, what form does it take? To use our example of mobile home renters, how much does monthly rent increase with each unit change in total household income? Answering questions like these requires a technique known as regression.4 Perry Hinton distinguishes correlation from regression 4 Why is it called “regression”? The word comes from the Latin regredi, “to go back,” and was used by nineteenth-century researchers to describe a phenomenon known as ‘reversion to the mean.’ Francis 13 this way: “A linear correlation tells us how close the relationship two variables is to a straight line. A linear regression is the straight line that best describes the linear relationship between the two variables.”5 0 Monthly gross rent 500 1000 1500 Let’s return to our sample of Vancouver-area mobile home renters: 10000 20000 30000 40000 Total household income 50000 60000 To describe the straight line that would achieve the “best fit” with these points, we only need to know a few pieces of information. The equation for the straight line would relate the dependent variable (monthly gross rent, Y) to variation in the independent variable (total household income, X) -- while also specifying the point where the line would intersect the vertical axis. In other words, the equation for our line would take the form Y=a+bX Where Y is the value of the dependent variable, a is the value of the vertical axis where the line intersects it (i.e., where X is equal to zero), and b is the slope coefficient that relates the change in units of X to corresponding changes in the value of Y. Regression involves finding the values of a and b that achieve the “best fit” of a line to the scatter of points. Achieving the best fit requires minimizing the sum of the squared Galton, in a series of studies of the heredity of height and other physical characteristics, observed that very tall people tended to have children shorter than themselves (i.e., closer to the average), while very short parents tended to have children who were taller than themselves. 5 Perry Hinton (1995). Statistics Explained. New York: Routledge, p. 262. 14 deviations of the dependent variable, Y. This simple approach is often labeled “ordinary least squares” or OLS regression. The Line of Best Fit. Source: Peter J. Taylor (1977). Quantitative Methods in Geography: An Introduction to Spatial Analysis. Prospect Heights, IL: Waveland Press, p. 198. The sum of squares is at a minimum when ∑ (X − X )(Y − Y ) b= ∑ ( X − X )2 Look carefully at the numerator in this equation. This is identical to the numerator in the equation for the correlation coefficient. The equation expresses the ratio between the joint variation of X and Y and the variation of X with itself (i.e., the sum of the squared deviations). Once we’ve figured out b, then a can be calculated as a = Y − bX The figures in the worksheet in Table 2 can be used to calculate a and b for this small sample; we obtain b=0.0121 and a=135.05. The line of best fit crosses the vertical axis at Y=$135.05 gross monthly rent, and each one-unit increase on the X axis (i.e., one dollar of total household income) yields a corresponding increase in rent of 0.0121. Since the units for the variables are so different, it might help to express the change in rent 15 associated with, say, an increase of $10,000 in total household income: this is associated with an increase in rent of $121. In Stata, issue the following command: regress grosrth totinch if cmah==933 & tenurh==2 & dtypeh==8 & totinch > 10000 & totinch < 70000 and your results panel will give you this: Notice three parts of the output. First, the “R-squared” value in the upper-right corner is 0.2021, which is the coefficient of determination we calculated earlier -- also equivalent to the squared value of the correlation coefficient. About 20.2 percent of the variance in monthly rent levels can be associated with the variance in total household income among mobile home renters in the Vancouver metropolitan area in 2001. The “Adj R-Squared” value takes into consideration the degrees of freedom -- such that analysis with comparatively few observations will be ‘penalized’ with a lower coefficient of determination. Second, note the “Coef.” column in the lower-left corner. The coefficient for totinch is 0.0120938, which is our b value, sometimes called a “beta coefficient.” Third, the coefficient for “_cons” is Stata’s way of labeling the intercept, which is also sometimes called the “constant.” Notice that Stata also provides a column labeled “Std. Err.” When the scatter diagram is diffuse, our line of best fit will provide rather unreliable estimates for the dependent variable. There will be large differences between the line of best fit -- the line of Y values predicted with that Y=a+bX equation -- and the actual values for each sample. Notice the right-hand side of the graph, where household income is about $60,000; there are three sample households, with rents ranging from less than $500 to more than $1,000. This introduces considerable uncertainty. The difference between the observed value and model-predicted value for each observation is known as a residual. If we calculate the residuals for all the observations, they will have their own mean and standard deviation. The standard deviation of the residuals is known as the standard error of the estimate. We can use the standard error of the estimate to calculate t statistics for the beta coefficient, to test the null hypothesis 16 that the slope in the population is zero, signifying no relationship. In our example, the ttest yields a probability of 0.041, indicating that we can be more than 95 percent confident that the coefficient in the population is not zero. There does seem to be a relationship, although it is a weak relationship. Multiple Regression Our examples thus far are quite simplistic, with correlations between one variable and another. Things get more interesting when we consider the effect of multiple independent variables on our dependent variable. If we add one more predictor variable, our simple bivariate regression equation, Y=a+bX becomes a multivariate regression, Y = a + b1X1 + b 2 X 2 with two separate beta or slope coefficients. Instead of fitting a line to a scatter of points plotted on a two-dimensional graph, we are now fitting a plane to a cloud of points plotted in a three-dimensional space: Visualizing Multiple Regression as a Sloping Plane. Source: Peter J. Taylor (1977). Quantitative Methods in Geography: An Introduction to Spatial Analysis. Prospect Heights, IL: Waveland Press, p. 208. There’s no need for us to remain in the realm of three dimensions; mathematically, the model can be extended to the general form, 17 Y = a + b1X1 + b 2 X 2 ... + b n X n In Stata, issue this command: regress grosrth totinch roomh if cmah==933 & tenurh==2 & dtypeh==8 & totinch > 10000 & totinch < 70000 and the results panel will yield this: Our r-squared value has increased from 0.2021 to 0.2234 with the addition of a variable measuring the number of rooms in the dwelling. We cannot, however, simply subtract these two values to determine the amount of variance accounted for by the addition of the new variable; this is because totinch and roomh may themselves be correlated. You can test this by issuing this command, pwcorr grosrth totinch roomh if cmah==933 & tenurh==2 & dtypeh==8 & totinch > 10000 & totinch < 70000 which gives these results: Note that while rent and income are correlated (0.45), there is a much weaker relation between rent and the number of rooms (0.18); but the number of rooms is also related to income (0.071). When independent variables exhibit interdependencies, we have the problem of collinearity; when it involves multiple inter-relations amongst predictors, it’s called multicollinearity. 18