Correlation and Regression Geography 450, Urban Research Elvin Wyly

advertisement
Correlation and Regression
Geography 450, Urban Research
Elvin Wyly
“To avoid falling for the post hoc fallacy and thus wind up believing many
things that are not so, you need to put any statement of relationship
through a sharp inspection. The correlation, that convincingly precise
figure that seems to prove that something is because of something, can
actually be any of several types.”1
“The correlation coefficient is the most commonly seen measure of
association between two variables. It is often denoted r or R, and
sometimes by the Greek r, ρ (rho). ... the correlation coefficient, R, is not
always a sufficient summary of association, but it is useful and often used.
The fact is that no ideal summary numbers exist.”2
Suppose we’re doing a study of working-class housing in the Vancouver region, and
we’re interested in the circumstances of people who live as renters in mobile homes. We
have survey responses from a sample of people, and two of the questions deal with the
total monthly rent, and the total household income. The survey responses are listed in
Table 1. What is the relationship between total monthly rent and total household
income? Does rent co-vary with income? In other words, is there a correlation between
these two measures?
1
Darrell Huff (1954). How to Lie With Statistics. New York: W.W. Norton, p. 89.
Loren Haskins and Kirk Jeffrey (1990). Understanding Quantitative History. Cambridge, MA: MIT
Press, p. 234.
2
1
Table 1. Rent and Income for a Sample of Renters
in Mobile Homes, Vancouver CMA, 2001.
Gross
monthly
Total household
rent
income
1033
179
850
608
413
710
850
726
350
99
425
825
792
1192
718
99
99
99
1300
560
99
60000
38595
34267
23071
34300
50165
29064
61506
45382
33501
59000
32804
28688
38513
43411
16864
12000
24312
46624
20608
27214
Data Source: Statistics Canada (2005). 2001 Census, Public Use Microdata
File (PUMF), households and housing file. Ottawa: Statistics Canada.
When two or more things co-vary with one another, they share variance. If households
with higher incomes tend to have higher monthly rents, and if those with lower incomes
tend to also have lower rents, then these two variables have a positive covariance. If the
opposite held -- if households with low incomes tend to have higher rents and those with
high incomes have lower rents, then the two measures have a negative or inverse
covariance. Given everything that we know about household finances and housing
markets in this society, in this region, at this point in time, we would not expect to
observe negative covariance between income and rent for renters living in mobile homes.
It would be logical to anticipate some kind of positive covariance.
But what is covariance? Recall that variance is one of the measures of the ‘spread’of a
set of numerical scores. Take the difference between each observation and its mean,
square the result, add up all the squared deviations, and then divide by the number of
observations to obtain the mean.
∑ (X − X )
=
2
s
2
n
2
In some textbooks you’ll see an equation which is just a little bit different:
∑ (X − X )
=
2
s
2
n −1
Statistical purists emphasize that calculating variance for a sample, the denominator
should be n-1 rather than n in order to provide an unbiased estimate. This adjustment
doesn’t make much of a difference when n is large, but of course things do matter a lot if
you’re working with a small sample.
Covariance is the product of the differences, for two separate variables, of each score
from its mean value. Instead of multiplying a score’s deviation from the mean by itself
(that is, squaring it), we multiply the deviation by the corresponding difference with the
mean for another variable. For variables X and Y, then, covariance is calculated as
COV ( X , Y ) =
∑ (X − X )(Y − Y )
n
As in the case of the equation with variance, when you’re working with a sample, the
equation is
COV ( X , Y ) =
∑ (X − X )(Y − Y )
n −1
In both of these equations, however, covariance is affected by the scale of measurement
of the two variables. In our case, rent is measured on a scale that varies several hundred
dollars, while the range for income is many thousands of dollars. If we multiply the
denominator by the product of the standard deviations for both of the variables, we can
effectively standardize the covariance. This creates a ratio that will always range
between -1.0 and +1.0, no matter what the measurement scale of the original variables
(kilometers, liters, thousands of dollars, etc.). The only restriction is that the variables
must be measured on an interval or ratio scale. The standardized covariance is known as
the correlation coefficient:
r=
∑ (X − X )(Y − Y )
n − 1( s x s y )
The correlation coefficient is often called Pearson’s r, or Pearson’s product-moment
correlation coefficient. Karl Pearson developed this measure in 1895, as part of a series
of breakthroughs in measurement, probability theory, and the assessment of “goodness of
fit” between observed patterns and expectations derived either from a priori theory or an
3
assumed benchmark of pure, random variation. 3 If this equation looks a bit cumbersome
or complicated, just keep in mind that when we are expressing things in terms of standard
deviations, that’s the same as a z-score. So the correlation coefficient can also be
calculated as
r=
∑z z
x y
n −1
Table 2 shows the calculations for the variance, covariance, and then the correlation
coefficient for our small sample of renters in mobile homes in the Vancouver region. The
covariance is positive -- as we expected it would be -- and the correlation coefficient is
also positive. Correlation coefficients range between -1.0 and +1.0. If two variables
have no relationship whatsoever, the correlation will be close to zero. Two variables that
approach “perfect” positive correlation will have a coefficient close to +1.0. Two
variables that approach perfect negative correlation will have a coefficient near -1.0.
If we take the square of the correlation coefficient, we obtain the coefficient of
determination, r2. The coefficient of determination also ranges from -1.0 to +1.0, but it
has a more interesting and valuable property: r2 measures the percentage of variance that
two variables share. For our example, r2 is 0.2021. This means that 20.21 percent of the
variance in monthly rents for mobile home renters in the Vancouver region can be
associated with the variance in total household income.
3
For a fascinating history, see M. Eileen Magnello (1999). “The Non-Correlation of Biometrics and
Eugenics: Rival Forms of Laboratory Work in Karl Pearson’s Career at University College London, Part
1.” History of Science 37, 79-106, especially p. 96.
4
Table 2. Calculating the Variance, Covariance, and Correlation.
1
2
3
4
5
6
Gross
Mean
monthly
Difference from
Squared
Total household
Difference from
Squared
Column 2 *
rent
mean
difference
income
mean
difference
Column 5
1033
179
850
608
413
710
850
726
350
99
425
825
792
1192
718
99
99
99
1300
560
99
460
-394
277
35
-160
137
277
153
-223
-474
-148
252
219
619
145
-474
-474
-474
727
-13
-474
211907
154973
76914
1248
25493
18860
76914
23511
49580
224360
21805
63672
48107
383574
21122
224360
224360
224360
529014
160
224360
60000
38595
34267
23071
34300
50165
29064
61506
45382
33501
59000
32804
28688
38513
43411
16864
12000
24312
46624
20608
27214
23815
2410
-1918
-13114
-1885
13980
-7121
25321
9197
-2684
22815
-3381
-7497
2328
7226
-19321
-24185
-11873
10439
-15577
-8971
567145153
5807182
3679455
171981992
3553943
195435074
50711354
641143395
84581305
7204879
520515534
11432449
56207865
5418697
52212323
373308401
584923438
140972652
108968744
242648863
80482259
10962751
-948662
-531978
-463368
301002
1919894
-1974943
3882524
-2047823
1271412
-3368987
-853187
-1644384
1441690
1050151
9151804
11455719
5623935
7592494
197311
4249354
573
Variance
Standard Deviation
36185
141433
376
195416748
13979
47266707.3 Sum of the products...
2363335.37 ...divided by N-1 is the covariance
105144366 N-1 * (product of the two standard deviations)
A
B
C
0.4495 A divided by C is the correlation coefficient
Assessing the Significance of r
If we’re working with sample data, we know that our results will be different if we draw
a different random sample. Correlation coefficients, like means, ratios, and other
parameters, are subject to random sampling variability. If we find a particular correlation
coefficient in our sample (r), how can we know if r is just the product of chance, random
sampling variability? Perhaps the r we observe is just random, chance variation that
would lead us to believe there is a relationship when in fact the true, population
correlation coefficient (ρ)is actually zero?
If we can safely assume that the data for each variable come from a population
distribution that is normal, and if we can safely assume that observations are independent
-- that is, that one observation for x does not affect the other observations of x, and the
same holds for y -- then we can use a t-test to evaluate the significance of a sample
correlation coefficient:
r n−2
t=
1− r2
If the null hypothesis is correct -- if, in fact, there is no true correlation in the population - then this statistic will follow a student’s t distribution, with n-2 degrees of freedom.
This means that even if there is no true correlation in the population, if we were to draw
repeated random samples and calculate correlation coefficients for each sample, there
would be a sampling distribution something like what appears in Figure 2. Most of the
sample correlation coefficients would cluster fairly close to the true zero population
5
correlation. But in a small number of cases -- the “tails” -- we would obtain coefficients
very far away from zero. The shape of this distribution depends on the degrees of
freedom -- the number of sampled observations minus two (to adjust for the calculation
of standard deviations from two different variables). So we calculate the t statistic using
the formula above, and then look up the critical values of the t distribution in an appendix
of any standard statistics textbook.
The Distribution of Pearson’s r. Source: Perry R. Hinton (1995). Statistics Explained. New York:
Routledge, p. 261.
For our example, the formula yields a t value of 2.64. For df=19 (our sample of 21
households minus 2), a table of “Critical Values of the T Distribution” indicates that in
ninety-five percent of all random samples when the population correlation coefficient is
zero, the t statistic will be between -2.093 and +2.093. Since our t value is outside this
range, it means we can reject the null hypothesis. We do have sufficient evidence to
conclude that there is a statistically significant correlation between the monthly rents paid
and the total employment income of workers in households living in mobile homes in the
Vancouver region.
Correlation in Stata
Fortunately, we don’t have to go through all the tedious calculations that Karl Pearson
(or, to be much more accurate, Karl Pearson’s many hardworking assistants) had to do in
the 1890s. Make sure the 2001 Census of Canada PUMF is located in your c:\data\pumf
directory, and then open Stata and issue the following commands:
set memory 200m
use “c:\data\pumf\2001hh.dta”
corr grosrth totinch if cmah==933 & tenurh==2 & dtypeh==8 &
totinch > 10000 & totinch < 70000
6
The “corr” command asks for an analysis of the correlation between grosrth (monthly
gross rent) and totinch (total household income). All the specifications after the “if”
narrow the analysis to renters (tenurh 2) who live in dwellings classified as “mobile home
or other movable dwelling” (dtypeh 8); finally, the analysis excludes households with
annual incomes of less than $10,000 or more than $70,000.
After you submit these commands, your screen should look something like this:
The correlation between grosrth (monthly gross rent) and totinch (total household
income) is 0.4495 for all of the households who meet the criteria in that command. The
figure of 0.4495 is precisely what we calculated in the worksheet shown in Table 2. If
you would like to request a t-test for the significance of the correlation coefficient, then
the command is a little bit different:
pwcorr grosrth totinch if cmah==933 & tenurh==2 & dtypeh==8 & totinch >
10000 & totinch < 70000, sig
which gives you this
7
The figure below the correlation coefficient -- the 0.0409 below the 0.4495 -- indicates
that given the sample size we’re working with, random sampling variability will mean
that about 4 percent of the time, a random sample will yield a correlation this large even
when the correlation in the population is actually zero. Any correlation with a probability
below 0.05 is usually regarded as “statistically significant” -- meaning that it probably did
not occur solely through chance, random sampling variability. Statistical significance is
not the same as practical significance, however. As sample sizes increase, even very
small correlation coefficients will yield t statistics that lie in the extreme ranges of the tail
of the t distribution. This means that analysts who are working with small sample size
tend to “accept” correlation coefficients as meaningful if they pass a t test at P<0.10,
while analysts working with extremely large sample sizes will focus on the magnitude of
the correlation coefficient itself -- say, above 0.50 or 0.75. Keep in mind that our small
example yields a correlation coefficient of 0.4495, and so the squared correlation -- the
coefficient of determination -- is only 0.2021. Only a fifth of the variance in rent levels
can be associated with the variance in total household income. This is not a strong
relationship at all.
Assessing Correlation with Scatter Diagrams
Thus far, we’ve considered the relations between our two variables in terms of variance
and covariance. But we can get a simpler and more intuitive view of the ideas behind
correlation if we take the data in Table 1 and draw a scatter diagram of the households.
Issue the following command
twoway scatter grosrth totinch if cmah==933 & tenurh==2 & dtypeh==8 &
totinch > 10000 & totinch < 70000
8
1500
Monthly gross rent
500
1000
0
10000
20000
30000
40000
Total household income
50000
60000
This is an immediately intuitive confirmation of a positive -- but weak relationship
between total household income and monthly gross rent for our sample of households. In
fact, the relationship is even weaker. Since I couldn’t bring myself to work all the way
through the calculations for a pathetically small correlation coefficient, the Stata
commands we’ve been using in this example have excluded households with incomes
lower than $10,000 per year, and a small number of households with incomes over
$70,000. Get rid of these restrictions and draw the scatterplot again:
twoway scatter grosrth totinch if cmah==933 & tenurh==2 & dtypeh==8
9
1500
Monthly gross rent
500
1000
0
0
20000
40000
60000
Total household income
80000
100000
Here, the relationship is even weaker. The correlation coefficient for this graph is
0.0981, which means that only 0.96 percent -- less than one percent -- of the variance in
monthly rents can be associated with variance in total household income. Given the
sample size and the small value of the coefficient, a t test fails to reject the null
hypothesis -- and so we cannot have confidence that the observed correlation is not just a
random sampling fluctuation from a zero correlation in the population.
This relationship is a little bit stronger in some places, however. Try these two
commands to explore the relations in the Edmonton metropolitan area:
twoway scatter grosrth totinch if cmah==835 & tenurh==2 & dtypeh==8
pwcorr grosrth totinch if cmah==835 & tenurh==2 & dtypeh==8, sig
10
1500
Monthly gross rent
500
1000
0
-50000
0
50000
Total household income
100000
150000
The scatter diagram seems to show a somewhat stronger relationship, and the pwcorr
command -- “pairwise correlation” -- yields a coefficient of 0.4340, implying a
coefficient of determination that can account for about nineteen percent of all the
variance in gross rent values among mobile-home renters. But the sample size is pretty
small -- only 18 sampled households -- and thus the there is a 0.0719 probability level
attached to the t statistic. It’s a judgment call as to how much confidence to place in this
correlation. Indeed, if we are suspicious about that one sample household in the upper
right-hand corner -- if there’s any reason to believe that there is something fundamentally
unique or un-generalizable about this household with an income of about $135,000 living
in a rented mobile home -- then we might make a case for eliminating this “outlier.” We
can do this by editing the command like this:
twoway scatter grosrth totinch if cmah==835 & tenurh==2 & dtypeh==8 &
totinch < 130000
And we get the scatterplot that appears on the following page. If we also issue the
pwcorr command, we see that the scatter indicates no relationship whatsoever, and the
correlation coefficient -- 0.0463 -- implies that less than two-tenths of one percent of the
variance in rent levels can be associated with total household income of renters living in
mobile homes in Edmonton. Again, whether it makes sense to exclude the “outlier”
household is a subjective judgment call -- that would be guided by sifting through the
data to explore other characteristics of this household, in an attempt to draw a conceptual
inference. (For example, perhaps this household is a middle-class family living
temporarily in a mobile home while their new, custom home is under construction; we
11
0
200
Monthly gross rent
400
600
800
1000
could make a case that this household is different from most other renters living in
mobile homes.)
-50000
0
50000
Total household income
100000
You should always draw scatter diagrams when exploring correlations among variables.
Scatter diagrams are often the best way to begin your inquiry. This is true even if you’ll
eventually calculate correlations and publish the results in a table, like Martin Danyluk
and David Ley did when they correlated neighborhood-level gentrification in Vancouver,
Toronto, and Montreal to the proportion of workers commuting to work by various
means.
Danyluk and Ley’s Correlation Analysis. Source: Martin Danyluk and David Ley (2007). “Modalities of
the New Middle Class: Ideology and Behavior in the Journey to Work from Gentrified Neighbourhoods in
Canada.” Urban Studies 44(11), 2195-2210.
12
Scatter diagrams are also essential in detecting non-linear relationships (Figure 4). The
correlation coefficient measures the strength of a linear relationship between two
variables -- and thus it is entirely possible to obtain weak correlation coefficients for
relations that are strong but non-linear.
A “tight” scatter of observations along something that looks roughly like a straight line
will yield a very large correlation coefficient -- approaching +1.0 if the slope is upward to
the right, or approaching -1.0 if the slope is down to the right (compare the left and right
scatters of the top panel in Figure 4). Conventionally, the vertical axis Y in a correlation
analysis is referred to as the dependent variable, and the horizontal axis X is described as
the independent variable.
Scatter Diagrams for Various Kinds of Relations between Two Variables. Source: Martin Bland
(2005). Clinical Biostatistics, Lecture Notes, Week 7. Toronto: Department of Health Sciences,
York University. Available at http://www-users.york.ac.uk/~mb55/msc/clinbio/week7/corr.htm
Regression
If a correlation coefficient is useful to evaluate the strength of a relationship, and if
scatter diagrams are useful to convey this information visually, these approaches still
leave important questions unanswered: if there is a relationship, what form does it take?
To use our example of mobile home renters, how much does monthly rent increase with
each unit change in total household income? Answering questions like these requires a
technique known as regression.4 Perry Hinton distinguishes correlation from regression
4
Why is it called “regression”? The word comes from the Latin regredi, “to go back,” and was used by
nineteenth-century researchers to describe a phenomenon known as ‘reversion to the mean.’ Francis
13
this way: “A linear correlation tells us how close the relationship two variables is to a
straight line. A linear regression is the straight line that best describes the linear
relationship between the two variables.”5
0
Monthly gross rent
500
1000
1500
Let’s return to our sample of Vancouver-area mobile home renters:
10000
20000
30000
40000
Total household income
50000
60000
To describe the straight line that would achieve the “best fit” with these points, we only
need to know a few pieces of information. The equation for the straight line would relate
the dependent variable (monthly gross rent, Y) to variation in the independent variable
(total household income, X) -- while also specifying the point where the line would
intersect the vertical axis. In other words, the equation for our line would take the form
Y=a+bX
Where Y is the value of the dependent variable, a is the value of the vertical axis where
the line intersects it (i.e., where X is equal to zero), and b is the slope coefficient that
relates the change in units of X to corresponding changes in the value of Y.
Regression involves finding the values of a and b that achieve the “best fit” of a line to
the scatter of points. Achieving the best fit requires minimizing the sum of the squared
Galton, in a series of studies of the heredity of height and other physical characteristics, observed that very
tall people tended to have children shorter than themselves (i.e., closer to the average), while very short
parents tended to have children who were taller than themselves.
5
Perry Hinton (1995). Statistics Explained. New York: Routledge, p. 262.
14
deviations of the dependent variable, Y. This simple approach is often labeled “ordinary
least squares” or OLS regression.
The Line of Best Fit. Source: Peter J. Taylor (1977). Quantitative Methods in
Geography: An Introduction to Spatial Analysis. Prospect Heights, IL: Waveland Press,
p. 198.
The sum of squares is at a minimum when
∑ (X − X )(Y − Y )
b=
∑ ( X − X )2
Look carefully at the numerator in this equation. This is identical to the numerator in the
equation for the correlation coefficient. The equation expresses the ratio between the
joint variation of X and Y and the variation of X with itself (i.e., the sum of the squared
deviations). Once we’ve figured out b, then a can be calculated as
a = Y − bX
The figures in the worksheet in Table 2 can be used to calculate a and b for this small
sample; we obtain b=0.0121 and a=135.05. The line of best fit crosses the vertical axis at
Y=$135.05 gross monthly rent, and each one-unit increase on the X axis (i.e., one dollar
of total household income) yields a corresponding increase in rent of 0.0121. Since the
units for the variables are so different, it might help to express the change in rent
15
associated with, say, an increase of $10,000 in total household income: this is associated
with an increase in rent of $121.
In Stata, issue the following command:
regress grosrth totinch if cmah==933 & tenurh==2 & dtypeh==8 & totinch
> 10000 & totinch < 70000
and your results panel will give you this:
Notice three parts of the output. First, the “R-squared” value in the upper-right corner is
0.2021, which is the coefficient of determination we calculated earlier -- also equivalent
to the squared value of the correlation coefficient. About 20.2 percent of the variance in
monthly rent levels can be associated with the variance in total household income among
mobile home renters in the Vancouver metropolitan area in 2001. The “Adj R-Squared”
value takes into consideration the degrees of freedom -- such that analysis with
comparatively few observations will be ‘penalized’ with a lower coefficient of
determination. Second, note the “Coef.” column in the lower-left corner. The coefficient
for totinch is 0.0120938, which is our b value, sometimes called a “beta coefficient.”
Third, the coefficient for “_cons” is Stata’s way of labeling the intercept, which is also
sometimes called the “constant.”
Notice that Stata also provides a column labeled “Std. Err.” When the scatter diagram is
diffuse, our line of best fit will provide rather unreliable estimates for the dependent
variable. There will be large differences between the line of best fit -- the line of Y
values predicted with that Y=a+bX equation -- and the actual values for each sample.
Notice the right-hand side of the graph, where household income is about $60,000; there
are three sample households, with rents ranging from less than $500 to more than $1,000.
This introduces considerable uncertainty.
The difference between the observed value and model-predicted value for each
observation is known as a residual. If we calculate the residuals for all the observations,
they will have their own mean and standard deviation. The standard deviation of the
residuals is known as the standard error of the estimate. We can use the standard error
of the estimate to calculate t statistics for the beta coefficient, to test the null hypothesis
16
that the slope in the population is zero, signifying no relationship. In our example, the ttest yields a probability of 0.041, indicating that we can be more than 95 percent
confident that the coefficient in the population is not zero. There does seem to be a
relationship, although it is a weak relationship.
Multiple Regression
Our examples thus far are quite simplistic, with correlations between one variable and
another. Things get more interesting when we consider the effect of multiple
independent variables on our dependent variable. If we add one more predictor variable,
our simple bivariate regression equation,
Y=a+bX
becomes a multivariate regression,
Y = a + b1X1 + b 2 X 2
with two separate beta or slope coefficients. Instead of fitting a line to a scatter of points
plotted on a two-dimensional graph, we are now fitting a plane to a cloud of points
plotted in a three-dimensional space:
Visualizing Multiple Regression as a Sloping Plane. Source: Peter J. Taylor (1977).
Quantitative Methods in Geography: An Introduction to Spatial Analysis. Prospect
Heights, IL: Waveland Press, p. 208.
There’s no need for us to remain in the realm of three dimensions; mathematically, the
model can be extended to the general form,
17
Y = a + b1X1 + b 2 X 2 ... + b n X n
In Stata, issue this command:
regress grosrth totinch roomh if cmah==933 & tenurh==2 & dtypeh==8 &
totinch > 10000 & totinch < 70000
and the results panel will yield this:
Our r-squared value has increased from 0.2021 to 0.2234 with the addition of a variable
measuring the number of rooms in the dwelling. We cannot, however, simply subtract
these two values to determine the amount of variance accounted for by the addition of the
new variable; this is because totinch and roomh may themselves be correlated. You can
test this by issuing this command,
pwcorr grosrth totinch roomh if cmah==933 & tenurh==2 & dtypeh==8 &
totinch > 10000 & totinch < 70000
which gives these results:
Note that while rent and income are correlated (0.45), there is a much weaker relation
between rent and the number of rooms (0.18); but the number of rooms is also related to
income (0.071). When independent variables exhibit interdependencies, we have the
problem of collinearity; when it involves multiple inter-relations amongst predictors, it’s
called multicollinearity.
18
Download