12
Simple Linear
Regression and
Correlation
Copyright © Cengage Learning. All rights reserved.
12.5
Correlation
Copyright © Cengage Learning. All rights reserved.
Correlation
There are many situations in which the objective in
studying the joint behavior of two variables is to see
whether they are related, rather than to use one to predict
the value of the other.
In this section, we first develop the sample correlation
coefficient r as a measure of how strongly related two
variables x and y are in a sample and then relate r to the
correlation coefficient 
3
The Sample Correlation
Coefficient r
4
The Sample Correlation Coefficient r
Given n numerical pairs (x1, y1), (x2, y2), c, (xn, yn), it is
natural to speak of x and y as having a positive relationship
if large x’s are paired with large y’s and small x’s with small
y’s. Similarly, if large x’s are paired with small y’s and small
x’s with large y’s, then a negative relationship between the
variables is implied.
Consider the quantity
5
The Sample Correlation Coefficient r
Then if the relationship is strongly positive, an xi above the
mean will tend to be paired with a yi above the mean ,
so that
and this product will also be
positive whenever both xi and yi are below their respective
means.
Thus a positive relationship implies that Sxy will be positive.
An analogous argument shows that when the relationship
is negative, Sxy will be negative, since most of the products
will be negative.
6
The Sample Correlation Coefficient r
This is illustrated in Figure 12.19.
(b)
(a)
(a) Scatter plot with Sxy positive; (b) scatter plot with Sxy negative
[+ means (xi – x)(yi – y) > 0, and – means (xi – x)(yi – y) < 0]
Figure 12.19
7
The Sample Correlation Coefficient r
Although Sxy seems a plausible measure of the strength of
a relationship, we do not yet have any idea of how positive
or negative it can be.
Unfortunately, Sxy has a serious defect: By changing the
unit of measurement for either x or y, Sxy can be made
either arbitrarily large in magnitude or arbitrarily close to
zero.
For example, if Sxy = 25,000 = 25 when x is measured in
meters, then Sxy = 25,000 when x is measured in
millimeters and .025 when x is expressed in kilometers.
8
The Sample Correlation Coefficient r
A reasonable condition to impose on any measure of how
strongly x and y are related is that the calculated measure
should not depend on the particular units used to measure
them.
This condition is achieved by modifying Sxy to obtain the
sample correlation coefficient.
9
The Sample Correlation Coefficient r
Definition
The sample correlation coefficient for the n pairs
(x1, y1), … , (xn, yn) is
(12.8)
10
Example 15
An accurate assessment of soil productivity is critical to
rational land-use planning.
Unfortunately, as the author of the article “Productivity
Ratings Based on Soil Series” (Prof. Geographer, 1980:
158–163) argues, an acceptable soil productivity index is
not so easy to come by.
One difficulty is that productivity is determined partly by
which crop is planted, and the relationship between the
yield of two different crops planted in the same soil may not
be very strong.
11
Example 15
cont’d
To illustrate, the article presents the accompanying data on
corn yield x and peanut yield y (mT/Ha) for eight different
types of soil.
With
12
Example 15
cont’d
from which
13
Properties of r
14
Properties of r
The most important properties of r are as follows:
1. The value of r does not depend on which of the two
variables under study is labeled x and which is labeled y.
2. The value of r is independent of the units in which x and
y are measured.
3. –1  r  1
4. r = 1 if and only if (iff) (xi, yi) all pairs lie on a straight line
with positive slope, and r = –1 iff all (xi, yi) pairs lie on
a straight line with negative slope.
15
Properties of r
5. The square of the sample correlation coefficient gives
the value of the coefficient of determination that would
result from fitting the simple linear regression model—in
symbols, (r)2 = r 2.
Property 1 stands in marked contrast to what happens in
regression analysis, where virtually all quantities of interest
(the estimated slope, estimated y-intercept, s2, etc.) depend
on which of the two variables is treated as the dependent
variable.
16
Properties of r
However, Property 5 shows that the proportion of variation
in the dependent variable explained by fitting the simple
linear regression model does not depend on which variable
plays this role.
Property 2 is equivalent to saying that r is unchanged if
each xi is replaced by cxi and if each yi is replaced by dyi
(a change in the scale of measurement), as well as if each
xi is replaced by xi – a and yi by yi – b (which changes the
location of zero on the measurement axis).
This implies, for example, that r is the same whether
temperature is measured in °F or °C.
17
Properties of r
Property 3 tells us that the maximum value of r,
corresponding to the largest possible degree of positive
relationship, is r = 1, whereas the most negative
relationship is identified with r = –1.
According to Property 4, the largest positive and largest
negative correlations are achieved only when all points lie
along a straight line.
Any other configuration of points, even if the configuration
suggests a deterministic relationship between variables,
will yield an r value less than 1 in absolute magnitude.
18
Properties of r
Thus r measures the degree of linear relationship among
variables. A value of r near 0 is not evidence of the lack of a
strong relationship, but only the absence of a linear
relation, so that such a value of r must be interpreted with
caution. Figure 12.20 illustrates several configurations of
points associated with different values of r.
(a) r near +1
(b) r near 1
(c) r near 0, no apparent
relationship
(d) r near 0, nonlinear
relationship
Data plots for different values of r
Figure 12.20
19
Properties of r
A frequently asked question is, “When can it be said that
there is a strong correlation between the variables, and
when is the correlation weak?” Here is an informal rule of
thumb for characterizing the value of r:
Weak
–.5  r  .5
Moderate
either –.8 < r < –.5 or .5 < r < .8
Strong
either r  .8 or r  –.8
It may surprise you that an r as substantial as .5 or –.5
goes in the weak category.
20
Properties of r
The rationale is that if r = .5 or –.5, then r2 = .25 in a
regression with either variable playing the role of y.
A regression model that explains at most 25% of observed
variation is not in fact very impressive.
In Example 15, the correlation between corn yield and
peanut yield would be described as weak.
21
Inferences About the Population
Correlation Coefficient
22
Inferences About the Population Correlation Coefficient
The correlation coefficient r is a measure of how strongly
related x and y are in the observed sample.
We can think (xi, yi) of the pairs as having been drawn from
a bivariate population of pairs, with (Xi, Yi) having some
joint pmf or pdf.
We defined the correlation coefficient  (X,Y) by
23
Inferences About the Population Correlation Coefficient
Where
If we think of p(x, y) or f(x, y) as describing the distribution
of pairs of values within the entire population,  becomes a
measure of how strongly related x and y are in that
population.
24
Inferences About the Population Correlation Coefficient
The population correlation coefficient r is a parameter or
population characteristic, just as X, Y, X, and Y, are, so
we can use the sample correlation coefficient to make
various inferences about . In particular,  is a point
estimate for r, and the corresponding estimator is
25
Example 16
In some locations, there is a strong association between
concentrations of two different pollutants.
The article “The Carbon Component of the Los Angeles
Aerosol: Source Apportionment and Contributions to the
Visibility Budget” (J. of Air Pollution Control Fed., 1984:
643–650) reports the accompanying data on ozone
concentration x (ppm) and secondary carbon concentration
y (g/m3).
26
Example 16
cont’d
The summary quantities are n = 16, xi = 1.656, yi = 70.6,
= .196912, xiyi = 20.0397, and
= 2253.56 from
which
The point estimate of the population correlation coefficient
 between ozone concentration and secondary carbon
concentration is = r = .716.
27
Inferences About the Population Correlation Coefficient
The small-sample intervals and test procedures presented
in Chapters 7–9 were based on an assumption of
population normality.
To test hypotheses about r, an analogous assumption
about the distribution of pairs of (x, y) values in the
population is required.
We are now assuming that both X and Y are random,
whereas much of our regression work focused on x fixed by
the experimenter.
28
Inferences About the Population Correlation Coefficient
Assumption
The joint probability distribution of (X, Y) is specified by
<x<
<y<
(12.9)
where 1 and 1 are the mean and standard deviation of X,
and 2 and 2 are the mean and standard deviation of Y;
f(x, y) is called the bivariate normal probability
distribution.
29
Inferences About the Population Correlation Coefficient
The bivariate normal distribution is obviously rather
complicated, but for our purposes we need only a passing
acquaintance with several of its properties.
The surface determined
by f(x, y) lies entirely
above the x, y plane
[f(x, y)  0] and has a
three-dimensional bellor mound-shaped
appearance, as
illustrated in
Figure 12.21.
A graph of the bivariate normal pdf
Figure 12.21
30
Inferences About the Population Correlation Coefficient
If we slice through the surface with any plane perpendicular
to the x, y plane and look at the shape of the curve
sketched out on the “slicing plane,” the result is a normal
curve.
More precisely, if X = x, it can be shown that the
(conditional) distribution of Y is normal with mean
Yx = 2 – 12/1 +  2x/1 and variance
This is exactly the model used in simple linear regression
with 0 = 2 – 12/1, 1 = 2/1, and
independent of x.
31
Inferences About the Population Correlation Coefficient
The implication is that if the observed pairs (xi, yi) are
actually drawn from a bivariate normal distribution, then the
simple linear regression model is an appropriate way of
studying the behavior of Y for fixed x.
If  = 0, then Y  x = 2 independent of x; in fact, when  = 0,
the joint probability density function f(x, y) of (12.9) can be
factored as f1(x)f2(y), which implies that X and Y are
independent variables.
Assuming that the pairs are drawn from a bivariate normal
distribution allows us to test hypotheses about r and to
construct a CI.
32
Inferences About the Population Correlation Coefficient
There is no completely satisfactory way to check the
plausibility of the bivariate normality assumption.
A partial check involves constructing two separate normal
probability plots, one for the sample xi’s and another for the
sample yi’s, since bivariate normality implies that the
marginal distributions of both X and Y are normal.
If either plot deviates substantially from a straight-line
pattern, the following inferential procedures should not be
used for small n.
33
Inferences About the Population Correlation Coefficient
Testing for the Absence of Correlation
When H0:  = 0 is true, the test statistic
has a t distribution with n – 2 df.
34
Inferences About the Population Correlation Coefficient
Alternative Hypothesis
Rejection Region for Level  Test
Ha:  > 0
t  t,n – 2
Ha:  < 0
t  –t,n – 2
Ha:  ≠ 0
either t  t/2,n – 2 or t  –t/2,n – 2
A P-value based on n – 2 df can be calculated as described
previously.
35
Inferences About the Population Correlation Coefficient
Because  measures the extent to which there is a linear
relationship between the two variables in the population,
the null hypothesis H0:  = 0 states that there is no such
population relationship.
In Section 12.3, we used the t ratio
to test for a linear
relationship between the two variables in the context of
regression analysis.
It turns out that the two test procedures are completely
equivalent because
36
Inferences About the Population Correlation Coefficient
When interest lies only in assessing the strength of any
linear relationship rather than in fitting a model and using it
to estimate or predict, the test statistic formula just
presented requires fewer computations than does the
t-ratio.
37
Other Inferences Concerning 
38
Other Inferences Concerning 
The procedure for testing Ha:  = 0 when 0  0 is not
equivalent to any procedure from regression analysis. The
test statistic is based on a transformation of R called the
Fisher transformation.
Proposition
When (X1, Y1), …, (Xn, Yn) is a sample from a bivariate
normal distribution, the rv
(12.10)
39
Other Inferences Concerning 
has approximately a normal distribution with mean and
variance
The rationale for the transformation is to obtain a function
of R that has a variance independent of r; this would not be
the case with R itself.
Also, the transformation should not be used if n is quite
small, since the approximation will not be valid.
40
Other Inferences Concerning 
The test statistic for testing H0:  = 0 is
Alternative Hypothesis
Rejection Region for Level  Test
Ha:  > 0
z  z
Ha:  < 0
z  –z
Ha:  ≠ 0
either z  z/2 or z  –z/2
A P-value can be calculated in the same manner as for
previous z tests.
41
Example 18
The article “Size Effect in Shear Strength of Large
Beams—Behavior and Finite Element Modelling” (Mag. of
Concrete Res., 2005: 497–509) reported on a study of
various characteristics of large reinforced concrete deep
and shallow beams tested until failure.
Consider the following data on x = cube strength and
y = cylinder strength (both in MPa):
42
Example 18
cont’d
Then Sxx = 367.74, Sxx = 488.54, and Sxy = 322.37, from
which r = .761.
Does this provide strong evidence for concluding that the
two measures of strength are at least moderately positively
correlated?
Our previous interpretation of moderate positive correlation
was .5 <  < .8, so we wish to test H0:  = .5 versus
Ha:  > .5 The computed value of V is then
43
Example 18
cont’d
Thus
The P-value for an upper-tailed test is .0359. The null
hypothesis can therefore be rejected at significance level
.05 but not at level .01.
This latter result is somewhat surprising in light of the
magnitude of r, but when n is small, a reasonably large r
may result even  is not all that substantial.
At significance level .01, the evidence for a moderately
positive correlation is not compelling.
44
Other Inferences Concerning 
To obtain a CI for , we first derive an interval for
Standardizing V, writing a probability
statement, and manipulating the resulting inequalities
yields
(12.11)
as a 100(1 – )% interval for V, where
This interval can then be manipulated to yield a CI for .
45
Other Inferences Concerning 
A 100(1 – )% confidence interval for  is.
where c1 and c2 are the left and right endpoints,
respectively, of the interval (12.11).
46