linear relationship

advertisement
Lecture 11
Chapter 6. Correlation and
Linear Regression
6.1 Introduction

This chapter is concerned with relationships
between continuous variables.

Example (see Handout 11)
During the 1950s radioactive water leaked into the Columbia river
in Washington DC. Data were collected on an exposure index (X),
and the cancer mortality rate (Y) (deaths per 100,000 per year)
for the years 1959-1964, for each of nine counties downstream:
Exposure (x):
Mortality (y):
8.3
210
6.4
180
3.4
130
3.8
170
2.6
130
11.6
210
1.2
120
2.5
150
1.6
140
Both the variables X and Y are measurements on a
continuous scale.
We are interested in how these two variables are
related, or associated.
As usual, the sensible thing to do first is to have a
look at the data. The best thing to do here is to
plot the mortality rate against the exposure
index....
210
200
190
Mortality
180
170
160
150
140
130
120
0
5
Exposure
10
The plot suggests that there is a clear relationship
(association) between the mortality rate and the
exposure index. The relationship looks approximately
linear (like a straight line).
In this chapter we do two things:
1.
Use a measure called correlation to describe the
strength of the association between two variables.
2.
Use a method called linear regression to model the
relationship between two variables which are
associated in a way which is approximately linear.
6.2 Correlation

There are a several different measures of association in
usage, but we will only consider the most common, which is
called Pearson’s product moment correlation coefficient or
more briefly the sample linear correlation coefficient or just
the Pearson correlation. It is usually denoted by the letter r.
Additional Notes (Slide 1 of 2)
•
•
•
•
The value of r always lies between -1 and +1;
Values of r near to +1 indicate a strong positive linear relationship;
Values of r near to -1 indicate a strong negative linear relationship;
Values of r near to 0 indicate there is very little linear relationship.
Additional Notes (Slide 2 of 2)
• Let’s see what Minitab tells us about the Pearson
correlation for our example above. We use:
Stat>Basic Statistics>Correlation...
Minitab tells us two things:
• the Pearson correlation is r = 0.917
• the P-value is 0.000
Note that this correlation is close to +1, indicating a
strong positive linear relationship.
What about the p-value?
This is the result of the hypothesis test of the null
hypothesis:
H0: The linear correlation in the population is zero.
Our value of p = 0.000 indicates that we reject the null
hypothesis. There does appear to be a strong positive
linear relationship between exposure and mortality.
The correlation coefficient r is a very useful summary
measure, but it us often misused. Some points to
remember are as follows:
1. A high correlation does not necessarily imply a a
cause-and-effect relationship.
2.
Although a value of r close to 1 does indicate a
strong positive linear association, a linear
relationship is not always the most appropriate.
Always produce a plot of y against x.
3. A value close to zero indicates no linear relationship.
That does not necessarily mean there is no
relationship!
For the data plotted below, r = 0.020, and the p-value is
0.854. This correctly identifies there is no linear
relationship, but there clearly is a relationship!
y
100
50
0
0
10
x
20
6.3 Simple Linear Regression

The correlation coefficient tells us about the strength of
a linear relationship, but it doesn’t allow us to do things
like make predictions about new data.

For this we need a model for the data. If we think there
is an approximately linear relationship, we use the
equation of a straight line, which relates X and Y:
Y = α + βX

Here the values of α (alpha) and β (beta) are the
intercept and the slope of the straight line respectively.
The slope, β, is usually of much more interest, because
it tells us how Y changes with X.

Since we don’t expect the data to lie exactly on
a straight line, we always add a random error
component, ε (epsilon), so the equation
becomes:
Y = α + βX + ε (Equation 1)

Equation 1 is the equation of a simple linear
regression. In order to use it to model our data,
we need to choose the values of α and β which
work best.

E.g. for the exposure-mortality data, we might
obtain....
Regression Plot
Mortality = 118.449 + 9.03279 Exposure
S = 14.5763
R-Sq = 84.2 %
R-Sq(adj) = 81.9 %
Mortality
220
170
120
0
5
Exposure
10

Notice that in the plot above, α has been
chosen as 118.4, and β as 9.03.

This indicates that in our model, the mortality
rate increases by 9.03 for every unit increase in
the exposure index, and the mortality rate when
the exposure index is zero is 118.4.

But how were these values chosen?

The usual criterion, and the one used above is to
use the least squares estimates for α and β...
We obtain these in Minitab using:
Stat>Regression>Regression...
if we want the equation etc., and...
Stat>Regression>Fitted Line Plot...
if we want the graph with the fitted line
superimposed.
Download