Chapter 3 Regression and Correlation
Simple Least squares regression SLR): is a method to find the
“best fitting line”
to a set of n points (x, y). SLR minimizes the sum of squares of vertical distances from points to the fitted line. That is why the procedure is also called least squares regression.
Correlation coefficient (r):
A number that measures the strength and direction of the linear association between X and Y (both quantitative variables).
– 1
r
+ 1 always .
Correlation (r) = + 1 when there is a perfectly linear increasing relationship between X and Y.
Correlation (r) = – 1 when there is a perfectly linear decreasing relationship between X and Y.
No units. Correlation is a unit-less entity
R 2 = (r) 2 = is called coefficient of determination.
R 2 measures the percent of variability in the response (Y) explained by the changes in X [or by the regression on X]. What does R 2 = 0.81
(=81%) mean?
How do you find r when you are given R 2 ? For example what is r, if R 2 = 0.81 = 81%?
Chapters 3 and 11 Fall 2007
Page 1 of 34
Example : Suppose your friend claims that she can guess a persons age correctly (well, almost). So, to see if this claim is justifiable, you select a random sample of 10 people, ask your friend to guess their ages and then ask the person his/her true age. The following are observed:
ID 1 2 3 4 5 6 7 8 9 10
Guessed age 18 52 65 90 28 58 13 66 44 35
True Age 20 45 70 85 25 50 15 60 40 35
The very first step in regression analysis is to identify the independent (explanatory) and the dependent
(response) variables. Since the true age determines your friend’s guesses, (and your friend’s guess has no effect on a person’s true age) we have
X = Independent Variable = True age
Y = Response = Dependent Variable = Guessed Age.
The next step is to draw a scatter diagram of the data and interpret what you see (to get some ideas about the relation between two variables).
Chapters 3 and 11 Fall 2007
Page 2 of 34
Scatterplot of Guessed Age vs True Age
100
90
80
70
60
50
40
30
20
10
10 20 30 40 50
True Age
60 70 80 90
1. What do you see?
2. Verify the following summary statistics using your calculator: x
44.5
s
X
22.02
y
46.9
s
Y
24.02
3. Compute the slope and intercept of the least squares regression line, given that r = 0.9844.
Slope = b = r
s
Y
/ s
X
= 0.9844
24.02 /22.42 = 1.054651561
Intercept = a = y b x
= 46.9 – 1.054651561
44.5
= – 0.0319944692
Chapters 3 and 11 Fall 2007
Page 3 of 34
Hence the prediction equation is ˆ
x
.
Are these results consistent with what you have observed in the scatter plot?
3. Interpret the numerical results:
Correlation = r = 0.9844, so there is a strong, increasing, linear relationship between the true and guessed ages
Slope = 1.05, for every unit increase in the true age, the guessed age increases by 1.05 years.
Intercept = – 0.03 DO NOT INTERPRET the intercept in this case because a) Zero is not within the range of observed values of the independent variable (X) and b) Zero and – 0.03 are not meaningful in this context.
4. Compute R 2 (coefficient of determination) and interpret it.
R 2 = (r) 2 = (0.9844) 2 = 0.969
= 0.969
100% = 96.9%
Interpretation:
96.9% of the variation in guessed ages (Y) is explained by the true age (X).
96.9% of variability in guessed ages is explained by linear regression on true ages.
5. Plot the estimated regression line on the scatter diagram.
Chapters 3 and 11 Fall 2007
Page 4 of 34
For this we choose two values of X (as far as meaningful) and predict the value of Y for those two values of X, using the prediction equation y
x .
For x = 15 we have ˆ = 15.72
ˆ
x
For x = 90, we get ˆ = 94.47 y
ˆ x
94.47
.
These give us 2 points (15, 15.72) and (90, 94.47) that we connect on the scatter diagram. Now mark the points (15, 15.72) and (90, 94.47) on the graph and joint them with a ruler.
Scatterplot of Guessed Age vs True Age
60
50
40
30
20
100
90
80
70
10
10 20 30 40 50
True Age
60 70 80 90
Chapters 3 and 11 Fall 2007
Page 5 of 34
Chapter 11 Inferences for SLR
In Chapter 3, a linear relation between two quantitative variables (denoted by X and Y) was shown by ˆ a bX .
In this equation (called the prediction equation )
Y is called the response (dependent) variable,
X is called the explanatory variable
ˆy
is called the predicted value of Y. a = b =
ˆ
ˆ
=
= estimate of the intercept
estimate of the slope (
(α) and
).
Hence, ˆ
is an estimate for a simple linear regression model.
Also,
ˆy
is called the estimate of true (but unknown) regression line
µ
Y
= α +
X, so we also write
ˆ y
Y
Regression Model: A mathematical (or theoretical) equation that shows the linear relation between the explanatory variable and the response. The simple linear regression model we will use is
Y = α +
X +
, where
is called the error term .
Chapters 3 and 11 Fall 2007
Page 6 of 34
Let’s see the error terms graphically:
Total error = y – y
= (y
– y
ˆ
) + ( y
ˆ – y )
= Random error + Regression error
In this relation, total error is divided into two parts, the random error = y – ˆ and the regression error = ˆ – y = error due to using the regression model instead of the sample mean, y =46.9.
Scatterplot of Guessed Age vs True Age
90
80
50
40
70
60
30
20
10
10 20 30 40 50 60
True Age
70 80 90 100
Chapters 3 and 11 Fall 2007
Page 7 of 34
The unbiased estimators of the parameters (α and
) of the regression line (those given in Chapter 3) are found by the method of least squares estimation
(LSE) technique as
ˆ
S
Y
S
X
bX
Hence,
is the (true and unknown) slope of the regression line,
is estimated by
ˆ
S
Y and
S
X
α is the (true and unknown) y-intercept (or
simply intercept ) of the regression line
bX .
What do the slope and intercept of the regression line tell us?
Slope is the average amount of change in Y for one unit of increase in X.
Note: slope ≠ rise / run. Why?
Intercept is the value of Y when X = 0.
Important Note: We DO NOT use the above interpretation when a) X = 0 is not meaningful or b) Zero is not within the range or near the observed values of X
Chapters 3 and 11 Fall 2007
Page 8 of 34
Assumptions of Simple Linear Regression:
1.
A random sample of n pairs of observations,
(X
1
, Y
1
), (X
2
, Y
2
), …, (X n
, Y n
)
2.
The population of Ys have normal distribution with mean µ
Y
= α +
X, which changes for each value of X, and the same standard deviation,
, which is the same at every value of the independent variable, X. The relation between X and Y may also be formulated as Y
X
.
3.
As a result of the above assumptions, the error terms,
, are iid (identically and independently distributed) random variables that have a normal distribution with mean zero and standard deviation
, i.e.,
~ N(0,
).
4.
These mean that both Y and
are random variables [we may choose any value for X hence it is assumed to be a non-random variable (even when it is random)].
Are these assumptions satisfied in Example – 1?
Chapters 3 and 11 Fall 2007
Page 9 of 34
Assumption 1 (random sample) is satisfied.
To check assumptions 2 and 3 we look at the residuals , where,
Residual = Observed value of Y – Predicted value of Y
= y
y
ˆ
.
If these residuals do not have any extreme value, we say assumptions 2 and 3 are justifiable, since we do not have any reason to suspect otherwise (more later).
So, let’s calculate the residuals using the prediction equation, ˆ
0.03 1.05
x found in chapter 3. Then we will plot the residuals.
Observed Predicted Residual
Value of y Value ( y
ˆ
) y
y
ˆ
20 21.06 – 3.06
45 47.43 4.57
70 73.79 – 8.79
85 89.61 0.39
25 26.33 1.67
50 52.70 5.30
15 15.79 – 2.79
60 63.25 2.75
40 42.15 1.85
35 36.88 – 1.88
Chapters 3 and 11 Fall 2007
Page 10 of 34
Histogram
(response is Guessed Age)
1.5
1.0
0.5
3.0
2.5
2.0
0.0
-5.0
-2.5
0.0
2.5
Residual
5.0
7.5
10.0
Do you think the assumption of normality is satisfied? Why or why not?
Chapters 3 and 11 Fall 2007
Page 11 of 34
Inferences about parameters of SLR
The parameters of the regression model are
α,
and
These parameters are estimated by a, b and S, respectively.
Chapter 11 deals with inferences about the true
Simple Linear Regression (SLR) model 1 , i.e., a regression model with one explanatory variable (X).
When making inferences about the parameters of the regression model, we will determine
If X is a “good predictor” of Y
If the regression line is useful for making predictions about Y
If the slope is different from zero
In this chapter we will also see how to find
Prediction interval for an individual response ,
Y at X = x.
Confidence intervals for the mean of Y , that is
µ
Y
= mean response, at X = x.
We carry out these using ANOVA.
1 In Chapter 12 we will see how to make inferences about the parameters of a multiple regression model , i.e., a regression model with several (k
2) explanatory variables, X
1
, X
2
, …, X k
.
Chapters 3 and 11 Fall 2007
Page 12 of 34
ANOVA FOR SLR
Is X a good predictor of Y? This is equivalent to saying is the slope of the line significantly different from zero? [If not, we might as well use Y as a predictor.] We can answer these questions using an
ANOVA table:
Source
ANOVA for SLR df SS MS
Regression
Model
1 SSReg MS Re g
SS Re /1 F
F
MS Re g
MSE
Residuals
(Error) n – 2 SSE MSE
n
SSE
2
Total n – 1 SST
Total SS = Model SS + Error SS
SST = SSReg + SSE i n
1
( y i
y )
2 i n
1
(
ˆ i
y )
2 i n
1
( y i
ˆ i
)
2
df = (n – 1) = 1 + (n – 2)
The df for regression = 1 because there is only 1 independent variable
The df for residuals = n – 2 because we estimate 2 parameters (α and
).
Chapters 3 and 11 Fall 2007
Page 13 of 34
Assumptions for ANOVA:
Random sample
Normal distribution (of
and hence Y)
Constant variance (of
and Y)
The hypothesis of interest is Ho:
= 0 vs. Ha:
≠ 0.
Test statistic = F = MSReg / MSE
To find the p-value, first find the tabulated F-value from the F-tables with df
1
= 1 and df
2
= n – 2; then compare that value with the F in ANOVA table. The following is the output obtained from Minitab:
Regression Analysis: Guessed Age vs. True Age
Analysis of Variance
Source
Regression
DF SS MS F
1 5030.0 5030.0
Residual Error 8 160.9 20.1
P
250.08 0.000
Total 9 5190.9
To test Ho:
= 0 vs. Ha:
≠ 0, the test statistic is
F = 250.08 from the ANOVA table,
The F-value is extremely large!!! What does it mean?
The p-value = 0.000. What does it mean?
Decision?
Conclusion?
Chapters 3 and 11 Fall 2007
Page 14 of 34
Decision: Reject Ho since the p-value < 0.0005 is less than any reasonable level of significance.
Conclusion: The observed data indicate that the slope is significantly different from zero.
Using the t-test
We may also use t-test for testing the above hypotheses as explained in Chapter 8. Fro this we use the first block of the Minitab output:
The regression equation is
Guessed Age = – 0.03 + 1.05 True Age
Predictor Coef SE Coef T P
Constant – 0.030 3.289 – 0.01 0.993
True Age 1.05462 0.06669 15.81 0.000
S = 4.48483 R-Sq = 96.9% R-Sq(adj) = 96.5%
In this case parameter is
Estimate = b = 1.05462, SE(Estimate) = 0.06669
Significance Test: Ho:
= 0 vs. Ha:
≠ 0
Test statistic = T
)
Calculated value of the test statistic:
T cal
1.05462 0
0.06669
15.81
Chapters 3 and 11 Fall 2007
Page 15 of 34
To find the p-value, go back and look at Ha. We have a 2-sided alternative and hence,
P-Value = 2
P(T
| T cal
| ) = 2
P(T
15.81) = 0.
This p-value gives us the same decision and conclusion as the one we got from the ANOVA table.
In general, to find the p-value we would look at the ttable to see T cal
on the line with df = n – 2. [This is the df for error in ANOVA table.]
Compare the T cal
above with the F cal
in ANOVA table. We have the following general relation between the T cal and F cal
in SLR:
2
F cal
and equivalently T cal
F cal
.
So the p-value for the t-test is the same as the p-value for the F-test. Hence in SLR, the two significance tests for the slope give the same results in both the Ftest (using ANOVA table) and the t-test.
Observe that the above conclusion does not tell us in what way
is different from zero. We could use the t-test for testing one-sided alternatives about
.
However, these should be decided before looking at the data.
Chapters 3 and 11 Fall 2007
Page 16 of 34
Confidence Interval for the Slope:
Remember the general formula for the confidence interval:
CI
Estimate
Estimate
ME
t
*
( )
This is used in finding a CI for
, where the estimate is b and SE(estimate) is given in the Minitab output.
All we need to do is to find t* from the t-tables with df = (n – 2) = df error
in ANOVA table.
For the above example we had the following results from Minitab:
Predictor Coef SE Coef T P
Constant – 0.030 3.289 – 0.01 0.993
True Age 1.05462 0.06669 15.81 0.000
That is, b = slope = 1.05462, SE(Slope) = 0.06669.
Also, since df error
= 8 in the ANOVA table, we use the table of t-distribution and read the t-value for a
95% CI on the row with df = 8 as t* = 2.306, which gives
ME = t*
SE(Estimate) = 2.306
0.06669 = 0.153787.
Hence a 95% CI for
is
CI = (1.05462 ± 0.15379)
= (0.90083, 1.20841) = (0.9, 1.2)
Chapters 3 and 11 Fall 2007
Page 17 of 34
As in Chapters 7 – 9 we can use the CI to make a decision for significance test: when zero is not in the
CI we reject Ho and conclude that
The observed data give strong evidence that the slope of the regression line is different from zero.
Actually, we can say more: since the CI for
in this example is (0.9, 1.2), we see that both ends of the CI are positive thus we can conclude with 95% confidence that the slope of the true regression line is some number between 0.9 and 1.2.
Alternatively we interpret the CI as follows:
We are 95% confident that on the average, as the true age increases by one year, the guessed age increases by somewhere between 0.9 and 1.2 years.
Chapters 3 and 11 Fall 2007
Page 18 of 34
Confidence Interval for Mean Response
And Prediction Interval
General formula for CIs:
CI
Estimator
t
*
( )
Additional symbols:
µ
Y|x*
= α +
x*
= Mean response for the population of
ALL Y’s that have to X = x*
ˆ
= The point on the true regression line that
* corresponds to X = x*
=
ˆ | * a bx *
=
Estimator of mean response at X = x*
y x ) = SE(Estimator of Mean Response )
S
1 n
i n
1
( *
( X i
X
X
)
)
2
CI
Hence CI for Mean Response is
( M ean Response)
ˆ y
t *
1 n
i n
1
( *
( X i
X
X
)
)
2
Chapters 3 and 11 Fall 2007
Page 19 of 34
ˆ | *
*
=
Predicted value for one new response at X = x*
y x ) = SE( One new response )
=
S
1
1
( *
X ) n i n
1
( X i
X )
2
Hence prediction interval (PI) for one new response is
PI ( One New Response)
y
ˆ t *
S
1
1
( *
X ) n n i
1
( X i
X )
2
Chapters 3 and 11 Fall 2007
Page 20 of 34
Compare the formula for CI and PI to see the
CI ( difference between them:
M ean Response)
ˆ y
t *
S
1 n
i n
1
( *
( X i
X
X
)
)
2
PI ( One New Response)
y
ˆ t *
S
1
1
( *
X ) n i n
1
( X i
X )
2
In both of the above formulas
S = Standard deviation of points around the regression line =
MSE
df = df error
x* = a particular value of X for which we are making prediction.
Both CI and PI are centered around
ˆ | *
* = prediction at X = x*
PI for a new response is always wider than CI for mean response at the same value of X = x*.
(Why?)
SE’s and hence intervals will be smaller when x* is closer to X = the mean of the sample of
X’s and wider when x* is far from X . (Why?)
Chapters 3 and 11 Fall 2007
Page 21 of 34
CI and PI for Age Prediction problem
Guessed Age = - 0.030 + 1.055 True Age
100
80
60
40
20
S
R-Sq
R-Sq(adj)
Regression
95% CI
95% PI
4.48483
96.9%
96.5%
0
10 20 30 40 50
True Age
60 70 80 90
Age prediction example (Continued): a) Suppose you want to know with 95% confidence the range of your friend’s guesses for a 60 year old person.
Here we have one value of X = x* = 65, hence you want a 95% prediction interval at this value. Using the prediction equation we have found, we get the predicted value of Y at X = 65 as y
x
0.03 1.05 65
68.22
Chapters 3 and 11 Fall 2007
Page 22 of 34
Calculations for the SE’s are long and tedious.
However, we can use any one of statistical software to get what we want easily. For example, we got the prediction interval for X = 65 as PI = (57.22, 79.82) using Minitab.
Observe that the center of the above interval is also
68.52. This is the predicted value of Y ( ˆ ) Minitab calculated, using X = 65. [This is slightly different from what we’ve found because Minitab carries more digits after the decimal point in its calculations.] b) You want to know, with 95% confidence what would be the average of your friend’s guesses for all people aged 65 .
Since we are now looking for the mean of all guessed ages with X = 65, this is a problem of CI for mean response.
Minitab gives this as CI = (63.98, 73.06).
Observe that both the CI and the PI are centered on the same point, i.e., around ˆ = 68.52.
Chapters 3 and 11 Fall 2007
Page 23 of 34
Finally, observe the difference in the lengths of the intervals we got from Minitab:
95% CI at X = 65 is (63.98, 73.06).
Length of CI = 73.06 – 63.98 = 9.08
95% PI at X = 65 is (57.22, 79.82).
Length of PI = 79.82 – 57.22 = 22.6
As mentioned before, the PI is ALWAYS wider than the CI at the same level of confidence and the same value of X.
Chapters 3 and 11 Fall 2007
Page 24 of 34
More on R 2 :
We have seen that R 2 = (r) 2 . This can also be defined and calculated from the following relation:
R
2
SS Re g
SST
Variation in Y explain by Regression
Total variation in Y
This leads to alternative interpretation of R 2 :
R 2 is the proportion of variability in Y that is explained by the regression on X or equivalently,
R 2 is the proportional reduction in the prediction error, that is,
R
2
is the percentage of reduction in prediction error we will see when the prediction equation is used, instead of y = the sample mean of Y as the predicted value of Y.
Chapters 3 and 11 Fall 2007
Page 25 of 34
Example: In the ANOVA table for the analysis of guessed ages we had the following output:
S = 4.48483 R-Sq = 96.9% R-Sq(adj) = 96.5%
Source
Regression
Residual Error
Total
Then, R
2
SSR
SST
DF SS MS F P
1 5030.0 5030.0 250.08 0.000
8 160.9 20.1
9 5190.9
5030.0
5190.0
0.969
= 96.9%.
This is the same result we had from Minitab, as it should be. We may now interpret this as follows:
The regression model yields a predicted value for Y that has 96.9% less error than we would have if we used the sample mean of Y’s as a predicted value.
Chapters 3 and 11 Fall 2007
Page 26 of 34
More on Residuals:
Residual = Vertical distance from an observed point to the predicted value for the same X.
= Observed y – predicted y
= y
y
ˆ
Where ˆ
x
Observed Predicted Residuals
Values of y Values ( y
ˆ
) y
y
ˆ
20 21.06 – 3.06
45 47.43 4.57
70 73.79 – 8.79
85 89.61 0.39
25 26.33 1.67
50 52.70 5.30
15 15.79 – 2.79
60 63.25 2.75
40 42.15 1.85
35 36.88 – 1.88
Hence, for someone whose actual age is 35, the predicted value of his/her age is 36.88. This means the prediction was 1.88 years higher than the true age.
Chapters 3 and 11 Fall 2007
Page 27 of 34
Positive residuals: Observations above regression line
Negative residuals: Observations below regression line
Sum of residuals = 0 ALWAYS.
We (or computers) can make residual plots to see if there are any problems with the assumptions.
Computer finds “standardized residuals = z-score for each observation. Any point that has a z score bigger than 3 in absolute value, i.e., |z| > 3 is called an outlier .
More on Correlation:
If the distance between a given value of X, say x* and X (in absolute value) is k standard deviations, i.e., | x* – X | = k
S, then the distance (in absolute value) between the predicted value of y ( ˆ ) at x* and
Y is r
k standard deviations, i.e., | y
ˆ –
Y | = r
k
S.
Chapters 3 and 11 Fall 2007
Page 28 of 34
Example: Suppose Y = Height of children and X = heights of their fathers and the correlation between the two variables is r = 0.5.
Then,
If a father’s height is 2 standard deviations above the mean height of all fathers, then the predicted height of his child will be 0.5
2 = 1 standard deviation above the mean height of children.
If the father’s height is 1.5 standard deviations below the mean height of all fathers, then his child’s predicted height will be 0.5
1.5 = 0.75 standard deviations below the mean height of all children.
Chapters 3 and 11 Fall 2007
Page 29 of 34
Some more on correlation
Correlation is very much affected by outliers and influential points.
Outliers weaken the correlation.
Influential points (far from the rest of observations in the x-direction that does not follow the trend) may change the sign and value of the slope.
Chapters 3 and 11 Fall 2007
Page 30 of 34
Residual Plots
Residuals are the estimators of the error term (
) in the regression model. Thus, the assumption of normality of
can be checked by looking at the histogram of the residuals.
A histogram of residuals that is (almost) bellshaped (symmetric) supports the assumption of normality of the residuals.
A histogram or a dot plot that shows outliers is indicative of the violation of the assumption of normality.
Normal probability plot or normal quantile plot can also be used to check the normality assumption. Points in a normal PP or QP around a straight line support assumption of normality.
Chapters 3 and 11 Fall 2007
Page 31 of 34
Plot of residuals
Against the explanatory variable (X)
Magnify any problems with assumptions.
If the residuals are randomly scattered around the line residuals = 0, this is good. It means nothing else is left after using X to predict Y.
If the residual plot shows a curved pattern this indicates that a curvilinear fit (quadratic?) will give better results.
If the residual plot is funnel shaped this means the assumption of constant variance is violated.
If the residual plot shows an outlier, this may mean the violation of normality and/or constant variance or show an influential point.
Chapters 3 and 11 Fall 2007
Page 32 of 34
11.5 Exponential regression
This is one of the nonlinear regression models of the following form: Y
X
or equivalently,
Y
X . The model is called “exponential” because the independent variable, X appears as the exponent of the coefficient
.
Observe that when we take the logarithm of the model we obtain log(
Y
)
log
(log
)X , hence logarithm of the mean of Y is a linear function of X with coefficients log(
) and log(
).
Note that when X = 0,
X =
0 = 1. Thus,
gives us the mean of Y at X = 0, since
Y
=
0 =
(1) =
.
The parameter
represents the multiplicative effect of X on Y (as opposed to the additive effect in simple linear regression we have seen so far.). So, if, for example,
= 1.5, increasing X by one unit will increase Y by 50% from its previous value, i.e., we need to multiply the value of Y at the previous value by 1.5 to obtain the current value.
Chapters 3 and 11 Fall 2007
Page 33 of 34
Summary of SLR
Model: y
x
Assumptions: a) Random sample b) Normal distribution c) Constant variance d)
~ N(0,
).
Parameters and Estimators:
Intercept = α Estimated by a Y bX
Slope =
Estimated by b
S
Y
S
X
Standard deviation =
Estimated by S = MSE
Interpretation of
Slope
Intercept
R 2
r
Testing if the model is good:
ANOVA
The t-test for slope
CI for slope
PI and CI for response
Residual plots and interpretations.
Chapters 3 and 11 Fall 2007
Page 34 of 34