12
Simple Linear Regression
and Correlation
Copyright © Cengage Learning. All rights reserved.
http://stats.stackexchange.com/questions/423/what-is-your-favorite-data-analysis-cartoon
Regression and Causality
http://stats.stackexchange.com/questions/10687/does-simple-linear-regression-imply-causation
Linear Regression: Definitions
X: predictor, explanatory, independent variable
Y: response, dependent variable
Example: Scatterplot
The following data is to determine the
relationship between age and change in
systolic blood pressure (BP, mm Hg) after 24
hours in response to a particular treatment.
a) Draw a scatterplot of this data.
Obs 1 2 3 4 5 6 7 8 9 10 11
Age 70 51 65 70 48 70 45 48 35 48 30
BP -28 -10 -8 -15 -8 -10 -12 3 1 -5 8
BP
BP
Example: Scatterplot (cont)
10
0
-10
-20
-30
10
0
-10
-20
-30
25
0
35
20
45
55
40
65
60
Age
Age
75
80
Error in the Regression Line
Distribution of Y
Linear Regression: Assumptions
1. There is a linear relationship between X and Y.
2. Each (X,Y) pair is random and independent of
the other pairs.
3. Variance of the residuals is constant.
Principle of Least Squares
Principle of Least Squares
Height of point – height of line = yi – (b0 +b1xi)
𝑛
𝑔 𝑏0 , 𝑏1 =
𝑦𝑖 − 𝑏0 + 𝑏1 𝑥𝑖
𝑖=1
𝜕𝑔(𝑏0 , 𝑏1 )
𝜕𝑔(𝑏0 , 𝑏1 )
= 0,
=0
𝜕𝑏0
𝜕𝑏1
2
Point Estimations
𝑏1 = 𝛽1 =
𝑏0 = 𝛽0 =
𝑆𝑥𝑦
𝑥𝑖 − 𝑥 𝑦𝑖 − 𝑦
=
2
𝑥𝑖 − 𝑥
𝑆𝑥𝑥
𝑦𝑖 − 𝑏1 𝑥𝑖
= 𝑦 − 𝑏1 𝑥
𝑛
Example: Least Squares
The following data is to determine the
relationship between age and change in
systolic blood pressure (BP, mm Hg) after 24
hours in response to a particular treatment.
a) What is the regression line for this data?
x̄ = 52.727, ȳ = -7.636, Sxy = -1055.909,
Sxx = 2006.182
Example: Least Squares (cont)
10
BP
0
-10
-20
-30
25
35
45
55
Age
65
75
Extrapolation
http://www.sciencedirect.com/science/article/pii/S001021800900114X
Example: Least Squares point estimate
The following data is to determine the
relationship between age and change in
systolic blood pressure (BP, mm Hg) after 24
hours in response to a particular treatment.
c) What is the point estimate for the change in
BP for someone who is 56 years old? 51 years
old?
d) What is the residual at this age 51? The
actual data point is (51, -10).
ANOVA Table
Source
df
SS
Model
(Regression)
1
𝑦𝑖 − 𝑦
Error
Total
n-2
𝑦𝑖 − 𝑦𝑖
n-1
𝑦𝑖 − 𝑦
MS
2
2
2
SSM
𝑆𝑆𝐸
𝑑𝑓𝑒
= 𝑆𝑦𝑦
=
𝑆𝑆𝐸
𝑛−2
Meaning of σ2
Meaning of R2
Cautions about R2
1.
2.
3.
4.
Linearity
Association
Outliers
Prediction
Example: Least Squares point estimate
The following data is to determine the
relationship between age and change in
systolic blood pressure (BP, mm Hg) after 24
hours in response to a particular treatment.
e) What proportion of the observed variation in
y can be attributed to the simple linear
regression relationship between x and y?
Example: Least Squares (cont)
10
BP
0
-10
outlier?
-20
-30
25
35
45
55
Age
65
75
Inference on the slope
http://www.biomedware.com/files/documentation/spacestat/interface/Views/
Regression_line.htm
Normality of Y
𝑥𝑖 − 𝑥 𝑦𝑖 − 𝑦
𝑏1 =
𝑥𝑖 − 𝑥 2
𝑥𝑖 − 𝑥 𝑦𝑖
𝑥𝑖 − 𝑥 𝑦
=
−
𝑆𝑥𝑥
𝑆𝑥𝑥
𝑥𝑖 − 𝑥 𝑦𝑖
𝑥𝑖 − 𝑥 𝑦𝑖
=
−0=
𝑆𝑥𝑥
𝑆𝑥𝑥
2
𝜎
2
𝜎𝑏1 =
𝑆𝑥𝑥
𝑠
𝑀𝑆𝐸
𝜎𝑏1 =
=
𝑆𝑥𝑥
𝑆𝑥𝑥
Example: Regression
The cetane number is a critical property in specifying the
ignition quality of a fuel used in a diesel engine.
Determination of this number for a biodiesel fuel is
expensive and time-consuming. Therefore a way of
predicting this number is wanted. The data on the next
slide is x = iodine value (g) and y = cetane number for a
sample of 14 biofuels. The iodine value is the amount
of iodine necessary to saturate a sample of 100g of oil.
Sxx = 6802.7693, Sxy = -1424.41429, x̄ = 93.393, ȳ = 55.657,
SSE = 78.920, SST = 377.174
a) What is the estimated regression line (besides the
equation of the line, include R2)?
Example: Scatterplot
cetane number
65
60
55
50
45
50
70
90
Iodine (g)
110
130
Example (Example 12.4): CI
The cetane number is a critical property in specifying
the ignition quality of a fuel used in a diesel engine.
Determination of this number for a biodiesel fuel is
expensive and time-consuming. Therefore a way of
predicting this number is wanted. The data on the
next slide is x = iodine value (g) and y = cetane
number for a sample of 14 biofuels. The iodine
value is the amount of iodine necessary to saturate
a sample of 100g of oil.
Sxx = 6802.7693, Sxy = -1424.41429, x̄ = 93.393, ȳ = 55.657,
SSE = 78.920, SST = 377.174
b) What is the 95% CI for the true slope?
β1 Hypothesis test: Summary
Example (Example 12.4): Hypothesis test
The cetane number is a critical property in specifying
the ignition quality of a fuel used in a diesel engine.
Determination of this number for a biodiesel fuel is
expensive and time-consuming. Therefore a way of
predicting this number is wanted. The data on the
next slide is x = iodine value (g) and y = cetane
number for a sample of 14 biofuels. The iodine
value is the amount of iodine necessary to saturate
a sample of 100g of oil.
c) Is the model useful (that is, is there a useful linear
relationship between x and y)?
ANOVA Table
Source
df
SS
Model
(Regression)
1
𝑦𝑖 − 𝑦
Error
Total
n-2
𝑦𝑖 − 𝑦𝑖
n-1
𝑦𝑖 − 𝑦
MS
2
2
2
SSM
𝑆𝑆𝐸
𝑑𝑓𝑒
= 𝑆𝑦𝑦
=
𝑆𝑆𝐸
𝑛−2
12.4
Inferences Concerning Y x and the
Prediction of Future Y Values

Copyright © Cengage Learning. All rights reserved.
 Hypothesis test: Summary
Example (12.4): Hypothesis test for 
The cetane number is a critical property in specifying the
ignition quality of a fuel used in a diesel engine.
Determination of this number for a biodiesel fuel is
expensive and time-consuming. Therefore a way of
predicting this number is wanted. The data on the next
slide is x = iodine value (g) and y = cetane number for a
sample of 14 biofuels. The iodine value is the amount
of iodine necessary to saturate a sample of 100g of oil.
d) Is the model useful (that is, is there a useful linear
relationship between x and y) using the population
correlation coefficient?
SSE = 78.9192, SST = Syy = 377.1743
Sxx = 6502.7693, Sxy = -1424.41429
Good
Residual Plots
Linearity Violation
Good
Residual Plots
Constant variance violation
Residual
Plots
Example: SLR 1 – Residual Plot
Example: SLR 1 – Normality