Simple Linear Regression

advertisement
Simple Linear Regression
Chapter Topics







Types of Regression Models
Determining the Simple Linear Regression
Equation
Measures of Variation
Assumptions of Regression and Correlation
Residual Analysis
Measuring Autocorrelation
Inferences about the Slope
Chapter Topics



(continued)
Correlation - Measuring the Strength of the
Association
Estimation of Mean Values and Prediction of
Individual Values
Pitfalls in Regression and Ethical Issues
Purpose of Regression Analysis

Regression Analysis is Used Primarily to Model
Causality and Provide Prediction


Predict the values of a dependent (response)
variable based on values of at least one
independent (explanatory) variable
Explain the effect of the independent variables on
the dependent variable
Types of Regression Models
Positive Linear Relationship
Negative Linear Relationship
Relationship NOT Linear
No Relationship
Simple Linear Regression Model



Relationship between Variables is Described
by a Linear Function
The Change of One Variable Causes the Other
Variable to Change
A Dependency of One Variable on the Other
Simple Linear Regression Model
(continued)
Population regression line is a straight line that
describes the dependence of the average value
(conditional mean) of one variable on the other
Population
Slope
Coefficient
Population
Y Intercept
Dependent
(Response)
Variable
Random
Error
Yi      X i   i
Population
Regression
Y |X
Line
(Conditional Mean)

Independent
(Explanatory)
Variable
Simple Linear Regression Model
(continued)
Y
(Observed Value of Y) = Yi
     X i   i
 i = Random Error

Y | X      X i

(Conditional Mean)
X
Observed Value of Y
Linear Regression Equation
Sample regression line provides an estimate of
the population regression line as well as a
predicted value of Y
Sample
Y Intercept
Yi  b0  b1 X i  ei
Ŷ  b0  b1 X
Sample
Slope
Coefficient
Residual
Simple Regression Equation
(Fitted Regression Line, Predicted Value)
Linear Regression Equation
(continued)

b0 and b1 are obtained by finding the values
of b0 and b that minimize the sum of the
1
squared residuals

n
i 1


Yi  Yˆi

2
n
  ei2
i 1
b0 provides an estimate of  
b1 provides an estimate of 
Linear Regression Equation
(continued)
Yi  b0  b1 X i  ei
Y
ei
Yi      X i   i
b1
i

Y | X      X i

b0
Observed Value
Yˆi  b0  b1 X i
X
Interpretation of the Slope
and Intercept

  E Y | X  0 is the average value of Y
when the value of X is zero

change in E Y | X 
1 
measures the
change in X
change in the average value of Y as a result of
a one-unit change in X
Interpretation of the Slope
and Intercept
(continued)

b  Eˆ Y | X  0  is the estimated average
value of Y when the value of X is zero

change in Eˆ Y | X 
b1 
is the estimated
change in X
change in the average value of Y as a result of
a one-unit change in X
Simple Linear Regression:
Example
You wish to examine
the linear dependency
of the annual sales of
produce stores on their
sizes in square footage.
Sample data for 7
stores were obtained.
Find the equation of
the straight line that
fits the data best.
Store
Square
Feet
Annual
Sales
($1000)
1
2
3
4
5
6
7
1,726
1,542
2,816
5,555
1,292
2,208
1,313
3,681
3,395
6,653
9,543
3,318
5,563
3,760
Scatter Diagram: Example
Annua l Sa le s ($000)
12000
10000
8000
6000
4000
2000
0
0
1000
2000
3000
4000
S q u a re F e e t
Excel Output
5000
6000
Simple Linear Regression
Equation: Example
Yˆi  b0  b1 X i
 1636.415  1.487 X i
From Excel Printout:
C o e ffi c i e n ts
I n te r c e p t
1 6 3 6 .4 1 4 7 2 6
X V a ria b le 1 1 .4 8 6 6 3 3 6 5 7
Annua l Sa le s ($000)
Graph of the Simple Linear
Regression Equation: Example
12000
10000
8000
6000
4000
2000
0
0
1000
2000
3000
4000
S q u a re F e e t
5000
6000
Interpretation of Results:
Example
Yˆi  1636.415  1.487 X i
The slope of 1.487 means that for each increase of
one unit in X, we predict the average of Y to
increase by an estimated 1.487 units.
The equation estimates that for each increase of 1
square foot in the size of the store, the expected
annual sales are predicted to increase by $1487.
Simple Linear Regression
in PHStat


In Excel, use PHStat | Regression | Simple
Linear Regression …
Excel Spreadsheet of Regression Sales on
Footage
Measures of Variation:
The Sum of Squares
SST
=
Total
=
Sample
Variability
SSR
Explained
Variability
+
SSE
+
Unexplained
Variability
Measures of Variation:
The Sum of Squares
(continued)

SST = Total Sum of Squares


SSR = Regression Sum of Squares


Measures the variation of the Yi values around
their mean, Y
Explained variation attributable to the relationship
between X and Y
SSE = Error Sum of Squares

Variation attributable to factors other than the
relationship between X and Y
Measures of Variation:
The Sum of Squares
(continued)

SSE =(Yi - Yi )2
Y
_
SST = (Yi - Y)2
 _
SSR = (Yi - Y)2
Xi
_
Y
X
Venn Diagrams and Explanatory
Power of Regression
Variations in
store Sizes not
used in
explaining
variation in
Sales
Sizes
Sales
Variations in Sales
explained by the
error term or
unexplained by
Sizes  SSE 
Variations in Sales
explained by Sizes
or variations in Sizes
used in explaining
variation in Sales
 SSR 
The ANOVA Table in Excel
ANOVA
df
Regression
k
SS
MS
SSR
MSR
=SSR/k
Residuals
n-k-1 SSE
Total
n-1
SST
MSE
=SSE/(n-k-1)
F
Significance
F
MSR/MSE
P-value of
the F Test
Measures of Variation
The Sum of Squares: Example
Excel Output for Produce Stores
Degrees of freedom
ANOVA
df
SS
MS
Regression
1
30380456.12
30380456
Residual
5
1871199.595 374239.92
Total
6
32251655.71
F
81.17909
Regression (explained) df
Error (residual) df
Total df
SSE
SSR
Significance F
0.000281201
SST
The Coefficient of Determination


SSR Regression Sum of Squares
r 

SST
Total Sum of Squares
2
Measures the proportion of variation in Y that
is explained by the independent variable X in
the regression model
Venn Diagrams and
Explanatory Power of Regression
r 
2
Sales
Sizes
SSR

SSR  SSE
Coefficients of Determination (r 2)
and Correlation (r)
Y r2 = 1, r = +1
Y r2 = 1, r = -1
^=b +b X
Y
i
^=b +b X
Y
i
0
1 i
0
X
Y r2 = .81,r = +0.9
X
Y
^=b +b X
Y
i
0
1 i
X
1 i
r2 = 0, r = 0
^=b +b X
Y
i
0
1 i
X
Standard Error of Estimate

n


SYX
SSE


n2
i 1
Y  Yˆi

2
n2
Measures the standard deviation (variation) of
the Y values around the regression equation
Measures of Variation:
Produce Store Example
Excel Output for Produce Stores
R e g r e ssi o n S ta ti sti c s
M u lt ip le R
R S q u a re
0 .9 4 1 9 8 1 2 9
A d ju s t e d R S q u a re
0 .9 3 0 3 7 7 5 4
S t a n d a rd E rro r
6 1 1 .7 5 1 5 1 7
O b s e r va t i o n s
r2 = .94
0 .9 7 0 5 5 7 2
n
7
94% of the variation in annual sales can be
explained by the variability in the size of the
store as measured by square footage.
Syx
Linear Regression Assumptions

Normality




Y values are normally distributed for each X
Probability distribution of error is normal
Homoscedasticity (Constant Variance)
Independence of Errors
Consequences of Violation
of the Assumptions

Violation of the Assumptions


Non-normality (error not normally distributed)
Heteroscedasticity (variance not constant)


Autocorrelation (errors are not independent)


Usually happens in time-series data
Consequences of Any Violation of the
Assumptions



Usually happens in cross-sectional data
Predictions and estimations obtained from the sample
regression line will not be accurate
Hypothesis testing results will not be reliable
It is Important to Verify the Assumptions
Variation of Errors Around
the Regression Line
f(e)
• Y values are normally distributed
around the regression line.
• For each X value, the “spread” or
variance around the regression line is
the same.
Y
X2
X1
X
Sample Regression Line
Residual Analysis

Purposes



Examine linearity
Evaluate violations of assumptions
Graphical Analysis of Residuals

Plot residuals vs. X and time
Residual Analysis for Linearity
Y
Y
X
e
X
X
e
X
Not Linear

Linear
Residual Analysis for
Homoscedasticity
Y
Y
X
SR
X
SR
X
Heteroscedasticity
X
Homoscedasticity
Residual Analysis: Excel Output
for Produce Stores Example
Observation
1
2
3
4
5
6
7
Excel Output
Residual Plot
0
1000
2000
3000
4000
Square Feet
5000
6000
Predicted Y
4202.344417
3928.803824
5822.775103
9894.664688
3557.14541
4918.90184
3588.364717
Residuals
-521.3444173
-533.8038245
830.2248971
-351.6646882
-239.1454103
644.0981603
171.6352829
Residual Analysis for
Independence

The Durbin-Watson Statistic


Used when data is collected over time to detect
autocorrelation (residuals in one time period are
related to residuals in another period)
Measures violation of independence assumption
n
D
2
(
e

e
)
 i i1
i 2
n
e
i 1
2
i
Should be close to 2.
If not, examine the model
for autocorrelation.
Durbin-Watson Statistic
in PHStat

PHStat | Regression | Simple Linear
Regression …

Check the box for Durbin-Watson Statistic
Obtaining the Critical Values of
Durbin-Watson Statistic
Table 13.4 Finding Critical Values of Durbin-Watson Statistic
 5
k=1
k=2
n
dL
dU
dL
dU
15
1.08
1.36
.95
1.54
16
1.10
1.37
.98
1.54
Using the Durbin-Watson
Statistic
H0 :
H1
No autocorrelation (error terms are independent)
: There is autocorrelation (error terms are not
independent)
Reject H0
(positive
autocorrelation)
0
dL
Inconclusive
Accept H0
(no autocorrelation)
dU
2
4-dU
Reject H0
(negative
autocorrelation)
4-dL
4
Residual Analysis for
Independence
Graphical Approach

Not Independent
e
Independent
e
Time
Cyclical Pattern
Time
No Particular Pattern
Residual is Plotted Against Time to Detect Any Autocorrelation
Inference about the Slope:
t Test

t Test for a Population Slope


Null and Alternative Hypotheses



Is there a linear dependency of Y on X ?
H0: 1 = 0
H1: 1  0
(no linear dependency)
(linear dependency)
Test Statistic


b1  1
t
where Sb1 
Sb1
d. f .  n  2
SYX
n
(X
i 1
i
 X)
2
Example: Produce Store
Data for 7 Stores:
Store
Square
Feet
Annual
Sales
($000)
1
2
3
4
5
6
7
1,726
1,542
2,816
5,555
1,292
2,208
1,313
3,681
3,395
6,653
9,543
3,318
5,563
3,760
Estimated Regression
Equation:
Yˆi  1636.415  1.487X i
The slope of this
model is 1.487.
Does square footage
affect annual sales?
Inferences about the Slope:
t Test Example
Test Statistic:
H0: 1 = 0
From Excel Printout b
S
t
b1
H1: 1  0
1
Coefficients Standard Error t Stat P-value
  .05
Intercept
1636.4147
451.4953 3.6244 0.01515
df  7 - 2 = 5
Footage
1.4866
0.1650 9.0099 0.00028
Critical Value(s):
Reject
.025
Decision:
Reject H0.
Reject
.025
-2.5706 0 2.5706
t
p-value
Conclusion:
There is evidence that
square footage affects
annual sales.
Inferences about the Slope:
Confidence Interval Example
Confidence Interval Estimate of the Slope:
b1  tn  2 Sb1
Excel Printout for Produce Stores
Intercept
Footage
Lower 95% Upper 95%
475.810926 2797.01853
1.06249037 1.91077694
At 95% level of confidence, the confidence interval
for the slope is (1.062, 1.911). Does not include 0.
Conclusion: There is a significant linear dependency
of annual sales on the size of the store.
Inferences about the Slope:
F Test

F Test for a Population Slope


Null and Alternative Hypotheses



Is there a linear dependency of Y on X ?
H0: 1 = 0
H1: 1  0
(no linear dependency)
(linear dependency)
Test Statistic


SSR
1
F 
SSE
 n  2
Numerator d.f.=1, denominator d.f.=n-2
Relationship between a t Test
and an F Test

Null and Alternative Hypotheses





H0: 1 = 0
H1: 1  0
t 
n2
2
(no linear dependency)
(linear dependency)
 F1,n 2
The p –value of a t Test and the p –value of
an F Test are Exactly the Same
The Rejection Region of an F Test is Always
in the Upper Tail
Inferences about the Slope:
F Test Example
H0: 1 = 0
H1: 1  0
  .05
numerator
df = 1
denominator
df  7 - 2 = 5
Test Statistic:
From Excel Printout
ANOVA
df
Regression
Residual
Total
1
5
6
Reject
 = .05
0
6.61
F1, n  2
SS
MS
F Significance F
30380456.12 30380456.12 81.179
0.000281
1871199.595 374239.919
p-value
32251655.71
Decision: Reject H0.
Conclusion:
There is evidence that
square footage affects
annual sales.
Purpose of Correlation Analysis

Correlation Analysis is Used to Measure
Strength of Association (Linear Relationship)
Between 2 Numerical Variables


Only strength of the relationship is concerned
No causal effect is implied
Purpose of Correlation Analysis
(continued)

Population Correlation Coefficient  (Rho) is
Used to Measure the Strength between the
Variables
 XY

 X Y
Purpose of Correlation Analysis
(continued)

Sample Correlation Coefficient r is an
Estimate of  and is Used to Measure the
Strength of the Linear Relationship in the
Sample Observations
n
r
 X
i 1
n
 X
i 1
i
i
 X Yi  Y 
X
2
n
 Y  Y 
i 1
i
2
Sample Observations from
Various r Values
Y
Y
Y
X
r = -1
X
r = -.6
Y
X
r=0
Y
r = .6
X
r=1
X
Features of  and r





Unit Free
Range between -1 and 1
The Closer to -1, the Stronger the Negative
Linear Relationship
The Closer to 1, the Stronger the Positive
Linear Relationship
The Closer to 0, the Weaker the Linear
Relationship
t Test for Correlation

Hypotheses



H0:  = 0 (no correlation)
H1:   0 (correlation)
Test Statistic
t

r
where
 r
n2
2
n
r  r2 
 X
i 1
n
 X
i 1
i
i
 X Yi  Y 
X
2
n
 Y  Y 
i 1
i
2
Example: Produce Stores
From Excel Printout
Is there any
evidence of linear
relationship between
annual sales of a
store and its square
footage at .05 level
of significance?
r
R e g r e ssi o n S ta ti sti c s
M u lt ip le R
R S q u a re
0 .9 7 0 5 5 7 2
0 .9 4 1 9 8 1 2 9
A d ju s t e d R S q u a re 0 . 9 3 0 3 7 7 5 4
S t a n d a rd E rro r
6 1 1 .7 5 1 5 1 7
O b s e rva t io n s
H0:  = 0 (no association)
H1:   0 (association)
  .05
df  7 - 2 = 5
7
Example: Produce Stores
Solution
r
.9706
t

 9.0099
2
1  .9420
 r
5
n2
Critical Value(s):
Reject
.025
Reject
.025
-2.5706 0 2.5706
Decision:
Reject H0.
Conclusion:
There is evidence of a
linear relationship at 5%
level of significance.
The value of the t statistic is
exactly the same as the t
statistic value for test on the
slope coefficient.
Estimation of Mean Values
Confidence Interval Estimate for
Y | X  X
:
i
The Mean of Y Given a Particular Xi
Standard error
of the estimate
Size of interval varies according
to distance away from mean, X
Yˆi  tn 2 SYX
t value from table
with df=n-2
(Xi  X )
1
 n
n
2
 (Xi  X )
2
i 1
Prediction of Individual Values
Prediction Interval for Individual Response
Yi at a Particular Xi
Addition of 1 increases width of interval
from that for the mean of Y
Yˆi  tn 2 SYX
1 (Xi  X )
1  n
n
2
(Xi  X )
2
i 1
Interval Estimates for Different
Values of X
Y
Confidence
Interval for the
Mean of Y
Prediction Interval
for a Individual Yi
X
X
a given X
Example: Produce Stores
Data for 7 Stores:
Store
Square
Feet
Annual
Sales
($000)
1
2
3
4
5
6
7
1,726
1,542
2,816
5,555
1,292
2,208
1,313
3,681
3,395
6,653
9,543
3,318
5,563
3,760
Consider a store
with 2000 square
feet.
Regression Model Obtained:

Yi = 1636.415 +1.487Xi
Estimation of Mean Values:
Example
Confidence Interval Estimate for
Y | X  X
i
Find the 95% confidence interval for the average annual
sales for stores of 2,000 square feet.

Predicted Sales Yi = 1636.415 +1.487Xi = 4610.45 ($000)
X = 2350.29
SYX = 611.75
Yˆi  tn 2 SYX
tn-2 = t5 = 2.5706
( X i  X )2
1
 n
 4610.45  612.66
n
2
(Xi  X )
i 1
3997.02  Y |X  X i  5222.34
Prediction Interval for Y :
Example
Prediction Interval for Individual YX  X i
Find the 95% prediction interval for annual sales of one
particular store of 2,000 square feet.

Predicted Sales Yi = 1636.415 +1.487Xi = 4610.45 ($000)
X = 2350.29
SYX = 611.75
Yˆi  tn 2 SYX
tn-2 = t5 = 2.5706
1 ( X i  X )2
1  n
 4610.45  1687.68
n
2
(
X

X
)
 i
i 1
2922.00  YX  X i  6297.37
Estimation of Mean Values and
Prediction of Individual Values in PHStat

In Excel, use PHStat | Regression | Simple
Linear Regression …


Check the “Confidence and Prediction Interval for
X=” box
Excel Spreadsheet of Regression Sales on
Footage
Pitfalls of Regression Analysis




Lacking an Awareness of the Assumptions
Underlining Least-Squares Regression
Not Knowing How to Evaluate the Assumptions
Not Knowing What the Alternatives to LeastSquares Regression are if a Particular
Assumption is Violated
Using a Regression Model Without Knowledge
of the Subject Matter
Strategy for Avoiding the Pitfalls
of Regression



Start with a scatter plot of X on Y to observe
possible relationship
Perform residual analysis to check the
assumptions
Use a histogram, stem-and-leaf display, boxand-whisker plot, or normal probability plot of
the residuals to uncover possible nonnormality
Strategy for Avoiding the Pitfalls
of Regression
(continued)


If there is violation of any assumption, use
alternative methods (e.g., least absolute
deviation regression or least median of squares
regression) to least-squares regression or
alternative least-squares models (e.g.,
curvilinear or multiple regression)
If there is no evidence of assumption violation,
then test for the significance of the regression
coefficients and construct confidence intervals
and prediction intervals
Chapter Summary






Introduced Types of Regression Models
Discussed Determining the Simple Linear
Regression Equation
Described Measures of Variation
Addressed Assumptions of Regression and
Correlation
Discussed Residual Analysis
Addressed Measuring Autocorrelation
Chapter Summary




(continued)
Described Inference about the Slope
Discussed Correlation - Measuring the
Strength of the Association
Addressed Estimation of Mean Values and
Prediction of Individual Values
Discussed Pitfalls in Regression and Ethical
Issues
Introduction to Multiple
Regression
Chapter Topics

The Multiple Regression Model

Residual Analysis




Testing for the Significance of the Regression
Model
Inferences on the Population Regression
Coefficients
Testing Portions of the Multiple Regression
Model
Dummy-Variables and Interaction Terms
The Multiple Regression Model
Relationship between 1 dependent & 2 or more
independent variables is a linear function
Population
Y-intercept
Population slopes
Random
error
Yi      X1i    X 2i    k X ki   i
Dependent (Response)
variable
Independent (Explanatory)
variables
Multiple Regression Model
Bivariate model
Y
Response
Response
Plane
Plane
X
X11
+  1X
YYi i= 00 
X1i1i + 22XX2i2+i i i
(Observed Y)
(Observed
Y)
 00
i
X22
X 1i ,,X
X
(X
1i 2i2)i
+ 1XX1i +
2X2i
Y| XY|X= 00 
1 1i
2 X 2i
Multiple Regression Equation
Bivariate model
Response
Response
Plane
Plane
X
X11
Y
Y
Yii =
+ b11X
X11ii 
+ bb22X 2i2i +eeii
 b0 
(Observed
(ObservedYY))
bb00
ei
X
X22
X 11ii , X2i2i)
(X
^
ˆ
+ b 2i
YYi i=bb00+bb1 X
1X11i
i  b22X2i
Multiple Regression Equation
Multiple Regression Equation
Too
complicated
by hand!
Ouch!
Interpretation of Estimated
Coefficients

Slope (bj )



Estimated that the average value of Y changes by
bj for each 1 unit increase in Xj , holding all other
variables constant (ceterus paribus)
Example: If b1 = -2, then fuel oil usage (Y) is
expected to decrease by an estimated 2 gallons for
each 1 degree increase in temperature (X1), given
the inches of insulation (X2)
Y-Intercept (b0)

The estimated average value of Y when all Xj = 0
Multiple Regression Model:
Example
Develop a model for estimating
heating oil used for a single
family home in the month of
January, based on average
temperature and amount of
insulation in inches.
Oil (Gal) Temp (0F) Insulation
275.30
40
3
363.80
27
3
164.30
40
10
40.80
73
6
94.30
64
6
230.90
34
6
366.70
9
6
300.60
8
10
237.80
23
10
121.40
63
3
31.40
65
10
203.50
41
6
441.10
21
3
323.00
38
3
52.50
58
10
Multiple Regression Equation:
Example
Yˆi  b0  b1 X1i  b2 X 2i 
Excel Output
Intercept
X Variable 1
X Variable 2
 bk X ki
Coefficients
562.1510092
-5.436580588
-20.01232067
Yˆi  562.151  5.437 X1i  20.012 X 2i
For each degree increase in
temperature, the estimated average
amount of heating oil used is
decreased by 5.437 gallons,
holding insulation constant.
For each increase in one inch
of insulation, the estimated
average use of heating oil is
decreased by 20.012 gallons,
holding temperature constant.
Multiple Regression in PHStat

PHStat | Regression | Multiple Regression …

Excel spreadsheet for the heating oil example
Venn Diagrams and
Explanatory Power of Regression
Variations in
Temp not used
in explaining
variation in Oil
Temp
Oil
Variations in
Oil explained
by the error
term  SSE 
Variations in Oil
explained by Temp
or variations in
Temp used in
explaining variation
in Oil  SSR 
Venn Diagrams and
Explanatory Power of Regression
(continued)
r 
2
Oil
Temp
SSR

SSR  SSE
Venn Diagrams and
Explanatory Power of Regression
Variation NOT
explained by
Temp nor
Insulation
 SSE 
Temp
Overlapping
variation in
both Temp and
Oil
Insulation are
used in
explaining the
variation in Oil
but NOT in the
Insulation estimation of 
1
nor  2
Coefficient of
Multiple Determination

Proportion of Total Variation in Y Explained by
All X Variables Taken Together


2
Y 12 k
r
SSR Explained Variation


SST
Total Variation
Never Decreases When a New X Variable is
Added to Model

Disadvantage when comparing among models
Venn Diagrams and
Explanatory Power of Regression
Oil
2
Y 12
r

Temp
Insulation
SSR

SSR  SSE
Adjusted Coefficient of Multiple
Determination

Proportion of Variation in Y Explained by All the X
Variables Adjusted for the Sample Size and the
Number of X Variables Used





2
adj
r

2
 1  1  rY 12

n 1 
k 
n  k  1 
Penalizes excessive use of independent variables
2
r
Smaller than Y 12 k
Useful in comparing among models
Can decrease if an insignificant new X variable is
added to the model
Coefficient of Multiple
Determination
Excel Output
rY2,12
R e g re ssi o n S ta ti sti c s
M u lt ip le R
0.982654757
R S q u a re
0.965610371
A d ju s t e d R S q u a re
0.959878766
S t a n d a rd E rro r
26.01378323
O b s e rva t io n s
15
SSR

SST
Adjusted r2
 reflects the number
of explanatory
variables and sample
size
 is smaller than r2
Interpretation of Coefficient of
Multiple Determination

2
Y 12
r


SSR

 .9656
SST
96.56% of the total variation in heating oil can be
explained by temperature and amount of insulation
r  .9599
2
adj

95.99% of the total fluctuation in heating oil can
be explained by temperature and amount of
insulation after adjusting for the number of
explanatory variables and sample size
Simple and Multiple Regression
Compared


The slope coefficient in a simple regression picks up
the impact of the independent variable plus the
impacts of other variables that are excluded from the
model, but are correlated with the included
independent variable and the dependent variable
Coefficients in a multiple regression net out the
impacts of other variables in the equation
 Hence, they are called the net regression coefficients

They still pick up the effects of other variables that are
excluded from the model, but are correlated with the
included independent variables and the dependent
variable
Simple and Multiple Regression
Compared: Example

Two Simple Regressions:



Oil   0  1 Temp  
Oil   0   2 Insulation  
Multiple Regression:
 Oil    
0
1 Temp   2 Insulation  
Simple and Multiple Regression
Compared: Slope Coefficients
Oil  b0  b1 Temp  b2 Insulation  e
Intercept
Temp
Insulation
Coefficients
562.1510092
-5.436580588
-20.01232067
Oil  b0  b1 Temp  e
Intercept
Temp
Coefficients
436.4382299
-5.462207697
-20.0123  -20.3503
Oil  b0  b2 Insulation  e
Intercept
Insulation
-5.4366  -5.4622
Coefficients
345.3783784
-20.35027027
Simple and Multiple Regression
Compared: r2
Oil   0  1 Temp   2 Insulation  
Oil   0  1 Temp  
Regression Statistics
Multiple R
0.86974117
R Square
0.756449704
Adjusted R Square 0.737715065
Standard Error
66.51246564
Observations
15
 0.97275

Regression Statistics
Multiple R
0.982654757
R Square
0.965610371
Adjusted R Square
0.959878766
Standard Error
26.01378323
Observations
15
0.96561 
 0.75645
 0.21630
Oil   0  1 Insulation  
Regression Statistics
Multiple R
0.465082527
R Square
0.216301757
Adjusted R Square 0.156017277
Standard Error
119.3117327
Observations
15
Example: Adjusted r2
Can Decrease
Oil   0  1 Temp   2 Insulation  
Regression Statistics
Multiple R
0.982654757
R Square
0.965610371
Adjusted R Square
0.959878766
Standard Error
26.01378323
Observations
15
Oil   0  1 Temp   2 Insulation  3 Color  
Regression Statistics
Multiple R
0.983482856
R Square
0.967238528
Adjusted R Square
0.958303581
Standard Error
25.72417272
Observations
15
Adjusted r 2 decreases when
k increases from 2 to 3
Color is not useful in explaining
the variation in oil consumption.
Using the Regression Equation
to Make Predictions
Predict the amount of heating oil used for a
home if the average temperature is 300 and
the insulation is 6 inches.
Yˆi  562.151  5.437 X 1i  20.012 X 2i
 562.151  5.437  30   20.012  6 
 278.969
The predicted heating oil
used is 278.97 gallons.
Predictions in PHStat

PHStat | Regression | Multiple Regression …


Check the “Confidence and Prediction Interval
Estimate” box
Excel spreadsheet for the heating oil example
Residual Plots

Residuals Vs



X1
May need to transform
Residuals Vs


May need to transform Y variable
Residuals Vs

Yˆ
X2
May need to transform
X1 variable
X 2variable
Residuals Vs Time

May have autocorrelation
Residual Plots: Example
T em p eratu re R esid u al P lo t
Maybe some nonlinear relationship
60
Residuals
40
20
Insulation R esidual P lot
0
0
20
40
60
80
-20
-40
-60
0
No Discernable Pattern
2
4
6
8
10
12
Testing for Overall Significance



Shows if Y Depends Linearly on All of the X
Variables Together as a Group
Use F Test Statistic
Hypotheses:




H0:     …  k = 0 (No linear relationship)
H1: At least one i   ( At least one independent
variable affects Y )
The Null Hypothesis is a Very Strong Statement
The Null Hypothesis is Almost Always Rejected
Testing for Overall Significance
(continued)

Test Statistic:


MSR SSR  all  / k
F

MSE
MSE  all 
Where F has k numerator and (n-k-1)
denominator degrees of freedom
Test for Overall Significance
Excel Output: Example
ANOVA
df
Regression
Residual
Total
SS
MS
F
Significance F
2 228014.6 114007.3 168.4712
1.65411E-09
12 8120.603 676.7169
14 236135.2
k = 2, the number of
explanatory variables
p-value
n-1
MSR
 F Test Statistic
MSE
Test for Overall Significance:
Example Solution
H0: 1 = 2 = … = k = 0
H1: At least one j  0
 = .05
df = 2 and 12
Test Statistic:
F 
168.47
(Excel Output)
Decision:
Reject at  = 0.05.
Critical Value:
Conclusion:
 = 0.05
0
3.89
F
There is evidence that at
least one independent
variable affects Y.
Test for Significance:
Individual Variables



Show If Y Depends Linearly on a Single Xj
Individually While Holding the Effects of Other
X’s Fixed
Use t Test Statistic
Hypotheses:


H0: j  0 (No linear relationship)
H1: j  0 (Linear relationship between Xj and Y)
t Test Statistic
Excel Output: Example
t Test Statistic for X1
(Temperature)
Coefficients Standard Error
t Stat
Intercept
562.1510092
21.09310433 26.65094
Temp
-5.436580588
0.336216167 -16.1699
Insulation -20.01232067
2.342505227 -8.543127
bi
t
Sbi
P-value
4.77868E-12
1.64178E-09
1.90731E-06
t Test Statistic for X2
(Insulation)
t Test : Example Solution
Does temperature have a significant effect on monthly
consumption of heating oil? Test at  = 0.05.
Test Statistic:
H0: 1 = 0
t Test Statistic = -16.1699
H1: 1  0
Decision:
Reject H0 at  = 0.05.
df = 12
Critical Values:
Reject H0
Reject H0
.025
.025
-2.1788
0 2.1788
t
Conclusion:
There is evidence of a
significant effect of
temperature on oil
consumption holding constant
the effect of insulation.
Venn Diagrams and
Estimation of Regression Model
Only this
information is
used in the
estimation of
1
Oil
Only this
information is
used in the
estimation of  2
Temp
Insulation
This
information
is NOT used
in the
estimation
of 1 nor  2
Confidence Interval Estimate for
the Slope
Provide the 95% confidence interval for the population
slope 1 (the effect of temperature on oil consumption).
Intercept
Temp
Insulation
Coefficients
562.151009
-5.4365806
-20.012321
b1  tn  p 1Sb1
Lower 95% Upper 95%
516.1930837 608.108935
-6.169132673 -4.7040285
-25.11620102
-14.90844
-6.169  1  -4.704
We are 95% confident that the estimated average consumption of
oil is reduced by between 4.7 gallons to 6.17 gallons per each
increase of 10 F holding insulation constant.
We can also perform the test for the significance of individual
variables, H0: 1 = 0 vs. H1: 1  0, using this confidence interval.
Contribution of a Single
Independent Variable X


j
Let Xj Be the Independent Variable of
Interest
SSR  X j | all others except X j 
 SSR  all   SSR  all others except X j 

Measures the additional contribution of Xj in
explaining the total variation in Y with the inclusion
of all the remaining independent variables
Contribution of a Single
Independent Variable X k
SSR  X 1 | X 2 and X 3 
 SSR  X 1 , X 2 and X 3   SSR  X 2 and X 3 
From ANOVA section of
regression for
Yˆi  b0  b1 X1i  b2 X 2i  b3 X 3i
From ANOVA section
of regression for
Yˆi  b0  b2 X 2i  b3 X 3i
Measures the additional contribution of X1 in
explaining Y with the inclusion of X2 and X3.
Coefficient of Partial
Determination of X

2
Yj  all others
r
j

SSR  X j | all others 
SST  SSR  all   SSR  X j | all others 

Measures the proportion of variation in the
dependent variable that is explained by Xj
while controlling for (holding constant) the
other independent variables
Coefficient of Partial
Determination for X
j
(continued)
Example: Model with two independent variables
2
Y 1 2
r
SSR  X 1 | X 2 

SST  SSR  X 1 , X 2   SSR  X 1 | X 2 
Venn Diagrams and Coefficient of
Partial Determination for X j
2
Y1  2
r
SSR  X1 | X 2 
Oil

SSR  X1 | X 2 
SST  SSR  X 1 , X 2   SSR  X 1 | X 2 
=
Temp
Insulation
Coefficient of Partial
Determination in PHStat

PHStat | Regression | Multiple Regression …


Check the “Coefficient of Partial Determination”
box
Excel spreadsheet for the heating oil example
Contribution of a Subset of
Independent Variables

Let Xs Be the Subset of Independent Variables
of Interest

SSR  X s | all others except X s 
 SSR  all   SSR  all others except X s 

Measures the contribution of the subset Xs in
explaining SST with the inclusion of the remaining
independent variables
Contribution of a Subset of
Independent Variables: Example
Let Xs be X1 and X3
SSR  X 1 and X 3 | X 2 
 SSR  X 1 , X 2 and X 3   SSR  X 2 
From ANOVA section of
regression for
Yˆi  b0  b1 X1i  b2 X 2i  b3 X 3i
From ANOVA
section of
regression for
Yˆi  b0  b2 X 2i
Testing Portions of Model


Examines the Contribution of a Subset Xs of
Explanatory Variables to the Relationship with Y
Null Hypothesis:


Variables in the subset do not improve the model
significantly when all other variables are included
Alternative Hypothesis:

At least one variable in the subset is significant
when all other variables are included
Testing Portions of Model
(continued)


One-Tailed Rejection Region
Requires Comparison of Two Regressions


One regression includes everything
Another regression includes everything except the
portion to be tested
Partial F Test for the Contribution of
a Subset of X Variables

Hypotheses:



H0 : Variables Xs do not significantly improve the
model given all other variables included
H1 : Variables Xs significantly improve the model
given all others included
Test Statistic:

SSR  X s | all others  / m
F
MSE  all 


with df = m and (n-k-1)
m = # of variables in the subset Xs
Partial F Test for the
Contribution of a Single X j

Hypotheses:



H0 : Variable Xj does not significantly improve
the model given all others included
H1 : Variable Xj significantly improves the
model given all others included
Test Statistic:
SSR  X j | all others 

F
MSE  all 


with df = 1 and (n-k-1 )
m = 1 here
Testing Portions of Model:
Example
Test at the  = .05
level to determine if
the variable of
average temperature
significantly improves
the model, given that
insulation is included.
Testing Portions of Model:
Example
H0: X1 (temperature) does
not improve model with X2
(insulation) included
 = .05, df = 1 and 12
Critical Value = 4.75
H1: X1 does improve model
ANOVA
(For X1 and X2)
ANOVA
(For X2)
Regression
Residual
Total
SS
MS
228014.6263 114007.313
8120.603016 676.716918
236135.2293
SS
Regression 51076.47
Residual
185058.8
Total
236135.2
SSR  X 1 | X 2   228, 015  51, 076 
F

 261.47
MSE  X 1 , X 2 
676.717
Conclusion: Reject H0; X1 does improve model.
Testing Portions of Model
in PHStat

PHStat | Regression | Multiple Regression …


Check the “Coefficient of Partial Determination”
box
Excel spreadsheet for the heating oil example
Do We Need to Do This
for One Variable?


The F Test for the Contribution of a Single
Variable After All Other Variables are Included
in the Model is IDENTICAL to the t Test of
the Slope for that Variable
The Only Reason to Perform an F Test is to
Test Several Variables Together
Dummy-Variable Models







Categorical Explanatory Variable with 2 or
More Levels
Yes or No, On or Off, Male or Female,
Use Dummy-Variables (Coded as 0 or 1)
Only Intercepts are Different
Assumes Equal Slopes Across Categories
The Number of Dummy-Variables Needed is
(# of Levels - 1)
Regression Model Has Same Form:
Yi   0  1 X1i   2 X 2i       k X ki   i
Dummy-Variable Models
(with 2 Levels)
Given: Yˆi  b0  b1 X1i  b2 X 2i
Y = Assessed Value of House
X1 = Square Footage of House
X2 = Desirability of Neighborhood =
Desirable (X2 = 1)
Yˆi  b0  b1 X1i  b2 (1)  (b0  b2 )  b1 X1i
Undesirable (X2 = 0)
Yˆ  b  b X  b (0)  b  b X
i
0
1
1i
2
0
1
1i
0 if
undesirable
1 if desirable
Same
slopes
Dummy-Variable Models
(with 2 Levels)
(continued)
Y (Assessed Value)
Same
slopes
b1
b0 + b2
Intercepts
different
b0
X1 (Square footage)
Interpretation of the DummyVariable Coefficient (with 2 Levels)
Example:
Yˆi  b0  b1 X1i  b2 X 2i  20  5 X1i  6 X 2i
Y : Annual salary of college graduate in thousand $
X1 : GPA
X 2:
0 non-business degree
1 business degree
With the same GPA, college graduates with a business
degree are making an estimated 6 thousand dollars more
than graduates with a non-business degree, on average.
Dummy-Variable Models
(with 3 Levels)
Given:
Y  Assessed Value of the House (1000 $)
X 1  Square Footage of the House
Style of the House = Split-level, Ranch, Condo
(3 Levels; Need 2 Dummy Variables)
1 if Split-level
1 if Ranch
X2  
X3  
 0 if not
 0 if not
Yˆi  b0  b1 X 1  b2 X 2  b3 X 3
Interpretation of the DummyVariable Coefficients (with 3 Levels)
Given the Estimated Model:
Yˆi  20.43  0.045 X 1i  18.84 X 2i  23.53 X 3i
For Split-level  X 2  1 :
Yˆi  20.43  0.045 X 1i  18.84
For Ranch  X 3  1 :
Yˆi  20.43  0.045 X 1i  23.53
For Condo:
Yˆ  20.43  0.045 X
i
1i
With the same footage, a Splitlevel will have an estimated
average assessed value of 18.84
thousand dollars more than a
Condo.
With the same footage, a Ranch
will have an estimated average
assessed value of 23.53
thousand dollars more than a
Condo.
Regression Model Containing
an Interaction Term

Hypothesizes Interaction between a Pair of X
Variables



Response to one X variable varies at different
levels of another X variable
Contains a Cross-Product Term
 Yi   0  1 X 1i   2 X 2i   3 X 1i X 2i   i
Can Be Combined with Other Models

E.g., Dummy-Variable Model
Effect of Interaction




Given:
 Yi   0  1 X 1i   2 X 2 i   3 X 1i X 2i   i
Without Interaction Term, Effect of X1 on Y is
Measured by 1
With Interaction Term, Effect of X1 on Y is
Measured by 1 + 3 X2
Effect Changes as X2 Changes
Interaction Example
Y
Y = 1 + 2X1 + 3X2 + 4X1X2
Y = 1 + 2X1 + 3(1) + 4X1(1) = 4 + 6X1
12
8
Y = 1 + 2X1 + 3(0) + 4X1(0) = 1 + 2X1
4
0
X1
0
0.5
1
1.5
Effect (slope) of X1 on Y depends on X2 value
Interaction Regression Model
Worksheet
Case, i
Yi
X1i
X2i
X1i X2i
1
2
3
4
:
1
4
1
3
:
1
8
3
5
:
3
5
2
6
:
3
40
6
30
:
Multiply X1 by X2 to get X1X2
Run regression with Y, X1, X2 , X1X2
Interpretation When There Are
3+ Levels
Y   0  1MALE   2 MARRIED   3DIVORCED
  4 MALE  MARRIED   5 MALE  DIVORCED
MALE = 0 if female and 1 if male
MARRIED = 1 if married; 0 if not
DIVORCED = 1 if divorced; 0 if not
MALE•MARRIED = 1 if male married; 0 otherwise
= (MALE times MARRIED)
MALE•DIVORCED = 1 if male divorced; 0 otherwise
= (MALE times DIVORCED)
Interpretation When There Are
3+ Levels
(continued)
Y   0  1MALE   2 MARRIED   3DIVORCED
  4 MALE  MARRIED   5 MALE  DIVORCED
SINGLE
FEMALE  
MALE
MARRIED
DIVORCED
  2
   3
   1
  2   4  3  5
   1     1
Interpreting Results
FEMALE
MALE
Difference
1
Single:  0
Single:  0  1
1  4
Married:  0   2 Married:  0  1   2   4
Divorced:  0   3 Divorced:  0  1   3   5 1  5
Main Effects : MALE, MARRIED and DIVORCED
Interaction Effects : MALE•MARRIED and
MALE•DIVORCED
Evaluating the Presence of
Interaction with Dummy-Variable




Suppose X1 and X2 are Numerical Variables and X3 is a
Dummy-Variable
To Test if the Slope of Y with X1 and/or X2 are the
Same for the Two Levels of X3
Model:
Yi  0  1 X 1i   2 X 2i  3 X 3i   4 X 1i X 3i  5 X 2i X 3i   i
Hypotheses:



H0: 4 = 5 = 0 (No Interaction between X1 and X3 or X2 and
X3 )
H1: 4 and/or 5  0 (X1 and/or X2 Interacts with X3)
Perform a Partial F Test
SSR( X 1 , X 2 , X 3 , X 4 , X 5 )  SSR( X 1 , X 2 , X 3 )  / 2

F
MSE ( X 1 , X 2 , X 3 , X 4 , X 5 )
Evaluating the Presence of
Interaction with Numerical Variables




Suppose X1, X2 and X3 are Numerical Variables
To Test If the Independent Variables Interact with
Each Other
Model:
Yi  0  1 X 1i  2 X 2i  3 X 3i  4 X 1i X 2i  5 X 1i X 3i  6 X 2i X 3i   i
Hypotheses:



H0: 4 = 5 = 6 = 0 (no interaction among X1, X2 and X3 )
H1: at least one of 4, 5, 6  0 (at least one pair of X1, X2,
X3 interact with each other)
Perform a Partial F Test
SSR( X 1 , X 2 , X 3 , X 4 , X 5 , X 6 )  SSR( X 1 , X 2 , X 3 )  / 3

F
MSE ( X 1 , X 2 , X 3 , X 4 , X 5 , X 6 )
Chapter Summary






Developed the Multiple Regression Model
Discussed Residual Plots
Addressed Testing the Significance of the
Multiple Regression Model
Discussed Inferences on Population
Regression Coefficients
Addressed Testing Portions of the Multiple
Regression Model
Discussed Dummy-Variables and Interaction
Terms
Download