Purpose of Regression Analysis
• Regression analysis is used primarily to model causality and provide prediction
– Predicts the value of a dependent (response) variable based on the value of at least one independent (explanatory) variable
– Explains the effect of the independent variables on the dependent variable
Types of Regression Models
Positive Linear Relationship Relationship NOT Linear
Negative Linear Relationship No Relationship
Simple Linear Regression Model
• Relationship between variables is described by a linear function
• The change of one variable causes the change in the other variable
• A dependency of one variable on the other
Population Linear Regression
Population regression line is a straight line that describes the dependence of the average value
(conditional mean) of one variable on the other
Population
Y intercept
Y i
Population
Slope
X
Coefficient i
i
Random
Error
Dependent
(Response)
Variable
Population
Regression
Line
YX
(conditional mean)
Independent
(Explanatory)
Variable
Population Linear Regression
(continued)
Y (Observed Value of Y ) = Y i
X i
i
Observed Value of Y
i
= Random Error
YX
(Conditional Mean)
X i
X
Sample Linear Regression
Sample regression line provides an predicted value of Y estimate of the population regression line as well as a
Sample
Y Intercept
Y i
Sample
Slope b
0
b
1
X i
e i
Coefficient
Residual
Y
b
0
b X
1
Sample Regression Line
(Fitted Regression Line, Predicted Value)
Sample Linear Regression b b
(continued)
• b
0 and that minimizes the sum of the squared
1 residuals b
0 i n
1
Y Y i i
2
i n
1 e i
2
• b b
0 provides an estimate of
•
1 provides and estimate of
Sample Linear Regression
Y
Y i
b
0
b
1
X i
e i e i b
0
Observed Value
i
Y i
X
(continued) i
i b
1
YX
X i
Y i
0 b X
1 i
X
Interpretation of the
Slope and the Intercept
•
|
0
is the average value of Y when the value of X is zero.
•
1
measures the change in the
X average value of Y as a result of a one-unit change in X.
• b
ˆ
Interpretation of the
Slope and the Intercept
(continued)
|
0
is the estimated average value of Y when the value of X is zero.
• b
1
is the estimated change in
X the average value of Y as a result of a one-unit change in X.
Simple Linear Regression: Example
You want to examine the linear dependency of the annual sales of produce stores on their size in square footage.
Sample data for seven stores were obtained.
Find the equation of the straight line that fits the data best.
Annual
Store Square Sales
Feet ($1000)
3
4
1
2
5
6
7
1,726
1,542
2,816
5,555
1,292
2,208
1,313
3,681
3,395
6,653
9,543
3,318
5,563
3,760
Scatter Diagram: Example
1 2 0 0 0
1 0 0 0 0
8 0 0 0
6 0 0 0
4 0 0 0
2 0 0 0
0
0
Excel Output
1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0
S q u a r e F e e t
5 0 0 0 6 0 0 0
Equation for the Sample Regression Line:
Example
Y
ˆ i
0 b X
1 i
X i
From Excel Printout:
I n t e r c e p t
C o e ffi c i e n ts
1 6 3 6 . 4 1 4 7 2 6
X V a r i a b l e 1 1 . 4 8 6 6 3 3 6 5 7
Excel Output
Regression Statistics
Multiple R 0.970557
R Square
Adjusted R
Square
0.941981
0.930378
611.7515
Standard Error
Observations 7
ANOVA
Regression
Residual
Total
Intercept
X Variable 1 df
Coefficient s
6
1636.415
1.486634
1
5
SS MS F
30380456 30380456 81.17909
Significance
F
0.000281
1871200 374239.9
32251656
Standard
Error t Stat P-value
451.4953
3.624433
0.015149
0.164999
9.009944
0.000281
Lower 95%
475.8109
1.06249
Upper 95%
2797.019
1.910777
Graph of the Sample
Regression Line: Example
1 2 0 0 0
1 0 0 0 0
8 0 0 0
6 0 0 0
4 0 0 0
2 0 0 0
0
0 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0
S q u a r e F e e t
5 0 0 0 6 0 0 0
Interpretation of Results: Example
Y i
X i
The slope of 1.487 means that for each increase of one unit in X, we predict the average of Y to increase by an estimated 1.487 units.
The model estimates that for each increase of one square foot in the size of the store, the expected annual sales are predicted to increase by $1487 .
How Good is the regression?
• R 2
• Confidence Intervals
• Residual Plots
• Analysis of Variance
• Hypothesis ( t ) tests
Measure of Variation:
The Sum of Squares
SST = SSR + SSE
Total
Sample
Variability
= Explained
Variability
+
Unexplained
Variability
Measure of Variation:
The Sum of Squares
(continued)
• SST = total sum of squares
– Measures the variation of the Y i mean Y values around their
• SSR = regression sum of squares
– Explained variation attributable to the relationship between X and Y
• SSE = error sum of squares
– Variation attributable to factors other than the relationship between X and Y
Measure of Variation:
The Sum of Squares
Y
SST =
( Y i
_
Y ) 2
SSE =
( Y i
Y i
) 2
SSR =
( Y i
_
Y ) 2
(continued)
_
Y
X
X i
The Coefficient of Determination
• r
2
SSR
SST
• Measures the proportion of variation in Y that is explained by the independent variable X in the regression model
Coefficients of Determination (r 2 ) and
Correlation (r)
Y r 2 = 1, r = +1
^ i
= b
0
+ b
1
X i
X
Y r 2 = .8, r = +0.9
^ i
= b
0
+ b
1
X i
X
Y r 2 = 1,
^ i
r = -1
= b
0
+ b
1
X i
Y
X r 2 = 0, r = 0
^ i
= b
0
+ b
1
X i
X
Linear Regression Assumptions
1. Linearity
2. Normality
– Y values are normally distributed for each X
– Probability distribution of error is normal
2. Homoscedasticity (Constant Variance)
3. Independence of Errors
Residual Analysis
• Purposes
– Examine linearity
– Evaluate violations of assumptions
• Graphical Analysis of Residuals
– Plot residuals vs. X i
, Y i and time
Y
Residual Analysis for Linearity
Y e
Not Linear
X e
X
Linear
X
X
Y
Residual Analysis for Homoscedasticity
Y
SR
Heteroscedasticity
X
SR
X
Homoscedasticity
X
X
Variation of Errors around the Regression Line f(e)
• Y values are normally distributed around the regression line.
• For each X value, the “spread” or variance around the regression line is the same.
Y
X
2
X
X
1
Sample Regression Line
Residual Analysis:Excel Output for
Produce Stores Example
Excel Output
Observation Predicted Y Residuals
1 4202.344417
-521.3444173
2
3
4
3928.803824
5822.775103
9894.664688
-533.8038245
830.2248971
-351.6646882
5
6
7
3557.14541
4918.90184
3588.364717
-239.1454103
644.0981603
171.6352829
Residual Plot
0 1000 2000 3000 4000 5000 6000
Square Feet
e
Residual Analysis for Independence
Graphical Approach
Not Independent e
Independent
Time Time
Cyclical Pattern No Particular Pattern
Residual is plotted against time to detect any autocorrelation
Inference about the Slope: t Test
• t test for a population slope
– Is there a linear dependency of Y on X ?
• Null and alternative hypotheses
– H
0
:
1
– H
1
:
1
= 0 (no linear dependency)
0 (linear dependency)
• Test statistic
– t
b
1
S b
1
1 where S b
1
(
S
YX
X i
X
– . .
2 i n
1
)
2
Example: Produce Store
Data for Seven Stores:
Annual
Store Square Sales
Feet ($000)
5
6
7
3
4
1 1,726
2 1,542
2,816
5,555
1,292
2,208
1,313
3,681
3,395
6,653
9,543
3,318
5,563
3,760
Estimated
Regression
Equation:
Y i
= 1636.415 +1.487X
i
The slope of this model is 1.487.
Is square footage of the store affecting its annual sales?
Inferences about the Slope: t Test Example
H
0
:
1
= 0
H
1
:
1
.05
0 df
7 - 2 = 5
Critical Value(s):
Reject
.025
-2.5706
Test Statistic:
From Excel Printout b
1
S b 1 t
Coefficients Standard Error t Stat P-value
Intercept 1636.4147
451.4953
3.6244
0.01515
Footage 1.4866
Decision:
0.1650
9.0099
0.00028
Reject
.025
0 2.5706
t
Reject H
0
Conclusion:
There is evidence that square footage affects annual sales.
The Multiple Regression Model
Relationship between 1 dependent & 2 or more independent variables is a linear function
Population
Y-intercept
Population slopes Random
Error
Y i
X
1 i
X
Y i b b X
0 1 1 i
b X
2 2 i
2 i
X k ki
i
b X k ki
e i
Dependent (Response) variable for sample
Independent (Explanatory) variables for sample model
Residual
Population Multiple Regression Model
Bivariate model
Y
Response
Plane
Y i
=
0
+
1
X
1i
+
2
X
2i
+
i
(Observed Y)
0
i
X
2
X
1
(X
1i
,X
2i
)
Y|X
=
0
+
1
X
1i
+
2
X
2i
Sample Multiple
Regression Model
Bivariate model Y
Response
Plane
Y i
= b
0
+ b
1
X
1i
+ b
2
X
2i
+ e i
(Observed Y) b
0 e i
X
2
X
1
(X
1i
, X
2i
) i
= b
0
+ b
1
X
1i
+ b
2
X
2i
Sample Regression Plane
Simple and Multiple
Regression Compared
• Coefficients in a simple regression pick up the impact of that variable plus the impacts of other variables that are correlated with it and the dependent variable.
• Coefficients in a multiple regression net out the impacts of other variables in the equation.
Simple and Multiple
Regression Compared:Example
• Two simple regressions:
–
–
Oil
Oil
0
0
1
1
Temp
Insulation
• Multiple regression:
–
Oil
0
1
Temp
2
Insulation
Multiple Linear
Regression Equation
Too complicated by hand!
Ouch!
Interpretation of Estimated Coefficients
• Slope ( b i
)
– Estimated that the average value of Y changes by b i for each 1 unit increase in X i holding all other variables constant (ceteris paribus)
– Example: if b
1
= -2, then fuel oil usage ( Y ) is expected to decrease by an estimated 2 gallons for each 1 degree increase in temperature ( X
1
) given the inches of insulation ( X
2
)
• Y-intercept ( b
0
)
– The estimated average value of Y when all X i
= 0
Multiple Regression Model: Example
Develop a model for estimating heating oil used for a single family home in the month of
January based on average temperature and amount of insulation in inches.
Oil (Gal) Temp
( 0 F)
Insulation
275.30
40 3
363.80
164.30
40.80
27
40
73
3
10
6
94.30
230.90
366.70
300.60
237.80
64
34
9
8
23
10
10
6
6
6
121.40
31.40
203.50
441.10
323.00
52.50
21
38
58
63
65
41
3
3
10
3
10
6
Sample Multiple Regression Equation:
Example
Y
i
0 b X
1 1 i
b X
2 2 i
b X k ki
Excel Output Intercept
Coefficients
562.1510092
X Variable 1 -5.436580588
Y i
ˆ
X Variable 2 -20.01232067
X
20.012
1 i
X
2 i
For each degree increase in temperature, the estimated average amount of heating oil used is decreased by 5.437 gallons, holding insulation constant.
For each increase in one inch of insulation, the estimated average use of heating oil is decreased by 20.012 gallons, holding temperature constant.
Confidence Interval Estimate for the Slope
Provide the 95% confidence interval for the population slope
1
(the effect of temperature on oil consumption).
b
1
t n p 1
S b
1
Coefficients Lower 95% Upper 95%
Intercept 562.151009 516.1930837 608.108935
X Variable 1 -5.4365806 -6.169132673 -4.7040285
X Variable 2 -20.012321 -25.11620102
-14.90844
-6.169
1
-4.704
The estimated average consumption of oil is reduced by between 4.7 gallons to 6.17 gallons per each increase of 1 0 F.
Coefficient of
Multiple Determination
• Proportion of total variation in Y explained by all
X variables taken together
– r
Y
2
12 k
SSR
SST
Explained Variation
Total Variation
• Never decreases when a new X variable is added to model
– Disadvantage when comparing models
Adjusted Coefficient of Multiple Determination
• Proportion of variation in Y explained by all X variables adjusted for the number of X variables used
–
2 r adj
1
r
Y
2
12 k
n
1 n k 1
– Penalize excessive use of independent variables
– Smaller than r
Y
2
12 k
– Useful in comparing among models
Coefficient of Multiple Determination
Excel Output
R e g re ssi o n S ta ti sti c s
M u l t i p l e R 0 . 9 8 2 6 5 4 7 5 7
R S q u a re
A d j u s t e d R S q u a re
S t a n d a rd E rro r
O b s e rva t i o n s
0 . 9 6 5 6 1 0 3 7 1
0 . 9 5 9 8 7 8 7 6 6
2 6 . 0 1 3 7 8 3 2 3
1 5 r
Y
2
, 12
SSR
SST
Adjusted r 2
reflects the number of explanatory variables and sample size
is smaller than r 2
Interpretation of Coefficient of Multiple
Determination
•
• r
Y
2
,12
SSR
.9656
SST
– 96.56% of the total variation in heating oil can be explained by different temperature and amount of insulation r
2
adj
– 95.99% of the total fluctuation in heating oil can be explained by different temperature and amount of insulation after adjusting for the number of explanatory variables and sample size
Using The Model to Make Predictions
Predict the amount of heating oil used for a home if the average temperature is 30 0 and the insulation is six inches.
Y i
ˆ
X
562.151 5.437 30
1 i
20.012
X
2 i
278.969
The predicted heating oil used is 278.97 gallons
Testing for Overall Significance
• Shows if there is a linear relationship between all of the X variables together and Y
• Use F test statistic
• Hypotheses:
– H
0
:
… k
– H
1
: at least one
i
= 0 (no linear relationship)
( at least one independent variable affects Y )
• The null hypothesis is a very strong statement
• Almost always reject the null hypothesis
Test for Significance:
Individual Variables
• Shows if there is a linear relationship between the variable X i and
• Use t test statistic
Y
• Hypotheses:
– H
0
:
i
– H
1
:
i
0 (no linear relationship)
0 (linear relationship between X i and Y )
Residual Plots
• Residuals vs.
Y
ˆ
– May need to transform variable
• Residuals vs.
X
1
– May need to transform variable
1
X
2
– May have autocorrelation
2
Residual Plots: Example
6 0
4 0
2 0
0
- 2 0
0
- 4 0
- 6 0
T e m p e ra tu re R e s id u a l P lo t
2 0 4 0 6 0 8 0
Maybe some nonlinear relationship
In su latio n R esid u al P lo t
0 2 4 6 8 10 12
No Discernable Pattern
The Quadratic Regression Model
• Relationship between one response variable and two or more explanatory variables is a quadratic polynomial function
• Useful when scatter diagram indicates nonlinear relationship
• Quadratic model :
–
Y i
0
X
1 1 i
X
2
2 1 i
i
• The second explanatory variable is the square of the first variable
Y
Quadratic Regression Model
(continued)
Quadratic models may be considered when scatter diagram takes on the following shapes:
Y Y Y
2
> 0
X
1
2
> 0
X
1
2
< 0
X
1
2
= the coefficient of the quadratic term
2
< 0
X
1
Dummy Variable Models
• Categorical explanatory variable (dummy variable) with two or more levels:
• Yes or no, on or off, male or female,
• Coded as 0 or 1
• Only intercepts are different
• Assumes equal slopes across categories
• The number of dummy variables needed is
(number of levels - 1)
• Regression model has same form:
Y i
0
1
X
1 i
2
X
2 i
k
X ki
i
Dummy-Variable Models
(with 2 Levels)
Given:
Y
i
0 b X
1 1 i
b X
2 2 i
Y = Assessed Value of House
X
1
= Square footage of House
X
2
= Desirability of Neighborhood =
Desirable (X
2
Y i
ˆ b X
0 1 1 i
= 1)
Undesirable (X
2 b
2
(1) (
= 0)
Y
ˆ i
0 b X
1 1 i
b
2
(0) b b
0 2
0
) b X b X
1 1 i
1 1 i
0 if undesirable
1 if desirable
Same slopes
Dummy-Variable Models
(with 2 Levels)
Y (Assessed Value)
(continued)
Same slopes b
1 b
0
+ b
2
Intercepts different b
0
X
1
(Square footage)
Interpretation of the Dummy Variable
Coefficient (with 2 Levels)
Example:
Y
ˆ i
0 b X
1 1 i
b X
2 2 i
20 5 X
1 i
6 X
2 i
Y
: Annual salary of college graduate in thousand $
0 Female
X : GPA
1
X
2
:
1 Male
On average, male college graduates are making an estimated six thousand dollars more than female college graduates with the same GPA.
Dummy-Variable Models
(with 3 Levels)
Given:
Y
Assessed Value of the House (1000 $)
X
1
Square Footage of the House
Style of the House = Split-level, Ranch, Condo
(3 Levels; Need 2 Dummy Variables)
X
2
1 if Split-level
X
3
0 if not
1 if Ranch
0 if not
Y
ˆ i
0 b X
1 1
b X
2 2
b X
3 3
Interpretation of the Dummy Variable
Coefficients (with 3 Levels)
Given the Estimated Model:
Y
ˆ i
For Split-level
X
1 i
X
2
18.84
X
2
23.53
X i 3 i
With the same footage, a Splitlevel will have an estimated
Y
ˆ i
For Ranch
X
X
3
1 i
18.84
average assessed value of 18.84 thousand dollars more than a
Condo.
Y
ˆ i
X
1 i
23.53
With the same footage, a Ranch will have an estimated average
For Condo: assessed value of 23.53
Y
ˆ i
X
1 i thousand dollars more than a
Condo.
Dummy Variables
• Predict Weekly Sales in a Grocery Store
• Possible independent variables:
– Price
– Grocery Chain
• Data Set:
– Grocery.xls
• Interaction Effect?
Interaction
Regression Model
• Hypothesizes interaction between pairs of X variables
– Response to one X variable varies at different levels of another X variable
• Contains two-way cross product terms
– Y i
0
X
1 1 i
X
2 2 i
X X
3 1 i 2 i
i
• Can be combined with other models
– E.G., Dummy variable model
Effect of Interaction
• Given:
– Y
measured by
X
1
X
X X i 0 1 1 i 2 2 i 3 1 i 2 i
• Without interaction term, effect of X
1
i on Y is
• With interaction term, effect of X
1 measured by
1
+
3
X
2
• Effect changes as X
2 increases on Y is
Interaction Example
Y
Y = 1 + 2X
1
+ 3X
2
+ 4X
1
X
2
Y = 1 + 2X
1
+ 3(1) + 4X
1
(1) = 4 + 6X
1
12
8
Y = 1 + 2X
1
+ 3(0) + 4X
1
(0) = 1 + 2X
1
4
0
X
1
0 0.5
1 1.5
Effect (slope) of X
1 on Y does depend on X
2 value
Interaction Regression Model Worksheet
Case, i Y i
1
2
3
4
:
1
4
1
3
:
X
1i
1
8
3
5
:
X
2i
3
5
2
6
:
X
1i
X
3
40
6
30
:
Multiply X
1 by X
2 to get X
1
X
2
.
Run regression with Y, X
1
, X
2
, X
1
X
2
2i
Evaluating Presence of Interaction
• Hypothesize interaction between pairs of independent variables
• Contains 2-way product terms
Y i
0
X
1 1 i
X
2 2 i
X X
3 1 i 2 i
i
Using Transformations
• Requires data transformation
• Either or both independent and dependent variables may be transformed
• Can be based on theory, logic or scatter diagrams
Inherently Linear Models
• Non-linear models that can be expressed in linear form
– Can be estimated by least squares in linear form
• Require data transformation
Transformed Multiplicative Model (Log-
Log)
Original: Y i
Transformed: ln
0
X
1
1 i
X
2
2 i
i
i
ln
0
1 ln
1 i
2 ln
2 i
ln
i
1
1
Y
Y
0
1
1
X
1
Similarly for X
2
1
1
0
1
1
1
1
X
1
Square Root Transformation
Y i
0 1
X
1 i
2
X
2 i
i
Y
1
> 0
Similarly for X
2
1
< 0
X
1
Transforms one of above model to one that appears linear. Often used to overcome heteroscedasticity.
Linear-Logarithmic Transformation
Y i
0
1
X
1 i
2
X
2 i
i
Y
0
1
>
Similarly for X
2
0
1
<
X
1
Transformed from an original multiplicative model
Exponential Transformation
(Log-Linear)
Original Model
Y
Y i
e
0
1
X
1 i
2
X
2 i
i
1
> 0
Transformed Into: ln Y i
1
< 0
0
X
1 1 i
X
X
2 2 i
1
ln
1
Model Building / Model Selection
• Find “the best” set of explanatory variables among all the ones given.
• “Best subset” regression (only linear models)
– Requires a lot of computation (2 N regressions)
• “Stepwise regression”
• “Common Sense” methodology
– Run regression with all variables
– Throw out variables not statistically significant
– “Adjust” model by including some excluded variables, one at a time
• Tradeoff: Parsimony vs. Fit
Association ≠ Causation !
Regression Limitations
• R 2 measures the association between independent and dependent variables
• Be careful about doing predictions that involve extrapolation
• Inclusion / Exclusion of independent variables is subject to a type I / type II error
Multi-collinearity
• What?
– When one independent variable is highly correlated (“collinear”) with one or more other independent variables
– Examples:
• square feet and square meters as independent variables to predict house price (1 sq ft is roughly 0.09 sq meters)
• “total rooms” and bedrooms plus bathrooms for a house
• How to detect?
– Run a regression with the “not-so-independent” independent variable (in the examples above: square feet and total rooms) as a function of all other remaining independent variables, e.g.:
• X
1
= β
0
+ β
2
X
2
+ …+ β k
X k
– If R 2 of the above regression is > 0.8, then one suspects multicollinearity to be present
Multi-collinearity
(continued)
• What effect?
– Coefficient estimates are unreliable
– Can still be used for predicting values for Y
– If possible, delete the “not-so-independent” independent variable
• When to check?
– When one suspects that two variables measure the same thing, or when the two variables are highly correlated
– When one suspects that one independent variable is a (linear) function of the other independent variables