Uploaded by samtimesam11

Chapter05 - Regression (2)

advertisement
Chapter 05
Regression Models
IIMT3636
Faculty of Business and Economics
The University of Hong Kong
Instructor: Dr. Wei ZHANG
2
Introduction
• When data is available, how to understand the
underlying relationship between
โ–ซ Education and income?
โ–ซ Advertising expense and sales volume?
โ–ซ Number of policemen and the crime rate in a region?
• If we know the education level of a man, how to predict
This is about correlation.
his future income level?
• If the number of policemen is reduced by half, how will
This is about causation.
the crime rate increase?
3
Introduction
• Regression analysis helps us (i) understand the
relationship between variables and (ii) predict the value
of one based on the others.
Linear Regression:
๐‘Œ = ๐›ฝ0 + ๐›ฝ1 โˆ™ ๐‘‹ + ๐œ–
Y
X
Dependent variable
Independent variable
Explained variable
Explanatory variable
Response variable
Control variable
Predicted variable
Predictor variable
Regressand
Regressor
The error term: the
part of Y that cannot be
predicted by X.
4
Scatter Diagrams
• A graphical presentation of the data
โ–ซ Independent variable is plotted on the horizontal axis
โ–ซ Dependent variable is plotted on the vertical axis
6
3
8
9
5
4.5
9.5
Hidden
relationship:
better payroll
predicts
higher sales
4
6
4
2
5
๐‘Œ = ๐›ฝ0 + ๐›ฝ1 โˆ™ ๐‘‹ + ๐œ–
12 –
10 –
Sales ($100,000)
TRIPLE A’S SALES
($100,000s)
LOCAL
PAYROLL
($100,000,000s)
8–
6–
Which line
best represents
the true
relationship?
4–
2–
0–
|
0
|
1
|
2
|
|
|
3
4
5
Payroll ($100 million)
|
6
|
7
|
8
5
Simple Linear Regression
• What are the best estimates of ๐›ฝ0 and ๐›ฝ1 ?
โ–ซ ๐‘0 = estimate of ๐›ฝ0
โ–ซ ๐‘1 = estimate of ๐›ฝ1
• Once we have ๐‘0 and ๐‘1 , then given ๐‘‹ (payroll) we can predict
๐‘Œ (sales):
๐‘Œเท  = ๐‘0 + ๐‘1 โˆ™ ๐‘‹
• The chosen line will in some way minimize the “errors”.
Error = Actual value − Predicted value
๐‘’ = ๐‘Œ − ๐‘Œเท 
• Objective: to minimize the sum of ๐‘’ 2 .
6
Simple Linear Regression
• The following formulas can be used to compute the
“best” intercept and slope:
σ๐‘‹
๐‘‹เดค =
= average (mean) of ๐‘‹ values
๐‘›
σ๐‘Œ
เดค
๐‘Œ=
= average
๐‘›
σ ๐‘‹−๐‘‹เดค ๐‘Œ−๐‘Œเดค
๐‘1 = σ
๐‘‹−๐‘‹เดค 2
๐‘0 = ๐‘Œเดค − ๐‘1 ๐‘‹เดค
(mean) of ๐‘Œ values
7
Simple Linear Regression
Y
Triple A Construction
เดค 2
X
(X –๐‘‹)
เดค
เดค
(X – ๐‘‹)(Y
– ๐‘Œ)
6
3
(3 – 4)2 = 1
(3 – 4)(6 – 7) = 1
8
4
(4 – 4)2 = 0
(4 – 4)(8 – 7) = 0
9
6
(6 – 4)2 = 4
(6 – 4)(9 – 7) = 4
5
4
(4 – 4)2 = 0
(4 – 4)(5 – 7) = 0
4.5
2
(2 – 4)2 = 4
(2 – 4)(4.5 – 7) = 5
9.5
5
๐‘Œเดค = ΣY/6 = 7
๐‘‹เดค = ΣX/6 = 4
(5 – 4)2 = 1
เดค 2 = 10
Σ(X – ๐‘‹)
(5 – 4)(9.5 – 7) = 2.5
เดค
เดค = 12.5
Σ(X – ๐‘‹)(Y
– ๐‘Œ)
๐‘1 =
σ ๐‘‹−๐‘‹เดค ๐‘Œ−๐‘Œเดค
σ ๐‘‹−๐‘‹เดค 2
= 1.25
๐‘0 = ๐‘Œเดค − ๐‘1 ๐‘‹เดค = 7 – 5 = 2
๐‘Œเท  = 2 + 1.25๐‘‹
เทฃ = 2 + 1.25 × Payroll
Sales
8
Simple Linear Regression
• Discussion: What is the real logic behind the relationsip
เทฃ = 2 + 1.25 × Payroll?
Sales
• Payroll is associated with sales via two channels
โ–ซ First, payroll means income. People may want to renovate their
homes when they are richer.
โ–ซ Second, payroll is correlated with the economic condition. There
are more realty transactions and demand for renovation when the
economy is better.
• 1.25 is the aggregate effect. To rule out the impact of
economy, we need to include it as a control variable.
9
The Fit of Regression Model
• How good or effective is the estimated model? How well does
the model “fit” the data?
• One way to evaluate the effectiveness is to compare the
predictions with a simple benchmark model: the average of Y.
• Define:
โ–ซ The sum of squares total (SST) = σ ๐‘Œ − ๐‘Œเดค 2 .
2
โ–ซ The sum of squares error (SSE) = σ ๐‘Œ − ๐‘Œเท  .
• SSE/SST measures the relative effectiveness of the regression
model as compared to the benchmark model.
2
เดค
เท 
• An equation: SSR = SST – SSE = σ ๐‘Œ − ๐‘Œ .
10
The Fit of Regression Model
12 –
^
Y = 2 + 1.25X
10 –
Sales ($100,000)
^
Y–Y
8–
^
Y–Y
Y–Y
Y
6–
4–
2–
0–
0
|
1
|
2
|
|
|
3
4
5
Payroll ($100 million)
|
6
|
7
|
8
11
Coefficient of Determination
• Coefficient of determination (or the so-called R squared)
SSE SSR
2
๐‘Ÿ =1−
=
SST SST
• It means the proportion of the variability in Y explained
by the regression model.
• For Triple A Construction, ๐‘Ÿ 2 = 0.6944, which means
about 69% of the variations in sales is captured by the
regression model based on payroll.
• ๐‘Ÿ 2 can range from 0 to 1. An ๐‘Ÿ 2 greater than 0.5 is very
good in practice.
12
Correlation Coefficient
• This measure, r, expresses the degree of linear
relationship in the data.
๐‘Ÿ = ± ๐‘Ÿ2
• It is positive if ๐‘1 > 0 and negative if ๐‘1 < 0.
• r can range between and including -1 and +1.
• For Triple A Construction,
๐‘Ÿ = 0.6944 = 0.8333
• A strong, positive correlation.
13
Correlation Coefficient
Y
Y
(a) Perfect Positive X
Correlation: r = +1
Y
X
(b) Positive
Correlation: 0 < r < 1
Y
(c) No Correlation:
r=0
X
(d) Perfect Negative
Correlation: r = –1
X
14
Assumptions of Regression Model
• When performing regression analysis, we often make the
following assumption about the random error ๐œ–:
• 1. Errors are independent (Random sampling)
• 2. Errors are normally distributed
• 3. Errors have a mean of zero
• 4. Errors have a constant variance (Homoscedasticity)
• A plot of the residuals (prediction errors) often
highlights obvious violations of assumptions.
15
Residual Plot
Prediction Error
• When the assumptions are met, the errors are random
and no discernible pattern is present.
X
16
Residual Plot
Prediction Error
• Non-constant variance
X
17
Residual Plot
Prediction Error
• Nonlinear relationship
X
18
Residual Plot
Prediction Error
• Normality is violated
X
19
Testing the Model for Significance
• The ๐‘Ÿ 2 provides a measure of accuracy or “fit” in a
regression model. However, when the sample size is too
small, it is possible to get good fit by randomness.
• E.g.,
Y
X
• To see if a linear relationship exists (i.e., ๐›ฝ1 ≠ 0), a
statistical hypothesis test is performed.
20
Testing the Model for Significance
• Define the F-statistic as ๐น =
SSR
SSE
/
.
๐‘˜
๐‘›−๐‘˜−1
โ–ซ ๐‘› = number of observations
โ–ซ ๐‘˜ = number of independent variables
• F is large if the model is accurate and small if otherwise.
โ–ซ F is boosted for large n
โ–ซ F is discounted for large k
21
Testing the Model for Significance
• Testing the model: ๐‘Œ = ๐›ฝ0 + ๐›ฝ1 โˆ™ ๐‘‹ + ๐œ–
• Null hypothesis ๐ป0 : ๐›ฝ1 = 0
• Alternative hypothesis ๐ป1 : ๐›ฝ1 ≠ 0
• If ๐ป0 is true, then SST = σ ๐‘Œ − ๐‘Œเดค 2 and SSE = σ ๐‘Œ − ๐‘Œเท 
should be close. In other words, SSR = SST – SSE and
the F-stat should be close to zero.
2
22
Testing the Model for Significance
• Given ๐ป0 , F-stat follows an F distribution with df1 , df2
โ–ซ df1 = degree of freedom for the numerator = k
โ–ซ df2 = degree of freedom for the denominator = n – k – 1
• F distribution:
โ–ซ https://en.wikipedia.org/wiki/F-distribution
• Select the level of significance ๐›ผ and the threshold value
๐น๐›ผ,df1,df2 such that ๐‘ƒ ๐น > ๐น๐›ผ,df1,df2 = ๐›ผ.
• Reject ๐ป0 if the F-stat > ๐น๐›ผ,df1,df2
23
Testing the Model for Significance
• Triple A Construction
๐ป0 : no linear relationship between sales and payroll
๐ป1 : linear relationship exists
•
•
•
•
df1 = 1
df2 = 4
SSE = 6.875
SSR = 15.625
P-value = P(F > F-stat) < 0.05
0.05
๐น0.05 = 7.71
F-stat = 9.09
The observed data is very unlikely if the null hypothesis is true!
24
Analysis of Variance (ANOVA) Table
• When software is used to develop a regression model, an
ANOVA table is typically created that shows the observed
significance level (p-value) for the F-stat, which can be
compared to the level of significance (α) to make a decision.
DF
SS
MS
Regression
k
SSR
MSR = SSR/k
Residual
n-k-1
SSE
MSE = SSE/(n - k - 1)
Total
n-1
SST
F
SIGNIFICANCE
MSR/MSE
P(F > MSR/MSE)
25
Using Excel for Regression
• Open Chapter05_Regression.xlsx
26
Multiple Regression Analysis
• The model:
๐‘Œ = ๐›ฝ0 + ๐›ฝ1 ๐‘‹1 + ๐›ฝ2 ๐‘‹2 + โ‹ฏ + ๐›ฝ๐‘˜ ๐‘‹๐‘˜ + ๐œ–
where
๐‘Œ = dependent variable
๐‘‹๐‘– = ith independent variable
๐›ฝ0 = intercept
๐›ฝ๐‘– = coefficient of the ith independent variable
k = number of independent variable
๐œ– = random error
27
Multiple Regression Analysis
• The estimated equation:
๐‘Œเท  = ๐‘0 + ๐‘1 ๐‘‹1 + ๐‘2 ๐‘‹2 + โ‹ฏ + ๐‘๐‘˜ ๐‘‹๐‘˜
where
๐‘Œเท  = predicted value of ๐‘Œ
๐‘0 = the estimate of intercept ๐›ฝ0
๐‘๐‘– = estimated coefficient of ith variable
• The estimation procedure is more complex.
• Excel is usually enough.
28
Jenny Wilson Realty
• JWR is a real estate firm in Alabama. Jenny wants to
develop a model to determine a suggested listing price
based on the size and age of the house.
• A sample of historical data include selling price (๐‘Œ), the
square footage (๐‘‹1 ), the age (๐‘‹2 ), and the condition
(good, excellent, or mint).
• The model:
๐‘Œเท  = ๐‘0 + ๐‘1 ๐‘‹1 + ๐‘2 ๐‘‹2
• Open Chapter05_Regression.xlsx
29
Jenny Wilson Realty
30
Evaluating Multiple Regression Models
• R squared: same as with simple linear regression
โ–ซ R squared increases with the number of variables
โ–ซ Use the adjusted r squared to correct for the number of
variables
๐ด๐‘‘๐‘—.
๐‘Ÿ2
๐‘†๐‘†๐ธ/(๐‘› − ๐‘˜ − 1)
=1−
๐‘†๐‘†๐‘‡/(๐‘› − 1)
• F test: for overall effectiveness of the model
โ–ซ Null hypothesis: ๐›ฝ1 = ๐›ฝ2 = โ‹ฏ = ๐›ฝ๐‘˜ = 0
31
Evaluating Multiple Regression Models
• t test is for the significance of a single variable
๐‘ก-stat =
เทข๐‘–
๐›ฝ
๐‘ ๐‘ก.๐‘’๐‘Ÿ๐‘Ÿ.
Given ๐›ฝ๐‘– = 0, the t-stat follows
student’s t distribution with
degrees of freedom ๐‘› − ๐‘˜ − 1.
• In Excel, the test is performed for two sides
โ–ซ i.e., ๐ป0 : ๐›ฝ๐‘– = 0 and ๐ป1 : ๐›ฝ๐‘– ≠ 0
• Sometimes, we need a test for only one side
โ–ซ e.g., ๐ป0 : ๐›ฝ๐‘– ≤ 0 and ๐ป1 : ๐›ฝ๐‘– > 0
โ–ซ The p-value computed by Excel should be halved
32
The t distribution
• It is symmetric.
• Reject ๐ป0 : ๐›ฝ๐‘– = 0 if the t-stat falls into either tail region.
• P-value indicates whether the t-stat is more extreme than the
๐›ผ-threshold values.
−๐‘ก๐›ผ/2
0
๐‘ก๐›ผ/2
33
Binary or Dummy Variables
• Binary (or dummy or indicator) variables are special
variables created for qualitative data
โ–ซ Whether a person has a college degree
โ–ซ Whether a purchase is made by a female customer
โ–ซ Whether a call is from New York (or LA or SF)
• A dummy variable is assigned a value of 1 if a particular
condition is met and a value of 0 otherwise.
• The number of dummy variables must equal one less than the
number of categories of the qualitative variable.
34
Jenny Wilson Realty
• A better model can be developed if information about the
condition of the property is included
X3 = 1 if house is in good condition
= 0 otherwise
X4 = 1 if house is in excellent condition
= 0 otherwise
• Two dummy variables are used to describe the three
categories of condition
• No variable is needed for “mint” condition since if both
X3 and X4 = 0, the house must be in mint condition
35
Jenny Wilson Realty
The adj. R sq. is
greatly improved!
High p-value for X4 does not
mean no relationship. It
means customers do not treat
mint and excellent conditions
quite differently.
36
Multicollinearity
• When an independent variable is highly correlated with
a combination of other independent variables,
multicollinearity exists.
• Variables contain duplicate information
โ–ซ Square footage, number of bedrooms, and number of bathrooms
โ–ซ Dummies for good, excellent, and mint conditions
โ–ซ US GDP per capita and S&P 500 index
• When multicollinearity exists, the overall F test is still
valid and the model is still useful for prediction, but the
tests for individual coefficients are not.
• Normally, a variable may appear to be insignificant when
it is significant.
37
Nonlinear Regression
• Sometimes the relationship is significantly nonlinear.
• The usual solution is to create a linear model that can
describe a nonlinear relationship.
โ–ซ Polynomial function: ๐‘Œเท  = ๐‘0 + ๐‘1 ๐‘‹ + ๐‘2 ๐‘‹ 2
โ–ซ Logarithm function: ๐‘Œเท  = ๐‘0 ๐‘’ ๐‘1๐‘‹ or log ๐‘Œเท  = log ๐‘0 + ๐‘1 ๐‘‹
*
*
*
* **
*
* ** *
Quadratic relationship
*
*
** *
*
* * ** *
Exponential relationship
38
Colonel Motors
• The engineers want to use regression analysis to improve
fuel efficiency
• They have been asked to study the impact of weight on
miles per gallon (MPG)
MPG
WEIGHT (1,000 LBS.)
MPG
WEIGHT (1,000 LBS.)
12
4.58
20
3.18
13
4.66
23
2.68
15
4.02
24
2.65
18
2.53
33
1.70
19
3.09
36
1.95
19
3.11
42
1.92
39
Colonel Motors
• A linear model for MPG data
40
Colonel Motors
A good model with high
R squared and F-stat.
41
Colonel Motors
• A quadratic model for MPG data
42
Colonel Motors
A better model. But do not try
to interpret the coefficients.
43
Nonlinear Regression
• When multiple variables are involved, the plot of a
marginal relationship may not show the true pattern.
• The residual plot may reveal a nonlinear pattern.
50
4
40
2
30
0
Residuals
MPG
โ–ซ E.g., Chapter05_Regression.xlsx
20
10
-2
0
1
2
3
-4
-6
0
0
1
2
3
Weight
4
5
-8
Weight
4
5
44
Cautions and Pitfalls
• Check if the assumptions are met
• Correlation does not mean causation
• Multicollinearity makes the interpretation of coefficients
problematic, but the model may still be good
• Using a regression model beyond the range of X is
questionable
• The significance of intercept is usually not important
• A linear relationship may not be the best relationship
• A nonlinear relationship can exist even if a linear one does not
• A model with a significant relationship but low R squared is of
little practical value. The first order effects are not captured.
45
In-class Exercises
•
•
•
•
•
What can go wrong if X is correlated with the error?
What is the null hypothesis for the F test?
What is the meaning of R squared?
Why we need the adjusted R squared?
What is an appropriate regression model for the
relationship between firm output size, labor size, and
capital size?
• How to capture the impact of seasonality in a model?
(1) When X is correlated with factors included in the error term, the estimated coefficient of X should NOT be interpreted as
the marginal impact of X on Y; instead, the coefficient can only be interpreted as the expected difference in Y that is correlated
with a unit difference in X when we compare two data points. (2) The null hypothesis for the F test is that Y is not correlated
with any independent variable. (3) R squared measure the percentage of variation in Y that is explained by the model. (4)
Adjusted R corrects for the impact of k and n. As k increases, we need to discount R squared because a model with more
parameters is more flexible and thus can better fit the data by default; as n increases, we need to boost R squared because it
becomes less likely to get high R squared just by chance. (5) Take log on both sides of the Cobb-Douglas model. (6) 3 dummies.
46
In-class Exercises
SUMMARY OUTPUT
Jenny Wilson Realty
Regression Statistics
Multiple R
0.526894
R Square
0.277617
Adjusted R Square 0.217419
Standard Error
34568.4
Observations
14
ANOVA
df
Regression
Residual
Total
Intercept
SF (X1)
SS
MS
F
Significance F
1 5.51E+09 5.51E+09 4.611689 0.05288022
12 1.43E+10 1.19E+09
13 1.99E+10
Coeff.
99704.31
28.68438
St. Er.
t Stat
P-value Lower 95% Upper 95%
31294.9 3.18596 0.007834 31518.5747 167890.04
13.3572 2.147484 0.05288 -0.4184621 57.7872167
Download