Multiple Regression Analysis

advertisement
Dr. Mohammed Alahmed
Dr. mohammed Alahmed
Multiple Regression
Analysis
http://fac.ksu.edu.sa/alahmed
alahmed@ksu.edu.sa
(011) 4674108
1
• In simple linear regression we studied
the relationship between one
explanatory variable and one response
variable.
• Now, we look at situations where several
explanatory variables works together to
explain the response.
Dr. mohammed Alahmed
Introduction
2
Introduction
• We look at the distribution of each variable to
be used in multiple regression to determine if
there are any unusual patterns that may be
important in building our regression analysis.
Dr. mohammed Alahmed
• Following our principles of data analysis, we
look first at each variable separately, then at
relationships among the variables.
3
Multiple Regression
•
•
•
•
X1 : average size of loan outstanding during the year,
X2 : average number of loans outstanding,
X3 : total number of new loan applications processed, and
X4 : office salary scale index.
• The model for this example is
Y   0  1 x1   2 x2   3 X 3   4 x4  
Dr. mohammed Alahmed
• Example: In a study of direct operating cost, Y, for 67
branch offices of consumer finance charge, four
independent variables were considered:
4
Formal Statement of the Model
• General regression model
Where:
• 0, 1, , k are parameters
• X1, X2, …,Xk are known constants
•  , the error terms are independent N(o, 2)
Dr. mohammed Alahmed
Y   0  1 x1   2 x2     k xk  
5
• The values of the regression parameters i are not
known. We estimate them from data.
• As in the simple linear regression case, we use
the least-squares method to fit a linear function
to the data.
yˆ  b0  b1 x1  b2 x2    bk xk
• The least-squares method chooses the b’s that
make the sum of squares of the residuals as small
as possible.
Dr. mohammed Alahmed
Estimating the parameters of the model
6
Estimating the parameters of the model
• The least-squares estimates are the values that
minimize the quantity
2
ˆ
(
y

y
)
 i i
i 1
• Since the formulas for the least-squares estimates
are complicated and hand calculation is out of
question, we are content to understand the leastsquares principle and let software do the
computations.
Dr. mohammed Alahmed
n
7
• The estimate of i is bi and it indicates the change
in the mean response per unit increase in Xi when
the rest of the independent variables in the model
are held constant.
• The parameters i are frequently called partial
regression coefficients because they reflect the
partial effect of one independent variable when
the rest of independent variables are included in
the model and are held constant
Dr. mohammed Alahmed
Estimating the parameters of the model
8
Estimating the parameters of the model
n
1
2
ˆ
S2 
(
y

y
)

i
i
n  k  1 i 1
and the regression standard error:
s s
Dr. mohammed Alahmed
• The observed variability of the responses about
this fitted model is measured by the variance:
2
9
• In the model 2 and  measure the variability of
the responses about the population regression
equation.
• It is natural to estimate 2 by s2 and  by s.
Dr. mohammed Alahmed
Estimating the parameters of the model
10
• The basic idea of the regression ANOVA table
are the same in simple and multiple regression.
• The sum of squares decomposition and the
associated degrees of freedom are:
2
2
2
ˆ
ˆ
(
y

y
)

(
y

y
)

(
y

y
)
 i
 i
 i i
SST
• df:

SSR
n  1  k  (n  k  1)

SSE
Dr. mohammed Alahmed
Analysis of Variance Table
11
Source
Regression
Sum of
Squares
SS
SSR
df
Mean Square
MS
F-test
k
MSR=
SSR/k
MSR/MSE
MSE=
SSE/n-k-1
Error
SSE
n-k-1
Total
SST
n-1
Dr. mohammed Alahmed
Analysis of Variance Table
12
F-test for the overall fit of the model
H 0 : 1   2     k  0
H a : not all  i (i  1, k ) equal zero
• We use the test statistic:
MSR
F
MSE
Dr. mohammed Alahmed
• To test the statistical significance of the regression
relation between the response variable y and the set
of variables x1,…, xk, i.e. to choose between the
alternatives:
13
F-test for the overall fit of the model
• The decision rule at significance level  is:
F  F ( ; k , n  k  1)
• Where the critical value F(, k, n-k-1) can be found
from an F-table.
• The existence of a regression relation by itself
does not assure that useful prediction can be
made by using it.
• Note that when k=1, this test reduces to the F-test
for testing in simple linear regression whether or
not 1= 0
Dr. mohammed Alahmed
• Reject H0 if
14
Interval estimation of i
• Therefore, an interval estimate for i with 1-  confidence
coefficient is:

bi  t ( ; n  k  1) s (bi )
2
Where
s(bi ) 
MSE
 ( x  x )2
Dr. mohammed Alahmed
• For our regression model, we have:
bi   i
has a t - distributi on with n - k - 1 degrees of freedom
s (bi )
15
Significance tests for i
• To test:
H 0 : i  0
H a : i  0
t  t(

2
t  t (
; n  k  1)

2
; n  k  1)
Dr. mohammed Alahmed
• We may use the test statistic:
bi
t
s (bi )
• Reject H0 if
or
16
• Often we have many explanatory variables,
and our goal is to use these to explain the
variation in the response variable.
• A model using just a few of the variables
often predicts about as well as the model
using all the explanatory variables.
Dr. mohammed Alahmed
Multiple regression model Building
17
• We may find that the reciprocal of a
variable is a better choice than the variable
itself, or that including the square of an
explanatory variable improves prediction.
• We may find that the effect of one
explanatory variable may depends upon the
value of another explanatory variable. We
account for this situation by including
interaction terms.
Dr. mohammed Alahmed
Multiple regression model Building
18
• The simplest way to construct an interaction term
is to multiply the two explanatory variables
together.
• How can we find a good model?
Dr. mohammed Alahmed
Multiple regression model Building
19
• After a lengthy list of potentially useful
independent variables has been compiled, some
of the independent variables can be screened out.
An independent variable
• May not be fundamental to the problem
• May be subject to large measurement error
• May effectively duplicate another independent
variable in the list.
Dr. mohammed Alahmed
Selecting the best Regression equation.
20
• Once the investigator has tentatively
decided upon the functional forms of the
regression relations (linear, quadratic, etc.),
the next step is to obtain a subset of the
explanatory variables (x) that “best”
explain the variability in the response
variable y.
Dr. mohammed Alahmed
Selecting the best Regression Equation.
21
• An automatic search procedure that develops
sequentially the subset of explanatory variables
to be included in the regression model is called
stepwise procedure.
• It was developed to economize on computational
efforts.
• It will end with the identification of a single
regression model as “best”.
Dr. mohammed Alahmed
Selecting the best Regression Equation.
22
• Sales Forecasting
• Multiple regression is a popular technique for
predicting product sales with the help of other
variables that are likely to have a bearing on sales.
• Example
• The growth of cable television has created vast new
potential in the home entertainment business. The
following table gives the values of several variables
measured in a random sample of 20 local television
stations which offer their programming to cable
subscribers. A TV industry analyst wants to build a
statistical model for predicting the number of
subscribers that a cable station can expect.
Dr. mohammed Alahmed
Example: Sales Forecasting
23
• Y = Number of cable subscribers (SUSCRIB)
• X1 = Advertising rate which the station charges local
advertisers for one minute of prim time
space (ADRATE)
• X2 = Number of families living in the station’s area of
dominant influence (ADI), a geographical division
of radio and TV audiences (APIPOP)
• X3 = Number of competing stations in the ADI
(COMPETE)
• X4 = Kilowatt power of the station’s non-cable signal
(SIGNAL)
Dr. mohammed Alahmed
Example: Sales Forecasting
24
• The sample data are fitted by a multiple regression
model using Excel program.
• The marginal t-test provides a way of choosing the
variables for inclusion in the equation.
• The fitted Model is
SUBSCRIBE   0  1  ADRATE   2  APIPOP   3  COMPETE   4  SIGNAL
Dr. mohammed Alahmed
Example: Sales Forecasting
25
Example: Sales Forecasting
• Excel Summary output
Regression Statistics
Multiple R
0.884267744
R Square
0.781929444
Adjusted R Square
0.723777295
Standard Error
142.9354188
Observations
20
ANOVA
df
Regression
Residual
Total
Intercept
AD_Rate
Signal
APIPOP
Compete
4
15
19
SS
1098857.84
306458.0092
1405315.85
MS
F
Significance F
274714.4601 13.44626923 7.52E-05
20430.53395
Coefficients
Standard Error
t Stat
P-value
Lower 95% Upper 95%
51.42007002
98.97458277
0.51952803 0.610973806 -159.539 262.3795
-0.267196347
0.081055107 -3.296477624 0.004894126 -0.43996 -0.09443
-0.020105139
0.045184758 -0.444954014 0.662706578 -0.11641 0.076204
0.440333955
0.135200486 3.256896248 0.005307766 0.152161 0.728507
16.230071
26.47854322
0.61295181 0.549089662 -40.2076 72.66778
Dr. mohammed Alahmed
SUMMARY OUTPUT
26
• Do we need all the four variables in the model?
• Based on the partial t-test, the variables signal
and compete are the least significant variables in
our model.
• Let’s drop the least significant variables one at a
time.
Dr. mohammed Alahmed
Example: Sales Forecasting
27
Example: Sales Forecasting
• Excel Summary Output
Regression Statistics
Multiple R
0.882638739
R Square
0.779051144
Adjusted R Square
0.737623233
Standard Error
139.3069743
Observations
20
ANOVA
df
Regression
Residual
Total
Intercept
AD_Rate
APIPOP
Compete
3
16
19
SS
1094812.92
310502.9296
1405315.85
MS
F
Significance F
364937.64 18.80498277
1.69966E-05
19406.4331
Coefficients Standard Error
t Stat
P-value
51.31610447
96.4618242 0.531983558 0.602046756
-0.259538026
0.077195983 -3.36206646 0.003965102
0.433505145
0.130916687 3.311305499 0.004412929
13.92154404
25.30614013 0.550125146 0.589831583
Lower 95% Upper 95%
-153.1737817
255.806
-0.423186162 -0.09589
0.15597423 0.711036
-39.72506442 67.56815
Dr. mohammed Alahmed
SUMMARY OUTPUT
28
Example: Sales Forecasting
Dr. mohammed Alahmed
• The variable Compete is the next variable to get rid of.
29
Example: Sales Forecasting
• Excel Summary Output
SUMMARY OUTPUT
Regression Statistics
0.8802681
R Square
0.774871928
Adjusted R Square
0.748386273
Standard Error
136.4197776
Observations
Dr. mohammed Alahmed
Multiple R
20
ANOVA
df
SS
Regression
MS
2
1088939.802
544469.901
Residual
17
316376.0474
18610.35573
Total
19
1405315.85
Coefficients
Intercept
96.28121395
AD_Rate
-0.254280696
APIPOP
0.495481252
Standard Error
50.16415506
t Stat
F
29.2562866
P-value
Significance F
3.13078E-06
Lower 95%
Upper 95%
-9.556049653
202.1184776
1.919322948
0.07188916
0.075014548 -3.389751739
0.003484198
-0.41254778 -0.096013612
0.065306012
7.45293E-07
0.357697418
7.587069489
0.633265086
30
Example: Sales Forecasting
Final Model
SUBSCRIBE  96.28  0.25  ADRATE  0.495  APIPOP
Dr. mohammed Alahmed
• All the variables in the model are statistically significant,
therefore our final model is:
31
• What is the interpretation of the estimated
parameters.
• Is the association positive or negative?
• Does this make sense intuitively, based on what
the data represents?
• What other variables could be confounders?
• Are there other analysis that you might consider
doing? New questions raised?
Dr. mohammed Alahmed
Interpreting the Final Model
32
• In multiple regression analysis, one is often
concerned with the nature and significance of the
relations between the explanatory variables and
the response variable.
• Questions that are frequently asked are:
1. What is the relative importance of the effects of the
different independent variables?
2. What is the magnitude of the effect of a given
independent variable on the dependent variable?
Dr. mohammed Alahmed
Multicollinearity
33
3. Can any independent variable be dropped from the
model because it has little or no effect on the
dependent variable?
4. Should any independent variables not yet included in
the model be considered for possible inclusion?
• Simple answers can be given to these questions if:
• The independent variables in the model are uncorrelated
among themselves.
• They are uncorrelated with any other independent
variables that are related to the dependent variable but
omitted from the model.
Dr. mohammed Alahmed
Multicollinearity
34
• When the independent variables are correlated among
themselves, multicollinearity or colinearity among them
is said to exist.
• In many non-experimental situations in business,
economics, and the social and biological sciences, the
independent variables tend to be correlated among
themselves.
• For example, in a regression of family food expenditures
on the variables: family income, family savings, and the
age of head of household, the explanatory variables will
be correlated among themselves.
Dr. mohammed Alahmed
Multicollinearity
35
Multicollinearity
Dr. mohammed Alahmed
• Further, the explanatory variables will also be correlated
with other socioeconomic variables not included in the
model that do affect family food expenditures, such as
family size.
36
Multicollinearity
Some key problems that typically arise when the
explanatory variables being considered for the
regression model are highly correlated among
themselves are:
1.
2.
3.
Adding or deleting an explanatory variable changes the
regression coefficients.
The estimated standard deviations of the regression
coefficients become large when the explanatory variables
in the regression model are highly correlated with each
other.
The estimated regression coefficients individually may
not be statistically significant even though a definite
statistical relation exists between the response variable
and the set of explanatory variables.
Dr. mohammed Alahmed
•
37
Multicollinearity Diagnostics
• It measures how much the variances of the estimated
regression coefficients are inflated as compared to
when the independent variables are not linearly related.
1
VIFj 
,
2
1 Rj
j  1,2,k
Dr. mohammed Alahmed
• A formal method of detecting the presence of
multicollinearity that is widely used is by the
means of Variance Inflation Factor.
2
R
• j is the coefficient of determination from the
regression of the jth independent variable on the
remaining k-1 independent variables.
38
Multicollinearity Diagnostics
- Its estimated coefficient and associated t-value will not
change much as the other independent variables are added
or deleted from the regression equation.
• VIF much greater than 1 indicates the presence of
multicollinearity. A maximum VIF value in excess of 10
is often taken as an indication that the multicollinearity
may be unduly influencing the least square estimates.
- the estimated coefficient attached to the variable is unstable
and its associated t statistic may change considerably as the
other independent variables are added or deleted.
Dr. mohammed Alahmed
• VIF near 1 suggests that multicollinearity is not a
problem for the independent variables.
39
• The simple correlation coefficient between all pairs of
explanatory variables (i.e., X1, X2, …, Xk ) is helpful in
selecting appropriate explanatory variables for a
regression model and is also critical for examining
multicollinearity.
• While it is true that a correlation very close to +1 or –1
does suggest multicollinearity, it is not true (unless there
are only two explanatory variables) to infer
multicollinearity does not exist when there are no high
correlations between any pair of explanatory variables.
Dr. mohammed Alahmed
Multicollinearity Diagnostics
40
Example: Sales Forecasting
SUBSCRIB
ADRATE
KILOWATT
APIPOP
COMPETE
1.00000
-0.02848
0.9051
0.44762
0.0478
0.90447
<.0001
0.79832
<.0001
ADRATE
ADRATE
-0.02848
0.9051
1.00000
-0.01021
0.9659
0.32512
0.1619
0.34147
0.1406
SIGNAL
SIGNAL
0.44762
0.0478
SUBSCRIB
SUBSCRIB
-0.01021
0.9659
1.00000
0.45303
0.0449
0.46895
0.0370
APIPOP
APIPOP
0.90447
<.0001
0.32512
0.1619
0.45303
0.0449
1.00000
0.87592
<.0001
COMPETE
0.79832
0.34147
0.46895
0.87592
1.00000
COMPETE
<.0001
0.1406
0.0370
<.0001
Dr. mohammed Alahmed
Pearson Correlation Coefficients, N = 20
Prob > |r| under H0: Rho=0
41
Example: Sales Forecasting
SUBSCRIBE  51.42  0.27  ADRATE - .02  SIGNAL  0.44  APIPOP 16.23 COMPETE
SUBSCRIBE  96.28  0.25  ADRATE  0.495  APIPOP
Dr. mohammed Alahmed
SUBSCRIBE  51.32  0.26  ADRATE  0.43 APIPOP 13.92  COMPETE
42
Example: Sales Forecasting
• VIF calculation:
• Fit the model
APIPOP   0  1  SIGNAL   2  ADRATE  3  COMPETE
Dr. mohammed Alahmed
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.878054
R Square
0.770978
Adjusted R Square
0.728036
Standard Error
264.3027
Observations
20
ANOVA
df
Regression
SS
MS
3
3762601
1254200
Residual
16
1117695
69855.92
Total
19
4880295
Coefficients
Standard Error t Stat
F
17.9541
Significance F
2.25472E-05
P-value
Lower 95%
Intercept
-472.685
139.7492
-3.38238
0.003799
-768.9402258
-176.43
Compete
159.8413
28.29157
5.649786
3.62E-05
99.86587622
219.8168
ADRATE
0.048173
0.149395
0.322455
0.751283
-0.268529713
0.364876
Signal
0.037937
0.083011
0.457012
0.653806
-0.138038952
0.213913
Upper 95%
43
Example: Sales Forecasting
• Fit the model
Compete   0  1  ADRATE   2  APIPOP  3  SIGNAL
SUMMARY OUTPUT
0.882936
R Square
0.779575
Adjusted R Square
0.738246
Standard Error
Dr. mohammed Alahmed
Regression Statistics
Multiple R
1.34954
Observations
20
ANOVA
df
Regression
SS
MS
3
103.0599
34.35329
Residual
16
29.14013
1.821258
Total
19
132.2
Coefficients
Standard Error t Stat
F
18.86239
P-value
Significance F
1.66815E-05
Lower 95%
Upper 95%
Intercept
3.10416
0.520589
5.96278
1.99E-05
2.000559786
4.20776
ADRATE
0.000491
0.000755
0.649331
0.525337
-0.001110874
0.002092
Signal
0.000334
0.000418
0.799258
0.435846
-0.000552489
0.001221
APIPOP
0.004167
0.000738
5.649786
3.62E-05
0.002603667
0.005731
44
Example: Sales Forecasting
• Fit the model
Signal   0  1  ADRATE   2  APIPOP  3  COMPETE
Regression Statistics
Multiple R
0.512244
R Square
0.262394
Adjusted R Square
0.124092
Standard Error
790.8387
Observations
20
ANOVA
df
Regression
Residual
Total
Intercept
APIPOP
Compete
ADRATE
SS
3 3559789
16 10006813
19 13566602
MS
1186596
625425.8
Coefficients
Standard Error t Stat
5.171093 547.6089 0.009443
0.339655 0.743207 0.457012
114.8227 143.6617 0.799258
-0.38091 0.438238 -0.86919
F
Significance F
1.897261
0.170774675
P-value
0.992582
0.653806
0.435846
0.397593
Lower 95%
Upper 95%
-1155.707711
1166.05
-1.235874129 1.915184
-189.7263711 419.3718
-1.309935875 0.548109
Dr. mohammed Alahmed
SUMMARY OUTPUT
45
Example: Sales Forecasting
• Fit the model
ADRATE   0  1  Signal   2  APIPOP  3  COMPETE
SUMMARY OUTPUT
0.399084
R Square
0.159268
Adjusted R Square
0.001631
Standard Error
440.8588
Observations
Dr. mohammed Alahmed
Regression Statistics
Multiple R
20
ANOVA
df
Regression
SS
MS
3
589101.7
196367.2
Residual
16
3109703
194356.5
Total
19
3698805
Coefficients
Standard Error t Stat
Intercept
Signal
APIPOP
Compete
F
1.010346
Significance F
0.413876018
P-value
Lower 95%
Upper 95%
253.7304
298.6063
0.849716
0.408018
-379.2865355
886.7474
-0.11837
0.136186
-0.86919
0.397593
-0.407073832
0.170329
0.134029
0.415653
0.322455
0.751283
-0.747116077
1.015175
52.3446
80.61309
0.649331
0.525337
-118.5474784
223.2367
46
Example: Sales Forecasting
Variable
R- Squared
VIF
ADRATE
0.159268
1.19
COMPETE
0.779575
4.54
SIGNAL
0.262394
1.36
APIPOP
0.770978
4.36
• There is no significant multicollinearity.
Dr. mohammed Alahmed
• VIF calculation Results:
47
• Many variables of interest in business,
economics, and social and biological sciences are
not quantitative but are qualitative.
• Examples of qualitative variables are gender
(male, female), purchase status (purchase, no
purchase), and type of firms.
• Qualitative variables can also be used in multiple
regression.
Dr. mohammed Alahmed
Qualitative Independent Variables
48
• An economist wished to relate the speed with which a
particular insurance innovation is adopted (y) to the size
of the insurance firm (x1) and the type of firm. The
dependent variable is measured by the number of months
elapsed between the time the first firm adopted the
innovation and and the time the given firm adopted the
innovation. The first independent variable, size of the
firm, is quantitative, and measured by the amount of total
assets of the firm. The second independent variable, type
of firm, is qualitative and is composed of two classesStock companies and mutual companies.
Dr. mohammed Alahmed
Qualitative Independent Variables
49
• Indicator, or dummy variables are used to
determine the relationship between qualitative
independent variables and a dependent variable.
• Indicator variables take on the values 0 and 1.
• For the insurance innovation example, where the
qualitative variable has two classes, we might
define the indicator variable x2 as follows:
1 if stock company
x2 
0 otherwise
Dr. mohammed Alahmed
Indicator variables
50
• A qualitative variable with c classes will be represented
by c-1 indicator variables.
• A regression function with an indicator variable with two
levels (c = 2) will yield two estimated lines.
Dr. mohammed Alahmed
Indicator variables
51
Interpretation of Regression Coefficients
• In our insurance innovation example, the regression
model is:
• Where:
x1  size of firm
Dr. mohammed Alahmed
y   0  1 x1   2 x2  
1 if stock company
x2 
0 otherwise
52
Interpretation of Regression Coefficients
yˆ i  b0  b1 x1  b2 (0)  b0 b1 x1
Mutual firms
• For a stock firm x2 = 1 and the response function is:
yˆ i  b0  b1 x1  b2 (1)  (b0  b2 )  b1 x1
Stock firms
Dr. mohammed Alahmed
• To understand the meaning of the regression coefficients in this
model, consider first the case of mutual firm. For such a firm,
x2 = 0 and we have:
53
• The response function for the mutual firms is a
straight line, with y intercept 0 and slope 1.
• For stock firms, this also is a straight line, with
the same slope 1 but with y intercept 0+2.
• With reference to the insurance innovation
example, the mean time elapsed before the
innovation is adopted is linear function of size of
firm (x1), with the same slope 1for both types of
firms.
Dr. mohammed Alahmed
Interpretation of Regression Coefficients
54
• 2 indicates how much lower or higher the
response function for stock firm is than the one
for the mutual firm.
• 2 measures the differential effect of type of
firms.
• In general, 2 shows how much higher (lower)
the mean response line is for the class coded 1
than the line for the class coded 0, for any level
of x1.
Dr. mohammed Alahmed
Interpretation of Regression Coefficients
55
Example: Insurance Innovation Adoption
Months Elapsed
Size
type of firm
Type
17
151
0
Mutual
26
92
0
Mutual
21
175
0
Mutual
30
31
0
Mutual
22
104
0
Mutual
0
277
0
Mutual
12
210
0
Mutual
19
120
0
Mutual
4
290
0
Mutual
16
238
1
Stock
28
164
1
Stock
15
272
1
Stock
11
295
1
Stock
38
68
1
Stock
31
85
1
Stock
21
224
1
Stock
20
166
1
Stock
13
305
1
Stock
30
124
1
Stock
14
246
1
Stock
Dr. mohammed Alahmed
• Here is the data set for the insurance innovation example:
56
Example: Insurance Innovation Adoption
• Fitting the regression model
Where
x1  size of firm
1 if stock company
x2 
0 otherwise
Dr. mohammed Alahmed
y   0  1 x1   2 x2  
• fitted response function is:
yˆ  33.87  .1061x1  8.77 x2
57
Example: Insurance Innovation Adoption
SUMMARY OUTPUT
Multiple R
0.95993655
R Square
0.92147818
Adjusted R Square
0.91224031
Standard Error
2.78630562
Observations
20
ANOVA
df
Regression
SS
MS
2
1548.820517
774.4103
Residual
17
131.979483
7.763499
Total
19
1680.8
Coefficients
Intercept
Size
type of firm
Standard Error
t Stat
F
99.75016
P-value
33.8698658
1.562588138
21.67549
8E-14
-0.10608882
0.007799653
-13.6017
1.45E-10
8.76797549
1.286421264
6.815789
3.01E-06
Significance F
4.04966E-10
Lower 95%
Upper 95%
30.57308841
37.16664321
Dr. mohammed Alahmed
Regression Statistics
-0.122544675 -0.089632969
6.053860079
11.4820909
58
Example: Insurance Innovation Adoption
• The fitted response function is:
yˆ  33.87  .1061x1  8.77 x2
yˆ  (33.87  8.77)  .1061x1
• Mutual firms response function is:
yˆ  33.87  .1061x1
Dr. mohammed Alahmed
• Stock firms response function is:
• Interpretation ?
59
• Seasonal Patterns are not easily accounted for by
the typical causal variables that we use in
regression analysis.
• An indicator variable can be used effectively to
account for seasonality in our time series data.
• The number of seasonal indicator variables to use
depends on the data.
• If we have p periods in our data series, we can
not use more than P-1 seasonal indicator
variables.
Dr. mohammed Alahmed
Accounting for Seasonality in a Multiple
regression Model
60
• Housing starts in the United States measured in
thousands of units.
• These data are plotted for 1990 Q1 through 1999Q4.
• There are typically few housing starts during the first
quarter of the year (January, February, March); there
is usually a big increase in the second quarter of
(April, May, June), followed by some decline in the
third quarter (July, August, September), and further
decline in the fourth quarter (October, November,
December).
Dr. mohammed Alahmed
Example: Private Housing Starts (PHS)
61
Nov-98
Jul-98
1
Mar-98
Nov-97
Jul-97
1
Mar-97
Nov-96
Jul-96
Mar-96
Nov-95
Jul-95
1
Mar-95
Nov-94
Jul-94
300
250
1
1
1
150
1
Dr. mohammed Alahmed
100
Mar-94
Nov-93
Jul-93
1
Mar-93
Nov-92
Jul-92
200
Mar-92
Nov-91
Jul-91
Mar-91
Nov-90
Jul-90
Mar-90
Example: Private Housing Starts (PHS)
Private Housing Starts (PHS) in Thousands of Units
400
350
"1" marks the first quarter of each year.
50
0
62
• To Account for and measure this seasonality in a
regression model, we will use three dummy
variables: Q2 for the second quarter, Q3 for the
third quarter, and Q4 for the fourth quarter. These
will be coded as follows:
• Q2 = 1 for all second quarters and zero otherwise.
• Q3 = 1 for all third quarters and zero otherwise
• Q4 = 1 for all fourth quarters and zero otherwise.
Dr. mohammed Alahmed
Example: Private Housing Starts (PHS)
63
• Data for private housing starts (PHS), the
mortgage rate (MR), and these seasonal indicator
variables are shown in the following slide.
• Examine the data carefully to verify your
understanding of the coding for Q2, Q3, Q4.
• Since we have assigned dummy variables for the
second, third, and fourth quarters, the first quarter
is the base quarter for our regression model.
• Note that any quarter could be used as the base,
with indicator variables to adjust for differences
in other quarters.
Dr. mohammed Alahmed
Example: Private Housing Starts (PHS)
64
PERIOD
31-Mar-90
30-Jun-90
30-Sep-90
31-Dec-90
31-Mar-91
30-Jun-91
30-Sep-91
31-Dec-91
31-Mar-92
30-Jun-92
30-Sep-92
31-Dec-92
31-Mar-93
30-Jun-93
30-Sep-93
31-Dec-93
31-Mar-94
30-Jun-94
30-Sep-94
31-Dec-94
PHS
217
271.3
233
173.6
146.7
254.1
239.8
199.8
218.5
296.4
276.4
238.8
213.2
323.7
309.3
279.4
252.6
354.2
325.7
265.9
MR
10.1202
10.3372
10.1033
9.9547
9.5008
9.5265
9.2755
8.6882
8.7098
8.6782
8.0085
8.2052
7.7332
7.4515
7.0778
7.0537
7.2958
8.4370
8.5882
9.0977
Q2
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
Q3
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
Q4
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
Book: Table 5-5, page 216
PERIOD
31-Mar-95
30-Jun-95
30-Sep-95
31-Dec-95
31-Mar-96
30-Jun-96
30-Sep-96
31-Dec-96
31-Mar-97
30-Jun-97
30-Sep-97
31-Dec-97
31-Mar-98
30-Jun-98
30-Sep-98
31-Dec-98
31-Mar-99
30-Jun-99
30-Sep-99
31-Dec-99
PHS
214.2
296.7
308.2
257.2
240
344.5
324
252.4
237.8
324.5
314.6
256.8
258.4
360.4
348
304.6
294.1
377.1
355.6
308.1
MR
8.8123
7.9470
7.7012
7.3508
7.2430
8.1050
8.1590
7.7102
7.7905
7.9255
7.4692
7.1980
7.0547
7.0938
6.8657
6.7633
6.8805
7.2037
7.7990
7.8338
Q2
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
Q3
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
Q4
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
Dr. mohammed Alahmed
Example: Private Housing Starts (PHS)
65
Example: Private Housing Starts (PHS)
• The regression model for private housing starts (PHS) is:
• In this model we expect b1 to have a negative sign, and
we would expect b2, b3, b4 all to have positive signs.
Why?
• Regression results for this model are shown in the next
slide.
Dr. mohammed Alahmed
PHS   0  1 (MR)   2 (Q2)  3 (Q3)   4 (Q4)
66
Example: Private Housing Starts (PHS)
SUMMARY OUTPUT
ANOVA
df
Regression
Residual
Total
Intercept
MR
Q2
Q3
Q4
4
35
39
SS
88837.93624
24485.87476
113323.811
MS
F
Significance F
22209.48406 31.74613731
3.33637E-11
699.5964217
Coefficients Standard Error
t Stat
P-value
473.0650749
35.54169837 13.31014264 2.93931E-15
-30.04838192
4.257226391 -7.058206249 3.21421E-08
95.74106935
11.84748487 8.081130334
1.6292E-09
73.92904763
11.82881519 6.249911462 3.62313E-07
20.54778131
11.84139803
1.73524961 0.091495355
Lower 95%
Upper 95%
400.9115031 545.2186467
-38.69102153 -21.40574231
71.689367 119.7927717
49.91524679 97.94284847
-3.491564078
44.5871267
Dr. mohammed Alahmed
Regression Statistics
Multiple R
0.885398221
R Square
0.78393001
Adjusted R Square
0.759236296
Standard Error
26.4498851
Observations
40
67
Example: Private Housing Starts (PHS)
PHSˆ  473.06  30.05( MR )  95.74(Q 2)  73.93(Q3)  20.55(Q 4)
Dr. mohammed Alahmed
• Use the prediction equation to make a forecast for each of
the fourth quarter of 1999.
• Prediction equation:
68
Example: Private Housing Starts (PHS)
Private Housing Starts (PHS) with a Simple Regression Forecast (PHSF1)
and a Multiple Regression Forecast (PHSF2) in Thousands of Units
400
Dr. mohammed Alahmed
350
300
250
200
150
100
PHS
50
PHSF1
PHSF2
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
0
69
• It is important to check the adequacy of the model
before it becomes part of the decision making
process.
• Residual plots can be used to check the model
assumptions.
• It is important to study outlying observations to
decide whether they should be retained or eliminated.
• If retained, whether their influence should be reduced
in the fitting process or revise the regression
function.
Dr. mohammed Alahmed
Regression Diagnostics and Residual Analysis
70
• In the regression models we assume that the errors
i are independent.
• In business and economics, many regression
applications involve time series data.
• For such data, the assumption of uncorrelated or
independent error terms is often not appropriate.
Dr. mohammed Alahmed
Time Series Data and the Problem of Serial
Correlation
71
• If the error terms in the regression model are
autocorrelated, the use of ordinary least squares
procedures has a number of important
consequences
• MSE underestimate the variance of the error terms
• The confidence intervals and tests using the t and F
distribution are no longer strictly applicable.
• The standard error of the regression coefficients
underestimate the variability of the estimated
regression coefficients.
Dr. mohammed Alahmed
Problems of Serial Correlation
72
First order serial correlation
yt   0  1 xt   t
 t    t 1  t
Where:
• t = error at time t
•  = the parameter that measures correlation between
adjacent error terms
• t normally distributed error terms with mean zero and
variance 2
Dr. mohammed Alahmed
• The error term in current period is directly related to
the error term in the previous time period.
• Let the subscript t represent time, then the simple
linear regression model is:
73
• The effect of positive serial correlation in a simple linear
regression model.
• Misleading forecasts of future y values.
• Standard error of the estimate, S y.x will underestimate
the variability of the y’s about the true regression line.
• Strong autocorrelation can make two unrelated
variables appear to be related.
Dr. mohammed Alahmed
Example
74
Durbin-Watson Test for Serial Correlation
• Recall the first-order serial correlation model
 t    t 1  t
• The hypothesis to be tested are:
H0 :   0
Ha :   0
• The alternative hypothesis is  > 0 since in business
and economic time series tend to show positive
correlation.
Dr. mohammed Alahmed
yt   0  1 xt   t
75
Durbin-Watson Test for Serial Correlation
• The Durbin-Watson statistic is defined as
n
t 2
t
 et 1 ) 2
n
e
t 1
2
t
• Where
et  yt  yˆ t  the residual for time period t
et 1  yt 1  yˆ t 1  the residual for time period t - 1
Dr. mohammed Alahmed
DW 
 (e
76
Durbin-Watson Test for Serial Correlation
• The auto correlation coefficient  can be estimated by the lag 1
residual autocorrelation r1(e)
n
• And it can be shown that
t 2
n
t
t 1
2
e
 t
t 1
DW  2(1  r1 (e))
Dr. mohammed Alahmed
r1 (e) 
e e
77
Durbin-Watson Test for Serial Correlation
• If r1(e) = 0, then DW = 2 (there is no correlation.)
• If r1(e) > 0, then DW < 2 (positive correlation)
• If r1(e) < 0, Then DW > 2 (negative correlation)
Dr. mohammed Alahmed
• Since –1 < r1(e) < 1 then 0 < DW < 4
78
Durbin-Watson Test for Serial Correlation
- If DW > U, Do not reject H0.
- If DW < L, Reject H0
- If L  DW  U, the test is inconclusive.
• The critical Upper (U) an Lower (L) bound can
be found in Durbin-Watson table of your text
book.
• To use this table you need to know
- The significance level ()
- the number of independent parameters in the model (k),
and
- the sample size (n).
Dr. mohammed Alahmed
• Decision rule:
79
• The Blaisdell Company wished to predict its
sales by using industry sales as a predictor
variable.
• The following table gives seasonally adjusted
quarterly data on company sales and industry
sales for the period 1983-1987.
Dr. mohammed Alahmed
Example
80
Year
1983
1984
1985
1986
1987
Quarter
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
t
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
CompSale
20.96
21.4
21.96
21.52
22.39
22.76
23.48
23.66
24.1
24.01
24.54
24.3
25
25.64
26.36
26.98
27.52
27.78
28.24
28.78
InduSale
127.3
130
132.7
129.4
135
137.1
141.2
142.8
145.5
145.3
148.3
146.4
150.2
153.1
157.3
160.7
164.2
165.6
168.7
171.7
Dr. mohammed Alahmed
Example
81
Example
Blaisdell Company Example
30
25
Dr. mohammed Alahmed
Company Sales ($
millions)
35
20
15
10
5
0
0
50
100
150
200
Industry sales($ millions)
82
• The scatter plot suggests that a linear regression
model is appropriate.
• Least squares method was used to fit a regression
line to the data.
• The residuals were plotted against the fitted
values.
• The plot shows that the residuals are consistently
above or below the fitted value for extended
periods.
Dr. mohammed Alahmed
Example
83
Dr. mohammed Alahmed
Example
84
Example
• To confirm this graphic diagnosis we will use the
Durbin-Watson test for:
Ha :   0
• The test statistic is:
n
DW 
 (e
t 2
t
 et 1 ) 2
Dr. mohammed Alahmed
H0 :   0
n
e
t 1
2
t
85
Year
1983
1984
1985
1986
1987
Quarter
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
t
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Company sales(y) Industry sales(x)
20.96
127.3
21.4
130
21.96
132.7
21.52
129.4
22.39
135
22.76
137.1
23.48
141.2
23.66
142.8
24.1
145.5
24.01
145.3
24.54
148.3
24.3
146.4
25
150.2
25.64
153.1
26.36
157.3
26.98
160.7
27.52
164.2
27.78
165.6
28.24
168.7
28.78
171.7
Blaisdell Company Example
35
30
25
et
-0.02605
-0.06202
0.022021
0.163754
0.04657
0.046377
0.043617
-0.05844
-0.0944
-0.14914
-0.14799
-0.05305
-0.02293
0.105852
0.085464
0.106102
0.029112
0.042316
-0.04416
-0.03301
et -et-1
(et -et-1)^2
-0.03596
0.084036
0.141733
-0.11718
-0.00019
-0.00276
-0.10205
-0.03596
-0.05474
0.001152
0.094937
0.030125
0.12878
-0.02039
0.020638
-0.07699
0.013204
-0.08648
0.011152
0.001293
0.007062
0.020088
0.013732
3.76E-08
7.61E-06
0.010415
0.001293
0.002997
1.33E-06
0.009013
0.000908
0.016584
0.000416
0.000426
0.005927
0.000174
0.007478
0.000124
et ^2
0.000679
0.003846
0.000485
0.026815
0.002169
0.002151
0.001902
0.003415
0.008911
0.022243
0.021901
0.002815
0.000526
0.011205
0.007304
0.011258
0.000848
0.001791
0.00195
0.00109
0.097941 0.133302
Dr. mohammed Alahmed
Example
86
Example
• Using Durbin Watson table of your text book, for k = 1, and
n=20, and using  = .01 we find U = 1.15, and L = .95
• Since DW = .735 falls below L = .95 , we reject the null
hypothesis, namely, that the error terms are positively
autocorrelated.
Dr. mohammed Alahmed
.09794
DW 
 .735
.13330
87
Remedial Measures for Serial Correlation
• One major cause of autocorrelated error terms is the
omission from the model of one or more key variables
that have time-ordered effects on the dependent
variable.
• Use transformed variables.
• The regression model is specified in terms of changes
rather than levels.
Dr. mohammed Alahmed
• Addition of one or more independent variables to
the regression model.
88
Extensions of the Multiple Regression Model
• Business or economic logic may suggest that nonlinearity is expected.
• A graphic display of the data may be helpful in
determining whether non-linearity is present.
• One common economic cause for non-linearity is
diminishing returns.
• Fore example, the effect of advertising on sales may
diminish as increased advertising is used.
Dr. mohammed Alahmed
• In some situations, nonlinear terms may be
needed as independent variables in a regression
analysis.
89
Extensions of the Multiple Regression Model
• Some common forms of nonlinear functions are :
Y   0  1 ( X )   2 ( X 2 )   3 ( X 3 )
Y   0  1 (1 X )
0
Y e X
1
Dr. mohammed Alahmed
Y   0  1 ( X )   2 ( X 2 )
90
Extensions of the Multiple Regression Model
PHS   0  1 (MR)   2 (Q2)  3 (Q3)   4 (Q4)
• Where MR is the mortgage rate and Q2, Q3, and Q4 are indicators
variables for quarters 2, 3, and 4.
Dr. mohammed Alahmed
• To illustrate the use and interpretation of a non-linear term, we
return to the problem of developing a forecasting model for
private housing starts (PHS).
• So far we have looked at the following model
91
Example: Private Housing Start
PHS   0  1 ( MR)   2 (Q2)  3 (Q3)   4 (Q4)  5 ( DPI )
• Regression results for this model are shown in the next slide.
Dr. mohammed Alahmed
• First we add real disposable personal income per capita (DPI)
as an independent variable.
• Our new model for this data set is:
92
Example: Private Housing Start
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.943791346
R Square
0.890742104
Adjusted R Square
0.874187878
Standard Error
19.05542121
39
ANOVA
df
SS
Regression
5
97690.01942
Residual
33
11982.59955
Total
38
109672.619
Coefficients
Intercept
Standard Error
MS
F
19538 53.80753
Significance F
6.51194E-15
363.1091
t Stat
P-value
Lower 95%
Upper 95%
-245.0826992
182.9546249
-31.06403714
105.1938477
-0.2953
0.769613
MR
-20.1992545
4.124906847
-4.8969
2.5E-05
Q2
97.03478074
8.900711541
10.90191
1.78E-12
78.9261326
115.1434289
Q3
75.40017073
8.827185877
8.541813
7.17E-10
57.44111179
93.35922967
Q4
20.35306822
8.83373887
2.304015
0.027657
2.380677107
38.32545934
DPI
0.022407799
0.004356973
5.142974
1.21E-05
0.013543464
0.031272134
Dr. mohammed Alahmed
Observations
-28.59144723 -11.80706176
93
Example: Private Housing Start
• The prediction model is
• In comparison with the previous model, we see that
• R-squared has improved, increased from 78% to 89%.
• The standard error of the estimate has decreased from 26.49
for the previous model to 19.05 for the new model.
Dr. mohammed Alahmed
PHSˆ  31.06  20.19( MR )  97.03(Q 2)  75.40(Q3)  20.35(Q 4)  0.02( DPI )
94
• The value of the DW test has changed from 0.88 for the
previous model to 0.78 for the new model.
• At 5% level the critical value for DW test, from DurbinWatson table, for k = 5, and n = 39 is L= 1.22, and U =
1.79.
• Since The value of the DW test is smaller than L=1.22,
we reject the null hypothesis H0:  = 0
• This implies that there is serial correlation in both
models, the assumption of the independence of the error
terms is not valid.
Dr. mohammed Alahmed
Example: Private Housing Start
95
Example: Private Housing Start
• The Plot of PHS against DPI shows a curve linear relation.
Private Housing Start and Disposable Personal Income
21500
21000
PHS
20000
19500
19000
18500
18000
17500
0
50
100
150
200
250
300
350
400
DPI
• Next we introduce a nonlinear term into the regression.
• The square of disposable personal income per capita (DPI2) is
included in the regression model.
Dr. mohammed Alahmed
20500
96
Example: Private Housing Start
PHS   0  1 ( MR )   2 (Q 2)   3 (Q3)   4 (Q 4)   5 ( DPI )   6 ( DPI 2 )   7 ( LPHS )
• Regression results for this model are shown in the next slide.
Dr. mohammed Alahmed
• We also add the dependent variable, lagged one quarter, as an
independent variable in order to help reduce serial correlation.
• The third model that we fit to our data set is:
97
Example: Private Housing Start
SUMMARY OUTPUT
Regression Statistics
0.97778626
R Square
0.956065971
Adjusted R Square
0.946145384
Standard Error
12.46719572
Observations
39
ANOVA
df
SS
Regression
MS
7
104854.2589
14979.17985
Residual
31
4818.360042
155.4309691
Total
38
109672.619
Coefficients
Intercept
716.5926532
Standard Error
96.37191
Significance F
3.07085E-19
P-value
Lower 95%
0.704153784
0.486593
-1358.949934
3.093504134 -4.414158396
0.000114
-19.96446404 -7.345970448
1017.664989
t Stat
F
Upper 95%
2792.13524
MR
-13.65521724
Q2
106.9813297
6.069780998
17.62523718
1.04E-17
94.60192287
119.3607366
Q3
27.72122303
9.111432565
3.042465916
0.004748
9.138323433
46.30412262
Q4
-13.37855186
7.653050858 -1.748133144
0.09034
-28.98706069
2.22995698
DPI
-0.060399279
0.104412354 -0.578468704
0.567127
-0.273349798
0.15255124
DPI SQUARED
0.000335974
0.000536397
0.626354647
0.535668
-0.000758014
0.001429963
LPHS
0.655786939
0.097265424
6.742241114
1.51E-07
0.457412689
0.854161189
Dr. mohammed Alahmed
Multiple R
98
• The inclusion of DPI2 and Lagged PHS has increased the
R-squared to 96%
• The standard error of the estimate has decreased to 12.45
• The value of the DW test has increased to 2.32 which is
greater than U = 1.79 which rule out positive serial
correlation.
• You see that the third model worked best for this data set.
• The following slide gives the data set.
Dr. mohammed Alahmed
Example: Private Housing Start
99
Example: Private Housing Start
PHS
LPHS
Q2
Q3
Q4
DPI
271.3
10.3372
MR
217
1
0
0
18063
DPI SQUARED
1,631,359.85
30-Sep-90
233
10.1033
271.3
0
1
0
18031
1,625,584.81
31-Dec-90
173.6
9.9547
233
0
0
1
17856
1,594,183.68
31-Mar-91
146.7
9.5008
173.6
0
0
0
17748
1,574,957.52
30-Jun-91
254.1
9.5265
146.7
1
0
0
17861
1,595,076.61
30-Sep-91
239.8
9.2755
254.1
0
1
0
17816
1,587,049.28
31-Dec-91
199.8
8.6882
239.8
0
0
1
17811
1,586,158.61
31-Mar-92
218.5
8.7098
199.8
0
0
0
18000
1,620,000.00
30-Jun-92
296.4
8.6782
218.5
1
0
0
18085
1,635,336.13
30-Sep-92
276.4
8.0085
296.4
0
1
0
18036
1,626,486.48
31-Dec-92
238.8
8.2052
276.4
0
0
1
18330
1,679,944.50
31-Mar-93
213.2
7.7332
238.8
0
0
0
17975
1,615,503.13
30-Jun-93
323.7
7.4515
213.2
1
0
0
18247
1,664,765.05
30-Sep-93
309.3
7.0778
323.7
0
1
0
18246
1,664,582.58
31-Dec-93
279.4
7.0537
309.3
0
0
1
18413
1,695,192.85
31-Mar-94
252.6
7.2958
279.4
0
0
0
18154
1,647,838.58
30-Jun-94
354.2
8.4370
252.6
1
0
0
18409
1,694,456.41
30-Sep-94
325.7
8.5882
354.2
0
1
0
18493
1,709,955.25
31-Dec-94
265.9
9.0977
325.7
0
0
1
18667
1,742,284.45
31-Mar-95
214.2
8.8123
265.9
0
0
0
18834
1,773,597.78
30-Jun-95
296.7
7.9470
214.2
1
0
0
18798
1,766,824.02
30-Sep-95
308.2
7.7012
296.7
0
1
0
18871
1,780,573.21
31-Dec-95
257.2
7.3508
308.2
0
0
1
18942
1,793,996.82
31-Mar-96
240
7.2430
257.2
0
0
0
19071
1,818,515.21
30-Jun-96
344.5
8.1050
240
1
0
0
19081
1,820,422.81
30-Sep-96
324
8.1590
344.5
0
1
0
19161
1,835,719.61
31-Dec-96
252.4
7.7102
324
0
0
1
19152
1,833,995.52
31-Mar-97
237.8
7.7905
252.4
0
0
0
19331
1,868,437.81
30-Jun-97
324.5
7.9255
237.8
1
0
0
19315
1,865,346.13
30-Sep-97
314.6
7.4692
324.5
0
1
0
19385
1,878,891.13
31-Dec-97
256.8
7.1980
314.6
0
0
1
19478
1,896,962.42
31-Mar-98
258.4
7.0547
256.8
0
0
0
19632
1,927,077.12
30-Jun-98
360.4
7.0938
258.4
1
0
0
19719
1,944,194.81
30-Sep-98
348
6.8657
360.4
0
1
0
19905
1,980,963.41
31-Dec-98
304.6
6.7633
348
0
0
1
20194
2,038,980.00
31-Mar-99
294.1
6.8805
304.6
0
0
0
20377
2,076,010.87
30-Jun-99
377.1
7.2037
294.1
1
0
0
20472
2,095,440.74
30-Sep-99
355.6
7.7990
377.1
0
1
0
20756
2,153,982.23
31-Dec-99
308.1
7.8338
355.6
0
0
1
21124
2,231,020.37
Dr. mohammed Alahmed
PERIOD
30-Jun-90
100
Download