DSCI 5340

advertisement
DSCI 5340: Predictive Modeling and
Business Forecasting
Spring 2013 – Dr. Nick Evangelopoulos
Lecture 2:
Review of Multiple Regression (Ch. 4-5)
Material based on:
Bowerman-O’Connell-Koehler, Brooks/Cole
slide 1
DSCI 5340
FORECASTING
Review of textbook HW
Page 127-128 Ex 3.12 (Use Excel)
Page 128 Ex 3.13, 3.17
Page 132 Ex 3.25
Page 134 Ex 3.35
slide 2
DSCI 5340
FORECASTING
Excel Data Analysis Add-in
In Excel,
Make Sure
Analysis
ToolPak is
an add-in.
slide 3
DSCI 5340
FORECASTING
Ex 3.12 Page 128 Scatter Plot
An accountant wishes to predict direct labor cost
(y) based on the batch size (x) of a product
produced in a job shop. Data for 12 production
runs are given.
slide 4
DSCI 5340
FORECASTING
Ex 3.13 Page 128 Interpretation of
Mean of Y Given X
a. m y|x=60 = b0 + b1(60) : The average value of y
for repeated values of X=60. This is the
Fitted model:
point on the regression line predicted for Y
Ŷ = 18.49 + 10.15X at X=60.
b. m y|x=30 = b0 + b1(30) : The average value of y
for repeated values of X=30. This is the
point on the regression line predicted for Y
SUMMARY OUTPUT
at X=30. The distribution of values around
Regression Statistics
X=30 should be similar to that for X=60.
Multiple R
0.99963578
R Square
0.999271693
c. Interpretation of slope: As the Batch Size
Adjusted R Square
0.999198862
Standard Error
8.641541386
increases by one unit, the direct labor cost
Observations
12
increases by b1= 10.1463.
ANOVA
df
Regression
Residual
Total
Intercept
Batch_Size_X
1
10
11
SS
1024592.904
746.7623752
1025339.667
MS
1024593
74.67624
F
13720.47
Significance F
5.04436E-17
Coefficients
18.48750754
10.14625896
Standard Error
4.676579789
0.086620659
t Stat
3.953211
117.1344
P-value
0.002716
5.04E-17
Lower 95%
8.067438459
9.953256104
Upper 95%
28.90757661
10.33926181
slide 5
DSCI 5340
FORECASTING
Ex 3.13 Page 128 Interpretation of
Model
Intercept b0: 18.49 is the Labor Cost if the
batch size is 0. Theoretically, this costs would
be 0, but it can be interpreted as fixed costs.
Interpretation of Error Term: There may be
other factors that determine direct labor costs,
such as benefits to employees, type of
product, number of employees, etc. Thus, the
model may be more accurate with additional
independent variables that are being
compensated by having an error term in the
model.
slide 6
DSCI 5340
FORECASTING
Ex 3.17 Page 128
Accu-Copiers, Inc., sells and services the
Accu-500 copying machine. As part of its
standard service contract, the company
agrees to perform routine service on the
copier. To obtain information about the time
it takes to perform routine service, AccuCopiers has collected data for 11 service
calls, shown in Table 3.7 (p. 126)
slide 7
DSCI 5340
FORECASTING
EX 3.17 Page 128
slide 8
DSCI 5340
FORECASTING
EX 3-25 Page 132: Test for
correlation
The test for correlation between X and Y:
H0: ρ = 0 vs. Ha: ρ ≠ 0
Has the same test statistic and p-value as the test
for significance of the regression slope coefficient.
However, the two tests use different assumptions.
slide 9
DSCI 5340
FORECASTING
EX 3-35 Page 134
A State Department of
Taxation asked taxpayers
to report the time y (in
hours) required to
complete a tax form and
the number of times x
(including this one) the
taxpayer has filled out this
form
slide 10
DSCI 5340
FORECASTING
EX 3-35 Page 134
To understand this model, not that as x increases,
1/x decreases and thus μy|x decreases.
slide 11
Multiple Regression Graphically
DSCI 5340
FORECASTING
slide 12
DSCI 5340
FORECASTING
Residuals
The residuals will be denoted êi:
êi = yi - i
They represent the distance that each
dependent variable value is from the
estimated regression line or the portion
of the variation in y that cannot be
“explained” with the data available.
What assumptions can we test using these
residuals?
slide 13
DSCI 5340
FORECASTING
Regression model assumptions
What are the Assumptions of Regression
Analysis?
How can these assumptions be checked?
The relationship is linear.
The disturbances ei have constant variance s2e .
The disturbances are independent.
The disturbances are normally distributed.
slide 14
DSCI 5340
FORECASTING
Graphical Techniques
scatterplots
residual plots
histograms
(not an exact science)
slide 15
DSCI 5340
FORECASTING
Properties of residual plots
Property 1: The average of the residuals will be equal
to zero. This property holds regardless of
whether the assumptions are true or not and
is a direct result of the way the least-squares
method works.
Property 2: There should be no systematic pattern in a
residual plot.
(What is a systematic pattern?)
Property 3: Residuals should look like random
numbers chosen from a normal distribution.
(How close to normality should the chart
look?)
slide 16
DSCI 5340
FORECASTING
Residual plots
In a residual analysis it is suggested that the following
plots be used:
1. Plot the residuals versus each explanatory variable.
2. Plot the residuals versus the predicted or fitted values.
3. If the data are measured over time, plot the residuals
versus some variable representing the time sequence.
What assumptions can each of these support or
indicate a violation?
slide 17
DSCI 5340
FORECASTING
Residual plots
Plots may be constructed using the actual residuals,
êi, or the standardized residuals.
The standardized residuals are simply the residuals
divided by their standard deviation.
Why do you think standardized
residuals are sometimes used instead
of regular residuals?
slide 18
DSCI 5340
FORECASTING
No Violations of the Assumptions of
Regression
Plot shows random residuals
slide 19
DSCI 5340
FORECASTING
Does this Plot Look Like One of the
Assumptions of Regression Analysis is
Violated?
slide 20
DSCI 5340
FORECASTING
PLOT OF RESIDUALS - Standardized
values are small.
slide 21
DSCI 5340
FORECASTING
Outliers
The method of least squares estimation chooses
the regression coefficient estimates so the error
sum of squares, SSE, is a minimum.
In doing this, the distances from the true y
values, yi, to the points on the regression line of
or surface, i, are minimized.
Least squares thus tries to avoid any large
distances from yi to i.
slide 22
DSCI 5340
FORECASTING
Outliers
OUTLIER: When a sample data point has a y
value that is much different from the y values of
the other points in the sample.
An outlier is any value whose studentized
residual is greater than 2.
An outlier does not have to be influential.
That is, removing the outlier may not change
the regression coefficients very much.
slide 23
DSCI 5340
FORECASTING
No influential observations
slide 24
DSCI 5340
FORECASTING
A High Leverage Observation That is
Not Influential
slide 25
DSCI 5340
FORECASTING
Leverages
The slope of the line appears to be
determined almost entirely by this one point.
The sixth observation is said to have high
leverage and is referred to as a leverage
point.
What do you think the term “leverage point”
means?
slide 26
DSCI 5340
FORECASTING
Studentized residuals
Another measure sometimes used in place of
the standardized residual is the standardized
residual computed after deleting the ith
observation. This measure is called the
studentized residual or studentized deleted
residual.
(Note that SAS refers to the standardized
residual as the studentized residual.)
slide 27
DSCI 5340
FORECASTING
Checking Model Assumptions
Checking Assumption 1 - Normal distribution
Construct a histogram
Checking Assumption 2 - Constant variance
Plot residuals versus predicted Y values
Checking Assumption 3 - Errors are independent
Durbin-Watson statistic
Plot of errors and time
slide 28
DSCI 5340
FORECASTING
Detecting Sample Outliers
 Sample leverages
 Standardized residuals
 Cook’s distance measure
Cook’s distance measure
1
Di =
k+1
hi
1 - hi
(standardized residual)2
slide 29
DSCI 5340
FORECASTING
Example of An Influential
Observation
slide 30
DSCI 5340
FORECASTING
Should an unusual observation be
deleted?
If an observation is exerting undue influence on the fit of the
model, then from an exploratory and data-mining standpoint,
removing the observation may reveal a substantial changes
in the model.
An observation may be miscoded or not be appropriate for
the collected data.
No more than 10% of the data should be deleted to improve
the model.
slide 31
DSCI 5340
FORECASTING
Dummy Variables
slide 32
DSCI 5340
FORECASTING
Test of Null Hypothesis (F-test)
Tests the null hypothesis:
H0: b2=b3bp = 0
Ha: at least one beta is not zero
Null hypothesis is known as a joint or simultaneous
hypothesis, because it compares the values of all bi
simultaneously. This tests overall significance of regression
model. There is an F test for the overall model.
slide 33
DSCI 5340
FORECASTING
Model building: Backward Selection
A “deconstruction” approach
Begin with the saturated (full) regression model
Compute the drop in R2 as a consequence of eliminating
each predictor variable, and the partial F-test value; treat
as if the variable was the last to enter the regression
equation
Compare the lowest partial F-test value, (designated FL), to
the critical value of F (designated FC)
a. If FL < FC, remove the variable and recompute the
regression equation using the remaining predictor
variables and return to step 2.
b. FL > FC, adopt the regression equation as calculated
slide 34
DSCI 5340
FORECASTING
Model building: Stepwise Selection
Calculate correlations of all predictors with response
variable
Select the predictor variable with highest correlation.
Regress Y on Xi. Retain the predictor if there is a
significant F-test value.
Calculate partial correlations of all variable not in equation
with response variable. Select next predictor to enter that
has the highest partial correlation. Call this predictor Xj.
Compute the regression equation with both Xi and Xj
entered. Retain Xj if its partial F-value exceeds the
tabulated F (1, n-2-1) df.
Now determine whether Xi warrants retention. Compare its
partial F-value as if Xj was entered into the equation first.
slide 35
DSCI 5340
FORECASTING
Stepwise Continued
Retain if its F-value exceeds the tabulated F value
Enter a new Xk variable. Compute regression with three
predictors. Compute partial F-values for Xi, Xj and Xk.
Determine whether any should be retained by comparing
observed partial F with the critical F.
6) Retain regression equation when no other predictor
can be entered or removed from the model.
slide 36
Download