Doing Statistics for Business

advertisement
Doing Statistics for Business
Data, Inference, and Decision Making
Marilyn K. Pelosi
Theresa M. Sandifer
Chapter 12
Multiple
Regression Models
1
Doing Statistics for Business
Chapter 12 Objectives
 Find the the regression equation for a
dependent variable Y as a function of a set
of independent variables, X1, X2,…Xk.
 Determine whether the relationship is
significant.
 Determine which variables contribute to the
2
model and which do not.
Doing Statistics for Business
Chapter 12 Objectives (con’t)
 Analyze the results of a regression analysis
to determine whether the model is
appropriate.
3
Doing Statistics for Business
The dependent variable, Y, is often
referred to as the output variable , while
the set of independent variables, X1, X2,
… , Xk, are referred to as the input variables
4
Doing Statistics for Business
The true relationship between the
independent variable Y and the set of
independent variables, X1, X2,…Xk, the
multiple regression model, can be
described by the equation:
y = b0 +b1x1 + b2x2 + ... + bkxk + e
5
Doing Statistics for Business
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.866778185
R Square
0.751304423
Adjusted R Square
0.715776483
Standard Error
37957.7302
Observations
41
ANOVA
df
Regression
Residual
Total
Intercept
Avg. Household
Income
Total Instructional
Expenditures per pupil
Mean SAT
Score
Popultaion Diversity
(% non-white)
1993/94 Avg. Violent
Crime (per 1000)
SS
1.52341E+11
50427624859
2.02768E+11
MS
30468171569
1440789282
F
21.14686162
Significance F
1.08463E-09
Coefficients
Standard Error
-140351.6689
112957.8302
t Stat
-1.242513853
P-value
0.222307657
Lower 95%
-369668.5358
5
35
40
Upper 95%
Lower 95.0%
88965.198
-369668.5358
1.637671454
0.312118144
5.246960128
7.59702E-06
1.004037162
2.271305746
1.004037162
39.15871129
21.78140875
1.797804345
0.080839566
-5.059953319
83.3773759
-5.059953319
89.63285763
113.8778516
0.787096493
0.436523067
-141.5517541
320.8174694
-141.5517541
-47820.09741
62143.05497
-0.769516359
0.446749366
-173977.3601
78337.1653
-173977.3601
-1132.378569
2146.787097
-0.527475952
0.601191161
-5490.5934
3225.836262
-5490.5934
Figure 12.1 Computer Output from Excel
6
Doing Statistics for Business
Regression Analysis
Figure 12.1
(con’t)
Computer
Output from
Minitab
The regression equation is
1994 Avg. Value _of Home Sold = - 140352 + 1.64 Avg. Household_ Income
+ 39.2 Total Instructional_ Expenditur + 90 Mean SAT_ Score
- 47820 Population Diversity_ (% non-white
- 1132 1993/94 Avg. Violent _Crime (per 1000)
Predictor
Constant
Avg. Hou
Total In
Mean SAT
Populati
1993/94
S = 37958
Coef
-140352
1.6377
39.16
89.6
-47820
-1132
StDev
112958
0.3121
21.78
113.9
62143
2147
R-Sq = 75.1%
T
-1.24
5.25
1.80
0.79
-0.77
-0.53
P
0.222
0.000
0.081
0.437
0.447
0.601
R-Sq(adj) = 71.6%
Analysis of Variance
Source
Regression
Error
Total
DF
SS
MS
5 1.52341E+11 30468171569
35 50427624859 1440789282
40 2.02768E+11
Source
Avg. Hou
Total In
Mean SAT
Populati
1993/94
DF
Seq SS
1 1.46103E+11
1 3261836377
1 1202680560
1 1372121063
1
400872070
Unusual Observations
Obs Avg. Hou
1994 Avg
1
49803
204800
8
30434
58200
21
196697
361100
38
84007
269926
39
87854
269926
Fit
110273
42311
397689
178710
178242
F
21.15
StDev Fit
13150
28127
30411
9319
13518
P
0.000
Residual
94527
15889
-36589
91216
91684
St Resid
2.65R
0.62 X
-1.61 X
2.48R
2.58R
7
Doing Statistics for Business
TRY IT NOW!
Order Filling
Finding the Multiple Regression Model
A mail-order catalog company is looking at the time it takes to prepare
an order for for shipping. In particular, the company is looking for the
amount of time that is spent collecting the items ordered and packing
them. In this operation, an employee (a checker) is given an order to fill.
Items are located in bins in one of six different sections of the warehouse.
The checkers move around the warehouse retrieving the items and
8
Doing Statistics for Business
TRY IT NOW!
Order Filling
Finding the Multiple Regression Model
(con’t)
packing them into the shipping cartons. The company has looked at the
operation in some detail and believes that three major variables are
involved in the process: the number of items ordered, the number of
different locations (sections of the warehouse) in which the items are
located, and the experience level (in months) of the checker. Data are
collected on 45 orders. A portion of the data is shown next:
9
Doing Statistics for Business
TRY IT NOW!
Order Filling
Finding the Multiple Regression Model
(con’t)
Time
Items
Locations
Experience
(min)
9.3
4.4
4.4
5.6
4.9
8.8
1
1
5
3
11
14
6
3
2
3
1
3
(months)
8
13
3
5
4
5
10
Doing Statistics for Business
TRY IT NOW!
Order Filling
Finding the Multiple Regression Model
(con’t)
The company wants to know how the time it takes to fill an order is
related to the other three variables, so it decides to use a multiple
regression model.
Write down the equation of the regression model.
Interpret the value of each of the coefficients of the model.
11
Doing Statistics for Business
TRY IT NOW!
Order Filling
Finding Predicted Values
The mail-order catalog company that is looking at the time it takes to
prepare an order for shipping wants to see how well the model it has
found predicts the time to fill an order. The data for six of the
observations are shown next:
12
Doing Statistics for Business
TRY IT NOW!
Order Filling
Finding Predicted Values (con’t)
Time
(min)
9.3
4.4
4.4
5.6
4.9
8.8
Items
Locations
1
1
5
3
11
14
6
3
2
3
1
3
Experience
(months)
8
13
3
5
4
5
13
Doing Statistics for Business
TRY IT NOW!
Order Filling
Finding Predicted Values (con’t)
Use the model to find the predicted time to fill an order for each of the six
sets of input data you have.
Compare the predicted results to the actual data. Do you think that this
model does a good job of predicting the time to fill an order? Why or
why not?
14
Doing Statistics for Business
Source of
Variation
Regression
Degrees of
Freedom
k
Sum of
Squares
SSR
Error
n-k-1
SSE
Total
n-1
SST
Mean Square
MSR 
SSR
k
MSE =
SSE
n - k -1
F value
MSR
MSE
Figure 12.2 ANOVA Table for Multiple
Regression Model
15
Doing Statistics for Business
TRY IT NOW!
Order Filling
Testing the Significance of the Model
The mail-order company looking at factors related to the time
to fill an order wants to know if its model is significant.
Write down the hypotheses that it needs to test.
Using the Excel output from your textbook, locate the values for
MSR and MSE and their degrees of freedom.
16
Doing Statistics for Business
TRY IT NOW!
Order Filling
Testing the Significance of the Model
Use the values of MSR and MSE to calculate the value of the
F statistic and compare it to the value in the table.
At the 0.05 level of significance, what is the critical value for the test?
What can the mail-order company conclude as a result of its test? Verify
your answer by using the p value from the printout.
17
Doing Statistics for Business
The Coefficient of Multiple Determination,
R2, is a measure of the percentage of the
variation in the dependent variable, Y, that
can be accounted for by the complete set of
independent variables, X1, X2,…Xk , in the
model.
18
Doing Statistics for Business
The Adjusted R2, is the value of the
coefficient of multiple determination
adjusted to reflect the number of variables
in the model.
19
Doing Statistics for Business
TRY IT NOW!
Order Filling
The Coefficient of Multiple Determination
The mail-order company looking at factors related to the time
to fill an order wants to know if the set of independent variables that
it selected does a good job of accounting for the variation in the time
SUMMARY OUTPUT
to fill an order.
Regression Statistics
Multiple R
0.915838993
R Square
0.838761061
Adjusted R Square
0.82696309
Standard Error
1.021396967
Observations
45
20
Doing Statistics for Business
TRY IT NOW!
Order Filling
The Coefficient of Multiple Determination
(con’t)
Using the output, find the value of the coefficient of multiple
determination.
If you were the manager of the company would you be satisfied
with this model? Why or why not?
Look at the value of adjusted R2. What do you think this value
might be telling the company?
21
Doing Statistics for Business
TRY IT NOW!
Order Filling
Testing Individual Regression Coefficients
The mail-order company looking at factors related to the time to fill
an order wants to know how the individual variables contribute to the
model. It decides to look at the computer analysis again.
Look at the coefficients of each of the three variables in the model and
perform the appropriate hypothesis tests.
22
Doing Statistics for Business
TRY IT NOW!
Order Filling
Testing Individual Regression Coefficients
(con’t)
Which variable(s) have nonzero coefficients? Which variable(s) have
coefficients that are equal to zero?
As a result of these test, what recommendation would you make?
23
Doing Statistics for Business
Discovery Exercise 12.1
Find the Best Model
The office of Institutional Planning at a university in the
Far West is interested in understanding what factors influence the
graduation rate, that is, the percentage of entering freshmen who actually
graduate from the university. The university planners have collected data
on 46 universities with traits similar to theirs and have selected three
variables that they think are related to the graduation rate: 25th percentile
combined SATs of accepted students, the acceptance rate (percentage of
students who apply that are accepted by the university), and educational
expenditure per student ($) by the university.
24
Doing Statistics for Business
Discovery Exercise 12.1
Find the Best Model (con’t)
A sample of the data is shown below:
Rank/School Name
Indiana University - Bloomington
SUNY--Binghamton
Allegheny University
Univ. of California--Riverside
Oregon State University
University of Hawaii--Manoa
New Jersey Inst. of Technology
1995 Actual
Education expenditures
Graduation Rate SAT/ACT 25th Acceptance Rate per student
59%
NA
80%
9713
74%
NA
40%
9080
52%
786
61%
33270
56%
910
78%
13403
53%
949
89%
10182
55%
960
65%
13360
63%
970
67%
14030
25
Doing Statistics for Business
Discovery Exercise 12.1
Find the Best Model (con’t)
A. Which variable do you think will have the greatest effect
on graduation rate? Explain why you chose this variable.
B. Find a simple linear model that predicts graduation rate using the
variable that you think is most important. What is the value of R2 for
your model? Is the model significant?
C. Do this again using the other two independent variables.
26
Doing Statistics for Business
Discovery Exercise 12.1
Find the Best Model (con’t)
D. Which one-variable model do you think is the best? Why?
E. How many different models with two independent variables could
you find? List them all.
F. Find the multiple regression models for all of the two-variable models.
G. Which two-variable model do you think is best? Why? Does the
model you think is best contain the variable form the best one-variable
model? Would you expect it to?
27
Doing Statistics for Business
Discovery Exercise 12.1
Find the Best Model (con’t)
H. Find the multiple regression model that predicts graduation
rate using all three independent variables. What is R2 for this model?
Is it significant?
I. Now, fill in the table in your textbook for each of the “best” models
that you have found.
J. If you were going to use a model to predict graduation rate, which of
these three models would you choose? Why?
28
Doing Statistics for Business
Model Building Techniques are methods
used for identifying the best multiple
regression model from a set of independent
variables. These methods include:
 Forward Selection
 Backward Elimination
 Stepwise Regression
 All Possible Regressions.
29
Doing Statistics for Business
TRY IT NOW!
GPA
Choosing the Independent Variables
Many studies have been done on what factors are related to a
college student’s grade point average (GPA). These studies often
focus on pre-college factors, such as high school performance, SAT or
ACT scores and general socioeconomic factors such as family income
and race. Students know that once they are in college, many other factors
influence their GPA.
30
Doing Statistics for Business
TRY IT NOW!
GPA
Choosing the Independent Variables
(con’t)
You are going to try to find a model relating current factors to GPA.
Make a list of all variables that you think are related to a college student’s
GPA.
From this list, identify what you think are the five most important
variables.
How would you go about gathering the data you need to do your study?
31
Doing Statistics for Business
TRY IT NOW!
Order Filling
Forward Selection
The mail-order company wants to use a standard method to find
the best model from its set of three variables. It decides on forward
selection because this variable is the easiest to understand. The relevant
portions of the computer output are shown in your textbook.
What variable is added on the first step of the procedure? What is its
coefficient? What is the value of R2 for the first model considered?
32
Doing Statistics for Business
TRY IT NOW!
Order Filling
Forward Selection (con’t)
What variable is added on the second step? What is its coefficient?
What does R2 change to after this step?
Does the coefficient of the first variable change when the second
variable is added? If so, what is the new value?
Write down the final model.
33
Doing Statistics for Business
TRY IT NOW!
Order Filling
Testing Individual Regression Coefficients
The mail-order company looking at the model for order filling
decides to use stepwise regression to see if it finds a model that is
different from the one found using forward selection.
From the Minitab output in your textbook, how many iterations did the
procedure take?
Which variable was added to the model first? What is its coefficient?
34
Doing Statistics for Business
TRY IT NOW!
Order Filling
Testing Individual Regression Coefficients
(con’t)
Which variable was added second? What is its coefficient?
Did the coefficient for the first variable change? If so, what is its
coefficient after the second variable is added?
What is the is the final model from the stepwise procedure?
35
Doing Statistics for Business
TRY IT NOW!
Order Filling
All Possible Regressions
The mail-order company is pretty sure that it has identified
the best model, but it decides to use all the possible regressions
method to make sure that it is not missing something.
Based on the output in your textbook, is there a one-variable model
that is best on all criteria? If so, what is it?
What is the best two-variable model?
36
Doing Statistics for Business
TRY IT NOW!
Order Filling
All Possible Regressions (con’t)
Is there any benefit from moving to a three-variable model?
Why or why not?
What model do you recommend the company use? Why?
37
Doing Statistics for Business
TRY IT NOW!
Order Filling
All Possible Regressions
Now that has a model it likes, the mail-order company would
like to make sure that the model does not violate any assumptions
of the multiple regression model. The company creates a set of residual
plots, which are shown below:
Number of Items Residual Plot
3
2
1
0
-1 0
-2
Residuals
Residuals
Locations Residual Plot
2
4
Locations
6
3
2
1
0
-1 0
-2
5
10
15
20
Number of Items
38
Doing Statistics for Business
TRY IT NOW!
Order Filling
All Possible Regressions (con’t)
Look at the plot of residuals versus the locations. Does it
appear that there is a problem with the assumption of equal variances?
Look at the plot of residuals versus the number of items. Does it appear
that there is a problem with the assumption of equal variances?
39
Doing Statistics for Business
TRY IT NOW!
Order Filling
All Possible Regressions (con’t)
The company also created a normal probability plot of the residuals:
Normal Probability Plot
Time
15
10
5
0
0
20
40
60
80
100
Sample Percentile
From this plot, what can you say about the assumption of normality?
40
Doing Statistics for Business
The Cook’s Distance method compares
the values of the regression coefficients
with all observations to the values when
the ith observation is removed from the
model.
41
Doing Statistics for Business
A multiple-regression model has
Multicollinearity when variables in the set
of independent variables are correlated
with each other.
42
Doing Statistics for Business
TRY IT NOW!
Order Filling
Checking For Multicollinearity
The mail-order company wants to check to see if the model
it is thinking about using has any problems with multicollinearity.
It calculates the correlation between the two variables that are in
the final model and finds that the correlation is – 0.141.
Based on this information do you think that multicollinearity is a problem
in their model? Why or why not?
43
Doing Statistics for Business
Multiple Regression Models in Excel
The tools used in Excel for multiple regression models are
the same one that are used for the simple linear model. The
only difference is that the data range for the X variables will
cover more than one column. It is very important, however,
that the X variable columns be adjacent to each other. If
they are not, you will have to rearrange the worksheet before
you do the analysis.
44
Doing Statistics for Business
Stepwise Regression with KaddStat
 From the Kadd menu select Regression and
Correlation > Forward Stepwise and the dialog
box shown in Figure 12.3 opens. The dialog box is
identical to the one for Single/Multiple regression
with two additional inputs.
45
Doing Statistics for Business
46
Doing Statistics for Business
 Just above the where you indicate which graphical
output you want is a section labeled Select one:.
 This allows you to indicate whether you want to
have only the final model output or whether you
want each step output.
 In addition, you need to specify the p value to be
used for adding and dropping variables from the
model.
 Kaddstat uses the same p value for both.
47
Doing Statistics for Business
Chapter 12 Summary
In this chapter you have learned:
 Modeling is an iterative process for which there is
no single correct answer.
 There are several steps to the modeling process:
Identifying potential independent variables
Collecting data
Finding a potential model
48
Doing Statistics for Business
Chapter 12 Summary (con’t)
 Once a potential model is found, the process of
Model Building takes place to find a “best” model.
 The objective is finding a model that does an
acceptable job of explaining or predicting the
dependent variable with as few independent
variables as possible.
49
Doing Statistics for Business
Chapter 12 Summary (con’t)
 Some model-building techniques used are:
Forward Selection
Backward Elimination
Stepwise Regression
All Possible Regressions
50
Doing Statistics for Business
Chapter 12 Summary (con’t)
 Once the “best” model is identified, and
before it can be used for decision-making purposes,
it must be checked for problems such as:
Violation of Assumptions
Influential Observations
Multicollinearity
51
Download