Multiple Regression - Interaction Concepts

advertisement
CHAPTER 11A
Multiple Regression - Interaction Concepts
I. Purpose
A. To study the quantitative (or qualitative) relationships between one dependent (ratio or interval) variable with
at least one or more quantitative independent variables and one or more qualitative variables and any potential
INTERACTION. What is interaction? Interaction is the recognition of the fact that the presence a second
independent variable may affect the normal effect of the first independent variable.
B. Example - A city slicker who recently bought a farm wanted to burn a large pile of trees trunks that had been
bull-dozed down. He asked a local farmer for advice. The farmer suggested using some matches to start the fire. Th
city slicker bought a box of wooden matches and attempted to set the logs on fire. After a futile half day of work and
no fire, the city slicker returned to the farmer for more advice. The farmer asked if he had used kerosene to start the
fire. The city slicker said no and went to town and purchased 20 gallons of kerosene. Upon returning, he dumped 5
gallons of kerosene on the pile and nothing happened. He repeated this futile exercise 3 more times until all of the
kerosene was exhausted. Upon returning to the farmer and explaining that the kerosene had not worked either, the
farmer asked if he had used the matches IN COMBINATION with the kerosene. The city slicker said no!! Oh what
at wonderful thing it is to learn about INTERACTIONS!!! The logs can turn to a raging fire when the two are used
in combination! But nothing happens when used separately!!!
II. Basic Patterns
A. One quantitative dependent variable and two independent quantitative variables with no interaction.
Model: y = 0 + 1x1 +  2x2 + 
Pattern and Example
High Pesticide Rate (gal./acre) – x2
Crop Yield
(bushels/acre)
Low Pesticide Rate (gal./acre)- x2
X1
Tons Fertilizer/ Acre
Interpretation of Interaction (or lack thereof)
The incremental increase in crop yield for each additional ton of fertilizer in the same regardless of whether
the pesticide usage rate was high or low. Thus, the presence of differing amounts of pesticide in the does not interact
with the fertilizer to change the fertilizers affect on crop yield.
B. One quantitative dependent variable with two quantitative independent variables with interaction.
Model: y = 0 + 1x1 + 2x2 + 3 x1x2 + 
Pattern and Example
High Shelf Space (ft3) -
y
x2
Sales
(units/week)
Low Shelf Space (ft3) -
x2
x1
Advertising (in $)
Interpretation of Interaction
The incremental increase in sales in units per week is DIFFERENT depending on the amount of shelf space
allocated to the product. Thus, with higher shelf space exposure, the incremental effect of advertising on sales is
greater (much like a "booster effect").
C. One quantitative dependent variable with one quantitative independent variable and one qualitative
independent variable with k = 2 levels (e.g. male/female, yes/no) and no interaction.
Model: y = 0 + 1x1 +  2x2
where x1 = quantiative variable
x2 = qualitative variable and x2 = (0,1)
Note: When the qualitative variable has k levels, k-1 dummy variables are utilized.
Pattern and Example:
y
Working Student - x2 = 1
GPA
Non - Working Student - x2 = 0
x1
ACT
Interpretation of Interaction (or lack thereof)
The incremental increase in ones GPA given an increase in ACT scores is the same regardless of whether one
is working or not working, however, working students seem on the average to perform better than non-working
students.
D. One quantiative dependent variable with one quantitative independent variable and one qualitative
independent variable with k = 2 levels AND interaction.
Model: y = 0 + 1x1 +  2x2 + 3 x1x2 + 
where x1 = quantitative variable
x2 = qualitative variable and x2 = (0,1)
Pattern and Example:
TQM Plant x2 =
1
y
Productivity
(units/manhour)
Non - TQM Plant x2 = 0
x1
Production Order Size
Interpretation of Interaction
The incremental increase in productivity given a larger order size is greater at a plant having TQM practices
versus a plant that is not practicing TQM.
III. Extensions More Than k = 2 Levels of a Qualitative Variable
A. Oftentimes, a qualitative variable will have more than k = 2 levels. In such cases, the graphs will simply
have more parallel or non-parallel lines representing the k levels.
B. Dummy Variables Utilized for Qualitative Variables
A qualitative variable represents an exhaustive list of all of the mutually exclusive cases of a variable, for
example, student rank is represented by Freshman, Sophomore, Junior and Senior (k = 4 levels).
1. Coding of these terms is done as follows: If k choices are given for a qualitative variable, then (k - 1)
data columns are constructed with (0,1) codes.
e.g.
x1
1
0
0
0
Sr., Jr., Soph., and Freshman
x2 xc
0 0
Senior
1 0
Junior
0 1
Soph.
0 0
Freshman
k=4
IV. Main Effects and Interactions
A. Main Effects can be either quantitative or qualitative variables. They are usually independent variables given
the symbolic notations of x1, x2 , x3 , x4 , …., xk . Interaction terms are the cross-products of the independent
variables and can be recognized as the given symbolic notations such as x1x2 , x1 x3 , x2 x3 , etc.
B. Examples:
1. Sales as a function of advertising and shelf space with interaction:
Dependent
Variable
Y
Constant
Term
Main Effect of
Advertising
0
=
 1x1
+
Main Effect
of Shelf
Space
+
 2x2
Interaction
+
 3 x1x2
2. GPA as a function of ACT for different classes.
Dependent
Variable
Main
Effect of
ACT
Y =
0
+
 1x1
Main Effect of Different
Classes (k = 4)
+
 2x2 +  3x3 +  4x4
Interaction Effect of ACT and the
Different Classes
+  5 x1x2 +  3 x1x3+  3 x1x4
V. Testing for the Significance of the Interaction
A. Individual Coefficients
1. Usually the t test on the individual coefficient is used for testing the significance of the interaction term.
H0 :  i = 0
Ha :  i  0
Test Statistic: t with n-(k+1) d. f.
t 
1  0
sb
Where the t values and related p-values will be shown on computer printout. Based on the appropriate t
calculated versus t critical (or on the use of the p value), the conclusion can be drawn.
VI.
EXAMPLES Using the Computer
A. Easton is interested in determining if size and age (both ratio variables) and potential interaction might
combine in some way to help predict or "explain" the variation in price. In order to test for interaction, the crossproduct of age and size must be calculated and added to the data base. Thus, go to Data Management/Transform
Variables or Cases/Option H: Variable X * Variable Y/ Option C: Multiply/Select Size as Y and Age as X/Assign a
new name called Age*Size/Add to Data Matrix.
B. Now, go to Statistics/Regression Analysis/Multiple Regression/Select Price as Dependent and Size, Age, and
Age*Size as independent/Add Constant?Yes.
MULTIPLE REGRESSION
MODEL: Price = 34.7586Size + -1304.51Age + 0.496094Age*Size + 30761.8CNST
Size
Age
Age*Size
COEF.
-----------34.7586
-1304.51
0.496094
SD. ER.
-----------5.29087
1553.15
0.812011
t(514)
--------6.56954
-0.839913
0.610945
P-VALUE.
----------1.23911E-10
4.01348E-1
5.41506E-1
Only Size is significant while age and
age*size interaction is not.
R SQ. = 0.509218
SQ. ROOT MSE = 12619.7, F(3/514) = 177.769 (P-VALUE = 4.68475E-79)
C . Since both price and area are significant (and were determined to be so in the multivariate and stepwise
regression model), the likelihood that interaction exists in greater than in the case when a variable like age was not
significant in the main effects!
Key Point: Interaction will more likely be found among significant main effects than among
insignificant main effects.
In order to examine the interaction of size and area, Size* Area(1) and Size*Area(2) will need to be created and
added to the data base. The procedure is similar to adding Size*Area.
MULTIPLE REGRESSION
MODEL: Price = 34.2616Size + 8493.37Area(1) + -4822.96Area(2) + 8.53168Area(1)*Size +
.00155Area(2)*Size + 16578.8
Size
Area(1)
Area(2)
Area(1)*Size
Area(2)*Size
CNST
COEF.
----------34.2616
8493.37
-4822.96
8.53168
5.00155
16578.8
SD. ER.
t(512)
----------------1.79021
19.1383
4140.6
2.05124
4397.6 -1.09673
2.12301
4.01868
2.26718
2.20607
3539.93 4.68337
P-VALUE
----------5.49079E-62
4.07519E-2
2.73277E-1
6.73169E-5
2.78225E-2
3.61999E-6
Main effects of Size and
Area(1) are significant as is
the interaction of Size with
both Area(1) and Area(2).
[Note: All p values < .05.].
Since Area(2) is not
significant, it may be dropped
from the model if so desired.
R SQ. = 0.887382
SQ. ROOT MSE = 6056.94, F(5/512) = 806.871 (P-VALUE = 4.22407E-240)
Note the results of the analysis. With the variables size area and its interactions, the R SQ. value is 88.70 while
the Sq. Root MSE is 6056. The coefficients with the lowest p-values were size and the interaction of size and
Area (1), that is, whether the home was in Dallas. In order to further explain the flexibility one has in model
building, a new model is tried which attempts to predict Price as solely a function of Size and the interaction of
Size and Area(1).
MULTIPLE REGRESSION
MODEL: Price = 33.9051Size + 11.5097Area(1)*Size + 20055.3CNST
Size
Area(1)*Size
CNST
COEF.
---------33.9051
11.5097
20055.3
SD. ER.
----------0.829605
0.295042
1567.5
t(515)
P-VALUE
-----------------40.8689 9.36923E-164
39.0105
6.98070E-156
12.7944
9.68119E-33
R SQ. = 0.875354
SQ. ROOT MSE = 6353.61, F(2/515) = 1808.35 (P-VALUE = 1.37187E-233)
Notice the R. SQ. in this model is only slightly smaller and the Sq. Root MSE slightly larger than the previous
model. The reason for the selection of this model is to illustrate in a simplified fashion how the equation can be
reduced and then graphed as expected – two straight lines that are not parallel.
D. The two linear equations of PRICE which results from the equation are found by assuming that Area(1) is 1 if
the house in Dallas and 0 if the house is NOT in Dallas.
PriceDallas = 33.90Size + 11.51 (1)(Size) + 20055.3 = (33.90 + 11.51)Size + 20055.3 = 45.51Size + 20055.3
PriceNot in Dallas = 33.90Size + 11.51(0)(Size) + 20055.3 = 33.90 Size + 20055.3
Note the intercept is the same for both equations but the slopes are different. Thus, each additional square foot
in Dallas costs $45.51 per sq. ft. while each additional sq. ft. only costs an average of $33.90 in Ft.Worth and the
Outlying areas. A sketch of the two equations of the price in Dallas and Not in Dallas is shown on the graph below.
House in Dallas
Area (1) = 1
Y
Slope = $45.51/ft2
Price
House in Ft. Worth or
Outlying Areas
Area(1) = 0
Slope = $33.90/ ft2
Size
HOMEWORK - CHAPTER 11A
INTERACTION CONCEPTS
1. A large retail discount chain was interested in determining if there was any relationship between its sales in
dollars versus its advertising expenditures in dollars and its shelf space for a particular product. It seemed
reasonable to think that sales would increase with more advertising but it also seemed reasonable to think that sales
might increase with added shelf space. Someone also suggested that more shelf space might give the advertising and
added effectiveness since the product advertised would be easier to find on the shelf with a bigger display space.
Twenty weeks of sales of a common household product were recorded with the differing amounts of advertising
dollars and the varying amounts of shelf space (high, medium, and low) recorded.
A data file with the following values and symbols was developed.
y = sales in $
x1 = advertising in $
x2 = shelf space in square feet
y
x1 x2
Case 1 2010 201 75
Case 2 1850 205 50
Case 3 2400 355 75
Case 4 1575 208 30
Case 5 3550 590 75
Case 6 2015 397 50
Case 7 3908 820 75
Case 8 1870 400 30
Case 9 4877 997 75
Case 10 2190 515 30
Case 11 5005 996 75
Case 12 2500 625 50
Case 13 3005 860 50
Case 14 3480 1012 50
Case 15 5500 1135 75
Case 16 1995 635 30
Case 17 2390 837 30
Case 18 4390 1200 50
Case 19 2785 990 30
Case 20 2989 1205 30
(a) Plot and interpret the following graphs.
y vs. x1
y vs. x2
y vs. x2 broken down by x3
x1 vs. x2
(b) Find regression models for the following situations:
(1) Sales = f (advertising and shelf space)
(2) Sales = f( advertising, shelf space and the interaction between the two)
(3) Sales = f(interaction only)
(c) Complete appropriate tests of hypotheses to determine the "best model".
(d) What would be the marginal efficiency in terms of sales of an extra dollar of advertising if only 50 sq. ft. of shelf
space were to be allocated to the product?
(e) Explain what "interaction means in relationship to this problem.
2. A large microcomputer company is interested in developing some work standards for its repair operators that
work outside the office at remote user locations. However, increased worker efficiency and effectiveness are being
dictated by desires to become more profitable and productive. In order to establish some desired work standards, a
management employee has gone on location with 10 randomly selected repairmen on different job assignments and
recorded the time for the repairman to correct the microcomputer problems. This procedure was desired relative to
having the repairman report their own times since the data will be more accurate as excessive downtime and rest time
were controlled.
The table below contains the results of the study.
Repair Time
(in hrs.)
No.of units
Repaired
Experience
(in months)
1.0
1
12
3.1
3
8
17.0
10
5
14.0
8
2
6.0
5
10
1.8
1
1
11.5
10
10
9.3
5
2
12.2
10
8
6.0
4
6
(a) Because of the time study analysis and the thought of impending doom on the repairman's part, the repairmen
suggested that a standard time be established for each unit. They suggested that a time standard be developed which
found the average maintenance time per computer plus 20% to allow for random variation times. What repair time
per computer would they suggest? (Note: Which mean and standard deviation calculation would be appropriate weighted average calculations or just a simple average of each repairman's times? Discuss the assumptions of using
both and tell why you selected the method that you did.)
(b) Management meanwhile counter proposed their suggestion but utilizing statistics. They decided to utilize the
same concept but place the standard at a figure that would allow the repairman a 70% chance of completing the job
"under standard". What figure would they be suggesting using this analysis?
(c) Using this figure, what total time would the management likely suggest if the repairman was given 6 computers
to repair and management wanted the repairman to have a 70% chance of finishing below standard? (Note:
Remember the concept of independent combinations of means and variances).
(d) Based on the inconclusive evidence from the initial analysis and reactions, some regression analysis was desired.
Thus, based on a plot of the data of maintenance time vs. number of microcomputers, what type of relationship
might be suggested?
(e) Based on a plot of the data of maintenance time versus experience, what type of relationship might be suggested?
(f) Based on a plot of maintenance time PER COMPUTER (new variable) versus experience, which might be
concluded?
(g) Does there appear to be any credence in the argument put forward by one manager that says that the incremental
time to repair another computer at a location is affected by the amount of experience that the repairman happens to
have? (Plot time/computer versus experience).
(h) One manager suggested that the total time that the repairman should take to repair computers off sight was
influenced by two independent factors - the total number that had to be repaired and the experience that the
repairman had. Can we at this point be sure that one or both of these factors did in fact significantly affected the
total repair time?
(i) If this particular model was selected to describe the total repair time, what would be the exact equation that you
would suggest?
(j) What percent of the variation in repair time would be explained by the variation in number of computers to be
repaired and the experience of the repairman?
(k) Using this analysis, if a man was assigned 6 computers to repair and he had 8 months experience, what would be
his expected average time to complete the task? __________ He should complete the task in less than how many
hours 70% of the time?
(l) After completing the analysis, one manager suggested that the results implied that the incremental time to
compete an additional computer was constant regardless of the amount of experience that the repairman had. He
disagreed with the basic hypothesis of the model. He suggested instead that the model would be more appropriate if
the model indicated that each incremental computer would take less time if a repairman had more experience. If this
wasn't so, WHY SHOULD WE BE GIVING HIGHER PAY AND MERIT BONUSES IF THEY WERE NO MORE
CAPABLE THAN A NEW REPAIRMAN? Is this argument valid? (Prove with an hypothesis test).
(m) Using the argument presented in (l) above as valid, what would be the equation which would demonstrate the
relationship between time versus number of computers and experience on the job?
(n) What percent of the variation in repair time is explained under these hypothesized conditions?
(0) Which is the "best model" and why?
(p) Using the best model, what would be the average amount of time it would take to complete 6 computers if the
repairman had 8 months experience on the job?
(q) What total time could have been assigned so that we would be 70% sure that he would have had enough time to
complete the job?
(r) While on a particular job, the customer told the repairman (with 8 months experience) that one additional
computer had broken down and it wasn't on the invoice sent to the main office. Could he also fix that machine while
he was there? He mentioned he would have to call his manager for more time authorization. As the manager, how
much more time on the average should you allow the repairman for this additional computer?
Download