CHAPTER 11A Multiple Regression - Interaction Concepts I. Purpose A. To study the quantitative (or qualitative) relationships between one dependent (ratio or interval) variable with at least one or more quantitative independent variables and one or more qualitative variables and any potential INTERACTION. What is interaction? Interaction is the recognition of the fact that the presence a second independent variable may affect the normal effect of the first independent variable. B. Example - A city slicker who recently bought a farm wanted to burn a large pile of trees trunks that had been bull-dozed down. He asked a local farmer for advice. The farmer suggested using some matches to start the fire. Th city slicker bought a box of wooden matches and attempted to set the logs on fire. After a futile half day of work and no fire, the city slicker returned to the farmer for more advice. The farmer asked if he had used kerosene to start the fire. The city slicker said no and went to town and purchased 20 gallons of kerosene. Upon returning, he dumped 5 gallons of kerosene on the pile and nothing happened. He repeated this futile exercise 3 more times until all of the kerosene was exhausted. Upon returning to the farmer and explaining that the kerosene had not worked either, the farmer asked if he had used the matches IN COMBINATION with the kerosene. The city slicker said no!! Oh what at wonderful thing it is to learn about INTERACTIONS!!! The logs can turn to a raging fire when the two are used in combination! But nothing happens when used separately!!! II. Basic Patterns A. One quantitative dependent variable and two independent quantitative variables with no interaction. Model: y = 0 + 1x1 + 2x2 + Pattern and Example High Pesticide Rate (gal./acre) – x2 Crop Yield (bushels/acre) Low Pesticide Rate (gal./acre)- x2 X1 Tons Fertilizer/ Acre Interpretation of Interaction (or lack thereof) The incremental increase in crop yield for each additional ton of fertilizer in the same regardless of whether the pesticide usage rate was high or low. Thus, the presence of differing amounts of pesticide in the does not interact with the fertilizer to change the fertilizers affect on crop yield. B. One quantitative dependent variable with two quantitative independent variables with interaction. Model: y = 0 + 1x1 + 2x2 + 3 x1x2 + Pattern and Example High Shelf Space (ft3) - y x2 Sales (units/week) Low Shelf Space (ft3) - x2 x1 Advertising (in $) Interpretation of Interaction The incremental increase in sales in units per week is DIFFERENT depending on the amount of shelf space allocated to the product. Thus, with higher shelf space exposure, the incremental effect of advertising on sales is greater (much like a "booster effect"). C. One quantitative dependent variable with one quantitative independent variable and one qualitative independent variable with k = 2 levels (e.g. male/female, yes/no) and no interaction. Model: y = 0 + 1x1 + 2x2 where x1 = quantiative variable x2 = qualitative variable and x2 = (0,1) Note: When the qualitative variable has k levels, k-1 dummy variables are utilized. Pattern and Example: y Working Student - x2 = 1 GPA Non - Working Student - x2 = 0 x1 ACT Interpretation of Interaction (or lack thereof) The incremental increase in ones GPA given an increase in ACT scores is the same regardless of whether one is working or not working, however, working students seem on the average to perform better than non-working students. D. One quantiative dependent variable with one quantitative independent variable and one qualitative independent variable with k = 2 levels AND interaction. Model: y = 0 + 1x1 + 2x2 + 3 x1x2 + where x1 = quantitative variable x2 = qualitative variable and x2 = (0,1) Pattern and Example: TQM Plant x2 = 1 y Productivity (units/manhour) Non - TQM Plant x2 = 0 x1 Production Order Size Interpretation of Interaction The incremental increase in productivity given a larger order size is greater at a plant having TQM practices versus a plant that is not practicing TQM. III. Extensions More Than k = 2 Levels of a Qualitative Variable A. Oftentimes, a qualitative variable will have more than k = 2 levels. In such cases, the graphs will simply have more parallel or non-parallel lines representing the k levels. B. Dummy Variables Utilized for Qualitative Variables A qualitative variable represents an exhaustive list of all of the mutually exclusive cases of a variable, for example, student rank is represented by Freshman, Sophomore, Junior and Senior (k = 4 levels). 1. Coding of these terms is done as follows: If k choices are given for a qualitative variable, then (k - 1) data columns are constructed with (0,1) codes. e.g. x1 1 0 0 0 Sr., Jr., Soph., and Freshman x2 xc 0 0 Senior 1 0 Junior 0 1 Soph. 0 0 Freshman k=4 IV. Main Effects and Interactions A. Main Effects can be either quantitative or qualitative variables. They are usually independent variables given the symbolic notations of x1, x2 , x3 , x4 , …., xk . Interaction terms are the cross-products of the independent variables and can be recognized as the given symbolic notations such as x1x2 , x1 x3 , x2 x3 , etc. B. Examples: 1. Sales as a function of advertising and shelf space with interaction: Dependent Variable Y Constant Term Main Effect of Advertising 0 = 1x1 + Main Effect of Shelf Space + 2x2 Interaction + 3 x1x2 2. GPA as a function of ACT for different classes. Dependent Variable Main Effect of ACT Y = 0 + 1x1 Main Effect of Different Classes (k = 4) + 2x2 + 3x3 + 4x4 Interaction Effect of ACT and the Different Classes + 5 x1x2 + 3 x1x3+ 3 x1x4 V. Testing for the Significance of the Interaction A. Individual Coefficients 1. Usually the t test on the individual coefficient is used for testing the significance of the interaction term. H0 : i = 0 Ha : i 0 Test Statistic: t with n-(k+1) d. f. t 1 0 sb Where the t values and related p-values will be shown on computer printout. Based on the appropriate t calculated versus t critical (or on the use of the p value), the conclusion can be drawn. VI. EXAMPLES Using the Computer A. Easton is interested in determining if size and age (both ratio variables) and potential interaction might combine in some way to help predict or "explain" the variation in price. In order to test for interaction, the crossproduct of age and size must be calculated and added to the data base. Thus, go to Data Management/Transform Variables or Cases/Option H: Variable X * Variable Y/ Option C: Multiply/Select Size as Y and Age as X/Assign a new name called Age*Size/Add to Data Matrix. B. Now, go to Statistics/Regression Analysis/Multiple Regression/Select Price as Dependent and Size, Age, and Age*Size as independent/Add Constant?Yes. MULTIPLE REGRESSION MODEL: Price = 34.7586Size + -1304.51Age + 0.496094Age*Size + 30761.8CNST Size Age Age*Size COEF. -----------34.7586 -1304.51 0.496094 SD. ER. -----------5.29087 1553.15 0.812011 t(514) --------6.56954 -0.839913 0.610945 P-VALUE. ----------1.23911E-10 4.01348E-1 5.41506E-1 Only Size is significant while age and age*size interaction is not. R SQ. = 0.509218 SQ. ROOT MSE = 12619.7, F(3/514) = 177.769 (P-VALUE = 4.68475E-79) C . Since both price and area are significant (and were determined to be so in the multivariate and stepwise regression model), the likelihood that interaction exists in greater than in the case when a variable like age was not significant in the main effects! Key Point: Interaction will more likely be found among significant main effects than among insignificant main effects. In order to examine the interaction of size and area, Size* Area(1) and Size*Area(2) will need to be created and added to the data base. The procedure is similar to adding Size*Area. MULTIPLE REGRESSION MODEL: Price = 34.2616Size + 8493.37Area(1) + -4822.96Area(2) + 8.53168Area(1)*Size + .00155Area(2)*Size + 16578.8 Size Area(1) Area(2) Area(1)*Size Area(2)*Size CNST COEF. ----------34.2616 8493.37 -4822.96 8.53168 5.00155 16578.8 SD. ER. t(512) ----------------1.79021 19.1383 4140.6 2.05124 4397.6 -1.09673 2.12301 4.01868 2.26718 2.20607 3539.93 4.68337 P-VALUE ----------5.49079E-62 4.07519E-2 2.73277E-1 6.73169E-5 2.78225E-2 3.61999E-6 Main effects of Size and Area(1) are significant as is the interaction of Size with both Area(1) and Area(2). [Note: All p values < .05.]. Since Area(2) is not significant, it may be dropped from the model if so desired. R SQ. = 0.887382 SQ. ROOT MSE = 6056.94, F(5/512) = 806.871 (P-VALUE = 4.22407E-240) Note the results of the analysis. With the variables size area and its interactions, the R SQ. value is 88.70 while the Sq. Root MSE is 6056. The coefficients with the lowest p-values were size and the interaction of size and Area (1), that is, whether the home was in Dallas. In order to further explain the flexibility one has in model building, a new model is tried which attempts to predict Price as solely a function of Size and the interaction of Size and Area(1). MULTIPLE REGRESSION MODEL: Price = 33.9051Size + 11.5097Area(1)*Size + 20055.3CNST Size Area(1)*Size CNST COEF. ---------33.9051 11.5097 20055.3 SD. ER. ----------0.829605 0.295042 1567.5 t(515) P-VALUE -----------------40.8689 9.36923E-164 39.0105 6.98070E-156 12.7944 9.68119E-33 R SQ. = 0.875354 SQ. ROOT MSE = 6353.61, F(2/515) = 1808.35 (P-VALUE = 1.37187E-233) Notice the R. SQ. in this model is only slightly smaller and the Sq. Root MSE slightly larger than the previous model. The reason for the selection of this model is to illustrate in a simplified fashion how the equation can be reduced and then graphed as expected – two straight lines that are not parallel. D. The two linear equations of PRICE which results from the equation are found by assuming that Area(1) is 1 if the house in Dallas and 0 if the house is NOT in Dallas. PriceDallas = 33.90Size + 11.51 (1)(Size) + 20055.3 = (33.90 + 11.51)Size + 20055.3 = 45.51Size + 20055.3 PriceNot in Dallas = 33.90Size + 11.51(0)(Size) + 20055.3 = 33.90 Size + 20055.3 Note the intercept is the same for both equations but the slopes are different. Thus, each additional square foot in Dallas costs $45.51 per sq. ft. while each additional sq. ft. only costs an average of $33.90 in Ft.Worth and the Outlying areas. A sketch of the two equations of the price in Dallas and Not in Dallas is shown on the graph below. House in Dallas Area (1) = 1 Y Slope = $45.51/ft2 Price House in Ft. Worth or Outlying Areas Area(1) = 0 Slope = $33.90/ ft2 Size HOMEWORK - CHAPTER 11A INTERACTION CONCEPTS 1. A large retail discount chain was interested in determining if there was any relationship between its sales in dollars versus its advertising expenditures in dollars and its shelf space for a particular product. It seemed reasonable to think that sales would increase with more advertising but it also seemed reasonable to think that sales might increase with added shelf space. Someone also suggested that more shelf space might give the advertising and added effectiveness since the product advertised would be easier to find on the shelf with a bigger display space. Twenty weeks of sales of a common household product were recorded with the differing amounts of advertising dollars and the varying amounts of shelf space (high, medium, and low) recorded. A data file with the following values and symbols was developed. y = sales in $ x1 = advertising in $ x2 = shelf space in square feet y x1 x2 Case 1 2010 201 75 Case 2 1850 205 50 Case 3 2400 355 75 Case 4 1575 208 30 Case 5 3550 590 75 Case 6 2015 397 50 Case 7 3908 820 75 Case 8 1870 400 30 Case 9 4877 997 75 Case 10 2190 515 30 Case 11 5005 996 75 Case 12 2500 625 50 Case 13 3005 860 50 Case 14 3480 1012 50 Case 15 5500 1135 75 Case 16 1995 635 30 Case 17 2390 837 30 Case 18 4390 1200 50 Case 19 2785 990 30 Case 20 2989 1205 30 (a) Plot and interpret the following graphs. y vs. x1 y vs. x2 y vs. x2 broken down by x3 x1 vs. x2 (b) Find regression models for the following situations: (1) Sales = f (advertising and shelf space) (2) Sales = f( advertising, shelf space and the interaction between the two) (3) Sales = f(interaction only) (c) Complete appropriate tests of hypotheses to determine the "best model". (d) What would be the marginal efficiency in terms of sales of an extra dollar of advertising if only 50 sq. ft. of shelf space were to be allocated to the product? (e) Explain what "interaction means in relationship to this problem. 2. A large microcomputer company is interested in developing some work standards for its repair operators that work outside the office at remote user locations. However, increased worker efficiency and effectiveness are being dictated by desires to become more profitable and productive. In order to establish some desired work standards, a management employee has gone on location with 10 randomly selected repairmen on different job assignments and recorded the time for the repairman to correct the microcomputer problems. This procedure was desired relative to having the repairman report their own times since the data will be more accurate as excessive downtime and rest time were controlled. The table below contains the results of the study. Repair Time (in hrs.) No.of units Repaired Experience (in months) 1.0 1 12 3.1 3 8 17.0 10 5 14.0 8 2 6.0 5 10 1.8 1 1 11.5 10 10 9.3 5 2 12.2 10 8 6.0 4 6 (a) Because of the time study analysis and the thought of impending doom on the repairman's part, the repairmen suggested that a standard time be established for each unit. They suggested that a time standard be developed which found the average maintenance time per computer plus 20% to allow for random variation times. What repair time per computer would they suggest? (Note: Which mean and standard deviation calculation would be appropriate weighted average calculations or just a simple average of each repairman's times? Discuss the assumptions of using both and tell why you selected the method that you did.) (b) Management meanwhile counter proposed their suggestion but utilizing statistics. They decided to utilize the same concept but place the standard at a figure that would allow the repairman a 70% chance of completing the job "under standard". What figure would they be suggesting using this analysis? (c) Using this figure, what total time would the management likely suggest if the repairman was given 6 computers to repair and management wanted the repairman to have a 70% chance of finishing below standard? (Note: Remember the concept of independent combinations of means and variances). (d) Based on the inconclusive evidence from the initial analysis and reactions, some regression analysis was desired. Thus, based on a plot of the data of maintenance time vs. number of microcomputers, what type of relationship might be suggested? (e) Based on a plot of the data of maintenance time versus experience, what type of relationship might be suggested? (f) Based on a plot of maintenance time PER COMPUTER (new variable) versus experience, which might be concluded? (g) Does there appear to be any credence in the argument put forward by one manager that says that the incremental time to repair another computer at a location is affected by the amount of experience that the repairman happens to have? (Plot time/computer versus experience). (h) One manager suggested that the total time that the repairman should take to repair computers off sight was influenced by two independent factors - the total number that had to be repaired and the experience that the repairman had. Can we at this point be sure that one or both of these factors did in fact significantly affected the total repair time? (i) If this particular model was selected to describe the total repair time, what would be the exact equation that you would suggest? (j) What percent of the variation in repair time would be explained by the variation in number of computers to be repaired and the experience of the repairman? (k) Using this analysis, if a man was assigned 6 computers to repair and he had 8 months experience, what would be his expected average time to complete the task? __________ He should complete the task in less than how many hours 70% of the time? (l) After completing the analysis, one manager suggested that the results implied that the incremental time to compete an additional computer was constant regardless of the amount of experience that the repairman had. He disagreed with the basic hypothesis of the model. He suggested instead that the model would be more appropriate if the model indicated that each incremental computer would take less time if a repairman had more experience. If this wasn't so, WHY SHOULD WE BE GIVING HIGHER PAY AND MERIT BONUSES IF THEY WERE NO MORE CAPABLE THAN A NEW REPAIRMAN? Is this argument valid? (Prove with an hypothesis test). (m) Using the argument presented in (l) above as valid, what would be the equation which would demonstrate the relationship between time versus number of computers and experience on the job? (n) What percent of the variation in repair time is explained under these hypothesized conditions? (0) Which is the "best model" and why? (p) Using the best model, what would be the average amount of time it would take to complete 6 computers if the repairman had 8 months experience on the job? (q) What total time could have been assigned so that we would be 70% sure that he would have had enough time to complete the job? (r) While on a particular job, the customer told the repairman (with 8 months experience) that one additional computer had broken down and it wasn't on the invoice sent to the main office. Could he also fix that machine while he was there? He mentioned he would have to call his manager for more time authorization. As the manager, how much more time on the average should you allow the repairman for this additional computer?