Business Statistics, Can. ed. By Black, Chakrapani & Castillo Chapter 14 Discrete Distributions Building Multiple Regression Models Prepared by Dr. Clarence S. Bayne JMSB, Concordia University Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Learning Objectives • Analyze and interpret nonlinear variables in multiple regression analysis. • Understanding the role of qualitative variables and how to use them in multiple regression analysis. • How to build and evaluate multiple regression models. • What is multicollinearity and how to deal with it Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Mathematical Transformations: Recoding Independent Variables to Create Non-linear Models Description of Models First-order model with Two Independent Variables Equations Y 0 1 X1 2 X 2 Second-order Model with One Independent variable Y 0 1 X 1 2 X 12 Second-order Model with an Interaction Term Y 0 1 X1 2 X 2 3 X1 X 2 Second-order with Two Independent Variables Y 0 1 X1 2 X 2 3 X12 4 X 22 5 X1 X 2 Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. A Curvilinear Scatter Plot of Sales Data for 13 Manufacturing Companies Sales Manufacturer ($1,000,000) 1 2.1 2 3.6 3 6.2 4 10.4 5 22.8 6 35.6 7 57.1 8 83.5 9 109.4 10 128.6 11 196.8 12 280.0 13 462.3 Number of Manufacturing Representatives 2 1 2 3 4 4 5 5 6 7 8 10 11 500 450 400 350 300 Sales 250 200 150 100 50 0 Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. 0 2 4 6 8 10 Number of Representatives 12 Excel Simple Linear Regression Output for the Manufacturing Example Coefficients Standard Error -107.03 28.737 41.026 4.779 Intercept numbers Regression Statistics Multiple R 0.933 R Square 0.870 Adjusted R Square 0.858 Standard Error 51.10 Observations t Stat -3.72 8.58 13 P-value 0.003 0.000 ANOVA df Regression Residual Total 1 11 12 SS 192395 28721 221117 MS 192395 2611 Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. F 73.69 Significance F 0.000 Second Order Model with one Independent Variable: Manufacturing Sales Data: Table 14.2 Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Scatter Plots Showing Original Curvilinear With More Linear Transformed Data: Figure 14.2 Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Computer Output for Quadratic Model to Predict Sales Intercept MfgrRp MfgrRpSq Regression Statistics Multiple R 0.986 R Square 0.973 Adjusted R Square 0.967 Standard Error 24.593 Observations 13 Coefficients Standard Error 18.067 24.673 -15.723 9.5450 4.750 0.776 t Stat 0.73 - 1.65 6.12 P-value 0.481 0.131 0.000 ANOVA df Regression Residual Total 2 10 12 SS 215069 6048 221117 MS 107534 605 Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. F 177.79 Significance F 0.000 Tukey’s Ladder of Transformation The Four Quadrant Approach Move toward 2 3 y ,y , toward log x, -1 x , Move toward log x, -1 toward log Y, -1 , or y, x 2 3 y ,y , Move toward toward x , , or 2 3 ,x , Move toward x 2 3 ,x , toward log y, -1 Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. , or or y , Regression Models With Interactions Often in the real world of business and economics interaction occurs between two variables One variable acts differently over a range of values for the second variable than it does over another range of values for the second variable In a manufacturing plant humidity might affect the hardness of material differently at differently at different temperatures The ANOVA model in Chapter 11 addressed this problem by using an interaction variable as a blocking variable In regression analysis, interaction can be examined as a separate independent variable This is illustrated by using the second-order model design with two independent variables and an interaction term. Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Table 14.3 Share Prices of Three Stocks over a 15-Month Period Stock 1 Problem Definition: The data represent the closing prices for three corporations over a 15 months period. An investment firm wants to use the prices for stocks 2 and 3 to develop a regression model to predict the price of stock 1 Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Stock 2 Stock 3 41 36 35 39 36 35 38 38 32 45 51 41 41 52 39 43 55 55 47 57 52 49 58 54 41 62 65 35 70 77 36 72 75 39 74 74 33 83 81 28 101 92 31 107 91 Develop Model Using Step by Step Approach and Explore for Interaction First-order with Two Independent Variables Y 0 1 X 1 2 X 2 where: Y = price of stock 1 X X 1 price of stock 2 2 price of stock 3 Second-order with an Interaction Term Y X X X X Y X X X 0 1 1 2 2 3 1 0 1 1 2 2 3 3 where : Y = price of stock 1 X X X 1 price of stock 2 2 price of stock 3 3 X X 1 2 Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. 2 Initial Regression First-order Model with Two Independent Variables The regression equation is Stock 1 = 50.9 - 0.119 Stock 2 - 0.071 Stock 3 Predictor Coef Constant 50.855 Stock 2 -0.1190 Stock 3 -0.0708 S = 4.570 StDev 3.791 0.1931 0.1990 R-Sq = 47.2% T P 13.41 0.000 -0.62 0.549 -0.36 0.728 R-Sq(adj) = 38.4% Analysis of Variance Source Regression Error Total DF 2 12 14 SS 224.29 250.64 474.93 MS 112.15 20.89 Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. F P 5.37 0.022 Excel Regression Second-order Model with Interaction Term for the Three Stocks The regression equation is Stock 1 = 12.0 - 0.879 Stock 2 - 0.220 Stock 3 – 0.00998 Inter Predictor Constant Stock 2 Stock 3 Inter S = 2.909 Coef 12.046 0.8788 0.2205 -0.009985 StDev 9.312 0.2619 0.1435 0.002314 R-Sq = 80.4% T P 1.29 0.222 3.36 0.006 1.54 0.153 -4.31 0.001 R-Sq(adj) = 75.1% Analysis of Variance Source Regression Error Total DF 3 11 14 SS 381.85 93.09 474.93 MS 127.28 8.46 F P 15.04 0.000 Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Response Surface for the Stock ExampleWithout and With Interaction Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Regression Statistics from Two Excel Output Summaries With and Without Interaction Summary Regression Statistics for Share Prices of Three Stocks Summary Output : With No Interaction Multiple R R Square Summary Output With Interaction 0.687213365 0.47226221 Multiple R R Square 0.804000661 0.750546296 Adjusted R Square 0.384305911 Adjusted R Square Standard Error 4.570195728 Standard Error Observations 15 0.89666084 Observations Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. 2.90902388 15 Analysis and Conclusions • By using the interaction term the coefficient of determination( R2) increases from 0.47 to 0.80 • The Standard error decreases from 4.57 in the first model down to 2.909 in the second. • The t ratios for the X1 term and the interaction term are statistically significant in the second model • T = 3.36 with a p value of 0.006 for X1 and t= -4.31 with a probability of 0.001 for X1X2 . • Inclusion of X1X2 helped the model account for a substantially greater amount of the dependent variable. It is a significant contributor to the model • The second graph in figure 14.6 shows how the interaction term bends the curve to fit the data as stock 2 is increased • Be cautious in interpreting the accuracy of the partial coefficients because of the high likelihood of multicollinearity Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Model-Building: Search Procedures Search procedure are processes whereby more than one multiple regression model is developed for a given database, and the models are compared and sorted by different criteria, depending on the given procedure There are many search procedures. Among the most widely known are All Possible Regressions Stepwise Regression Forward Selection Backward Elimination Which approach is best is subject to much debate and depends on the disciplines and the philosophy of enquiry that the researcher brings to the research. Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. All Possible Regressions • All possible regressions search procedure computes all possible linear multiple regression models from the data using all variables • If a data set contains k independent variables all possible regressions will determine 2k – 1 different models • This produces all possible different models with single predictors; two predictors; three predictors up to all k predictors • The next slide show predictors for all possible regressions for five independent variables • If a research methodology and study design exist that identifies all essential variables, the procedure enables the business researcher to examine every model • Warning. This search through all possible models can be tedious, time consuming, inefficient, and perhaps overwhelming Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. All Possible Regressions with Five Independent Variables Single Predictor X1 X2 X3 X4 X5 Two Predictors X1,X2 X1,X3 X1,X4 X1,X5 X2,X3 X2,X4 X2,X5 X3,X4 X3,X5 X4,X5 Three Predictors X1,X2,X3 X1,X2,X4 X1,X2,X5 X1,X3,X4 X1,X3,X5 X1,X4,X5 X2,X3,X4 X2,X3,X5 X2,X4,X5 X3,X4,X5 Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Four Predictors X1,X2,X3,X4 X1,X2,X3,X5 X1,X2,X4,X5 X1,X3,X4,X5 X2,X3,X4,X5 Five Predictors X1,X2,X3,X4,X 5 Stepwise Regression • Stepwise regression is a step-by-step process that begins by developing a regression model with a single predictor variable and adds and deletes predictors one step at a time • It allows the researcher to examine the fit of the model at each step until no more significant predictors remain outside the model • This starts by choosing the single predictor regression with the highest t or F value and which is significant at some predetermined Alpha value. • If none of the independent variables meet this criteria, no model is recommended. • Incrementally other variables are added to the equation and tested for the significance of their contribution to explaining Total variation relative to other variable, then test for the significance. • This procedure continues until all significant predictor are included • Stepwise regression allows checks for multicollinearity and the dropping of variables that were included in earlier stages Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Forward Selection Like stepwise, except that variables are not reevaluated after entering the model Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Backward Elimination • Start with the “full model” (all k predictors) • If all predictors are significant, stop • Otherwise, eliminate the most nonsignificant predictor; and return to previous step Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Stepwise Regression • Perform k simple regressions; and select the best as the initial model • Evaluate each variable not in the model – If none meet the criterion, stop – Add the best variable to the model; reevaluate previous variables, and drop any which are not significant • Return to previous step • The criteria for inclusion and exclusion of variables may be of a technical nature; common sense observational nature; based on a body of theory; the usefulness of the discovery of new relationships as insights to meaning • The researcher has to be keenly aware of the problem of spurious relationships when using these search procedures Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Choosing the Variables for a Stepwise Regression Predicting World Crude Oil Production Example Problem Definition: Predicting world crude oil production • Choice of a method: many different types of prediction models can be constructed. the researcher adopts an econometric approach using multiple regression • After a preliminary survey of the industry and the factors surrounding it, the researcher realizes that much of the world crude oil market is driven by variables related to the usage and production in the USA The researcher identifies five independent variables as predictors: 1.U.S. energy consumption 2. Gross U.S. nuclear electricity generation 3.U.S. Coal production 4.Total U.S. dry gas (natural gas) production 5. Fuel rate of U.S. owned automobiles Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Systematic Framework Underlying Data Collection • A survey of published and other data on energy production and usage suggest that world production of crude oil is driven by previous years activities in the U.S. • Expected that as energy consumption of the U.S. increased, so would world production of crude oil • It seemed reasonable to introduce nuclear electricity generation, coal production, dry gas production and fuel rates to the study • Rationale: their increase output may be expected to have a negative effect on crude oil production if energy consumption remained fixed. • Data on five independent variables and the dependent variable (world crude oil production) was gathered and is presented on the next slide Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Definition and Measurement of Variables: Data for Multiple Regression Model to Predict World Crude Oil Production Y World Crude Oil Production (millions of barrels per Day) X1 U.S. Energy Consumption (quadrillion BTUs generation per year) X2 U.S. Nuclear Generation (billion kilowatts-hours) X3 U.S. Coal Production (million short-tons) X4 U.S. Dry Gas Production (trillion cubic feet) X5 U.S. Fuel Rate for Autos (miles per gallon) Y 55.7 55.7 52.8 57.3 59.7 60.2 62.7 59.6 56.1 53.5 53.3 54.5 54.0 56.2 56.7 58.7 59.9 60.6 60.2 60.2 60.6 60.9 Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. X1 74.3 72.5 70.5 74.4 76.3 78.1 78.9 76.0 74.0 70.8 70.5 74.1 74.0 74.3 76.9 80.2 81.3 81.3 81.1 82.1 83.9 85.6 X2 83.5 114.0 172.5 191.1 250.9 276.4 255.2 251.1 272.7 282.8 293.7 327.6 383.7 414.0 455.3 527.0 529.4 576.9 612.6 618.8 610.3 640.4 X3 598.6 610.0 654.6 684.9 697.2 670.2 781.1 829.7 823.8 838.1 782.1 895.9 883.6 890.3 918.8 950.3 980.7 1029.1 996.0 997.5 945.4 1033.5 X4 21.7 20.7 19.2 19.1 19.2 19.1 19.7 19.4 19.2 17.8 16.1 17.5 16.5 16.1 16.6 17.1 17.3 17.8 17.7 17.8 18.2 18.9 X5 13.30 13.42 13.52 13.53 13.80 14.04 14.41 15.46 15.94 16.65 17.14 17.83 18.20 18.27 19.20 19.87 20.31 21.02 21.69 21.68 21.04 21.48 Step 1: Stepwise Regression Results with One Predictor The results of simple regression using each independent variable to predict oil production produces the initial regression equation y = 13.075 + 0.580x1 where y is world crude oil production and x1 is U.S. Energy consumption. Note the t value (11.77) in Table 14.8 is the highest of all variables tried, an R-squared is 85.2% Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Excel Output of Regression for Crude Oil Production Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Step 2: Stepwise Regression Results with Two Predictors • X2 is retained initially in the model and a search is conducted to determine which of the other models together with it produces the highest significant t value( add most to explaining variation in Y). • The new equation emerging from computer calculation is y = 7.14 + 0.772x1 – 0.517x2 . X2 is U.S. fuel rate. It has a t value of 3.75 and an r-squared of 90.8. Both very significant. Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Step 3: Regression Results with Three Predictors • Step 3 continues the search for additional predictor variables • Table 14.10 shows that any other values added make no significant contributions to the regression obtained at step 2. The t values are very small. Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Minitab Stepwise Output Stepwise Regression F-to-Enter: 4.00 F-to-Remove: 4.00 Response is Coiler on 5 predictors, with N = 26 Step Constant Seconds T-Value P-value 1 13.075 0.580 11.77 0.000 Fuel Rate T-Value P-value S R-Sq 2 7.140 0.772 11.91 0.000 -0.52 -3.75 0.001 1.52 85.24 1.22 90.83 Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Key Concerns • The search procedures provide a framework for an analysis and must be applied subject to commonsense and an explanatory theory or analysis • Avoid the mistake of using the strict sequential order in which variables come into a computer print out ( on stepwise and forward selection) to rank the importance of the variable • In multiple regression (unlike simple regression) the importance of an independent variable is ranked in terms of its net contribution to explaining Y when used with other variables; not in terms of its individual correlation with y • Problems of multicollinearity require transformation or omission of variable(s) before or as analysis proceeds . Adding a variable that is highly correlated with other independent variables is very problematic. It distorts the value of coefficients and renders all tests unreliable. • An increase in R-squared is not in and of itself a good indicator of the importance of the last variable added. • Common sense and use value is the final arbiter in choosing the final model Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Multicollinearity Condition that occurs when two or more of the independent variables of a multiple regression model are highly correlated Effect of Multicollinearity Difficult, if not impossible, to interpret the estimates of the regression coefficients Inordinately small t values for the regression coefficients Standard deviations of regression coefficients are overestimated: t-tests and F test may have no meaning Algebraic sign of predictor variable’s coefficient opposite of what expected In practice correlations as high as 60 to 70 percent may be tolerated without causing a serious problem of multicollinearity Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Testing for Multicollinearity Two techniques for determining the possible existence of Multicollinearity Prepare a correlation matrix of the independent variables using an Excel or other software program and identify those pairs of variables that have correlations in excess of 0.70 The Variance Inflation factor (VIF): conduct a regression analysis to predict one independent variable by the other. Thus the independent variable being predicted becomes the dependent variable. This is done for all possible different pairs and R-squares (Coefficients of determination) for each calculated. VIF 1 1 Ri2 is the measure that determines whether the standard errors of the estimates are inflated. Some researchers follow a guideline that for VIF greater than 10 or an R2 greater than 0.90 for the largest VIFs indicates a severe multicollinearity problem Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Correlations among Oil Production Predictor Variables Energy Consumption Energy Consumption Nuclear Coal Dry Gas Fuel Rate 1 0.856 0.791 0.057 0.791 Nuclear 0.856 1 0.952 -0.404 0.972 Coal 0.791 0.952 1 -0.448 0.968 Dry Gas 0.057 -0.404 -0.448 1 - Fuel Rate 0.796 0.972 0.968 -0.423 1 Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Problem of Interpretation When Multicollinearity Exists: World Crude Oil Production Regression • The algebraic signs in a regression model must conform to common sense observation or established theory • Note the following three equations considered at different stages f the stepwise regression analysis 1. 2. 3. • Ŷ = 44.869 + 0.7838(fuel rate). The positive fuel rate coefficient can be interpreted in terms of economic theory: price substitution effect. Ŷ = 45.072 + 0.0157(coal). The positive coal coefficient is explainable in a complementary sense. Ŷ = 45.806 + 0.0227(coal) – 0.3934(fuel rate). The negative fuel rate coefficient is opposite to that in equation 1 and is contrary to what by normally expected in economic theory or common sense observation The reason for the apparent contradiction in equation 3 can be attributed to multicollinearity: R2 = 0.968 or VIF =31 Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Copyright Notice Copyright © 2010 John Wiley & Sons Canada, Ltd. All rights reserved. Reproduction or translation of this work beyond that permitted by Access Copyright (The Canadian Copyright Licensing Agency) is unlawful. Request for further information should be addressed to the Permissions Department, John Wiley & Sons Canada, Ltd. The purchaser may make back-up copies for his/her own use only and not for distribution or resale. The Publisher assumes no responsibility for errors, omissions, or damages caused by the use of these programs or from the use of the information herein. Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.