Chapter 7 Qualitative Variables and Non-Linearities in Multiple Linear Regression Analysis Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education. Learning Objectives • Construct and use qualitative independent variables • Construct and use interaction effects • Control for non-linear relationships • Estimate marginal effects as percent changes and elasticities • Estimate a more fully-specified model 7-2 7-3 Construct and Use Qualitative Independent Variables • Qualitative explanatory variable (dummy variable) with two or more levels: – yes or no, on or off, male or female – coded as 0 or 1 • Regression intercepts are different if the variable is statistically significant • Assumes equal slopes for the other variables • The number of dummy variables needed is (number of levels - 1) 7-4 Dummy-Variable Model Example (with 2 Levels) Let: y = pie sales y 0 1 x 1 2 x 2 x1 = price x2 = holiday (X2 = 1 if a holiday occurred during the week) (X2 = 0 if there was no holiday that week) 7-5 Dummy-Variable Model Example (with 2 Levels) Continued ŷ ˆ0 ˆ1 x1 ˆ2 (1) ( ˆ0 ˆ2 ) ˆ1 x1 ŷ ˆ ˆ x ˆ (0) ˆ ˆ x 0 y (sales) ˆ0 ˆ2 ̂ 0 1 1 2 0 Different ˆ intercept 1 1 Holiday No Holiday Same slope If H0: β2 = 0 is rejected, then “Holiday” has a significant effect on pie sales x1 (Price) 7-6 Interpretation of the Dummy Variable Coefficient (with 2 Levels) Example: Sales 300 - 30(Price) 15(Holiday) Sales: number of pies sold per week Price: pie price in $ 1 If a holiday occurred during the week Holiday: 0 If no holiday occurred ˆ2 = 15: on average, sales were 15 pies greater in weeks with a holiday than in weeks without a holiday, given the same price 7-7 Dummy-Variable Models (more than 2 Levels) • The number of dummy variables is one less than the number of levels • Example: y = house price ; x1 = square feet • The style of the house is also thought to matter: Style = ranch, split level, condo Three levels, so two dummy variables are needed 7-8 Dummy-Variable Models (more than 2 Levels) Continued Let the default category be “condo” 1 if ranch x2 0 if not 1 if split level x3 0 if not ŷ ˆ0 ˆ1x1 ˆ2 x 2 ˆ3 x 3 ˆ2 shows the impact on price if the house is a ranch style, compared to a condo ̂ 3 shows the impact on price if the house is a split level style, compared to a condo 7-9 Interpreting the Dummy Variable Coefficients (with 3 Levels) Suppose the estimated equation is ŷ 20.43 0.045x1 23.53x 2 18.84x 3 For a condo: x2 = x3 = 0 ŷ 20.43 0.045x1 For a ranch: x3 = 0 ŷ 20.43 0.045x1 23.53 For a split level: x2 = 0 ŷ 20.43 0.045x1 18.84 Same slope With the same square feet, a ranch will have an estimated average price of 23.53 thousand dollars more than a condo and the intercept for a ranch is 20.43 + 23.53 = 43.96 With the same square feet, a ranch will have an estimated average price of 18.84 thousand dollars more than a condo and the intercept for a split level is 20.43 + 19.84 = 40.27 7-10 Excel Example What type of relationship exists between energy use per capita and GDP per Capita. The initial regression is as follows: On average, if GDP per capita increases by $1000 US dollars, energy consumption per capita increases by .07 tons. This is statistically significant at the 1% level. 7-11 Scatter Plots of this Relationship for Europe, North America, and South America Energy per Capita vs. GDP per Capita: Europe Energy per Capita vs. GDP per Capita: North America Energy per Capita (tons) 100 80 60 40 20 0 0 5 10 15 50 40 30 20 10 0 0 5 10 15 20 GDP per Capita ($1000 of US dollars) 20 GDP per Capita ($1000 of US dollars) Energy per Capita vs. GDP per Capita: South America Energy per Capita (tons) Energy per Capita (tons) 120 3 2.5 2 1.5 1 0.5 0 0 5 10 15 GDP per Capita ($1000 of US dollars) Are the intercepts the same for these three locations? 7-12 Excel Example Are the intercepts different between Europe and North America with South America as the omitted group? On average, if GDP per capita increases by $1000 US dollars, energy consumption per capita increases by .07 tons. This is statistically significant at the 10% level. The dummy variables for Europe and N. America are not statistically different from S. America at the 10% level. 7-13 Construct and Use Interaction Effects Interaction effects are the product of two different independent variables. We are first going to consider interaction effects between a quantitative variable and a dummy variable. This type of interaction effect changes the slope of the quantitative variable for the various levels of the dummy variable. 7-14 Interaction Regression Model Worksheet Case, i yi x1i x2i x1i x2i 1 2 3 4 : 1 4 1 3 : 1 8 3 5 : 1 0 0 1 : 1 0 0 5 : multiply x1 by x2 to get x1x2, then run regression with y, x1, x2 , x1x2 7-15 Consider the price of the house with three levels of the dummy variable Let the default category be “condo” and x2 is 1 if ranch and 0 if not and x3 is 1 if split level and 0 if not and x1 is square feet. ŷ ˆ0 ˆ1x1 ˆ2 x 2 ˆ3 x 3 ˆ4 x1x 2 ˆ5 x1x 3 ˆ2 shows a change in the intercept on price if the house is a ranch style, compared to a condo ̂ 3 shows a change in the intercept on price if the house is a split level style, compared to a condo ˆ4 shows the impact of the slope on price if the house is a ranch style, compared to a condo ̂ 5 shows the impact of the slope on price if the house is a split level style, compared to a condo 7-16 Interaction Term Worksheet Suppose the estimated equation is ŷ 18.30 98x1 22.44x 2 16.38x 3 45x1x 2 32x1x 3 7-17 Visual Depiction of Interaction Terms with Dummy Variables 7-18 Excel Example Are the slopes and intercepts different between Europe and North America with South America as the omitted group? On average, if GDP per capita increases by $1000 US dollars, energy consumption per capita increases by .07 tons. This is statistically significant at the 10% level. The dummy variables for Europe and N. America are not statistically different from S. America at the 10% level. 7-19 Control for Nonlinear Relationships • The relationship between the dependent variable and an independent variable may not be linear • Useful when scatter diagram indicates nonlinear relationship • Example: Quadratic model – y β β x β x2 ε 0 1 j 2 j – The second independent variable is the square of the first variable 7-20 Polynomial Regression Model General form: y β0 β1x j β2 x βp x ε 2 j p j • where: β0 = Population regression constant βi = Population regression coefficient for variable xj : j = 1, 2, …k p = Order of the polynomial i = Model error If p = 2 the model is a quadratic model: y β0 β1x j β2 x2j ε 7-21 Linear vs. Nonlinear Fit y y x x Linear fit does not give random residuals residuals residuals x x Nonlinear fit gives random residuals 7-22 Quadratic Regression Model y β0 β1x j β2 x ε 2 j Quadratic models may be considered when scatter diagram takes on the following shapes: y y β1 < 0 β2 > 0 x1 y β1 > 0 β2 > 0 x1 y β1 < 0 β2 < 0 x1 β1 > 0 β2 < 0 x1 β1 = the coefficient of the linear term β2 = the coefficient of the squared term 7-23 Marginal Effect for the Quadratic Regression Model ˆy β̂ 0 β̂1 x j β̂ 2 x 2j How does a one unit increase in xj affect the dependent variable y (the marginal effect)? This is just a partial derivative of y with respect to xj ŷ β̂1 2β̂ 2 x j x j Notice that the effect that xj has on y changes depending on the value of xj and this should be evaluated at xj-1 7-24 Illustration of the Marginal Effect that xj has on y The marginal effect is the slope of a line tangent to the curve At x1j the marginal effect is positive At x2j the marginal effect is negative x1j x2j 7-25 Empirical Example of the Quadratic Effect: Utility Bill vs. Temperature Average Bill vs. Average Monthly Temperature $160.00 $150.00 $140.00 $130.00 Average Bill $120.00 $110.00 $100.00 $90.00 $80.00 $70.00 $60.00 35 45 55 65 75 Average Monthly Temperature 85 95 7-26 Utility Bill vs. Temperature – Simple Linear Regression Even though the scatter plot shows a clear relationship between utility bill and temperature, there is no linear relationship between these two variables. 7-27 Utility Bill vs. Temperature – Quadratic Regression UtilityBil l 484.12 12.08temp 0.09temp2 When a quadratic relationship is fit between utility bill and monthly temperature the linear and quadratic terms are now statistically significant at the 1% level. 7-28 Utility Bill vs. Temperature – Quadratic Regression Interpretation UtilityBil l 484.12 12.08temp 0.09temp2 The marginal effect is utilitybill -12.08 2(0.09) temp temp The marginal effect at a temperature of 40 (evaluated at 39) is - 12.08 2(0.09)39 12.08 7.02 5.06 which means that if temperature increases from 39 to 40 degrees then the utility bill decreases by $5.06. The marginal effect at a temperature of 80 (evaluated at 79) is - 12.08 2(0.09)79 12.08 14.22 2.14 which means that if temperature increases from 79 to 80 degrees then the utility bill increases by $2.14. 7-29 Finding Where the Quadratic Function Reaches a Maximum (or Minimum) Method: Set the first derivative of the regression equal to 0 and solve for xj. ŷ β̂1 2β̂ 2 x j 0 x j or x β̂1 j 2β̂ 2 Using the utility bill example, the function reaches a minimum at (12.08) temp 67.11 2(.09) or at a temperature of 67.11 degrees. The function will reach a minimum if β̂ 2 is positive and the function will reach a maximum if β̂ 2 is negative. 7-30 Testing for Significance: Quadratic Model • Test for Overall Relationship between y and xj (test if the two parameters are jointly equal to 0). – Use an F-test with the Hypothesis H0: β1 = β2 = 0 (xj does not affect y) H1: not H0 (xj affects y) • Testing the Quadratic Effect 2 y β β x β x – Compare quadratic model 0 1 j 2 j ε with the linear model y β0 β1x j ε – Use a t-test with the Hypothesis H0: β2 = 0 (No 2nd order polynomial term) HA: β2 0 (2nd order polynomial term is needed) 7-31 Higher Order Models y x If p = 3 the model is a cubic form: y β0 β1x j β2 x β3 x ε 2 j 3 j 7-32 Interaction Effects • Hypothesizes interaction between pairs of x variables – Response to one x variable varies at different levels of another x variable • Contains two-way cross product terms y β0 β1x1 β2 x12 β3 x 3 β 4 x1x 2 β5 x12 x 2 Basic Terms Interactive Terms 7-33 Effect of Interaction • Given: y β0 β1x1 β2 x 2 β3 x1x 2 ε • Without interaction term, effect of x1 on y is measured by β1 • With interaction term, effect of x1 on y is measured by β1 + β3 x2 • Effect changes as x2 increases 7-34 Evaluating Presence of Interaction • Hypothesize interaction between pairs of independent variables y β0 β1x1 β2 x 2 β3 x1x 2 ε • Hypotheses: – H0: β3 = 0 (no interaction between x1 and x2) – HA: β3 ≠ 0 (x1 interacts with x2) 7-35 Estimate Marginal Effects as Percent Changes and Elasticities The models are estimated taking natural logarithms of the dependent variable, the independent variable, or both. - Log-Linear Model - Log-Log Model 7-36 Log – Linear Model The population regression function is specified as ln y β 0 β1x1 ε and β1 is interpreted as, “on average, if x1 increases by 1 unit then y increases by β1100% Note that this is only an approximation because the natural log is a nonlinear function. 7-37 Empirical Example of the Log – Linear Model The dependent variable is the natural log of energy per capita This slope coefficient on gdppc is interpreted as, “on average, if GDP per capita increases by $1000 then energy consumption per capita goes up by (0.026)100% or 2.6%.” This coefficient is statistically significant at the 1% level. 7-38 Empirical Example of the Log – Linear Model with Dummy Variables The dependent variable is the natural log of energy per capita with South America as the omitted group The Europe dummy variable coefficient is interpreted as “on average energy consumption per capita is 50.5% higher in Europe than South America.” The North America dummy variable coefficient is interpreted as “on average energy consumption per capita is 56.6% higher in North America than South America.” Europe is statistically insignificant while North America is marginally significant (significant at the 10% level). 7-39 Log – Log Model The population regression function is specified as ln y β 0 β1ln x 1 ε and β1 is interpreted as “on average, if x1 increases by 1 percent then y increases by β1 percent.” In the log-log model β1 is an elasticity. 7-40 Empirical Example of the Log – Log Model The dependent variable is the natural log of energy per capita This slope coefficient on lngdppc is interpreted as, “on average, if GDP per capita increases by 1% then energy consumption per capita goes up by .69%.” This coefficient is statistically significant at the 1% level. 7-41 Empirical Example of the Log – Linear Model with Dummy Variables The dependent variable is the natural log of energy per capita with South America as the omitted group The Europe dummy variable coefficient is interpreted as “on average energy consumption per capita is 9.3% lower in Europe than South America.” The North America dummy variable coefficient is interpreted as “on average energy consumption per capita is 41.5% higher in North America than South America.” Neither of these are statistically significant at the 10% level. 7-42