Chapter 9 Supplement Model Building 1 Introduction • Regression analysis is one of the most commonly used techniques in statistics. • It is considered powerful for several reasons: – It can cover a variety of mathematical models • linear relationships. • non - linear relationships. • nominal independent variables. – It provides efficient methods for model building 2 Polynomial Models • There are models where the independent variables (xi) may appear as functions of a smaller number of predictor variables. • Polynomial models are one such example. 3 Polynomial Models with One Predictor Variable y = b0 + b1x1+ b2x2 +…+ bpxp + e y = b0 + b1x + b2x2 + …+bpxp + e 4 Polynomial Models with One Predictor Variable • First order model (p = 1) y = b0 + b1x + e • Second order model (p=2) y = b0 + b1x + eb2x2 + e b2 < 0 b2 > 0 5 Polynomial Models with One Predictor Variable • Third order model (p = 3) y = b0 + b1x + b2x2 + eb3x3 + e b3 < 0 b3 > 0 6 Polynomial Models with Two Predictor Variables y • First order model y = b0 + b1x1 + eb2x2 + e y x1 x2 x1 x2 7 Polynomial Models with Two Predictor Variables • First order model y = b0 + b1x1 + b2x2 + e The effect of one predictor variable on y is independent of the effect of the other predictor variable on y. X2 = 3 X2 = 2 X2 = 1 • First order model, two predictors,and interaction y = b0 + b1x1 + b2x2 +b3x1x2 + e The two variables interact to affect the value of y. X2 = 3 X2 = 2 X2 =1 x1 x 8 1 Polynomial Models with Two Predictor Variables Second order model y = b0 + b1x1 + b2x2 + b3x12 + b4x22 + e X2 = 3 y = [b0+b2(3)+b4(32)]+ b1x1 + b3x12 + e X2 = 2 y = [b0+b2(2)+b4 (22)]+ b1x1 + b3x1 + e 2 Second order model with interaction y = b0 + b1x1 + b2x2 +b3x12 + b4x22+ b e 5x1x2 + e X2 = 3 X2 = 2 X2 =1 X2 =1 y = [b0+b2(1)+b4(12)]+ b1x1 + b3x12 + e x1 9 Selecting a Model • Several models have been introduced. • How do we select the right model? • Selecting a model: – Use your knowledge of the problem (variables involved and the nature of the relationship between them) to select a model. – Test the model using statistical techniques. 10 Selecting a Model; Example • Example: The location of a new restaurant – A fast food restaurant chain tries to identify new locations that are likely to be profitable. – The primary market for such restaurants is middleincome adults and their children (between the age 5 and 12). – Which regression model should be proposed to predict the profitability of new locations? 11 Selecting a Model; Example • Solution – The dependent variable will be Gross Revenue – Quadratic relationships between Revenue and each predictor variable should be observed. Why? • Families with very young or older kids will not visit the restaurant as frequent as families with mid-range ages of kids. • Members of middle-class families are more likely to visit a fast food restaurant than members of poor or wealthy families. Revenue Revenue Low Middle High Income Low Middle High 12 age Selecting a Model; Example • Solution – The quadratic regression model built is Sales = b0 + b1INCOME + b2AGE + b3INCOME2 +b4AGE2 + b5(INCOME)(AGE) +e Include interaction term when in doubt, and test its relevance later. SALES = annual gross sales INCOME = median annual household income in the neighborhood AGE = mean age of children in the neighborhood 13 Selecting a Model; Example To verify the validity of the proposed model for recommending the location of a new fast food restaurant, 25 areas with fast food restaurants were randomly selected. – Each area included one of the firm’s and three competing restaurants. – Data collected included (Xm9-01.xls): • Previous year’s annual gross sales. • Mean annual household income. • Mean age of children 14 Selecting a Model; Example Xm9-01.xls Revenue 1128 1005 1212 . . Added data Income 23.5 17.6 26.3 . . Age 10.5 7.2 7.6 . . Income sq 552.25 309.76 691.69 . . Collected data Age sq 110.25 51.84 57.76 . . (Income)( Age) 246.75 126.72 199.88 . . 15 The Quadratic Relationships – Graphical Illustration REVENUE vs. AGE 1500 1000 500 REVENUE vs. INCOME 0 0.0 1500 5.0 10.0 15.0 20.0 1000 500 0 0.0 10.0 20.0 30.0 40.0 16 Model Validation Regression Statistics Multiple R 0.9521 R Square 0.9065 Adjusted R Square 0.8819 Standard Error 44.70 Observations 25 This is a valid model that can be used to make predictions. ANOVA df Regression Residual Total Intercept Income Age Income sq Age sq (Income)( Age) 5 19 24 SS 368140 37956 406096 Coefficients Standard Error -1134 320.0 173.2 28.20 23.55 32.23 -3.73 0.54 -3.87 1.18 1.97 0.94 MS 73628 1998 F Significance F 36.86 0.0000 t Stat P-value -3.54 0.0022 6.14 0.0000 0.73 0.4739 -6.87 0.0000 -3.28 0.0039 2.08 0.0509 But… 17 Reducing multicollinearity Model Validation The model can be used to make predictions... …but multicollinearity is a problem!! The t-tests may be distorted, therefore, do not interpret the coefficients or test them. In excel: Tools > Data Analysis > Correlation INCOME INCOME AGE INC sq AGE sq INC X AGE AGE INC sq AGE sq INC X AGE 1 0.0201 1 0.9945 -0.045 1 -0.042 0.9845 -0.099 1 0.4596 0.8861 0.3968 0.8405 1 18 Nominal Independent Variables • In many real-life situations one or more independent variables are nominal. • Including nominal variables in a regression analysis model is done via indicator variables. • An indicator variable (I) can assume one out of two values, “zero” or “one”. o 1 if a degree earned is in Finance 1 if the temperature was below 50 11 ififadata firstwere condition collected out ofbefore two is1980 met I= 00 ififthe o orFinance asecond degree earnedwas isout not temperature 50 more 00 ififadata werecondition collected after ofintwo 1980 is met 19 Nominal Independent Variables; Example: Auction Price of Cars A car dealer wants to predict the auction price of a car. Xm9-02a_supp – The dealer believes now that odometer reading and the car color are variables that affect a car’s price. – Three color categories are considered: • White • Silver • Other colors Note: Color is a nominal variable. 20 Nominal Independent Variables; Example: Auction Price of Cars • data - revised (Xm9-02b_supp) 1 if the color is white I1 = 0 if the color is not white 1 if the color is silver I2 = 0 if the color is not silver The category “Other colors” is defined by: I1 = 0; I2 = 0 21 How Many Indicator Variables? • Note: To represent the situation of three possible colors we need only two indicator variables. • Conclusion: To represent a nominal variable with m possible categories, we must create m-1 indicator variables. 22 Nominal Independent Variables; Example: Auction Car Price • Solution – the proposed model is y = b0 + b1(Odometer) + b2I1 + b3I2 + e – The data Price 14636 14122 14016 15590 15568 14718 . . Odometer 37388 44758 45833 30862 31705 34010 . . I-1 1 1 0 0 0 0 . . I-2 0 0 0 0 1 1 . . White car Other color Silver color 23 Example: Auction Car Price The Regression Equation From Excel we get the regression equation PRICE = 16701-.0555(Odometer)+90.48(I-1)+295.48(I-2) For one additional mile the auction price decreases by 5.55 cents. A white car sells, on the average, for $90.48 more than a car of the “Other color” category A silver color car sells, on the average, for $295.48 more than a car of the “Other color” category. 24 Example: Auction Car Price The Regression Equation From Excel (Xm9-02b_supp) we get the regression equation PRICE = 16701-.0555(Odometer)+90.48(I-1)+295.48(I-2) Price Price = 16701 - .0555(Odometer) + 90.48(0) + 295.48(1) Price = 16701 - .0555(Odometer) + 90.48(1) + 295.48(0) Price = 16701 - .0555(Odometer) + 45.2(0) + 148(0) Odometer 25 Example: Auction Car Price The Regression Equation SUMMARY OUTPUT Regression Statistics Multiple R 0.8355 R Square 0.6980 Adjusted R Square 0.6886 Standard Error 284.5 Observations 100 There is insufficient evidence Xm9-02b_supp to infer that a white color car and a car of “other color” sell for a different auction price. There is sufficient evidence to infer that a silver color car sells for a larger price than a car of the “other color” category. ANOVA df Regression Residual Total Intercept Odometer I-1 I-2 3 96 99 SS 17966997 7772564 25739561 Coefficients Standard Error 16701 184.3330576 -0.0555 0.0047 90.48 68.17 295.48 76.37 MS 5988999 80964 F Significance F 73.97 0.0000 t Stat P-value 90.60 0.0000 -11.72 0.0000 1.33 0.1876 3.87 0.0002 26 Nominal Independent Variables; Example: MBA Program Admission (MBA II) • The Dean wants to evaluate applications for the MBA program by predicting future performance of the applicants. • The following three predictors were suggested: – Undergraduate GPA – GMAT score – Years of work experience Note: The undergraduate degree is nominal data. • It is now believed that the type of undergraduate degree should be included in the model. 27 Nominal Independent Variables; Example: MBA Program Admission 1 if B.A. I1 = 0 otherwise 1 if B.B.A I2 = 0 otherwise 1 if B.Sc. or B.Eng. I3 = 0 otherwise The category “Other group” is defined by: I1 = 0; I2 = 0; I3 = 0 28 Nominal Independent Variables; Example: MBA Program Admission MBA-II SUMMARY OUTPUT Regression Statistics Multiple R 0.7461 R Square 0.5566 Adjusted R Square 0.5242 Standard Error 0.729 Observations 89 ANOVA df Regression Residual Total Intercept UnderGPA GMAT Work I-1 I-2 I-3 SS 6 82 88 54.75 43.62 98.37 Coefficients Standard Error 0.19 1.41 -0.0061 0.114 0.0128 0.0014 0.098 0.030 -0.34 0.22 0.71 0.24 0.03 0.21 MS 9.13 0.53 F Significance F 17.16 0.0000 t Stat P-value 0.13 0.8930 -0.05 0.9577 9.43 0.0000 3.24 0.0017 -1.54 0.1269 2.93 0.0043 0.17 0.8684 29 Applications in Human Resources Management: Pay-Equity • Pay-equity can be handled in two different forms: – Equal pay for equal work – Equal pay for work of equal value. • Regression analysis is extensively employed in cases of equal pay for equal work. 30 Human Resources Management: Pay-Equity • Example (Xm9-03_supp) – Is there sex discrimination against female managers in a large firm? – A random sample of 100 managers was selected and data were collected as follows: • • • • Annual salary Years of education Years of experience Gender 31 Human Resources Management: Pay-Equity • Solution – Construct the following multiple regression model: y = b0 + b1Education + b2Experience + b3Gender + e – Note the nature of the variables: • Education – Interval • Experience – Interval • Gender – Nominal (Gender = 1 if male; =0 otherwise). 32 Human Resources Management: Pay-Equity • Solution – Continued (Xm9-03) SUMMARY OUTPUT Regression Statistics Multiple R 0.8326 R Square 0.6932 Adjusted R Square 0.6836 Standard Error 16274 Observations 100 Analysis and Interpretation • The model fits the data quite well. • The model is very useful. • Experience is a variable strongly related to salary. • There is no evidence of sex discrimination. ANOVA df Regression Residual Total Intercept Education Experience Gender SS MS 3 57434095083 19144698361 96 25424794888 264841613.4 99 82858889971 CoefficientsStandard Error -5835.1 16082.8 2118.9 1018.5 4099.3 317.2 1851.0 3703.1 t Stat -0.36 2.08 12.92 0.50 F Significance F 72.29 0.0000 P-value 0.7175 0.0401 0.0000 0.6183 33 Human Resources Management: Pay-Equity • Solution – Continued (Xm9-03) SUMMARY OUTPUT Regression Statistics Multiple R 0.8326 R Square 0.6932 Adjusted R Square 0.6836 Standard Error 16274 Observations 100 ANOVA df Regression Residual Total Intercept Education Experience Gender 3 96 99 Analysis and Interpretation • Further studying the data we find: Average experience (years) for women is 12. Average experience (years) for men is 17 • Average salary for female manager is $76,189 Average salary for male manager is $97,832 SS 57434095083 25424794888 82858889971 Coefficients Standard Error -5835.1 16082.8 2118.9 1018.5 4099.3 317.2 1851.0 3703.1 MS 19144698361 264841613 t Stat -0.36 2.08 12.92 0.50 F Significance F 72.29 0.0000 P-value 0.7175 0.0401 0.0000 0.6183 34 Stepwise Regression • Multicollinearity may prevent the study of the relationship between dependent and independent variables. • The correlation matrix may fail to detect multicollinearity because variables may relate to one another in various ways. • To reduce multicollinearity we can use stepwise regression. • In stepwise regression variables are added to or deleted from the model one at a time, based on their contribution to the current model. 35 Model Building • Identify the dependent variable, and clearly define it. • List potential predictors. – Bear in mind the problem of multicollinearity. – Consider the cost of gathering, processing and storing data. – Be selective in your choice (try to use as few variables as possible). 36 • Gather the required observations (have at least six observations for each independent variable). • Identify several possible models. – A scatter diagram of the dependent variables can be helpful in formulating the right model. – If you are uncertain, start with first order and second order models, with and without interaction. – Try other relationships (transformations) if the polynomial models fail to provide a good fit. • Use statistical software to estimate the model. 37 • Determine whether the required conditions are satisfied. If not, attempt to correct the problem. • Select the best model. – Use the statistical output. – Use your judgment!! 38