Pfeifer note: Section 6 Class 26 Model Building Philosophy Assignment 26 • 1. T-test 2-sample ≡ regression with dummy T= +/- 6.2/2.4483 (from data analysis, complicated formula, OR regression with dummy) • 2. ANOVA single factor ≡ regression with p-1 dummies (see next slide) • 3. Better predictor? The one with lower regression standard error (or higher adj R2) – Not the one with the higher coefficient. • 4. Will they charge less than $4,500? – Use regression’s standard error and t.dist to calculate the probability. Occupation Lawyer Lawyer Lawyer Lawyer Lawyer Lawyer Lawyer Lawyer Lawyer Lawyer Ready for ANOVA Lawyer 44 42 74 42 53 50 45 48 64 38 Physical Therapist 55 78 80 86 60 59 62 52 55 50 Cabinetmake r 54 65 79 69 79 64 59 78 84 60 Systems Analyst 44 73 71 60 64 66 41 55 76 62 SAT 44 42 74 42 53 50 45 48 64 38 Physical Therapist 55 Physical Therapist 78 Physical Therapist 80 Physical Therapist 86 Physical Therapist 60 Physical Therapist 59 Physical Therapist 62 Physical Therapist 52 Physical Therapist 55 Physical Therapist Cabinetmaker Cabinetmaker Cabinetmaker Cabinetmaker Cabinetmaker Cabinetmaker Cabinetmaker Cabinetmaker Cabinetmaker Cabinetmaker Systems Analyst Systems Analyst Systems Analyst Systems Analyst Systems Analyst Systems Analyst Systems Analyst Systems Analyst Systems Analyst Systems Analyst 50 54 65 79 69 79 64 59 78 84 60 44 73 71 60 64 66 41 55 76 62 Ready for Regression SAT 44 42 74 42 53 50 45 48 64 38 55 78 80 86 60 59 62 52 55 50 54 65 79 69 79 64 59 78 84 60 44 73 71 60 64 66 41 55 76 62 Dlawyer 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 DPT 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Dcabinet 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 Agenda • • • • IQ demonstration What you can do with lots of data What you should do with not much data Practice using the Oakland As case Remember the Coal Pile! • Model Building involves more than just selecting which of the available X’s to include in the model. – See section 9 of the Pfeifer note to learn about transforming X’s. – We won’t do much in this regard… With lots of data (big data?) Stats like “std 1 2 3 . . N X1 0.96 0.58 0.39 . . 0.47 X2 0.24 0.16 0.75 . . 0.34 . 0.34 0.93 0.07 . . 0.69 1. Split the data into two sets . 0.57 0.96 0.63 . . 0.86 Xn 0.20 0.75 0.87 . . 0.30 error” and adj Rsquare only measure FIT Y 0.43 0.35 0.49 . . 0.22 1 2 3 . . N1 X1 0.96 0.58 0.39 . . 0.21 X2 0.24 0.16 0.75 . . 0.76 . 0.34 0.93 0.07 . . 0.44 . 0.57 0.96 0.63 . . 0.07 Xn 0.20 0.75 0.87 . . 0.65 Y 0.43 0.35 0.49 . . 0.92 N1+1 N1+2 N1+3 . . N X1 0.47 0.03 0.16 . . 0.47 X2 0.86 0.51 0.31 . . 0.34 . 0.53 0.35 0.37 . . 0.69 . 0.02 0.09 0.38 . . 0.86 Xn 0.70 0.95 0.31 . . 0.30 Y 0.73 0.11 0.96 . . 0.22 2. Use the training set to build several models. Performance on a hold-out sample measures how well each model will FORECAST 3. Use the hold-out sample to test/compare the models. Use the best performing model. With lots of data (big data?) • Computer Algorithms do a very good job of finding a model • They guard against “over-fitting” • Once you own the software, they are fast and cheap • They won’t claim, however, to do better than a professional model builder • Remember the coal pile! Without much Data • You will not be able to use a training set/hold out sample • You get “one shot” to find a GOOD model • Regression and all its statistics can tell you which model “FIT” the data the best. • Regression and all its statistics CANNOT tell you which model will perform (forecast) the best. • Not to mention….regression has no clue about what causes what….. Remember….. • The model that does a spectacular job of fitting the past….will do worse at predicting the future than a simpler model that more accurately captures the way the world works. • Better fit leads to poorer forecasts! – Instead of forecasting 100 for the next IQ, the over-fit model will sometimes predict 110 and other times predict 90! Requiring low-p-values for all coefficients does not protect against over-fitting. • If there are 100 X’s that are of NO help in predicting Y, – We expect 5 of them will be statistically significant. – And we’ll want to use all 5 to predict the future. – And the model will be over-fit – We won’t know it, perhaps – Our predictions will be WORSE as a result. Modeling Balancing Act • Useable (do we know the X’s?) • Simple • Make Sense – Use your judgment, given you can’t solely rely on the stats/data – Signs of coefficients should make sense • Significant (low p) coefficients – Except for sets of dummies • Low standard error – Consistent with high adjusted R-square • Meets all four assumptions – Linearity (most important) – Homoskedasticity (equal variance) – Independence – Normality (least important) Oakland As (A) Case Facts • Despite making only $40K, pitcher Mark Nobel had a great year for Oakland in 1980. – Second in the league for era (2.53), complete games (24), innings (284-1/3), and strikeouts (180) – Gold glove winner (best fielding pitcher) – Second in CY YOUNG award voting. Nobel Wants a Raise • “I’m not saying anything against Rick Langford or Matt Keough (fellow As pitchers)…but I filled the stadium last year against Tommy John (star pitcher for the Yankees)” • Nobel’s Agent argued – Avg. home attendance for Nobel’s 16 starts was 12,663.6 – Avg. home attendance for remaining home games was only 10,859.4 – Nobel should get “paid” for the difference • 1,804.2 extra tickets per start. Data from 1980 Home Games No 1 2 3 . . 73 74 75 DATE 10-Apr 11-Apr 12-Apr . . 26-Sep 27-Sep 28-Sep TIX OPP POS GB DOW TEMP PREC TOG TV PROMO YANKS NOBEL 24415 2 5 1 4 57 0 2 1 0 0 0 5729 2 3 1 5 66 0 2 1 0 0 0 5783 2 7 1 6 64 0 1 0 0 0 0 . . . . . . . . . . . . . . . . . . . . . . . . 5099 6 2 14 5 64 0 2 1 0 0 1 4581 6 2 13 6 62 0 1 0 0 0 0 10662 6 2 12 7 65 0 1 0 1 0 0 LEGEND Opposing Team Position: A's ranking in American League West Games Behind: Minimum No of games needed to move ahead of current first place team. 1 Seattle 8 White Sox 2 Minnesota 9 Boston 3 California 10 Baltimore Day of Week: Monday=1, Tuesday=2, etc. 4 Yankees 5 Detroit 11 Cleveland 12 Texas Precipitation: 1 if precipitation, 0 if not. 6 Milwaukee 13 Kansas City 7 Toronto Time of Game: 1 if day, 2 if night TASK • Be ready to report about the model assigned to your table (1 to 7) – What is the model? (succinct) – Critique it (succinctly) – Ignore “durban watson” – “standard deviation of residuals” aka regression’s standard error. – Output gives just t-stat. A t of +- 2 corresponds to p-value of 0.05. Model 1: TIX versus NOBEL Variable Coefficient Std. Error T-stat. NOBEL CONSTANT 1,804.207 10,859.356 2,753.164 1,271.632 0.655 8.540 R-Squared = 0.006 Adjusted R-Square = -0.008 Std. Deviation of Residuals = 9767.6 Durbin Watson D = 1.196 Model 4: TIX versus OPP, NOBEL Variable Coefficient Std. Error T-stat. OPP NOBEL CONSTANT -269.135 1,572.135 12,807.161 297.809 2,768.562 2,182.002 -0.904 0.568 5.869 R-Squared = 0.017 Adjusted R-Square = 0.010 Std. Deviation of Residuals = 9779.9 Durbin Watson D = 1.146 Model 2: TIX versus 01 through 012, NOBEL Variable Coefficient Std. Error T-stat. NOBEL O1 O2 O3 O4 O5 O6 O7 O8 O9 O10 O11 O12 CONSTANT 323.388 -4,627.963 -1,607.024 -3,810.322 28,663.478 -2,177.244 -3,412.231 -3,628.322 -6,516.065 1,263.371 100.833 -927.898 -5,839.463 11,652.167 1,755.292 3,396.590 3,224.109 3,578.674 3,578.674 3,526.638 3,358.582 3,578.674 3,358.582 3,396.590 3,345.816 3,358.582 3,396.590 983.1261 0.184 -1.363 -0.498 -1.065 8.010 -0.617 -1.016 -1.014 -1.940 0.372 0.030 -0.276 -1.719 11.852 R-Squared = 0.708 Adjusted R-Squared = 0.645 Std. Deviation of Residuals = 5795.1 Durbin Watson D = 2.291 Model 3: TIX versus O1 through O12, PREC, TEMP, PROMO, NOBEL, OD, DH Variable Coefficient Std. Error T-stat. PREC TEMP PROMO NOBEL OD DH O1 O2 O3 O4 O5 O6 O7 O8 O9 O10 O11 O12 CONSTANT -3,772.043 -184.293 5,398.545 -403.502 15,382.632 7,645.224 -7,213.660 -3,203.395 -5,780.245 25,640.501 -3,444.192 -4,568.433 -5,075.192 -5,973.904 1,966.401 -2,352.715 -1,701.151 -5,627.881 22,740.489 3,383.418 237.731 1,780.857 1,518.000 5,652.397 2,429.894 2,999.437 3,046.540 3,242.464 3,196.000 3,056.500 2,988.677 3,190.707 3,329.604 2,971.357 3,002.119 3,023.445 2,911.665 14,777.323 -1.115 -0.775 3.031 -0.266 2.721 3.146 -2.405 -1.051 -1.783 8.023 -1.127 -1.529 -1.591 -1.794 0.662 -0.784 -0.563 -1.933 1.539 R-Squared = 0.803 Adjusted R-Squared = 0.740 Std. Deviation of Residuals = 5011.0 Durbin Watson D = 2.269 Model 5: TIX versus PREC, TOG, TV, PROMO, NOBEL, YANKS, WKEND, OD, DH Variable Coefficient Std. Error T-stat. PREC TOG TV PROMO NOBEL YANKS WKEND OD DH CONSTANT -3,660.109 1,606.406 223.421 4,382.173 -1,244.411 29,493.164 1,468.269 16,119.831 5,815.814 5,082.356 3,251.502 1,334.121 1,982.301 1,658.644 1,546.545 2,532.314 1,328.585 5,388.174 2,375.194 2,170.419 -1.126 1.204 0.113 2.642 -0.805 11.647 1.105 2.992 2.449 2.342 R-Squared = 0.742 Adjusted R-Squared = 0.706 Std. Deviation of Residuals = 5273.5 Durbin Watson D = 1.733 Model 6: TIX versus PROMO, NOBEL, YANKS, DH Variable Coefficient Std. Error T-stat. PROMO NOBEL YANKS DH CONSTANT 4,195.743 -1,204.082 29,830.245 5,274.262 8,363.238 1,737.742 1,607.869 2,641.516 2,457.377 527.298 2.414 -0.749 11.293 2.146 15.861 R-Squared = 0.692 Adjusted R-Square = 0.675 Std. Deviation of Residuals = 5551.0 Durbin Watson D = 1.96 Model 7: TIX versus PREC, PROMO, NOBEL, YANKS, OD Variable Coefficient Std. Error T-stat. PREC PROMO NOBEL YANKS OD CONSTANT -1,756.508 3,758.92 -209.484 30,568.223 15,957.998 8,457.002 3,227.439 1,687.895 1,549.192 2,570.535 5,491.220 496.203 -0.544 2.227 -0.135 11.892 2.906 17.043 R-Squared = 0.709 Adjusted R-Square = 0.688 Std. Deviation of Residuals = 5434.5 Durbin Watson D = 1.873 What does it mean that the coefficient of NOBEL in negative in most of the models? Why was the coefficient of NOBEL positive in model 1?