Ognjen Mladenovic, Laura Martinez, Nicolas Rosen, Kalash Tirpude Assignment 2 QUESTION 1 1.1. π¦ = 7.84 + 2.85π₯+ + 2.28π₯, + 7.17π₯. + π Where: y is the dependent variable: net earnings in millions of dollars x1 is the independent variable: production cost of the movie in millions of dollars x2 is the independent variable: promotional cost of the movie in millions of dollars x3 is the independent categorical variable: whether or not the movie is based on a book: 1 = based on book 0 = not based on book π is the standard error 1.2. Model is useful, adjusted R square is 96%. Statistical error of estimation is 3.69. Generally, we can say that this model is very good. R square explains how much variance in our dependent variable is explained by our model (explained variance/total variance). Multiple Regression for Earn Summary Multiple R R-Square Adjusted R-square Std. Err. Of Estimate 0.9832 0.9667 0.9605 3.689501 All three independent variables are statistically significant. We can justify that by using three different tests: p- Value, t- Test, Confidence Interval 95%. Coefficient Standard t-Value p-Value Confidence Interval 95% Error Regression Table Constant cost Prom Book • • • 7.84 2.85 2.28 7.17 2.33 0.39 0.25 1.82 3.358 7.258 8.989 3.942 0.0040 < 0.0001 < 0.0001 0.0012 Lower Upper 2.89 2.02 1.74 3.31 12.78 3.68 2.82 11.02 Explanatory variables are statistically significant if its absolute t-Value is higher than 2. This condition is fulfilled for all three independent variables. Explanatory variables are statistically significant if its p-Value is lower than 0.05. This condition is fulfilled for all three independent variables. Explanatory variables are statistically significant if its confidence interval (95%) does not contain 0. This condition is fulfilled for all three independent variables. In order to create a confidence interval at the 95% confidence level we use our model to predict the earnings of a movie that cost 7.5 million to produce and spent 5.5 million for promotion and that was based on a book. π¦ = 7.84 + 2.85(7.5) + 2.28(5.5) + 7.17(1) = 48.93 πΆπππππππππ πΌππ‘πππ£ππ = 48.93 ± 3.69 × 2 = 41.55 56.31] As we can see the range of the confidence interval at the 95% confidence level is between 41.55 and 56.31. This is a relatively small range showing us that the model is quite good at predicting the earning of movies Ognjen Mladenovic, Laura Martinez, Nicolas Rosen, Kalash Tirpude Assignment 2 1.3. π¦ = 7.84 + 2.85(0) + 2.28(0) + 7.17(0) = 7.84 The model predicts net earnings of 7.84 million for a movie that has no production or promotion costs and is not based on a movie. However, one has to note that the y-intercept on its own, when all IV’s are set to zero, should not be given any meaning as it is an unrealistic scenario, where the movie doesn’t cost anything to be produced. Furthermore, this data point would be outside of the range of observed data, which might change the relationship between the variables. 1.4. For every additional million US$ in production cost, earnings will go up by an estimated US$ 2.85 million. 1.5. If the underlying value of the coefficient of the PROM (promotions) variable is 1, that would mean that an increase in promotion cost by US$ 1 million will increase net earnings of the movie by US$ 1 million. Since it is net earnings that is increased by 1 million for each million invested into promotion, it would imply that promotions are effective, as the cost is already accounted for in the 1 million net earnings increase. If the dependent variable was revenue instead of net earnings, it would mean that promotions would be ineffective as, net earnings would not increase due to increase in promotional cost. 1.6. Based on book: Not based on book: πππ‘ πΈππππππ = 7.84 + 2.85(6) + 2.28(3) + 7.17(1) = 38.95 πππ‘ πΈππππππ = 7.84 + 2.85(6) + 2.28(3) + 7.17(0) = 31.78 According to our model, estimated net earnings of a movie costing $6m, with promotion cost of $3m and is based on a book are $38.95 millions, while for a movie with identical costs, but not based on a book estimated net earnings are $31.78 millions. In practical terms, the coefficient of the book variable tells us that on average a movie that is based on a book will have $7.17 million higher net earnings than a movie with identical costs that is not based on a book. 1.7. H0: Residuals follow normal distribution Ha: Residuals are not normally distributed We are using Lilliefors test in order to check whether we have to reject H0. We fail to reject H0, because Test statistic (0,0927) is lower than Critical Value at 5% significance level (0,1924). We do not use Chi-test, since our sample is too small (20 ). Lilliefors Test Results Residuals Sample Size 20 0.000 3.386 0.0927 0.1666 0.1760 Sample Mean Sample Std Dev Test Statistic CVal (15% Sig. Level) CVal (10% Sig. Level) Ognjen Mladenovic, Laura Martinez, Nicolas Rosen, Kalash Tirpude CVal (5% Sig. Level) CVal (2.5% Sig. Level) CVal (1% Sig. Level) Assignment 2 0.1924 0.2053 0.2812 QUESTION 2 2.1. After running three regression models, removing statistically insignificant independent variables one by one (See model 1 and 2 in Appendix), we came up with the following regression equation: π¦ = −0.859 + 0.005π₯+ − 0.130π₯, + π Where: y is the dependent variable: return in excess of risk free rate x1 is the independent variable: Quality of education measured as average SAT at managers postgrad university x2 is the independent variable: Age of the fund manager π is the standard error Coefficient Standard Error t-Value p-Value Confidence Interval 95% Regression Table Constant -0.859 0.005 -0.130 SAT Age 2.187 0.001 0.038 -0.393 3.978 -3.421 0.6945 < 0.0001 0.0006 Lower Upper -5.149 0.003 -0.205 3.430 0.008 -0.056 2.2. Model significance: In this third regression model all independent variables (SAT and Age) are statistically significant, as: • • • their p-Values are below 0.05 Absolute t value is higher than 2 Confidence Intervals don’t include 0 Multiple Regression for Return Multiple R R-Square Adjusted R-square Std. Err. Of Estimate 0.12 0.015 0.013 8.31 Summary Model quality: • • Adjusted R square: 0.013 Standard error of Estimates: 8.31 Considering that the adjusted R square of 0.013 is extremely low and the standard error of estimates is very high with 8.31, the model quality is extremely low. The standard error of estimates tells us that 95% of the observations should fall within plus/minus 16.62 of the fitted line, which, considering that the average performance of all the analyzed funds is 0.55, is not a close match for the prediction interval at all. In order to calculate a confidence interval, we calculated the expected return using our model, for a manager that studied at a University with average SAT of 1000, since 1000 is close to the mean of our data, and age 42, since that is close to the average age of the managers in our data. π¦ = −0.859 + 0.005 1000 − 0.130 42 = −1.319 Ognjen Mladenovic, Laura Martinez, Nicolas Rosen, Kalash Tirpude Assignment 2 πΆπππππππππ πΌππ‘πππ£ππ = −1.319 ± 8.31 × 2 = −17.94 15.30] For a fund that the model predicts a -1.32% return the confidence interval at 95% confidence level is between -17.94% and 15.30%. Clearly, this model is very weak in predicting fund’s performance. 2.3. π¦ = −0.859 + 0.005 710 − 0.130 35 = −1.859 Our model predicts that the performance of a fund run by a 35-year-old MIAF (not included in model) that has been with the fund for 4 years (not included in the model) and graduated with a SAT avg score of 710 will have a return of 1.859 lower than the risk free rate. If our model was of good quality, I would not invest into this fund as the model predicts that the fund will have a return of 1.859 lower than the risk free rate, however since the model has such low explanatory power and such a high standard error of estimates, I would not pay attention to this model at all and would instead try and find a better way of analyzing whether to invest into this fund or not. QUESTION 3 3.1. π¦ = −128.72 + 0.75π₯+ + 34.02π₯, − 86.68π₯. + π Where: y is the dependent variable: time to complete the job in hours x1 is the independent variable: number of pieces in the job x2 is the independent variable: number of operations per piece x3 is the independent categorical variable: whether or not the order is a ‘rush’: 1 = ‘rush’ 0 = not a ‘rush’ π is the standard error Coefficient Standard Error t-Value p-Value Confidence Interval 95% Regression Table Constant PIECES OPS RUSH -128.72 0.75 34.02 -86.68 89.92 0.10 7.08 42.10 -1.43 7.39 4.81 -2.06 0.1715 < 0.0001 0.0002 0.0562 Lower Upper -319.34 0.53 19.01 -175.92 61.90 0.96 49.02 2.57 The dummy variable RUSH should be removed from the model, as both the p-Value (not lower than 0.05) and the Confidence interval (0 is included in the interval [-175.92:2.57] show that it is not statistically significant. Multiple Regression for TIME Multiple R R-Square Adjusted R-square Std. Err. Of Estimate 0.92 0.84 0.81 88.91 Summary Model Quality: The adjusted R square is of the model is good with 0.81, which means that 81% of the variance in the dependent variable is explained by this model, however the standard error of estimates is quite high with 88.91. Ognjen Mladenovic, Laura Martinez, Nicolas Rosen, Kalash Tirpude Assignment 2 The standard error of estimates tells us that 95% of the observations should fall within plus/minus 177.82 of the fitted line. In order to calculate a confidence interval, we calculated the expected times using the given model, for a process that uses 250 pieces, since 250 is close to the mean of our data, and 10 operations per piece, since that is close to the average number of operations per piece in our data. π¦ = −128.72 + 0.75 250 + 34.2 10 = 400.78 πΆπππππππππ πΌππ‘πππ£ππ = 400.78 ± 88.91× 2 = 222.96 578.60] The given model predicts a time of 400.78 hours required to finish the order. At the 95% confidence level the confidence interval is between 222.96 hours and 578.60 hours. Clearly, this model is very weak in predicting required time for a process at Erie Steel Ltd. However, considering that the model includes a variable that is statistically insignificant we should first make a new model without the insignificant variable before assessing the models quality. 3.2. Ho=0, Rush variable(IV) doesn’t affect time (DV) H1≠0, Rush variable(IV) affects time(DV) Roger’s claim is true at the 5% confidence interval. We can prove that through the P-Value and Confidence interval test. We hypothesize (Ho=0) that rush variable doesn’t have effect on time. This is true because p-Value (0,056) is higher than 0.05. Confidence interval (5%) contains 0 [175.92:2.57]. Thus we can reject Pete’s claim that the average effect of ‘rush’ reduces the time by 50 hours at the 5% confidence level! 3.3. What regressions do you run? First a regression with Time as the DV and PIECES, OPS, RUSH and the interaction effect between PIECES and OPS (PIECES *OPS) as IV was run. In this model (Model 1) all IV’s except the interaction effect between PIECES and OPS are statistically not significant (See Appendix 3.3). So next, Model 2 was calculated without the IV RUSH. In this model PIECES and RUSH were both shown as statistically insignificant (See Appendix 3.3). For Model 3 the IV PIECES was removed, and only OPS and the interaction effect between OPS and PIECES were used as IV. This mode shows both OPS and the interaction effect to be statistically significant, as their p-Values are below 0.05 (See Appendix 3.3). Finally, Model 4 was created in which the IV’s were PIECES and the interaction effect between PIECES and OPS. Again both variables were shows as statistically significant as their p-Values are below 0.05 (See Appendix 3.3). Since both Model 3 and Model 4 have all IV’s as statistically significant and their model qualities are nearly identical with Adjusted R square of 0.96 and a Standard error of estimates close to 40 (See tables below), we cannot make a decision on this basis of which model we prefer. Model 3 Quality: Multiple Regression for TIME Multiple R R-Square Adjusted R-square Std. Err. Of Estimate 0.98 0.96 0.96 40.27 Summary Ognjen Mladenovic, Laura Martinez, Nicolas Rosen, Kalash Tirpude Assignment 2 Model 4 Quality: Multiple Regression for TIME Summary Multiple R R-Square Adjusted R-square Std. Err. Of Estimate 0.98 0.97 0.96 39.28 When checking the PIECES residual plot of Model 4 we can see that there is some kind of pattern, with suggests that the relationship between PIECES and TIME is non-linear. Thus we looked at the scatterplot between TIME and PIECES (See below), which also might suggest that there is a non-linear relationship. Thus we transformed the variable PIECES, by taking its square root and running a 5th model with TIME as DV and Square root of PIECES and the interaction effect between PIECES and OPS as IV’s. Ognjen Mladenovic, Laura Martinez, Nicolas Rosen, Kalash Tirpude Assignment 2 In this 5th model, both IV’s are statistically significant as their p-Values are <0.05 (See Appendix 3.3). Model 5 Quality: Multiple Regression for TIME Summary Multiple R R-Square Adjusted R-square Std. Err. Of Estimate 0.99 0.98 0.98 31.46 Looking at the model quality, we can see it is the best model, with an Adjusted R square of 0.98, compared to model 3 and 4’s 0.96, and the standard error is reduced from 40.27 and 39.28 respectively to 31.46, which shows that the transformation of PIECES improved the model. Model 5: The equation for this model is: π¦ = 209.24 + 0.15π₯+ − 12.76π₯, + π Where: y is the dependent variable: time to complete the job x1 is the interaction between PIECES and OPS (PIECES*OPS) x2 is the transformed independent variable PIECES: Square root of PIECES π is the standard error This new model with the interaction effect between OPS and PIECES included as an IV is of much higher quality than the model in question 3.1. Not only did the adjusted R square increase from 0.81 to 0.98 but also the standard error of estimates dropped from 88.91 to 31.46. The standard error of estimates tells us that 95% of the observations should fall within plus/minus 62.92 of the fitted line. In order to calculate a confidence interval, we calculated the expected time using our model, for the given process: π¦ = 209.24 + 0.15 500×7 − 12.76 500 = 448.92 πΆπππππππππ πΌππ‘πππ£ππ = 448.92 ± 31.46× 2 = 386.00 511.84] The model predicts a time of 448.92 hours required to finish the order that Roger promised. At the 95% confidence level the confidence interval is between 386.00 hours and 511.84 hours. We can see that time (360 hours) that Roger requires, does not lie within the 95% confidence interval, so our model predicts with 95% certainty that the order cannot be completed in 360 hours. No Roger should not designate the order as rush, since the IV RUSH was found to be statistically not significant in our model. Ognjen Mladenovic, Laura Martinez, Nicolas Rosen, Kalash Tirpude Assignment 2 QUESTION 4 4.1. Explanatory Variables are statistically significant if they fulfil the following conditions: • • • • • • a) absolute t-Value > 2 b) p-Value < 0.05 House Size: is statistically significant o Absolute t value (11.24)>2 o P value of 0 is <0.05 Lot Size: is statistically significant o Absolute t value (46.07)>2 o P value of 0 is <0.05 Neighborhood: is not statistically significant o Absolute t value (0.21)<2 o P value of 0.8308 is >0.05 Bedroom: is statistically significant o Absolute t value (6.85)>2 o P value of 0 is <0.05 House Size*Neighborhood: is statistically significant o Absolute t value (2.54)>2 o P value of 0.0115 is <0.05 Lot Size*Neighborhood: is not statistically significant o Absolute t value (0.89)<2 o P value of 0.3752 is >0.05 4.2. Coefficient for number of bedrooms is 17.08, and that means that for every additional bedroom, price of the house increases on average by $17,080. The Standard error is calculated as the coefficient divided by the t-Value: ππ‘ππππππ πΈππππ = π΅ 17.08 = = 2.49 π‘ − ππππ’π 6.85 4.3. πππππππ πππππ = −35.45 + 0.067 π»ππ’π π π ππ§π + 0.05 πΏππ‘ π ππ§π − 2.31 ππππβπππ’πβπππ + 17.08 # ππ ππππππππ − 0.015 π»ππ’π π π ππ§π ππππβππππβπππ + 0.001(πΏππ‘ π ππ§π)(ππππβπππ’πβπππ) + π According to this model, the selling price of a 3,000 square foot house, on a 10,000 square foot lot, in neighborhood 1 with 4 bedrooms is: (Assuming that the dummy variable neighborhood is 1 when the house is in neighborhood 1 and 0 if in neighborhood 2) πππππππ πππππ = −35.45 + 0.067 3000 + 0.05 10000 − 2.31 1 + 17.08 4 − 0.015 3000 1 + 0.001 10000 1 = 696.56 Thus the estimated selling price would be $696,560. Ognjen Mladenovic, Laura Martinez, Nicolas Rosen, Kalash Tirpude Assignment 2 However, it has to be considered that some of the models variables were found to be statistically insignificant so it would be better to calculate a model with only statistically significant variables before estimating. 4.4. πΌππππππ π ππ πππππππ πππππ = 0.067 100 + 17.08 1 − 0.015 100 1 = 22.28 According to the model the selling price of the house would increase by $22,280. As the addition cost $20,000 to build, the net profit is estimated at $2,280. QUESTION 5 5.1 Yes, Income is a significant predictor of sales as all 3 tests show that it is statistically significant: 1. Absolute t-Value is higher than 2 (14.65) 2. P-Value is lower than 0.05 (0.00) 3. Confidence interval does not include 0 ([0.0045:0.0059]) The predictive power of the model seems to be acceptable, but not very high with an adjusted R square of 0.71. If this is sufficient really depends on the preferences of the user of the model. The standard error of 121.24 seems quite high as from the “Income ($) Line Fit Plot” we can see that the observations of the DV Sales ($/Sq Ft) range approximately between 100 to 1100. A standard error of 121.24, gives us a range of ± 242.48, thus indicating that the predictive power of the model is not particularly strong. The equation of the model is: π¦ = 370.38 + 0.0052π₯+ + π Where: y is the dependent variable: Sales ($/Sq Ft) x1 is the independent variable: median household income in the surrounding community π is the standard error 5.2 The Line Fit Plot plots the actual (blue) and predicted (red) Sales for different levels of income. From this graph we can see that while the predicted values clearly show a linear relationship between Sales and Income, the actual values don’t show a linear relationship but rather a relationship where the marginal increase in sales decreases as income increases. Furthermore, we can see that the Residual plot shows a pattern, reinforcing our expectations from the first graph. If the relationship was linear the residual plot should not show any kind of pattern. In order to improve this model the independent variable income should be transformed, by taking the square root of income instead as the independent variable, as shown in the graphic below. Ognjen Mladenovic, Laura Martinez, Nicolas Rosen, Kalash Tirpude Assignment 2 After this transformation, the relationship between the IV income and the DV sales should be more linear, which should lead to higher adjusted r square value and lower standard error of estimation, thus giving the model higher predictive power. Furthermore, the Line Fit Plot of this new model should show both actual and predicted values to follow a linear relationship, while the residual plot should not show any pattern anymore. 5.3 She is not correct! When using dummy variables, like in this case 3 dummy variables Suburban, Urban and Rural, the model is designed in such a way that it includes one less dummy variable than given, as that dummy variable is used when the other two dummy variables are set to 0. So in this case when Suburban =0 and Urban =0 it means that we consider the store to be in a rural area ( 1 ). 5.4 All variables are significant for this model. Every single variable has absolute t-Test value higher than 2, or to confirm this, every single variable has p-Values lower than 0.05. This model is much more useful than previous one. It has higher adjusted R square (0.91) compared to the previous model’s (0.71) adjusted R squared, what is telling us that this model can explain 91% of variability of the dependent variable (Sales). Secondly, the standard error of estimates dropped from 121.24 in the first model to 67.63 in the second model. Both values indicate that the second model is much better than the first model. Income: Adding 1 $ of income, sales in dollars per square foot increase for 0,00496. Population : Adding 1000 people, sales in dollars per square foot increase for 116,22. Suburban : If retail store is located in suburban area, sales in dollars per square foot increase for 217,29. Urban : If retail store is located in urban area, sales in dollars per square foot increase fore 86,78. 5.5 πππππ = 213.31 + 0.00496(πΌπππππ) + 116.22(ππππ’πππ‘πππ[ππ 000π ]) + 217.29(ππ’ππ’ππππ π·π’πππ¦) + 86.78(πππππ π·π’πππ¦) πππππ = 213.31 + 0.00496 35,000 + 116.22 0.5 + 217.29 0 + 86.78 0 = πππ. ππ The estimated sales are $445.02 per square foot of store space. Since the store is 1000 square foot, the estimated sales for this store are: $445,020 Ognjen Mladenovic, Laura Martinez, Nicolas Rosen, Kalash Tirpude Assignment 2 Appendix Appendix Q 2.1: Model 1: Coefficient Standard Error t-Value p-Value Regression Table Constant SAT Finance degree Age Tenure -1.15 0.01 0.67 -0.14 0.08 2.20 0.00 0.38 0.04 0.18 -0.52 3.96 1.79 -3.31 0.47 0.6012 < 0.0001 0.0730 0.0009 0.6412 Confidence Interval 95% Lower Upper -5.453 0.003 -0.063 -0.224 -0.262 3.158 0.008 1.412 -0.057 0.426 Model 1 includes all independent variables (SAT, Finance Degree, Age and Tenure), however we can see that the variables Financial degree and Tenure are statistically insignificant as their p-Value is not below 0.05. Model 2: Coefficient Standard Error t-Value p-Value Regression Table Constant SAT Finance degree Age -1.18 0.01 0.67 -0.13 2.19 0.00 0.38 0.04 -0.54 3.97 1.78 -3.46 0.5897 < 0.0001 0.0748 0.0005 Confidence Interval 95% Lower Upper -5.485 0.003 -0.067 -0.207 3.119 0.008 1.407 -0.057 For Model 2 the variable Tenure, which was found to be statistically insignificant in Model 1 has been removed from the regression. However, Financial degree is still statistically insignificant in this new model as the p-Value is not below 0.05, thus for Model 3 (see main part Q2.1) it has been removed from the model. Ognjen Mladenovic, Laura Martinez, Nicolas Rosen, Kalash Tirpude Assignment 2 Appendix Q 3.3: Model 1 Coefficient Standard Error t-Value p-Value Regression Table Constant PIECES OPS RUSH total machining time (PIECES*OPS) Model 2 Regression Table Constant PIECES OPS total machining time (PIECES*OPS) Model 3 Regression Table Constant OPS total machining time (PIECES*OPS) Model 4 Regression Table Constant PIECES total machining time (PIECES*OPS) Model 5 Regression Table Constant total mashining time (PIECES*OPS) Sqrt(Pieces) 77.98 -0.15 7.09 -25.29 0.11 44.82 0.11 4.31 19.12 0.01 1.74 -1.34 1.64 -1.32 8.66 0.102 0.199 0.121 0.206 < 0.0001 Coefficient Standard Error t-Value p-Value 72.70 -0.18 5.78 0.12 45.68 0.11 4.29 0.01 1.59 -1.65 1.35 9.63 0.131 0.118 0.197 < 0.0001 Coefficient Standard Error t-Value p-Value 18.04 11.02 0.10 33.04 3.04 0.00 0.55 3.63 20.88 0.592 0.002 < 0.0001 Coefficient Standard Error t-Value p-Value 131.68 -0.30 0.13 13.31 0.08 0.01 9.90 -3.83 14.55 < 0.0001 0.001 < 0.0001 Confidence Interval 95% Lower Upper -17.56 -0.39 -2.10 -66.05 0.09 173.51 0.09 16.28 15.47 0.14 Confidence Interval 95% Lower Upper -24.13 -0.42 -3.32 0.09 169.53 0.05 14.89 0.15 Confidence Interval 95% Lower Upper -51.66 4.61 0.09 87.75 17.43 0.11 Confidence Interval 95% Lower Upper 103.61 -0.46 0.11 159.75 -0.13 0.15 Coefficient Standard Error t-Value p-Value Confidence Interval 95% Lower Upper 209.24 0.15 17.89 0.01 11.69 17.39 < 0.0001 < 0.0001 171.49 0.13 246.99 0.16 -12.76 2.24 -5.69 < 0.0001 -17.49 -8.03