# Assignment 2 - Final Version

```Ognjen Mladenovic, Laura Martinez, Nicolas Rosen, Kalash Tirpude
Assignment 2
QUESTION 1
1.1.
π¦ = 7.84 + 2.85π₯+ + 2.28π₯, + 7.17π₯. + π
Where:
y is the dependent variable: net earnings in millions of dollars
x1 is the independent variable: production cost of the movie in millions of dollars
x2 is the independent variable: promotional cost of the movie in millions of dollars
x3 is the independent categorical variable: whether or not the movie is based on a book: 1 = based on
book 0 = not based on book
π is the standard error
1.2.
Model is useful, adjusted R square is 96%. Statistical error of estimation is 3.69. Generally, we can say
that this model is very good. R square explains how much variance in our dependent variable is
explained by our model (explained variance/total variance).
Multiple Regression for Earn
Summary
Multiple R
R-Square
Std. Err. Of Estimate
0.9832
0.9667
0.9605
3.689501
All three independent variables are statistically significant. We can justify that by using three
different tests: p- Value, t- Test, Confidence Interval 95%.
Coefficient
Standard
t-Value
p-Value
Confidence Interval 95%
Error
Regression Table
Constant
cost
Prom
Book
•
•
•
7.84
2.85
2.28
7.17
2.33
0.39
0.25
1.82
3.358
7.258
8.989
3.942
0.0040
&lt; 0.0001
&lt; 0.0001
0.0012
Lower
Upper
2.89
2.02
1.74
3.31
12.78
3.68
2.82
11.02
Explanatory variables are statistically significant if its absolute t-Value is higher than 2. This
condition is fulfilled for all three independent variables.
Explanatory variables are statistically significant if its p-Value is lower than 0.05. This condition is
fulfilled for all three independent variables.
Explanatory variables are statistically significant if its confidence interval (95%) does not contain
0. This condition is fulfilled for all three independent variables.
In order to create a confidence interval at the 95% confidence level we use our model to predict the
earnings of a movie that cost 7.5 million to produce and spent 5.5 million for promotion and that was
based on a book.
π¦ = 7.84 + 2.85(7.5) + 2.28(5.5) + 7.17(1) = 48.93
πΆπππππππππ πΌππ‘πππ£ππ = 48.93 &plusmn; 3.69 &times; 2 = 41.55 56.31]
As we can see the range of the confidence interval at the 95% confidence level is between 41.55 and
56.31. This is a relatively small range showing us that the model is quite good at predicting the
earning of movies
Ognjen Mladenovic, Laura Martinez, Nicolas Rosen, Kalash Tirpude
Assignment 2
1.3.
π¦ = 7.84 + 2.85(0) + 2.28(0) + 7.17(0) = 7.84
The model predicts net earnings of 7.84 million for a movie that has no production or promotion
costs and is not based on a movie. However, one has to note that the y-intercept on its own, when all
IV’s are set to zero, should not be given any meaning as it is an unrealistic scenario, where the movie
doesn’t cost anything to be produced. Furthermore, this data point would be outside of the range of
observed data, which might change the relationship between the variables.
1.4.
For every additional million US\$ in production cost, earnings will go up by an estimated US\$ 2.85
million.
1.5.
If the underlying value of the coefficient of the PROM (promotions) variable is 1, that would mean
that an increase in promotion cost by US\$ 1 million will increase net earnings of the movie by US\$ 1
million. Since it is net earnings that is increased by 1 million for each million invested into promotion,
it would imply that promotions are effective, as the cost is already accounted for in the 1 million net
earnings increase. If the dependent variable was revenue instead of net earnings, it would mean that
promotions would be ineffective as, net earnings would not increase due to increase in promotional
cost.
1.6.
Based on book:
Not based on book:
πππ‘ πΈππππππ = 7.84 + 2.85(6) + 2.28(3) + 7.17(1) = 38.95
πππ‘ πΈππππππ = 7.84 + 2.85(6) + 2.28(3) + 7.17(0) = 31.78
According to our model, estimated net earnings of a movie costing \$6m, with promotion cost of \$3m
and is based on a book are \$38.95 millions, while for a movie with identical costs, but not based on a
book estimated net earnings are \$31.78 millions.
In practical terms, the coefficient of the book variable tells us that on average a movie that is based
on a book will have \$7.17 million higher net earnings than a movie with identical costs that is not
based on a book.
1.7.
Ha: Residuals are not normally distributed
We are using Lilliefors test in order to check whether we have to reject H0.
We fail to reject H0, because Test statistic (0,0927) is lower than Critical Value at 5% significance level
(0,1924). We do not use Chi-test, since
our sample is too small (20 ).
Lilliefors Test Results
Residuals
Sample Size
20
0.000
3.386
0.0927
0.1666
0.1760
Sample Mean
Sample Std Dev
Test Statistic
CVal (15% Sig. Level)
CVal (10% Sig. Level)
Ognjen Mladenovic, Laura Martinez, Nicolas Rosen, Kalash Tirpude
CVal (5% Sig. Level)
CVal (2.5% Sig. Level)
CVal (1% Sig. Level)
Assignment 2
0.1924
0.2053
0.2812
QUESTION 2
2.1.
After running three regression models, removing statistically insignificant independent variables one
by one (See model 1 and 2 in Appendix), we came up with the following regression equation:
π¦ = −0.859 + 0.005π₯+ − 0.130π₯, + π
Where:
y is the dependent variable: return in excess of risk free rate
x1 is the independent variable: Quality of education measured as average SAT at managers postgrad
university
x2 is the independent variable: Age of the fund manager
π is the standard error
Coefficient
Standard Error
t-Value
p-Value
Confidence Interval 95%
Regression Table
Constant
-0.859
0.005
-0.130
SAT
Age
2.187
0.001
0.038
-0.393
3.978
-3.421
0.6945
&lt; 0.0001
0.0006
Lower
Upper
-5.149
0.003
-0.205
3.430
0.008
-0.056
2.2.
Model significance: In this third regression model all independent variables (SAT and Age) are
statistically significant, as:
•
•
•
their p-Values are below 0.05
Absolute t value is higher than 2
Confidence Intervals don’t include 0
Multiple Regression for Return
Multiple R
R-Square
Std. Err. Of Estimate
0.12
0.015
0.013
8.31
Summary
Model quality:
•
•
Standard error of Estimates: 8.31
Considering that the adjusted R square of 0.013 is extremely low and the standard error of estimates
is very high with 8.31, the model quality is extremely low.
The standard error of estimates tells us that 95% of the observations should fall within plus/minus
16.62 of the fitted line, which, considering that the average performance of all the analyzed funds is 0.55, is not a close match for the prediction interval at all.
In order to calculate a confidence interval, we calculated the expected return using our model, for a
manager that studied at a University with average SAT of 1000, since 1000 is close to the mean of our
data, and age 42, since that is close to the average age of the managers in our data.
π¦ = −0.859 + 0.005 1000 − 0.130 42 = −1.319
Ognjen Mladenovic, Laura Martinez, Nicolas Rosen, Kalash Tirpude
Assignment 2
πΆπππππππππ πΌππ‘πππ£ππ = −1.319 &plusmn; 8.31 &times; 2 = −17.94 15.30]
For a fund that the model predicts a -1.32% return the confidence interval at 95% confidence level is
between -17.94% and 15.30%. Clearly, this model is very weak in predicting fund’s performance.
2.3.
π¦ = −0.859 + 0.005 710 − 0.130 35 = −1.859
Our model predicts that the performance of a fund run by a 35-year-old MIAF (not included in model)
that has been with the fund for 4 years (not included in the model) and graduated with a SAT avg
score of 710 will have a return of 1.859 lower than the risk free rate. If our model was of good
quality, I would not invest into this fund as the model predicts that the fund will have a return of
1.859 lower than the risk free rate, however since the model has such low explanatory power and
such a high standard error of estimates, I would not pay attention to this model at all and would
instead try and find a better way of analyzing whether to invest into this fund or not.
QUESTION 3
3.1.
π¦ = −128.72 + 0.75π₯+ + 34.02π₯, − 86.68π₯. + π
Where:
y is the dependent variable: time to complete the job in hours
x1 is the independent variable: number of pieces in the job
x2 is the independent variable: number of operations per piece
x3 is the independent categorical variable: whether or not the order is a ‘rush’: 1 = ‘rush’ 0 = not a
‘rush’
π is the standard error
Coefficient
Standard Error
t-Value
p-Value
Confidence Interval 95%
Regression Table
Constant
PIECES
OPS
RUSH
-128.72
0.75
34.02
-86.68
89.92
0.10
7.08
42.10
-1.43
7.39
4.81
-2.06
0.1715
&lt; 0.0001
0.0002
0.0562
Lower
Upper
-319.34
0.53
19.01
-175.92
61.90
0.96
49.02
2.57
The dummy variable RUSH should be removed from the model, as both the p-Value (not lower than
0.05) and the Confidence interval (0 is included in the interval [-175.92:2.57] show that it is not
statistically significant.
Multiple Regression for TIME
Multiple R
R-Square
Std. Err. Of Estimate
0.92
0.84
0.81
88.91
Summary
Model Quality: The adjusted R square is of the model is good with 0.81, which means that 81% of the
variance in the dependent variable is explained by this model, however the standard error of
estimates is quite high with 88.91.
Ognjen Mladenovic, Laura Martinez, Nicolas Rosen, Kalash Tirpude
Assignment 2
The standard error of estimates tells us that 95% of the observations should fall within plus/minus
177.82 of the fitted line.
In order to calculate a confidence interval, we calculated the expected times using the given model,
for a process that uses 250 pieces, since 250 is close to the mean of our data, and 10 operations per
piece, since that is close to the average number of operations per piece in our data.
π¦ = −128.72 + 0.75 250 + 34.2 10 = 400.78
πΆπππππππππ πΌππ‘πππ£ππ = 400.78 &plusmn; 88.91&times; 2 = 222.96 578.60]
The given model predicts a time of 400.78 hours required to finish the order. At the 95% confidence
level the confidence interval is between 222.96 hours and 578.60 hours. Clearly, this model is very
weak in predicting required time for a process at Erie Steel Ltd.
However, considering that the model includes a variable that is statistically insignificant we should
first make a new model without the insignificant variable before assessing the models quality.
3.2.
Ho=0, Rush variable(IV) doesn’t affect time (DV)
H1≠0, Rush variable(IV) affects time(DV)
Roger’s claim is true at the 5% confidence interval. We can prove that through the P-Value and
Confidence interval test. We hypothesize (Ho=0) that rush variable doesn’t have effect on time. This
is true because p-Value (0,056) is higher than 0.05. Confidence interval (5%) contains 0 [175.92:2.57]. Thus we can reject Pete’s claim that the average effect of ‘rush’ reduces the time by 50
hours at the 5% confidence level!
3.3. What regressions do you run?
First a regression with Time as the DV and PIECES, OPS, RUSH and the interaction effect between
PIECES and OPS (PIECES *OPS) as IV was run. In this model (Model 1) all IV’s except the interaction
effect between PIECES and OPS are statistically not significant (See Appendix 3.3). So next, Model 2
was calculated without the IV RUSH. In this model PIECES and RUSH were both shown as statistically
insignificant (See Appendix 3.3). For Model 3 the IV PIECES was removed, and only OPS and the
interaction effect between OPS and PIECES were used as IV. This mode shows both OPS and the
interaction effect to be statistically significant, as their p-Values are below 0.05 (See Appendix 3.3).
Finally, Model 4 was created in which the IV’s were PIECES and the interaction effect between PIECES
and OPS. Again both variables were shows as statistically significant as their p-Values are below 0.05
(See Appendix 3.3). Since both Model 3 and Model 4 have all IV’s as statistically significant and their
model qualities are nearly identical with Adjusted R square of 0.96 and a Standard error of estimates
close to 40 (See tables below), we cannot make a decision on this basis of which model we prefer.
Model 3 Quality:
Multiple Regression for TIME
Multiple R
R-Square
Std. Err. Of Estimate
0.98
0.96
0.96
40.27
Summary
Ognjen Mladenovic, Laura Martinez, Nicolas Rosen, Kalash Tirpude
Assignment 2
Model 4 Quality:
Multiple Regression for TIME
Summary
Multiple R
R-Square
Std. Err. Of Estimate
0.98
0.97
0.96
39.28
When checking the PIECES residual plot of Model 4 we can see that there is some kind of pattern,
with suggests that the relationship between PIECES and TIME is non-linear.
Thus we looked at the scatterplot between TIME and PIECES (See below), which also might suggest
that there is a non-linear relationship.
Thus we transformed the variable PIECES, by taking its square root and running a 5th model with
TIME as DV and Square root of PIECES and the interaction effect between PIECES and OPS as IV’s.
Ognjen Mladenovic, Laura Martinez, Nicolas Rosen, Kalash Tirpude
Assignment 2
In this 5th model, both IV’s are statistically significant as their p-Values are &lt;0.05 (See Appendix 3.3).
Model 5 Quality:
Multiple Regression for TIME
Summary
Multiple R
R-Square
Std. Err. Of Estimate
0.99
0.98
0.98
31.46
Looking at the model quality, we can see it is the best model, with an Adjusted R square of 0.98,
compared to model 3 and 4’s 0.96, and the standard error is reduced from 40.27 and 39.28
respectively to 31.46, which shows that the transformation of PIECES improved the model.
Model 5:
The equation for this model is:
π¦ = 209.24 + 0.15π₯+ − 12.76π₯, + π
Where:
y is the dependent variable: time to complete the job
x1 is the interaction between PIECES and OPS (PIECES*OPS)
x2 is the transformed independent variable PIECES: Square root of PIECES
π is the standard error
This new model with the interaction effect between OPS and PIECES included as an IV is of much
higher quality than the model in question 3.1. Not only did the adjusted R square increase from 0.81
to 0.98 but also the standard error of estimates dropped from 88.91 to 31.46.
The standard error of estimates tells us that 95% of the observations should fall within plus/minus
62.92 of the fitted line.
In order to calculate a confidence interval, we calculated the expected time using our model, for the
given process:
π¦ = 209.24 + 0.15 500&times;7 − 12.76 500 = 448.92
πΆπππππππππ πΌππ‘πππ£ππ = 448.92 &plusmn; 31.46&times; 2 = 386.00 511.84]
The model predicts a time of 448.92 hours required to finish the order that Roger promised. At the
95% confidence level the confidence interval is between 386.00 hours and 511.84 hours. We can see
that time (360 hours) that Roger requires, does not lie within the 95% confidence interval, so our
model predicts with 95% certainty that the order cannot be completed in 360 hours.
No Roger should not designate the order as rush, since the IV RUSH was found to be statistically not
significant in our model.
Ognjen Mladenovic, Laura Martinez, Nicolas Rosen, Kalash Tirpude
Assignment 2
QUESTION 4
4.1.
Explanatory Variables are statistically significant if they fulfil the following conditions:
•
•
•
•
•
•
a) absolute t-Value &gt; 2
b) p-Value &lt; 0.05
House Size: is statistically significant
o Absolute t value (11.24)&gt;2
o P value of 0 is &lt;0.05
Lot Size: is statistically significant
o Absolute t value (46.07)&gt;2
o P value of 0 is &lt;0.05
Neighborhood: is not statistically significant
o Absolute t value (0.21)&lt;2
o P value of 0.8308 is &gt;0.05
Bedroom: is statistically significant
o Absolute t value (6.85)&gt;2
o P value of 0 is &lt;0.05
House Size*Neighborhood: is statistically significant
o Absolute t value (2.54)&gt;2
o P value of 0.0115 is &lt;0.05
Lot Size*Neighborhood: is not statistically significant
o Absolute t value (0.89)&lt;2
o P value of 0.3752 is &gt;0.05
4.2.
Coefficient for number of bedrooms is 17.08, and that means that for every additional bedroom,
price of the house increases on average by \$17,080.
The Standard error is calculated as the coefficient divided by the t-Value:
ππ‘ππππππ πΈππππ =
π΅
17.08
=
= 2.49
π‘ − ππππ’π
6.85
4.3.
πππππππ πππππ = −35.45 + 0.067 π»ππ’π π π ππ§π + 0.05 πΏππ‘ π ππ§π − 2.31 ππππβπππ’πβπππ
+ 17.08 # ππ ππππππππ  − 0.015 π»ππ’π π π ππ§π ππππβππππβπππ
+ 0.001(πΏππ‘ π ππ§π)(ππππβπππ’πβπππ) + π
According to this model, the selling price of a 3,000 square foot house, on a 10,000 square foot lot, in
neighborhood 1 with 4 bedrooms is: (Assuming that the dummy variable neighborhood is 1 when the
house is in neighborhood 1 and 0 if in neighborhood 2)
πππππππ πππππ = −35.45 + 0.067 3000 + 0.05 10000 − 2.31 1 + 17.08 4 − 0.015 3000 1
+ 0.001 10000 1 = 696.56
Thus the estimated selling price would be \$696,560.
Ognjen Mladenovic, Laura Martinez, Nicolas Rosen, Kalash Tirpude
Assignment 2
However, it has to be considered that some of the models variables were found to be statistically
insignificant so it would be better to calculate a model with only statistically significant variables
before estimating.
4.4.
πΌππππππ π ππ πππππππ πππππ = 0.067 100 + 17.08 1 − 0.015 100 1 = 22.28
According to the model the selling price of the house would increase by \$22,280. As the addition cost
\$20,000 to build, the net profit is estimated at \$2,280.
QUESTION 5
5.1
Yes, Income is a significant predictor of sales as all 3 tests show that it is statistically
significant:
1. Absolute t-Value is higher than 2 (14.65)
2. P-Value is lower than 0.05 (0.00)
3. Confidence interval does not include 0 ([0.0045:0.0059])
The predictive power of the model seems to be acceptable, but not very high with an adjusted
R square of 0.71. If this is sufficient really depends on the preferences of the user of the
model. The standard error of 121.24 seems quite high as from the “Income (\$) Line Fit Plot”
we can see that the observations of the DV Sales (\$/Sq Ft) range approximately between 100
to 1100. A standard error of 121.24, gives us a range of &plusmn; 242.48, thus indicating that the
predictive power of the model is not particularly strong.
The equation of the model is:
π¦ = 370.38 + 0.0052π₯+ + π
Where:
y is the dependent variable: Sales (\$/Sq Ft)
x1 is the independent variable: median household income in the surrounding community
π is the standard error
5.2
The Line Fit Plot plots the actual (blue) and predicted (red) Sales for different levels of income. From
this graph we can see that while the predicted values clearly show a linear relationship between
Sales and Income, the actual values don’t show a linear relationship but rather a relationship where
the marginal increase in sales decreases as income increases. Furthermore, we can see that the
Residual plot shows a pattern, reinforcing our expectations from the first graph. If the relationship
was linear the residual plot should not show any kind of pattern. In order to improve this model the
independent variable income should be transformed, by taking the square root of income instead as
the independent variable, as shown in the graphic below.
Ognjen Mladenovic, Laura Martinez, Nicolas Rosen, Kalash Tirpude
Assignment 2
After this transformation, the relationship between the IV income and the DV sales should be more
linear, which should lead to higher adjusted r square value and lower standard error of estimation,
thus giving the model higher predictive power. Furthermore, the Line Fit Plot of this new model
should show both actual and predicted values to follow a linear relationship, while the residual plot
should not show any pattern anymore.
5.3
She is not correct! When using dummy variables, like in this case 3 dummy variables Suburban,
Urban and Rural, the model is designed in such a way that it includes one less dummy variable than
given, as that dummy variable is used when the other two dummy variables are set to 0. So in this
case when Suburban =0 and Urban =0 it means that we consider the store to be in a rural area ( 1 ).
5.4
All variables are significant for this model. Every single variable has absolute t-Test value higher than
2, or to confirm this, every single variable has p-Values lower than 0.05. This model is much more
useful than previous one. It has higher adjusted R square (0.91) compared to the previous model’s
(0.71) adjusted R squared, what is telling us that this model can explain 91% of variability of the
dependent variable (Sales). Secondly, the standard error of estimates dropped from 121.24 in the
first model to 67.63 in the second model. Both values indicate that the second model is much better
than the first model.
Income: Adding 1 \$ of income, sales in dollars per square foot increase for 0,00496.
Population : Adding 1000 people, sales in dollars per square foot increase for 116,22.
Suburban : If retail store is located in suburban area, sales in dollars per square foot increase for
217,29.
Urban : If retail store is located in urban area, sales in dollars per square foot increase fore 86,78.
5.5
πππππ  = 213.31 + 0.00496(πΌπππππ) + 116.22(ππππ’πππ‘πππ[ππ 000π ])
+ 217.29(ππ’ππ’ππππ π·π’πππ¦) + 86.78(πππππ π·π’πππ¦)
πππππ  = 213.31 + 0.00496 35,000 + 116.22 0.5 + 217.29 0 + 86.78 0 = πππ. ππ
The estimated sales are \$445.02 per square foot of store space. Since the store is 1000 square foot,
the estimated sales for this store are: \$445,020
Ognjen Mladenovic, Laura Martinez, Nicolas Rosen, Kalash Tirpude
Assignment 2
Appendix
Appendix Q 2.1:
Model 1:
Coefficient
Standard Error
t-Value
p-Value
Regression Table
Constant
SAT
Finance degree
Age
Tenure
-1.15
0.01
0.67
-0.14
0.08
2.20
0.00
0.38
0.04
0.18
-0.52
3.96
1.79
-3.31
0.47
0.6012
&lt; 0.0001
0.0730
0.0009
0.6412
Confidence Interval 95%
Lower
Upper
-5.453
0.003
-0.063
-0.224
-0.262
3.158
0.008
1.412
-0.057
0.426
Model 1 includes all independent variables (SAT, Finance Degree, Age and Tenure), however we can
see that the variables Financial degree and Tenure are statistically insignificant as their p-Value is not
below 0.05.
Model 2:
Coefficient
Standard Error
t-Value
p-Value
Regression Table
Constant
SAT
Finance degree
Age
-1.18
0.01
0.67
-0.13
2.19
0.00
0.38
0.04
-0.54
3.97
1.78
-3.46
0.5897
&lt; 0.0001
0.0748
0.0005
Confidence Interval 95%
Lower
Upper
-5.485
0.003
-0.067
-0.207
3.119
0.008
1.407
-0.057
For Model 2 the variable Tenure, which was found to be statistically insignificant in Model 1 has been
removed from the regression. However, Financial degree is still statistically insignificant in this new
model as the p-Value is not below 0.05, thus for Model 3 (see main part Q2.1) it has been removed
from the model.
Ognjen Mladenovic, Laura Martinez, Nicolas Rosen, Kalash Tirpude
Assignment 2
Appendix Q 3.3:
Model 1
Coefficient
Standard Error
t-Value
p-Value
Regression Table
Constant
PIECES
OPS
RUSH
total machining time
(PIECES*OPS)
Model 2
Regression Table
Constant
PIECES
OPS
total machining time
(PIECES*OPS)
Model 3
Regression Table
Constant
OPS
total machining time
(PIECES*OPS)
Model 4
Regression Table
Constant
PIECES
total machining time
(PIECES*OPS)
Model 5
Regression Table
Constant
total mashining time
(PIECES*OPS)
Sqrt(Pieces)
77.98
-0.15
7.09
-25.29
0.11
44.82
0.11
4.31
19.12
0.01
1.74
-1.34
1.64
-1.32
8.66
0.102
0.199
0.121
0.206
&lt; 0.0001
Coefficient
Standard Error
t-Value
p-Value
72.70
-0.18
5.78
0.12
45.68
0.11
4.29
0.01
1.59
-1.65
1.35
9.63
0.131
0.118
0.197
&lt; 0.0001
Coefficient
Standard Error
t-Value
p-Value
18.04
11.02
0.10
33.04
3.04
0.00
0.55
3.63
20.88
0.592
0.002
&lt; 0.0001
Coefficient
Standard Error
t-Value
p-Value
131.68
-0.30
0.13
13.31
0.08
0.01
9.90
-3.83
14.55
&lt; 0.0001
0.001
&lt; 0.0001
Confidence Interval 95%
Lower
Upper
-17.56
-0.39
-2.10
-66.05
0.09
173.51
0.09
16.28
15.47
0.14
Confidence Interval 95%
Lower
Upper
-24.13
-0.42
-3.32
0.09
169.53
0.05
14.89
0.15
Confidence Interval 95%
Lower
Upper
-51.66
4.61
0.09
87.75
17.43
0.11
Confidence Interval 95%
Lower
Upper
103.61
-0.46
0.11
159.75
-0.13
0.15
Coefficient
Standard Error
t-Value
p-Value
Confidence Interval 95%
Lower
Upper
209.24
0.15
17.89
0.01
11.69
17.39
&lt; 0.0001
&lt; 0.0001
171.49
0.13
246.99
0.16
-12.76
2.24
-5.69
&lt; 0.0001
-17.49
-8.03
```