Uploaded by shikha lakhani

Xtraex06sol2016

advertisement
MBA PROGRAMME
INTERNAL USE ONLY
UNCERTAINTY, DATA & JUDGMENT
t quantifies the spread or dispersion of residuals, helping you understand how well the model's predictions align
with the actual data.
A smaller standard deviation indicates a better fit, as it implies that residuals are generally closer to the mean, and
the model has less error in its predictions.
Standard deviation is crucial for assessing the reliability of the model's predictions, especially in the context of
applications where prediction accuracy is critical.
EXTRA EXERCISES
SET 6 - SOLUTIONS
Adjusted R-Squared (R²):
Importance: R-squared (R²) measures the proportion of the variance in the dependent variable that is explained by
the independent variables in the model. Adjusted R-squared takes this one step further by penalizing the addition
of unnecessary independent variables to the model.
Significance:
Adjusted R-squared helps to account for the number of independent variables in the model. It adjusts R-squared
downward when irrelevant variables are added, which prevents overfitting.
It provides a better indication of the model's goodness of fit because it considers both explained and unexplained
variance while adjusting for the number of predictors.
A higher adjusted R-squared indicates a better fit, but it also encourages parsimonious models with fewer variables.
INSEAD MBA Programme
This document is authorised for use only in MBA - Uncertainty, Data & Judgment (002307)
at INSEAD - Aug 2023 - Feb 2024 – by Professor(s) Spyros Zoumpoulis. Copying, printing
or posting is a copyright infringement.
Regression
1.
Food a la carte
Food a la carte, a leader in the French restaurant market, is investigating opportunities for opening
a new restaurant in town. Competition is very high, the market shares are shrinking. Before
deciding whether or not to go into business, Ms. Croquette, operations manager for Food a la
carte, would like to understand what are the factors that make a new restaurant successful.
Henceforth, Ms. Croquette decides to collect data on several relevant variables that may have an
impact on the profitability of a new restaurant in town:
1.
Total profit from operations in Thousands of Euros. (PROFIT)
2.
Total area of the store in m2. (SIZE)
3.
Number of employees employed by the store. (EMPL)
4.
Total population in 3km radius around site. (TOTAL)
5.
Average income in town in Thousands of Euros. (INC)
6.
Number of competitors in a 1km radius around site. (COMP)
7.
Number of restaurants that do not compete directly with Food a la carte. (NCOMP)
8.
Number of non restaurant business in 1km radius around site. (NREST)
9.
Cost of rent per square meter in Euros. (PRICE)
10.
Cost of living index. (CLI)
To begin with, she collects 50 observations for the entire set of variables and starts building a
model to predict total profit (PROFIT).
a)
What can you infer from the Matrix of Simple Correlation (Exhibit 1)?
NCOMP and INC show multicollinearity. SIZE and EMPL show multicollinearity, as
their
correlation coefficients are higher than 0.7 (in absolute value)
b)
What can you infer from the regression analysis in Exhibit 2?
Both SIZE and EMPL are significant, so we keep them in the model. NCOMP is nonsignificant, and multicollinear with INC, so it should be first, taken out from the model. We
should run the regression again without NCOMP, and then take non-significant variables out
from the model one-by-one using “backward elimination.”
2
This document is authorised for use only in MBA - Uncertainty, Data & Judgment (002307)
at INSEAD - Aug 2023 - Feb 2024 – by Professor(s) Spyros Zoumpoulis. Copying, printing
or posting is a copyright infringement.
c)
Ms Croquette then prepares several different models. Which model would you select among
MODELS 1 to 6 in Exhibit 3?
Explain your reasoning. Please be precise and concise.
We select the model with the highest explanatory power. Model 4 has 6 significant variables
and R2 = 0.973, Adj R2 = 0.971, Std.deviation of regression = 51.8.
With respect to these measures model 4 is preferred to models 3 and 2.
Models 1, 5 and 6 have non-significant variables.
d)
An external consultant, Mr. Gourmet, has proposed his best model to predict PROFIT. Exhibit
4 refers to his best model. From studying Exhibits 4(a) – 4(f), what can you conclude about
the assumptions for regression? How would you correct for problems, if any? Do you need to
make any assumptions? Motivate your answers by indicating the appropriate exhibit. Please be
precise and concise.
Exhibit 4(a) Residuals vs Observation Number: to check if the errors are not autocorrelated.
We can clearly see that the errors are not random and conclude that they are autocorrelated.
This might be due to:

none linearity between the dependent variable and an independent variable,

missing one or more independent variables in the model.
Exhibit 4(b) Residuals vs Predicted: to check homoscedasticity. We can see that the
dispersion of the errors is not constant and conclude that there is a problem of
heteroscedasticity this might be due to none linearity between the dependant variable and an
independent variable.
Exhibit 4(c) Durbin Watson test: to verify if the errors are random or autocorrelated. We
assume that the data has been ordered (otherwise the test is not valid). Durbin Watson test
calculated falls in the rejection region. So we can conclude that the errors are autocorrelated.
The reasons for that might be: -non linearity between the dependent and an independent
variables –and/or missing an important independent variable in the model.
3
This document is authorised for use only in MBA - Uncertainty, Data & Judgment (002307)
at INSEAD - Aug 2023 - Feb 2024 – by Professor(s) Spyros Zoumpoulis. Copying, printing
or posting is a copyright infringement.
Exibitit 4(d) and 4(e) are plots of the dependent variable vs an independent variable to check
linearity. In 4(d) we see that the relationship between Profit and Size is linear. In 4(e) we see
that the relationship between Profit and NREST is not linear. That may cause both
autocorrelation and heteroscedasticity detected in 4(a) and 4(b). To correct for that, we should
transform the variable NREST.
Exihibit 4(f) is the histogram of the Residuals to verify if the errors are normally distributed.
We can accept this assumption.
e)
Based on MODEL 2, estimate the impact on PROFIT of one unit increase in SIZE.
Give a point estimate and a 99% confidence interval.
By increasing SIZE by 1 unit, the PROFIT increases, on average, by 4.52 units, keeping all
the rest constant.
A 99% CI for the regression coefficient for SIZE is
4.52 ± Z0.005 x 0.27, where Z0.005=2.57.
3.83  BSIZE  5.21
f) Based on MODEL 2, provide a 95% prediction interval for PROFIT.
The following values for the independent variables are given:
SIZE=100, EMPL=20, PRICE=50.
The best point estimate for the prediction of PROFIT is
164.44 + 4.52(100) - 7.57(20) + 22.18(50) = 1572.04
A 95% CI is 1572.04 ± 2(94.66).
1383  PROFIT  1761
4
This document is authorised for use only in MBA - Uncertainty, Data & Judgment (002307)
at INSEAD - Aug 2023 - Feb 2024 – by Professor(s) Spyros Zoumpoulis. Copying, printing
or posting is a copyright infringement.
2. Internet Users
A lot of business nowadays involves advertising and direct sales via internet. To predict the number
of internet users, the following data were collected for the year 2000:
Variable
GDP per Capita
Unit
One unit is one US $ per Capita
Personal Computers
One unit is one Computer per
1,000 people
One unit is one Mobile Phone
per 1,000 people
One unit is one Television Set
per 1,000 people
One unit is Kwh per Capita
Mobile Phones
Television Sets
Electric Power per Capita
Internet Users
One unit is one Internet User per
1,000 people
Description
Gross Domestic Product per
Capita, in constant US $
Number of Personal Computers
per 1,000 people.
Number of Mobile Phones per
1,000 people
Number of Television Sets per
1,000 people
Electric Power Consumption per
capita, in Kwh (kilowatt-hours)
Number of Internet Users per
1,000 people
The data were collected for all countries with GDP per Capita exceeding 1,000 US $, and ordered by
GDP per Capita. In all regression models, Internet Users is the dependent variable.
a. What can you infer from the correlation matrix of the variables (Exhibit 1)
There are several pairs of independent variables which have a correlation coefficient greater
than 0.7 in absolute value, meaning Risk of multicolinearity between
GDP per Capita and Personal Computers ( ρ = 0.8919 )
GDP per Capita and Television Sets ( ρ = 0.7036 )
GDP per Capita and Mobile Phones ( ρ = 0.8309 )
GDP per Capita and Electric Power per Capita ( ρ = 0.7902 )
Television Sets and Personal Computers ( ρ = 0.7056 )
Mobile Phones and Personal Computers ( ρ = 0.7967 )
Electric Power per Capita and Personal Computers ( ρ = 0.7290 )
5
This document is authorised for use only in MBA - Uncertainty, Data & Judgment (002307)
at INSEAD - Aug 2023 - Feb 2024 – by Professor(s) Spyros Zoumpoulis. Copying, printing
or posting is a copyright infringement.
b. From Regression Models 1-5. (Exhibit 2), which model is the best? Please justify your answer.
The best model is Regression Model 4, all variables are significant, there is risk of
multicolinearity between Mobile Phones and Personal Computers and between Electric Power
per Capita and Personal Computers but the signs of regression coefficients are positive, which
make sense, It has highest adjusted R-squared (0.8471)
c. In Regression Model 4 three important statistics are missing for the intercept:
t-stat 
 11.9771
 1.38
8.6771
P-value = 2*Prob(t > 1.38)
We can approximate by a Z value
P-value = 2*Prob( Z > 1.38) = 2*0.0838=0.1676
and significance at 0.05 level
P-value > 0.05 so coefficient a is not significantly different from 0
d. Exhibit 3 shows the Analysis of the Residuals (Durbin-Watson test, Residuals vs. Predicted
values and Histogram of the residuals) for Regression Model 4.
Are the regression assumptions satisfied? If not, what could be the reason and what would you
do to improve the model?
Durbin-Watson test is equal to 2 which fall in region A, so we accept the null hypothesis that
the errors are random (not autocorrelated)
The plot Residuals vs Predicted is to check if the errors are homoscedastic: they fit within 2
horizontal parallels so they have a constant dispersion; this assumption is satisfied.
The histogram of the residuals is to check if the errors are normally distributed with a mean
equal to zero. This assumption is roughly satisfied.
6
This document is authorised for use only in MBA - Uncertainty, Data & Judgment (002307)
at INSEAD - Aug 2023 - Feb 2024 – by Professor(s) Spyros Zoumpoulis. Copying, printing
or posting is a copyright infringement.
e. Interpret the regression coefficient corresponding to the independent variable “Personal
Computers” in Regression Model 4
Coefficient b for “Personal Computers” is 0.3965
If the number of personal computers per 1000 people increases by 1, the number of internet
users per 1000 people increases on average by 0.3965, assuming that the number of mobile
phones and the electric power per capita do not change.
95% confidence interval for this coefficient.
b + t/2, n-k-1 *SEb  b + Z/2 SEb = 0.3965 + 1.96*0.0658  0.40 + 0.13. = [0.27 ; 0.53]
f. Use Regression Model 4 to compute a 95% prediction interval for the number of internet users
per 1,000 people in Singapore.
The data for Singapore is as follows:
GDP per Capita
Personal Computers
Mobile Phones
Television Sets
Electric Power per Capita
22,767
483
684
304
6,889
The point estimate is Yˆ f = -11.97+0.3965*483+0.1562*684+0.0087*6889 = 346.31.
The approximate formula for an 95% prediction interval is
Yˆf  Z 0.025 * Stdev Reg  346 + 2*55.75  346 + 111.5= [234.5; 457.5]
7
This document is authorised for use only in MBA - Uncertainty, Data & Judgment (002307)
at INSEAD - Aug 2023 - Feb 2024 – by Professor(s) Spyros Zoumpoulis. Copying, printing
or posting is a copyright infringement.
3.
TechProducts Sales
Bob Smart is the CEO of TechProducts, a manufacturer and distributor of high tech products,
before they become commodities. TechProducts has signed a number of strategic alliances with
two big high tech firms to license their products, as they are being commoditized. Consequently,
TechProducts is producing and distributing cheaper versions of such products for the medium
and low ends of the market. Bob is concerned about the sales of external memory cards (used in
cameras, PDAs, and hand held computers) that his company is producing. These cards account
for over 25% of its revenues and about 1/3 of its profits. There are several questions that Bob is
not sure about. For instance is it more beneficial to advertise his memory cards in trade
magazines, or spend more money on promotions? In addition, he is not sure about the effect of
price increases/decreases on sales, or the influence of advertising and promotions done by
competitors. To improve his insights concerning these and similar questions, he asked his
assistant, John Timber, to collect as much data as possible and run regressions (remembering
from his days at INSEAD that regression could provide useful information). Bob hopes that this
will clarify his concerns, and help him make more intelligent decisions.
The monthly data John has collected consists of Sales, the dependent variable, and six
independent ones. These are described briefly below:
1.
Total monthly sales of memory cards, minus returns, in Thousands of Boxes (each box
contains six memory cards). (SALES) The capacity of memory cards varied from 128K
to 1000K, and with it the price.
2.
Total monthly budget in Thousands of Dollars spent on advertising, mostly in trade
journals. (ADV)
3.
Total monthly budget in Thousands of Dollars spent on encouraging distributors to
promote TechProducts memory cards by displaying them in prominent places in their
stores, or by selling them cheaper. (PROMOT)
4.
Average monthly price of the memory cards shipped during the month, in Dollars.
(PRICE)
5.
Total monthly advertising budget spent by TechProducts’ competitors (also mainly used
in trade magazines). (COMP.ADV)
6.
Total monthly promotional budget spent by TechProducts’ competitors. Unlike
competitive advertising there figures are not as reliable estimates for promotional
spending, reducing the trustworthiness of the numbers. (COMP.PROMOT)
8
This document is authorised for use only in MBA - Uncertainty, Data & Judgment (002307)
at INSEAD - Aug 2023 - Feb 2024 – by Professor(s) Spyros Zoumpoulis. Copying, printing
or posting is a copyright infringement.
7.
Occasionally, TechProducts would find itself with a high inventory of memory cards, or
cards of lesser memory capacity than those demanded in the market. In such cases,
TechProducts provides the extra/unwanted cards to big discounters that sell them at
reduced prices ranging between 20% and 40%. The result is that the cards are sold, but
at reduced profit margins that cover costs and a small part of the fixed expenses. During
the months that such deals are provided to Discounters, this independent variable takes
the value 1; otherwise its value is zero. (DISCOUNTERS)
9
This document is authorised for use only in MBA - Uncertainty, Data & Judgment (002307)
at INSEAD - Aug 2023 - Feb 2024 – by Professor(s) Spyros Zoumpoulis. Copying, printing
or posting is a copyright infringement.
Please answer the following questions in a precise but brief and concise manner by consulting Exhibits
1 ,2 and 3.
Question 1 (please refer to Exhibit 1):
(a) Are there any possible problems that you should be aware of by studying Exhibit 1?
Yes, there is the possible risk of multicollinearity as the correlation between “PRICE” and
“DISCOUNTERS” is high in absolute value (i.e., -0.8089) and can create problems.
(b) Which variable exhibits a stronger relationship with SALES: ADV or PROMOT?
The strongest relationship is between “SALES” and “ADV” as the correlation between the two
is 0.5968, much bigger than that between “SALES” and “PROMOT” which is only 0.0433.
(c) What does the correlation coefficient of -0.8089 between PRICE and DISCOUNTERS
indicate?
It indicates that on a scale from 0 to -1 its value is -0.8089. This is close to -1 and it points out
to a strong negative relationship, i.e., when DISCOUNTERS is equal to 1, PRICE decreases .
(d) What does the correlation coefficient of -0.2704 between ADV and PROMOT indicate?
The correlation of -0.2704 between the two independent variables “ADV” and “PROMOT” is
not as strong as in (c) above and means that as one increases the other decreases, and vice
versa.
Question 2 (please refer to Exhibit 2):
(a) In your view, which is the best Regression Run from the six listed in Exhibit 2? What evidence
can you use to justify your answer (please refer to all evidence)?
The best Regression Run between those listed in Exhibit 3 is Regression Run 2.
The reasons are:
(i)
all the t-tests corresponding to the independent variables are significant, i.e. greater (in
absolute value) than about 1.96, or equivalently the p-values are smaller than 0.05.
(ii) the Adjusted R2 of this run is 0.806, the largest of all other Regression Runs whose t-tests
indicate that the coefficients of all independent variables are statistically significant.
10
This document is authorised for use only in MBA - Uncertainty, Data & Judgment (002307)
at INSEAD - Aug 2023 - Feb 2024 – by Professor(s) Spyros Zoumpoulis. Copying, printing
or posting is a copyright infringement.
(iii) The standard deviation of the regression is 62.11, the smallest of all other Regression Runs
whose t-tests indicate that the coefficients of all independent variables are statistically
significant.
(b) Write down the regression equation you chose in part (a) above, and explain the precise
meaning of the regression coefficients a and bi?
The Regression Run 2 is:
SALES = 639.04 + 3.01ADV + 4.70 PROMOT –4.91PRICE -0.17 COMP.ADV
+ 104.88 DISCOUNTERS
The meaning of the regression coefficients is the following:
a = 639.04: This is the constant term (intercept), it means that if all the independent variables
are equal to zero, the value of SALES would be 639.04 on average.
b1 = 3.01: It tells us that if ADV increases by one unit SALES would increase by 3.01 units on
average, keeping all other variables constant.
b2 = 4.70: It tells us that if PROMOT increases by one unit SALES would increase by 4.07
units on average, keeping all other variables constant.
b3 = -4.91: It tells us that if PRICE increases by one unit SALES would decrease by 4.91 units
on average, keeping all other variables constant.
b4 = -0.17: It tells us that when COMP.ADV increases by one unit SALES would decrease by
0.17 units on average, keeping all other variables constant.
b5 = 104.88: It tells us that during the months that there are sales to DISCOUNTERS, SALES
increase by 104.88 units on average, keeping all other variables constant.
(c) In Run 4, the regression coefficient for ADV is 4.47, while that of PROMOT is 4.23.
Can the marketing manager conclude that the impact of advertising on SALES is greater than
that of promotion?
No conclusion can be drawn regarding regression coefficients because one of the independent
variables in the model is non-significant.
Thus, the regression coefficients indicate the most likely value of ADV and PROMOT. These
coefficients, however, have a range of values that can be found computing say, a 95%
confidence interval. Such intervals for Regression Run 4 are:
11
This document is authorised for use only in MBA - Uncertainty, Data & Judgment (002307)
at INSEAD - Aug 2023 - Feb 2024 – by Professor(s) Spyros Zoumpoulis. Copying, printing
or posting is a copyright infringement.
For “ADV”:
 3.02 ≤ ADV ≤ 5.92
4.47 ± 1.96(0.74),
For “PROMOT”: 4.23 ± 1.96(1.38),
 1.53 ≤ PROMOT ≤ 6.93
Although the most likely value for the regression coefficients indicates that ADV has a higher
impact than PROMOT, the 95% confidence intervals between the two overlap -indicating that
the higher impact of ADV on SALES can be by chance.
(d) In Run 3, the regression coefficient for ADV is 3.61, while that of PROMOT is 4.47.
In Run 2, the regression coefficient for ADV is 3.01, while that of PROMOT is 4.70.
How can you explain the difference in the values of these regression coefficients between Runs
3 and 2?
The regression coefficients tell us the impact of a specific independent variable on the
dependent if the influence of all the others is kept constant. Thus, the difference in the
regression coefficient of “ADV” between Runs 2 and 3 is explained by the fact that there are
different independent variables in each run. Specifically, Run 2 has an extra variable,
DISCOUNTERS. This variable has positive correlation with ADV and negative correlation
with PROMOT, explaining why the coefficients of the corresponding variables move up,
respectively down as DISCOUNTERS is taken out of the model.
(e) Construct a 99% confidence interval for the values of a and b in Run 6.
for a is: 2416.92 ± 2.58(229.61)  ( 1824.53  A  3009.31 ).
(because d.f.=n-k-1=36 >29 we can approximate t with Z)
for b is:
-9.39 ± 2.58 (1.65)
  13.65  B  5.13
(f) In Regression Run 3, test the hypotheses that the value of the regression coefficient
Bprice = -10, versus the alternative that it is different than -10.
H O B price  10
H A B price  10
Z obs 
b - B  8.19   10 

 1.66
SE b
1.09
Z / 2  1.96
- Z/2  Z obs   Z/2
so we cannot reject H O
12
This document is authorised for use only in MBA - Uncertainty, Data & Judgment (002307)
at INSEAD - Aug 2023 - Feb 2024 – by Professor(s) Spyros Zoumpoulis. Copying, printing
or posting is a copyright infringement.
Question 3 (please refer to Exhibit 3):
By relating each specific part of Exhibit 3 to the various assumptions of regression, explain if such
assumptions are or are not satisfied. If necessary, specify what other information you may want to
seek to answer this question.
Exhibit 3(a) - Durbin Watson test: to check if the errors are random or autocorrelated. Data should
be ordered (otherwise the test is not valid).
Durbin Watson test calculated falls in region B so we cannot conclude if the errors are
autocorrelated or not.
Exhibit 3(b) – Residuals vs Predicted: to check if the errors are homoscedastic. The dots fit
approximately between 2 parallels so we can conclude that this assumption is satisfied.
Exhibit 3(c) – Histogram of the errors: to check if the errors are normally distributed. The
histogram fit approximately the theoretical normal curve so we can conclude that this assumption
is satisfied.
Question 4
After having studied the various Regression Runs and having answered the questions above, what
is your best advice for Bob?
Is it more beneficial to advertise his memory cards in trade
magazines or spend more money on promotions? Please be brief and precise.
Regression Run Number 2 is the most appropriate from those given in Exhibit 3. The regression
coefficient for “ADV” is 3.01 while that for “PROMOT,” is 4.70. This indicates that the influence
of PROMOT on SALES is greater than that of ADV. At the same time, however, the 95%
confidence intervals for ADV go from 1.52 to 4.5 while those for PROMOT from 2.64 to 6.76.
Since the two intervals overlap, our advice to Bob is that he cannot be sure that PROMOT is more
beneficial than ADV If he wants to be more confident about the impact of PROMOT vs. ADV he
should collect more data and re-run the regressions to re-estimate the coefficients and reduce the
value of the standard error of such coefficients.
13
This document is authorised for use only in MBA - Uncertainty, Data & Judgment (002307)
at INSEAD - Aug 2023 - Feb 2024 – by Professor(s) Spyros Zoumpoulis. Copying, printing
or posting is a copyright infringement.
Download