MBA PROGRAMME INTERNAL USE ONLY UNCERTAINTY, DATA & JUDGMENT t quantifies the spread or dispersion of residuals, helping you understand how well the model's predictions align with the actual data. A smaller standard deviation indicates a better fit, as it implies that residuals are generally closer to the mean, and the model has less error in its predictions. Standard deviation is crucial for assessing the reliability of the model's predictions, especially in the context of applications where prediction accuracy is critical. EXTRA EXERCISES SET 6 - SOLUTIONS Adjusted R-Squared (R²): Importance: R-squared (R²) measures the proportion of the variance in the dependent variable that is explained by the independent variables in the model. Adjusted R-squared takes this one step further by penalizing the addition of unnecessary independent variables to the model. Significance: Adjusted R-squared helps to account for the number of independent variables in the model. It adjusts R-squared downward when irrelevant variables are added, which prevents overfitting. It provides a better indication of the model's goodness of fit because it considers both explained and unexplained variance while adjusting for the number of predictors. A higher adjusted R-squared indicates a better fit, but it also encourages parsimonious models with fewer variables. INSEAD MBA Programme This document is authorised for use only in MBA - Uncertainty, Data & Judgment (002307) at INSEAD - Aug 2023 - Feb 2024 – by Professor(s) Spyros Zoumpoulis. Copying, printing or posting is a copyright infringement. Regression 1. Food a la carte Food a la carte, a leader in the French restaurant market, is investigating opportunities for opening a new restaurant in town. Competition is very high, the market shares are shrinking. Before deciding whether or not to go into business, Ms. Croquette, operations manager for Food a la carte, would like to understand what are the factors that make a new restaurant successful. Henceforth, Ms. Croquette decides to collect data on several relevant variables that may have an impact on the profitability of a new restaurant in town: 1. Total profit from operations in Thousands of Euros. (PROFIT) 2. Total area of the store in m2. (SIZE) 3. Number of employees employed by the store. (EMPL) 4. Total population in 3km radius around site. (TOTAL) 5. Average income in town in Thousands of Euros. (INC) 6. Number of competitors in a 1km radius around site. (COMP) 7. Number of restaurants that do not compete directly with Food a la carte. (NCOMP) 8. Number of non restaurant business in 1km radius around site. (NREST) 9. Cost of rent per square meter in Euros. (PRICE) 10. Cost of living index. (CLI) To begin with, she collects 50 observations for the entire set of variables and starts building a model to predict total profit (PROFIT). a) What can you infer from the Matrix of Simple Correlation (Exhibit 1)? NCOMP and INC show multicollinearity. SIZE and EMPL show multicollinearity, as their correlation coefficients are higher than 0.7 (in absolute value) b) What can you infer from the regression analysis in Exhibit 2? Both SIZE and EMPL are significant, so we keep them in the model. NCOMP is nonsignificant, and multicollinear with INC, so it should be first, taken out from the model. We should run the regression again without NCOMP, and then take non-significant variables out from the model one-by-one using “backward elimination.” 2 This document is authorised for use only in MBA - Uncertainty, Data & Judgment (002307) at INSEAD - Aug 2023 - Feb 2024 – by Professor(s) Spyros Zoumpoulis. Copying, printing or posting is a copyright infringement. c) Ms Croquette then prepares several different models. Which model would you select among MODELS 1 to 6 in Exhibit 3? Explain your reasoning. Please be precise and concise. We select the model with the highest explanatory power. Model 4 has 6 significant variables and R2 = 0.973, Adj R2 = 0.971, Std.deviation of regression = 51.8. With respect to these measures model 4 is preferred to models 3 and 2. Models 1, 5 and 6 have non-significant variables. d) An external consultant, Mr. Gourmet, has proposed his best model to predict PROFIT. Exhibit 4 refers to his best model. From studying Exhibits 4(a) – 4(f), what can you conclude about the assumptions for regression? How would you correct for problems, if any? Do you need to make any assumptions? Motivate your answers by indicating the appropriate exhibit. Please be precise and concise. Exhibit 4(a) Residuals vs Observation Number: to check if the errors are not autocorrelated. We can clearly see that the errors are not random and conclude that they are autocorrelated. This might be due to: none linearity between the dependent variable and an independent variable, missing one or more independent variables in the model. Exhibit 4(b) Residuals vs Predicted: to check homoscedasticity. We can see that the dispersion of the errors is not constant and conclude that there is a problem of heteroscedasticity this might be due to none linearity between the dependant variable and an independent variable. Exhibit 4(c) Durbin Watson test: to verify if the errors are random or autocorrelated. We assume that the data has been ordered (otherwise the test is not valid). Durbin Watson test calculated falls in the rejection region. So we can conclude that the errors are autocorrelated. The reasons for that might be: -non linearity between the dependent and an independent variables –and/or missing an important independent variable in the model. 3 This document is authorised for use only in MBA - Uncertainty, Data & Judgment (002307) at INSEAD - Aug 2023 - Feb 2024 – by Professor(s) Spyros Zoumpoulis. Copying, printing or posting is a copyright infringement. Exibitit 4(d) and 4(e) are plots of the dependent variable vs an independent variable to check linearity. In 4(d) we see that the relationship between Profit and Size is linear. In 4(e) we see that the relationship between Profit and NREST is not linear. That may cause both autocorrelation and heteroscedasticity detected in 4(a) and 4(b). To correct for that, we should transform the variable NREST. Exihibit 4(f) is the histogram of the Residuals to verify if the errors are normally distributed. We can accept this assumption. e) Based on MODEL 2, estimate the impact on PROFIT of one unit increase in SIZE. Give a point estimate and a 99% confidence interval. By increasing SIZE by 1 unit, the PROFIT increases, on average, by 4.52 units, keeping all the rest constant. A 99% CI for the regression coefficient for SIZE is 4.52 ± Z0.005 x 0.27, where Z0.005=2.57. 3.83 BSIZE 5.21 f) Based on MODEL 2, provide a 95% prediction interval for PROFIT. The following values for the independent variables are given: SIZE=100, EMPL=20, PRICE=50. The best point estimate for the prediction of PROFIT is 164.44 + 4.52(100) - 7.57(20) + 22.18(50) = 1572.04 A 95% CI is 1572.04 ± 2(94.66). 1383 PROFIT 1761 4 This document is authorised for use only in MBA - Uncertainty, Data & Judgment (002307) at INSEAD - Aug 2023 - Feb 2024 – by Professor(s) Spyros Zoumpoulis. Copying, printing or posting is a copyright infringement. 2. Internet Users A lot of business nowadays involves advertising and direct sales via internet. To predict the number of internet users, the following data were collected for the year 2000: Variable GDP per Capita Unit One unit is one US $ per Capita Personal Computers One unit is one Computer per 1,000 people One unit is one Mobile Phone per 1,000 people One unit is one Television Set per 1,000 people One unit is Kwh per Capita Mobile Phones Television Sets Electric Power per Capita Internet Users One unit is one Internet User per 1,000 people Description Gross Domestic Product per Capita, in constant US $ Number of Personal Computers per 1,000 people. Number of Mobile Phones per 1,000 people Number of Television Sets per 1,000 people Electric Power Consumption per capita, in Kwh (kilowatt-hours) Number of Internet Users per 1,000 people The data were collected for all countries with GDP per Capita exceeding 1,000 US $, and ordered by GDP per Capita. In all regression models, Internet Users is the dependent variable. a. What can you infer from the correlation matrix of the variables (Exhibit 1) There are several pairs of independent variables which have a correlation coefficient greater than 0.7 in absolute value, meaning Risk of multicolinearity between GDP per Capita and Personal Computers ( ρ = 0.8919 ) GDP per Capita and Television Sets ( ρ = 0.7036 ) GDP per Capita and Mobile Phones ( ρ = 0.8309 ) GDP per Capita and Electric Power per Capita ( ρ = 0.7902 ) Television Sets and Personal Computers ( ρ = 0.7056 ) Mobile Phones and Personal Computers ( ρ = 0.7967 ) Electric Power per Capita and Personal Computers ( ρ = 0.7290 ) 5 This document is authorised for use only in MBA - Uncertainty, Data & Judgment (002307) at INSEAD - Aug 2023 - Feb 2024 – by Professor(s) Spyros Zoumpoulis. Copying, printing or posting is a copyright infringement. b. From Regression Models 1-5. (Exhibit 2), which model is the best? Please justify your answer. The best model is Regression Model 4, all variables are significant, there is risk of multicolinearity between Mobile Phones and Personal Computers and between Electric Power per Capita and Personal Computers but the signs of regression coefficients are positive, which make sense, It has highest adjusted R-squared (0.8471) c. In Regression Model 4 three important statistics are missing for the intercept: t-stat 11.9771 1.38 8.6771 P-value = 2*Prob(t > 1.38) We can approximate by a Z value P-value = 2*Prob( Z > 1.38) = 2*0.0838=0.1676 and significance at 0.05 level P-value > 0.05 so coefficient a is not significantly different from 0 d. Exhibit 3 shows the Analysis of the Residuals (Durbin-Watson test, Residuals vs. Predicted values and Histogram of the residuals) for Regression Model 4. Are the regression assumptions satisfied? If not, what could be the reason and what would you do to improve the model? Durbin-Watson test is equal to 2 which fall in region A, so we accept the null hypothesis that the errors are random (not autocorrelated) The plot Residuals vs Predicted is to check if the errors are homoscedastic: they fit within 2 horizontal parallels so they have a constant dispersion; this assumption is satisfied. The histogram of the residuals is to check if the errors are normally distributed with a mean equal to zero. This assumption is roughly satisfied. 6 This document is authorised for use only in MBA - Uncertainty, Data & Judgment (002307) at INSEAD - Aug 2023 - Feb 2024 – by Professor(s) Spyros Zoumpoulis. Copying, printing or posting is a copyright infringement. e. Interpret the regression coefficient corresponding to the independent variable “Personal Computers” in Regression Model 4 Coefficient b for “Personal Computers” is 0.3965 If the number of personal computers per 1000 people increases by 1, the number of internet users per 1000 people increases on average by 0.3965, assuming that the number of mobile phones and the electric power per capita do not change. 95% confidence interval for this coefficient. b + t/2, n-k-1 *SEb b + Z/2 SEb = 0.3965 + 1.96*0.0658 0.40 + 0.13. = [0.27 ; 0.53] f. Use Regression Model 4 to compute a 95% prediction interval for the number of internet users per 1,000 people in Singapore. The data for Singapore is as follows: GDP per Capita Personal Computers Mobile Phones Television Sets Electric Power per Capita 22,767 483 684 304 6,889 The point estimate is Yˆ f = -11.97+0.3965*483+0.1562*684+0.0087*6889 = 346.31. The approximate formula for an 95% prediction interval is Yˆf Z 0.025 * Stdev Reg 346 + 2*55.75 346 + 111.5= [234.5; 457.5] 7 This document is authorised for use only in MBA - Uncertainty, Data & Judgment (002307) at INSEAD - Aug 2023 - Feb 2024 – by Professor(s) Spyros Zoumpoulis. Copying, printing or posting is a copyright infringement. 3. TechProducts Sales Bob Smart is the CEO of TechProducts, a manufacturer and distributor of high tech products, before they become commodities. TechProducts has signed a number of strategic alliances with two big high tech firms to license their products, as they are being commoditized. Consequently, TechProducts is producing and distributing cheaper versions of such products for the medium and low ends of the market. Bob is concerned about the sales of external memory cards (used in cameras, PDAs, and hand held computers) that his company is producing. These cards account for over 25% of its revenues and about 1/3 of its profits. There are several questions that Bob is not sure about. For instance is it more beneficial to advertise his memory cards in trade magazines, or spend more money on promotions? In addition, he is not sure about the effect of price increases/decreases on sales, or the influence of advertising and promotions done by competitors. To improve his insights concerning these and similar questions, he asked his assistant, John Timber, to collect as much data as possible and run regressions (remembering from his days at INSEAD that regression could provide useful information). Bob hopes that this will clarify his concerns, and help him make more intelligent decisions. The monthly data John has collected consists of Sales, the dependent variable, and six independent ones. These are described briefly below: 1. Total monthly sales of memory cards, minus returns, in Thousands of Boxes (each box contains six memory cards). (SALES) The capacity of memory cards varied from 128K to 1000K, and with it the price. 2. Total monthly budget in Thousands of Dollars spent on advertising, mostly in trade journals. (ADV) 3. Total monthly budget in Thousands of Dollars spent on encouraging distributors to promote TechProducts memory cards by displaying them in prominent places in their stores, or by selling them cheaper. (PROMOT) 4. Average monthly price of the memory cards shipped during the month, in Dollars. (PRICE) 5. Total monthly advertising budget spent by TechProducts’ competitors (also mainly used in trade magazines). (COMP.ADV) 6. Total monthly promotional budget spent by TechProducts’ competitors. Unlike competitive advertising there figures are not as reliable estimates for promotional spending, reducing the trustworthiness of the numbers. (COMP.PROMOT) 8 This document is authorised for use only in MBA - Uncertainty, Data & Judgment (002307) at INSEAD - Aug 2023 - Feb 2024 – by Professor(s) Spyros Zoumpoulis. Copying, printing or posting is a copyright infringement. 7. Occasionally, TechProducts would find itself with a high inventory of memory cards, or cards of lesser memory capacity than those demanded in the market. In such cases, TechProducts provides the extra/unwanted cards to big discounters that sell them at reduced prices ranging between 20% and 40%. The result is that the cards are sold, but at reduced profit margins that cover costs and a small part of the fixed expenses. During the months that such deals are provided to Discounters, this independent variable takes the value 1; otherwise its value is zero. (DISCOUNTERS) 9 This document is authorised for use only in MBA - Uncertainty, Data & Judgment (002307) at INSEAD - Aug 2023 - Feb 2024 – by Professor(s) Spyros Zoumpoulis. Copying, printing or posting is a copyright infringement. Please answer the following questions in a precise but brief and concise manner by consulting Exhibits 1 ,2 and 3. Question 1 (please refer to Exhibit 1): (a) Are there any possible problems that you should be aware of by studying Exhibit 1? Yes, there is the possible risk of multicollinearity as the correlation between “PRICE” and “DISCOUNTERS” is high in absolute value (i.e., -0.8089) and can create problems. (b) Which variable exhibits a stronger relationship with SALES: ADV or PROMOT? The strongest relationship is between “SALES” and “ADV” as the correlation between the two is 0.5968, much bigger than that between “SALES” and “PROMOT” which is only 0.0433. (c) What does the correlation coefficient of -0.8089 between PRICE and DISCOUNTERS indicate? It indicates that on a scale from 0 to -1 its value is -0.8089. This is close to -1 and it points out to a strong negative relationship, i.e., when DISCOUNTERS is equal to 1, PRICE decreases . (d) What does the correlation coefficient of -0.2704 between ADV and PROMOT indicate? The correlation of -0.2704 between the two independent variables “ADV” and “PROMOT” is not as strong as in (c) above and means that as one increases the other decreases, and vice versa. Question 2 (please refer to Exhibit 2): (a) In your view, which is the best Regression Run from the six listed in Exhibit 2? What evidence can you use to justify your answer (please refer to all evidence)? The best Regression Run between those listed in Exhibit 3 is Regression Run 2. The reasons are: (i) all the t-tests corresponding to the independent variables are significant, i.e. greater (in absolute value) than about 1.96, or equivalently the p-values are smaller than 0.05. (ii) the Adjusted R2 of this run is 0.806, the largest of all other Regression Runs whose t-tests indicate that the coefficients of all independent variables are statistically significant. 10 This document is authorised for use only in MBA - Uncertainty, Data & Judgment (002307) at INSEAD - Aug 2023 - Feb 2024 – by Professor(s) Spyros Zoumpoulis. Copying, printing or posting is a copyright infringement. (iii) The standard deviation of the regression is 62.11, the smallest of all other Regression Runs whose t-tests indicate that the coefficients of all independent variables are statistically significant. (b) Write down the regression equation you chose in part (a) above, and explain the precise meaning of the regression coefficients a and bi? The Regression Run 2 is: SALES = 639.04 + 3.01ADV + 4.70 PROMOT –4.91PRICE -0.17 COMP.ADV + 104.88 DISCOUNTERS The meaning of the regression coefficients is the following: a = 639.04: This is the constant term (intercept), it means that if all the independent variables are equal to zero, the value of SALES would be 639.04 on average. b1 = 3.01: It tells us that if ADV increases by one unit SALES would increase by 3.01 units on average, keeping all other variables constant. b2 = 4.70: It tells us that if PROMOT increases by one unit SALES would increase by 4.07 units on average, keeping all other variables constant. b3 = -4.91: It tells us that if PRICE increases by one unit SALES would decrease by 4.91 units on average, keeping all other variables constant. b4 = -0.17: It tells us that when COMP.ADV increases by one unit SALES would decrease by 0.17 units on average, keeping all other variables constant. b5 = 104.88: It tells us that during the months that there are sales to DISCOUNTERS, SALES increase by 104.88 units on average, keeping all other variables constant. (c) In Run 4, the regression coefficient for ADV is 4.47, while that of PROMOT is 4.23. Can the marketing manager conclude that the impact of advertising on SALES is greater than that of promotion? No conclusion can be drawn regarding regression coefficients because one of the independent variables in the model is non-significant. Thus, the regression coefficients indicate the most likely value of ADV and PROMOT. These coefficients, however, have a range of values that can be found computing say, a 95% confidence interval. Such intervals for Regression Run 4 are: 11 This document is authorised for use only in MBA - Uncertainty, Data & Judgment (002307) at INSEAD - Aug 2023 - Feb 2024 – by Professor(s) Spyros Zoumpoulis. Copying, printing or posting is a copyright infringement. For “ADV”: 3.02 ≤ ADV ≤ 5.92 4.47 ± 1.96(0.74), For “PROMOT”: 4.23 ± 1.96(1.38), 1.53 ≤ PROMOT ≤ 6.93 Although the most likely value for the regression coefficients indicates that ADV has a higher impact than PROMOT, the 95% confidence intervals between the two overlap -indicating that the higher impact of ADV on SALES can be by chance. (d) In Run 3, the regression coefficient for ADV is 3.61, while that of PROMOT is 4.47. In Run 2, the regression coefficient for ADV is 3.01, while that of PROMOT is 4.70. How can you explain the difference in the values of these regression coefficients between Runs 3 and 2? The regression coefficients tell us the impact of a specific independent variable on the dependent if the influence of all the others is kept constant. Thus, the difference in the regression coefficient of “ADV” between Runs 2 and 3 is explained by the fact that there are different independent variables in each run. Specifically, Run 2 has an extra variable, DISCOUNTERS. This variable has positive correlation with ADV and negative correlation with PROMOT, explaining why the coefficients of the corresponding variables move up, respectively down as DISCOUNTERS is taken out of the model. (e) Construct a 99% confidence interval for the values of a and b in Run 6. for a is: 2416.92 ± 2.58(229.61) ( 1824.53 A 3009.31 ). (because d.f.=n-k-1=36 >29 we can approximate t with Z) for b is: -9.39 ± 2.58 (1.65) 13.65 B 5.13 (f) In Regression Run 3, test the hypotheses that the value of the regression coefficient Bprice = -10, versus the alternative that it is different than -10. H O B price 10 H A B price 10 Z obs b - B 8.19 10 1.66 SE b 1.09 Z / 2 1.96 - Z/2 Z obs Z/2 so we cannot reject H O 12 This document is authorised for use only in MBA - Uncertainty, Data & Judgment (002307) at INSEAD - Aug 2023 - Feb 2024 – by Professor(s) Spyros Zoumpoulis. Copying, printing or posting is a copyright infringement. Question 3 (please refer to Exhibit 3): By relating each specific part of Exhibit 3 to the various assumptions of regression, explain if such assumptions are or are not satisfied. If necessary, specify what other information you may want to seek to answer this question. Exhibit 3(a) - Durbin Watson test: to check if the errors are random or autocorrelated. Data should be ordered (otherwise the test is not valid). Durbin Watson test calculated falls in region B so we cannot conclude if the errors are autocorrelated or not. Exhibit 3(b) – Residuals vs Predicted: to check if the errors are homoscedastic. The dots fit approximately between 2 parallels so we can conclude that this assumption is satisfied. Exhibit 3(c) – Histogram of the errors: to check if the errors are normally distributed. The histogram fit approximately the theoretical normal curve so we can conclude that this assumption is satisfied. Question 4 After having studied the various Regression Runs and having answered the questions above, what is your best advice for Bob? Is it more beneficial to advertise his memory cards in trade magazines or spend more money on promotions? Please be brief and precise. Regression Run Number 2 is the most appropriate from those given in Exhibit 3. The regression coefficient for “ADV” is 3.01 while that for “PROMOT,” is 4.70. This indicates that the influence of PROMOT on SALES is greater than that of ADV. At the same time, however, the 95% confidence intervals for ADV go from 1.52 to 4.5 while those for PROMOT from 2.64 to 6.76. Since the two intervals overlap, our advice to Bob is that he cannot be sure that PROMOT is more beneficial than ADV If he wants to be more confident about the impact of PROMOT vs. ADV he should collect more data and re-run the regressions to re-estimate the coefficients and reduce the value of the standard error of such coefficients. 13 This document is authorised for use only in MBA - Uncertainty, Data & Judgment (002307) at INSEAD - Aug 2023 - Feb 2024 – by Professor(s) Spyros Zoumpoulis. Copying, printing or posting is a copyright infringement.