March 2011 Midterm Examination Advanced Business Statistics MGSC 272 March 11, 9:00 – 11:00 a.m. Examiner: Prof. Brian Smith Student Name: McGill ID: INSTRUCTIONS: This is a CLOSED BOOK examination. Only one hand-written or typed double-sided CRIB SHEETS permitted. SPACE IS PROVIDED on the examination to answer all questions. You are permitted translation or regular dictionaries. Regular calculators are permitted. However calculators that can store text are not permitted. The marks for each question appear next to the question number. The exam consists of a total of 23 pages, including this cover page and 2 pages for rough work at the end of the exam. This examination paper MUST BE RETURNED For Marker Only Question Part 1 Part 2, Q1 Part 2, Q2 Part 2, Q3 Grade /30 /18 /12 /25 Part 2, Q4 /15 Part 1 consists of 15 questions worth 2 marks each for a total of 30 marks. Part 2 consists of four questions for a total of 70 marks. MGSC 272 Page 1/23 Part 1 (30 marks) 1. For a multiple regression model, when testing a hypothesis for a regression coefficient, which of the following statements are true: I. II. III. If the model contains multicollinearity, a t-test is more likely to erroneously conclude H1. If the model contains no multicollinearity, the F-test and t-tests are equally effective. The test statistic is calculated by dividing the coefficient by its standard error.. Which of the above statements are true: A. B. C. D. E. I and II only II and III only I only III only All of the above statements are true. Answer: B The following information pertains to the next two questions: Below you are given a partial Minitab output based on a sample of 20 observations. Constant X1 X2 X3 X4 2. Coefficient 145.321 25.625 -5.720 0.823 1.238 Standard Error 48.682 9.150 3.575 0.183 0.421 Refer to the information above. At the 1% level of significance which of the following statements are true? A. B. C. D. E. Variables X1 and X2 are significant, but variable X3 is not significant. Variables X1 and X3 are significant, but variable X2 is not significant Only variables X1 is significant. Only variable X3 is significant. None of the variables are significant. Answer: t.005 = 2.947 => D is the correct answer. MGSC 272 Page2/23 3. A 99% confidence interval estimate for β2 is: A. B. C. D. E. -15.02 to 3.58 -16.26 to 4.82 -16.16 to 4.72 -14.04 to 2.60 None of the above. Answer: B A multiple regression model relating Y (dependent variable) and three independent variables X1, X2 and X3 is based on a sample of 40 observations. 4. The error degrees of freedom is equal to ______. Answer: n – (k + 1) = 40 – (3 + 1) = 36 5. In question 4, SSR = 1200, SSE = 880. Compute the adjusted R2a. Answer: R 2 a 1 MGSC 272 40 1 880 1 .4583 .5417 40 (3 1) 2080 Page3/23 6. Consider the following statements regarding maximum likelihood estimates (MLE). I MLEs are sometimes biased. II MLEs are preferred to least squares estimates when the error terms in a regression model are normally distributed and exhibit homoscedasticity. III MLEs use more information than least squares estimates about the parameters being estimated. Which of the above statements are true: A. B. C. D. E. I only I and III only II and III only III only All of the above statements are true Ans: B 7. A marketing consultant wants to find a maximum likelihood estimate for the probability that a consumer will recommend a new product. A random sample of eight consumers is selected and each consumer may give one of two possible replies, “Recommend” or “Don’t Recommend”. This eight trial experiment is repeated 12 times, and a total of 42 consumers reply “Recommend”. The maximum likelihood estimate for the number of consumer who recommend the product is: _________________ MLE x nk 42 .4375 (8)(12) Ans: 0.4375 MGSC 272 Page4/23 8. A statistician wants to estimate the mean value of a normal distribution with a standard deviation of 5. Based on a random sample of three values, Y1 = 27, Y2 = 42, and Y3 = 36, she has performed the following analysis in Excel. The values in cells C6 – E7 are obtained using the normal distribution function with = 8 and the specified value of μ. Which of the following statements are true? I. The value of the likelihood function at μ = 32 is closer to the maximum likelihood estimate than the value of the likelihood function at μ = 36. II. The maximum likelihood estimate has to be some value between 27 and 36. III. The likelihood function for this example is obtained by multiplying the cumulative probabilities associated with the three sample values of the normal distributions for the specified values of μ. IV. The graph of the likelihood function will have a maximum values for a unique value of μ between 27 and 42. V. More than one of the statements is true. Answer: D 9. An economist is studying a lognormal distribution X. The variable ln(X) has a normal distribution with a mean of 3 and a standard deviation of 1.5. The lognormal distribution density function for variable X is given by: f(x) = _________________. 1 ln x 3 1.5 2 1 f ( x) e 2 2 x 1.5 10. An Initial stock price is $20. The expected return is 12% per annum, and the volatility is estimated to MGSC 272 Page5/23 be 8% per annum. A 95% confidence for the value of the stock after 9 months is: .082 Mean of ln(Price) = ln(20) .12 .75 3.0833 2 Std. Dev. of ln(Price) = T 0.08 .75 0.06928 3.0833 1.96(.06928) 3.0833 .1358 2.9475 ln ST 3.2191 $19.06 ST $26.01 11. For a multiple regression model, consider the following statements: I. One use of transformations in regression is to transform what appears to be a nonlinear model into a linear model. II. A nested F-test may be used to measure the significance of the coefficient of partial determination when a new variable is added to a model. III. Significant interaction between two variables implies that as the value of one of the variables increases the rate of change of the Y value with respect to the other variable also increases. Which of the above statements are true: A. B. C. D. E. I and II only II and III only II only III only All of the above statements are true. Ans: A The following information pertains to the next three questions: MGSC 272 Page6/23 The CEO of a chain of 38 sporting goods stores wishes to determine a relationship between Monthly Sales (Y) and the demographic variables Age (X1= Median Age of customer base), Income (X2 = Median Income), and HS (X3 = percentage of customer base with a high school diploma). A partial computer output is shown below. Regression Analysis: Sales versus Age, Income, HS The regression equation is Sales = - 2063712 - 30100 Age + 0.8 Income + 60212 HS Predictor Constant Age Income HS S = 823650 12. Coef -2063712 -30100 0.82 60212 SE Coef 3445731 87328 27.89 31223 R-Sq = 24.3% T -0.60 -0.34 0.03 1.93 P 0.553 0.732 0.977 0.062 R-Sq(adj) = 17.7% Find the value of R2 for this model. n (k 1) 38 4 (1 R 2 adj ) 1 (1 .177) n 1 38 1 0.2437 R2 1 13. One store in the sample was located in a suburb that had a median age of 36 years, a median income of $48,000, and 80% of the population had a high school diploma. This store recorded sales of $1.5 million. What is the residual for this store? Sales = -2063712 -30100(36) + 0.8(48000) +60212(80) = 1708048 Residual = 1,500,000 - 1,708,048 = - $208,048 MGSC 272 Page7/23 14. For a logistic regression model, consider the following statements: I. The logit function represents the natural logarithm of the probability that a binary variable will assume the value 1. II. In a logistic regression the dependent variable assesses the likelihood that a particular outcome will occur. III. In logistic regression models, the test statistics for goodness of fit tests follow a Wald distribution. Which of the above statements are true? A. B. C. D. E. I and II only II and III only II only III only All of the above statements are true. Answer: C 15. Consider the following statements: I. Mallow’s Cp is a good criterion for selecting the variables in a multiple regression model because it simultaneously increases the adjusted R2 while reducing the number of variables in the model. II. Stepwise regression generally selects a model that reduces multicollinearity. III. The variance inflation factor is a measure of how strongly the Y variable is correlated with all of the X variables combined. Which of the above statements are true? A. B. C. D. E. I only I and II only II and III only III only All of the above statements are true. Ans: B MGSC 272 Page8/23 PART 2 (70 marks) QUESTION 1 (16 marks) The manager of a large supermarket chain would like to determine the effect of shelf space, and whether the product was placed at the front or back of the aisle, on the sales of pet food. A random sample of 12 equal sized stores is selected with the following results: The variable Space represents shelf space measured in meters, and the variable Place = 1 if the item is placed in the front of the aisle and 0 if it is placed in the back. The dependent variable Sales represents weekly sales in thousands of dollars. For example the first observation involves sales of $1,600 when 5 m of space is allocated and the pet food is placed at the back of the aisle. The following output has been obtained from Minitab: MGSC 272 Page9/23 (a) Interpret the coefficient of Space in the first model above, Sales vs Space. The store manager currently devotes 15 meters of shelf space to pet food and is thinking about doubling the space. As the manager’s statistical consultant, you are asked to express your opinion on this decision in the space below. Show calculation to justify your conclusion. [4 points] Write your opinion here Coeff of Space = .074. This means that when Space is increased by 1 unit (1 meter) Sales Increases, on average, by $74. If Space = 15 then Sales = 2.56 ($2,560) If Space = 30 then Sales = 3.67 ($3,670) WARNING: Extrapolation problem, since Space = 30 is outside of the range of observed data of 5 to 20 meters. MGSC 272 Page10/23 (b) Interpret the coefficient of Place in the second model above, Salary vs Place. Based on this model would you conclude that, at the 5% level of significance, sales value is significantly higher when the pet food is placed at the front of aisle? Show your work. [4 points] The coefficient of Place is 0.45. This means that, on average, when Place = 1 (front of aisle) sales is higher by 1000 × 0.45 = $450. H0: 1 0 H1: 1 > 0 TS: b1 = .45 => t = .45/.1305 = 3.45 CV: t0.05;10 = 1.812 Conclusion: Reject H0 => Sales higher in front of aisle (c) Consider the regression model of Sales vs Space and Place. [4 points] (i) For this model, estimate R2. s = .213177 => s2 = .0454 = MSE SSE = [n-(k+1)]MSE = [9](.9454) = 0.409 R2 = 1 – SSE/SSTO = 1 - .409/3.0025 = 1 - .1362 = 0.8638 => 86.4% (ii) Discuss the question of multicollinearity in the model.[2 point] No Multicollinearity, because the coefficient of Place does not change when Space is added to the model (iii) Construct a 98% confidence interval for the coefficient of Space in this model. Interpret the confidence interval. [2 points] b1 t.01;9sb1 = .074 (2.821)(.01101) = .074 .031 => .043 1 .105 MGSC 272 Page11/23 QUESTION 2 (20 marks) The Minister of Education in a province wants to investigate factors affecting the percentage of students who pass a reading proficiency exam in Grade 10. She has conducted a study of 47 school districts and has collected the following data: %Passing (Y): Percentage of Students passing the proficiency exam %Attendance (X1): Daily average of percentage of students attending class Salary (X2): Average teacher salary (dollars) Spending (X3): = Instructional spending per student (dollars) An extract of the data is shown below: Some relevant computer output is shown below: MGSC 272 Page12/23 MGSC 272 Page13/23 Answer the following questions: a) Which variables are most likely to cause a multicollinearity problem? Justify your answer. [2 points] Salary and Spending are highly correlated so they will cause multicollinearity b) Determine the coefficient of partial determination when Spending is added to the model containing only %Attendance. Interpret this value. [2 points] R2y2.1 = (5035.9 – 4702)/5035.9 = .0663 => 6.63% Adding Spending explains an extra 6.63% of the variability unexplained by %Attendance. c) Perform a nested F-test to determine whether it is worth adding Spending to the model containing only %Attendance. [4 points] Nested F-Test Ho: β1 = 0 H1: β1 0 SSER SSEC / k g 5035.9 4702 /(2 1) 3.1245 TS: F = 4702 /(47 3) SSEC / n k 1 CV: F.10; 1,44 4.08 Conclusion: Do not reject Ho i.e. it is not worth adding Spending to the model containing %Attendance. MGSC 272 Page14/23 d) Consider the regression analysis of %Passing on %Attending and Spending. For this model, calculate the standardized regression coefficients for %Attending and Spending. Interpret the results. [4 points] % Attending b* 2.121 (8.5) .7487 275.37 Spending b* 209705.1 (.00598) .1650 275.37 e) Consider the regression analysis of %Passing on %Attending and Spending. Perform the appropriate t-tests and recommend whether or not to keep each of the variables in the model. [4 points] CV : t.025;44 1.96 8.501 7.982 1.96 Reject Ho 1.065 % Attending : t Spending : t .005984 1.768 1.96 Do not Reject Ho .003385 Conclusion: keep %Attending, drop Spending MGSC 272 Page15/23 f) The regression model of %Passing on %Attendance has been expanded to include a quadratic term (%Attend^2) with the following results: You have been asked to recommend a model relating %Passing to %Attendance. Would you recommend the linear model or the quadratic model? [1 point] Quadratic. If you choose the wrong model your estimate of the percentage of students who pass the proficiency exam will be incorrect. Estimate the percentage error associated with the wrong choice. Will this be an overestimate or an underestimate? [3 points] %Error = (52.2 – 64.8)/52.2 = 0.24 => 24% error. Overestimate. MGSC 272 Page16/23 QUESTION 3 (14 marks) An auto club rates 50 cars for mileage per gallon as a function of the cars’ horsepower and weight (in pounds). Regression analyses are run with and without an interaction term as shown below: a) In the model without interaction, interpret the coefficients of the variables HP and Weight. [2 points] For each extra unit of horsepower MPG decreases by .118 miles per gallon For each extra pound of weight MPG decreases by .00687 miles per gallon If car A weighs 2100 pounds and car B weighs 3300 pounds, estimate the difference in miles per gallon obtained by the two cars, assuming both cars have the same horsepower. [2 points] Weight = 2100: Yhat = 58.2 - .118HP - .00687(2100) = 58.2 - .118HP – 14.427 Weight = 3300: Yhat = 58.2 - .118HP - .00687(3300) = 58.2 - .118Hp – 22.671 Difference = 8.244 Conclusion: Difference in MPG = 8.244 mpg MGSC 272 Page17/23 Estimate the mileage per gallon obtained by a car with a horsepower rating of 120 and a weight of 2500 pounds. [2 points] MPG = 58.2 - .118(120) - .00687(2500) = 26.865 b) In the model with interaction, perform a test of hypothesis, showing the null and alternative hypotheses, with α = .05 to determine if the interaction is significant. [2 points] H0: 3 = 0 H1: 3 ≠ 0 TS: t = .00010047/.00002971 = 3.38 CV: t.025;46 1.96 Conclusion: Reject Ho => Interaction Interpret the interaction term in this model. [2 points] The response of MPG to horsepower differs for cars of different weights. The response of MPG to weight differs for cars of different horsepowers. In this model, estimate the mileage per gallon obtained by a car with a horsepower rating of 120 and a weight of 2500 pounds. [2 points] MPG = 85.1 - .451(120) - .0152(2500) + .0001(120)(2500) = 22.98 c) Which of the two estimates obtained in parts (a) and (b) for cars with a horsepower rating of 120 and a weight of 2500 pounds would you convey to a friend who has just bought such a car? Explain the consequences of choosing the wrong model. [2 points] The result of part (b) is more accurate because it takes interaction into account. The wrong model would result in overestimating the mileage per gallon by3.9 miles per gallon. MGSC 272 Page18/23 QUESTION 4 (20 marks) A Beer Company Executive is tracking sales of several brands of beer and it's tempting to establish a relationship between price and the independent variables alcohol content (expressed as a percentage), country of origin, and type of beer. Price is measured in dollars; country of origin is equal to 1 for domestic beers and 0 for imported beers. Beer is classified by 5 types: Lager, Ale, Red, Stout, and Lite. A sample of 62 beers is selected and the regression analysis of Price on %Alcohol Content and Country of Origin is: a) For this model interpret the coefficients of the two independent variables. [4 points] 1% increase in alcohol content increases price on average $0.76 assuming country of origin is fixed i.e. 1% increase applies equally to domestic and imported beers. Price is $1.20 cheaper on average for domestic beers (Country of Origin = 1) then for imported beers, assuming alcohol content is constant. MGSC 272 Page19/23 The next regression model shows Price versus Country and Type, where Type = 5 is the reference type. Analysis of Variance Source Regression Residual Error Total DF 5 56 61 SS 86.193 39.848 126.042 MS 17.239 0.712 F 24.23 P 0.000 b) Which type of beer is the most expensive? Which is the least expensive? [2 points] Type 2 is the most expensive. Type 4 is the least expensive. Beers Type 3 and 4 are not significant in the model. Should we drop the variables Type_3 and Type_4 from the model? Justify your answer. [1 points] You cannot drop a subset of the dummy variables. Calculate the value of the standard error of the estimate for this regression model. [3 points] SSE = 39.849; df = 62 –(5+1) = 56 => MSE = 39.849/56 = .7116 S = (MSE) = (.7116) = 0.8436 MGSC 272 Page20/23 The following output shows the full regression model, involving all of the predictor variables. c) A beer costs $4.85. It is a domestic beer that has 150 calories, and is classified as Type 1. What is the expected alcohol content of this beer? [4 points] 4.85 = 6.08 +.0006(150) -.139AC - 1.59(1) +.958(1) AC = (6.08 – 4.85 +.0006×150 -1.59 +.958)/.139 = .688/.139 = 4.95% alcohol d) Construct a 99% confidence interval for the difference in price between domestic and imported beers. Interpret the confidence interval in the context of this model. [2 points] -1.59 2.576(.5274) => -2.95 to -0.23 MGSC 272 Page21/23 Some additional Minitab output is shown below: Explain why the best subsets model with the minimum value of Cp excludes the variable %Alcohol Content. [2 points] 1. Parsimony – Cp reduces variable in the model 2. %Alcohol Content and Calories are highly correlated (r = .712). Therefore if one of them is already in the model the other will not add much to the value of R2adj. MGSC 272 Page22/23 Estimate the value of R2adj for the model with the minimum Cp value. [2 points] n 1 SSE 61 1 (1 R 2 ) n (k 1) SSTO 59 61 =1- (1 .256) 59 =.2307 R 2 adj 1 End of Exam First Blank Page for Rough Work MGSC 272 Page23/23 Second Blank Page for Rough Work MGSC 272 Page24/23