C22.0103 FINAL EXAM Name:________________________ Write your answers to the first five questions on the attached sheets, in the spaces provided. Circle the choice which best answers questions 6-15. Do not write anything else on this page (besides your name and the circles). When you are finished, hand in the entire exam (both question sheets and answer sheets). Please do not remove any pages from the exam paper. There are 15 questions, each worth 5 points. Everyone receives 25 points for free. Good Luck! 1) WRITTEN 11) (A) (B) (C) (D) (E) 2) WRITTEN 12) (A) (B) (C) (D) (E) 3) WRITTEN 13) (A) (B) (C) (D) (E) 4) WRITTEN 14) (A) (B) (C) (D) (E) 5) WRITTEN 15) (A) (B) (C) (D) (E) 6) (A) (B) (C) (D) (E) 7) (A) (B) (C) (D) (E) 8) (A) (B) (C) (D) (E) 9) (A) (B) (C) (D) (E) 10) (A) (B) (C) (D) (E) Answer For Question 1: Answer for Question 2: Answer for Question 3: Answer for Question 4: Answer for Question 5: C22.0103 FINAL EXAM In Questions 1) - 5), we consider the response variable of Hotel and Restaurant Employment for Costa Rica (in thousands of employees) for each year from 1995 to 2008, together with the following three explanatory variables: Tourists Arriving (in thousands), GDP (in millions of US Dollars), and Year. 1) Here is the linear regression output for the simple regression of Hotel and Restaurant Employment on Tourists Arriving. Regression Analysis: Hotel and Restaurant Employment versus Tourists Arriving The regression equation is Hotel and Restaurant Employment = 26.2 + 0.0412 Tourists Arriving Predictor Constant Tourists Arriving S = 8.07098 Coef 26.173 0.041155 R-Sq = 84.5% SE Coef 6.845 0.005095 T 3.82 8.08 P 0.002 0.000 R-Sq(adj) = 83.2% Analysis of Variance Source Regression Residual Error Total DF 1 12 13 SS 4249.5 781.7 5031.2 MS 4249.5 65.1 F 65.24 P 0.000 A) Based on this output, discuss the impact of an additional 2000 tourists arriving in Costa Rica in a given year on Hotel and Restaurant Employment. (2 Points) B) Test the null hypothesis that the true coefficient of Tourists Arriving in this model is .03. Use a two-tailed alternative hypothesis, and a significance level of .05. (3 Points). 2) Here is the fitted line plot for the simple regression in Question 1. Fitted Line Plot Hotel and Restaurant Employment = 26.17 + 0.04115 Tourists Arriving Hotel and Restaurant Employment 120 S R-Sq R-Sq(adj) 110 8.07098 84.5% 83.2% 100 90 80 70 60 50 800 1000 1200 1400 1600 Tourists Arriving 1800 2000 2200 A) The data point furthest to the right corresponds to the year 2008, and has a leverage of 0.34 and a Cook's D of 0.86. Does this give us cause for concern as to the validity of the regression model? (2 points). B) Is there anything about the fitted line plot, or the plot of residuals from this regression versus year (see below) that gives us cause for concern as to the validity of the regression model? (3 points). Residuals Versus Year (response is Hotel and Restaurant Employment) 15 10 Residual 5 0 -5 -10 1995.0 1997.5 2000.0 2002.5 Year 2005.0 2007.5 3) Here is the regression output using all three explanatory variables. Regression Analysis: Hotel and Restaurant Employment versus Tourists Arriving, GDP, Year The regression equation is Hotel and Restaurant Employment = - 10404 + 0.0162 Tourists Arriving - 0.197 GDP + 5.24 Year Predictor Constant Tourists Arriving GDP Year S = 5.11627 Coef -10404 0.01624 -0.1967 5.244 R-Sq = 94.8% SE Coef 2538 0.01952 0.1258 1.275 T -4.10 0.83 -1.56 4.11 P 0.002 0.425 0.149 0.002 R-Sq(adj) = 93.2% Analysis of Variance Source Regression Residual Error Total DF 3 10 13 SS 4769.5 261.8 5031.2 MS 1589.8 26.2 F 60.74 P 0.000 A) Based on this output, is there evidence of a positive relationship between Tourists Arriving and Hotel and Restaurant Employment? (2 points). B) Use the output above to compute the p-value in testing the null hypothesis that the true coefficient of GDP is zero versus the alternative hypothesis that the true coefficient is positive. (2 points). C) Do the F-statistic and its associated p-value indicate that all variables should be included in the regression? (1 point). 4) Next, we omit Year from the regression. For the regression based on Tourists Arriving and GDP, the output is as follows. Regression Analysis: Hotel and Restaurant Employment versus Tourists Arriving, GDP The regression equation is Hotel and Restaurant Employment = 32.2 + 0.0667 Tourists Arriving - 0.216 GDP Predictor Constant Tourists Arriving GDP S = 8.00209 Coef 32.232 0.06667 -0.2161 R-Sq = 86.0% SE Coef 8.744 0.02376 0.1966 T 3.69 2.81 -1.10 P 0.004 0.017 0.295 R-Sq(adj) = 83.5% Analysis of Variance Source Regression Residual Error Total DF 2 11 13 SS 4326.8 704.4 5031.2 MS 2163.4 64.0 F 33.79 P 0.000 Is this model preferable to the full model in Question 3? Justify your answer. (5 points). 5) In the regression output in Question 4 above, note that the estimated coefficient for Tourists Arriving is closer to zero than the estimated coefficient for GDP (since |.06667| < |−.2161|). How, then, do you explain the fact that the t-statistic for Tourists Arriving is further from zero than the t-statistic for GDP? (5 points). Questions 6-15 are general and do not pertain to the regression example above. 6) In a multiple regression context, suppose that we have three available explanatory variables. Suppose that we run three regressions. The first regression uses variables 1 and 2, and produces an R 2 of .65. The second regression uses variables 2 and 3, and yields an R 2 of .70. The third regression uses variables 1 and 3, and produces an R 2 of .75. Which model is preferable, according to AICc? A) The model with variables 1 and 2. B) The model with variables 2 and 3. C) The model with variables 1 and 3. D) It cannot be determined from the available information. 7) Consider a simple linear regression of y on x, where the y-values are not all the same. Suppose that the residuals all take the same value. Then: A) R 2 must be 1. B) R 2 may be less than 1 C) It cannot be determined from the available information 8) Suppose that a simple linear regression model holds for a data set with n=20. What is the probability that the sample mean of the (unobservable) errors is more than .2969 times the sample standard deviation of the errors? A) .0918 B) .100 C) .1836 D) .200 E) None of the Above. 9) Suppose we are going to use a t-test to test the null hypothesis H 0 : 0 versus the alternative hypothesis H A : 0 . Assume that the null hypothesis is true and the population is normally distributed. What is the probability that the right-tailed p-value will be less than .01? A) .005 B) .99 C) .995 D) .01 E) None of the Above 10) In a sample of size 10 from a normal population, the sample mean is 2 and the sample standard deviation is 3. Construct a 95% confidence interval for the population mean, μ. The interval is: A) (−.146, 4.146) B) (.141, 3.859) C) (−3.88, 7.88) D) (−.114, 4.114) E) None of the Above. 11) We will look here at the results of a very large trial of an HIV vaccine. The trial was conducted on 16,400 people in Thailand, all of whom were HIV negative at the start of the trial. Half of the people received a placebo, and half received the vaccine. Both groups were followed for three years afterwards. Of the 8,200 who received the vaccine, 51 developed HIV. Of the 8,200 who received the placebo, 74 developed HIV. Here are the results from Minitab's 2-proportions. Test and CI for Two Proportions Sample 1 2 X 51 74 N 8200 8200 Sample p 0.006220 0.009024 Difference = p (1) - p (2) Estimate for difference: -0.00280488 95% upper bound for difference: -0.000571046 Test for difference = 0 (vs < 0): Z = -2.07 P-Value = 0.019 If the vaccine were actually ineffective, what would be the probability of observing at least as big a reduction as seen here in the HIV rate for the vaccine compared to the placebo? A) .0095 B) .019 C) .038 D) .981 E) None of the Above 12) In simple linear regression, if the right-tailed p-value for the coefficient of the explanatory variable is .5, then the R 2 must be A) .5 B) .25 C) 1 D) 0 E) None of the Above 13) Suppose that an automobile manufacturer has been notified by owners that a certain model has a sticky accelerator pedal. To investigate these claims, the company wants to perform their own laboratory tests, based on a random sample of n automobiles of the given model. If 1% of the automobiles in the population have the sticky accelerator pedal, what is the smallest value of n that the company should use for their sample size to guarantee a probability of at least 90% that at least one of the automobiles in the sample has a sticky accelerator pedal? A) 10 B) 120 C) 230 D) 550 E) None of the Above. 14) Based on a sample of size 10 from a normal distribution, suppose you want to test the null hypothesis that the population mean is zero against a right-tailed alternative hypothesis. The sample mean is 1.0386 and the sample standard deviation is 1.452. Then the p-value is: A) .0238 B) .05 C) .0119 D) .025 E) None of the Above 15) Consider a game where a fair coin is tossed four times, independently. If all four tosses are heads, you win $10. Otherwise, you lose $1. If you are going to play this game once, what is your expected profit? A) $4 B) 31.25 Cents C) −31.25 Cents D) −$4 E) None of the Above.