Stat 401B – Final Exam December 16, 2008 Name: ________________________ INSTRUCTIONS: Read the questions carefully and completely. Answer each question and show work in the space provided. Partial credit will not be given if work is not shown. Use the JMP output. It is not necessary to calculate something by hand that JMP has already calculated for you. When asked to explain, describe, or comment, do so within the context of the problem. Be sure to include units when discussing quantitative variables. 1. [25 pts] On the first exam you looked at the alcohol content of beer (%). Consider an indicator variable, Type, that equals 0 if the beer is a light beer and 1 if the beer is a regular beer. Refer to the JMP output “Alcohol Content of Beer.” a) [4] Describe the distribution of % Alcohol. Be sure to comment on shape, center and spread. There is a big mound between 4 and 4.5 and a secondary mound between 5 and 5.5. You might also describe the shape as skewed right. Center: sample median = 4.9 % or sample mean = 4.81% Variability: values from 3.8% to 6.2% or sample standard deviation = 0.628% b) [5] Compute a 95% prediction interval for the % Alcohol of the next beer selected at random from the population. t* = 2.024 ( )√ The 95% prediction interval for the next beer selected at random is 3.52% to 6.10% alcohol by volume. c) [2] Give the equation for predicting % Alcohol from Type. Predicted % Alcohol = 4.17 + 1.02*Type 1 d) [2] How much of the variation in % Alcohol is explained by the linear relationship with Type? R2 = 0.632 so 63.2% of the variation in Alcohol is explained by the linear relationship with Type. e) [3] Give an interpretation of the estimated intercept. Light beers have 4.17% Alcohol, on average. f) [4] Give an interpretation of the estimated slope. Regular beers have 1.02 percentage points more Alcohol, on average, than light beers. g) [5] Is there a statistically significant difference between light beers and regular beers in terms of mean % Alcohol? Support your answer with an appropriate statistical analysis. Look at the t-Ratio for Type. t = 8.07 P-value < 0.0001 Because the P-value is so small (< 0.05) there is a difference in population mean alcohol content for light and regular beers. 2 2. [40 pts] Data on one year old GM vehicles from the 2005 model year are given in the November 2008 issue of Journal of Statistics Education. A random sample of 100 vehicles (Buicks, Cadillacs, Chevrolets, Pontiacs, Saturns and SAABs) was selected. Using these data we wish to predict the Price ($1000) of the vehicles. The explanatory variables are listed below. Refer to the JMP output entitled “Analysis of the Price of Used GM Vehicles”. Mileage – number of miles the vehicle has been driven Cylinder – the number of cylinders in the engine Liter – the size of the engine in liters Doors – the number of doors Cruise = 1 if vehicle has cruise control, = 0 otherwise Sound = 1 if vehicle has upgraded sound system, = 0 otherwise Leather = 1 if vehicle has leather interior, = 0 otherwise Sedan = 1 if vehicle is a sedan, = 0 otherwise Buick = 1 if vehicle is a Buick, = 0 otherwise Cadillac = 1 if vehicle is a Cadillac, = 0 otherwise Chevrolet = 1 if vehicle is a Chevrolet, = 0 otherwise Pontiac = 1 if vehicle is a Pontiac, = 0 otherwise Saturn = 1 if vehicle is a Saturn, = 0 otherwise Note: If a vehicle is not a Buick nor a Cadillac nor a Chevrolet nor a Pontiac nor a Saturn it is a SAAB. a) [7] Consider the Forward selection process with the 13 explanatory variables listed above. i. [4] What will be the first variable entered into the model? How do you know this? Cylinder because it has the largest sum of squares (2159.989), largest F statistic (51.871) and the “Prob > F” = 0.0000 is less than 0.25. ii. [3] What will be the value of R2 once this first variable is entered? b) [5] At the end of the Forward procedure eleven variables have been entered. i. [2] What is the value of R2 for this eleven variable model? 3 ii. [3] Could this be the “best” model for predicting Price? Explain briefly. No. There are two variables (Cruise: F = 0.525, P-value = 0.4711 and Cadillac: F = 0.250, P-value = 0.6185) that do not add significantly to the model (the P-values are greater than 0.05). c) [7] Consider the Backward selection process with the 13 explanatory variables listed above. i. [4] What will be the first variable removed from the model? How do you know this? Doors. This variable adds the least (sum of squares = 0.01859) has the smallest F (0.003) and the P-value = 0.9542 is greater than 0.10. ii. [3] What will be the value of R2 once this first variable is removed? d) [5] At the end of the Backward procedure four variables have been removed. i. [2] What is the value of R2 for the resulting nine variable model? ii. [3] Could this be the “best” model for predicting Price? Explain briefly. Yes. All variables have P-values less than 0.05 so the model with Mileage, Cylinder, Liter, Sound, Buick, Chevrolet, Pontiac, Saturn and Sedan could be the best model. Note: because all of the variables are adding significantly it will be a useful model. 4 e) [8] After six steps of the Mixed procedure, six variables have been entered. i. [2] What are the six variables that have been entered? Cylinder, Cadillac, Cruise, Liter, Chevrolet and Pontiac. ii. [6] What will happen at the next step of the Mixed procedure? Be sure to support your answer by referring to the JMP output. Cylinder is no longer statistically significant (F = 1.523, P-value = 0.2203 > Prob to Leave) so at the next step Cylinder will be removed from the model. f) [8] The “best” model contains nine variables: Mileage, Cylinder, Liter, Sound, Buick, Chevrolet, Pontiac, Saturn, and Sedan. i. [3] What is the value of Cp for this nine variable model? The “best” model is the same as the model in the last step of Backward. Cp = 7.4772. ii. [2] What is the estimate of the error standard deviation for this nine variable model? RMSE = 2.336 iii. [3] Use this “best” model to predict the Price of a 2005 Buick sedan which has a 6 cylinder 3.8-liter engine with 20,000 miles and an upgraded sound system. Predicted Price = 30.075 – 0.000176(20000) – 2.108276(6) + 6.6690734(3.8) – 1.190737(1) – 14.98859(1) – 2.093084(1) = 20.975 ($1000) or $20,975. 5 3. [25] The regression diagnostics for the “best” model that contains the nine variables: Mileage, Cylinder, Liter, Sound, Buick, Chevrolet, Pontiac, Saturn, and Sedan are computed. Refer to the JMP output entitled “Regression Diagnostics for the “Best” Model”. i. [3] According to the distribution of residuals, what values are potential outliers? Give the residual values and the make and model of the vehicles. Residual = 8.8 or 8.76 Residual = 5.5 Chevrolet Corvette SAAB 9_3 HO ii. [6] Are there any vehicles with statistically significant standardized residuals? If so, what vehicles are they and what are the values of their standardized residuals? Be sure to support your answer statistically. Chevrolet Corvette z = 8.76/2.336 = 3.75 Prob >|z| = 0.0002 The P-value (0.0002) is less than 0.05/100 = 0.0005 so this standardized residual is statistically significant. iii. [2] What value of leverage would be considered high? h > 2(9+1/100) = 0.20 iv. [3] Are there any vehicles with high leverage? If so, what vehicles are they and what are the values of their leverage? Yes Chevrolet Corvette SAAB 9_3 Saturn L Series h = 0.224 h = 0.221 h = 0.231 v. [5] Compute the value of the F statistic for the vehicle with the highest leverage. Is this a statistically significant F statistic? Be sure to support your answer statistically. ( ) ( ) Prob > F = 0.00511 which is greater than 0.0005, so this leverage value is not statistically significant. 6 vi. [3] According to Cook’s D are there any highly influential points? Explain briefly. No, the largest value of Cook’s D is 0.52 which is not greater than 1. vii. [3] Are there any vehicles with statistically significant Studentized residuals? If so, what vehicles are they and what are their Studentized residuals? Yes, Chevrolet Corvette has a Studentized residual of 4.26 with a Pvalue = 0.0001 which is less than 0.0005. 4. [35] A company collected data on a random sample of 25 employees, 10 women and 15 men. The data included annual salary ($1000), number of years employed at the company and the gender of the employee. The number of years employed goes from 0 (a new employee) to 25. Refer to the JMP output entitled “Salary Study”. a) [3] Describe the general relationship between the number of years employed at the company and salary. As years increase, salary tends to increase. There is a positive linear relationship. There appear to be two groups. b) [2] Give the equation for predicting salary given only the number of years employed at the company. Predicted Annual Salary = 29.651 + 0.762*Years 7 c) [4] Comment on the plot of residuals for the simple linear regression of salary on number of years. Be sure to indicate what this plot tells you about predicting salaries for men and women. The simple linear regression model under predicts the salaries of men and over predicts the salaries of women. d) [3] Give the equation for predicting salary given number of years with the company, an indicator variable for gender and an interaction term gender*years? The indicator variable gender = 0 if the employee is a man, gender = 1 if the employee is a woman. Predicted Annual Salary = 32.470 + 0.840*Years – 6.348*Gender – 0.252*Years*Gender e) [3] Give an interpretation of the estimated intercept for the equation in d). The predicted (average) starting salary for a man is $32,470. f) [4] Give an interpretation of the estimated slope coefficient for years for the equation in d). The average annual increase in salary for men is $840 per year. g) [4] Give an interpretation of the estimated slope coefficient for gender for the equation in d). Women, on average, have a predicted starting salary $6,348 less than the predicted starting salary of men. 8 h) [5] Does the average yearly change in salary for women differ significantly from the average yearly change in salary for men? Support your answer with an appropriate test of hypothesis. Yes. This is supported by the test of hypothesis for the slope coefficient for the Years*Gender (interaction) term. The t value is – 4.28 with a P-value of 0.0003. The small P-value indicates that there is a statistically significant difference in average change in salary for women compared to men. A woman’s average annual increase in salary is $252 less than that for men. i) [5] Construct a 95% confidence interval for the difference in mean starting salaries for men and women. Use t* = 2.080. Give an interpretation of the confidence interval. ( ) –$8,089 to –$4,607 We are 95% confident that the average starting salary for women is from $4,607 to $8,089 less than that of men. j) [2] After 17 years what are the predicted salaries for a man and a woman at the company? Men: 32.470 + 0.840(17) = 46.75 ($1000) or $46,750 Women: 32.470 + 0.840(17) – 6.348(1) – 0.252(17*1) = 36.118 ( $1000) or $36,188 You can pick up your graded final exam after noon on Friday, December 19 or at the beginning of next semester at 1407 Wilson Hall. 9