Stat 401B Final Exam (slightly revised version) Fall 2015 I have neither given nor received unauthorized assistance on this exam. ________________________________________________________ Name Signed Date _________________________________________________________ Name Printed ATTENTION! Incorrect numerical answers unaccompanied by supporting reasoning will receive NO partial credit. Correct numerical answers to difficult questions unaccompanied by supporting reasoning may not receive full credit. SHOW YOUR WORK/EXPLAIN YOURSELF! 1 1. Some 4 pts 6 pts 1 2 inch finished hex nuts have weights that are normally distributed with mean 17gm and standard deviation .6gm . a) What fraction of these nuts have weights above 17.4gm ? b) These nuts are packaged by weight. A package (intended to hold at least 100 of these hex nuts) will be filled with a weight of nuts that is at least 1710gm . Approximate the probability that 99 nuts have a total weight of at least 1710gm (so that the actual count of nuts is less than desired number). (Hint: What would the average weight of these 99 nuts have to be for this to happen?) 2. A data base is segmented as below in terms of "type of record" and "completeness of record." Suppose that one will select a single record at random from this data base. Type A Type B Type C 200 records 350 records 450 records Complete Incomplete 50 records 50 records 100 records 5 pts a) Evaluate P Type B | Incomplete . 5 pts b) Are the events "Type B" and "Incomplete" independent? Say why or why not. 2 3. Some experimental data on the page "Using Central Composite Design for Process Optimization" on the weibull.com web site concern the tensile strength of welds made on steel. We will use these data in various ways in this problem. 5 pts First, n 7 welds made at a standard set of process conditions produced y 6611 kgf and s 145 kgf . a) Give 95% confidence limits for the mean strength of steel welds made under standard process conditions. (Plug in completely, but you need not do arithmetic.) 5 pts b) Interpret your interval from a). (Say carefully what is meant by the "95%" figure.) 5 pts c) A weld is made at a non-standard set of process conditions and its strength tested. The value y 5830 kgf is observed. Give 95% confidence limits for the difference in mean strengths for the two sets of welding process conditions. (Plug in completely, but you need not do arithmetic.) 5 pts d) The non-standard set of process conditions referred to in part c) actually differs from the standard set only in the electrical current applied. Coded values of the current are x 1 for the non-standard conditions and x 0 for the standard conditions. A plot of the data is below. What model assumptions (be complete in stating them) would you make in order to support a prediction interval for the strength of a weld made with coded current x .5 (and all other process conditions standard)? 3 5 pts e) In fact, for the situation of d) sLF 145 kgf , x .125 and 8 x x i 1 i 2 .875 . Give 95% prediction limits for the next y at x .5 under your assumptions of d). (The least squares line should be obvious to you from the plot in d) and the information given at the beginning of the problem.) (Plug in completely, but you need not do arithmetic.) 5 pts f) There was also a weld made at coded electrical current x 1 that had strength y 6210 . A plot of this data point, the 7 data points mentioned at the start of the problem, and the one mentioned in part c) is below. If strength is a linear function of current, 1 0 0 1 0 . Give 95% confidence limits for 1 0 0 1 . Is it plausible that strength is linear over this range of x ? The entire data set was used to fit a model for strength, y , as a linear function of coded current, x1 , coded voltage, x2 , coded "stick out", x3 , and coded angle, x4 . (See the R output beginning on page 7.) 5 pts g) What fraction of the raw variability in y is accounted for using x1 , x2 , x3 , and x4 as predictor variables? 5 pts h) What is the meaning of b3 305.56 ? (Interpret this fitted coefficient.) 4 5 pts i) There is R output for the fit of a full quadratic model for y in the predictor variables x1 , x2 , x3 , and x4 beginning on page 8. Give the value of an F statistic and degrees of freedom for testing whether the quadratic model is a statistically significant improvement over the linear model in x1 , x2 , x3 , and x4 for explaining y . F ___________ d . f . _______ , _______ 4. Lab #10 used balanced 23 factorial data of Example 4 of Chapter 8 of Vardeman and Jobe. We will here continue use of that scenario. 5 pts a) Below is the ANOVA table produced in Lab #10. Suppose that one runs the regsubsets() function from the leaps package using the 7 dummy variables created in the lab. Which model with 4 predictors will be identified as best, and what value of R 2 is associated with it? 5 pts b) Why would it be impossible to answer part a) based on only the table above if the data were not balanced? 5 pts c) CVSSE values for the best (in terms of R 2 ) models with k 1, 2, , 7 factorial effects were computed using 8-fold cross validation. A plot of these is below. In light of this plot and the ANOVA table above, what "few effects" model for Power appears best? How does its CVSSE compare to the SSE that would be obtained fitting it? 5 5 pts 5. Below is a toy data set consisting of N 5 pairs x, y . Find the LOO cross-validation SSE for 1- nn prediction. (You don't need to do arithmetic, but write out a complete numerical expression. ) y 2.5 4.0 4.5 7.5 8.0 x 1.0 2.0 2.5 3.5 4.0 6. Below are fake regression trees that you may assume come from B 3 bootstrap samples of a large number, N , of x1i , x2i , yi data points. 5 pts a) What is the random forest prediction at x1 , x2 .5,.5 ? (Assume that the forest includes only the B 3 trees represented above.) 5 pts b) Suppose that in fact x1 , x2 , y .5,.5,3 was part of the bootstrap samples that were used to produce trees #1 and #2, but not #3. What is this data point's contribution to an OOB error sum of squares? 5 pts 7. Consider the ordinary least squares predictor ŷ OLS x (from standard multiple linear regression) and a lasso predictor ŷ Lasso x (for some ) computed for the same data. Is it true that the SSE for ŷ Lasso x is always at least as big as that for ŷ OLS x ? Say why or why not. 6 R Code and Output > Welds 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 current voltage stickout angle strength -1 -1 -1 -1 4730 1 -1 -1 -1 4990 -1 1 -1 -1 4240 1 1 -1 -1 7320 -1 -1 1 -1 7130 1 -1 1 -1 4920 -1 1 1 -1 4110 1 1 1 -1 5020 -1 -1 -1 1 5560 1 -1 -1 1 4910 -1 1 -1 1 5330 1 1 -1 1 7490 -1 -1 1 1 6820 1 -1 1 1 4030 -1 1 1 1 3690 1 1 1 1 4210 -1 0 0 0 5830 1 0 0 0 6210 0 -1 0 0 6230 0 1 0 0 6530 0 0 -1 0 6370 0 0 1 0 5510 0 0 0 -1 6390 0 0 0 1 6110 0 0 0 0 6550 0 0 0 0 6650 0 0 0 0 6750 0 0 0 0 6610 0 0 0 0 6340 0 0 0 0 6600 0 0 0 0 6780 > summary(lm(strength~.,Welds)) Call: lm(formula = strength ~ ., data = Welds) Residuals: Min 1Q -1740.7 -1023.5 Median 312.6 3Q 803.2 Max 1607.1 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5805.16 199.69 29.070 <2e-16 *** current 92.22 262.06 0.352 0.728 voltage -76.67 262.06 -0.293 0.772 stickout -305.56 262.06 -1.166 0.254 angle -38.89 262.06 -0.148 0.883 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1112 on 26 degrees of freedom Multiple R-squared: 0.05766, Adjusted R-squared: -0.08731 F-statistic: 0.3977 on 4 and 26 DF, p-value: 0.8084 7 > anova(lm(strength~.,Welds)) Analysis of Variance Table Response: strength Df Sum Sq Mean Sq F value current 1 153089 153089 0.1238 voltage 1 105800 105800 0.0856 stickout 1 1680556 1680556 1.3595 angle 1 27222 27222 0.0220 Residuals 26 32141108 1236196 Pr(>F) 0.7277 0.7722 0.2542 0.8832 > summary(lm(strength~.,data.frame(Welds2))) Call: lm(formula = strength ~ ., data = data.frame(Welds2)) Residuals: Min 1Q -262.13 -142.13 Median 5.84 3Q 104.91 Max 350.64 Coefficients: Estimate Std. Error t value (Intercept) 6544.16 69.59 94.045 current 190.00 165.87 1.145 voltage 150.00 165.87 0.904 stickout -305.56 55.29 -5.526 angle -38.89 55.29 -0.703 current2 -445.68 145.61 -3.061 voltage2 -85.68 145.61 -0.588 stickout2 -525.68 145.61 -3.610 angle2 -215.68 145.61 -1.481 currentvoltage 753.75 58.64 12.853 currentstickout -526.25 58.64 -8.974 currentangle -110.00 175.93 -0.625 voltagestickout -628.75 58.64 -10.722 voltageangle -255.00 175.93 -1.449 stickoutangle -277.50 58.64 -4.732 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ Pr(>|t|) < 2e-16 0.268853 0.379236 4.60e-05 0.491936 0.007469 0.564470 0.002348 0.157979 7.56e-10 1.21e-07 0.540624 1.03e-08 0.166533 0.000226 *** *** ** ** *** *** *** *** 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 234.6 on 16 degrees of freedom Multiple R-squared: 0.9742, Adjusted R-squared: 0.9516 F-statistic: 43.13 on 14 and 16 DF, p-value: 5.149e-10 > anova(lm(strength~.,data.frame(Welds2))) Analysis of Variance Table Response: strength Df current 1 voltage 1 stickout 1 angle 1 current2 1 voltage2 1 stickout2 1 angle2 1 currentvoltage 1 currentstickout 1 currentangle 1 Sum Sq 153089 105800 1680556 27222 8379097 554530 990677 120721 9090225 4431025 21511 Mean Sq F value Pr(>F) 153089 2.7822 0.1147672 105800 1.9228 0.1845697 1680556 30.5418 4.601e-05 *** 27222 0.4947 0.4919356 8379097 152.2787 1.371e-09 *** 554530 10.0778 0.0058839 ** 990677 18.0042 0.0006200 *** 120721 2.1939 0.1579786 9090225 165.2025 7.560e-10 *** 4431025 80.5279 1.212e-07 *** 21511 0.3909 0.5406241 8 voltagestickout 1 6325225 6325225 114.9524 1.033e-08 *** voltageangle 1 115600 115600 2.1009 0.1665326 stickoutangle 1 1232100 1232100 22.3917 0.0002256 *** Residuals 16 880396 55025 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 9