STATISTICS 401D Spring 2016 Laboratory Assignment 8 1. Thirteen specimens of Cu-Ni alloys with varying degrees of iron content in percent were submerged in sea water for 60 days and the weightloss due to corrosion recorded in units of milligrams per square decimeter per day. In a study to examine the dependency of corrosion (y) on iron content (x), a simple linear regression model was fitted to the data. Specimen 1 2 3 4 5 6 7 8 9 10 11 12 13 x, Fe % 0.01 0.48 0.71 0.95 1.19 0.01 0.48 1.44 0.71 1.96 0.01 1.44 1.96 y, Weight Loss (mg/dm) 127.6 124.0 110.8 103.9 101.5 130.2 122.0 92.3 113.2 83.7 128.0 91.7 86.3 Answer the following questions. For parts a) to d), do all computations by hand using P P these quantities computed from the data: y = 1, 415.2, y 2 = 157, 447.7, P P 2 P x = 11.35, x = 15.6183, xy = 1, 098.628 a) Plot the data in a y vs. x scatter plot. Does it appear that a simple linear regression model would be a good fit? b) Use the LS method to fit a a simple linear regression model. What is your prediction equation? c) Construct an analysis of variance table for the regression. Use the F-ratio to perform a test of H0 : β1 = 0 vs. Ha : β1 6= 0 using α = .05 d) Compute a lack of fit test for this model and report the results in an anova table where SSLack and SSPexp are shown as a partition of SSE as demonstrated in the class example. What is your conclusion from this test? Use α = .05. For the rest of the problem you may use a JMP analysis using the data table corrosion.jmp. Attach the output to your solution. e) Use JMP to obtain the answers to parts a) to d). f ) Compute the predicted values and residuals. Obtain plots of the residuals against Fe % and the predicted values (ŷ), respectively. Do these two plots suggest any inadequacies of this model ? Explain why you reached your conclusion. 2. One task assigned to foresters is to estimate the potential lumber harvest of a forest. This is typically done by selecting a sample of trees, making some nondestructive measures of these trees, then using a prediction formula to estimate lumber yield. The prediction formula is obtained from a previous study involving a sample of trees for which actual lumber yields were available from harvesting. The data, shown on page 3, are from such a study and includes the volume of lumber in cubic feet (y) and 1 several predictor variables: the diameter of the trunk at breast height (about 4 feet), in inches (x1 ), the height, in feet (x2 ), and the diameter of the trunk at 16 feet of height, in inches (x3 ), measured on a random sample of 20 trees. Use JMP to perform a multiple regression analysis to fit the full model y = β0 + β1 x1 + β2 x2 + β3 x3 + to this data. Part I The following are the X 0 X matrix, X 0 y vector, y 0 y, and the inverse of the X 0 X matrix, respectively, computed for this data (results rounded to manageable number of digits): X 0X = 20 310.63 1910.94 276.5 310.63 4889.0619 29743.6588 4348.945 1910.94 29743.6588 183029.3358 26486.925 276.5 4348.945 26486.925 3873.69 and (X 0 X)−1 = , X 0y = 1237.03 19659.1047 118970.1884 17516.935 , y 0 y = 80256.5195, 21.782755949 −0.439044462 −0.228898883 0.503209019 −0.439044462 0.1620460292 0.0040467803 −0.178259035 −0.228898883 0.0040467803 0.0029279638 −0.008225088 0.503209019 −0.178259035 −0.008225088 0.2207091285 Use these to perform the following calculations using matrix algebra. You may recalculate (X 0 X)−1 if you want more accuracy in your answers. The files X.txt, XPX.txt, XPy.txt, and XPXI.txt are available in the Downloads folder at the course website for you to use by other software. Show work. (a) Construct the normal equations. (b) Calculate the estimate β̂ by forming the product of (X 0 X)−1 and X 0 y. 0 (c) Calculate s2 using SSE= y 0 y − β̂ X 0 y. Use (d) Calculate the standard errors sβ̂1 , sβ̂2 , and sβ̂3 of β̂1 , β̂2 , and β̂3 , respectively. (e) Calculate the predicted values ŷ using the fact that ŷ = X β̂ Part II Write your answers to the following questions on separate pages using numbers extracted from a JMP analysis of the data performed on the data table lumber.JMP. No hand calculations needed for this part. Attach the JMP outputs to your answer. (a) Obtain the correlations and the scatter plot matrix for y, x1 , x2 , and x3 . (b) Report β̂0 , β̂1 , β̂2 , and β̂3 . (c) Report s2 , sβ̂0 , sβ̂1 , sβ̂2 , and sβ̂3 . (d) Report 95% confidence intervals for β1 , β2 and β3 , respectively. (e) Construct an analysis of variance for the above regression. Report the coefficient of determination. (f) Use the F -test statistic to test H0 : β1 = β2 = β3 = 0 vs. Ha : at least one β is not zero, and report the p-value for the test. State your decision. (g) Use the t-test statistic to test H0 : β2 = 0 vs. Ha : β2 6= 0, and report the p-value for the test. State your decision. (h) Obtain a 95% confidence interval for β2 . Use this interval to test H0 : β2 = 0 vs. Ha : β2 6= 0. What is the α level of this test. 2 (i) Obtain a 95% confidence interval for the mean lumber volume for a population of trees (of the same variety) with x1 = 15.5, x2 = 90, x3 = 14.1. Describe in words what this interval tells you. (j) Obtain a 95% prediction interval for the lumber volume y21 of a tree (of the same variety) with x1 = 15.5, x2 = 90, x3 = 14.1. Describe in words what this interval tells you. (k) Is there evidence in your analysis to indicate any multicollinearity problems in the estimation of the coefficients? Discuss how you determined your answer. (l) Obtain plots of the residuals vs. predicted values, x1 , x2 , and x3 , respectively. Does any pattern of the types discussed in class observed in these plots? Give your interpretation. (m) Obtain a normal probability plot of the studentized residuals. State the model assumption that can be examined using this plot. Does this assumption appear to be plausible here? (n) Fit the regression model y = β0 + β2 x2 + β3 x3 + to the above data. Use values from the resulting output and the previous output to construct an F-statistic to test H0 : β1 = 0 vs. Ha : β1 6= 0 in the full model. Perform the test at α = .05. (o) Examine to what extent multicollinearity problems affect the fit of the model in part (n) compared to the full model. If the model in part (n) is better, discuss reasons for the improvement. Data: Diameter at Breast Height x1 10.20 13.72 15.43 14.37 15.00 15.02 15.12 15.24 15.24 15.28 13.78 15.67 15.67 15.98 16.50 16.87 17.26 17.28 17.87 19.13 Height x2 89.00 90.07 95.08 98.03 99.00 91.05 105.60 100.80 94.00 93.09 89.00 102.00 99.00 89.02 95.09 95.02 91.02 98.06 96.01 101.00 Diameter at 16 feet x3 9.3 12.1 13.3 13.4 13.5 12.8 14.0 13.5 14.0 13.8 12.6 14.0 13.7 13.9 14.9 14.9 14.3 14.3 16.9 17.3 Volume y 25.93 45.87 56.20 58.60 63.36 46.35 68.99 62.91 58.13 59.79 56.20 66.16 62.18 57.01 65.62 65.03 66.74 73.38 82.87 95.71 Due Tuesday, April 19, 2016 (turn-in during the first 15 min. 3 of the lab)