Stat 511 Fall 2005 Midterm 2 Statistics 511 Midterm 2 Nov. 15, 2005 The following rules apply. 1. You may use 3 sheets of paper for any information you need - double-sided, any font. 2. You may use a calculator. 3. You may not collaborate or copy. 4. You may not use outside resources, such as the internet. As well, you may not store notes or formulas on your calculator. 5. Failure to comply with item 3 could lead to reduction in your grade, or disciplinary action. I have read the rules above and agree to comply with them. Signature ________________________________________________ Name (printed) ___________________________________________ 1 Stat 511 Fall 2005 Midterm 2 1. (21) In 1929, Edwin Hubble investigated the relationship between distance of a galaxy from the earth and the velocity with which it appears to be receding. The data collected include distances (megaparsecs) from earth to 24 galaxies and their recession velocities (km/sec). Note: 1 megaparsec = 3,260,000 light years. Hubble's theory was that at some time all the galaxies were compressed into a single point in space and that since then they have moved apart at an increasing velocity leading to Hubble's Law: Recession Velocity = G*Distance where G is Hubble's constant. Note that some of the galaxies appear to be moving towards us, and hence have negative velocity. Some plots and computer output are in the "Computer Output" handout. a) (3) The investigator regressed Velocity (Y) on Distance (X). Does the ordinary linear regression model fit the data? Justify your conclusion. The ordinary linear regression model fits the data. 1) The plot of studentized residuals shows an even spread of the residuals around 0. There is no curvature. There are no outliers. (any 2) 2. The normal probability plot of the residuals is very linear, indicating that the errors are close to normal. (required) b) (3) Is regression through the origin a suitable model for these data? Give 3 justifications for your answer. Yes, this is a suitable model. 1. Hubble’s Law indicates that the intercept should be 0. 2. There are data near X=0. 3. In the ordinary linear regression model, we do not reject the hypothesis that the intercept is zero. So, regression through the origin is a reasonable model for these data. 2 Stat 511 Fall 2005 Midterm 2 c) (1) What is the estimated value of Hubble's constant using the regression through the origin method? 423.93732 km/sec/Mps (You do not need the units for full points.) d) (2) A quasar is receding from us at a speed of 48,000 km/sec. Using the regression through the origin model, how far away do you estimate it to be in megaparsecs? You can estimate distance either by inverting regression of velocity on distance or by computing the regression of distance on velocity. If distance is considered to be a known quantity, then it is most correct to invert the regression equation. This is called the calibration problem. For calibration estimates, confidence intervals for X are computed by inverting confidence intervals for Yˆ . (There is more on this in the text.) In this case the estimated distance is: Distance=423.93732/Speed = 113.224 Mps If distance is considered to be a random variable, then it is most correct to regress distance on velocity. (See part g) In this case the estimated distance is: 0.0019218*48,000 = 92.25 Mps Note: According to my understanding, we can assign a velocity to every galaxy due to the redshift in its spectrum. The distance is known only for galaxies in which certain types of stars are observed. Hence, the distance to most galaxies is derived from this type of computation. e) (1) Is the estimate in part d an extrapolation? Justify your answer. No matter which equation you used in part d, the quasar is very far from the galaxy data used in to form the prediction. (The Dept. of Statistics at Penn State has an institute of Astrostatistics. If you are interested in these types of data, talk with Dr. Jogesh Babu.) 3 Stat 511 Fall 2005 Midterm 2 f) (2) From the output given, we can compute How can x y 2 i x = 6511425 and i 2 i 2 i = 29.518 i be computed from the output given? i For regression through the origin, the regression sum of squares is yˆ 2 i = b12 xi2 i i b1 and SSR are given. So, x 2 i =SSR/ b12 = 5305022/(423.93732)2 = 29.518 i g) (3) Suppose that for part d) you regressed Distance on Velocity (i.e. switched the roles of Y and X). What is the slope of the regression of Distance on Velocity. x 2 i = 29.518 In the regression Velocity on Distance, b1= 423.93732 = xi yi / xi2 i So, i x i yi = (29.518)(423.93732)=12513.782 i The required slope is x i i yi / yi2 = 12513.782/6511425= 0.0019218 i The units would be Mps/(km/sec) (1) point if you just invert the slope. 4 i Stat 511 Fall 2005 Midterm 2 h) (3) Hubble's constant, G, is estimated to be about 75. Is this supported by the data? Do a formal test using the regression through the origin model. State the null and alternative hypotheses. Ho: 1= 75 HA: 1≠ 75 t* = (423.93732-75) = 8.278 42.15414 p<.0001 We reject the null hypothesis. G is not 75. i) (3) A galaxy is thought to be 2.1 megaparsecs from earth. Use the regression through the origin model to compute a 90% interval for its recession velocity. This should be a prediction interval. The interval is: Yˆ t.95, 23 MSE (1 h) h= x2/ xi2 = (2.1)2 /29.518 = 0.1494 i (423.93732)(2.1) ± 1,714 √(52452)(1.1494) = (890.2677 ± 420.850 = (469.4177,1311.1177) km/sec 5 Stat 511 Fall 2005 Midterm 2 2. Suppose V1~N(3,5) independent of V2~N(2,4) W1= V1+2V2 W2 = V1-2V2 W W = 1 W2 a) (2) What is E(W) ? E (W1 ) E (V1 ) 2 E (V2 ) 3 2 * 2 7 E(W) = E (W2 ) E (V1 ) 2 E (V2 ) 3 2 * 2 1 b) (3) What is Var(W)? Var(W1) = Var(V1) + 4 Var(V2) = 5 + 4*4 = 21 Var(W2) = Var(W1) Cov(W1, W2) = Var(V1) – 4 Var(V2) = 5 -4*4 = -11 21 11 Var(W)= 11 21 6 Stat 511 Fall 2005 Midterm 2 3. (4) Suppose that Y=X+ where Y and are n x 1, is p x 1 and X is n x p and E()=0. Suppose A is p x p and (X'AX) is invertible. Show that (X'AX)-1X'AY is an unbiased estimator of . E(Y)=X So, E[(X'AX)-1X'AY] = (X'AX)-1X'A E(Y) = (X'AX)-1X'A X = (X'AX)-1(X'AX) = 7 Stat 511 Fall 2005 Midterm 2 4. The SENIC data consists of a simple random sample of hospitals which was collected in a study of the risk of patients incurring a new infection when in the hospital for treatment of an unrelated condition. We will focus on predicting RISK, the percentage of patients in the hospital who became infected, from several other variables: LENGTH the average length of stay in the hospital (days) AGE the average age of patients NURSES the number of nurses BEDS the number of beds CENSUS the average number of occupied beds Computer output for this problem is in the "Computer Output" handout. a) (3) What is the theoretical regression model for predicting RISK from the other 5 variables? Define all of your notation and make sure you include the distribution assumptions. Riski = 0 + 1 Lengthi +2 Agei +3 Nursesi+4 Bedsi+5 Censusi + i where Riski is the risk of infection for hospital i The predictor variables are as described above 0 … 5 are unknown regression coefficients and i are errors, with i i.i.d. N(0,2) b) (4) (You knew this one was coming!) Fill in the 8 blanks in the ANOVA table. Analysis of Variance Sum of Squares Mean Square Source DF Model _5_ _72.27455__ 14.45491 Error _107_ _129.10513_ 1.20659 Cor.Total _112_ R-Square F Value _201.37968_ _0.3589_ 8 Pr > F _11.98_ <.0001 Stat 511 Fall 2005 Midterm 2 c) (5) Below are the plot of RISK versus LENGTH and the partial regression leverage plot of RISK versus length. Explain the difference between these 2 plots, and 3 things that we might look for on these plots. Also, identify a high leverage point on at least one of the 2 plots. The REG Procedure Model: MODEL1 Partial Regression Residual Plot „ƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒ† risk ‚ ‚ ‚ ‚ ‚ ‚ 4 ˆ 1 ˆ ‚ ‚ ‚ ‚ ‚ ‚ ‚ 1 ‚ 3 ˆ ˆ ‚ ‚ ‚ 1 ‚ ‚ 1 ‚ ‚ 1 ‚ 2 ˆ 1 1 ˆ ‚ 1 1 ‚ ‚ 1 1 1 ‚ ‚ 1 11 1 ‚ ‚ 1 ‚ 1 ˆ 2 1 1 ˆ ‚ 111 11 11 1 1 ‚ ‚ 1 1 1 1 11 ‚ ‚ 2 1 2 1 11 ‚ ‚ 1 1 1 1 1112 1 ‚ 0 ˆ 1 1 1 1 12 ˆ ‚ 11 1 1 1 1 11 ‚ ‚ 1 1 1 1 ‚ ‚ 11 1111 11 1 ‚ ‚ 1 1 1 ‚ -1 ˆ 21 1 11 1 ˆ ‚ 1 21 1 1 ‚ ‚ 1 1 1 1 1 ‚ ‚ 1 ‚ ‚ 2 1 ‚ -2 ˆ ˆ ‚ 1 1 ‚ ‚ 1 1 1 ‚ ‚ 1 ‚ ‚ ‚ -3 ˆ ˆ ‚ ‚ ‚ ‚ ‚ ‚ ŠƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒŒ -3 -2 -1 0 1 2 3 4 5 6 7 8 length 9 Stat 511 Fall 2005 Midterm 2 8 7 6 5 4 3 2 1 6 8 10 12 14 16 18 20 l engt h The plot of risk versus length uses the actual measured values of the variables. The partial regression leverage plot has on the X-axis, the residuals of length after regressing length on age, beds, nurses and census, and on the Y-axis, the residuals of risk after regressing length on age, beds, nurses and census. Some of the things we might look for on these plots: Plot of risk versus length: The shape of the regression function (curved or linear) Constancy of spread. Outliers in risk. High leverage or outlying values of Length. Partial regression leverage plot of risk versus length: The shape of the regression function (curved or linear) Constancy of spread. High leverage values. On either plot we can see that risk has a strong relationship with length. The partial regression leverage plot is a better plot for detecting linearity and leverage. On that plot, we see that the relationship is essentially linear, but that there is a high leverage plot (in blue) that might induce 10 Stat 511 Fall 2005 Midterm 2 curvature or reduce the slope of the line. On both plots we can see that the data are evenly spread around the line, so the equal variance assumption likely holds. One curious feature of these plots is that there appear to be 2 high leverage point (in blue) on the plot of risk versus length, but only one on the partial regression leverage plot. This would indicate that the other point might be accounted for by one of the other variables in the model. c) (2) What is the likely effect of the high leverage point on the regression coefficients? Briefly justify your answer. The high leverage point(s) lie below the line that goes through the remaining points. So, the effect of the point(s) is to reduce the regression coefficient of Length. Or (since I was not specific that I meant the point identified earlier) in general a high leverage point can change the regression slope so that the line comes closer to the y value for that point. 11 Stat 511 Fall 2005 Midterm 2 d) (2) Below is the plot of studentized residuals versus LENGTH. Summarize the important features of this plot for checking the regression assumptions. 3 2 1 0 - 1 - 2 - 3 6 8 10 14 12 16 18 20 l engt h There are 2 very outlying values of Length, and both of these have negative residuals. The other points appear to have mean 0 and constant variance. There are no outliers. (Any 2 for full points) I also accepted variance is not constant (or even curvature) if it was clear that the effect was due to the 2 high leverage points – e.g. by a sketch on the plot, or a verbal description. 12 Stat 511 Fall 2005 Midterm 2 e) (1) What is the predicted risk for a hospital with LENGTH=8.82 AGE=58.2 BEDS=80 NURSES=52 CENSUS=51 predicted risk -= 1.63109 + 0.37974*8.82 -0.02308 * 58.2 + 0.00506*80 + 0.00041805*52 -0.00362*51 = 3.7492 f) (1) The leverage for the hospital in part d is 0.027. Is this hospital a high leverage point? We compare with 2p/n = 2*6/113 = 0.106. This leverage is less than our cut-off, so the hospital is not a high leverage point. 13 Stat 511 Fall 2005 Midterm 2 g) (3) Compute a 95% interval for the mean risk for hospitals with the same values of the predictor variables as the hospital in part d. This is a confidence interval, since we are interested in mean risk. I ask this type of question every year and I put this on the review list: Var( Yˆ ) = 2h and h was given in part f. Yˆ t.975,107 MSE (h) = 3.7492 ± 1.98√(1.2066)(0.027) =3.7492 ± 0.3574 (3.3918,41066) h) (3) Compute a 95% confidence interval for the regression coefficient of BEDS when the other variables are in the model. The interval has the form b4 ± t.975,107*s(b4) = 0.00042 ± 1.98(0.00303) = 0.00042 ± 0.006 (-0.0056,0.0064) 14 Stat 511 Fall 2005 Midterm 2 i) (2) Is there evidence of high multicollinearity among the predictor variables? Briefly justify your answer. There are many variance inflation factors higher than 10. So there is high multicollinearity among the predictor variables. j) (2) Compute the TOLERANCE for BEDS. What does this tell us? TOLERANCE = 1/VIF = 1/31.77778 = 0.0315 This is much smaller than 0.1, so BEDS is highly collinear with the other variables. Alternatively, TOLERANCE = 1- R2 for the regression of BEDS on the other predictors. So, the percentage of the variance in BEDS explained by the other variables in the model is 96.85%. 15 Stat 511 Fall 2005 Midterm 2 k) (3) Test whether the effect of LENGTH on risk is statistically significant when the other variables are in the model. Be sure to include your null and alternative hypothesis, p-value and conclusion. You may use whatever statistics you can obtain from the computer output, and compute the rest. H0: 1 = 0 HA: 1 ≠ 0 This test statistic can usually be found on the computer output, but since it was deleted: b1 - 1 = 0.37974 = 5.5778 s(b1) 0.06808 Alternative, you can use F*= SSII(length) = 31.1139= t*2 MSE Naturally, the p-value is the same both ways p<.001 The effect of length is statistically significant when the other variables are in the model. l) (2) The study administrator notes that the p-values for the coefficients of average patient age, number of beds and the census are not statistically significant and concludes none of these variables contributes significantly to the risk. Is this conclusion statistically justified? Briefly explain your answer. The t-tests can only be used to determine the statistical significance of 1 of the variables when ALL of the other variables are in the model. So, the t-tests cannot be used to conclude that none of the 3 variables contributes significantly to the risk. 16 Stat 511 Fall 2005 Midterm 2 m) (3) The biostatistical staff suggest doing a simultaneous F-test to determine whether AGE, BEDS and CENSUS are jointly significant when the other variables are in the model. They compute the F-statistic: F* = [SSI(AGE)+SSI(BEDS)+SSI(CENSUS)]/3 MSE(full) = (2.07506+2.76081+1.04570)/3 1.20659 and compare the result with F(3, 107) Is this test statistically justified? The staff are on the right track – they do want to use a simultaneous F-test – but the computation is incorrect. The SSI depend on having the variables in the model in the right order. For the test required here, age, beds and census must be the last 3 variables in the model. So this is the wrong test statistic. 17 Stat 511 Fall 2005 Midterm 2 n) (2) What is the percent variance explained by BEDS and CENSUS when the other variables are in the model? The percent variance explained by BEDS and CENSUS when the other variables are in the model is SSI(Beds) + SSI(Census) = (2.76081+1.04570) = 0.0189 (1.89%) SSTo 201.3797 o) (2) What is the percent variance explained by AGE when the other variables are in the model? The percent variance explained by AGE when the other variables are in the model is SSII(Age) = 1.10841 = 0.0055 (0.55%) SSTo 201.3797 18