UNIVERSITY OF CAPE TOWN DEPARTMENT OF STATISTICAL SCIENCES JANUARY 2022 SUPPLEMENTARY EXAMINATION STA2005S: Linear Models INTERNAL EXAMINERS: Y. Junglee, B. Erni EXTERNAL EXAMINER: J. Harvey (US) PAGES: 6 + 8 Appendix AVAILABLE MARKS: 100 MAXIMUM MARKS: 100 TIME ALLOWED: 3 hours PLEASE ANSWER EACH SECTION IN A SEPARATE BOOK SECTION A: REGRESSION [50Marks] 1. Consider the following two linear regression models, Model 1: Model 2: Y(n×1) = β0 1 + e(n×1) Y(n×1) = X(n×k) β (k×1) + e(n×1) Define p = k − 1, β̂ as the estimated β coefficients of the full regression model (Model 2), ȳ as the sample mean of the 0 response variable and β̂ (1) β̂ with β̂ = β̂1 , . . . , β̂p . (1) as a p dimensional sub vector of (a) Show that the residual sum of squares for Model 1 is given by, SSE1 = Y0 Y − nȳ 2 . (5) (b) Show that the residual sum of squares for Model 2 is given by, 0 SSE2 = Y0 Y − β̂ X0 Y. (3) (c) State the null and alternative hypothesis to test whether regression should be performed. (1) (d) Hence, show that the test statistic when undertaking the hypothesis test in (c) is given by, 0 β̂ X0 Y − nȳ 2 F = , ps22 where s22 is the estimated residual variance of Model 2. Explain your reasoning. (e) What is the distribution of the derived test statistic in (d)? (5) (1) [15] 1 2. Consider the following data set with 7 predictor variables X1 , . . . X7 , and the response variable Y. There are n = 100 observations. The predictor variables were centered Zi = Xi − X̄i , and are used in a regression analysis to identify an appropriate model to predict Y with i.i.d errors, e ∼ N (0, σ 2 I). Y = β0 1 + β1 Z1 + . . . + βp Zp + e Use the R output in Appendix A to answer the following questions: (a) Calculate βˆ0 in MODEL A1. (2) (b) Use MODEL A1 to calculate the test statistic to test whether regression should be performed. (4) (c) Show how the p-value for the test statistic in (b) was calculated. (2) (d) Use MODEL A1 to construct a 95% confidence interval for β7 − 2β3 . (3) (e) Use MODEL A1 to test the hypothesis that 3β2 − 2β1 15 β3 + β4 = 5 β5 + β6 + 2β7 10 by answering the following questions: i. Explain how the test statistic is constructed. ii. Calculate the value of the test statistic. iii. What is the distribution of the test statistic? (5) (f) Use MODEL A1 to construct a 95% prediction interval for the following observation: Z1 0.13 Z2 Z3 0.13 0.05 Z4 0.92 Z5 0.48 Z6 0.18 Z7 0.23 (2) (g) Briefly explain what the Cook’s statistic is based on and how it can be used to detect influential and outlying points. (3) (h) Identify what variable selection procedure is employed at the end of APPENDIX A. Describe the procedure of the selection with reference to what criterion is used, how it is used and what is the suggested best model. (4) [25] 2 0 −2 −1 Y 1 2 3. Consider the following data set consisting of n = 500 observations displayed in Figure 1 below. 0.0 0.2 0.4 0.6 0.8 1.0 X Figure 1: Scatterplot of the response variable, Y, against predictor variable X. The R output for this question can be found in Appendix B. (a) We fit MODEL B1, yi = β0 +β1 xi +ei , to the data. Why is MODEL B1 inappropriate for this data set? (1) (b) Figure 2 (in Appendix B) displays the fitted line obtained from MODEL B1. Construct an approximate scatter plot of the residuals from MODEL B1 against X. (2) (c) We now consider the following regression model, MODEL B2: yi = β0 + β1 xi + β2 x2i + β3 x3i + ei . How would the coefficient of determination for MODEL B2 change from MODEL B1? Why? (2) (d) Why is MODEL B2 still considered a linear regression model? (1) (e) Calculate the test statistic to test whether at least one of the quadratic and cubic terms should be added to MODEL B1. (2) (f) Figure 3 (in Appendix B) displays the residuals obtained from MODEL B2. Construct an approximate line plot of the fitted values from MODEL B2 against X. (2) [10] 3 SECTION B: EXPERIMENTAL DESIGN [50 Marks] Please answer this section in a separate book. 4. Give two approaches for avoiding confounding between a treatment factor and intrinsic differences between experimental units. Exactly how do these approaches avoid confounding? [3] 5. A placebo is a form of control treatment. True or False? Justify your answer. [2] 6. Suppose you obtain a large p-value (e.g. 0.66) for a hypothesis test, e.g. H0 : µ1 = µ2 = µ3 . Does this mean that the three means are equal? Explain. [3] 7. Data from an important (completely randomised design) experiment with five treatments have just been made available. Researcher A wants to test her a-priori hypothesis that µ5 = µ1 (her only test). Researcher B looks at all the treatment means, finds that Ȳ1. and Ȳ5. are most different and now wants a p-value for this comparison. Researchers A and B don’t know of each other’s research. Should there be a difference in the test used by these two researchers? Should their resulting p-values differ, and if so, how? Justify all of your statements. [3] 8. In a one-way ANOVA (for a completely randomised design experiment) with a treatments (single treatment factor) what is the distribution of F = M Streatments M SE if the null hypothesis is false, where H0 : αi = 0, ∀ i? Assume all αi , σ 2 are known. Briefly describe how we can calculate the power of an F-test for the above null hypothesis using the above distribution. [3] 4 9. The Pygmalion effect in psychology refers to a situation where the high expectations of a supervisor or teacher translate into improved performance by employees or students. Ten companies of soldiers in an army training camp were selected for a study investigating the Pygmalion effect. Each company had three platoons. Each platoon had its own platoon leader. Using a random mechanism, one of the three platoons was selected to be the Pygmalion platoon. Prior to assuming command of a platoon, each leader met with an army psychologist. To each of the Pygmalion treatment platoon leaders, the psychologist described a nonexistent battery of tests result that had predicted superior performance from his or her platoon. At the conclusion of basic training, soldiers took a battery of real tests. These tests evaluated soldiers on their ability to operate weapons and answer questions about their use. The response is the average score on the test for the soldiers in a platoon. Sum Sq Df F value Pr(>F) Company 682.52 9 1.75 0.1484 Treat 338.88 1 7.84 0.0119 Residuals 778.50 18 (a) What design was used for this experiment? Make a sketch of the layout of the design, clearly indicating all blocking and treatment factors with their levels. (3) (b) Write down the model underlying the above ANOVA table. Include all necessary constraints. (3) (c) Write down the line of the design matrix corresponding to an observation from the Pygmalion treatment and company 2, and corresponding to the model you specified above. (1) (d) One of the companies had only 2 platoons. This makes the data unbalanced. Therefore, Type II sums of squares were used to obtain the above ANOVA table. Explain how the sums of squares (Sum Sq) of this table differ (not in value but conceptually) to the SS we would have obtained with the usual aov() command we would have used had the data been balanced. (2) (e) Summarize what you can learn from the ANOVA table given above about the factors involved in this experiment. Justify your statements. (2) (f) Did the experimenters find a Pygmalion effect? Use the following output to construct a confidence interval to estimate the Pygmalion effect. Then answer the question by interpreting the confidence interval. (3) > emmeans(mod, ~ Treat) Treat emmean SE df lower.CL upper.CL Control 71.5 1.53 18 68.3 74.7 Pygmalion 78.7 2.08 18 74.3 83.1 where mod refers to the model fitted above. 5 [14] 10. An engineer is studying methods for improving the ability to detect targets on a radar scope. Two factors she considers to be important are the amount of background noise, or ground clutter, on the scope and the type of filter placed over the screen. An experiment is designed using three levels of ground clutter and two filter types. The experiment is performed by randomly selecting a treatment combination (ground clutter level and filter type) and then introducing a signal representing the target into the scope. The intensity of this target is increased until the operator observes it. The intensity level at detection is then measured as the response variable. Because of operator availability, it is convenient to select an operator and keep him or her at the scope until all the necessary runs have been made. Furthermore, operators differ in their skill and ability to use the scope. Four operators are randomly selected. Once an operator is chosen, the order in which the six treatment combinations are run is randomly determined. Appendix C contains summaries of the data, an interaction plot and an ANOVA table relating to this question. (a) Identify all blocking and treatment factors, treatment structure, and the experimental design used. (4) (b) State the null hypothesis tested in the ‘filter:clutter’ line in the ANOVA table, both in statistical notation, and in plain English. Also, write down the likelihood for the model corresponding to this null hypothesis. (4) (c) How many replicates of each treatment are there? (1) (d) How do type of filter and level of clutter affect the intensity at which the target is detected? Refer to the ANOVA table and interaction plot in Appendix C to answer this question. Justify your answers. (4) (e) Have the researchers randomised correctly? Justify your answer. (2) (f) What is the total number of possible randomizations in this design? (1) (g) For the treatment high clutter with filter type 1 : i. Calculate the standard error (SE) of the treatment mean estimate. Explain what this SE value is a measure of. (3) ii. Give an expression for the distribution of the treatment mean. (1) iii. Estimate the interaction effect corresponding to this treatment. (2) [22] 6 Appendix A – R output for Question 2 > head(DAT) Z1 Z2 Z3 Z4 Z5 Z6 Z7 -0.00105 -0.03375 -0.0175 -0.230025 -0.08118 -0.06953 -0.0574 0.13395 0.12625 0.0525 0.917975 0.47732 0.17897 0.2276 0.18395 0.11125 0.0575 1.043975 0.57682 0.25347 0.1881 -0.16105 -0.13375 -0.0575 -0.648025 -0.29268 -0.13703 -0.1844 > > > # Calculate the sum of Y > sum(DAT$Y) [1] 225.7748 > > # Fit a null model > # MODEL 0 > > MODELA0 = lm(Y ~ 1 , dat = DAT) > summary(MODELA0) Call: lm(formula = Y ~ 1, data = DAT) Residuals: Min 1Q Median -0.64831 -0.17831 -0.06052 3Q 0.14015 Max 0.73798 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) ? 0.02861 ? ? --Residual standard error: ? on ? degrees of freedom > > > > > > # Fit a full model # MODEL A1 MODELA1 = lm(Y ~ Z1 + Z2 + Z3 + Z4 + Z5 + Z6 + Z7 , DAT) summary(MODELA1) Call: lm(formula = Y ~ Z1 + Z2 + Z3 + Z4 + Z5 + Z6 + Z7, data = DAT) Residuals: Min 1Q Median -0.47324 -0.11956 -0.01811 3Q 0.10488 Max 0.52420 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) ? ? ? ? Z1 -1.08773 1.22286 -0.889 0.37606 Z2 2.65735 1.45576 1.825 0.07119 Z3 1.53375 1.40457 1.092 0.27770 Z4 -0.16975 0.49592 -0.342 0.73291 7 Y 2.079442 2.397895 2.302585 1.609438 Z5 Z6 Z7 -0.73600 -0.64373 2.27445 0.55575 0.58752 0.85895 -1.324 -1.096 2.648 0.18867 0.27608 0.00953 Residual standard error: 0.1978 on ? degrees of freedom Multiple R-squared: 0.5559,Adjusted R-squared: 0.5221 F-statistic: 16.45 on ? and ? DF, p-value: ? > > > > > > > > > > > # ============ Xmat <- model.matrix(Y ~ . , DAT) Cmat <- solve(t(Xmat)%*%Xmat) L = matrix(0, ncol = 1, nrow = 8) L[4] <- -2 L[8] <- 1 t(L)%*%Cmat%*%L [,1] [1,] 247.3741 > > > Amat <- matrix(0, ncol = 8, nrow = 3) > Amat[1, c(2, 3)] <- c(-2, 3) > Amat[2, c(4, 5)] <- c(1, 1) > Amat[3, c(6, 7, 8)] <- c(1, 1, 2) > Dmat <- c(15, 5, 10) > > > ABhat = Amat%*%MODELA1$coefficients > ACAt = Amat%*%Cmat%*%t(Amat) > SS = t(MODELA1$residuals)%*%MODELA1$residuals > > (t(ABhat-Dmat)%*%solve(ACAt)%*%(ABhat-Dmat))/(SS/92) [,1] [1,] 28.34087 > > =================== > > Zf%*%MODELA1$coefficients [,1] [1,] 2.43629 > > Zf <- c(1, 0.13 , 0.13 , 0.05 , 0.92 , 0.48 , 0.18 , 0.23) > t(Zf)%*%Cmat%*%Zf [,1] [1,] 0.1159083 > > =================== > stepAIC(MODELA0, scope = list(upper = Y ~ Z1 + Z2 + Z3 + Z4 + + Z5 + Z6 + Z7, lower = Y ~ 1), 8 + Start: Y ~ 1 direction = "forward") AIC=-249.3 Df Sum of Sq RSS AIC + Z7 1 3.2199 4.8823 -297.95 + Z3 1 3.0220 5.0802 -293.98 + Z2 1 3.0081 5.0941 -293.71 + Z1 1 2.7303 5.3718 -288.40 + Z4 1 2.2730 5.8291 -280.23 + Z6 1 1.8958 6.2063 -273.96 + Z5 1 1.4783 6.6239 -267.45 <none> 8.1022 -249.30 Step: AIC=-297.96 Y ~ Z7 Df Sum of Sq RSS AIC + Z5 1 0.85622 4.0261 -315.24 + Z4 1 0.78626 4.0960 -313.51 + Z6 1 0.30474 4.5776 -302.40 <none> 4.8823 -297.95 + Z3 1 0.06719 4.8151 -297.34 + Z2 1 0.04832 4.8340 -296.95 + Z1 1 0.00042 4.8819 -295.96 Step: AIC=-315.24 Y ~ Z7 + Z5 + Z2 + Z1 + Z3 <none> + Z6 + Z4 Df Sum of Sq RSS AIC 1 0.267578 3.7585 -320.11 1 0.156575 3.8695 -317.20 1 0.102196 3.9239 -315.81 4.0261 -315.24 1 0.023020 4.0031 -313.81 1 0.007415 4.0187 -313.42 Step: AIC=-320.11 Y ~ Z7 + Z5 + Z2 Df Sum of Sq RSS + Z6 1 0.078203 3.6803 <none> 3.7585 + Z1 1 0.044830 3.7137 + Z4 1 0.035898 3.7226 + Z3 1 0.027859 3.7306 AIC -320.22 -320.11 -319.31 -319.07 -318.86 Step: AIC=-320.22 Y ~ Z7 + Z5 + Z2 + Z6 Df Sum of Sq <none> + Z3 1 RSS AIC 3.6803 -320.22 0.047437 3.6329 -319.51 9 + Z1 + Z4 1 1 0.032180 3.6481 -319.10 0.002997 3.6773 -318.30 Call: lm(formula = Y ~ Z7 + Z5 + Z2 + Z6, data = DAT) Coefficients: (Intercept) Z7 ? 2.2837 > > =================== > Z5 -0.9773 Z2 1.6401 10 Z6 -0.7041 Appendix B – R output for Question 3 > MODELB1 = lm(Y ~ linear.x, data = df) > summary(MODELB1) Call: lm(formula = Y ~ linear.x, data = df) Residuals: Min 1Q Median -1.50514 -0.48655 -0.02775 3Q 0.48449 Max 1.31462 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.13717 0.05118 41.76 <2e-16 linear.x -4.29069 0.08860 -48.43 <2e-16 Residual standard error: 0.5731 on ? degrees of freedom Multiple R-squared: 0.8248,Adjusted R-squared: 0.8245 F-statistic: ? on ? and ? DF, p-value: ? > MODELB2 = lm(Y ~ linear.x + quadratic.x + cubic.x, data = df) > summary(MODELB2) Call: lm(formula = Y ~ linear.x + quadratic.x + cubic.x, data = df) Residuals: Min 1Q -0.62556 -0.13206 Median 0.00608 3Q 0.13285 Max 0.52683 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.72999 0.03543 ? ? linear.x 12.64869 0.30715 ? ? quadratic.x -42.34337 ? ? ? cubic.x 28.20783 ? ? ? Residual standard error: 0.1996 on ? degrees of freedom Multiple R-squared: ? ,Adjusted R-squared: ? F-statistic: ? on ? and ? DF, p-value: ? 11 0 −2 −1 Y 1 2 MODEL B1: Fitted values 0.0 0.2 0.4 0.6 0.8 1.0 X 0.2 0.0 −0.2 −0.4 −0.6 MODEL B2 − Residuals 0.4 0.6 Figure 2: Fitted values from MODEL B1 added to the scatterplot of the response variable Y against predictor variable X . 0.0 0.2 0.4 0.6 0.8 1.0 X Figure 3: Scatterplot of residuals from MODEL B2 against predictor variable X. 12 Appendix C – R output for Question 9 filter mean of intensity 105 1 2 100 95 90 85 low medium high clutter Figure 4: Interaction plot for Question 9 Table 1: ANOVA table for Question 9 Df operator 3 filter 1 clutter 2 filter:clutter 2 Residuals 15 Sum Sq Mean Sq F value Pr(>F) 402.17 134.06 12.09 0.0003 1066.67 1066.67 96.19 0.0000 335.58 167.79 15.13 0.0003 77.08 38.54 3.48 0.0575 166.33 11.09 Tables of means Grand mean: 94.91667 operator 1 2 3 4 95.33 96.50 99.50 88.33 filter 1 101.58 2 88.25 clutter low medium 90.13 95.38 high 99.25 filter:clutter clutter filter low medium high 1 94.50 102.25 108.00 2 85.75 88.50 90.50 13 Probability Tables Table 2: t-Distribution - One sided critical values, i.e. the value of tPdf such that P = P r[tdf > tPdf ], where df is the degrees of freedom. P = 0.025 P = 0.050 df = 1 df = 7 df = 8 df = 9 12.7062 2.3646 2.3060 2.2622 6.3138 1.8946 1.8595 1.8331 14 df = 18 df = 91 df = 92 2.1009 1.9864 1.9861 1.7341 1.6618 1.6616 df = 93 df = 99 1.9858 1.9842 1.6614 1.6604