Uploaded by Soyama Xotyeni

STA2005S Jan2022

advertisement
UNIVERSITY OF CAPE TOWN
DEPARTMENT OF STATISTICAL SCIENCES
JANUARY 2022 SUPPLEMENTARY EXAMINATION
STA2005S: Linear Models
INTERNAL EXAMINERS: Y. Junglee, B. Erni
EXTERNAL EXAMINER: J. Harvey (US)
PAGES: 6 + 8 Appendix
AVAILABLE MARKS: 100
MAXIMUM MARKS: 100
TIME ALLOWED: 3 hours
PLEASE ANSWER EACH SECTION IN A SEPARATE BOOK
SECTION A: REGRESSION [50Marks]
1. Consider the following two linear regression models,
Model 1:
Model 2:
Y(n×1) = β0 1 + e(n×1)
Y(n×1) = X(n×k) β (k×1) + e(n×1)
Define p = k − 1, β̂ as the estimated β coefficients of the full regression model (Model 2),
ȳ as the sample mean of the
0 response variable and β̂
(1)
β̂ with β̂ = β̂1 , . . . , β̂p .
(1)
as a p dimensional sub vector of
(a) Show that the residual sum of squares for Model 1 is given by,
SSE1 = Y0 Y − nȳ 2 .
(5)
(b) Show that the residual sum of squares for Model 2 is given by,
0
SSE2 = Y0 Y − β̂ X0 Y.
(3)
(c) State the null and alternative hypothesis to test whether regression should be performed.
(1)
(d) Hence, show that the test statistic when undertaking the hypothesis test in (c) is
given by,
0
β̂ X0 Y − nȳ 2
F =
,
ps22
where s22 is the estimated residual variance of Model 2. Explain your reasoning.
(e) What is the distribution of the derived test statistic in (d)?
(5)
(1)
[15]
1
2. Consider the following data set with 7 predictor variables X1 , . . . X7 , and the response
variable Y. There are n = 100 observations. The predictor variables were centered
Zi = Xi − X̄i , and are used in a regression analysis to identify an appropriate model to
predict Y with i.i.d errors, e ∼ N (0, σ 2 I).
Y = β0 1 + β1 Z1 + . . . + βp Zp + e
Use the R output in Appendix A to answer the following questions:
(a) Calculate βˆ0 in MODEL A1.
(2)
(b) Use MODEL A1 to calculate the test statistic to test whether regression should be
performed.
(4)
(c) Show how the p-value for the test statistic in (b) was calculated.
(2)
(d) Use MODEL A1 to construct a 95% confidence interval for β7 − 2β3 .
(3)
(e) Use MODEL A1 to test the hypothesis that

  
3β2 − 2β1
15
 β3 + β4  =  5 
β5 + β6 + 2β7
10
by answering the following questions:
i. Explain how the test statistic is constructed.
ii. Calculate the value of the test statistic.
iii. What is the distribution of the test statistic?
(5)
(f) Use MODEL A1 to construct a 95% prediction interval for the following observation:
Z1
0.13
Z2
Z3
0.13 0.05
Z4
0.92
Z5
0.48
Z6
0.18
Z7
0.23
(2)
(g) Briefly explain what the Cook’s statistic is based on and how it can be used to detect
influential and outlying points.
(3)
(h) Identify what variable selection procedure is employed at the end of APPENDIX A.
Describe the procedure of the selection with reference to what criterion is used, how
it is used and what is the suggested best model.
(4)
[25]
2
0
−2
−1
Y
1
2
3. Consider the following data set consisting of n = 500 observations displayed in Figure 1
below.
0.0
0.2
0.4
0.6
0.8
1.0
X
Figure 1: Scatterplot of the response variable, Y, against predictor variable X.
The R output for this question can be found in Appendix B.
(a) We fit MODEL B1, yi = β0 +β1 xi +ei , to the data. Why is MODEL B1 inappropriate
for this data set?
(1)
(b) Figure 2 (in Appendix B) displays the fitted line obtained from MODEL B1. Construct an approximate scatter plot of the residuals from MODEL B1 against X.
(2)
(c) We now consider the following regression model, MODEL B2:
yi = β0 + β1 xi + β2 x2i + β3 x3i + ei .
How would the coefficient of determination for MODEL B2 change from MODEL
B1? Why?
(2)
(d) Why is MODEL B2 still considered a linear regression model?
(1)
(e) Calculate the test statistic to test whether at least one of the quadratic and cubic
terms should be added to MODEL B1.
(2)
(f) Figure 3 (in Appendix B) displays the residuals obtained from MODEL B2. Construct an approximate line plot of the fitted values from MODEL B2 against X.
(2)
[10]
3
SECTION B: EXPERIMENTAL DESIGN [50 Marks]
Please answer this section in a separate book.
4. Give two approaches for avoiding confounding between a treatment factor and intrinsic
differences between experimental units. Exactly how do these approaches avoid confounding?
[3]
5. A placebo is a form of control treatment. True or False? Justify your answer.
[2]
6. Suppose you obtain a large p-value (e.g. 0.66) for a hypothesis test, e.g. H0 : µ1 = µ2 = µ3 .
Does this mean that the three means are equal? Explain.
[3]
7. Data from an important (completely randomised design) experiment with five treatments
have just been made available.
Researcher A wants to test her a-priori hypothesis that µ5 = µ1 (her only test). Researcher
B looks at all the treatment means, finds that Ȳ1. and Ȳ5. are most different and now wants
a p-value for this comparison. Researchers A and B don’t know of each other’s research.
Should there be a difference in the test used by these two researchers? Should their
resulting p-values differ, and if so, how? Justify all of your statements.
[3]
8. In a one-way ANOVA (for a completely randomised design experiment) with a treatments
(single treatment factor) what is the distribution of
F =
M Streatments
M SE
if the null hypothesis is false, where H0 : αi = 0, ∀ i?
Assume all αi , σ 2 are known. Briefly describe how we can calculate the power of an F-test
for the above null hypothesis using the above distribution.
[3]
4
9. The Pygmalion effect in psychology refers to a situation where the high expectations of
a supervisor or teacher translate into improved performance by employees or students.
Ten companies of soldiers in an army training camp were selected for a study investigating
the Pygmalion effect. Each company had three platoons. Each platoon had its own
platoon leader. Using a random mechanism, one of the three platoons was selected to be
the Pygmalion platoon.
Prior to assuming command of a platoon, each leader met with an army psychologist.
To each of the Pygmalion treatment platoon leaders, the psychologist described a nonexistent battery of tests result that had predicted superior performance from his or her
platoon.
At the conclusion of basic training, soldiers took a battery of real tests. These tests
evaluated soldiers on their ability to operate weapons and answer questions about their
use. The response is the average score on the test for the soldiers in a platoon.
Sum Sq Df F value Pr(>F)
Company
682.52
9
1.75 0.1484
Treat
338.88
1
7.84 0.0119
Residuals
778.50 18
(a) What design was used for this experiment? Make a sketch of the layout of the design,
clearly indicating all blocking and treatment factors with their levels.
(3)
(b) Write down the model underlying the above ANOVA table. Include all necessary
constraints.
(3)
(c) Write down the line of the design matrix corresponding to an observation from the
Pygmalion treatment and company 2, and corresponding to the model you specified
above.
(1)
(d) One of the companies had only 2 platoons. This makes the data unbalanced. Therefore, Type II sums of squares were used to obtain the above ANOVA table.
Explain how the sums of squares (Sum Sq) of this table differ (not in value but
conceptually) to the SS we would have obtained with the usual aov() command we
would have used had the data been balanced.
(2)
(e) Summarize what you can learn from the ANOVA table given above about the factors
involved in this experiment. Justify your statements.
(2)
(f) Did the experimenters find a Pygmalion effect? Use the following output to construct
a confidence interval to estimate the Pygmalion effect. Then answer the question by
interpreting the confidence interval.
(3)
> emmeans(mod, ~ Treat)
Treat
emmean
SE df lower.CL upper.CL
Control
71.5 1.53 18
68.3
74.7
Pygmalion
78.7 2.08 18
74.3
83.1
where mod refers to the model fitted above.
5
[14]
10. An engineer is studying methods for improving the ability to detect targets on a radar
scope. Two factors she considers to be important are the amount of background noise, or
ground clutter, on the scope and the type of filter placed over the screen. An experiment
is designed using three levels of ground clutter and two filter types. The experiment is
performed by randomly selecting a treatment combination (ground clutter level and filter
type) and then introducing a signal representing the target into the scope. The intensity
of this target is increased until the operator observes it. The intensity level at detection is
then measured as the response variable. Because of operator availability, it is convenient
to select an operator and keep him or her at the scope until all the necessary runs have
been made. Furthermore, operators differ in their skill and ability to use the scope. Four
operators are randomly selected. Once an operator is chosen, the order in which the six
treatment combinations are run is randomly determined.
Appendix C contains summaries of the data, an interaction plot and an ANOVA table
relating to this question.
(a) Identify all blocking and treatment factors, treatment structure, and the experimental design used.
(4)
(b) State the null hypothesis tested in the ‘filter:clutter’ line in the ANOVA table, both
in statistical notation, and in plain English. Also, write down the likelihood for the
model corresponding to this null hypothesis.
(4)
(c) How many replicates of each treatment are there?
(1)
(d) How do type of filter and level of clutter affect the intensity at which the target is
detected? Refer to the ANOVA table and interaction plot in Appendix C to answer
this question. Justify your answers.
(4)
(e) Have the researchers randomised correctly? Justify your answer.
(2)
(f) What is the total number of possible randomizations in this design?
(1)
(g) For the treatment high clutter with filter type 1 :
i. Calculate the standard error (SE) of the treatment mean estimate. Explain
what this SE value is a measure of.
(3)
ii. Give an expression for the distribution of the treatment mean.
(1)
iii. Estimate the interaction effect corresponding to this treatment.
(2)
[22]
6
Appendix A – R output for Question 2
> head(DAT)
Z1
Z2
Z3
Z4
Z5
Z6
Z7
-0.00105 -0.03375 -0.0175 -0.230025 -0.08118 -0.06953 -0.0574
0.13395
0.12625 0.0525 0.917975 0.47732 0.17897 0.2276
0.18395
0.11125 0.0575 1.043975 0.57682 0.25347 0.1881
-0.16105 -0.13375 -0.0575 -0.648025 -0.29268 -0.13703 -0.1844
>
>
> # Calculate the sum of Y
> sum(DAT$Y)
[1] 225.7748
>
> # Fit a null model
> # MODEL 0
>
> MODELA0 = lm(Y ~ 1 , dat = DAT)
> summary(MODELA0)
Call:
lm(formula = Y ~ 1, data = DAT)
Residuals:
Min
1Q
Median
-0.64831 -0.17831 -0.06052
3Q
0.14015
Max
0.73798
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
?
0.02861
?
?
--Residual standard error: ? on ? degrees of freedom
>
>
>
>
>
>
# Fit a full model
# MODEL A1
MODELA1 = lm(Y ~ Z1 + Z2 + Z3 + Z4 + Z5 + Z6 + Z7 , DAT)
summary(MODELA1)
Call:
lm(formula = Y ~ Z1 + Z2 + Z3 + Z4 + Z5 + Z6 + Z7, data = DAT)
Residuals:
Min
1Q
Median
-0.47324 -0.11956 -0.01811
3Q
0.10488
Max
0.52420
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
?
?
?
?
Z1
-1.08773
1.22286 -0.889 0.37606
Z2
2.65735
1.45576
1.825 0.07119
Z3
1.53375
1.40457
1.092 0.27770
Z4
-0.16975
0.49592 -0.342 0.73291
7
Y
2.079442
2.397895
2.302585
1.609438
Z5
Z6
Z7
-0.73600
-0.64373
2.27445
0.55575
0.58752
0.85895
-1.324
-1.096
2.648
0.18867
0.27608
0.00953
Residual standard error: 0.1978 on ? degrees of freedom
Multiple R-squared: 0.5559,Adjusted R-squared: 0.5221
F-statistic: 16.45 on ? and ? DF, p-value: ?
>
>
>
>
>
>
>
>
>
>
>
# ============
Xmat <- model.matrix(Y ~ . , DAT)
Cmat <- solve(t(Xmat)%*%Xmat)
L = matrix(0, ncol = 1, nrow = 8)
L[4] <- -2
L[8] <- 1
t(L)%*%Cmat%*%L
[,1]
[1,] 247.3741
>
>
> Amat <- matrix(0, ncol = 8, nrow = 3)
> Amat[1, c(2, 3)] <- c(-2, 3)
> Amat[2, c(4, 5)] <- c(1, 1)
> Amat[3, c(6, 7, 8)] <- c(1, 1, 2)
> Dmat <- c(15, 5, 10)
>
>
> ABhat = Amat%*%MODELA1$coefficients
> ACAt = Amat%*%Cmat%*%t(Amat)
> SS = t(MODELA1$residuals)%*%MODELA1$residuals
>
> (t(ABhat-Dmat)%*%solve(ACAt)%*%(ABhat-Dmat))/(SS/92)
[,1]
[1,] 28.34087
>
> ===================
>
> Zf%*%MODELA1$coefficients
[,1]
[1,] 2.43629
>
> Zf <- c(1, 0.13 , 0.13 , 0.05 , 0.92 , 0.48 , 0.18 , 0.23)
> t(Zf)%*%Cmat%*%Zf
[,1]
[1,] 0.1159083
>
> ===================
>
stepAIC(MODELA0, scope = list(upper = Y ~ Z1 + Z2 + Z3 + Z4 +
+
Z5 + Z6 + Z7, lower = Y ~ 1),
8
+
Start:
Y ~ 1
direction = "forward")
AIC=-249.3
Df Sum of Sq
RSS
AIC
+ Z7
1
3.2199 4.8823 -297.95
+ Z3
1
3.0220 5.0802 -293.98
+ Z2
1
3.0081 5.0941 -293.71
+ Z1
1
2.7303 5.3718 -288.40
+ Z4
1
2.2730 5.8291 -280.23
+ Z6
1
1.8958 6.2063 -273.96
+ Z5
1
1.4783 6.6239 -267.45
<none>
8.1022 -249.30
Step: AIC=-297.96
Y ~ Z7
Df Sum of Sq
RSS
AIC
+ Z5
1
0.85622 4.0261 -315.24
+ Z4
1
0.78626 4.0960 -313.51
+ Z6
1
0.30474 4.5776 -302.40
<none>
4.8823 -297.95
+ Z3
1
0.06719 4.8151 -297.34
+ Z2
1
0.04832 4.8340 -296.95
+ Z1
1
0.00042 4.8819 -295.96
Step: AIC=-315.24
Y ~ Z7 + Z5
+ Z2
+ Z1
+ Z3
<none>
+ Z6
+ Z4
Df Sum of Sq
RSS
AIC
1 0.267578 3.7585 -320.11
1 0.156575 3.8695 -317.20
1 0.102196 3.9239 -315.81
4.0261 -315.24
1 0.023020 4.0031 -313.81
1 0.007415 4.0187 -313.42
Step: AIC=-320.11
Y ~ Z7 + Z5 + Z2
Df Sum of Sq
RSS
+ Z6
1 0.078203 3.6803
<none>
3.7585
+ Z1
1 0.044830 3.7137
+ Z4
1 0.035898 3.7226
+ Z3
1 0.027859 3.7306
AIC
-320.22
-320.11
-319.31
-319.07
-318.86
Step: AIC=-320.22
Y ~ Z7 + Z5 + Z2 + Z6
Df Sum of Sq
<none>
+ Z3
1
RSS
AIC
3.6803 -320.22
0.047437 3.6329 -319.51
9
+ Z1
+ Z4
1
1
0.032180 3.6481 -319.10
0.002997 3.6773 -318.30
Call:
lm(formula = Y ~ Z7 + Z5 + Z2 + Z6, data = DAT)
Coefficients:
(Intercept)
Z7
?
2.2837
>
> ===================
>
Z5
-0.9773
Z2
1.6401
10
Z6
-0.7041
Appendix B – R output for Question 3
> MODELB1 = lm(Y ~ linear.x, data = df)
> summary(MODELB1)
Call:
lm(formula = Y ~ linear.x, data = df)
Residuals:
Min
1Q
Median
-1.50514 -0.48655 -0.02775
3Q
0.48449
Max
1.31462
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.13717
0.05118
41.76
<2e-16
linear.x
-4.29069
0.08860 -48.43
<2e-16
Residual standard error: 0.5731 on ? degrees of freedom
Multiple R-squared: 0.8248,Adjusted R-squared: 0.8245
F-statistic: ? on ? and ? DF, p-value: ?
> MODELB2 = lm(Y ~ linear.x + quadratic.x + cubic.x, data = df)
> summary(MODELB2)
Call:
lm(formula = Y ~ linear.x + quadratic.x + cubic.x, data = df)
Residuals:
Min
1Q
-0.62556 -0.13206
Median
0.00608
3Q
0.13285
Max
0.52683
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
0.72999
0.03543
?
?
linear.x
12.64869
0.30715
?
?
quadratic.x -42.34337
?
?
?
cubic.x
28.20783
?
?
?
Residual standard error: 0.1996 on ? degrees of freedom
Multiple R-squared: ? ,Adjusted R-squared: ?
F-statistic: ? on ? and ? DF, p-value: ?
11
0
−2
−1
Y
1
2
MODEL B1: Fitted values
0.0
0.2
0.4
0.6
0.8
1.0
X
0.2
0.0
−0.2
−0.4
−0.6
MODEL B2 − Residuals
0.4
0.6
Figure 2: Fitted values from MODEL B1 added to the scatterplot of the response variable Y
against predictor variable X .
0.0
0.2
0.4
0.6
0.8
1.0
X
Figure 3: Scatterplot of residuals from MODEL B2 against predictor variable X.
12
Appendix C – R output for Question 9
filter
mean of intensity
105
1
2
100
95
90
85
low
medium
high
clutter
Figure 4: Interaction plot for Question 9
Table 1: ANOVA table for Question 9
Df
operator
3
filter
1
clutter
2
filter:clutter
2
Residuals
15
Sum Sq Mean Sq F value Pr(>F)
402.17
134.06
12.09 0.0003
1066.67
1066.67
96.19 0.0000
335.58
167.79
15.13 0.0003
77.08
38.54
3.48 0.0575
166.33
11.09
Tables of means
Grand mean: 94.91667
operator
1
2
3
4
95.33 96.50 99.50 88.33
filter
1
101.58
2
88.25
clutter
low medium
90.13 95.38
high
99.25
filter:clutter
clutter
filter low
medium high
1 94.50 102.25 108.00
2 85.75 88.50 90.50
13
Probability Tables
Table 2: t-Distribution - One sided critical values, i.e. the value of tPdf such that
P = P r[tdf > tPdf ], where df is the degrees of freedom.
P = 0.025
P = 0.050
df = 1 df = 7 df = 8 df = 9
12.7062 2.3646 2.3060 2.2622
6.3138 1.8946 1.8595 1.8331
14
df = 18 df = 91 df = 92
2.1009
1.9864
1.9861
1.7341
1.6618
1.6616
df = 93 df = 99
1.9858
1.9842
1.6614
1.6604
Download