Regression Analysis – Assignment #2 2017150407 전한림 1. (a) > x <- c(1,2,3,4,5,6,7) > y <- c(2,5,6,9,9,11,14) > X=cbind(1,x) > b=solve(t(X)%*%X)%*%t(X)%*%y > b [,1] 0.7142857 x 1.8214286 > model.1 <- lm(y~x) > summary(model.1) Call: lm(formula = y ~ x) Residuals: 1 -0.5357 2 3 0.6429 -0.1786 4 5 6 1.0000 -0.8214 -0.6429 7 0.5357 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.7143 x 1.8214 0.6662 1.072 0.1490 0.333 12.226 6.47e-05 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7883 on 5 degrees of freedom Multiple R-squared: 0.9676, Adjusted R-squared: F-statistic: 149.5 on 1 and 5 DF, p-value: 6.474e-05 LSE is > b [,1] 0.7142857 x 1.8214286 and fitted regression equation is then 0.9612 (b) > fm=lm(y~x) > rm=lm(y~1) > anova(rm,fm) Analysis of Variance Table Model 1: y ~ 1 Model 2: y ~ x Res.Df RSS Df Sum of Sq 1 6 96.000 2 5 3.107 1 F Pr(>F) 92.893 149.48 6.474e-05 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 F = 149.48 and Pr(>F) is less then 0.05 so, under =0.05 we can reject H0. (c) t =(1.8214-2)/0.1490=-1.1987 t(5,0.005)=-4.032 so, t>t(5,0.005) we cannot reject H0. (d) The 95% confidence interval for b0 is (0.714 ± t(5,0.025) × 0.666) = (-0.998,2.426). Since this interval contains 1, we cannot reject the null H0 : b0 = 1. (e) Multiple R-squared: 0.9676 (R^2) > cor(x,y) [1] 0.9836839 sampel correlation coefficient = 0.984 (f) > predict(model.1, newdata = data.frame(x=3),interval="confidence") fit lwr upr 1 6.178571 5.322257 7.034885 Therefore, 95% confidence interval for b0+3b1 is (5.322,7.035) 2. (a) > concrete<- read.table("concrete.txt",header=T,sep=",") > attach(concrete) > summary(fit<-lm(FLOW.cm.~Slag+Fly.ash+Water)) Call: lm(formula = FLOW.cm. ~ Slag + Fly.ash + Water) Residuals: Min 1Q -32.621 -10.941 Median 1.834 3Q 9.145 Max 23.894 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -49.771697 13.966001 -3.564 0.000564 *** Slag -0.090818 0.022051 -4.119 7.91e-05 *** Fly.ash -0.001257 0.016078 -0.078 0.937853 Water 0.540915 0.064349 8.406 3.21e-13 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 12.66 on 99 degrees of freedom Multiple R-squared: 0.4958, Adjusted R-squared: F-statistic: 32.46 on 3 and 99 DF, 0.4806 p-value: 1.076e-14 The fitted linear equation is × × × (b) When Water increases by 1 unit, while Slag and Fly.ash are fixed, estimated Flow.cm. increased by 0.5409. (c) > FM=lm(FLOW.cm.~Slag+Fly.ash+Water) > RM=lm(FLOW.cm.~1) > anova(RM,FM) Analysis of Variance Table Model 1: FLOW.cm. ~ 1 Model 2: FLOW.cm. ~ Slag + Fly.ash + Water Res.Df RSS Df Sum of Sq 1 102 31483 2 99 15872 3 15611 F Pr(>F) 32.456 1.076e-14 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Since the F-statistic is relatively large (p-value is small), we can reject the null hypothesis (i.e. H0 : b1 = b2 = b3 = 0), and thus we cannot ignore all predictor variables when explaining FLOW.cm. (d) See the regression results in 2(a). According to the individual t-statistics, the effects of Slag and Water are strongly significant on FLOW.cm.. The Fly.ash effect seems not statistically significant and may be removed from the analysis. (e) > summary(fit.e <- lm(FLOW.cm.~Slag+Water)) Call: lm(formula = FLOW.cm. ~ Slag + Water) Residuals: Min 1Q -32.687 -10.746 Median 2.010 3Q 9.224 Max 23.927 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -50.26656 Slag 12.38669 -4.058 9.83e-05 *** -0.09023 0.02064 -4.372 3.02e-05 *** 0.54224 0.06175 8.781 4.62e-14 *** Water --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 12.6 on 100 degrees of freedom Multiple R-squared: 0.4958, Adjusted R-squared: F-statistic: 49.17 on 2 and 100 DF, 0.4857 p-value: 1.347e-15 Linear regression is fitted with Fly.ash being removed and the fitted linear equation is × × (f) > median(Slag) [1] 100 > median(Fly.ash) [1] 164 > median(Water) [1] 196 Keeping in mind the order of predictors, let x0 = (1, 100, 164, 196) denote the new observation in the question. We can use the following R code: > x0=c(intercept=1,Slag=100,Fly.ash=164,Water=196) > X=cbind(1,Slag,Fly.ash,Water) > b.hat=fit$coef # LSE > b.hat (Intercept) -49.771697457 Slag -0.090817519 Fly.ash Water -0.001256759 0.540914648 > sig.hat=summary(fit)$sigma # sigma hat > mu0=t(x0)%*%b.hat # mean response > mu0 [,1] [1,] 46.95971 > se.mu0=sig.hat*sqrt(t(x0)%*%solve(t(X)%*%X)%*%x0) > se.mu0 [,1] [1,] 1.384796 > cl.mu0=c(mu0-qt(0.025,lower.tail=F,df=99)*se.mu0,mu0+qt(0.025,lower.tail=F,df=99)*se. mu0) > cl.mu0 [1] 44.21198 49.70745 So, 95% confidence interval is (44.21198, 49.70745) (g) > y0=t(x0)%*%b.hat # predicted value > se.y0=sig.hat*sqrt(1+t(x0)%*%solve(t(X)%*%X)%*%x0) > pl.mu0= c(mu0-qt(0.005,lower.tail=F,df=99)*se.y0,mu0+qt(0.005,lower.tail=F,df=99)*se.y0) > pl.mu0 [1] 13.50600 80.41343 So, 99% paediction interval is (13.50600, 80.41343) 3. It is assumed that a simple linear regression (SLR) model was fitted for 103 observations, so (i) n = 103, (e),(m) df = 103-2 = 101, and (a) df = 1, (b) = 3371.2 Since t-test equals to Coef/SE, (g) should be 57.02/2.69 = 21.1970 and (h) should be 0.027*(-3.48) = -0.09396. In the SLR setting, F = t1^2 and thus (c) F = (-3.48)^2 = 12.1104. Since (c) = (b)/(f), (f) = (b)/(c), which is 3371.2/12.1104 = 278.3723. Also, since (f) = (d)/(e), (d) = (f)*(e) = 278.3723*101 = 28115.6. It follows from the ANOVA table, SSR = 3371.2 and SSE = (d) = 28115.6. Then, (j) R^2 = SSR/SST = SSR/(SSR+SSE) = 3371.2/31486.8 = 0.1071. (k) Ra^2 = 1-{SSE/(n-2)}/{SST/(n-1)}=1-(28115.6/101)/(31486.8/102) = 0.0982. (l) (a) = 1 (b) = 3371.2 (c) = 12.1104 (d) = 28115.6 (e) = 101 (f) = 278.3723 (g) = 21.1970 (h) = -0.09396 (i) = 103 (j) = 0.1071 (k) = 0.0982 (l) = 16.6845 (m) =101