1 Solutions 5: Regression Sxx , Syy represent corrected sums of squares for x and y, and Sxy is the corrected sum of products. S 2 represents the residual mean square, i.e., residual sum of squares divided by residual degrees of freedom. For simple linear regression, the residual sum of squares 2 /S is Syy − Sxy xx with n − 2 degrees of freedom. The following method of calculating sums of squares and products is usually faster and more accurate than using the direct formulas. P P P P X X X ( x) 2 ( y) 2 ( x) ( y) 2 2 Sxx = x − , Syy = y − , Sxy = xy − n n n 1. Summary statistics: P P P 2 n x y x 5 275 705 16125 y2 101341 P P xy 40155 Sxx 1000 Syy 1936 Sxy 1380 For this unusually, there is no loss of accuracy in using the direct formulas Pexample, 2 Sxx = (x − x̄) , etc. Regression slope b̂ = 1380/1000 = 1.38, regression SumSq 13802 /1000 = 1904.4 with 1 d.f. Residual SumSq 1936 − 1904.4 = 31.6 with 3 d.f. ANOVA table is Source Df Sum Sq Mean Sq F Regression 1 1904.4 1904.4 180.9 Residual 3 31.6 10.53 Total 4 1936 √ The slope estimate b̂ has standard error (10.53/1000) = 0.1026 and t = 13.45 with 3 d.f. (P < 0.001). Alternatively, use the F statistic. There is strong evidence that blood pressure is increasing with age, and the best estimate of the increase is about 13.8 mm (±1.0 mm) every 10 years. The confidence interval for the slope is 1.38 ±3.18 × 0.1026, or (1.05, 1.71). 2. From the summary statistics we calculate x̄ 4.32 ȳ 9.73 Sxx 13.016 Syy 3.381 Sxy 5.444 Slope b̂ = 5.444/13.016 = 0.418, intercept â = 9.73 − 0.418 × 4.32 = 7.92. Regression equation is y = 7.92 + 0.418 x. Regression SumSq 5.4442 /13.016 = 2.277, residual SumSq 3.381 − 2.277 = 1.104, √ S 2 = 1.104/8 = 0.138. Hence b̂ has standard error (0.138/13.016) = 0.103 and t = 4.06 with 8 d.f. (P < 0.01). The relationship between purity and yield is significant at the 1% level. 2 3. Following on from the plotting code given as part of this question, calculate both regressions and draw the regression line (D2 on D1 ): fit1 <- lm(D2 ~ D1, weight = frq, data = shells) fit2 <- lm(D1 ~ D2, weight = frq, data = shells) abline(fit1, lty = "dashed") We have to remember that the roles of the x and y axes are interchanged when plotting the second line on the existing graph: b0 <- coef(fit2)[1]; b1 <- coef(fit2)[2] abline(a = -b0/b1, b = 1/b1, lty = "dotted") For the second regression, b0 and b1 are intercept and slope relative to axes (y,x), −b0 /b1 and 1/b1 are intercept and slope relative to (x,y). The two lines are different because they solve different problems: predicting D1 from the value of D2 , and predicting D2 from the value of D1 . With a high correlation, as here, the two lines are close together (with small angular separation). With zero correlation, the two lines are at right angles. Low to moderate correlations give something between these two extremes. 4. After fitting a regression, it is good practice to check the assumptions of the model. The first and most useful diagnostic plot is the ‘residuals v fitted values’ plot. This looks satisfactory. There is no obvious relationship between size of residual and fitted value. The three largest residuals are labelled but these are not necessarily outliers. Humans and rhesus monkeys have a brain size larger than expected from the relationship with body weight, while the brain of the water opossum is smaller than expected. The Q-Q (quantile-quantile) plot is a check on the distribution of the residuals. Any non-linearity in this plot is evidence of some kind of deviation from normality (e.g., skewness or kurtosis). Here everything is OK. (See ?plot.lm for other types of diagnostic plot.) 5. Sxx = 42, Sxy = 65, Syy = 135.5, b̂ = 1.5476, ȳ = 18.75, x̄ = 3.5, â = 13.333, regression equation is y = 13.333 + 1.5476 x. Source d.f. Sum of Squares Mean Square F-ratio Regression 1 100.595 100.595 17.29 Residual 6 34.905 5.818 Total 7 135.500 √ Standard error of b̂ is (S 2 /Sxx ) = 0.3722, and t = 1.5476/0.3722 = 4.16, with 6 d.f. From F tables (1 and 6 d.f.), or t tables (6 d.f.), P < 0.01. There is strong evidence that yield increases with the amount of irrigation (about 1.5 tonnes/hectare for every additional cm of irrigation). Estimated variance of â + b̂ x is 3 S 2 [1/n + (x − x̄)2 /Sxx ]. Setting x = 0 gives intercept 13.33 ± 1.56. Setting x = 6 gives prediction 22.62 ± 1.26, and setting x = 8 gives prediction 25.72 ± 1.88. Assumptions: deviations from regression are normally distributed with constant variance. For prediction at x = 8 we also assume that regression line remains appropriate this far beyond the range of the data. In R, water <- 0:7 yield <- c(12, 18, 13, 19, 22, 20, 21, 25) lmfit <- lm(yield ~ water) summary(lmfit) anova(lmfit) newdata <- data.frame(water = c(0,6,8)) predict(fit, new = newdata, se = TRUE) The mean yields come from question 2, week 4. So we have analysed this data in two stages, first as anova then as regression. The total sum of squares in the regression analysis is (apart from a factor of 3) the treatment sum of squares in the ANOVA of week 4. Note that the regression mean square has been tested against the ‘deviation from regression’ mean square with 6 d.f. rather than the ‘between plot within treatment’ mean square from week 4. 6. For all three data sets, approximately, Sxx = 110, Sxy = 55.0, Syy = 41.2, b̂ = 0.5, ȳ = 7.5 and the fitted line is y = 3 + 0.5 x. However, the same ANOVA and regression estimates can arise from very different data patterns. Check by plotting the original data, or residuals against fitted values. The regression is straightforward for y1 . For y2 , there is a quadratic relationship, i.e. the points tend to lie on a parabolic curve rather than a straight line. In y3 a strong regression is hidden by an ‘outlier’. 7. The data plot with fitted line suggests that the data for location 12 might be an outlier. The diagnostic plot obtained with plot(fit, which = 1, add.smooth = FALSE, id.n = 0) shows the discrepancy even more clearly. In the original paper, this point was marked with the comment ‘evidently a mistake’. Subsequently various statisticians have used these data as a testbed for methods of outlier detection. To the best of my knowledge, all have decided that the point is an outlier. With very large sets of data, a careful approach to detecting outliers is often abandoned in favour of filtering the data by an arbitrary rule, such as ‘reject all residuals greater than 3 times the residual standard deviation’. This inevitably rejects some genuine observations. 4 8. The code below gives t = 2.018 with 9 d.f. to test for zero regression of sister’s height on brother’s height (P = 0.07, not significant at the 5% level). ‘Multiple R-squared’ is 0.3114, and the correlation estimate is the square root 0.558 (with the sign of the regression slope, here positive). Regressing brother’s height on sister’s height produces the same answer. This small data set does not firmly establish a relationship between brother’s and sister’s height. brother <- c(71, 68, 66, 67, 70, 71, 70, 73, 72, 65, 66) sister <- c(69, 64, 65, 63, 65, 62, 65, 64, 66, 59, 62) fit <- lm(sister ~ brother) summary(fit) # alternatively, cor.test(brother,sister) By hand calculation, with x = brother’s height, y = sister’s √ height, n = 11, Sxx = 74, Syy = 66, Sxy = 39. Calculate correlation as r = S / (Sxx Syy ) = 0.558. To check xy √ √ 2 significance, calculate t = r n − 2/ 1 − r with 9 d.f. This value of t is identical to that obtain by either of the two possible regressions. An alternative approach, which gives a slightly different answer, is to analyse the data as a two-way anova (compare problem 6 in week 4). heights <- c(brother, sister) sex <- gl(2, 11, labels = c("bro", "sis")) family <- gl(11, 1, 22) fit <- aov(heights ~ sex + family) summary(fit) Df Sum Sq Mean Sq F value Pr(>F) sex 1 137.5 137.5 44.355 5.64e-05 *** family 10 109.0 10.9 3.516 0.0299 * Residuals 10 31.0 3.1 If we regard family as a random effect, the residual and family mean squares have 2 and σ 2 + 2σ 2 , where σ 2 and σ 2 are between and within family expectations σW W B B W components of variance. An estimate of the intra-class correlation coefficient is (MB − MW )/(MB + MW ) = (10.9 − 3.1)/(10.9 + 3.1) = 0.557 which is very close to the ordinary (inter-class) correlation 0.558. The small difference arises because the inter-class version is calculated on the assumption that the brother measurements might have a different variance to the sister measurements, whereas the intraclass correlation assumes same variance for brothers and sisters. 5 9. (a) There is an obvious relationship between winning time and distance, approximately linear. hills <- read.csv("hills.csv") fit <- lm(Time ~ Distance, data = hills) with(hills, {plot(Distance, fitted(fit), type = "n", las = 1) points(Distance, Time)}) abline(fit, lty = "dashed") (Plotting in the normal way, parts of the fitted line lie outside the plotting region. This way, the first plot sets the scale on the y axis taking account of the fitted values but draws nothing. Then ‘points’ adds the points to the graph.) plot(fit, which = 1:2, add.smooth = FALSE) (b) The residual versus fitted plot is satisfactory for the less difficult races, but the three largest residuals occur among the four most difficult races (as judged by the fitted values). The same three races (Bens of Jura, Lairig Ghru, and Two Breweries Fell) also look out of place on the Q-Q plot. (c) Next week we will learn how to use lm to investigate the relationship between a dependent variable like Time and two or more predictors (like Distance and Climb).