Chapter 8: A Closer Look at Assumptions for SLR The simple linear regression model is based on independence, normality, constant variance, and a linear relationship. We need practice in judging when these assumptions are violated and need to know how to handle data which do not fit the assumptions. 8.2 Robustness of Least Squares Inferences: SLR Assumptions 1. Linearity: • Violations: (a) Straight line not adequate (eg. curvature) (b) Straight line appropriate for most of data, but there are several outliers • Implications: – Estimated means and predictions can be biased – Tests and CIs may be based on wrong SE – Severity of consequences depends on severity of the violation. • Remedies: – Transformations – Including other terms in the model (we will see this in Multiple Linear Regression), for example a quadratic term. 2. Constant Variance: • Violation: – Spread around the regression line changes for different values of X • Implications: – Estimates are unbiased, but SEs are inaccurate (same as for one-way ANOVA) • Remedies: – Transformations, Weighted Regression 3. Normality: • Violation – Distribution of Y for each value of X is not normally distributed • Implications: – Estimates and SEs of coefficients are robust to nonnormality. Long tailed distributions with small sample sizes present the only serious situation. – Prediction intervals are not robust to nonnormality. Why? • Remedies: – Transformations 1 4. Independence: • Violations: – Cluster effects – Serial effects • Implications: – Standard errors are seriously affected! • Remedies: – Incorporate dependence into analysis using more sophisticated models 8.3 Graphical Tools for Model Assessment Scatterplot of Response (Y ) vs. Explanatory variable (X) • Study Display 8.6 Scatterplots of Residuals vs. Fitted Values • Better for finding patterns because the linear component of variation in response has been removed. (see Display 8.7) • Use to detect – nonlinearity, (Look for curvature) – nonconstant variance, (Look for a fan shape) and – outliers (Residuals far from 0) • The horn-shaped pattern: – Poor fit and increasing variability – Transformations: logarithm, square root, – How to choose? ∗ Try one, re-fit the regression model, look at the residual plot ∗ Think about interpretation 2 reciprocal 8.4 Interpretation After Log Transformation It depends on whether the transformation was applied to the response, the explanatory variable, or both. Logged RESPONSE variable: If µ[log(Y )|X] = β0 + β1 X and distributions symmetric then Median{Y |X} = exp(β0 ) exp(β1 X) =⇒ Median{Y |(X + 1)} = exp (β1 ) Median{Y |X} • Wording: A one unit increase in X is associated with a multiplicative change in the median Y of exp(β1 ), with an associated 95% confidence interval from exp(lower limit of CI for β1 ) to exp(upper limit of CI for β1 ). Logged EXPLANATORY variable: If µ{Y | log(X)} = β0 + β1 log(X) then µ{Y | log(2X)} − µ{Y | log(X)} = β0 + β1 log(2) + β1 log(X) − [β0 + β1 log(X)] = β1 log(2) • Describes change in the mean of Y for a doubling (or another multiplicative change) of X • Wording: A doubling of X is associated with a log(2)β1 unit change in mean Y , with an associated 95% confidence interval from log(2)∗ (lower limit of CI for β1 ) to log(2)∗ (upper limit of CI for β1 ). Logged RESPONSE and EXPLANATORY variables: If µ{log(Y )| log(X)} = β0 + β1 log(X) then Median{Y |X} = eβ0 β1 X = eβ0 X β1 • Wording: A doubling of X is associated with a multiplicative change of 2β1 in the median of Y , with an associated 95% confidence interval from 2(lower limit of CI forβ1 ) to 2(upper limit of CI forβ1 ) . 3 8.5 Assessment of Fit Using Analysis of Variance Scenario: Replicate response values at several explanatory variable values. Three Models for Population Means 1. Separate-means model: 2. Simple linear regression model: 3. Equal-means model: 8.5.3 The Lack-of-Fit F -Test • If we want a formal assessment of the goodness of fit of the SLR model, what two models might we compare? (called a lack-of-fit F-test) • What is necessary in order to even think about comparing the above two models? • The Lack-of-Fit F -Test is a formal test of the adequacy of the straight-line regression model. In other words, can the variability among group means be adequately explained by the simple linear regression model? • What numbers do we need to actually do the lack-of-fit F-test? 4 • Breakdown of insulating fluid insulators <- read.table("data/insulatorBreakdown.txt",head=T) plot(log(time) ~ voltage,insulators) insl.linear.fit <- lm( log(time) ~ voltage,insulators) abline( insl.linear.fit ,col=2) points(tapply(insulators$voltage,insulators$voltage,mean) , tapply(log(insulators$time),insulators$voltage,mean), col=4,pch=10,cex=1.5) 8 insl.group.fit <- lm( log(time) ~ factor(voltage),insulators) anova(insl.group.fit) #Analysis of Variance Table # #Response: log(time) # Df Sum Sq Mean Sq F value Pr(>F) #factor(voltage) 6 196.48 32.746 13.004 8.871e-10 #Residuals 69 173.75 2.518 anova(insl.linear.fit) #Analysis of Variance Table # #Response: log(time) ● ● # Df Sum Sq Mean Sq F value Pr(>F) #voltage 1 190.15 190.151 78.141 3.34e-13 #Residuals 74 180.07 2.433 ● ● ● 6 ● ● ● ● ● ● ● ● log(time) 4 ● ● ● ● ● ● ● ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● d.f. 6 1 5 69 75 5 ● ● ● ● ● ● ● ● ● ● 36 38 ● −2 ● ● 26 28 30 32 34 voltage • Let’s combine the usual ANOVAs and the lack-of-fit test into a single table. Sum of Squares 196.48 190.15 6.33 173.75 370.23 ● ● ● ● ● ● ● ● ● ● ● ● ● 8.5.4 A Composite ANOVA Table: Source of Variation Between Groups Regression Lack-of-fit Within Groups Total ● ● ● ● ● ● anova(insl.linear.fit, insl.group.fit) #Analysis of Variance Table # #Model 1: log(time) ~ voltage #Model 2: log(time) ~ factor(voltage) # Res.Df RSS Df Sum of Sq F Pr(>F) #1 74 180.07 #2 69 173.75 5 6.3259 0.5024 0.7734 ● ● Mean Square 32.746 190.15 0.5024 2.518 —— F-stat 13.004 75.5 0.7734 —— —— p-val 8.871e-10 0.0000 0.77 —— —— Related Issues 8.6.1 R-Squared: The Proportion of Variation Explained • R-squared statistic (R2 or r2 ) = the percentage of the total variation in the response explained by the explanatory variable. R2 = 100 Total SS − Residual SS % Total SS • Interpretation: “R2 percent of the variation in Y was explained by the linear regression on X.” • What does an ESS F-test take into account that R2 does not? • What is a good R2 ? • For SLR, R2 is equal to the square of the sample correlation coefficient (r). • R2 should not be used to assess the adequacy of a straight-line model. (R2 can be quite large even when the SLR model is not adequate). 8.6.2 SLR or One-Way ANOVA? • If the simple linear regression model fits, then it is preferred. Why? • Advantages of SLR: 1. 2. 3. 4. 6 8.6.3 Other Residual Plots • Residuals vs. time order (or location) (Display 8.12) • Normal probability plots (Display 8.13) 8.6.4 Planning an Experiment: Balance • Balance = same number of experimental units in each treatment group • Balance is desirable, but not essential. It plays a more important role when there is more than one factor (more than one grouping variable) because it allows for straightforward decomposition of the SS. 7