Chapter 7: The Simple Linear Regression Model 7.1 Case study: Big Bang Case Study 2.5 Recession Velocity vs. Distance ● ●● ● 1.5 ● ● ● 0.5 distance Q.: Investigate the relationship between a nebula’s distance (megaparsec) from the earth and the velocity (km/s) with which is was going away from the earth. Is the relationship between mean distance and velocity a straight line? ●● ● ● ● ● ● ● ●● ● ● ● −0.5 ● ●● ● −500 0 500 1000 velocity 7.2 The Simple Linear Regression (SLR) Model • GOAL: Describe the distribution of values of one variable (the response) as a function of another variable (the explanatory variable) • Simple linear regression refers to having one explanatory variable. • Think of one subpopulation of the response for each value of the explanatory variable. • The regression of the response variable on the explanatory variable is a mathematical relationship between the means of the subpopulations and the explanatory variable. – What is the equation of a line? (think back to high school or earlier!) • The simple linear regression model: µ{Y |X} = β0 + β1 X • Notation: * µ{Y |X} = “the mean of Y as a function of X” * µ{Y |X = x} = “ ” * β0 = * β1 = 1 * σ{Y |X} = the standard deviation of Y as a function of X • Example and Diagram of the model: • Assumptions: 1. Normality: Each subpopulation has a normal distribution. 2. Equal SD’s (constant spread): 3. Independence: Each response is drawn independently of all other responses from the same population, and independently of all responses from other subpopulations. 4. Linearity: The means of the subpopulations fall on a straight-line function of the explanatory variable. • How many parameters are in the SLR model? • Interpolation and Extrapolation: – interpolation = – extrapolation = 2 7.3 Least Squares Regression Estimation • The method of least squares is one way to estimate the parameters. • Hat notation: – βˆ0 estimates β0 – βˆ1 estimates β1 µ̂{Y |X} = βˆ0 + βˆ1 X • Now write it for a particular X in the data set: – Fitted values: fiti = µ̂{Yi |Xi } = βˆ0 + βˆ1 Xi – Residuals: resi = Yi − fiti • Diagram of fitted values and residuals: • How do we obtain a measure of the distance between all responses and their fitted values? • Least Squares Estimators: – How should we estimate β0 and β1 ? ∗ Find the particular values that minimize the residual sum of squares 3 ∗ These turn out to be: Pn (Xi − X)(Yi − Y ) β̂1 = i=1 Pn 2 i=1 (Xi − X) β̂0 = Y − β̂1 X • Sampling Distributions of β̂0 and β̂1 : – β̂0 and β̂1 are just statistics! Let’s think about their sampling distributions (see Display 7.7). Normally distributed about the true value (assuming normal errors, or by CLT for large n.) – SD of the sampling distribution of β̂1 : s SD(β̂1 ) = σ 1 (n − 1)s2x s or σ 1 P (Xi − X)2 – SD of the sampling distribution of β̂0 : s SD(β̂0 ) = σ 2 1 X + n (n − 1)s2x – What is s2x ? – What else do we need in order to calculate the above SDs. 4 • Estimation of σ: σ̂ = q Sum of Squared Residuals d.f. – Calculated the degrees of freedom (d.f.): Residual d.f. = (Total # of observations) − (# parameters in model for mean) • Standard Errors: – Just plug in σ̂ in place of σ in the SD formulas. – Degrees of freedom (d.f.) = n − 2 • Computer Output: > bigbang <- read.csv("data/BigBangData.csv",head=T) > bigbang.lm <- lm(distance ~ velocity, data = bigbang) > summary(bigbang.lm) Call: lm(formula = distance ~ velocity) Residuals: Min 1Q Median -0.763250 -0.235212 -0.008798 3Q 0.207201 Max 0.914434 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.3990982 0.1184697 3.369 0.00277 velocity 0.0013729 0.0002274 6.036 4.48e-06 Residual standard error: 0.405 on 22 degrees of freedom Multiple R-Squared: 0.6235, Adjusted R-squared: 0.6064 F-statistic: 36.44 on 1 and 22 DF, p-value: 4.477e-06 > plot(distance ~ velocity, data = bigbang) > abline(bigbang.lm) – Extra SS F test for regression versus a single mean (H0 : no relationship between y and x, or β1 = 0) uses anova. 2.5 Recession Velocity vs. Distance > anova(bigbang.lm) ●● 1.5 0.5 ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● −500 0 500 velocity – One way to summarize linear regression output: Estimated Mean Distance = 0.3991 + 0.001373∗ (Velocity) (0.1185) (0.000227) 5 ● ● ● −0.5 Response: distance Df Sum Sq Mean Sq F value Pr(>F) velocity 1 5.9755 5.9755 36.438 4.477e-06 Residuals 22 3.6078 0.1640 distance ● 1000 7.4 Inferential Tools • t-ratios: • What tests are included in the standard output from statistical packages? 7.4.1 Confidence intervals for β0 or β1 : • Estimate: • Standard Error (SE): • Multiplier: • Now, make the CI for true slope. 7.4.2 Describing the distribution of Y at a particular value of X (X = X0 ) • At some specified value (X0 ) of the explanatory variable, the response variable has a distribution. What shape is it? – What is the mean of the distribution? – What is the SD? • Ingredients for a confidence interval for the mean (µ{Y |X0 }): – Estimate: – Standard Error: s SE[µ̂{Y |X0 }] = σ̂ 1 (X0 − X)2 + n (n − 1)s2x d.f. = n − 2 – Multiplier: • How can we get SE[µ̂{Y |X0 }] without directly calculating it? 6 Option 1 Computer centering trick: (a) Create an artificial explanatory variable as X ∗ = X − X0 . (b) Fit the simple linear regression of Y on X ∗ . (c) The intercept IS now the mean of Y when X = X0 , and thus the SE for the intercept is what you want. [intercept for µ{Y |X ∗ }] = µ{Y |X ∗ = 0} = µ{Y |X = X0 } > summary(lm( distance ~ I(velocity - 500), data = bigbang)) Coefficients: ## (output truncated) Estimate Std. Error t value Pr(>|t|) (Intercept) 1.0855663 0.0875540 12.399 2.12e-11 I(velocity - 500) 0.0013729 0.0002274 6.036 4.48e-06 Option 2 Use R functions to directly get SEs and CIs for all points: bigbang.lm <- lm(distance ~ velocity, data = bigbang) fits.with.SEs <- predict(bigbang.lm,se.fit=TRUE, interval="confidence", level=0.95) fits.with.SEs fit.500 <- predict(bigbang.lm, newdata=list(velocity=500), se.fit=TRUE ) unlist(fit.500) fit.1 se.fit df residual.scale 1.08556627 0.08755405 22.00000000 0.40495881 #Calculate the CI "by hand" fit.500$fit +c(-1,1) * qt(.975,22)*fit.500$se.fit [1] 0.9039903 1.2671422 • What happens to the standard error for estimating the mean response as X0 gets farther from X? • There is compound uncertainty when estimating the mean response for several values of X. So far, we have only addressed the situation where we are interested in only a single X0 . – What multiplier should we use to make confidence intervals for several X0 ’s? – What if we want to protect against an unlimited number of comparisons or make a confidence band around the regression line? 7 • Creating a confidence band for the regression line using the Working-Hotelling procedure: – Want at least 95% of repetitions to produce bands that include the true mean response everywhere, that is, contain the true line. – Use Scheffé multiplier Confidence BAND multiplier = q 2∗ F2,n−2 (.95) – So, to make the confidence bands: 1. Calculate the confidence intervals for many different values of X 2. Connect the lower limits and connect the upper limits est.mean.ses <- predict(bigbang.lm, newdata=list(velocity =seq(-220,1090,length=40)), se.fit=TRUE) confband.Scheffe.low <- est.mean.ses$fit - (sqrt(2*qf(.95,2,22)))*est.mean.ses$se.fit confband.Scheffe.hi <- est.mean.ses$fit + (sqrt(2*qf(.95,2,22)))*est.mean.ses$se.fit cbind(confband.Scheffe.low, confband.Scheffe.hi) ● ● ● ● 1.5 ● ● 1.0 ● ● ● ● ● ● ● ● 0.5 ● ●● ● ●● ● ● 0.0 distance (megaparsec) 2.0 plot(distance~velocity, data=bigbang, ylab="distance (megaparsec)", xlab="velocity (km/s)") lines(seq(-220,1090,length=40), est.mean.ses$fit, lty=1, lwd=2) #fitted line lines(seq(-220,1090,length=40), confband.Scheffe.low, lty=4, lwd=2, col=4) #lower confidence band lines(seq(-220,1090,length=40), confband.Scheffe.hi, lty=4, lwd=2, col=4) #upper confidence band ● ● −200 0 200 400 600 800 1000 velocity (km/s) 8 7.4.3 Prediction of a Future Response • A prediction interval indicates likely values for a future observation of the response at a specific value of X (say X0 ). • Understand the difference between these two questions: 1. What is the mean pH 4 hours after slaughter? 2. What will be the pH of a particular steer carcass 4 hours after slaughter? • Suppose you know the mean and SD of the response at X0 . Is it possible to exactly predict the value of an individual observation? • Pred{Y |X0 } = our best prediction of a future response at X0 – What would you use for Pred{Y |X0 }? • A Prediction Interval has TWO independent sources of uncertainty! 1. Uncertainty in the location of the subpopulation mean 2. Uncertainty about where the future value will be in relation to its mean SE[Pred{Y |X0 }] = p σ̂ 2 + [SE[µ̂{Y |X0 }]]2 • How do we usually obtain SE[Pred{Y |X0 }] and/or prediction intervals? 1. Use the computer centering trick - will give σ̂ 2 and SE[µ̂{Y |X0 }]2 in the output 2. Direct request from R: 9 bigbang.lm <- lm(distance ~ velocity, data = bigbang) Vel.100 <- data.frame(velocity=100) predict(bigbang.lm, newdata=Vel.100, SE.fit=TRUE, interval="prediction", level=0.95) new.Vels <- data.frame(velocity=c(-100, 100, 400, 900)) predict(bigbang.lm, newdata=new.Vels, SE.fit=TRUE, interval="prediction", level=0.95) $fit 1 2 3 4 fit lwr upr 0.2618046 -0.62392239 1.147532 0.5363918 -0.33038691 1.403171 0.9482727 0.09102778 1.805518 1.6347407 0.74228834 2.527193 $se.fit 1 2 3 4 0.13569379 0.10340198 0.08288756 0.14557933 $df [1] 22 $residual.scale [1] 0.4049588 # Check "by hand" se.pred.100 <- sqrt((pred.100$resid)^(2)+(pred.100$se.fit)^(2)) se.pred.100 #0.4179517 pred.100$fit[1,1] + c(-1,1)*(qt(.975,22))*se.pred.100 #[1] -0.3303869 1.4031706 IT MATCHES • Prediction intervals: – For a single interval (one new response) −→ use t-multiplier with d.f. = n − 2 ● ● ● ● 1.5 ● ● 1.0 ● ● ● ● ● ● ● ● 0.5 ● ●● ● ●● ● ● 0.0 distance (megaparsec) 2.0 – For k multiple new responses −→ Scheffé multiplier q k ∗ Fk,n−2 (1 − α) ● ● −200 0 200 400 600 800 1000 velocity (km/s) range(bigbang$velocity) #-220 1090 10 newX <- data.frame(velocity = seq(-220, 1090, length=50)) #50 values between -220 and 1090 est.mean.CIs <- predict(bigbang.lm, newdata=newX, interval="confidence") pred.PIs <- predict(bigbang.lm, newdata=newX, interval="prediction") ## Make a confidence BAND using the Scheffe multiplier ww <- sqrt(2*qf(.95,2,22)) est.mean.ses <- predict(bigbang.lm, newdata=newX, se.fit=TRUE) confband.Scheffe.low <- est.mean.ses$fit - ww * est.mean.ses$se.fit confband.Scheffe.hi <- est.mean.ses$fit + ww * est.mean.ses$se.fit newX <- cbind(newX, est.mean.ses[1:2],confband.Scheffe.low,confband.Scheffe.hi ) dev.new() plot(distance ~ velocity, data=bigbang, pch=16, cex=0.5, ylab="distance (megaparsec)", xlab="velocity (km/s)",xlim=c(-300,1200), ylim=c(-0.5,2.5),xaxp=c(-300,1200,5)) abline(bigbang.lm, lwd=2) #fitted line lines(fit ~ velocity, newX, lty=1, lwd=2) # another way to add fitted line ( X’s are sorted in order) lines( est.mean.CIs[,2]~ newX$velocity, lty=2, col=2) #lower pointwise CI lines( est.mean.CIs[,3]~ new$velocity,, lty=2, col=2) #upper pointwise CI lines( pred.PIs[,2] ~ new$velocity,, lty=3, col=3) #lower prediction PI lines( pred.PIs[,3] ~ new$velocity, lty=3, col=3) #upper prediction PI lines( confband.Scheffe.low ~ velocity,data=newX, lty=4, col=4) #lower confidence band lines( confband.Scheffe.hi ~ velocity,data=newX, lty=4, col=4) #upper confidence band 7.4.4 The Calibration Problem • Calibration = Guessing the value of X that results in Y = Y0 (Inverse Prediction) – Also called Inverse Prediction – Why can’t we just reverse the response and the explantory variable? • Methods for inference: 1. Invert the prediction relationship to get: Pred{X|Y0 } = Y0 − β̂0 β̂1 2. Graphical method: (a) (b) (c) (d) Plot the prediction bands for predicting Y from X Draw a horizontal line at Y0 Draw vertical lines down to the X-axis. Take the two points where the lines cross the X-axis to be the calibration interval 11 3. Direct calculation of approximate SE(X̂): – For predicting the value of X at which the mean of Y is Y0 : SE(X̂) = SE(µ̂{Y |X̂}) |β̂1 | – For predicting the value of X at which the value of Y is Y0 : SE(X̂) = SE(Pred{Y |X̂}) |β̂1 | 7.5 Related Issues The Regression Effect (regression to the mean) • Occurs in any test-retest situation: – Subjects who score high on the first test will, as a group, score closer to the average (lower) on the second test – Subjects who score low on the first test will, as a group, score closer to the average (higher) on the second test • There are many more individuals with skill levels near average than there are with skill levels farther from one SD from it. So, by chance, more appear in the strip whose skill level is closer to average (see Display 7.13) • regression fallacy = the mistake of attaching some broader meaning to the regression effect. Causation • It is tempting to think of the explanatory variable as causing different values of the response. But, if it is an observational study, cause and effect statements are NOT justified. • Get used to using the word association! (between the mean response and the value of the explanatory variable) 12 Correlation • The sample correlation coefficient describes the degree of linear association between any two random variables X and Y . rXY = 1 (n−1) Pn i=1 (Xi − X)(Yi − Y ) sX sY • Correlation does not depend on distinguishing between the response and explanatory variables. • What are the units of correlation? • What is the range of values for correlation? • Only use the correlation for inference if the pairs (X, Y ) are randomly selected from a population (typically not the case in regression scenarios) – For regression, X does not have to be random – It is a common mistake to base conclusions on correlations when the X’s are not random. • Correlation only measures the degree of association! – It is possible for there to be an exact relationship between X and Y and have a sample correlation coefficient of ZERO! – See handout. A few questions • What is wrong with the following formulation of the regression model? Y = β0 + β1 X • At what value of X will there be the most precise estimate of the mean of Y ? • At what value of X will there be the most precise prediction of a future Y ? • Consider the regression of weight on height for a sample of adult males. Suppose the estimated intercept is 5 kg. Does this imply that males of height 0 weigh 5 kg on average? Does this imply that the simple linear regresssion model is meaningless? 13