Objectives 10.1 Simple linear regression Statistical model for linear regression Estimating the regression parameters Confidence interval for regression parameters Significance test for the slope Confidence interval for µy Prediction intervals Statistical model for linear regression In the population, the linear regression equation is y = 0 + 1x + e, where e is the random deviation (or error) of the response variable from the prediction formula. Usually, we assume that e has Normal(0,σ) distribution. 0 (y-intercept) and 1 (slope) are the parameters. Statistical inference is conducted to draw conclusions about the parameters. Confidence interval and hypothesis test for 1. We especially want to test whether the slope equals zero. Confidence interval for 0 + 1x, given a value for x. Prediction interval for a random y, given a value for x. Estimating the parameters The population linear regression equation is y = 0 + 1x + e. The sample fitted regression line is ŷ = b0 + b1x. b0 is the estimate for the intercept 0 and b1 is the estimate for the slope 1. We also estimate σ (the standard deviation of e), using se residual n2 2 2 ( n 1)(1 r ) n2 sy. se is a measure of the typical size of a residual y − ŷ. We will use se to compute the standard errors we need. Confidence interval for the slope parameter Before we do inference for the slope parameter 1, we need the standard error for the estimate b1: se SE b 1 (n . 2 1) s x We use the t distribution, now with n – 2 degrees of freedom. A level C confidence interval for the slope, 1, is * b1 t SE b . 1 t* is the table value for the t(n – 2) distribution with area C between −t* and t*. “Confidence” has the same interpretation as always. Significance test for the slope parameter We can test the hypothesis H0: 1 = m versus either a 1-sided or a 2-sided alternative, using a t-statistic. (The primary case is with m = 0.) We calculate t b1 m SE b 1 and use the t(n – 2) distribution to find the P-value of the test. Note: Software typically provides two-sided p-values. Relationship between ozone and carbon pollutants In StatCrunch: Stat-Regression-Simple Linear; choose Hypothesis Test se df = n − 2 To test H0: 1 = 0 with α = 0.05, we compute t b1 m SE b 1 0.0057084 0.0 yˆ 0.0515 0.005708 x . 4.584. 0.0012452 From the t-table, using df = 28 − 2 = 26, we can see that the P-value is less than 0.0005. Since it is very small we reject H0 and conclude the slope is not zero. Relationship between ozone and carbon pollutants In StatCrunch: Stat-Regression-Simple Linear; choose Confidence Interval Having decided that the slope is not zero, we next estimate it with a 95% confidence interval: * yˆ 0.0515 0.005708 x . b1 t SE b 0.0057084 2.056 0.0012452 (0.00315, 0.00827) . 1 Confidence interval for 0 + 1x We can also calculate a confidence interval for the regression line itself, at any choice x. Generally this is sensible as long as x is within the range of data observed (interpolation). Extrapolation should only be done with a great deal of caution. The interval is centered on ŷ = b0 + b1x, but we need a standard error for this particular estimate. SE yˆ s e 1 n (x x ) (n 2 2 1) s x . The confidence interval is then calculated in the usual fashion: * * yˆ t SE yˆ b0 b1 x t SE yˆ . This is an estimate of the point on the line (the expected value of y) for the given value of x. Prediction interval for a new obs. y It often is of greater interest to predict what the actual y value might be (not just what it is expected to be). Such a prediction interval for an actual (new) observation y, must necessarily account for both the estimation of the line and the random deviation e away from that line. The interval is again centered on ŷ = b0 + b1x, but now we also account for the random deviation. The prediction interval for the actual y, with given value for x, is * yˆ t 2 2 s e SE yˆ . The distinction between a confidence interval and a prediction interval is whether you want to capture the expected value of y or the actual value of y. Prediction intervals Unlike confidence intervals, the size of the prediction interval does not get narrower as you increase the sample size. This is because: The confidence interval is estimating a parameter, such as the mean, the slope, the slope equation. For example, if I am interesting in the mean grade of all people taking midterm 3 who scored 10 on midterm 2, the CI will get narrower as the sample size grows (because the estimators tend to get better for large sample size). The prediction interval is completely different. Here we are trying to predict the grade of a randomly selected person who scored 10 on midterm 2. There will be a lot of variability, and it does not improve as we increase the sample size: very individual is different (it is like predicting the weight of someone who is 6 foot tall, even if we know what the average weight of a 6 footer is, there is a huge variation in this group, thus the prediction interval must be wide for us to be able to capture the height). This is a fundamental difference between predicting the measurement of an individual and estimating the mean. The mean estimator will get better with sample size, the individual won’t. Efficiency of a biofilter, by temperature In StatCrunch: Stat-Regression-Simple Linear; choose Predict Y for X For a 95% confidence interval of the expected yˆ 97.5 0.0757 16 98.71. ozone level, with temperature = 16, we compute * yˆ t SE yˆ 98.71 2.042 0.0393 (98.63, 98.79). For a 95% prediction interval of the actual ozone level, with temperature = 16, we compute * 2 2 2 2 yˆ t s e SE yˆ 98.71 2.042 0.1552 0.0393 (98.38, 99.04).