APSTATISTICS: chapter 14 Inference for regression When a scatterplot shows a linear relationship between a quantitative explanatory variable x and a quantitative response variable y, one can use a least-squares line fitted to the data to predict y (yhat) for a given value of x. Now the question becomes: “If the relationship is truly linear, what is the equation for the true line?” In other words, how close is our least-square equation, estimated from a single data set, to the TRUE equation. EXAMPLE: Infants who cry easily may be more easily stimulated than others and this may be a sign of higher IQ. Child development researchers explored the relationship between the crying of infants four to ten days old and their later IQ test-scores. A snap of a rubber ban on the sole of the foot caused the infants to cry. The researchers recorded the crying and measured its intensity by the number of peaks in the most active 20 seconds. They later measured the children’s IQ at age three years using the Stanford-Binet IQ test. Data for 38 infants are listed as follows. 1 APSTATISTICS: chapter 14 Inference for regression Make a scatterplot. Does the relationship appear to be roughly linear? 2 APSTATISTICS: chapter 14 Inference for regression Perform a least-squares fit. The resulting equation is typically used for predicting y with a given value of x. What is the correlation r and the value of r2. Remember r2 describes how well the regression line fits the data, in that, it is the proportion of the observed variation in y that is accounted for by the straight-line relationship of y and x. 3 APSTATISTICS: chapter 14 Inference for regression Plot the line with the data. Any obvious outliers or influential points? Remember outliers are points that lie far from the overall pattern. Influential points are those that move the fitted line; and are usually points that are far out in the x direction and isolated from other points. It is dangerous to use a predictive model if influential points are present. Y-hat = 91.27 + 1.493X r = 0.455 and r2 = 0.207. This value of r2 indicates that approximately 21% of the variation in IQ Scores is (or can be) explained by a linear relationship with crying intensity 4 APSTATISTICS: chapter 14 Inference for regression 5 APSTATISTICS: chapter 14 Inference for regression Now for Advanced Statistics The slope b and intercept a of the least-squares line are statistics. They are estimates, computed from our sample, and most certainly change if they were calculated from another data set. They are estimates of unknown parameters α and β. Let’s go after parameters α and β. Assumptions for Regression Inference We have n observation on an explanatory variable x and a response variable y. Out goal is to study or predict the behavior of y for given values of x ! For any fixed value of x, the response y varies according to a normal distribution. Repeated responses y are independent of each other ! The mean response μy has a straight-line relationship with x μy = α+ βx The slope β and intercept α are unknown parameters ! The standard deviation of y (call it σ) is the same for all values of x. The value of σ is unknown. This model states that “on the average” there is a straight-line relationship between y and x. The TRUE REGRESSION line μy = α + βx says that the mean response μy moves along a straight line as the 6 APSTATISTICS: chapter 14 Inference for regression explanatory variable x changes. The values of y that we do observe vary about their means according to a normal distribution. If we hold x fixed and take many observations on y, the normal pattern will eventually appear in a stemplot, histogram or the like. OK so here we go The first step is to estimate the unknown parameters α, β and σ. From the least-squares line y--hat = a + bx we have the following ! The slope b of the least-squares line is an unbiased estimate of the true slope β ! The intercept a of the least-squares line is an unbiased estimator of the true intercept α Note that it is generally the slope of the line which is of the greatest interest. A slope is a rate of change. In the case of the Crying-IQ data the true slope β says how much the average IQ Score changes when 7 APSTATISTICS: chapter 14 Inference for regression the value of crying intensity x is increased by 1. ! σ is the standard deviation which describes the variability of the response y about the true regression line. Since the least-squares line estimates the true regression line, the residuals estimate how much y varies about the true line. Remember that residuals are observed y minus predicted y. Because σ is the standard deviation of responses about the true regression line, it is estimated by a sample standard deviation of the residuals. The sample standard deviation is referred to as a standard error. Remember that the sum of the residuals is always zero, hence, their mean is always zero. Standard Error About the Least-Squares Line 8 APSTATISTICS: chapter 14 Inference for regression The standard error about the line is the key measure of the variability of the responses in regression. It is part of the standard error of all the statistics we will use for inference. Confidence Intervals for the Regression Slope The slope is the rate of change of the mean response as the explanatory variable increases. The slope b 9 APSTATISTICS: chapter 14 Inference for regression of the least-squares line is an unbiased estimator of β. We can calculate a confidence interval for β and it has the familiar form: estimate ± t*SEESTIMATE Confidence Interval for the Regression Slope A level C confidence interval for the slope β of the true regression line is b ± t*SEb In this recipe, the standard error of the least-squares slope b is 10 APSTATISTICS: chapter 14 Inference for regression and t* is the upper (1-C)/2 critical value from the t distribution with n-2 degrees of freedom. 11 APSTATISTICS: chapter 14 Inference for regression Shown below is the basic output for the Crying Intensity-IQ data using the regression command in the Minitab software package. Note that Minitab like most software packages produce more information than the basic output. Use only what you need. For a 95% confidence interval for β b ± t*SEb 12 APSTATISTICS: chapter 14 Inference for regression Thus we are 95% confident that mean IQ increases by between about 0.5 and 2.5 points for each additional peak in crying A similar calculation can be performed to estimate α, but is seldom used. Using the Hypothesis of NO Linear Relationship One of the most common test hypothesis about the value of the slope β is H0 : β = 0 A regression line with slope 0 is horizontal. That is, the mean of y does not change at all when x changes. ! So this hypothesis says there is NO true linear relationship between x and y; ! or the straight line dependence on x is of no value for predicting y; ! or there is no correlation between x and y in the population from which we drew our data 13 APSTATISTICS: chapter 14 Inference for regression Significance Tests for Regression Slope To test the hypothesis H0: β = 0, compute the t statistic t =b/(SEb) 14 APSTATISTICS: chapter 14 Inference for regression In terms of a random variable T having the t(n-2) distribution the P-value for a test of H0 against Ha : β > 0 is P(T > t) 15 APSTATISTICS: chapter 14 Inference for regression Ha : β < 0 is P(T < t) 16 APSTATISTICS: chapter 14 Inference for regression 17 APSTATISTICS: chapter 14 Inference for regression Ha : β ≠ 0 is 2P(T > t) The previous example of computer output also gave the t statistic and its associated two sided P-value. You can always do the calculation on the TI-83 with STAT/TESTS/LinRegTTest 18 APSTATISTICS: chapter 14 Inference for regression EXAMPLE: How well does the number of beers a student drinks predict his or her blood alcohol content? Sixteen student volunteers at the University of Tennessee drank a randomly assigned number of cans of beer. Thirty minutes later, a police officer measured their blood alcohol content (BAC). The data are as follows: Student Beers BAC Student Beers BAC 1 5 0.10 9 3 0.02 2 2 0.03 10 5 0.05 3 9 0.19 11 4 0.07 4 8 0.12 12 6 0.10 5 3 0.04 13 5 0.085 6 7 0.095 14 7 0.09 7 3 0.07 15 1 0.01 8 5 0.06 16 4 0.05 Minitab output for the blood alcohol content data 19 APSTATISTICS: chapter 14 Inference for regression Scatterplot of students’ blood alcohol content against the number of cans of beers consumed. The dotted line is with the possible outlier of 9 beers consumed removed. For this line r2 = 77%. 20 APSTATISTICS: chapter 14 Inference for regression Is there evidence to suggest that the more beers consumed the higher the BAC? If so then give a 90% confidence interval for the slope of the regression line 21 APSTATISTICS: chapter 14 Inference for regression Inference About Predictions One of the most common reasons to fit a line to data is to predict the response to a particular value of the explanatory variable. That is, substitute a specific value for x and then calculate y-hat. The predictive equation for BAC is y-hat = -0.0127 + 0.0180x Now the question becomes what is it that you want to calculate. Do you want to calculate μy, the mean response for the value of x, or are you interested in calculating an individual response y for just one observation of x. In both cases the method of prediction is the same with the value of x put in the equation and y-hat calculated However the margin of error is different for the two kinds of prediction. A larger margin of error is needed to bracket the response for one observation as compared with that to bracket the mean response for all months. ! ! We use a confidence interval to estimate the mean response We use a prediction interval to estimate the individual response 22 APSTATISTICS: chapter 14 Inference for regression In both cases the form of the interval is y-hat ± t*SE For a level C confidence interval for the mean response, the standard error is For a level C prediction interval for a single observation, the standard error is In both cases, t* is the upper (1-C)/2 critical value of the t distribution with n-2 degrees of freedom 23 APSTATISTICS: chapter 14 Inference for regression Statistical software calculates these intervals. Minitab would produce the following output for prediction when x = 5 beers Predicted Values Fit StDev Fit 95% CI 0.07712 0.00513 {0.06612, 0.08812} 95% PI {0.03192, 0.12232} Note the Stdev. Fit is the standard error for the mean response. The key point here is that it is harder to predict one response than to predict a mean response 1. 2. 3. A Reminder of the Regression Assumptions The true relationship is linear The standard deviation of the response about the true line is the same everywhere The response varies normally about the true regression line 24