STAT 101, Module 10: Statistical Inference for Simple Linear Regression (Book: chapter 12) Uncertainty about slopes When we fit lines to x-y data, the single most important quantity we estimate is almost always the slope. It tells us what average difference in y to expect for a given difference in x. For example, in the diamond data we expect an average difference of S$372 for a 0.1 carat difference in weight. The question we now raise is: How certain are we about this slope? The uncertainty in the slope stems from the fact that another dataset of weight-price data would produce a slightly different slope estimate. We should therefore be concerned with the variability of slopes from dataset to dataset just as we were concerned with the variability of means and differences of means from dataset to dataset. Our next question is therefore: Can we quantify the dispersion of slopes from dataset to dataset? In other words, are there standard errors for slopes, as there were for means and differences of means? If so, can one derive standard error estimates for slopes that are computable from a single dataset? The happy answer is of course “yes” to both of these questions. Populations and samples for lines: the simple linear regression model We need a notion of population slope which is the target estimated by sample slopes, just as population means (expected values) were the target estimated by sample means. In solving this problem, we also solve the problem of a population intercept and a population standard deviation around the line. It all comes together in the simple linear regression model (SLRM): yi = β0 + β1 xi + εi This is to be interpreted as follows: For a predictor value xi the response value yi comes about as the sum of two things: o a straight line formula, β0 + β1 xi. The coefficients β0 and β1 are the population intercept and the population slope, respectively. They are unknown and to be estimated by the sample intercept b0 and the sample slope b1, just as the population mean was to be estimated by the sample mean. o an “error” εi which represents the variation in yi that cannot be captured by the straight line. This “error” (not a good term) is assumed to be a random variable with zero population mean and standard deviation σε. This implies that the response yi is also a random variable, but that its population mean is β0 + β1 xi (the reason: adding a constant to a random variable implies adding the same constant to the population mean). There are some oddities about the SLRM, not the least being that the predictor values xi are not considered random. In other words, if we imagine other datasets just like the diamond data, the SLRM expects them to have the exact same column of weights. Only the prices would be slightly different because the errors are random… One other important element of the SLRM should be mentioned: the errors are not only assumed random but also independent. This means that we really can’t infer whether the price of a specific diamond is going to be above or below the line knowing the prices of other diamonds. This assumption makes sense, unless we have additional information, such as (hypothetically): diamonds 1 through 5 were cut by an artisan who is more skilled than the other artisans, hence should command a higher price than the weight alone would tell us. Even though we may find some aspects of the SLRM strange, let us buy into it and see what the consequences are. For one thing, it allows us to play “as if” games and simulate data. Simulations are always useful to illustrate statistical concepts and make them concrete. Don’t miss the point of what’s coming: We are illustrating the concept of a SLRM, not giving instruction for data analysis! In what follows we will simulate data that may look like the diamond data, if they actually followed the SLRM. To this end we have to adopt a “view from above” and choose population values β0, β1, σε, as well as decide exactly what distribution the errors εi should have. Here is what the instructor chose: o The population values shall be: β0 = –260, β1 = 3720. These values were chosen because they are round yet close to the values we saw as sample estimates in the actual data. The simulation could therefore generate prices that look like the real ones, except for random variation. o The probability distribution of the errors εi shall be normal with με = 0 (as mandated by the SLRM) and σε = 32. This value for the error standard deviation was again chosen to be round yet close to the estimate (the RMSE) seen in the actual data. The choice of the normal distribution is not unreasonable because the residuals in the actual data look pretty normal as a normal quantile plot revealed (not shown here). o Recall that we have no choice in picking the weights: they must be the same as those in the actual data. We can only simulate new prices for the given weights. o Finally, we are not going to allow the simulation to produce prices with 15 digits precision. To be realistic, we round the prices to whole S$s. Below are plots of five datasets simulated from this model. One of them is a plot of the actual data. Can you tell which it is? If you can’t, it is a pretty good sign that the model is not so bad. It would seem to show not only the proper kind of linear association but also the proper kind of randomness. If, however, you can tell which is the actual data, and if you can identify features that make it different from the simulated datasets, then this is a sign that the model is flawed. The flaws may not be serious, but the model would not be perfect. The final message is that we have a way to describe a population for linearly associated data. Correspondingly, we have o a notion of population slopes β1, intercepts β0, and error standard deviations σε, as well as o a notion of datasets sampled from the population, and hence o a notion of dataset-to-dataset variability for estimates of slopes b1, estimates of intercepts b0, and estimates of error standard deviations RMSE=se. 1100 1000 1000 900 900 800 800 Price (Sim4) Price (Sim1) 1100 700 600 500 500 400 300 300 200 200 100 .1 .15 .2 .25 .3 Weight (car ats) .35 1100 1100 1000 1000 900 900 800 800 Price (Sim5) Price (Sim2) 600 400 100 700 600 500 .1 .15 .2 .25 .3 Weight (car ats) .35 .1 .15 .2 .25 .3 Weight (car ats) .35 .1 .15 .2 .25 .3 Weight (car ats) .35 700 600 500 400 400 300 300 200 200 100 100 .1 .15 .2 .25 .3 Weight (car ats) .35 1100 1100 1000 1000 900 900 800 800 Price (Sim6) Price (Sim3) 700 700 600 500 700 600 500 400 400 300 300 200 200 100 100 .1 .15 .2 .25 .3 Weight (car ats) .35 Population lines versus fitted lines, errors versus residuals: Making this distinction seems to be difficult for many students who learn about the SLRM. Keep in mind that the SLRM describes the population, whereas you work on data drawn from the population. In detail: o The SLRM is determined by three numbers, the population parameters β0, β1, σε . These population parameters are never known, but they are the targets to which estimates b0, b1, se should get close. This means that the population has one fixed “true” line determined by β0, β1, while the lines we fit to data are determined by b0, b1, and these estimates vary across datasets drawn from the same population. Hence one has to imagine that the fitted lines vary around the true population line from dataset to dataset. o Since the population line and the fitted lines are not the same, even though we hope the latter are near the former, it follows that errors and residuals are not the same either: errors εi are the vertical offsets of the actual yi (e.g., Price) from the unknown, true population line: εi = yi – (β0 + β1 xi ), residuals ei are the vertical offsets of the actual yi (e.g., Price) from the estimated/fitted line: ei = yi – (b0 + b1 xi ). It is not a crime to confuse the two since there is almost no practical consequence from it, but it shows a lack of understanding of what is going on. To clarify things, keep staring at the six plots above, five of which show data simulated from a hypothetical model. These five datasets would produce slightly different fitted lines, even though they come from the same population, and neither of these five lines would totally agree with the population line, even though they would be close. How close? That’s the question to be answered with standard errors. Standard errors and statistical inference for slope estimates Recall the formulas for the estimates from Module 4: o the slope estimate: b1 cov( x, y ) s2 ( x) o the error SD estimate (= RMSE): RMSE se 1 RSS N 2 1 e12 e22 ... eN2 N 2 (We will gladly ignore the intercept estimate.) The standard error for the slope estimate is given by the following formula (no derivation or even heuristic given here): (b1 ) SD( x ) N 1 / 2 The only thing that matters to us is that it suggests a standard error estimate quite easily by estimating the error SD σε with the residual SD sε (=RMSE): stderr (b1 ) se SD( x ) N 1 / 2 This is the magic formula of simple linear regression! It allows us to do statistical inference as we know it: o construct confidence intervals, CI = (b1 – 2stderr(b1), b1 + 2stderr(b1)) o and perform tests of the null hypothesis H0: β1 = 0: t b1 stderr(b1 ) The null hypothesis of a zero slope is standard. Its t-statistic and pvalue are given in every regression output. Why this obsession with a zero slope? Because a zero slope means that there is no linear association between x and y (recall a zero slope is equivalent to a zero correlation). If the data are incompatible with the assumption of a zero slope, we can infer that there is a slope and it makes sense to estimate it. The above CI will then give us a range of plausible values for the true population slope: as usual for a 95% CI, it is a random interval that has a 0.95 probability of containing the true population slope across datasets generated from the SLRM. Examples of inference for straight line fits Example 1: the Diamond data. Here is the output that we can interpret from now on: Parameter Estimates Term Intercept Weight (carats) Estimate -259.6259 3721.0249 Std Error 17.31886 81.78588 t Ratio Prob>|t| -14.99 <.0001 45.50 <.0001 The last line has the slope estimated, about S$3720/carat. Its standard error estimate is about 82 (same units as the slope). The null hypothesis of a zero slope is vastly rejected because of the insanely large t-ratio of over 45 and a p-value that is so small that it is not even reported. The 95% CI is about 3720±164 = (3556, 3884), which contains the plausible values for the population slope β1. Example 2: the Penn Student survey. We can check whether students are on average still getting taller with age. Here is the regression of Height on Age: Parameter Estimates Term Intercept AGE Estimate 65.476114 0.1186769 Std Error 3.246521 0.168809 t Ratio Prob>|t| 20.17 <.0001 0.70 0.4825 We see that this is not the case: the null hypothesis of a zero slope (a zero average difference in Height for 1-year difference in Age) cannot be rejected. The t-ratio is 0.7 and hence within the interval ±2. Consequently, the p-value is about 0.5 and very much above 0.05. In general it is quite difficult to find sensible example where the slope is statistically insignificant. This will change in Statistics 102 where you will see multiple linear regression with more than one predictor. Some insights from the standard error formula for slopes We reproduce the formula one more time: (b1 ) SD( x ) N 1 / 2 Here are some insights: The uncertainty about the estimated slope decreases as the errors get less dispersed, that is, σε gets smaller, the predictor values get more dispersed, that is, SD(x) gets larger, the number of observations gets larger. All of the above should make sense: less error, better spread-out predictor values, and more data should all stabilize the line and hence drive down the uncertainty about the line and hence about the slope.