Module 10 - Wharton Statistics Department

advertisement
STAT 101, Module 10:
Statistical Inference for Simple Linear Regression
(Book: chapter 12)
Uncertainty about slopes
 When we fit lines to x-y data, the single most important quantity we
estimate is almost always the slope. It tells us what average
difference in y to expect for a given difference in x. For example, in
the diamond data we expect an average difference of S$372 for a 0.1
carat difference in weight.
 The question we now raise is: How certain are we about this slope?
The uncertainty in the slope stems from the fact that another dataset
of weight-price data would produce a slightly different slope estimate.
We should therefore be concerned with the variability of slopes from
dataset to dataset just as we were concerned with the variability of
means and differences of means from dataset to dataset.
 Our next question is therefore: Can we quantify the dispersion of
slopes from dataset to dataset? In other words, are there standard
errors for slopes, as there were for means and differences of means?
If so, can one derive standard error estimates for slopes that are
computable from a single dataset?
 The happy answer is of course “yes” to both of these questions.
Populations and samples for lines: the simple linear regression model
 We need a notion of population slope which is the target estimated
by sample slopes, just as population means (expected values) were
the target estimated by sample means. In solving this problem, we
also solve the problem of a population intercept and a population
standard deviation around the line. It all comes together in the
simple linear regression model (SLRM):
yi = β0 + β1 xi + εi
This is to be interpreted as follows: For a predictor value xi the
response value yi comes about as the sum of two things:
o a straight line formula, β0 + β1 xi. The coefficients β0 and β1 are
the population intercept and the population slope, respectively.
They are unknown and to be estimated by the sample intercept
b0 and the sample slope b1, just as the population mean was to
be estimated by the sample mean.
o an “error” εi which represents the variation in yi that cannot be
captured by the straight line. This “error” (not a good term) is
assumed to be a random variable with zero population mean
and standard deviation σε. This implies that the response yi is
also a random variable, but that its population mean is β0 + β1 xi
(the reason: adding a constant to a random variable implies
adding the same constant to the population mean).
There are some oddities about the SLRM, not the least being that the
predictor values xi are not considered random. In other words, if we
imagine other datasets just like the diamond data, the SLRM expects
them to have the exact same column of weights. Only the prices
would be slightly different because the errors are random…
One other important element of the SLRM should be mentioned: the
errors are not only assumed random but also independent. This
means that we really can’t infer whether the price of a specific
diamond is going to be above or below the line knowing the prices of
other diamonds.
This assumption makes sense, unless we have additional information, such
as (hypothetically): diamonds 1 through 5 were cut by an artisan who is
more skilled than the other artisans, hence should command a higher price
than the weight alone would tell us.
 Even though we may find some aspects of the SLRM strange, let us
buy into it and see what the consequences are. For one thing, it allows
us to play “as if” games and simulate data. Simulations are always
useful to illustrate statistical concepts and make them concrete.
Don’t miss the point of what’s coming: We are illustrating the concept of
a SLRM, not giving instruction for data analysis!
In what follows we will simulate data that may look like the diamond
data, if they actually followed the SLRM. To this end we have to
adopt a “view from above” and choose population values β0, β1, σε, as
well as decide exactly what distribution the errors εi should have.
Here is what the instructor chose:
o The population values shall be: β0 = –260, β1 = 3720.
These values were chosen because they are round yet close to the
values we saw as sample estimates in the actual data. The
simulation could therefore generate prices that look like the real
ones, except for random variation.
o The probability distribution of the errors εi shall be normal with
με = 0 (as mandated by the SLRM) and σε = 32.
This value for the error standard deviation was again chosen to be
round yet close to the estimate (the RMSE) seen in the actual data.
The choice of the normal distribution is not unreasonable because
the residuals in the actual data look pretty normal as a normal
quantile plot revealed (not shown here).
o Recall that we have no choice in picking the weights: they must
be the same as those in the actual data. We can only simulate
new prices for the given weights.
o Finally, we are not going to allow the simulation to produce
prices with 15 digits precision. To be realistic, we round the
prices to whole S$s.
Below are plots of five datasets simulated from this model. One of
them is a plot of the actual data. Can you tell which it is?
 If you can’t, it is a pretty good sign that the model is not so bad.
It would seem to show not only the proper kind of linear
association but also the proper kind of randomness.
 If, however, you can tell which is the actual data, and if you can
identify features that make it different from the simulated
datasets, then this is a sign that the model is flawed. The flaws
may not be serious, but the model would not be perfect.
The final message is that we have a way to describe a population for
linearly associated data. Correspondingly, we have
o a notion of population slopes β1, intercepts β0, and error
standard deviations σε, as well as
o a notion of datasets sampled from the population, and hence
o a notion of dataset-to-dataset variability for estimates of slopes
b1, estimates of intercepts b0, and estimates of error standard
deviations RMSE=se.
1100
1000
1000
900
900
800
800
Price (Sim4)
Price (Sim1)
1100
700
600
500
500
400
300
300
200
200
100
.1
.15
.2
.25
.3
Weight (car ats)
.35
1100
1100
1000
1000
900
900
800
800
Price (Sim5)
Price (Sim2)
600
400
100
700
600
500
.1
.15
.2
.25
.3
Weight (car ats)
.35
.1
.15
.2
.25
.3
Weight (car ats)
.35
.1
.15
.2
.25
.3
Weight (car ats)
.35
700
600
500
400
400
300
300
200
200
100
100
.1
.15
.2
.25
.3
Weight (car ats)
.35
1100
1100
1000
1000
900
900
800
800
Price (Sim6)
Price (Sim3)
700
700
600
500
700
600
500
400
400
300
300
200
200
100
100
.1
.15
.2
.25
.3
Weight (car ats)
.35
 Population lines versus fitted lines, errors versus residuals:
Making this distinction seems to be difficult for many students who
learn about the SLRM. Keep in mind that the SLRM describes the
population, whereas you work on data drawn from the population. In
detail:
o The SLRM is determined by three numbers, the population
parameters β0, β1, σε . These population parameters are never
known, but they are the targets to which estimates b0, b1, se
should get close. This means that the population has one fixed
“true” line determined by β0, β1, while the lines we fit to data
are determined by b0, b1, and these estimates vary across
datasets drawn from the same population. Hence one has to
imagine that the fitted lines vary around the true population line
from dataset to dataset.
o Since the population line and the fitted lines are not the same,
even though we hope the latter are near the former, it follows
that errors and residuals are not the same either:
 errors εi are the vertical offsets of the actual yi (e.g.,
Price) from the unknown, true population line:
εi = yi – (β0 + β1 xi ),
 residuals ei are the vertical offsets of the actual yi (e.g.,
Price) from the estimated/fitted line:
ei = yi – (b0 + b1 xi ).
It is not a crime to confuse the two since there is almost no practical
consequence from it, but it shows a lack of understanding of what is
going on. To clarify things, keep staring at the six plots above, five of
which show data simulated from a hypothetical model. These five
datasets would produce slightly different fitted lines, even though they
come from the same population, and neither of these five lines would
totally agree with the population line, even though they would be
close. How close? That’s the question to be answered with standard
errors.
Standard errors and statistical inference for slope estimates
 Recall the formulas for the estimates from Module 4:
o the slope estimate:
b1 
cov( x, y )
s2 ( x)
o the error SD estimate (= RMSE):
RMSE  se 
1
RSS 
N 2
1

e12  e22  ...  eN2
N 2

(We will gladly ignore the intercept estimate.)
 The standard error for the slope estimate is given by the following
formula (no derivation or even heuristic given here):
 (b1 ) 

SD( x ) N 1 / 2
The only thing that matters to us is that it suggests a standard error
estimate quite easily by estimating the error SD σε with the residual
SD sε (=RMSE):
stderr (b1 ) 
se
SD( x ) N 1 / 2
 This is the magic formula of simple linear regression! It allows us to
do statistical inference as we know it:
o construct confidence intervals,
CI = (b1 – 2stderr(b1), b1 + 2stderr(b1))
o and perform tests of the null hypothesis H0: β1 = 0:
t
b1
stderr(b1 )
The null hypothesis of a zero slope is standard. Its t-statistic and pvalue are given in every regression output. Why this obsession with a
zero slope? Because a zero slope means that there is no linear
association between x and y (recall a zero slope is equivalent to a zero
correlation). If the data are incompatible with the assumption of a
zero slope, we can infer that there is a slope and it makes sense to
estimate it. The above CI will then give us a range of plausible values
for the true population slope: as usual for a 95% CI, it is a random
interval that has a 0.95 probability of containing the true population
slope across datasets generated from the SLRM.
Examples of inference for straight line fits
 Example 1: the Diamond data. Here is the output that we can interpret
from now on:
Parameter Estimates
Term
Intercept
Weight (carats)
Estimate
-259.6259
3721.0249
Std Error
17.31886
81.78588
t Ratio Prob>|t|
-14.99 <.0001
45.50 <.0001
The last line has the slope estimated, about S$3720/carat. Its standard
error estimate is about 82 (same units as the slope). The null
hypothesis of a zero slope is vastly rejected because of the insanely
large t-ratio of over 45 and a p-value that is so small that it is not even
reported. The 95% CI is about 3720±164 = (3556, 3884), which
contains the plausible values for the population slope β1.
 Example 2: the Penn Student survey. We can check whether students
are on average still getting taller with age. Here is the regression of
Height on Age:
Parameter Estimates
Term
Intercept
AGE
Estimate
65.476114
0.1186769
Std Error
3.246521
0.168809
t Ratio Prob>|t|
20.17 <.0001
0.70 0.4825
We see that this is not the case: the null hypothesis of a zero slope (a
zero average difference in Height for 1-year difference in Age) cannot
be rejected. The t-ratio is 0.7 and hence within the interval ±2.
Consequently, the p-value is about 0.5 and very much above 0.05.
In general it is quite difficult to find sensible example where the slope is
statistically insignificant. This will change in Statistics 102 where you will
see multiple linear regression with more than one predictor.
Some insights from the standard error formula for slopes
We reproduce the formula one more time:
 (b1 ) 

SD( x ) N 1 / 2
Here are some insights:
The uncertainty about the estimated slope decreases as
 the errors get less dispersed, that is, σε gets smaller,
 the predictor values get more dispersed, that is, SD(x) gets larger,
 the number of observations gets larger.
All of the above should make sense: less error, better spread-out predictor
values, and more data should all stabilize the line and hence drive down the
uncertainty about the line and hence about the slope.
Download