chapter14

advertisement
APSTATISTICS: chapter 14 Inference for regression
When a scatterplot shows a linear relationship
between a quantitative explanatory variable x and a
quantitative response variable y, one can use a
least-squares line fitted to the data to predict y (yhat) for a given value of x. Now the question
becomes: “If the relationship is truly linear, what is
the equation for the true line?” In other words, how
close is our least-square equation, estimated from a
single data set, to the TRUE equation.
EXAMPLE: Infants who cry easily may be more easily
stimulated than others and this may be a sign of
higher IQ. Child development researchers explored
the relationship between the crying of infants four to
ten days old and their later IQ test-scores. A snap of
a rubber ban on the sole of the foot caused the
infants to cry. The researchers recorded the crying
and measured its intensity by the number of peaks in
the most active 20 seconds. They later measured
the children’s IQ at age three years using the
Stanford-Binet IQ test. Data for 38 infants are listed
as follows.
1
APSTATISTICS: chapter 14 Inference for regression
Make a scatterplot. Does the relationship appear to
be roughly linear?
2
APSTATISTICS: chapter 14 Inference for regression
Perform a least-squares fit. The resulting equation is
typically used for predicting y with a given value of x.
What is the correlation r and the value of r2. Remember
r2 describes how well the regression line fits the data, in
that, it is the proportion of the observed variation in y
that is accounted for by the straight-line relationship of y
and x.
3
APSTATISTICS: chapter 14 Inference for regression
Plot the line with the data. Any obvious outliers or
influential points? Remember outliers are points
that lie far from the overall pattern. Influential
points are those that move the fitted line; and are
usually points that are far out in the x direction and
isolated from other points. It is dangerous to use a
predictive model if influential points are present.
Y-hat = 91.27 + 1.493X
r = 0.455 and r2 = 0.207. This value of r2 indicates
that approximately 21% of the variation in IQ Scores
is (or can be) explained by a linear relationship with
crying intensity
4
APSTATISTICS: chapter 14 Inference for regression
5
APSTATISTICS: chapter 14 Inference for regression
Now for Advanced Statistics
The slope b and intercept a of the least-squares line
are statistics. They are estimates, computed from
our sample, and most certainly change if they were
calculated from another data set. They are
estimates of unknown parameters α and β. Let’s go
after parameters α and β.
Assumptions for Regression Inference
We have n observation on an explanatory variable x
and a response variable y. Out goal is to study or
predict the behavior of y for given values of x
!
For any fixed value of x, the response y varies
according to a normal distribution. Repeated
responses y are independent of each other
!
The mean response μy has a straight-line
relationship with x
μy = α+ βx
The slope β and intercept α are unknown
parameters
!
The standard deviation of y (call it σ) is the same
for all values of x. The value of σ is unknown.
This model states that “on the average” there is a
straight-line relationship between y and x. The TRUE
REGRESSION line μy = α + βx says that the mean
response μy moves along a straight line as the
6
APSTATISTICS: chapter 14 Inference for regression
explanatory variable x changes. The values of y that
we do observe vary about their means according to a
normal distribution. If we hold x fixed and take many
observations on y, the normal pattern will eventually
appear in a stemplot, histogram or the like.
OK so here we go
The first step is to estimate the unknown parameters
α, β and σ. From the least-squares line y--hat = a +
bx we have the following
!
The slope b of the least-squares line is an
unbiased estimate of the true slope β
!
The intercept a of the least-squares line is an
unbiased estimator of the true intercept α
Note that it is generally the slope of the line which is
of the greatest interest. A slope is a rate of change.
In the case of the Crying-IQ data the true slope β
says how much the average IQ Score changes when
7
APSTATISTICS: chapter 14 Inference for regression
the value of crying intensity x is increased by 1.
!
σ is the standard deviation which describes the
variability of the response y about the true
regression line. Since the least-squares line
estimates the true regression line, the residuals
estimate how much y varies about the true line.
Remember that residuals are observed y minus
predicted y. Because σ is the standard deviation
of responses about the true regression line, it is
estimated by a sample standard deviation of the
residuals. The sample standard deviation is
referred to as a standard error. Remember that
the sum of the residuals is always zero, hence,
their mean is always zero.
Standard Error About the Least-Squares Line
8
APSTATISTICS: chapter 14 Inference for regression
The standard error about the line is the key measure
of the variability of the responses in regression. It is
part of the standard error of all the statistics we will
use for inference.
Confidence Intervals for the Regression Slope
The slope is the rate of change of the mean response
as the explanatory variable increases. The slope b
9
APSTATISTICS: chapter 14 Inference for regression
of the least-squares line is an unbiased estimator of
β. We can calculate a confidence interval for β and it
has the familiar form:
estimate ± t*SEESTIMATE
Confidence Interval for the Regression Slope
A level C confidence interval for the slope β of the
true regression line is
b ± t*SEb
In this recipe, the standard error of the least-squares
slope b is
10
APSTATISTICS: chapter 14 Inference for regression
and t* is the upper (1-C)/2 critical value from the t
distribution with n-2 degrees of freedom.
11
APSTATISTICS: chapter 14 Inference for regression
Shown below is the basic output for the Crying
Intensity-IQ data using the regression command in
the Minitab software package.
Note that Minitab like most software packages
produce more information than the basic output.
Use only what you need.
For a 95% confidence interval for β
b ± t*SEb
12
APSTATISTICS: chapter 14 Inference for regression
Thus we are 95% confident that mean IQ increases
by between about 0.5 and 2.5 points for each
additional peak in crying
A similar calculation can be performed to estimate α,
but is seldom used.
Using the Hypothesis of NO Linear Relationship
One of the most common test hypothesis about the
value of the slope β is
H0 : β = 0
A regression line with slope 0 is horizontal. That is,
the mean of y does not change at all when x
changes.
!
So this hypothesis says there is NO true linear
relationship between x and y;
!
or the straight line dependence on x is of no
value for predicting y;
!
or there is no correlation between x and y in the
population from which we drew our data
13
APSTATISTICS: chapter 14 Inference for regression
Significance Tests for Regression Slope
To test the hypothesis H0: β = 0, compute the t
statistic
t =b/(SEb)
14
APSTATISTICS: chapter 14 Inference for regression
In terms of a random variable T having the t(n-2)
distribution the P-value for a test of H0 against
Ha : β > 0
is P(T > t)
15
APSTATISTICS: chapter 14 Inference for regression
Ha : β < 0
is P(T < t)
16
APSTATISTICS: chapter 14 Inference for regression
17
APSTATISTICS: chapter 14 Inference for regression
Ha : β ≠ 0
is 2P(T >
t)
The previous example of computer output also gave
the t statistic and its associated two sided P-value.
You can always do the calculation on the TI-83 with
STAT/TESTS/LinRegTTest
18
APSTATISTICS: chapter 14 Inference for regression
EXAMPLE: How well does the number of beers a
student drinks predict his or her blood alcohol
content? Sixteen student volunteers at the
University of Tennessee drank a randomly assigned
number of cans of beer. Thirty minutes later, a
police officer measured their blood alcohol content
(BAC). The data are as follows:
Student
Beers
BAC
Student
Beers
BAC
1
5
0.10
9
3
0.02
2
2
0.03
10
5
0.05
3
9
0.19
11
4
0.07
4
8
0.12
12
6
0.10
5
3
0.04
13
5
0.085
6
7
0.095
14
7
0.09
7
3
0.07
15
1
0.01
8
5
0.06
16
4
0.05
Minitab output for the blood alcohol content data
19
APSTATISTICS: chapter 14 Inference for regression
Scatterplot of students’ blood alcohol content
against the number of cans of beers consumed. The
dotted line is with the possible outlier of 9 beers
consumed removed. For this line r2 = 77%.
20
APSTATISTICS: chapter 14 Inference for regression
Is there evidence to suggest that the more beers
consumed the higher the BAC?
If so then give a 90% confidence interval for the
slope of the regression line
21
APSTATISTICS: chapter 14 Inference for regression
Inference About Predictions
One of the most common reasons to fit a line to data
is to predict the response to a particular value of the
explanatory variable. That is, substitute a specific
value for x and then calculate y-hat. The predictive
equation for BAC is
y-hat = -0.0127 + 0.0180x
Now the question becomes what is it that you want
to calculate. Do you want to calculate μy, the mean
response for the value of x, or are you interested in
calculating an individual response y for just one
observation of x. In both cases the method of
prediction is the same with the value of x put in the
equation and y-hat calculated
However the margin of error is different for the two
kinds of prediction. A larger margin of error is
needed to bracket the response for one observation
as compared with that to bracket the mean response
for all months.
!
!
We use a confidence interval to estimate the
mean response
We use a prediction interval to estimate the
individual response
22
APSTATISTICS: chapter 14 Inference for regression
In both cases the form of the interval is
y-hat ± t*SE
For a level C confidence interval for the mean
response, the standard error is
For a level C prediction interval for a single
observation, the standard error is
In both cases, t* is the upper (1-C)/2 critical value of
the t distribution with n-2 degrees of freedom
23
APSTATISTICS: chapter 14 Inference for regression
Statistical software calculates these intervals.
Minitab would produce the following output for
prediction when x = 5 beers
Predicted Values
Fit
StDev Fit
95% CI
0.07712 0.00513 {0.06612, 0.08812}
95% PI
{0.03192, 0.12232}
Note the Stdev. Fit is the standard error for the mean
response. The key point here is that it is harder to
predict one response than to predict a mean
response
1.
2.
3.
A Reminder of the Regression Assumptions
The true relationship is linear
The standard deviation of the response about
the true line is the same everywhere
The response varies normally about the true
regression line
24
Download