Chapter 7: The Simple Linear Regression Model

advertisement
Chapter 7: The Simple Linear Regression Model
7.1 Case study: Big Bang Case Study
2.5
Recession Velocity vs. Distance
●
●●
●
1.5
●
●
●
0.5
distance
Q.: Investigate the relationship between a nebula’s distance (megaparsec) from the earth and
the velocity (km/s) with which is was going
away from the earth. Is the relationship between mean distance and velocity a straight
line?
●● ● ●
●
●
●
● ●●
●
●
●
−0.5
●
●●
●
−500
0
500
1000
velocity
7.2 The Simple Linear Regression (SLR) Model
• GOAL: Describe the distribution of values of one variable (the response) as a function
of another variable (the explanatory variable)
• Simple linear regression refers to having one explanatory variable.
• Think of one subpopulation of the response for each value of the explanatory variable.
• The regression of the response variable on the explanatory variable is a mathematical
relationship between the means of the subpopulations and the explanatory variable.
– What is the equation of a line? (think back to high school or earlier!)
• The simple linear regression model:
µ{Y |X} = β0 + β1 X
• Notation:
* µ{Y |X} = “the mean of Y as a function of X”
* µ{Y |X = x} = “
”
* β0 =
* β1 =
1
* σ{Y |X} = the standard deviation of Y as a function of X
• Example and Diagram of the model:
• Assumptions:
1. Normality: Each subpopulation has a normal distribution.
2. Equal SD’s (constant spread):
3. Independence: Each response is drawn independently of all other responses from
the same population, and independently of all responses from other subpopulations.
4. Linearity: The means of the subpopulations fall on a straight-line function of the
explanatory variable.
• How many parameters are in the SLR model?
• Interpolation and Extrapolation:
– interpolation =
– extrapolation =
2
7.3 Least Squares Regression Estimation
• The method of least squares is one way to estimate the parameters.
• Hat notation:
– βˆ0 estimates β0
– βˆ1 estimates β1
µ̂{Y |X} = βˆ0 + βˆ1 X
• Now write it for a particular X in the data set:
– Fitted values:
fiti = µ̂{Yi |Xi } = βˆ0 + βˆ1 Xi
– Residuals:
resi = Yi − fiti
• Diagram of fitted values and residuals:
• How do we obtain a measure of the distance between all responses and their fitted
values?
• Least Squares Estimators:
– How should we estimate β0 and β1 ?
∗ Find the particular values that minimize the residual sum of squares
3
∗ These turn out to be:
Pn
(Xi − X)(Yi − Y )
β̂1 = i=1
Pn
2
i=1 (Xi − X)
β̂0 = Y − β̂1 X
• Sampling Distributions of β̂0 and β̂1 :
– β̂0 and β̂1 are just statistics! Let’s think about their sampling distributions (see
Display 7.7). Normally distributed about the true value (assuming normal errors,
or by CLT for large n.)
– SD of the sampling distribution of β̂1 :
s
SD(β̂1 ) = σ
1
(n − 1)s2x
s
or
σ
1
P
(Xi − X)2
– SD of the sampling distribution of β̂0 :
s
SD(β̂0 ) = σ
2
1
X
+
n (n − 1)s2x
– What is s2x ?
– What else do we need in order to calculate the above SDs.
4
• Estimation of σ:
σ̂ =
q
Sum of Squared Residuals
d.f.
– Calculated the degrees of freedom (d.f.):
Residual d.f. = (Total # of observations) − (# parameters in model for mean)
• Standard Errors:
– Just plug in σ̂ in place of σ in the SD formulas.
– Degrees of freedom (d.f.) = n − 2
• Computer Output:
> bigbang <- read.csv("data/BigBangData.csv",head=T)
> bigbang.lm <- lm(distance ~ velocity, data = bigbang)
> summary(bigbang.lm)
Call:
lm(formula = distance ~ velocity)
Residuals:
Min
1Q
Median
-0.763250 -0.235212 -0.008798
3Q
0.207201
Max
0.914434
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.3990982 0.1184697
3.369 0.00277
velocity
0.0013729 0.0002274
6.036 4.48e-06
Residual standard error: 0.405 on 22 degrees of freedom
Multiple R-Squared: 0.6235,
Adjusted R-squared: 0.6064
F-statistic: 36.44 on 1 and 22 DF, p-value: 4.477e-06
> plot(distance ~ velocity, data = bigbang)
> abline(bigbang.lm)
– Extra SS F test for regression versus a single mean (H0 : no relationship between
y and x, or β1 = 0) uses anova.
2.5
Recession Velocity vs. Distance
> anova(bigbang.lm)
●●
1.5
0.5
●
●● ● ●
●
●
●
● ●●
●
●●
●
●
●
●
−500
0
500
velocity
– One way to summarize linear regression output:
Estimated Mean Distance = 0.3991 + 0.001373∗ (Velocity)
(0.1185) (0.000227)
5
●
●
●
−0.5
Response: distance
Df Sum Sq Mean Sq F value
Pr(>F)
velocity
1 5.9755 5.9755 36.438 4.477e-06
Residuals 22 3.6078 0.1640
distance
●
1000
7.4 Inferential Tools
• t-ratios:
• What tests are included in the standard output from statistical packages?
7.4.1 Confidence intervals for β0 or β1 :
• Estimate:
• Standard Error (SE):
• Multiplier:
• Now, make the CI for true slope.
7.4.2 Describing the distribution of Y at a particular value of X
(X = X0 )
• At some specified value (X0 ) of the explanatory variable, the response variable has a
distribution. What shape is it?
– What is the mean of the distribution?
– What is the SD?
• Ingredients for a confidence interval for the mean (µ{Y |X0 }):
– Estimate:
– Standard Error:
s
SE[µ̂{Y |X0 }] = σ̂
1 (X0 − X)2
+
n
(n − 1)s2x
d.f. = n − 2
– Multiplier:
• How can we get SE[µ̂{Y |X0 }] without directly calculating it?
6
Option 1 Computer centering trick:
(a) Create an artificial explanatory variable as X ∗ = X − X0 .
(b) Fit the simple linear regression of Y on X ∗ .
(c) The intercept IS now the mean of Y when X = X0 , and thus the SE for the
intercept is what you want.
[intercept for µ{Y |X ∗ }] = µ{Y |X ∗ = 0} = µ{Y |X = X0 }
> summary(lm( distance ~ I(velocity - 500), data = bigbang))
Coefficients:
## (output truncated)
Estimate Std. Error t value Pr(>|t|)
(Intercept)
1.0855663 0.0875540 12.399 2.12e-11
I(velocity - 500) 0.0013729 0.0002274
6.036 4.48e-06
Option 2 Use R functions to directly get SEs and CIs for all points:
bigbang.lm <- lm(distance ~ velocity, data = bigbang)
fits.with.SEs <- predict(bigbang.lm,se.fit=TRUE, interval="confidence", level=0.95)
fits.with.SEs
fit.500 <- predict(bigbang.lm, newdata=list(velocity=500), se.fit=TRUE )
unlist(fit.500)
fit.1
se.fit
df residual.scale
1.08556627
0.08755405
22.00000000
0.40495881
#Calculate the CI "by hand"
fit.500$fit +c(-1,1) * qt(.975,22)*fit.500$se.fit
[1] 0.9039903 1.2671422
• What happens to the standard error for estimating the mean response as X0 gets
farther from X?
• There is compound uncertainty when estimating the mean response for several
values of X. So far, we have only addressed the situation where we are interested in
only a single X0 .
– What multiplier should we use to make confidence intervals for several X0 ’s?
– What if we want to protect against an unlimited number of comparisons or make
a confidence band around the regression line?
7
• Creating a confidence band for the regression line using the Working-Hotelling
procedure:
– Want at least 95% of repetitions to produce bands that include the true mean
response everywhere, that is, contain the true line.
– Use Scheffé multiplier
Confidence BAND multiplier =
q
2∗ F2,n−2 (.95)
– So, to make the confidence bands:
1. Calculate the confidence intervals for many different values of X
2. Connect the lower limits and connect the upper limits
est.mean.ses <- predict(bigbang.lm, newdata=list(velocity =seq(-220,1090,length=40)), se.fit=TRUE)
confband.Scheffe.low <- est.mean.ses$fit - (sqrt(2*qf(.95,2,22)))*est.mean.ses$se.fit
confband.Scheffe.hi <- est.mean.ses$fit + (sqrt(2*qf(.95,2,22)))*est.mean.ses$se.fit
cbind(confband.Scheffe.low, confband.Scheffe.hi)
●
● ●
●
1.5
●
●
1.0
● ●
●
●
●
●
●
●
0.5
●
●●
●
●●
● ●
0.0
distance (megaparsec)
2.0
plot(distance~velocity, data=bigbang, ylab="distance (megaparsec)", xlab="velocity (km/s)")
lines(seq(-220,1090,length=40), est.mean.ses$fit, lty=1, lwd=2) #fitted line
lines(seq(-220,1090,length=40), confband.Scheffe.low, lty=4, lwd=2, col=4) #lower confidence band
lines(seq(-220,1090,length=40), confband.Scheffe.hi, lty=4, lwd=2, col=4) #upper confidence band
●
●
−200
0
200
400
600
800
1000
velocity (km/s)
8
7.4.3 Prediction of a Future Response
• A prediction interval indicates likely values for a future observation of the response at a specific
value of X (say X0 ).
• Understand the difference between these two questions:
1. What is the mean pH 4 hours after slaughter?
2. What will be the pH of a particular steer carcass 4 hours after slaughter?
• Suppose you know the mean and SD of the response at X0 . Is it possible to exactly predict the value
of an individual observation?
• Pred{Y |X0 } = our best prediction of a future response at X0
– What would you use for Pred{Y |X0 }?
• A Prediction Interval has TWO independent sources of uncertainty!
1. Uncertainty in the location of the subpopulation mean
2. Uncertainty about where the future value will be in relation to its mean
SE[Pred{Y |X0 }] =
p
σ̂ 2 + [SE[µ̂{Y |X0 }]]2
• How do we usually obtain SE[Pred{Y |X0 }] and/or prediction intervals?
1. Use the computer centering trick - will give σ̂ 2 and SE[µ̂{Y |X0 }]2 in the output
2. Direct request from R:
9
bigbang.lm <- lm(distance ~ velocity, data = bigbang)
Vel.100 <- data.frame(velocity=100)
predict(bigbang.lm, newdata=Vel.100, SE.fit=TRUE, interval="prediction", level=0.95)
new.Vels <- data.frame(velocity=c(-100, 100, 400, 900))
predict(bigbang.lm, newdata=new.Vels, SE.fit=TRUE, interval="prediction", level=0.95)
$fit
1
2
3
4
fit
lwr
upr
0.2618046 -0.62392239 1.147532
0.5363918 -0.33038691 1.403171
0.9482727 0.09102778 1.805518
1.6347407 0.74228834 2.527193
$se.fit
1
2
3
4
0.13569379 0.10340198 0.08288756 0.14557933
$df
[1] 22
$residual.scale
[1] 0.4049588
# Check "by hand"
se.pred.100 <- sqrt((pred.100$resid)^(2)+(pred.100$se.fit)^(2))
se.pred.100 #0.4179517
pred.100$fit[1,1] + c(-1,1)*(qt(.975,22))*se.pred.100
#[1] -0.3303869 1.4031706 IT MATCHES
• Prediction intervals:
– For a single interval (one new response) −→ use t-multiplier with d.f. = n − 2
●
● ●
●
1.5
●
●
1.0
● ●
●
●
●
●
●
●
0.5
●
●●
●
●●
● ●
0.0
distance (megaparsec)
2.0
– For k multiple new responses −→ Scheffé multiplier
q
k ∗ Fk,n−2 (1 − α)
●
●
−200
0
200
400
600
800
1000
velocity (km/s)
range(bigbang$velocity) #-220 1090
10
newX <- data.frame(velocity = seq(-220, 1090, length=50)) #50 values between -220 and 1090
est.mean.CIs <- predict(bigbang.lm, newdata=newX, interval="confidence")
pred.PIs <- predict(bigbang.lm, newdata=newX, interval="prediction")
## Make a confidence BAND using the Scheffe multiplier
ww <- sqrt(2*qf(.95,2,22))
est.mean.ses <- predict(bigbang.lm, newdata=newX, se.fit=TRUE)
confband.Scheffe.low <- est.mean.ses$fit - ww * est.mean.ses$se.fit
confband.Scheffe.hi <- est.mean.ses$fit + ww * est.mean.ses$se.fit
newX <- cbind(newX, est.mean.ses[1:2],confband.Scheffe.low,confband.Scheffe.hi )
dev.new()
plot(distance ~ velocity, data=bigbang, pch=16, cex=0.5, ylab="distance (megaparsec)",
xlab="velocity (km/s)",xlim=c(-300,1200), ylim=c(-0.5,2.5),xaxp=c(-300,1200,5))
abline(bigbang.lm, lwd=2) #fitted line
lines(fit ~ velocity, newX, lty=1, lwd=2)
# another way to add fitted line ( X’s are sorted in order)
lines( est.mean.CIs[,2]~ newX$velocity, lty=2, col=2) #lower pointwise CI
lines( est.mean.CIs[,3]~ new$velocity,, lty=2, col=2) #upper pointwise CI
lines( pred.PIs[,2] ~ new$velocity,, lty=3, col=3) #lower prediction PI
lines( pred.PIs[,3] ~ new$velocity, lty=3, col=3) #upper prediction PI
lines( confband.Scheffe.low ~ velocity,data=newX, lty=4, col=4) #lower confidence band
lines( confband.Scheffe.hi ~ velocity,data=newX, lty=4, col=4) #upper confidence band
7.4.4 The Calibration Problem
• Calibration = Guessing the value of X that results in Y = Y0 (Inverse Prediction)
– Also called Inverse Prediction
– Why can’t we just reverse the response and the explantory variable?
• Methods for inference:
1. Invert the prediction relationship to get:
Pred{X|Y0 } =
Y0 − β̂0
β̂1
2. Graphical method:
(a)
(b)
(c)
(d)
Plot the prediction bands for predicting Y from X
Draw a horizontal line at Y0
Draw vertical lines down to the X-axis.
Take the two points where the lines cross the X-axis to be the calibration interval
11
3. Direct calculation of approximate SE(X̂):
– For predicting the value of X at which the mean of Y is Y0 :
SE(X̂) =
SE(µ̂{Y |X̂})
|β̂1 |
– For predicting the value of X at which the value of Y is Y0 :
SE(X̂) =
SE(Pred{Y |X̂})
|β̂1 |
7.5 Related Issues
The Regression Effect (regression to the mean)
• Occurs in any test-retest situation:
– Subjects who score high on the first test will, as a group, score closer to the average (lower) on
the second test
– Subjects who score low on the first test will, as a group, score closer to the average (higher) on
the second test
• There are many more individuals with skill levels near average than there are with skill levels farther
from one SD from it. So, by chance, more appear in the strip whose skill level is closer to average
(see Display 7.13)
• regression fallacy = the mistake of attaching some broader meaning to the regression effect.
Causation
• It is tempting to think of the explanatory variable as causing different values of the response. But, if
it is an observational study, cause and effect statements are NOT justified.
• Get used to using the word association! (between the mean response and the value of the explanatory
variable)
12
Correlation
• The sample correlation coefficient describes the degree of linear association between any two
random variables X and Y .
rXY =
1
(n−1)
Pn
i=1 (Xi
− X)(Yi − Y )
sX sY
• Correlation does not depend on distinguishing between the response and explanatory variables.
• What are the units of correlation?
• What is the range of values for correlation?
• Only use the correlation for inference if the pairs (X, Y ) are randomly selected from a population
(typically not the case in regression scenarios)
– For regression, X does not have to be random
– It is a common mistake to base conclusions on correlations when the X’s are not random.
• Correlation only measures the degree of
association!
– It is possible for there to be an exact relationship between X and Y and have a sample correlation
coefficient of ZERO!
– See handout.
A few questions
• What is wrong with the following formulation of the regression model? Y = β0 + β1 X
• At what value of X will there be the most precise estimate of the mean of Y ?
• At what value of X will there be the most precise prediction of a future Y ?
• Consider the regression of weight on height for a sample of adult males. Suppose the estimated
intercept is 5 kg. Does this imply that males of height 0 weigh 5 kg on average? Does this imply that
the simple linear regresssion model is meaningless?
13
Download