Chapter 5

advertisement
Chapter 7
Multiple linear regression
• In this chapter we consider multiple linear
regression problems involving modeling the
relationship between a dependent variable, Y
and two or more predictor variables X1, X2, X3
etc.
5.1 Polynomial regression
• The predictors are a single predictor, x, and its
polynomial powers (x2 , x3 , etc.).
• In polynomial regression, we can display the
result of our multiple regression on a single
two-dimensional graph.
Example: modeling salary from years
of experience
• We want to develop a regression equation to
model the relationship between Y, salary (in
thousands of dollars) and x, the number of
years of experience and find a 95% prediction
interval for Y when x=10.
Example: modeling salary from years of experience, scatter plot and residual plot if a
simple linear regression is fit.
Example: modeling salary from years of experience, after fitting a polynomial
regression model, the random pattern in the standardized residuals indicates it is a
valid model.
Example: modeling salary from years of experience, leverage point
Example: modeling salary from years of experience, diagnostic plots
Example: modeling salary from years of experience, 95% prediction interval for 10
years of experience.
> m2 <- lm(Salary~Experience + I(Experience^2))
> predict(m2,newdata=data.frame(Experience=c(10)),interval="prediction",level=0.95)
fit lwr upr
1 58.11164 52.50481 63.71847
5.2 Estimation and inference in
multiple linear regression
• The response random variable is predicted
from p predictor (explanatory) variables X1, X2,
…, Xp and the relationship between Y and X1,
X2, …, Xp is linear in the parameters β0 β1 β2,...
βp . The ei ’s are random errors.
E(Y | X1  x1, X 2  x2 ,..., X p  x p )  0  1x1  ...  p x p
Yi  0  1x1i  ...  p x pi  ei
Least square estimates
• The least square estimates of β0 β1 β2,... βp are
the values of b0 b1 b2,... bp for which the sum of
the squared residuals,
n
n
i 1
i 1
RSS   ( yi  yˆ i ) 2   ( yi b0  b1 x1i  ...  bp x pi ) 2
is minimal. xi is a vector xi=(x1i x2i … xpi).
Residual sum of squares
n
n
2
ˆ
ˆ
ˆ
RSS   ( yi  yˆi )   ( yi  0  1 x1i  ...  p x pi )
2
i 1
i 1
Testing whether there is a linear association
between Y and a subset/all of the predictors
• H0 : β0 = β1 = β2 = … = βp = 0
• HA : at least one of the βi ≠0.
• Total correlated sum of squares SST or SYY
n
2
(
y

y
)
 i
i 1
• The residual sum of squares , RSS
n
2
ˆ
(
y

y
)
 i
i 1
• The regression sum of squares, SSreg
n
2
ˆ
(
y

y
)
 i
i 1
• SST=SSreg+RSS; if there is linear relationship, or if
H0 is true, then SSreg should be very close to SST.
F test
F=
SSreg / p
RSS /(n  p  1)
Reject H if F > F a, p, n-p-1 , or if p-value <a.
The F test is always is used first to test for the
existence of a linear association between Y
and ANY of the p x=-variables.
• If the F test is significant then a natural
question to ask is: for which of the p xvariables is there evidence of a linear
association with Y?
• H0 : β0 = β1 = β2 = … = βk = 0 where k<p
i.e., reduced model Yi  0  k 1xk 1  ...  p x p  ei
• HA : H0 is not right.
Yi  0  1x1  ...  p x p  ei
i.e., full model
F test: reduced model versus full
model
• F=(RSS(reduced) -RSS(full) )/k / RSS(full) /(n-p-1)
5.3 Analysis of Covariance
• Consider the situation in which we want to model a response variable, Y
based on a continuous predictor, x and a dummy variable, d.
• There are four possibilities
• whether d=0 or d=1, Y     x  e
• Parallel regression lines, d=0 Y     x  e
d=1 Y       x  e
• Same intercepts but different slopes
d=0 Y     x  e
d=1 Y     x  e
0
0
1
0
1
1
0
1
0
2
1
• Different intercepts and different slopes,
called unrelated regression lines.
d=0, Y     x  e
d=1, Y      (   ) x  e
Additive change in Y due to the dummy variable
0
1
0
2
1
3
The change in the size of the effect x on Y due to
the dummy variable
Example: amount spent on travel
• Data are available on 925 customers, 466
purchased an adventure, 459 purchased a
cultural tour
• Y, amount of money spent in the last twelve
months
• X, age
• C, dummy variable, 1 if customer purchased a
cultural tour.
Example: amount spent on travel, the unrelated regression lines
Y  0  2C  (1  3C) x  e
c=0,
c=1,
Y  0  1 x  e
Y  0  2  (1  3 ) x  e
Example: amount spent on travel
> mfull <- lm(Amount~Age+C+C:Age)
> summary(mfull)
Call:
lm(formula = Amount ~ Age + C + C:Age)
Residuals:
Min
1Q Median
3Q
Max
-143.29750 -30.54140 -0.03431 31.10816 130.74317
Coefficients:
Estimate
Std. Error
t value Pr(>|t|)
(Intercept) 1814.5445 8.6011
211.0 <2e-16 ***
Age
-20.3175
0.1878
-108.2 <2e-16 ***
C
-1821.2337
12.5736
-144.8 <2e-16 ***
Age:C
40.4461
0.2724
148.5 <2e-16 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 47.63 on 921 degrees of freedom
Multiple R-squared: 0.9601, Adjusted R-squared: 0.9599
F-statistic: 7379 on 3 and 921 DF, p-value: < 2.2e-16
• H0 : β2= β3 = 0
i.e., reduced model
• HA : H0 is not right.
i.e., full model
Yi  0  1 xi  ei
Y  0  2C  (1  3C) x  e
summary(mreduced)
Call:
lm(formula = Amount ~ Age)
Residuals:
Min
1Q Median
3Q Max
-545.059 -199.033 6.336 198.739 497.389
Coefficients:
Estimate
Std. Error
t value Pr(>|t|)
(Intercept)
957.9103 31.3056
30.599 <2e-16 ***
Age
-1.1140
0.6784
-1.642 0.101
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 237.7 on 923 degrees of freedom
Multiple R-squared: 0.002913, Adjusted R-squared: 0.001833
F-statistic: 2.697 on 1 and 923 DF, p-value: 0.1009
Example: amount spent on travel, Analysis of variance, pick the model with the best
fit
> anova(mreduced,mfull)
Analysis of Variance Table
Model 1: Amount ~ Age
Model 2: Amount ~ Age + C + C:Age
Res.Df
RSS
Df Sum of Sq F
Pr(>F)
1 923
52158945
2 921
2089377 2 50069568 11035
< 2.2e-16 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
There is very strong evidence the reduced model in favor of the full model. Thus we
prefer the unrelated regression lines model to the coincident lines model.
Exercises: Menu pricing in a new Italian
restaurant in New York
• Actual data from 168 Italian restaurants in the
target area. Develop a regression model to
predict the price of a dinner.
• Y, Price: price of dinner
X1, Food: customer rating of food
X2, Décor: customer rating of décor
X3, Service: customer rating of service
X4, East: dummy variable, 1( east of Fifth
Avenue)
• The full model:
Y  0  1x1  2 x2  3 x3  4 East  5 x1  East  6 x2  East  7 x3  East  e
• H0 : β3= β5=β6= β7 = 0
i.e., reduced model
HA : H0 is not right.
i.e., full model
1.
2.
3.
Use F test to pick the best model.
If the aim is to choose the location of the restaurant so that the price
achieved for dinner is maximized, should the new restaurant be on the
east or west of Fifth Avenue?
Does it seem possible to achieve a price premium for “setting a new
standard for high-quality service in Manhattan” for Italian restaurants?
Download