Notes 10

advertisement
Stat 112: Lecture 10 Notes
• Fitting Curvilinear Relationships
– Polynomial Regression (Ch. 5.2.1)
– Transformations (Ch. 5.2.2-5.2.4)
• Schedule:
– Homework 3 due on Thursday.
– Quiz 2 next
Curvilinear Relationship
• Reconsider the simple regression problem of
estimating the conditional mean of y given x,E ( y | x)
• For many problems, E ( y | x) is not linear.
• Linear regression model makes restrictive
assumption that increase in mean of y|x for a
one unit increase in x equals 1
• Curvilinear relationship: E ( y | x) is a curve, not a
straight line; increase in mean of y|x is not the
same for all x.
Example 1: How does rainfall
affect yield?
• Data on average corn yield and rainfall in
six U.S. states (1890-1927), cornyield.JMP
Bivariate Fit of YIELD By RAINFALL
40
YIELD
35
30
25
20
6
7
8
9
10
11
12
RAINFALL
13
14 15
16
17
Example 2: How do people’s
incomes change as they age
• Weekly wages and age of 200 randomly
chosen males between ages 18 and 70
from the 1998 March Current Population
SurveyBivariate Fit of wage By age
2500
wage
2000
1500
1000
500
0
20
30
40
age
50
60
70
Example 3: Display.JMP
• A large chain of liquor stores would like to
know how much display space in its stores
to devote to a new wine. It collects sales
and display space data from 47 of its
stores.
Bivariate Fit of Sales By DisplayFeet
450
400
350
Sales
300
250
200
150
100
50
0
0
1
2
3
4
5
DisplayFeet
6
7
8
Polynomial Regression
• Add powers of x as additional explanatory
variables in a multiple regression model.
E (Y | X )   0  1 x   2 x 2 
• Often ( x  x )
 K xK
is used in the place of x.
E (Y | X )   0  1 x   2 ( x  x ) 2    K ( x  x ) K
This does not affect the ŷ that is obtained
from the multiple regression model.
• Quadratic model (K=2) is often sufficient.
Polynomial Regression in JMP
• Two ways to fit model:
( x  x ) 2 , ( x  x )3 ,..., ( x  x ) k
– Create variables
.
Use
2
3
k
x
,
(
x

x
)
,
(
x

x
)
,...,
(
x

x
)
fit model with variables
– Use Fit Y by X. Click on red triangle next to
Bivariate Analysis … and click Fit Polynomial
instead of the usual Fit Line . This method
produces nicer plots.
Bivariate Fit of YIELD By RAINFALL
Linear Fit
YIELD = 23.552103 + 0.7755493 RAINFALL
40
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
YIELD
35
30
25
0.16211
0.138835
4.049471
31.91579
38
Polynomial Fit Degree=2
YIELD = 21.660175 + 1.0572654 RAINFALL - 0.2293639
(RAINFALL-10.7842)^2
20
Summary of Fit
6
7
8
9
10 11 12 13 14 15 16 17
RAINFALL
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.296674
0.256484
3.762707
31.91579
38
Parameter Estimates
Term
Estimate Std Error t Ratio Prob>|t|
Intercept
21.660175 3.094868 7.00 <.0001
RAINFALL
1.0572654 0.293956 3.60 0.0010
(RAINFALL-10.7842)^2 -0.229364 0.088635 -2.59 0.0140
Linear Fit
wage = 407.72321 + 6.5370642 age
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Bivariate Fit of wage By age
0.049778
0.044979
345.4422
Polynomial Fit Degree=2
2500
wage = 356.39651 + 9.6873755 age - 0.4769883 (age-38.22)^2
wage
2000
Summary of Fit
1000
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
500
Parameter Estimates
1500
0
20
30
40
age
Linear Fit
Polynomial Fit Degree=2
50
60
Term
Intercept
age
70 (age-38.22)^2
Estimate
356.39651
9.6873755
-0.476988
0.095328
0.086143
337.9155
657.5698
200
Std Error t Ratio Prob>|t|
81.21184 4.39 <.0001
2.223264 4.36 <.0001
0.151453 -3.15 0.0019
Interpretation of coefficients in
polynomial regression
• The usual interpretation of multiple
regression coefficients doesn’t make
sense in polynomial regresssion.
E (Y | X )   0  1 X   2 ( X  X ) 2
• We can’t hold x fixed and change
.
• Effect of increasing x by one unit
depends on the starting x=x*
( X  X )2
E (Y | X  X *  1)  E (Y | X  X * )  [ 0  1 ( X *  1)   2 ( X *  1  X ) 2 ] 
[ 0  1 X *  2 ( X *  X )2 ]  1   2   2 [2 X *  2 X ]
Interpretation of coefficients in
wage data
Polynomial Fit Degree=2
wage = 356.39651 + 9.6873755 age - 0.4769883 (age-38.22)^2
Parameter Estimates
Term
Intercept
age
(age-38.22)^2
Estimate
356.39651
9.6873755
-0.476988
Std Error t Ratio Prob>|t|
81.21184 4.39 <.0001
2.223264 4.36 <.0001
0.151453 -3.15 0.0019
Change in Mean Wage Associated with One Year Increase in Age
From 29 to 30
From 39 to 40
From 49 to 50
From 59 to 60
Change in Mean Wage
18.00
8.47
-1.07
-10.61
Choosing the order in
polynomial regression
• Is it necessary to include a kth order term ( X  X )k
?
E (Y | X )   0  1 X   2 ( X  X ) 2 
  K ( X  X )k
• Test H 0 :  k  0 vs. H a :  k  0
• Choose largest k so that test still rejects H 0 (at
0.05 level)
• If we use ( X  X )k , always keep the lower order
terms in the model.
• For corn yield data, use K=2 polynomial
regression model.
• For income data, use K=2 polynomial regression
model
Bivariate Fit of YIELD By RAINFALL
40
YIELD
35
30
25
20
6
7
8
9
10 11 12 13 14 15 16 17
RAINFALL
Linear Fit
Polynomial Fit Degree=2
Polynomial Fit Degree=3
Parameter Estimates
Term
Intercept
RAINFALL
(RAINFALL-10.7842)^2
(RAINFALL-10.7842)^3
Estimate
29.281281
0.376709
-0.349335
0.0517568
Std Error
5.625537
0.511817
0.114401
0.032202
t Ratio
5.21
0.74
-3.05
1.61
Prob>|t|
<.0001
0.4668
0.0044
0.1172
Transformations
• Curvilinear relationship: E(Y|X) is not a straight
line.
• Another approach to fitting curvilinear
relationships is to transform Y or x.
• Transformations: Perhaps E(f(Y)|g(X)) is a
straight line, where f(Y) and g(X) are
transformations of Y and X, and a simple linear
regression model holds for the response
variable f(Y) and explanatory variable g(X).
Curvilinear Relationship
Bivariate Fit of Life Expectancy By Per Capita GDP
Life Expectancy
80
70
60
Y=Life Expectancy in 1999
X=Per Capita GDP (in US
Dollars) in 1999
Data in gdplife.JMP
50
40
0
5000
10000 15000 20000 25000 30000
Per Capita GDP
Residual
15
5
-5
-15
-25
0
5000
10000
15000
20000
Per Capita GDP
25000
30000
Linearity assumption of simple
linear regression is clearly violated.
The increase in mean life
expectancy for each additional dollar
of GDP is less for large GDPs than
Small GDPs. Decreasing returns to
increases in GDP.
Bivariate Fit of Life Expectancy By log Per Capita GDP
70
15
60
Residual
Life Expectancy
80
50
5
-5
-15
-25
40
6
6
7
8
9
10
7
8
9
10
log Per Capita GDP
log Per Capita GDP
Linear Fit
Life Expectancy = -7.97718 + 8.729051 log Per Capita GDP
The mean of Life Expectancy | Log Per Capita appears to be approximately
a straight line.
HowLinear
doFit we use the transformation?
•
Life Expectancy = -7.97718 + 8.729051 log Per Capita GDP
Parameter Estimates
Term
Estimate Std Error t Ratio Prob>|t|
Intercept
-7.97718 3.943378
-2.02 0.0454
log Per Capita
8.729051 0.474257 18.41 <.0001
GDP
• Testing for association between Y and X: If the simple linear
regression model holds for f(Y) and g(X), then Y and X are
associated if and only if the slope in the regression of f(Y) and g(X)
does not equal zero. P-value for test that slope is zero is <.0001:
Strong evidence that per capita GDP and life expectancy are
associated.
• Prediction and mean response: What would you predict the life
expectancy to be for a country with a per capita GDP of $20,000?
Eˆ (Y | X  20,000)  Eˆ (Y | log X  log 20,000) 
Eˆ (Y | log X  9.9035)  7.9772  8.7291* 9.9035  78.47
How do we choose a
transformation?
• Tukey’s Bulging Rule.
• See Handout.
• Match curvature in data to the shape of
one of the curves drawn in the four
quadrants of the figure in the handout.
Then use the associated transformations,
selecting one for either X, Y or both.
Transformations in JMP
1. Use Tukey’s Bulging rule (see handout) to determine
transformations which might help.
2. After Fit Y by X, click red triangle next to Bivariate Fit and
click Fit Special. Experiment with transformations
suggested by Tukey’s Bulging rule.
3. Make residual plots of the residuals for transformed
model vs. the original X by clicking red triangle next to
Transformed Fit to … and clicking plot residuals.
Choose transformations which make the residual plot
have no pattern in the mean of the residuals vs. X.
4. Compare different transformations by looking for
transformation with smallest root mean square error on
original y-scale. If using a transformation that involves
transforming y, look at root mean square error for fit
measured on original scale.
Bivariate Fit of Life Expectancy By Per Capita GDP
Life Expectancy
80
70
60
50
40
0
5000
10000 15000 20000 25000 30000
Per Capita GDP
Linear Fit
Transformed Fit to Log
Transformed Fit to Sqrt
Transformed Fit Square
Transformed Fit to Sqrt
Linear Fit
Life Expectancy = 56.176479 + 0.0010699 Per Capita GDP
•
0.515026
0.510734
8.353485
63.86957
115
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.636551
0.633335
7.231524
63.86957
115
Transformed Fit Square
Transformed Fit to Log
Life Expectancy = -7.97718 + 8.729051 Log(Per Capita GDP)
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
`
Summary of Fit
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
Life Expectancy = 47.925383 + 0.2187935 Sqrt(Per Capita GDP)
Square(Life Expectancy) = 3232.1292 + 0.1374831 Per Capita GDP
Fit Measured on Original Scale
0.749874
0.74766
5.999128
63.86957
115
Sum of Squared Error
Root Mean Square Error
RSquare
Sum of Residuals
7597.7156
8.1997818
0.5327083
-70.29942
By looking at the root mean square error on the original y-scale, we see that
all of the transformations improve upon the untransformed model and that the
transformation to log x is by far the best.
Linear Fit
Transformation to
-5
-15
5
-5
-15
-25
0
5000
10000
15000
20000
25000
-25
30000
0
Per Capita GDP
5000
10000
15000
20000
25000
30000
25000
30000
Per Capita GDP
Transformation to Log X
Transformation to
15
Y2
15
5
Residual
Residual
X
15
5
Residual
Residual
15
-5
5
-5
-15
-15
-25
-25
0
5000
10000
15000
20000
Per Capita GDP
25000
30000
0
5000
10000
15000
20000
Per Capita GDP
The transformation to Log X appears to have mostly removed a trend in the mean
of the residuals. This means that E (Y | X )  0  1 log X. There is still a
problem of nonconstant variance.
Comparing models for
curvilinear relationships
• In comparing two transformations, use transformation
with lower RMSE, using the fit measured on the original
scale if y was transformed on the original y-scale [this is
equivalent to choosing the transformation with the higher R2
2
orRadj ]
• In comparing transformations
to polynomial regression
2
models, compare Radjof best transformation to best
polynomial regression model (selected using the criterion
on slide 10).
2
• If the transfomation’s Radj is close to (e.g., within .01) but
not as high as the polynomial regression’s, it is still
reasonable to use the transformation on the grounds of
parsimony.
2
adj
R
(Section 4.3.1)
• Problem with R2 : it never decreases even if we
add useless variables.
•
2
adj
R
SSE /( n  K  1)
 1
SST /( n  1) . This can decrease if
useless variables are added.
• Useful for comparing regression models with
different numbers of variables. No longer
represents proportion of variation in y explained
by multiple regression line.
• Found under Summary of Fit in JMP.
Transformations and Polynomial Regression for
Display.JMP
Fourth order polynomial is the best polynomial regression model
using the criterion on slide 10
R2
2
Radj
Linear
0.712
0.706
log x
0.815
0.811
1/x
0.826
0.823
0.771
0.766
0.856
0.842
x
Fourth order poly.
Fourth order polynomial is the best model – it has the highest
2
Radj
Summary
• Two methods for fitting regression models
for curvilinear relationships:
– Polynomial Regression
– Transformations
Download