Stat 112 Notes 10 • Today: – Fitting Curvilinear Relationships (Chapter 5) • Homework 3 due Thursday. Curvilinear Relationship • Reconsider the simple regression problem of estimating the conditional mean of y given x, E ( y | x) • For many problems, E ( y | x) is not linear. • Linear regression model makes restrictive assumption that increase in mean of y|x for a one unit increase in x equals 1 • Curvilinear relationship: E ( y | x) is a curve, not a straight line; increase in mean of y|x is not the same for all x. • When the relationship is curvilinear, the residual plot from a simple linear regression will violate linearity and there will be ranges of X for which the mean of the residuals is not approximately zero. Example 1: How does rainfall affect yield? • Data on average corn yield and rainfall in six U.S. states (1890-1927), cornyield.JMP Bivariate Fit of YIELD By RAINFALL 40 YIELD 35 30 25 20 6 7 8 9 10 11 12 RAINFALL 13 14 15 16 17 Residual 5 0 -5 -10 6 7 8 9 10 11 12 13 14 15 16 17 RAINFALL Residual plot indicates violation of linearity – mean of residuals is above zero for rainfall between about 10-12 and below zero for rainfall from about 13-17. Example 2: How do people’s incomes change as they age • Weekly wages and age of 200 randomly chosen males between ages 18 and 70 from the 1998 March Current Population Survey Bivariate Fit of wage By age 2500 wage 2000 1500 1000 500 0 20 30 40 age 50 60 70 Example 3: Display.JMP • A large chain of liquor stores would like to know how much display space in its stores to devote to a new wine. It collects sales and display space data from 47 of its stores. Bivariate Fit of Sales By DisplayFeet 450 400 350 Sales 300 250 200 150 100 50 0 0 1 2 3 4 5 DisplayFeet 6 7 8 Polynomial Regression • Add polynomial terms in x as additional explanatory variables in a multiple regression model. E (Y | X ) 0 1 x 2 x 2 K x K • In JMP ( x x ) is used in the place of x. E (Y | X ) 0 1 x 2 ( x x ) 2 K ( x x )K This does not affect the ŷ that is obtained from the multiple regression model. • Quadratic model (K=2) is often sufficient. Polynomial Regression in JMP • Two ways to fit model: ( x x ) 2 , ( x x )3 ,..., ( x x ) k – Create variables . Use 2 3 k x , ( x x ) , ( x x ) ,..., ( x x ) fit model with variables (we will illustrate this method when we apply polynomial regression when there is more than one explanatory variable) – Use Fit Y by X. Click on red triangle next to Bivariate Analysis … and click Fit Polynomial instead of the usual Fit Line . This method produces nicer plots. Bivariate Fit of YIELD By RAINFALL Linear Fit YIELD = 23.552103 + 0.7755493 RAINFALL 40 Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) YIELD 35 30 25 0.16211 0.138835 4.049471 31.91579 38 Polynomial Fit Degree=2 YIELD = 21.660175 + 1.0572654 RAINFALL - 0.2293639 (RAINFALL-10.7842)^2 20 Summary of Fit 6 7 8 9 10 11 12 13 14 15 16 17 RAINFALL RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 0.296674 0.256484 3.762707 31.91579 38 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 21.660175 3.094868 7.00 <.0001 RAINFALL 1.0572654 0.293956 3.60 0.0010 (RAINFALL-10.7842)^2 -0.229364 0.088635 -2.59 0.0140 Linear Fit wage = 407.72321 + 6.5370642 age Summary of Fit RSquare RSquare Adj Root Mean Square Error Bivariate Fit of wage By age 0.049778 0.044979 345.4422 Polynomial Fit Degree=2 2500 wage = 356.39651 + 9.6873755 age - 0.4769883 (age-38.22)^2 wage 2000 Summary of Fit 1000 RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 500 Parameter Estimates 1500 0 20 30 40 age Linear Fit Polynomial Fit Degree=2 50 60 Term Intercept age 70 (age-38.22)^2 Estimate 356.39651 9.6873755 -0.476988 0.095328 0.086143 337.9155 657.5698 200 Std Error t Ratio Prob>|t| 81.21184 4.39 <.0001 2.223264 4.36 <.0001 0.151453 -3.15 0.0019 Interpretation of coefficients in polynomial regression • The usual interpretation of multiple regression coefficients doesn’t make sense in polynomial regresssion. E (Y | X ) 0 1 X 2 ( X X ) 2 • We can’t hold x fixed and change . • Effect of increasing x by one unit depends on the starting x=x* ( X X )2 E (Y | X X * 1) E (Y | X X * ) [ 0 1 ( X * 1) 2 ( X * 1 X ) 2 ] [ 0 1 X * 2 ( X * X )2 ] 1 2 2 [2 X * 2 X ] Interpretation of coefficients in wage data Polynomial Fit Degree=2 wage = 356.39651 + 9.6873755 age - 0.4769883 (age-38.22)^2 Parameter Estimates Term Intercept age (age-38.22)^2 Estimate 356.39651 9.6873755 -0.476988 Std Error t Ratio Prob>|t| 81.21184 4.39 <.0001 2.223264 4.36 <.0001 0.151453 -3.15 0.0019 Change in Mean Wage Associated with One Year Increase in Age From 29 to 30 From 39 to 40 From 49 to 50 From 59 to 60 Change in Mean Wage 18.00 8.47 -1.07 -10.61 Choosing the order in polynomial regression • Is it necessary to include a kth order term ( X X )k ? E (Y | X ) 0 1 X 2 ( X X ) 2 K ( X X )k • Test H 0 : k 0 vs. H a : k 0 • Choose largest k so that test still rejects H 0 (at 0.05 level) • If we use ( X X )k , always keep the lower order terms in the model. • For corn yield data, use K=2 polynomial regression model. • For income data, use K=2 polynomial regression model Bivariate Fit of YIELD By RAINFALL 40 YIELD 35 30 25 20 6 7 8 9 10 11 12 13 14 15 16 17 RAINFALL Linear Fit Polynomial Fit Degree=2 Polynomial Fit Degree=3 Parameter Estimates Term Intercept RAINFALL (RAINFALL-10.7842)^2 (RAINFALL-10.7842)^3 Estimate 29.281281 0.376709 -0.349335 0.0517568 Std Error 5.625537 0.511817 0.114401 0.032202 t Ratio 5.21 0.74 -3.05 1.61 Prob>|t| <.0001 0.4668 0.0044 0.1172 Transformations • Curvilinear relationship: E(Y|X) is not a straight line. • Another approach to fitting curvilinear relationships is to transform Y or x. • Transformations: Perhaps E(f(Y)|g(X)) is a straight line, where f(Y) and g(X) are transformations of Y and X, and a simple linear regression model holds for the response variable f(Y) and explanatory variable g(X). Curvilinear Relationship Bivariate Fit of Life Expectancy By Per Capita GDP Life Expectancy 80 70 60 Y=Life Expectancy in 1999 X=Per Capita GDP (in US Dollars) in 1999 Data in gdplife.JMP 50 40 0 5000 10000 15000 20000 25000 30000 Per Capita GDP Residual 15 5 -5 -15 -25 0 5000 10000 15000 20000 Per Capita GDP 25000 30000 Linearity assumption of simple linear regression is clearly violated. The increase in mean life expectancy for each additional dollar of GDP is less for large GDPs than Small GDPs. Decreasing returns to increases in GDP. Bivariate Fit of Life Expectancy By log Per Capita GDP 70 15 60 Residual Life Expectancy 80 50 5 -5 -15 -25 40 6 6 7 8 9 10 7 8 9 10 log Per Capita GDP log Per Capita GDP Linear Fit Life Expectancy = -7.97718 + 8.729051 log Per Capita GDP The mean of Life Expectancy | Log Per Capita appears to be approximately a straight line. HowLinear doFit we use the transformation? • Life Expectancy = -7.97718 + 8.729051 log Per Capita GDP Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept -7.97718 3.943378 -2.02 0.0454 log Per Capita 8.729051 0.474257 18.41 <.0001 GDP • Testing for association between Y and X: If the simple linear regression model holds for f(Y) and g(X), then Y and X are associated if and only if the slope in the regression of f(Y) and g(X) does not equal zero. P-value for test that slope is zero is <.0001: Strong evidence that per capita GDP and life expectancy are associated. • Prediction and mean response: What would you predict the life expectancy to be for a country with a per capita GDP of $20,000? Eˆ (Y | X 20,000) Eˆ (Y | log X log 20,000) Eˆ (Y | log X 9.9035) 7.9772 8.7291* 9.9035 78.47 How do we choose a transformation? • Tukey’s Bulging Rule. • See Handout. • Match curvature in data to the shape of one of the curves drawn in the four quadrants of the figure in the handout. Then use the associated transformations, selecting one for either X, Y or both. Transformations in JMP 1. Use Tukey’s Bulging rule (see handout) to determine transformations which might help. 2. After Fit Y by X, click red triangle next to Bivariate Fit and click Fit Special. Experiment with transformations suggested by Tukey’s Bulging rule. 3. Make residual plots of the residuals for transformed model vs. the original X by clicking red triangle next to Transformed Fit to … and clicking plot residuals. Choose transformations which make the residual plot have no pattern in the mean of the residuals vs. X. 4. Compare different transformations by looking for transformation with smallest root mean square error on original y-scale. If using a transformation that involves transforming y, look at root mean square error for fit measured on original scale. Bivariate Fit of Life Expectancy By Per Capita GDP Life Expectancy 80 70 60 50 40 0 5000 10000 15000 20000 25000 30000 Per Capita GDP Linear Fit Transformed Fit to Log Transformed Fit to Sqrt Transformed Fit Square Transformed Fit to Sqrt Linear Fit Life Expectancy = 56.176479 + 0.0010699 Per Capita GDP • 0.515026 0.510734 8.353485 63.86957 115 RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 0.636551 0.633335 7.231524 63.86957 115 Transformed Fit Square Transformed Fit to Log Life Expectancy = -7.97718 + 8.729051 Log(Per Capita GDP) Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) ` Summary of Fit Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) Life Expectancy = 47.925383 + 0.2187935 Sqrt(Per Capita GDP) Square(Life Expectancy) = 3232.1292 + 0.1374831 Per Capita GDP Fit Measured on Original Scale 0.749874 0.74766 5.999128 63.86957 115 Sum of Squared Error Root Mean Square Error RSquare Sum of Residuals 7597.7156 8.1997818 0.5327083 -70.29942 By looking at the root mean square error on the original y-scale, we see that all of the transformations improve upon the untransformed model and that the transformation to log x is by far the best. Linear Fit Transformation to -5 -15 5 -5 -15 -25 0 5000 10000 15000 20000 25000 -25 30000 0 Per Capita GDP 5000 10000 15000 20000 25000 30000 25000 30000 Per Capita GDP Transformation to Log X Transformation to 15 Y2 15 5 Residual Residual X 15 5 Residual Residual 15 -5 5 -5 -15 -15 -25 -25 0 5000 10000 15000 20000 Per Capita GDP 25000 30000 0 5000 10000 15000 20000 Per Capita GDP The transformation to Log X appears to have mostly removed a trend in the mean of the residuals. This means that E (Y | X ) 0 1 log X. There is still a problem of nonconstant variance. Comparing models for curvilinear relationships • In comparing two transformations, use transformation with lower RMSE, using the fit measured on the original scale if y was transformed on the original y-scale • In comparing transformations to polynomial regression models, compare RMSE of best transformation to best polynomial regression model. • If the transfomation’s RMSE is larger than the polynomial regression’s RMSE but is within 1% of the polynomial regression’s RMSE, then it is still a good idea to use the transformation on the grounds of parsimony. Transformations and Polynomial Regression for Display.JMP Fourth order polynomial is the best polynomial regression model using the criterion on slide 10 RMSE Linear 51.59 log x 41.31 1/x 40.04 x Fourth order poly. 46.02 37.79 Fourth order polynomial is the best model – it has the smallest RMSE by a considerable amount (more than 1% advantage over best transformation of 1/x.