Handout #15: Polynomial Terms in a Linear Regression Model Section 15.1: Fitting a Linear Regression Model with Polynomial Terms Consider the situation of modeling the temperature profile for across the United States. Latitude Temperature Profile of United States Longitude Standard Linear Regression Setup Response Variable: Temperature Predictor Variables: Latitude and Longitude Initially assume the following structure for mean and variance functions o o 𝐸(𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 | 𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒, 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒 ) = 𝛽0 + 𝛽1 ∗ 𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒 + 𝛽2 ∗ 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒 𝑉𝑎𝑟(𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒|𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒, 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒) = 𝜎 2 1 Understanding the Effects in this model Model Effects Longitude Latitude Reality A linear term, i.e. constant rate of change, for Longitude does not appear to match reality. Latitude Possible Fix: Include a quadratic term in the mean function. 𝐸(𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 | 𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒, 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒 ) = 𝛽0 + 𝛽1 ∗ 𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒 + 𝛽2 ∗ 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒 + 𝛽3 ∗ 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒 2 2 Comments Consider the proposed (updated) mean function 𝐸(𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 | 𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒, 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒 ) = 𝛽0 + 𝛽1 ∗ 𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒 + 𝛽2 ∗ 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒 + 𝛽3 ∗ 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒 2 1. This model is said to have two predictors, but three terms in addition to intercept. Generally speaking, a mean function is constructed using terms. Terms may be simple predictors, combinations of predictor variables, or functions of the predictor variables. Predictors Latitude Longitude Terms Intercept Latitude Longitude Longitude2 2. This model is said to be a linear model even though it includes a quadratic term. The notation of linear here implies linear in its coefficients. That is, the derivative of the mean function with respect to each coefficient is free of all other coefficients. o 𝜕 𝐸(𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒|𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒, 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒) 𝜕𝛽0 =1 o 𝜕 𝐸(𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒|𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒, 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒) 𝜕𝛽1 = 𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒 o 𝜕 𝐸(𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒|𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒, 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒) 𝜕𝛽2 = 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒 o 𝜕 𝐸(𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒|𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒, 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒) 𝜕𝛽3 = 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒 2 An example of a non-linear model – consider the Michaelis-Menton model for enzyme kinetics. In this model 𝑣 = 𝑟𝑒𝑎𝑐𝑡𝑖𝑜𝑛 𝑟𝑎𝑡𝑒 and 𝑥 = 𝑐𝑜𝑛𝑒𝑛𝑡𝑟𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑠𝑢𝑏𝑠𝑡𝑟𝑎𝑡𝑒. Realize, the partial derivatives of one coefficient are a function of the other. 𝐸(𝑣|𝑥) = Derivatives 𝛽1 ∗ 𝑥 (𝛽2 + 𝑥) Visual o 𝜕 𝐸(𝑣|𝑥) 𝜕𝛽1 = 𝑥 (𝛽2 + 𝑥) o 𝜕 𝐸(𝑣|𝑥) 𝜕𝛽2 = −𝛽1 𝑥 (𝛽2 + 𝑥)2 3 The estimation of model coefficients for a linear model is more straight forward than for a nonlinear model. Consider the construction of the X matrix used by software to estimate model coefficients. In a linear regression model, estimation is straight forward; however, for a non-linear model the elements of the X matrix depend on the estimated coefficients. Thus, estimation must be done iteratively, i.e. obtain initial estimates for coefficients, update X matrix, re-estimate coefficients, update X matrix, re-estimate coefficients, etc. This is known as iterative least squares estimation and is repeated until the coefficients do not change much from one iteration to the next. Linear Model Non-Linear Model ̂ = (𝑿′ 𝑿)−𝟏 𝑿′ 𝒀 𝜷 4 Section 15.2: Predicting January Temperature in continental United States Example 15.2.1 For this example, consider the US City Weather dataset on our course website. A snip-it of the dataset is provided here. Note: Albuquerque, NM will be removed from consideration from our analysis. This city is an extreme outlier. The effect of this city on the analysis will be considered after fitting an appropriate model. To begin, consider a standard linear model setup. In Section 5.1, we learned that this is likely an inappropriate model as a quadratic term for Longitude is probably necessary. The inadequate form of this model with respect to Longitude will be apparent when plotting the residuals from this model against Longitude. Model Setup (without the use of a quadratic term for Longitude) Response Variable: Jan Temp Predictor Variables: Latitude and Longitude Begin with the standard mean and variance function o 𝐸(𝐽𝑎𝑛 𝑇𝑒𝑚𝑝 | 𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒, 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒 ) = 𝛽0 + 𝛽1 ∗ 𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒 + 𝛽2 ∗ 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒 o 𝑉𝑎𝑟(𝐽𝑎𝑛 𝑇𝑒𝑚𝑝|𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒, 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒) = 𝜎 2 Output from JMP for above the models specified above. Standard Regression Output Residual Plot Note: Potential problems with residuals. Further investigation is warranted. 5 The residuals from the initial model are plotted against each predictor, Latitude and Longitude respectively. The anticipated lack-of-fit due to not incorporating a quadratic term for Longitude is apparent in the plot to the right. Weak quadratic trend Much stronger quadratic trend as suggested by the following output. Some may consider the model fitting done above as overkill. If the goal is to simple trend the residuals a kernel smoother is likely sufficient. The usual Analyze > Fit Y by X platform can be used in JMP; however, the Graph Builder framework is somewhat quicker and easier. This can be done by selecting Graph > Graph Builder. 6 Updated Model Setup Response Variable: Jan Temp Predictor Variables: Latitude and Longitude Terms: Intercept, Latitude, Longitude, Longitude2 Updated structure for mean function; Continue with constant variance function o o 𝐸(𝐽𝑎𝑛 𝑇𝑒𝑚𝑝 | 𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒, 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒 ) = 𝛽0 + 𝛽1 ∗ 𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒 + 𝛽2 ∗ 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒 + 𝛽3 ∗ 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒 2 𝑉𝑎𝑟(𝐽𝑎𝑛 𝑇𝑒𝑚𝑝|𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒, 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒) = 𝜎 2 Fitting this model in JMP 7 The regression output from the model that includes a quadratic term for Longitude is provided here. The accompanying residual plot is provided as well. Standard Regression Output Residual Plot Note: In addition to the plot provided by JMP, the residuals should be plotted against each predictor to ensure correct functional form. Comment: The standard form for a quadratic function is given by 𝑦 = 𝑎 ∗ 𝑥 2 + 𝑏 ∗ 𝑥 + 𝑐. The effect of each coefficient is show graphically here. In my opinion, the fact that the coefficient for Longitude is *not* statistically different from 0 is irrelevant. This implies that the horizontal shift in the vertex of the parabola is not statistically different than the average Longitude (average Longitude in JMP as JMP invokes a horizontal shift for this term). Identifying a parsimonious model, i.e. a model with only significant terms, is of utmost importance when modeling. However, in this case, the Longitude2 term is statistically important and keeping a lower order term of the polynomial, i.e. Longitude, in the model is not really increasing the complexity of the model. 8 Unfortunately and a bit to my surprise, a model that includes a quadratic term did not fix the apparent lack-of-fit in the Longitude direction. Final Model Setup Response Variable: Jan Temp Predictor Variables: Latitude and Longitude Terms: Intercept, Latitude, Longitude, Longitude2, Longitude3 Updated structure for mean function; Continue with constant variance function o o 𝐸(𝐽𝑎𝑛 𝑇𝑒𝑚𝑝 | 𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒, 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒 ) = 𝛽0 + 𝛽1 ∗ 𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒 + 𝛽2 ∗ 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒 + 𝛽3 ∗ 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒 2 + 𝛽4 ∗ 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒 3 𝑉𝑎𝑟(𝐽𝑎𝑛 𝑇𝑒𝑚𝑝|𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒, 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒) = 𝜎 2 Consider the rationale for including a cubic term for Longitude. For a quadratic function, i.e. 2nd degree polynomial, rate of decrease to left of vertex is the same as the rate of increase to the right. This is not the case for a 3rd degree polynomial. This functional form may in fact more closely match reality. 9 Output from Final Model Observations from this final model The residual plots provided below suggest a correct function form for the mean function. The model appears to be doing well in terms of a high R2 value = 95%. The Root Mean Square Error is pretty small with a value of 2.9. Recall, an approximate 95% prediction interval is given by ± 2 ∗ 𝑅𝑀𝑆𝐸. This suggest that predictions for Jan Temp in most locations across the US can be made with about 5.8OF, i.e. 2 * 2.9 = 5.8. The statistical importance of all model terms is evident through very small p-values as provided under the Prob > |t| column. A plot of the predicted values against actual Jan Temp values suggests possible overprediction for lower temps and under-prediction for warmer temps. 10 Residual Plots from Final Model Overall residual plot from final model Checking normality in the residuals A list of outliers, i.e observations whose residuals exceed 2*RMSE. Over-prediction appears to be occurring in northern California (and Reno which borders California) and under-prediction is occurring in some cities in the northwest. 11 Viewing the Fitted Model in JMP The Surface Profiler functionality allows us to investigate the surface in JMP. Select Factor Profiling > Surface Profiler from the red-drop down menu in JMP. The fitted model can be seen here. Profile view of Longitude Profile view of Latitude Note: The Contour Profiler produces a 2-dimensional display of this surface. 12 Investigating the Effect of Albuquerque, New Mexico Albuquerque, NM was removed from consideration in the above analyses. Albuquerque, NM is a city with a small Latitude (southern city) and a large Longitude (western city), but has a unusually cold temp due to its much higher altitude (in the mountains). Comparing the summary of fit output from the model with and without Albuquerque, NM. Model excluding Albuquerque, NM A model including Albuquerque, NM clearly indicates its fit as an extreme outlier. Model including Albuquerque, NM An investigation of the hat values suggests that Albuquerque, NM does *not* appear to have high leverage, but is simply an outlier. Recall, Cook’s Distance combines information regarding the degree to which Albuquerque, NM is an outlier with its leverage. Albuquerque ‘s Cook’s Distance is substantially higher than others. (𝑠𝑡𝑢𝑑𝑒𝑛𝑡 𝑟𝑖 )2 ℎ𝑖 𝐶𝑜𝑜𝑘 ′ 𝑠 𝐷 = ∗ (𝑘 + 1) (1 − ℎ𝑖 ) (−7.32)2 0.055 ∗ (5) (1 − 0.055) 0.62 Plot of Cook’s Distance 13 Section 15.3: Fitting this model in R (with visualization of surface) > # attach dataset to be used > > attach(USCity_WeatherData) > # Getting the names of the variables in this data > > names(USCity_WeatherData) [1] "City" "Latitude" "Longitude" [4] "JanTemp" "AprilTemp" "JulyTemp" [7] "OctTemp" "Precipitation.in." "Percipitation.days." [10] "Snowfall.in." > #Creating the additional terms needed for the final model > > Longitude2=Longitude^2 > Longitude3=Longitude^3 > #Reconstruct a data frame with all necessary terms > #This step is not necessary, but makes is much easier to remove Albuquerque, NM from model consideration > > > mydata=data.frame(City,Latitude,Longitude,Longitude2,Longitude3,JanTemp) > #Using the head() function to see top portion of newly created data > > head(mydata) City Latitude Longitude Longitude2 Longitude3 JanTemp 1 Albany, NY 42.67 73.75 5439.062 401130.9 22.2 2 Albuquerque, NM 35.08 106.65 11374.223 1213060.8 5.7 3 Asheville, NC 35.35 82.33 6778.229 558051.6 35.8 4 Atlanta, GA 33.75 84.38 7119.984 600784.3 42.7 5 Atlantic City, NJ 39.21 74.25 5513.062 409344.9 32.1 6 Austin, TX 30.27 97.73 9551.153 933434.2 50.2 > # Fitting the linear model with Latitude and up to 3rd degree polynomial for Longitude. The data=mydata[-2,] is used to remove the 2nd observation (Albuquerque, NM) from consideration. If no observations are to be removed, this would simple read data=mydata. > > myfit=lm(JanTemp ~ Latitude + Longitude + Longitude2 + Longitude3,data=mydata[-2,]) > # The summary() function can be used to obtain the standard regression output > > summary(myfit) Call: lm(formula = JanTemp ~ Latitude + Longitude + Longitude2 + Longitude3, data = mydata[-2, ]) Residuals: Min 1Q -12.7138 -1.0795 Median 0.1146 3Q 1.4033 Max 8.4928 14 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.640e+02 8.551e+01 -3.087 0.00269 ** Latitude -2.417e+00 6.152e-02 -39.289 < 2e-16 *** Longitude 1.433e+01 2.733e+00 5.244 1.05e-06 *** Longitude2 -1.720e-01 2.889e-02 -5.952 5.15e-08 *** Longitude3 6.728e-04 1.004e-04 6.703 1.81e-09 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.904 on 89 degrees of freedom Multiple R-squared: 0.953, Adjusted R-squared: 0.9508 F-statistic: 450.8 on 4 and 89 DF, p-value: < 2.2e-16 Note: As it should, the model output here matches the output obtained from JMP. Using R to create 3D plot of surface being fit Step 1: Create a grid pattern for Latitude and Longitude. > #Get smallest Latitude > min(Latitude) [1] 25.77 Creating a grid so that predictions can be made across continental US > #Get largest Latitude > max(Latitude) [1] 47.67 > #Create a sequence of x values to make prediction from > xgrid=seq(from=25,to=48,by=0.1) Now, do the same for Longitude > min(Longitude) [1] 67.59 > max(Longitude) [1] 122.68 > ygrid=seq(from=67,to=123,by=0.1) Step 2: Create a function from which prediction can be made. Note: The coefficients for each term are determined from the final regression model. > mypredict=function(x,y){-263.988 - 2.417*x + 14.333*y - 0.172*y^2 + 0.00067*y^3} Step 3: Obtain predictions across grid and create the 3d plot > # Use the outer function to obtain predictions across the entire grid, save the results into a matrix named z > > z=outer(xgrid,ygrid,mypredict) The graph will be made using a R package named rgl{}. This package must first be downloaded onto your machine and loaded in your current R session. > #Load the rlg package in R 15 > #First, download rlg package in R > #Next, load this package into your current R session > library(rgl) Once, the rgl{} package has been loaded into your current R session, use the persp3d() function to create the surface > #Use the persp3d function to create a 3d perspective plot > persp3d(xgrid,ygrid,z,xlab="Latitude",ylab="Longitude",zlab="JanTemp") A plot is produced in the rgl device window. This plot can be spun around etc. to see the surface from different angles, etc. A view of the surface in the Latitude direction A view of the surface in the Longitude direction 16 Section 15.4: Fitting the More Appropriate Geo-Spatial Model in R A geo-spatial model is a modeling strategy that utilizes the correlation structure due to the proximity of the observations. For example, when making a prediction for a particular city, (called kriging in geo-spatial models), the cities is close proximity are emphasized more than other cities. Model specifications determine the degree of closeness that is appropriate. Sampling locations in our dataset Observations close in distance have more impact on than others when making predictions in geo-spatial models The gstat{} package in R will be used here to make geo-spatial predictions for Jan Temp. > #Download and load the gstat{} package >library(gstat) > # Getting initial estimates for model parameters > init.model=gstat(id="JanTemp",formula=JanTemp~1,locations = ~ Latitude + Longitude,data=USCity_WeatherData) A variogram function is used to identify appropriate parameters for the geo-spatial model. > # Use a varigram function to update model parameters > plot(variogram(init.model) Variogram Characteristics Sill = 275 Nugget = 5 Partial Sill = 270 Variogram Plot Range=20 A Gaussian model will be used, i.e. “Gau” 17 Using the variogram characteristics to obtain the geo-spatial model. > #Fitting the variogram from the initial model > variogram.fit=fit.variogram(variogram(init.model), vgm(psill=270,model="Gau",range=20,nugget=5)) Creating a grid so that predictions can be made across locations in US. > # Creating the grid so that predictions can be made > mygrid = expand.grid(Latitude=seq(from=25,to=48,by=0.1),Longitude = seq(from=67,to=123,by=0.1)) Using the krige() function to make predictions using a geo-spatial model. > # Using the krige function to make predictions > mypredict = krige(id="JanTemp",JanTemp ~ 1, locations = ~ Latitude + Longitude, model=variogram.fit,data=USCity_WeatherData,newdata=mygrid) Use the plot3d() function to create a plot on the left. The axis3d() function is used here to clean up the axes a bit. The plot to the lower right was created using the levelplot() in the lattice{} package. > # Creating a visualization using plot3d() function > plot3d(mypredict$Latitude,mypredict$Longitude,mypredict$JanTemp.pred, xlab="",ylab="",zlab="",axes=F) > axis3d("x",at=c(25,48),labels=c("South","North")) > axis3d("y",at=c(67,90,123),labels=c("East","Midwest","West")) > axis3d("z") The plot using the above code Created using the levelplot() in the lattice{} package 18