Polynomial Terms in Models

advertisement
Handout #15: Polynomial Terms in a Linear Regression Model
Section 15.1: Fitting a Linear Regression Model with Polynomial Terms
Consider the situation of modeling the temperature profile for across the United States.
 Latitude 
Temperature Profile of United States
 Longitude 
Standard Linear Regression Setup



Response Variable: Temperature
Predictor Variables: Latitude and Longitude
Initially assume the following structure for mean and variance functions
o
o
𝐸(𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 | 𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒, 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒 ) = 𝛽0 + 𝛽1 ∗ 𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒 + 𝛽2 ∗ 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒
𝑉𝑎𝑟(𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒|𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒, 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒) = 𝜎 2
1
Understanding the Effects in this model
Model Effects
Longitude
Latitude
Reality
A linear term, i.e. constant rate of change, for Longitude does not appear to match reality.
Latitude
Possible Fix: Include a quadratic term
in the mean function.
𝐸(𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 | 𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒, 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒 )
= 𝛽0 + 𝛽1 ∗ 𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒
+ 𝛽2 ∗ 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒
+ 𝛽3 ∗ 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒 2
2
Comments
Consider the proposed (updated) mean function
𝐸(𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 | 𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒, 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒 ) = 𝛽0 + 𝛽1 ∗ 𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒 + 𝛽2 ∗ 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒 + 𝛽3 ∗ 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒 2
1. This model is said to have two predictors, but three terms in addition to intercept.
Generally speaking, a mean function is constructed using terms. Terms may be simple
predictors, combinations of predictor variables, or functions of the predictor variables.
Predictors
Latitude
Longitude
Terms
Intercept
Latitude
Longitude
Longitude2
2. This model is said to be a linear model even though it includes a quadratic term. The
notation of linear here implies linear in its coefficients. That is, the derivative of the mean
function with respect to each coefficient is free of all other coefficients.
o
𝜕
𝐸(𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒|𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒, 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒)
𝜕𝛽0
=1
o
𝜕
𝐸(𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒|𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒, 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒)
𝜕𝛽1
= 𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒
o
𝜕
𝐸(𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒|𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒, 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒)
𝜕𝛽2
= 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒
o
𝜕
𝐸(𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒|𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒, 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒)
𝜕𝛽3
= 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒 2
An example of a non-linear model – consider the Michaelis-Menton model for enzyme kinetics. In
this model 𝑣 = 𝑟𝑒𝑎𝑐𝑡𝑖𝑜𝑛 𝑟𝑎𝑡𝑒 and 𝑥 = 𝑐𝑜𝑛𝑒𝑛𝑡𝑟𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑠𝑢𝑏𝑠𝑡𝑟𝑎𝑡𝑒. Realize, the partial
derivatives of one coefficient are a function of the other.
𝐸(𝑣|𝑥) =
Derivatives
𝛽1 ∗ 𝑥
(𝛽2 + 𝑥)
Visual
o
𝜕
𝐸(𝑣|𝑥)
𝜕𝛽1
=
𝑥
(𝛽2 + 𝑥)
o
𝜕
𝐸(𝑣|𝑥)
𝜕𝛽2
=
−𝛽1 𝑥
(𝛽2 + 𝑥)2
3
The estimation of model coefficients for a linear model is more straight forward than for a nonlinear model. Consider the construction of the X matrix used by software to estimate model
coefficients. In a linear regression model, estimation is straight forward; however, for a non-linear
model the elements of the X matrix depend on the estimated coefficients. Thus, estimation must be
done iteratively, i.e. obtain initial estimates for coefficients, update X matrix, re-estimate
coefficients, update X matrix, re-estimate coefficients, etc. This is known as iterative least squares
estimation and is repeated until the coefficients do not change much from one iteration to the
next.
Linear Model
Non-Linear Model
̂ = (𝑿′ 𝑿)−𝟏 𝑿′ 𝒀
𝜷
4
Section 15.2: Predicting January Temperature in continental United States
Example 15.2.1 For this example, consider the US City Weather dataset on our course website.
A snip-it of the dataset is provided here.
Note: Albuquerque, NM will be removed from consideration from our analysis. This city is an
extreme outlier. The effect of this city on the analysis will be considered after fitting an
appropriate model.
To begin, consider a standard linear model setup. In Section 5.1, we learned that this is likely an
inappropriate model as a quadratic term for Longitude is probably necessary. The inadequate
form of this model with respect to Longitude will be apparent when plotting the residuals from
this model against Longitude.
Model Setup (without the use of a quadratic term for Longitude)



Response Variable: Jan Temp
Predictor Variables: Latitude and Longitude
Begin with the standard mean and variance function
o 𝐸(𝐽𝑎𝑛 𝑇𝑒𝑚𝑝 | 𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒, 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒 ) = 𝛽0 + 𝛽1 ∗ 𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒 + 𝛽2 ∗ 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒
o 𝑉𝑎𝑟(𝐽𝑎𝑛 𝑇𝑒𝑚𝑝|𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒, 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒) = 𝜎 2
Output from JMP for above the models specified above.
Standard Regression Output
Residual Plot
Note: Potential problems with residuals.
Further investigation is warranted.
5
The residuals from the initial model are plotted against each predictor, Latitude and Longitude
respectively. The anticipated lack-of-fit due to not incorporating a quadratic term for Longitude
is apparent in the plot to the right.
Weak quadratic trend
Much stronger quadratic trend as suggested
by the following output.
Some may consider the model fitting done above as overkill. If the goal is to simple trend the
residuals a kernel smoother is likely sufficient. The usual Analyze > Fit Y by X platform can be
used in JMP; however, the Graph Builder framework is somewhat quicker and easier. This can
be done by selecting Graph > Graph Builder.
6
Updated Model Setup




Response Variable: Jan Temp
Predictor Variables: Latitude and Longitude
Terms: Intercept, Latitude, Longitude, Longitude2
Updated structure for mean function; Continue with constant variance function
o
o
𝐸(𝐽𝑎𝑛 𝑇𝑒𝑚𝑝 | 𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒, 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒 ) = 𝛽0 + 𝛽1 ∗ 𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒 + 𝛽2 ∗ 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒 + 𝛽3 ∗
𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒 2
𝑉𝑎𝑟(𝐽𝑎𝑛 𝑇𝑒𝑚𝑝|𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒, 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒) = 𝜎 2
Fitting this model in JMP
7
The regression output from the model that includes a quadratic term for Longitude is provided
here. The accompanying residual plot is provided as well.
Standard Regression Output
Residual Plot
Note: In addition to the plot provided by JMP,
the residuals should be plotted against each
predictor to ensure correct functional form.
Comment:

The standard form for a quadratic function is given by 𝑦 = 𝑎 ∗ 𝑥 2 + 𝑏 ∗ 𝑥 + 𝑐. The
effect of each coefficient is show graphically here.
In my opinion, the fact that the coefficient for Longitude is *not* statistically different
from 0 is irrelevant. This implies that the horizontal shift in the vertex of the parabola is
not statistically different than the average Longitude (average Longitude in JMP as JMP
invokes a horizontal shift for this term).
Identifying a parsimonious model, i.e. a model with only significant terms, is of utmost
importance when modeling. However, in this case, the Longitude2 term is statistically
important and keeping a lower order term of the polynomial, i.e. Longitude, in the
model is not really increasing the complexity of the model.
8
Unfortunately and a bit to my surprise, a model that includes a quadratic term did not fix the
apparent lack-of-fit in the Longitude direction.
Final Model Setup




Response Variable: Jan Temp
Predictor Variables: Latitude and Longitude
Terms: Intercept, Latitude, Longitude, Longitude2, Longitude3
Updated structure for mean function; Continue with constant variance function
o
o
𝐸(𝐽𝑎𝑛 𝑇𝑒𝑚𝑝 | 𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒, 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒 ) = 𝛽0 + 𝛽1 ∗ 𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒 + 𝛽2 ∗ 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒 + 𝛽3 ∗
𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒 2 + 𝛽4 ∗ 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒 3
𝑉𝑎𝑟(𝐽𝑎𝑛 𝑇𝑒𝑚𝑝|𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒, 𝐿𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒) = 𝜎 2
Consider the rationale for including a cubic term for Longitude.
For a quadratic function, i.e. 2nd degree
polynomial, rate of decrease to left of vertex is the
same as the rate of increase to the right.
This is not the case for a 3rd degree
polynomial. This functional form may in
fact more closely match reality.
9
Output from Final Model
Observations from this final model





The residual plots provided below suggest a correct function form for the mean
function.
The model appears to be doing well in terms of a high R2 value = 95%.
The Root Mean Square Error is pretty small with a value of 2.9. Recall, an approximate
95% prediction interval is given by ± 2 ∗ 𝑅𝑀𝑆𝐸. This suggest that predictions for Jan
Temp in most locations across the US can be made with about 5.8OF, i.e. 2 * 2.9 = 5.8.
The statistical importance of all model terms is evident through very small p-values as
provided under the Prob > |t| column.
A plot of the predicted values against actual Jan Temp values suggests possible overprediction for lower temps and under-prediction for warmer temps.
10
Residual Plots from Final Model
Overall residual plot from final model
Checking normality in the residuals
A list of outliers, i.e observations whose residuals exceed 2*RMSE. Over-prediction appears to
be occurring in northern California (and Reno which borders California) and under-prediction is
occurring in some cities in the northwest.
11
Viewing the Fitted Model in JMP
The Surface Profiler functionality allows us to investigate the surface in JMP. Select Factor
Profiling > Surface Profiler from the red-drop down menu in JMP.
The fitted model can be seen here.
Profile view of Longitude
Profile view of Latitude
Note: The Contour Profiler produces a 2-dimensional display of this surface.
12
Investigating the Effect of Albuquerque, New Mexico
Albuquerque, NM was removed from consideration in the above analyses. Albuquerque, NM is
a city with a small Latitude (southern city) and a large Longitude (western city), but has a
unusually cold temp due to its much higher altitude (in the mountains).
Comparing the summary of fit output from the model with and without Albuquerque, NM.
Model excluding Albuquerque, NM
A model including Albuquerque, NM clearly
indicates its fit as an extreme outlier.
Model including Albuquerque, NM
An investigation of the hat values suggests that
Albuquerque, NM does *not* appear to have
high leverage, but is simply an outlier.
Recall, Cook’s Distance combines information
regarding the degree to which Albuquerque, NM
is an outlier with its leverage. Albuquerque ‘s
Cook’s Distance is substantially higher than
others.
(𝑠𝑡𝑢𝑑𝑒𝑛𝑡 𝑟𝑖 )2
ℎ𝑖
𝐶𝑜𝑜𝑘 ′ 𝑠 𝐷 =
∗
(𝑘 + 1)
(1 − ℎ𝑖 )
(−7.32)2
0.055
∗
(5)
(1 − 0.055)
0.62
Plot of Cook’s Distance
13
Section 15.3: Fitting this model in R (with visualization of surface)
> # attach dataset to be used
>
> attach(USCity_WeatherData)
> # Getting the names of the variables in this data
>
> names(USCity_WeatherData)
[1] "City"
"Latitude"
"Longitude"
[4] "JanTemp"
"AprilTemp"
"JulyTemp"
[7] "OctTemp"
"Precipitation.in."
"Percipitation.days."
[10] "Snowfall.in."
> #Creating the additional terms needed for the final model
>
> Longitude2=Longitude^2
> Longitude3=Longitude^3
> #Reconstruct a data frame with all necessary terms
> #This step is not necessary, but makes is much easier to remove Albuquerque,
NM from model consideration
>
>
> mydata=data.frame(City,Latitude,Longitude,Longitude2,Longitude3,JanTemp)
> #Using the head() function to see top portion of newly created data
>
> head(mydata)
City Latitude Longitude Longitude2 Longitude3 JanTemp
1
Albany, NY
42.67
73.75
5439.062
401130.9
22.2
2
Albuquerque, NM
35.08
106.65 11374.223 1213060.8
5.7
3
Asheville, NC
35.35
82.33
6778.229
558051.6
35.8
4
Atlanta, GA
33.75
84.38
7119.984
600784.3
42.7
5 Atlantic City, NJ
39.21
74.25
5513.062
409344.9
32.1
6
Austin, TX
30.27
97.73
9551.153
933434.2
50.2
> # Fitting the linear model with Latitude and up to 3rd degree polynomial for
Longitude. The data=mydata[-2,] is used to remove the 2nd observation
(Albuquerque, NM) from consideration. If no observations are to be removed,
this would simple read data=mydata.
>
> myfit=lm(JanTemp ~ Latitude + Longitude + Longitude2 + Longitude3,data=mydata[-2,])
> # The summary() function can be used to obtain the standard regression output
>
> summary(myfit)
Call:
lm(formula = JanTemp ~ Latitude + Longitude + Longitude2 + Longitude3,
data = mydata[-2, ])
Residuals:
Min
1Q
-12.7138 -1.0795
Median
0.1146
3Q
1.4033
Max
8.4928
14
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.640e+02 8.551e+01 -3.087 0.00269 **
Latitude
-2.417e+00 6.152e-02 -39.289 < 2e-16 ***
Longitude
1.433e+01 2.733e+00
5.244 1.05e-06 ***
Longitude2 -1.720e-01 2.889e-02 -5.952 5.15e-08 ***
Longitude3
6.728e-04 1.004e-04
6.703 1.81e-09 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.904 on 89 degrees of freedom
Multiple R-squared: 0.953,
Adjusted R-squared: 0.9508
F-statistic: 450.8 on 4 and 89 DF, p-value: < 2.2e-16
Note: As it should, the model output here matches the output obtained from JMP.
Using R to create 3D plot of surface being fit
Step 1: Create a grid pattern for Latitude and Longitude.
> #Get smallest Latitude
> min(Latitude)
[1] 25.77
Creating a grid so that predictions can be
made across continental US
> #Get largest Latitude
> max(Latitude)
[1] 47.67
> #Create a sequence of x values to
make prediction from
> xgrid=seq(from=25,to=48,by=0.1)
Now, do the same for Longitude
> min(Longitude)
[1] 67.59
> max(Longitude)
[1] 122.68
> ygrid=seq(from=67,to=123,by=0.1)
Step 2: Create a function from which prediction can be made.
Note: The coefficients for each term are determined from the final regression model.
> mypredict=function(x,y){-263.988 - 2.417*x + 14.333*y - 0.172*y^2 + 0.00067*y^3}
Step 3: Obtain predictions across grid and create the 3d plot
> # Use the outer function to obtain predictions across the entire grid,
save the results into a matrix named z
>
> z=outer(xgrid,ygrid,mypredict)
The graph will be made using a R package named rgl{}. This package must first be
downloaded onto your machine and loaded in your current R session.
> #Load the rlg package in R
15
> #First, download rlg package in R
> #Next, load this package into your current R session
> library(rgl)
Once, the rgl{} package has been loaded into your current R session, use the persp3d()
function to create the surface
> #Use the persp3d function to create a 3d perspective plot
> persp3d(xgrid,ygrid,z,xlab="Latitude",ylab="Longitude",zlab="JanTemp")
A plot is produced in the rgl device window. This plot can be spun around etc. to see the
surface from different angles, etc.
A view of the surface in the Latitude direction
A view of the surface in the Longitude direction
16
Section 15.4: Fitting the More Appropriate Geo-Spatial Model in R
A geo-spatial model is a modeling strategy that utilizes the correlation structure due to the
proximity of the observations. For example, when making a prediction for a particular city,
(called kriging in geo-spatial models), the cities is close proximity are emphasized more than
other cities. Model specifications determine the degree of closeness that is appropriate.
Sampling locations in our dataset
Observations close in distance have more
impact on than others when making
predictions in geo-spatial models
The gstat{} package in R will be used here to make geo-spatial predictions for Jan Temp.
> #Download and load the gstat{} package
>library(gstat)
> # Getting initial estimates for model parameters
> init.model=gstat(id="JanTemp",formula=JanTemp~1,locations = ~
Latitude + Longitude,data=USCity_WeatherData)
A variogram function is used to identify appropriate parameters for the geo-spatial model.
> # Use a varigram function to update model parameters
> plot(variogram(init.model)
Variogram Characteristics
 Sill = 275
 Nugget = 5
 Partial Sill = 270


Variogram Plot
Range=20
A Gaussian model will be
used, i.e. “Gau”
17
Using the variogram characteristics to obtain the geo-spatial model.
> #Fitting the variogram from the initial model
> variogram.fit=fit.variogram(variogram(init.model),
vgm(psill=270,model="Gau",range=20,nugget=5))
Creating a grid so that predictions can be made across locations in US.
> # Creating the grid so that predictions can be made
> mygrid = expand.grid(Latitude=seq(from=25,to=48,by=0.1),Longitude =
seq(from=67,to=123,by=0.1))
Using the krige() function to make predictions using a geo-spatial model.
> # Using the krige function to make predictions
> mypredict = krige(id="JanTemp",JanTemp ~ 1, locations = ~ Latitude +
Longitude, model=variogram.fit,data=USCity_WeatherData,newdata=mygrid)
Use the plot3d() function to create a plot on the left. The axis3d() function is used here to clean
up the axes a bit. The plot to the lower right was created using the levelplot() in the lattice{}
package.
> # Creating a visualization using plot3d() function
> plot3d(mypredict$Latitude,mypredict$Longitude,mypredict$JanTemp.pred,
xlab="",ylab="",zlab="",axes=F)
> axis3d("x",at=c(25,48),labels=c("South","North"))
> axis3d("y",at=c(67,90,123),labels=c("East","Midwest","West"))
> axis3d("z")
The plot using the above code
Created using the levelplot() in the
lattice{} package
18
Download