Nonlinear Regression (Part 1)

advertisement
MATH 075
SPRING 2015
Minitab Module 4: Nonlinear Regression (Part 1)
Introduction
Many data sets can be modeled well by a straight line (weight v. height, cigarette smoking v.
cancer risk); there are also a large number that are not adequately described well by the linear
form. For this latter group, we will now study some introductory techniques of nonlinear
regression.
Finding a line-of-best fit for curvilinear data is an adventure that requires care and, at times,
sophistication. The first two nonlinear functions we will study in this section are the
exponential and logarithmic curves. These two curves are intrinsically linear, which simply
means we use can transform them to a linear form and then use the methods previous
explored in Module 3 for linear regression. Similar to our methods for linear regression, we
will be using values like r2, Se, and as importantly the residual plots to assess the “goodness of
fit.” However, with curvilinear data, we first need to determine which of the many possible
functions would provide the best fit. With this one exception, the process follows the same
guidelines as linear regression in Module 3. For our introductory course, we will limit the
options of curvilinear functions to three types: exponential, logarithmic, and quadratic.
Review of Conditions for Linear Regression (from Module 3, Topic 3.3)
When assessing how well the linear regression model fits the data, we examine the following criteria:
1) The linear regression model must have two quantitative variables.
2) The scatterplot does not contain any overly influential outliers.
3) The form of the scatterplot is linear.
The Conditions for Nonlinear Regression are nearly identical, except for Criteria 3, where we
replace “linear” with “nonlinear.”
As in Module 3, one can use Minitab to provide diagnostics to verify that our statistical model is
not violating Criteria 2 - 3. A scatterplot (Graph > Scatterplot) provides a visual representation
for verifying Criteria 3 if the data follows a linear or nonlinear trend. The Residual Plot (from
Topic 3.3) allows us to provide evidence about Criteria 2 and 3 (Stat>Regression>Fitted Line
Plot>Graphs>Residuals versus the variables). If there is a discernible trend in the data or
College of the Canyons
J Gerda/K Kubo v.4
MATH 075
SPRING 2015
pattern, then it might suggest that we are missing some interaction and may have the wrong
model. The Residual Plot with no trend or pattern is a good candidate for the optimal model.
As a caution, there are fatal consequences if we choose the wrong shape (Criteria 3), one or
two points are overly influential (outliers that are far from the average x-value), the data are
not independent, or you extrapolate (base a prediction outside the range of the x-values).
Failure to meet the other conditions can be remedied through advanced techniques that we
will not consider in this course
The first step we perform in the analysis of two quantitative variables is to make the
scatterplot of the dataset. We will expand our current modeling beyond linear with some of
the following curves:
1. exponential growth
2. exponential decay
3. logarithmic growth
or
4. one of the quadratic functions—concave up
or concave down
In many cases, you will need to investigate two or more functions to find the model that fits the
data best.
For the first three curvilinear functions above, statisticians linearize the function by using the
inverse function (recall that exponential and logarithmic functions are inverses of each other).
A quadratic form can be easily handled directly without transforming.
For exponential growth, we will take the log10 of each of the y-values
For exponential decay and logarithmic growth, we will take the log10 of the x values.
(Minitab makes this process straight forward, as you will see later in the worksheet).
TUKEY CHART (will be presented in class)
College of the Canyons
J Gerda/K Kubo v.4
MATH 075
SPRING 2015
Sometimes it will be difficult to distinguish the Curve-of-Best-Fit using only the scatterplot (for
example, between a logarithmic and quadratic opening downward or between an exponential
decay/exponential growth and a quadratic opening upward). In these cases, we will need to
find the best fit for each class of functions, complete an analysis between the different classes
comparing values of r2 and se , and assess the residual plot. Remember that just because a
specific model is the best fit, it does NOT automatically follow that it is a valid fit.
A famous quote from 20th century statistician George E. P. Box:
“Essentially all models are wrong, but some are useful.”
Now we will look more closely at three examples of how to construct models for the three
different curvilinear functions and how to assess their fit.
Part 1:
Case 1: Investigating fatalities due to drunk drivers
From the Executive Summary of Statistical Analysis of Alcohol-Related Driving Trends,
1982-2005 (http://www-nrd.nhtsa.dot.gov/Pubs/810942.pdf,
“The number of fatal crashes that involved drivers who had been drinking at
the time of the crash has decreased during the past two decades. The
proportion of crash fatalities that are alcohol-related –that occurred in
crashes where at least one of the drivers and/or nonoccupants involved had a
blood alcohol concentration (BAC) of .08 or above – decreased at a steady
rate from 53 percent in 1982 to 34 percent in 1997. It leveled off for two years
and then increased by 1 percent in 2000 and remained at that level for two
more years before it decreased to 33 percent in 2005.The proportion of
drivers involved in fatal crashes who had BAC of .08 or above decreased from
35 percent in 1982 to 20 percent in 1997 and leveled off thereafter.”
We want to study the data to determine the model that fits the data. As outlined
above, the first step is to look at the scatterplot created from the data on the left.
College of the Canyons
J Gerda/K Kubo v.4
MATH 075
SPRING 2015
Scatterplot of Drunk Driver Fatal Accidents vs Year
20000
Drunk Driver Fatal Accidents
1 9000
1 8000
1 7000
1 6000
1 5000
1 4000
1 3000
1 2000
1 1 000
1 980
1 985
1 990
1 995
2000
2005
Year
Go to my website and open the “Module 4: Drunk Driver Fatal Accidents Data” Excel file.
Copy and paste all columns from Excel into Minitab and create the Minitab scatterplot that matches the
above scatterplot.
1. Based on the scatterplot, which curvilinear models do you think we should examine? (Answer on
worksheet provided)
___________________________________
You will now learn how to create the Minitab linear regression model plots we used in Topic 3.3.
We will examine how well the linear model fits first.
Linear Model
Create the Fitted Line Plot and Residual Plot

Stat > Regression > Fitted Line Plot
and choose “Number of Drunk Drivers in Fatal Crashes” as the response Y and “Number
of Years Since 1980” as the predictor X
For the Type of Regression Model, choose Linear
College of the Canyons
J Gerda/K Kubo v.4
MATH 075
SPRING 2015
Then to create the Residual Plot,

Click Graphs
Residuals for Plots should be automatically selected as Regular
Then in the Residuals versus the variables box, choose the same variable you chose as
your predictor (X)
Click OK twice
Copy and paste these graphs into a new Word document that you will save for reference.
Questions: (Tips for your Module project) Discuss with your partner.
a.
For the explanatory variable, why didn’t we use “Year?”
b.
For the explanatory variable, why did we rewrite it as “Number of Years since 1980” instead of
“Number of Years since 1982?”
College of the Canyons
J Gerda/K Kubo v.4
MATH 075
SPRING 2015
Fitted Line Plot
Number of Drunk Drivers in Fatal Accidents
Number of Drunk Drivers in Fatal Accidents = 1 91 53 - 358.6 Number of Years Since 1 980
20000
S
R-Sq
R-Sq(adj)
1 21 9.79
81 .9%
81 .0%
1 8000
1 6000
1 4000
1 2000
1 0000
0
5
10
15
20
25
Number of Years Since 1 980
(The above graphs are provided for you to verify with the ones you have created with minitab.)
Please record the following information from your Fitted Line Plot on your worksheet:
2. The equation of the linear regression model is: _______________________________
3. Complete the following table.
Model
Residual Plot
r2
se
What do you see?
(oval, band, fan shape, curvilinear
pattern, influential outliers)
(Addresses Criteria 2, 3)
Linear
4. What did you notice about the Versus Fit Plot (the residual plot) for the linear regression
model?
What is a second regression model we should consider?
We will now examine how well the exponential decay model fits.
Exponential Decay Model
In order to fit the data to this model using Minitab, we need to take the log10 for each x-value.
This is easily accomplished using the following directions.
College of the Canyons
J Gerda/K Kubo v.4
MATH 075
SPRING 2015
Create the Fitted Line Plot and Residual Plot

Stat > Regression > Fitted Line Plot
and select the appropriate variable for the response Y and predictor X
For the Type of Regression Model, choose Linear
Then to create the Residual Plot

Choose Graphs
Residuals for Plots should be automatically selected as Regular
Then in the Residuals versus the variables box, choose the same variable you chose as
your predictor (X)
Click OK
To create the Exponential Decay regression model,

Choose Options
Under Transformations, check Log 10 of X (the 2nd box in the left column)
Click OK twice
Please copy and paste your graphs into the same Word document you created earlier below
the first two.
Fitted Line Plot
Number of Drunk Drivers in Fatal Accidents
Number of Drunk Drivers in Fatal Accidents = 23406 - 8663 log1 0(Number of Yrs Since 1 980
22000
S
R-Sq
R-Sq(adj)
20000
1 033.24
87.0%
86.4%
1 8000
1 6000
1 4000
1 2000
1 0000
0
5
10
15
20
25
Number of Years Since 1 980
Notice the difference if we repeat the above directions, except now:
College of the Canyons
J Gerda/K Kubo v.4
MATH 075
SPRING 2015

Choose Options
Under Transformations, check Log 10 of x and also check Display logscale for X-variable
Click OK twice
Please copy and paste these two graphs into the same Word document you created earlier.
Fitted Line Plot
Number of Drunk Drivers in Fatal Accidents = 23406 - 8663 log1 0(No of Yrs Since 1 980)
Number of Drunk Drivers in Fata
22000
S
R-Sq
R-Sq(adj)
20000
1 033.24
87.0%
86.4%
1 8000
1 6000
1 4000
1 2000
1 0000
1
10
Number of Years Since 1 980
Question: What similarities and differences do you notice about the output from both of the
exponential decay models?
>>> WARNING <<<
When using Minitab to create different models (linear, exponential, quadratic,
logarithmic), be sure you double check the buttons you’ve selected, so that you
don’t inadvertently combine several models together.
College of the Canyons
J Gerda/K Kubo v.4
MATH 075
SPRING 2015
Please record the following information from your Fitted Line Plot
5. The equation of the Exponential Decay regression model is:_________________________
6. Complete the following table.
Model
Residual Plot
r2
se
What do you see?
(oval, band, fan shape, curvilinear
pattern, influential outliers)
(Addresses Criteria 2, 3, 4)
7. Complete the following.
The exponential decay model is an improvement over the linear model because the Se decreased from
_______ to ________. In the context of this model, that means for this exponential decay model, on
average the prediction for the number of drivers involved in fatal accidents will be off by approximately
±____________ drivers, an error reduction of _______% compared to the linear model which can be
computed by the following:[(amount of decrease/SE linear]x100.
The r2 increased from ________% to ________%, which in the context of the problem means that the
number of years since 1980 explains _______% of the total variation in the number of drunk driver fatal
accidents or a _________% [(amount of increase/ r2 of linear model)x100] increase from the linear
model. Also, the residuals are (circle one) more/less normally distributed for the exponential decay
model.
Part 2
Case 2: Eagles
During the mid-20th century, the population of bald eagles in the lower 48 states declined
substantially. A highly toxic pesticide, DDT, was the main cause of the decline. DDT causes
damage to bird egg shells. By 1963, bald eagles were in danger of complete extinction. Only 417
pairs of bald eagles remained. In 1967, the bald eagle became an official endangered species.
Then in 1972, the EPA banned the use of DDT in the United States. The impact of the ban was a
dramatic turnaround in the fate of the bald eagle.
Note that in the table of data below, we defined our explanatory variable t to be Years after
1950. The response variable is the number of bald eagle pairs that are mating.
College of the Canyons
J Gerda/K Kubo v.4
MATH 075
SPRING 2015
It appears to be a strong candidate for the exponential model because the values are increasing
more rapidly as each year passes. However, we can first check the linear model and obtain the
Fitted Line Plot using the directions from Case 1 and the data set (from my website) called
“Module 4: Eagles and Bears Data.” Copy and paste all columns from Excel into Minitab.
Linear Model
Create the Fitted Line Plot and Residual Plot

Stat > Regression > Fitted Line Plot
and select the appropriate variable for the response Y and predictor X
For the Type of Regression Model, choose Linear
Then to create the Residual Plot

Choose Graphs
Residuals for Plots should be automatically selected as Regular
Then in the Residuals versus the variables box, choose the same variable you chose as
your predictor (X)
Click OK
Please copy and paste your Minitab output into the same Word document you created earlier.
College of the Canyons
J Gerda/K Kubo v.4
MATH 075
SPRING 2015
Fitted Line Plot
Eagle Pairs = - 3873 + 1 85.4 Number of Years After 1 950
7000
S
R-Sq
R-Sq(adj)
6000
765.968
83.5%
82.6%
5000
Eagle Pairs
4000
3000
2000
1 000
0
-1 000
-2000
10
20
30
40
50
Number of Years After 1 950
Please record the following information from your Fitted Line Plot
8. The equation of the Linear regression model is: _______________________
9. Complete the following table for the linear model only:
Model
Residual Plot
What do you see?
(oval, band, fan shape, curvilinear
pattern, influential outliers)
(Addresses Criteria 2, 3)
r2
se
++
Linear**
Exponential
Important: **According to Occam’s Razor, the linear model is considered the preferred model unless
one of the nonlinear models is significantly better (over a 4% increase in the value of
r 2 ).
++ If one of your best fit options is the exponential growth model (when using log10 y
transformations), se is no longer meaningful and cannot be used to compare models. You can
compare values of
r2 .
College of the Canyons
J Gerda/K Kubo v.4
MATH 075
SPRING 2015
Questions (discuss with your partner):
Is the linear regression model appropriate for the eagle pairs data? Why or why not?
If not, which regression model do you think would be a better fit?
Let’s try the exponential growth model:
Exponential Growth Model
The process we use to produce the exponential growth model regression model is similar to
how Minitab generates the exponential decay regression model, except now we choose log10 Y
in the (Stat > Regression > Fitted Line Plot>Options) window.
Create the Fitted Line Plot with Residual Plot

Stat > Regression > Fitted Line Plot
and select the appropriate variable for the response Y and predictor X
For the Type of Regression Model, choose Linear
Then to create the Residual Plot

Choose Graphs
Residuals for Plots should be automatically selected as Regular
Then in the Residuals versus the variables box, choose the same variable you chose as
your predictor (X)
Click OK
To create the Exponential Growth regression model,

Choose Options
Under Transformations, check Log 10 of Y (the upper left box)
Click OK twice
College of the Canyons
J Gerda/K Kubo v.4
MATH 075
SPRING 2015
Now use Minitab to create the nonlinear regression model and related Residual Plot.
Please copy and paste your Minitab output into the same Word document you created earlier.
Fitted Line Plot
log1 0(Eagle Pairs) = 2.083 + 0.03479 Number of Years After 1 950
7000
S
R-Sq
R-Sq(adj)
6000
0.0375746
98.7%
98.6%
Eagle Pairs
5000
4000
3000
2000
1 000
0
10
20
30
40
50
Number of Years After 1 950
Please record the following information from your Fitted Line Plot
10. The equation of the Exponential Growth regression model is:________________
11. Then enter the information for the Exponential Growth regression model into the previous
table (below the Linear information).
12. Now complete the paragraph below.
The Exponential Growth regression model is an improvement over the linear regression model;
the r2 increased from ________% to _______%, which is a _________% increase
(
/
) over the linear model.
We CANNOT say anything about the Se because we transformed the response (y) value.
(Note that if we transform the x-value, then we can still use the Se). There is still some concern
based on the appearance of the Versus Fits residual plot. So never compare the Se in
exponential growth models.
In the next part, we will examine another potential model (the quadratic function) and discuss
how to write an analysis in which we choose the best fit model among several choices.
College of the Canyons
J Gerda/K Kubo v.4
Download