MATH 075 SPRING 2015 Minitab Module 4: Nonlinear Regression (Part 1) Introduction Many data sets can be modeled well by a straight line (weight v. height, cigarette smoking v. cancer risk); there are also a large number that are not adequately described well by the linear form. For this latter group, we will now study some introductory techniques of nonlinear regression. Finding a line-of-best fit for curvilinear data is an adventure that requires care and, at times, sophistication. The first two nonlinear functions we will study in this section are the exponential and logarithmic curves. These two curves are intrinsically linear, which simply means we use can transform them to a linear form and then use the methods previous explored in Module 3 for linear regression. Similar to our methods for linear regression, we will be using values like r2, Se, and as importantly the residual plots to assess the “goodness of fit.” However, with curvilinear data, we first need to determine which of the many possible functions would provide the best fit. With this one exception, the process follows the same guidelines as linear regression in Module 3. For our introductory course, we will limit the options of curvilinear functions to three types: exponential, logarithmic, and quadratic. Review of Conditions for Linear Regression (from Module 3, Topic 3.3) When assessing how well the linear regression model fits the data, we examine the following criteria: 1) The linear regression model must have two quantitative variables. 2) The scatterplot does not contain any overly influential outliers. 3) The form of the scatterplot is linear. The Conditions for Nonlinear Regression are nearly identical, except for Criteria 3, where we replace “linear” with “nonlinear.” As in Module 3, one can use Minitab to provide diagnostics to verify that our statistical model is not violating Criteria 2 - 3. A scatterplot (Graph > Scatterplot) provides a visual representation for verifying Criteria 3 if the data follows a linear or nonlinear trend. The Residual Plot (from Topic 3.3) allows us to provide evidence about Criteria 2 and 3 (Stat>Regression>Fitted Line Plot>Graphs>Residuals versus the variables). If there is a discernible trend in the data or College of the Canyons J Gerda/K Kubo v.4 MATH 075 SPRING 2015 pattern, then it might suggest that we are missing some interaction and may have the wrong model. The Residual Plot with no trend or pattern is a good candidate for the optimal model. As a caution, there are fatal consequences if we choose the wrong shape (Criteria 3), one or two points are overly influential (outliers that are far from the average x-value), the data are not independent, or you extrapolate (base a prediction outside the range of the x-values). Failure to meet the other conditions can be remedied through advanced techniques that we will not consider in this course The first step we perform in the analysis of two quantitative variables is to make the scatterplot of the dataset. We will expand our current modeling beyond linear with some of the following curves: 1. exponential growth 2. exponential decay 3. logarithmic growth or 4. one of the quadratic functions—concave up or concave down In many cases, you will need to investigate two or more functions to find the model that fits the data best. For the first three curvilinear functions above, statisticians linearize the function by using the inverse function (recall that exponential and logarithmic functions are inverses of each other). A quadratic form can be easily handled directly without transforming. For exponential growth, we will take the log10 of each of the y-values For exponential decay and logarithmic growth, we will take the log10 of the x values. (Minitab makes this process straight forward, as you will see later in the worksheet). TUKEY CHART (will be presented in class) College of the Canyons J Gerda/K Kubo v.4 MATH 075 SPRING 2015 Sometimes it will be difficult to distinguish the Curve-of-Best-Fit using only the scatterplot (for example, between a logarithmic and quadratic opening downward or between an exponential decay/exponential growth and a quadratic opening upward). In these cases, we will need to find the best fit for each class of functions, complete an analysis between the different classes comparing values of r2 and se , and assess the residual plot. Remember that just because a specific model is the best fit, it does NOT automatically follow that it is a valid fit. A famous quote from 20th century statistician George E. P. Box: “Essentially all models are wrong, but some are useful.” Now we will look more closely at three examples of how to construct models for the three different curvilinear functions and how to assess their fit. Part 1: Case 1: Investigating fatalities due to drunk drivers From the Executive Summary of Statistical Analysis of Alcohol-Related Driving Trends, 1982-2005 (http://www-nrd.nhtsa.dot.gov/Pubs/810942.pdf, “The number of fatal crashes that involved drivers who had been drinking at the time of the crash has decreased during the past two decades. The proportion of crash fatalities that are alcohol-related –that occurred in crashes where at least one of the drivers and/or nonoccupants involved had a blood alcohol concentration (BAC) of .08 or above – decreased at a steady rate from 53 percent in 1982 to 34 percent in 1997. It leveled off for two years and then increased by 1 percent in 2000 and remained at that level for two more years before it decreased to 33 percent in 2005.The proportion of drivers involved in fatal crashes who had BAC of .08 or above decreased from 35 percent in 1982 to 20 percent in 1997 and leveled off thereafter.” We want to study the data to determine the model that fits the data. As outlined above, the first step is to look at the scatterplot created from the data on the left. College of the Canyons J Gerda/K Kubo v.4 MATH 075 SPRING 2015 Scatterplot of Drunk Driver Fatal Accidents vs Year 20000 Drunk Driver Fatal Accidents 1 9000 1 8000 1 7000 1 6000 1 5000 1 4000 1 3000 1 2000 1 1 000 1 980 1 985 1 990 1 995 2000 2005 Year Go to my website and open the “Module 4: Drunk Driver Fatal Accidents Data” Excel file. Copy and paste all columns from Excel into Minitab and create the Minitab scatterplot that matches the above scatterplot. 1. Based on the scatterplot, which curvilinear models do you think we should examine? (Answer on worksheet provided) ___________________________________ You will now learn how to create the Minitab linear regression model plots we used in Topic 3.3. We will examine how well the linear model fits first. Linear Model Create the Fitted Line Plot and Residual Plot Stat > Regression > Fitted Line Plot and choose “Number of Drunk Drivers in Fatal Crashes” as the response Y and “Number of Years Since 1980” as the predictor X For the Type of Regression Model, choose Linear College of the Canyons J Gerda/K Kubo v.4 MATH 075 SPRING 2015 Then to create the Residual Plot, Click Graphs Residuals for Plots should be automatically selected as Regular Then in the Residuals versus the variables box, choose the same variable you chose as your predictor (X) Click OK twice Copy and paste these graphs into a new Word document that you will save for reference. Questions: (Tips for your Module project) Discuss with your partner. a. For the explanatory variable, why didn’t we use “Year?” b. For the explanatory variable, why did we rewrite it as “Number of Years since 1980” instead of “Number of Years since 1982?” College of the Canyons J Gerda/K Kubo v.4 MATH 075 SPRING 2015 Fitted Line Plot Number of Drunk Drivers in Fatal Accidents Number of Drunk Drivers in Fatal Accidents = 1 91 53 - 358.6 Number of Years Since 1 980 20000 S R-Sq R-Sq(adj) 1 21 9.79 81 .9% 81 .0% 1 8000 1 6000 1 4000 1 2000 1 0000 0 5 10 15 20 25 Number of Years Since 1 980 (The above graphs are provided for you to verify with the ones you have created with minitab.) Please record the following information from your Fitted Line Plot on your worksheet: 2. The equation of the linear regression model is: _______________________________ 3. Complete the following table. Model Residual Plot r2 se What do you see? (oval, band, fan shape, curvilinear pattern, influential outliers) (Addresses Criteria 2, 3) Linear 4. What did you notice about the Versus Fit Plot (the residual plot) for the linear regression model? What is a second regression model we should consider? We will now examine how well the exponential decay model fits. Exponential Decay Model In order to fit the data to this model using Minitab, we need to take the log10 for each x-value. This is easily accomplished using the following directions. College of the Canyons J Gerda/K Kubo v.4 MATH 075 SPRING 2015 Create the Fitted Line Plot and Residual Plot Stat > Regression > Fitted Line Plot and select the appropriate variable for the response Y and predictor X For the Type of Regression Model, choose Linear Then to create the Residual Plot Choose Graphs Residuals for Plots should be automatically selected as Regular Then in the Residuals versus the variables box, choose the same variable you chose as your predictor (X) Click OK To create the Exponential Decay regression model, Choose Options Under Transformations, check Log 10 of X (the 2nd box in the left column) Click OK twice Please copy and paste your graphs into the same Word document you created earlier below the first two. Fitted Line Plot Number of Drunk Drivers in Fatal Accidents Number of Drunk Drivers in Fatal Accidents = 23406 - 8663 log1 0(Number of Yrs Since 1 980 22000 S R-Sq R-Sq(adj) 20000 1 033.24 87.0% 86.4% 1 8000 1 6000 1 4000 1 2000 1 0000 0 5 10 15 20 25 Number of Years Since 1 980 Notice the difference if we repeat the above directions, except now: College of the Canyons J Gerda/K Kubo v.4 MATH 075 SPRING 2015 Choose Options Under Transformations, check Log 10 of x and also check Display logscale for X-variable Click OK twice Please copy and paste these two graphs into the same Word document you created earlier. Fitted Line Plot Number of Drunk Drivers in Fatal Accidents = 23406 - 8663 log1 0(No of Yrs Since 1 980) Number of Drunk Drivers in Fata 22000 S R-Sq R-Sq(adj) 20000 1 033.24 87.0% 86.4% 1 8000 1 6000 1 4000 1 2000 1 0000 1 10 Number of Years Since 1 980 Question: What similarities and differences do you notice about the output from both of the exponential decay models? >>> WARNING <<< When using Minitab to create different models (linear, exponential, quadratic, logarithmic), be sure you double check the buttons you’ve selected, so that you don’t inadvertently combine several models together. College of the Canyons J Gerda/K Kubo v.4 MATH 075 SPRING 2015 Please record the following information from your Fitted Line Plot 5. The equation of the Exponential Decay regression model is:_________________________ 6. Complete the following table. Model Residual Plot r2 se What do you see? (oval, band, fan shape, curvilinear pattern, influential outliers) (Addresses Criteria 2, 3, 4) 7. Complete the following. The exponential decay model is an improvement over the linear model because the Se decreased from _______ to ________. In the context of this model, that means for this exponential decay model, on average the prediction for the number of drivers involved in fatal accidents will be off by approximately ±____________ drivers, an error reduction of _______% compared to the linear model which can be computed by the following:[(amount of decrease/SE linear]x100. The r2 increased from ________% to ________%, which in the context of the problem means that the number of years since 1980 explains _______% of the total variation in the number of drunk driver fatal accidents or a _________% [(amount of increase/ r2 of linear model)x100] increase from the linear model. Also, the residuals are (circle one) more/less normally distributed for the exponential decay model. Part 2 Case 2: Eagles During the mid-20th century, the population of bald eagles in the lower 48 states declined substantially. A highly toxic pesticide, DDT, was the main cause of the decline. DDT causes damage to bird egg shells. By 1963, bald eagles were in danger of complete extinction. Only 417 pairs of bald eagles remained. In 1967, the bald eagle became an official endangered species. Then in 1972, the EPA banned the use of DDT in the United States. The impact of the ban was a dramatic turnaround in the fate of the bald eagle. Note that in the table of data below, we defined our explanatory variable t to be Years after 1950. The response variable is the number of bald eagle pairs that are mating. College of the Canyons J Gerda/K Kubo v.4 MATH 075 SPRING 2015 It appears to be a strong candidate for the exponential model because the values are increasing more rapidly as each year passes. However, we can first check the linear model and obtain the Fitted Line Plot using the directions from Case 1 and the data set (from my website) called “Module 4: Eagles and Bears Data.” Copy and paste all columns from Excel into Minitab. Linear Model Create the Fitted Line Plot and Residual Plot Stat > Regression > Fitted Line Plot and select the appropriate variable for the response Y and predictor X For the Type of Regression Model, choose Linear Then to create the Residual Plot Choose Graphs Residuals for Plots should be automatically selected as Regular Then in the Residuals versus the variables box, choose the same variable you chose as your predictor (X) Click OK Please copy and paste your Minitab output into the same Word document you created earlier. College of the Canyons J Gerda/K Kubo v.4 MATH 075 SPRING 2015 Fitted Line Plot Eagle Pairs = - 3873 + 1 85.4 Number of Years After 1 950 7000 S R-Sq R-Sq(adj) 6000 765.968 83.5% 82.6% 5000 Eagle Pairs 4000 3000 2000 1 000 0 -1 000 -2000 10 20 30 40 50 Number of Years After 1 950 Please record the following information from your Fitted Line Plot 8. The equation of the Linear regression model is: _______________________ 9. Complete the following table for the linear model only: Model Residual Plot What do you see? (oval, band, fan shape, curvilinear pattern, influential outliers) (Addresses Criteria 2, 3) r2 se ++ Linear** Exponential Important: **According to Occam’s Razor, the linear model is considered the preferred model unless one of the nonlinear models is significantly better (over a 4% increase in the value of r 2 ). ++ If one of your best fit options is the exponential growth model (when using log10 y transformations), se is no longer meaningful and cannot be used to compare models. You can compare values of r2 . College of the Canyons J Gerda/K Kubo v.4 MATH 075 SPRING 2015 Questions (discuss with your partner): Is the linear regression model appropriate for the eagle pairs data? Why or why not? If not, which regression model do you think would be a better fit? Let’s try the exponential growth model: Exponential Growth Model The process we use to produce the exponential growth model regression model is similar to how Minitab generates the exponential decay regression model, except now we choose log10 Y in the (Stat > Regression > Fitted Line Plot>Options) window. Create the Fitted Line Plot with Residual Plot Stat > Regression > Fitted Line Plot and select the appropriate variable for the response Y and predictor X For the Type of Regression Model, choose Linear Then to create the Residual Plot Choose Graphs Residuals for Plots should be automatically selected as Regular Then in the Residuals versus the variables box, choose the same variable you chose as your predictor (X) Click OK To create the Exponential Growth regression model, Choose Options Under Transformations, check Log 10 of Y (the upper left box) Click OK twice College of the Canyons J Gerda/K Kubo v.4 MATH 075 SPRING 2015 Now use Minitab to create the nonlinear regression model and related Residual Plot. Please copy and paste your Minitab output into the same Word document you created earlier. Fitted Line Plot log1 0(Eagle Pairs) = 2.083 + 0.03479 Number of Years After 1 950 7000 S R-Sq R-Sq(adj) 6000 0.0375746 98.7% 98.6% Eagle Pairs 5000 4000 3000 2000 1 000 0 10 20 30 40 50 Number of Years After 1 950 Please record the following information from your Fitted Line Plot 10. The equation of the Exponential Growth regression model is:________________ 11. Then enter the information for the Exponential Growth regression model into the previous table (below the Linear information). 12. Now complete the paragraph below. The Exponential Growth regression model is an improvement over the linear regression model; the r2 increased from ________% to _______%, which is a _________% increase ( / ) over the linear model. We CANNOT say anything about the Se because we transformed the response (y) value. (Note that if we transform the x-value, then we can still use the Se). There is still some concern based on the appearance of the Versus Fits residual plot. So never compare the Se in exponential growth models. In the next part, we will examine another potential model (the quadratic function) and discuss how to write an analysis in which we choose the best fit model among several choices. College of the Canyons J Gerda/K Kubo v.4