AP Statistics Section :12.2 Transforming to Achieve Linearity In Chapter 3, we learned how to analyze relationships between two quantitative variables that showed a linear pattern. When two-variable data show a curved relationship, we must develop new techniques for finding an appropriate model. This section describes several simple transformations of data that can straighten a nonlinear pattern. Once the data have been transformed to achieve linearity, we can use least-squares regression to generate a useful model for making predictions. And if the conditions for regression inference are met, we can estimate or test a claim about the slope of the population (true) regression line using the transformed data. Applying a function such as the logarithm or square root to a quantitative variable is called transforming the data. We will see in this section that understanding how simple functions work helps us choose and use transformations to straighten nonlinear patterns. Likewise, we could also calculate the reciprocal of the income per person and graph 1/(income per person) versus child mortality rate. Again we get a linear relationship! 0.0016 RecipIncome 0.0014 0.0012 0.0010 0.0008 0.0006 0.0004 0.0002 0.0000 0 20 40 60 80 100 120 140 Under5_Mortality_Rate 160 180 Transforming Relationships • We know how to handle and assess linear data. You will learn how do we handle curved data, there are so many different curves 1 • Your calculator has numerous regression models. What we would like to do: What we need to do: Transforming Data Sometimes the relationship between y and x is based on repeated multiplication by a constant factor. That is, each time x increases by 1 unit, the value of y is multiplied by b. An exponential x model of the form y = ab describes such multiplicative growth. We can rearrange the final equation as log y = log a + (log b)x. Notice that log a and log b are constants because a and b are constants. So the equation gives a linear model relating the explanatory variable x to the transformed variable log y. Thus, if the relationship between two variables follows an exponential model, and we plot the logarithm (base 10 or base e) of y against x, we should observe a straight-line pattern in the transformed data. Example: Exponential Growth – Bacteria Example: Fishing Tournament: Refer to the example on P- 773: A student opened a bag of M&M’s, dumped them out, and ate all the ones with the M on top. When he finished, he put the remaining 30 M&M’s back in the bag and repeated the same process over and over until all the M&M’s were gone. Here is a table and scatterplot showing the number of M&M’s remaining at the end of each “course”. Course 1 2 3 4 5 6 7 M&M’s remaining 30 13 10 3 2 1 0 Since the number of M&M’s should be cut in half after each course, an exponential model should describe the relationship between the variables. Problem: (a) A scatterplot of the natural log of the number of M&M’s remaining versus course number is shown below. The last observation in the table is not included since ln(0) is undefined. Explain why it would be reasonable to use an exponential model to describe the relationship between the number of M&M’s remaining and the course number. Solution: If there is an exponential relationship between two variables x and y, we expect a scatterplot of (x, ln y) to be roughly linear. Since the scatterplot of ln(remaining) versus course number is roughly linear, an exponential model seems appropriate here. (b) Minitab output from a linear regression analysis on the transformed data is shown below. Give the equation of the least-squares regression line defining any variables you use. Regression Analysis: LnRemaining versus Course Predictor Constant Course S = 0.198897 Coef 4.0593 -0.68073 SE Coef 0.1852 0.04755 R-Sq = 98.1% T 21.92 -14.32 P 0.000 0.000 R-Sq(adj) = 97.6% Solution: y = 4.0593 – 0.68073x, ln y = the predicted value of the natural log of the number of M&M’s remaining and x where ln = course number. (c) Use your model from part (b) to predict the original number of M&M’s in the bag. To estimate the original number of M&M’s, we need to predict the amount remaining after y = 4.0593 – 0.68073(0) = 4.0593. Thus, ŷ = e 4.0593 = 57.93 M&M’s. course 0. ln (d) A residual plot of the linear regression in part (b) is shown below. Discuss what this graph tells you about the appropriateness of the model. In an earlier alternate example, we transformed the variables Under-5 child mortality and Income per person using the reciprocal function for a random sample of 14 countries. We can also try transforming these variables using the natural logarithm function. Here is a scatterplot of data. Problem: The graphs below show the results of two different transformations of the data. (a) Explain why a power model would provide a more appropriate description of the relationship between income per person and under-5 mortality rate than an exponential model. The scatterplot of ln(Income) versus ln(Under5) is more linear than the scatterplot of ln(Income) versus Under5. (b) Minitab output for a linear regression analysis using y = ln(Income) and x = ln(Under5) is shown below. Give the equation of the least-squares regression line, defining any variables you use. Regression Analysis: ln(Income) versus ln(Under5) Predictor Constant ln(Under5) S = 0.343501 Coef 11.4950 -0.93740 SE Coef 0.2680 0.07579 R-Sq = 92.7% T 42.90 -12.37 P 0.000 0.000 R-Sq(adj) = 92.1% y = 11.495 − 0.9374ln( x) , where ln y is the predicted value of ln(income) and x = Under5 ln mortality rate. (c) Use your model to predict the income per person for Turkey, with an under-5 mortality rate of 20.3. y = 11.495 − 0.9374ln(20.3) = 8.6728 ln ŷ = e8.6728 = $5842. (d) A residual plot for the linear regression in part (b) is shown below. What does the plot tell you about the appropriateness of the power model? Since there is no left over curvature in the residual plot, the power model is appropriate.