Applying a function such as the logarithm or square root to a

advertisement
AP Statistics Section :12.2 Transforming to Achieve Linearity
In Chapter 3, we learned how to analyze relationships between two quantitative variables that showed a linear
pattern.
When two-variable data show a curved relationship, we must develop new techniques for finding an appropriate
model. This section describes several simple transformations of data that can straighten a nonlinear pattern.
Once the data have been transformed to achieve linearity, we can use least-squares regression to generate a useful
model for making predictions.
And if the conditions for regression inference are met, we can estimate or test a claim about the slope of the
population (true) regression line using the transformed data.
Applying a function such as the logarithm or square root to a quantitative variable is called
transforming the data. We will see in this section that understanding how simple functions
work helps us choose and use transformations to straighten nonlinear patterns.
Likewise, we could also calculate the reciprocal of the income per person and graph 1/(income per person)
versus child mortality rate. Again we get a linear relationship!
0.0016
RecipIncome
0.0014
0.0012
0.0010
0.0008
0.0006
0.0004
0.0002
0.0000
0
20
40
60
80
100
120
140
Under5_Mortality_Rate
160
180
Transforming Relationships
• We know how to handle and assess linear data. You will learn how do we handle curved data,
there are so many different curves
1
• Your calculator has numerous regression models.
What we would like to do:
What we need to do:
Transforming Data
Sometimes the relationship between y and x is based on repeated multiplication by a constant
factor. That is, each time x increases by 1 unit, the value of y is multiplied by b. An exponential
x
model of the form y = ab describes such multiplicative growth.
We can rearrange the final equation as log y = log a + (log b)x. Notice that log a and log b are
constants because a and b are constants.
So the equation gives a linear model relating the explanatory variable x to the transformed
variable log y.
Thus, if the relationship between two variables follows an exponential model, and we plot the
logarithm (base 10 or base e) of y against x, we should observe a straight-line pattern in the
transformed data.
Example: Exponential Growth – Bacteria
Example: Fishing Tournament:
Refer to the example on P- 773:
A student opened a bag of M&M’s, dumped them out, and ate all the ones with the M on top. When he finished, he put
the remaining 30 M&M’s back in the bag and repeated the same process over and over until all the M&M’s were gone.
Here is a table and scatterplot showing the number of M&M’s remaining at the end of each “course”.
Course
1
2
3
4
5
6
7
M&M’s remaining
30
13
10
3
2
1
0
Since the number of M&M’s should be cut in half after each course, an exponential model should describe the
relationship between the variables.
Problem:
(a) A scatterplot of the natural log of the number of M&M’s remaining versus course number is shown below. The last
observation in the table is not included since ln(0) is undefined. Explain why it would be reasonable to use an
exponential model to describe the relationship between the number of M&M’s remaining and the course number.
Solution: If there is an exponential relationship between two variables x and y, we
expect a scatterplot of (x, ln y) to be roughly linear. Since the scatterplot of
ln(remaining) versus course number is roughly linear, an exponential model
seems appropriate here.
(b) Minitab output from a linear regression analysis on the transformed data is shown below. Give the equation of
the least-squares regression line defining any variables you use.
Regression Analysis: LnRemaining versus Course
Predictor
Constant
Course
S = 0.198897
Coef
4.0593
-0.68073
SE Coef
0.1852
0.04755
R-Sq = 98.1%
T
21.92
-14.32
P
0.000
0.000
R-Sq(adj) = 97.6%
Solution:
y = 4.0593 – 0.68073x,
ln
y = the predicted value of the natural log of the number of M&M’s remaining and x
where ln
= course number.
(c) Use your model from part (b) to predict the original number of M&M’s in the bag.
To estimate the original number of M&M’s, we need to predict the amount remaining after
y = 4.0593 – 0.68073(0) = 4.0593. Thus, ŷ = e 4.0593 = 57.93 M&M’s.
course 0. ln
(d) A residual plot of the linear regression in part (b) is shown below. Discuss what this graph tells you about the
appropriateness of the model.
In an earlier alternate example, we transformed the variables Under-5 child mortality and Income per person using the
reciprocal function for a random sample of 14 countries. We can also try transforming these variables using the natural
logarithm function. Here is a scatterplot of data. Problem: The graphs below show the results of two different transformations of
the data.
(a) Explain why a power model would provide a more appropriate description of the relationship between income per
person and under-5 mortality rate than an exponential model.
The scatterplot of ln(Income) versus ln(Under5) is more linear than the scatterplot of
ln(Income) versus Under5.
(b) Minitab output for a linear regression analysis using y = ln(Income) and x = ln(Under5) is shown below. Give the
equation of the least-squares regression line, defining any variables you use.
Regression Analysis: ln(Income) versus ln(Under5)
Predictor
Constant
ln(Under5)
S = 0.343501
Coef
11.4950
-0.93740
SE Coef
0.2680
0.07579
R-Sq = 92.7%
T
42.90
-12.37
P
0.000
0.000
R-Sq(adj) = 92.1%
y = 11.495 − 0.9374ln( x) , where ln
y is the predicted value of ln(income) and x = Under5
ln
mortality rate.
(c) Use your model to predict the income per person for Turkey, with an under-5 mortality rate of 20.3.
y = 11.495 − 0.9374ln(20.3) = 8.6728
ln
ŷ = e8.6728 = $5842.
(d) A residual plot for the linear regression in part (b) is shown below. What does the plot tell you about the
appropriateness of the power model?
Since there is no left over curvature in the residual plot,
the power model is appropriate.
Download