Uploaded by aazzaaadii

Notes

advertisement
Apr. 19th Mon.
Chapter 3.4 NONLINEAR RELATIONSHIPS
Scatterplots of (bivariate) data frequently shows curvature rather than a linear pattern. (An example of
such curvature was shown in Exercise 27 Data set #2 (see p.16 of Apr. 18 Sun.pdf). In Chapter 3.4, we
discuss different ways to fit a curve (rather than a straight line) to such scatterplots of (bivariate) data.
Power Transformations
(For review, a list of algebraic power transformations is on the left-hand side of p.133 of your text.)
In this section, you are expected to know WHEN a power transformation is to be employed (we are not
going to get you into the business of memorizing which power transformation may be more likely to
work in one set of circumstances or is otherwise recommended – but not guaranteed; but, rather, we
want to be most comfortable and confident with the case of WHEN to use a power transformation).
The case of WHEN to use a power transformation is WHEN the general pattern in the scatterplot (or
data) is monotonic, but when a linear fit (i.e., a straight line) won’t do to the original data as is.
Recall the definition of monotonic from calculus or otherwise: a monotonic graph is a graph in which
there is either strictly increasing or decreasing in the plotted points.
In this course, we will say that WHEN a scatterplot is generally (i.e., the, vast majority of points)
monotonic then we should attempt a power transformation to model our data.
Example 3.12 Power Transformation p. 134-5
Consider the following dataset with respect to x = frying time (sec) and y = moisture content %
x:
5
10
15
20
25
30
45
60
y:
16.3
9.7
8.1
4.2
3.4
2.9
1.9
1.3
Cal Poly Learn By Doing Note that the general pattern of this scatterplot is monotonic (note definition
given above); this prompts us to attempt a power transformation to model our data.
Before doing so, however, I want you to note the R Studio output below with respect to attempting to
fit a linear model (i.e., straight line) and its accompanied r-squared value:
Example 3.12: Untransformed
Call:
lm(formula = moisture ~ fry, data = fm)
Residuals:
Min
1Q Median
-3.1762 -2.3895 -0.1576
3Q
0.8192
Max
5.5610
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.85995
2.07897
5.705 0.00125 **
fry
-0.22419
0.06616 -3.389 0.01470 *
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.233 on 6 degrees of freedom
Multiple R-squared: 0.6568, Adjusted R-squared: 0.5996
F-statistic: 11.48 on 1 and 6 DF, p-value: 0.0147
With respect to several transformations that we could attempt (and there are also several statistical
tests that can be employed beyond using Figure 3.16 in your text as a guide) we will provide the (or
provide several possible) transformations for you. Then, based on the computer output we will leave for
you to determine which model is best. In this introductory to statistics course, what transformation is
best will be judged by which model (among those we attempt, or do attempt based on our
established procedures) has the highest r2 value, and this general rule also holds up well in more
advanced modeling found beyond the scope of your textbook.
We display the power transformation of taking a natural log to both variable x and y below:
Note how now a visual inspection now reveals a linear pattern to the scatterplot above.
The accompanied computer output is:
Example 3.12: Power Transformation (Ln x, Ln y)
Call:
lm(formula = lnmoisture ~ lnfry, data = lnfm)
Residuals:
Min
1Q
Median
-0.15863 -0.06523 -0.02129
3Q
0.01043
Max
0.29473
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.6384
0.21105
21.98 5.80e-07 ***
lnfry
-1.0492
0.06786 -15.46 4.63e-06 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1449 on 6 degrees of freedom
Multiple R-squared: 0.9755, Adjusted R-squared: 0.9714
F-statistic: 239.1 on 1 and 6 DF, p-value: 4.629e-06
Our regression equation now is:
ln 𝑦̂ = 4.6384 − 1.0492(ln 𝑥)
Note that the r2 value has improved to 0.9755 from 0.6568 in the previous (original) plot.
All else equal, we again judge that the prediction model does a BETTER job of predicting if the given
prediction model has a higher r2 value then any other examined models.
This transformed model notes that when frying time, x, is 20 seconds, then the predicted moisture
content, %, y, is 4.46%. This is found by:
ln 𝑦̂ = 4.6384 − 1.0492(ln 𝑥)
ln 𝑦̂ = 4.6384 − 1.0492(ln 20)
ln 𝑦̂ = 4.6384 − 1.0492( 2.995732274)
ln 𝑦̂ = 4.6384 − 3.143122301
ln 𝑦̂ = 1.495277699
𝑒 ln 𝑦̂ = 𝑒 1.495277699
𝑦̂ = 4.4606
Can you verify that when frying time is x = 27.5 seconds, that the model predicts moisture content
percentage y to be 3.1936%? Cal Poly Learn By Doing
--- End Example 3.12 ---
Fitting a Polynomial Function
If viewing the scatterplot of your original dataset, if there is both a LACK of linear fit (i.e., a straight line
would not do), and a LACK of monotonicity, then it is reasonable to transform your data by fitting a
quadratic function (i.e., polynomial) with the general form:
𝑦̂ = a + b1x + b2x2
The signs of the coefficients (a, b1, and b2) are given to us by our computer output (either R Studio or
Minitab, whichever is presented), in the same fashion that non-polynomial functions were.
Example 3.13 p.136 Lack of Monotonicity
Consider the following dataset with respect to x = fermentation (days) and y = glucose (g/L)
x:
1
2
3
4
5
6
7
8
y:
74
54
52
51
52
53
58
71
We note from our scatterplot that the data is NOT linear and is also NOT monotonic; therefore, we are
well served to attempt to fit a polynomial.
Before doing so, however, I want you to note the R Studio output below with respect to attempting to
fit a linear (i.e., straight line) model and this original data’s r-squared value:
Example 3.13: Untransformed
Call:
lm(formula = gc ~ ft, data = ftgc)
Residuals:
Min
1Q Median
-7.107 -6.089 -4.607
3Q
Max
3.027 16.000
Coefficients:
Estimate Std. Error t value
(Intercept) 57.96429
7.70589
7.522
ft
0.03571
1.52599
0.023
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01
Pr(>|t|)
0.000286 ***
0.982087
‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.89 on 6 degrees of freedom
Multiple R-squared: 9.128e-05,
Adjusted R-squared:
F-statistic: 0.0005477 on 1 and 6 DF, p-value: 0.9821
-0.1666
(We note the r2 value is extremely low r2= 0.00009128, or < 0.001, another indication that a linear
model is not a good fit. (Again, all else equal, we want r2 to be as close to 1.0 as possible.)
OK – let’s fit a polynomial, for glucose concentration, gc, and fermenting time, time:
We note the fit appears to be pretty good (i.e., most points appear to fall on the quadratic line, shown in
green. Let’s check our output to see if the value of r2 has improved.
Example 3.13 Transformation to Polynomial
Call:
lm(formula = gc ~ time + timesqd, data = TS)
Residuals:
1
2
3
3.6250 -5.8036 -0.7679
4
1.7321
5
2.6964
6
7
0.1250 -1.9821
8
0.3750
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 84.4821
4.9036 17.229 1.21e-05 ***
time
-15.8750
2.5001 -6.350 0.00143 **
timesqd
1.7679
0.2712
6.519 0.00127 **
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.515 on 5 degrees of freedom
Multiple R-squared: 0.8948, Adjusted R-squared: 0.8527
F-statistic: 21.25 on 2 and 5 DF, p-value: 0.003594
Our general form for quadratic is:
𝑦̂ = a + b1x + b2x2
𝑦̂ = 84.482 − 15.875𝑥 + 1.7679𝑥 2
Note that the r2 value has improved to 0.8948 from < 0.001 in the original (untransformed) plot.
Again, all else equal, we can judge whether the prediction model does a BETTER job of predicting if the
given prediction model has a higher r2 value than any other examined plots.
The model notes that when time, x, is 4 days*, then the glucose concentration, y, is 49.27; can you verify
this? Cal Poly Learn By Doing
*textbook has a slight typo; this should again read as ‘days’ as it correctly does on p. 113
--- End Example 3.13 --Please be comfortable with, or otherwise PRACTICE
Section 3.4
Exercises: 29a 29b, 31b 31c, 33a 33b Cal Poly Learn By Doing
Note also that the below pages have scatterplots and computer output provided for you necessary to
answer these exercises. Such output will, of course, be continued to be provided for you.
On scientific notation: 9.96e-03 is equivalent to: 0.00996 (i.e., you move the decimal to the left 3
places if you have e-03)
On scientific notation: 1.452e+01 is equivalent to 14.52 (i.e., you move the decimal to the right 1 place
if you have e+01)
(For a greater sense of variety, I have changed the colors and plotting character of these scatterplots; we
do want to be comfortable with the idea that bivariate data can be, and is, plotted using multiple colors
and multiple plotting symbols to represent datapoints, respectively. Furthermore, we want to be
comfortable with dealing with scientific notation, when it presents itself.)
Exercise 29 y versus x
Call:
lm(formula = yield ~ flow, data = fy)
Residuals:
Min
1Q
-154.77 -128.99
Median
-33.52
3Q
8.68
Max
586.23
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
-397.9
389.6 -1.021
0.334
flow
136496.6
90168.2
1.514
0.164
Residual standard error: 224.6 on 9 degrees of freedom
Multiple R-squared: 0.2029, Adjusted R-squared: 0.1144
F-statistic: 2.292 on 1 and 9 DF, p-value: 0.1644
Exercise 29 1/y versus x
Call:
lm(formula = yield1 ~ flow, data = fy1)
Residuals:
Min
1Q
-0.013273 -0.002082
Median
0.001456
3Q
0.003474
Max
0.010298
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
0.10454
0.01182
8.844 9.85e-06 ***
flow
-21.02141
2.73617 -7.683 3.05e-05 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.006817 on 9 degrees of freedom
Multiple R-squared: 0.8677, Adjusted R-squared: 0.853
F-statistic: 59.03 on 1 and 9 DF, p-value: 3.054e-05
Exercise 31: y versus x
Call:
lm(formula = stramp ~ cycfail, data = CS)
Residuals:
Min
1Q
Median
-0.0038061 -0.0020579 -0.0009746
3Q
0.0019876
Max
0.0082848
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.906e-03 9.132e-04 10.848 4.64e-09 ***
cycfail
-4.934e-08 3.101e-08 -1.591
0.13
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.003509 on 17 degrees of freedom
Multiple R-squared: 0.1297, Adjusted R-squared: 0.07846
F-statistic: 2.532 on 1 and 17 DF, p-value: 0.13
Exercise 31: y versus ln(x)
Call:
lm(formula = stramp ~ cycfaillnx, data = CS)
Residuals:
Min
1Q
Median
-0.0032929 -0.0024824 -0.0003629
3Q
0.0021192
Max
0.0044476
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0197092 0.0026331
7.485 8.92e-07 ***
cycfaillnx -0.0012805 0.0003126 -4.096 0.000753 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.002668 on 17 degrees of freedom
Multiple R-squared: 0.4967, Adjusted R-squared: 0.4671
F-statistic: 16.78 on 1 and 17 DF, p-value: 0.0007534
Exercise 31: ln(y) versus ln(x)
Call:
lm(formula = stramplnx ~ cycfaillnx, data = CS)
Residuals:
Min
1Q
Median
-0.37052 -0.23765 -0.01776
3Q
0.20089
Max
0.43123
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.73722
0.26946 -13.869 1.07e-10 ***
cycfaillnx -0.12395
0.03199 -3.874 0.00122 **
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2731 on 17 degrees of freedom
Multiple R-squared: 0.4689, Adjusted R-squared: 0.4376
F-statistic: 15.01 on 1 and 17 DF, p-value: 0.001218
Exercise 31: 1/y versus 1/x
Call:
lm(formula = istrampl ~ icycfail, data = CS)
Residuals:
Min
1Q
-60.745 -17.602
Median
-2.295
3Q
37.733
Max
44.957
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
129.259
8.803 14.684 4.34e-11 ***
icycfail
-2154.336
946.543 -2.276
0.0361 *
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 36.35 on 17 degrees of freedom
Multiple R-squared: 0.2336, Adjusted R-squared: 0.1885
F-statistic: 5.18 on 1 and 17 DF, p-value: 0.03607
Exercise 33
Call:
lm(formula = strength ~ thickness, data = TS)
Residuals:
Min
1Q
-8.8620 -2.2765
Median
0.5266
3Q
2.4411
Max
5.7708
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.294856
2.260001 12.962 1.44e-10 ***
thickness
-0.022422
0.004019 -5.579 2.70e-05 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.098 on 18 degrees of freedom
Multiple R-squared: 0.6336, Adjusted R-squared: 0.6132
F-statistic: 31.12 on 1 and 18 DF, p-value: 2.701e-05
Exercise 33 Polynomial Function
Call:
lm(formula = strength ~ thickness + thicknesssqd, data = TS)
Residuals:
Min
1Q
-5.6278 -2.2024
Median
0.2857
3Q
2.4414
Max
4.8790
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
1.452e+01 4.754e+00
3.055 0.00717 **
thickness
4.323e-02 1.981e-02
2.183 0.04337 *
thicknesssqd -6.001e-05 1.786e-05 -3.359 0.00372 **
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.269 on 17 degrees of freedom
Multiple R-squared: 0.7797, Adjusted R-squared: 0.7538
F-statistic: 30.09 on 2 and 17 DF, p-value: 2.599e-06
Our quadratic equation used to predict strength is:
𝑦̂ = 14.52 + .0432x - .00006x2
(As per Section 3.4 where Residual Plots were cited as optional, we are skipping this portion)
Download