Correlation and Simple Linear Regression in JMP

advertisement
19 - Transformations in Simple Linear Regression
Example - Polychlorinated Biphenyl (PCB) Concentration and Age of
Rainbow Trout in Lake Cayuga (New York).
Data File: PCBtrout.JMP
In this experiment we are studying the relationship between age of trout and the PCB
concentration found in their tissues. The rainbow trout were all sampled from Lake
Cayuga in New York.
Sample Correlation (r)
Age
PCB
r = 0.7364
We begin by examining the correlation between age and PCB concentration as well as a
scatterplot matrix. The correlation is r = .7364 which is a moderate positive correlation
however examination of the scatter plot matrix suggests that the relationship is not linear,
thus the correlation really is not an appropriate measure of the association between PCB
concentration and age. One approach to dealing with nonlinearity is to transform Y
and/or X using power transformations (e.g. square root, logarithms, etc.) in attempts to
strengthen the linear association between Y and X. The Bulging Rule shown below
provides guidance in terms of the types of transformations to use for the two variables
Bulging Rule:
183
Here we see that the Bulging Rule suggests lowering the power on Y and/or raising the
power on X. Another consideration is the distribution of both variables, which are shown
in the histograms augmented to the scatter plot below (select Bivariate Fit … >
Histogram Borders). We can see the PCB distribution is markedly skewed to the right,
thus a log transformation would improve normality. When X and Y are both normally
distributed it is often times the case that the relationship between them will be linear.
Thus we begin by taking the log base 10 of the PCB concentration. To do this, create a
new column and double click at the top of the column. Select Formula from the New
Property pull-down menu and click Edit Formula. Then select Transcendental from
the right hand menu (because the logarithm is a transcendental function) and click on Log
10 from the right menu. Finally select the variable you wish to take the log of, which in
this case is PCB concentration. When you are finished the expression in the calculator
window should read
Log10(PCB)
The relationship between age and log10(PCB) is summarized below.
Sample Correlation (r)
Age
log10(PCB)
r = 0.8552
We can see that the correlation has increased after transformation.
184
There still appears to be some curvature present however. The bend is such that the
Bulging rule suggests either raising the power on Y or lowering the power on X. We
could consider raising the power on Y, which implies the log transformation of Y may
have been too strong. However in this case we will consider lowering the power on X.
Let’s consider using the square root of age. Again we need to add a column which will
contain the results of a formula. To take the square root of a variable simply select the
square root button and put the variable Age under the radical. The result should look like
this:
Age
Again examination of the correlation and scatter plot matrix shows improvement in terms
of linearity.
Sample Correlation (r)
log10(PCB)
sqrt(Age)
r = 0.8866
At this point we should feel comfortable building a simple linear regression model for
these data. Select Analyze > Fit Y by X menu and put log10(PCB) in the Y box and
sqrt(Age) in the X box. To fit the regression line, select Fit Line from the Bivariate Fit
pull-down menu located above the scatter plot. The results are shown below.
The regression equation is:
Eˆ (log 10( PCB) | Age)  ˆo  ˆ1 Age = -0.519 + 0.521* Age
185
To use this equation to predict the PCB concentration for a fish that is 5 years old e.g. we
would take the square root of 5 and plug that in to the regression equation. The predicted
log 10 PCB concentration would be:
-.519 + .521*2.236 = .645 log10(ppm)
Which corresponds to a PCB concentration of 10 .645  4.42 ppm.
The R2 = .786 which implies 78.6% of the variation in the log10 PCB concentration is
explained by the regression on the square root of the age of the trout.
To examine a plot of the residuals versus Age , select Plot Residuals from the Linear
Fit pull-down menu located beneath the scatter plot. The resulting plot is shown on the
following below.
No assumption violations are evident from this plot. To assess normality of the residuals
first save the residuals from the fit by selecting Save Residuals from the Linear Fit pulldown menu. This will save the residual values to the original spreadsheet. Then select
Analyze > Distribution to examine the distribution of the residuals and obtain a normal
quantile plot. The results are shown below.
The residuals appear to be slightly kurtotic, but not too bad. To obtain prediction and
confidence intervals we need to fit the regression model using the Fit Model option from
the Analyze menu. Put log10(PCB) in the Y box and sqrt(Age) in the Model Effects
box. From the Fit Model results select Save Columns > Prediction Formula, Mean
186
Confidence Interval (CI for E(Y|X)) and Indiv Confidence Interval (CI for the Y value
of an individual with X = x). To obtain these CI’s in the original scale (ppm) add two
columns to the spreadsheet which will take the results of a formula. Then use the JMP
calculator and the function 10x to create the following formulas:
10 Pred Formula log10(PCB) & 10 Lower 95% Mean log10(PCB) & 10 Upper 95% Mean log10(PCB)
This will transform the endpoints of the confidence interval for the mean back to the
original ppm scale. Similar formulas could be used to convert the endpoints of the
confidence interval for prediction of a single individual to the original scale.
A portion of the data spread sheet containing these additional columns is shown below.
Plot of PCB vs. Age with Estimated Mean and CI’s Added
You can use Graph > Overlay Plot to graph the original data points (PCB), the
predicted values (Pred Orig), the lower confidence limit (Mean Lower), and the upper
confidence limit (Mean Upper) in the same plot by placing these quantities in the Y box
and either sqrt(Age) or Age in the X box. The plot above shows the results using Age for
the X-axis. The Connect Points option has been selected and the Show Points option
has been unselected for the predicted values and the confidence bands (right-click on the
legend name for each of these quantities to obtain a menu from which these options can
be specified).
187
Example 2 ~ Mercury Levels Found in Sand Point Walleyes
Data File: Walleyes Sand Point
The variables in this data file are:
LTGHIN = length of walleye (in.)
HGPPM = mercury concentration found in fillet (ppm)
Log10(Hg) = log base 10 of the mercury concentration (log 10 ppm)
We begin constructing a plot of mercury level (ppm) vs. length (in.) and adding a
smoothing spline as a preliminary estimate of the E(Hg|Length). Clearly the
E(Hg|Length) could not be adequately modeled using a linear function of X.
Applying Bulging Rule:
After log transforming mercury level we have the following…
188
Fitting the regression model with log transformed response we obtain these results.
INTERPRETATION OF THE MODEL WITH THE LOG RESPONSE
Exponential model:
E(Y|X) =  o 10 1 X here we have a used a base of 10, although the base used is arbitrary.
Taking the log base 10 of the response gives,
E( log 10 Y|X) = log 10 ( o )  1 X =   1 X
If the distribution of Y is symmetric (e.g. normal) then we can also view this an
approximate relationship for the median in the original scale, i.e.
Med( log 10 Y | X )    1 X
How are the coefficients,  and  1 , in this model interpreted? As always the y-intercept
(  ) is the value of the response when X = 0. The slope however has a more interesting
interpretation.
189
With exponential trends, if we change X by 1 unit, the resulting in change in Y is
interpreted as a percentage change in the median of Y. The percentage change in the
response is the same, for all values of the explanatory variable X.
To see this consider the following:
log 10 Y    1 X  Y  10  1 X  10 10 1 X  k  10 1 X
Now if we increase X by 1 unit we have
Y  k  10 1 ( X 1)  k  10 1 X 10 1
Thus for exponential models where Y has been transformed to the log base 10 scale we
that…
• if X increases by 1 unit, Y gets multiplied by 10 1 , a 100  (10 1  1) % increase
• if X increases by w units, Y gets multiplied by10 1w , a 100  (10 1w  1) % increase
• Again if log 10 Y is approximately symmetric then these can be thought of as
percent increases in the median of Y given X in the original scale.
EXAMPLE 2: Mercury Levels in Sand Point Walleyes and Length (cont’d)
Here the estimated regression is given by:
E( log 10 Y|X) =   X = -1.066535 + .0556892*X
For a one inch increase in walleye length we estimate the mean log10(Hg) level will
increase by .0556892 log10 ppm. Converting to the original scale we have a
multiplicative increase of 10 .0556892 = 1.1369, i.e. the median mercury level increases an
estimated 1.1369 times per inch, i.e. a 13.69% increase for each one inch increase in
length.
For a 5 inch increase we estimate
190
Confidence Interval for 
We can find a 95% CI for the percent increase by first finding a 95% CI for  1 and then
converting the endpoints to obtain a confidence interval for 10 1 .
95% CI for  1 is given by
ˆ1  (t  table value)  SE ( ˆ1 ) where the df for the t-table value = n – 2.
Estimate and SE’s from JMP
t-Quantile Calculator from JMP
.0556892  (2.056)(. 00379) = (.0479, .0635) 10
 (1.1166 , 1.1574)
x
Thus we estimate with 95% confidence the percent increase in the median mercury level
associated with a one inch increase in length is between 11.66% and 15.74%.
Using the Fit Model approach we can again save predicted values, confidence intervals,
and prediction intervals to the spreadsheet and convert them back to the original scale
(again using 10 x ).
Visualization of raw data, Median(Y|X) and CI for Median(Y|X) using
Graph > Overlay Plot as shown above in the PCB/trout example discussed above
(see pg. 161).
191
Download