19 - Transformations in Simple Linear Regression Example - Polychlorinated Biphenyl (PCB) Concentration and Age of Rainbow Trout in Lake Cayuga (New York). Data File: PCBtrout.JMP In this experiment we are studying the relationship between age of trout and the PCB concentration found in their tissues. The rainbow trout were all sampled from Lake Cayuga in New York. Sample Correlation (r) Age PCB r = 0.7364 We begin by examining the correlation between age and PCB concentration as well as a scatterplot matrix. The correlation is r = .7364 which is a moderate positive correlation however examination of the scatter plot matrix suggests that the relationship is not linear, thus the correlation really is not an appropriate measure of the association between PCB concentration and age. One approach to dealing with nonlinearity is to transform Y and/or X using power transformations (e.g. square root, logarithms, etc.) in attempts to strengthen the linear association between Y and X. The Bulging Rule shown below provides guidance in terms of the types of transformations to use for the two variables Bulging Rule: 183 Here we see that the Bulging Rule suggests lowering the power on Y and/or raising the power on X. Another consideration is the distribution of both variables, which are shown in the histograms augmented to the scatter plot below (select Bivariate Fit … > Histogram Borders). We can see the PCB distribution is markedly skewed to the right, thus a log transformation would improve normality. When X and Y are both normally distributed it is often times the case that the relationship between them will be linear. Thus we begin by taking the log base 10 of the PCB concentration. To do this, create a new column and double click at the top of the column. Select Formula from the New Property pull-down menu and click Edit Formula. Then select Transcendental from the right hand menu (because the logarithm is a transcendental function) and click on Log 10 from the right menu. Finally select the variable you wish to take the log of, which in this case is PCB concentration. When you are finished the expression in the calculator window should read Log10(PCB) The relationship between age and log10(PCB) is summarized below. Sample Correlation (r) Age log10(PCB) r = 0.8552 We can see that the correlation has increased after transformation. 184 There still appears to be some curvature present however. The bend is such that the Bulging rule suggests either raising the power on Y or lowering the power on X. We could consider raising the power on Y, which implies the log transformation of Y may have been too strong. However in this case we will consider lowering the power on X. Let’s consider using the square root of age. Again we need to add a column which will contain the results of a formula. To take the square root of a variable simply select the square root button and put the variable Age under the radical. The result should look like this: Age Again examination of the correlation and scatter plot matrix shows improvement in terms of linearity. Sample Correlation (r) log10(PCB) sqrt(Age) r = 0.8866 At this point we should feel comfortable building a simple linear regression model for these data. Select Analyze > Fit Y by X menu and put log10(PCB) in the Y box and sqrt(Age) in the X box. To fit the regression line, select Fit Line from the Bivariate Fit pull-down menu located above the scatter plot. The results are shown below. The regression equation is: Eˆ (log 10( PCB) | Age) ˆo ˆ1 Age = -0.519 + 0.521* Age 185 To use this equation to predict the PCB concentration for a fish that is 5 years old e.g. we would take the square root of 5 and plug that in to the regression equation. The predicted log 10 PCB concentration would be: -.519 + .521*2.236 = .645 log10(ppm) Which corresponds to a PCB concentration of 10 .645 4.42 ppm. The R2 = .786 which implies 78.6% of the variation in the log10 PCB concentration is explained by the regression on the square root of the age of the trout. To examine a plot of the residuals versus Age , select Plot Residuals from the Linear Fit pull-down menu located beneath the scatter plot. The resulting plot is shown on the following below. No assumption violations are evident from this plot. To assess normality of the residuals first save the residuals from the fit by selecting Save Residuals from the Linear Fit pulldown menu. This will save the residual values to the original spreadsheet. Then select Analyze > Distribution to examine the distribution of the residuals and obtain a normal quantile plot. The results are shown below. The residuals appear to be slightly kurtotic, but not too bad. To obtain prediction and confidence intervals we need to fit the regression model using the Fit Model option from the Analyze menu. Put log10(PCB) in the Y box and sqrt(Age) in the Model Effects box. From the Fit Model results select Save Columns > Prediction Formula, Mean 186 Confidence Interval (CI for E(Y|X)) and Indiv Confidence Interval (CI for the Y value of an individual with X = x). To obtain these CI’s in the original scale (ppm) add two columns to the spreadsheet which will take the results of a formula. Then use the JMP calculator and the function 10x to create the following formulas: 10 Pred Formula log10(PCB) & 10 Lower 95% Mean log10(PCB) & 10 Upper 95% Mean log10(PCB) This will transform the endpoints of the confidence interval for the mean back to the original ppm scale. Similar formulas could be used to convert the endpoints of the confidence interval for prediction of a single individual to the original scale. A portion of the data spread sheet containing these additional columns is shown below. Plot of PCB vs. Age with Estimated Mean and CI’s Added You can use Graph > Overlay Plot to graph the original data points (PCB), the predicted values (Pred Orig), the lower confidence limit (Mean Lower), and the upper confidence limit (Mean Upper) in the same plot by placing these quantities in the Y box and either sqrt(Age) or Age in the X box. The plot above shows the results using Age for the X-axis. The Connect Points option has been selected and the Show Points option has been unselected for the predicted values and the confidence bands (right-click on the legend name for each of these quantities to obtain a menu from which these options can be specified). 187 Example 2 ~ Mercury Levels Found in Sand Point Walleyes Data File: Walleyes Sand Point The variables in this data file are: LTGHIN = length of walleye (in.) HGPPM = mercury concentration found in fillet (ppm) Log10(Hg) = log base 10 of the mercury concentration (log 10 ppm) We begin constructing a plot of mercury level (ppm) vs. length (in.) and adding a smoothing spline as a preliminary estimate of the E(Hg|Length). Clearly the E(Hg|Length) could not be adequately modeled using a linear function of X. Applying Bulging Rule: After log transforming mercury level we have the following… 188 Fitting the regression model with log transformed response we obtain these results. INTERPRETATION OF THE MODEL WITH THE LOG RESPONSE Exponential model: E(Y|X) = o 10 1 X here we have a used a base of 10, although the base used is arbitrary. Taking the log base 10 of the response gives, E( log 10 Y|X) = log 10 ( o ) 1 X = 1 X If the distribution of Y is symmetric (e.g. normal) then we can also view this an approximate relationship for the median in the original scale, i.e. Med( log 10 Y | X ) 1 X How are the coefficients, and 1 , in this model interpreted? As always the y-intercept ( ) is the value of the response when X = 0. The slope however has a more interesting interpretation. 189 With exponential trends, if we change X by 1 unit, the resulting in change in Y is interpreted as a percentage change in the median of Y. The percentage change in the response is the same, for all values of the explanatory variable X. To see this consider the following: log 10 Y 1 X Y 10 1 X 10 10 1 X k 10 1 X Now if we increase X by 1 unit we have Y k 10 1 ( X 1) k 10 1 X 10 1 Thus for exponential models where Y has been transformed to the log base 10 scale we that… • if X increases by 1 unit, Y gets multiplied by 10 1 , a 100 (10 1 1) % increase • if X increases by w units, Y gets multiplied by10 1w , a 100 (10 1w 1) % increase • Again if log 10 Y is approximately symmetric then these can be thought of as percent increases in the median of Y given X in the original scale. EXAMPLE 2: Mercury Levels in Sand Point Walleyes and Length (cont’d) Here the estimated regression is given by: E( log 10 Y|X) = X = -1.066535 + .0556892*X For a one inch increase in walleye length we estimate the mean log10(Hg) level will increase by .0556892 log10 ppm. Converting to the original scale we have a multiplicative increase of 10 .0556892 = 1.1369, i.e. the median mercury level increases an estimated 1.1369 times per inch, i.e. a 13.69% increase for each one inch increase in length. For a 5 inch increase we estimate 190 Confidence Interval for We can find a 95% CI for the percent increase by first finding a 95% CI for 1 and then converting the endpoints to obtain a confidence interval for 10 1 . 95% CI for 1 is given by ˆ1 (t table value) SE ( ˆ1 ) where the df for the t-table value = n – 2. Estimate and SE’s from JMP t-Quantile Calculator from JMP .0556892 (2.056)(. 00379) = (.0479, .0635) 10 (1.1166 , 1.1574) x Thus we estimate with 95% confidence the percent increase in the median mercury level associated with a one inch increase in length is between 11.66% and 15.74%. Using the Fit Model approach we can again save predicted values, confidence intervals, and prediction intervals to the spreadsheet and convert them back to the original scale (again using 10 x ). Visualization of raw data, Median(Y|X) and CI for Median(Y|X) using Graph > Overlay Plot as shown above in the PCB/trout example discussed above (see pg. 161). 191