Transforming the data Modified from: Gotelli and Allison 2004. Chapter 8; Sokal and Rohlf 2000 Chapter 13 What is a transformation? It is a mathematical function that is applied to all the observations of a given variable * Y f Y • Y represents the original variable, Y* is the transformed variable, and f is a mathematical function that is applied to the data Most are monotonic: • Monotonic functions do not change the rank order of the data, but they do change their relative spacing, and therefore affect the variance and shape of the probability distribution There are two legitimate reasons to transform your data before analysis • The patterns in the transformed data may be easier to understand and communicate than patterns in the raw data. • They may be necessary so that the analysis is valid They are often useful for converting curves into straight lines: The logarithmic function is very useful when two variables are related to each other by multiplicative or exponential functions Y 0 1 log( X ) Logarithmic (X): 20 20 15 10 10 Y Y y = Ln(x) 15 5 5 0 0 0 50000 100000 150000 200000 1 100 10000 x log(x) Y 0 1 log( X ) 1000000 Example: Asi’s growth (50 % each year) Year weight 1 10.0 2 15.0 3 22.5 4 33.8 5 50.6 6 75.9 7 113.9 8 170.9 9 256.3 10 384.4 11 576.7 12 865.0 Y 0e Exponential: 1 X 10000.0 1000.0 0.4055x y = 6.6667e weight (g) weight (g) 800.0 600.0 400.0 100.0 200.0 1.0 0.0 0 5 10 year 15 0 5 10 year ln( Y ) ln( 0 ) 1 X 15 Example: Species richness in the Galapagos Islands Y 0 X Power: 1 1000 400 Richness Richness 300 200 Nspecies 100 100 Nspecies 10 Power (Nspecies) Power (Nspecies) 1 0 0 2000 4000 6000 8000 0.1 10 1000 Area Area log( Y ) log( 0 ) 1 log( X ) 100000 Statistics and transformation Data to be analyzed using analysis of variance must meet to assumptions: • The data must be homoscedastic: variances of treatment groups need to be approximately equal • The residuals, or deviations from the mean must be normal random variables Lets look an example • A single variate of the simplest type of ANOVA (completely randomized, single classification) decomposes as follows: Yij i ij • In this model the components are additive with the error term εij distributed normally However… • We might encounter a situation in which the components are multiplicative in effect, where Yij i ij • If we fitted a standard ANOVA model, the observed deviations from the group means would lack normality and homoscedasticity The logarithmic transformation • We can correct this situation by transforming our model into logarithms Y * log( Y ) Wherever the mean is positively correlated with the variance the logarithmic transformation is likely to remedy the situation and make the variance independent of the mean We would obtain log( Yij ) log( ) log( i ) log( ij ) • Which is additive and homoscedastic The square root transformation • It is used most frequently with count data. Such distributions are likely to be Poisson distributed rather than normally distributed. In the Poisson distribution the variance is the same as the mean. Transforming the variates to square roots generally makes the variances independents of the means for these type of data. When counts include zero values, it is desirable to code all variates by adding 0.5. The box-cox transformation • Often one do not have a-priori reason for selecting a specific transformation. • Box and Cox (1964) developed a procedure for estimating the best transformation to normality within the family of power transformation Y * (Y 1) / for( 0) for ( 0) Y * log( Y ) The box-cox transformation • The value of lambda which maximizes the loglikelihood function: v v 2 L ln sT ( 1) ln( Y ) 2 n yields the best transformation to normality within the family of transformations s2T is the variance of the transformed values (based on v degrees of freedom). The second term involves the sum of the ln of untransformed values box-cox in R (for a vector of data Y) -23.0 -23.5 -24.0 log-Likelihood What do you conclude from this plot? -22.5 >library(MASS) >lamb <- seq(0,2.5,0.5) >boxcox(Y_~1,lamb,plotit=T) >library(car) >transform_Y<-box.cox(Y,lamb) 95% -2 Read more in Sokal and Rohlf 2000 page 417 -1 0 1 2 The arcsine transformation • Also known as the angular transformation • It is especially appropriate to percentages The arcsine transformation 1 Y * arcsin Y Transformed data 0.8 0.6 lineal arcsine 0.4 It is appropriate only for data expressed as proportions 0.2 0 0 0.2 0.4 0.6 0.8 Proportion original data 1 Since the transformations discussed are NON-LINEAR, confidence limits computed in the transformed scale and changed back to the original scale would be asymmetrical Evaluating Ecological Responses to Hydrologic Changes in a Payment-forenvironmental-services Program on Florida Ranchlands Patrick Bohlen, Elizabeth Boughton, John Fauth, David Jenkins, Pedro Quintana-Ascencio, Sanjay Shukla and Hilary Swain G08K10487 Palaez Ranch Wetland Water Retention Histogram of pointdata$mosqct[pointdata$mosqct > 0] 300 200 Frequency 100 20 0 10 0 1000 2000 3000 4000 pointdata$mosqct 0 Frequency 30 400 500 40 Histogram of pointdata$mosqct 0 1000 2000 3000 4000 5000 pointdata$mosqct[pointdata$mosqct > 0] 6000 5000 6000 Call: glm(formula = mosqct ~ depth + depth^2, data = pointdata) Deviance Residuals: Min 1Q Median -28.1 -27.4 -25.5 3Q -19.5 Max 6388.3 Coefficients: Estimate Std. Error t (Intercept) 19.40214 42.98439 depth 1.11973 4.07803 depth^2 -0.03597 0.07907 value Pr(>|t|) 0.451 0.652 0.275 0.784 -0.455 0.649 (Dispersion parameter for gaussian family taken to be 117582.7) Null deviance: 77787101 Residual deviance: 77722147 AIC: 9641.5 on 663 on 661 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 2 Residuals of Model 4000 3000 residuals 60 2000 40 0 1000 20 0 pointdata$mosqct 80 5000 6000 100 Model assuming gaussian errors 0 10 20 30 40 pointdata$depth 50 60 -40 -30 -20 -10 0 Y_hats 10 20 30 Call: zeroinfl(formula = mosqct ~ depth + depth^2, data = pointdata, dist = "poisson", EM = TRUE) Pearson residuals: Min 1Q Median 3Q -6.765e-01 -5.630e-01 -5.316e-01 -4.768e-01 Max 9.393e+05 Count model coefficients (poisson with log link): Estimate Std. Error z value Pr(>|z|) (Intercept) 1.8550639 0.0498470 37.22 <2e-16 *** depth 0.4364981 0.0064139 68.06 <2e-16 *** depth^2 -0.0139134 0.0001914 -72.68 <2e-16 *** Zero-inflation model coefficients (binomial with logit link): Estimate Std. Error z value Pr(>|z|) (Intercept) 0.6400910 0.3274469 1.955 0.05061 . depth 0.0846763 0.0371673 2.278 0.02271 * depth^2 -0.0027798 0.0009356 -2.971 0.00297 ** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Number of iterations in BFGS optimization: 1 Log-likelihood: -4.728e+04 on 6 Df > Residuals of Model assuming Poisson errors for counts 40 20 30 residuals_p 60 40 0 10 20 0 pointdata$mosqct 50 80 60 70 100 Zero inflated models 0 10 20 30 pointdata$depth 40 50 60 0 10 20 Y_hats_poisson 30 40