Transforming the data Modified from: Gotelli and Allison 2004. Chapter 8;

advertisement
Transforming the data
Modified from:
Gotelli and Allison 2004. Chapter 8;
Sokal and Rohlf 2000 Chapter 13
What is a transformation?
It is a mathematical function that is applied
to all the observations of a given
variable
*
Y  f Y 
• Y represents the original variable, Y* is the transformed variable,
and f is a mathematical function that is applied to the data
Most are monotonic:
• Monotonic functions do not
change the rank order of the
data, but they do change their
relative spacing, and therefore
affect the variance and shape of
the probability distribution
There are two legitimate reasons to
transform your data before analysis
• The patterns in the transformed data may
be easier to understand and communicate
than patterns in the raw data.
• They may be necessary so that the
analysis is valid
They are often useful for converting
curves into straight lines:
The logarithmic function is very useful when
two variables are related to each other by
multiplicative or exponential functions
Y   0  1 log( X )
Logarithmic (X):
20
20
15
10
10
Y
Y
y = Ln(x)
15
5
5
0
0
0
50000
100000 150000 200000
1
100
10000
x
log(x)
Y   0  1 log( X )
1000000
Example:
Asi’s growth (50 % each year)
Year
weight
1
10.0
2
15.0
3
22.5
4
33.8
5
50.6
6
75.9
7
113.9
8
170.9
9
256.3
10
384.4
11
576.7
12
865.0
Y   0e
Exponential:
1 X
10000.0
1000.0
0.4055x
y = 6.6667e
weight (g)
weight (g)
800.0
600.0
400.0
100.0
200.0
1.0
0.0
0
5
10
year
15
0
5
10
year
ln( Y )  ln(  0 )  1 X
15
Example: Species richness in the
Galapagos Islands
Y  0 X
Power:
1
1000
400
Richness
Richness
300
200
Nspecies
100
100
Nspecies
10
Power
(Nspecies)
Power
(Nspecies)
1
0
0
2000
4000
6000
8000
0.1
10
1000
Area
Area
log( Y )  log(  0 )  1 log( X )
100000
Statistics and transformation
Data to be analyzed using analysis of
variance must meet to assumptions:
• The data must be homoscedastic:
variances of treatment groups need to be
approximately equal
• The residuals, or deviations from the
mean must be normal random variables
Lets look an example
• A single variate of the simplest type of ANOVA
(completely randomized, single classification)
decomposes as follows:
Yij     i   ij
• In this model the components are additive with
the error term εij distributed normally
However…
• We might encounter a situation in which the
components are multiplicative in effect, where
Yij     i   ij
• If we fitted a standard ANOVA model, the
observed deviations from the group means
would lack normality and homoscedasticity
The logarithmic transformation
• We can correct this situation by
transforming our model into logarithms
Y *  log( Y )
Wherever the mean is
positively correlated with
the variance the logarithmic
transformation is likely to
remedy the situation and
make the variance
independent of the mean
We would obtain
log( Yij )  log(  )  log(  i )  log(  ij )
• Which is additive and homoscedastic
The square root transformation
• It is used most frequently with count data.
Such distributions are likely to be Poisson distributed
rather than normally distributed.
In the Poisson distribution the variance is the same as
the mean.
Transforming the variates to square roots generally
makes the variances independents of the means for
these type of data.
When counts include zero values, it is desirable to code
all variates by adding 0.5.
The box-cox transformation
• Often one do not have a-priori reason for
selecting a specific transformation.
• Box and Cox (1964) developed a
procedure for estimating the best
transformation to normality within the family
of power transformation
Y *  (Y  1) /  for(  0)
for (  0)
Y *  log( Y )

The box-cox transformation
• The value of lambda which maximizes the loglikelihood function:
v
v
2
L   ln sT  (  1)  ln( Y )
2
n
yields the best transformation to normality within the
family of transformations
s2T is the variance of the transformed values (based on v
degrees of freedom). The second term involves the sum of
the ln of untransformed values
box-cox in R (for a vector of data Y)
-23.0
-23.5
-24.0
log-Likelihood
What do you
conclude from this
plot?
-22.5
>library(MASS)
>lamb <- seq(0,2.5,0.5)
>boxcox(Y_~1,lamb,plotit=T)
>library(car)
>transform_Y<-box.cox(Y,lamb)
95%
-2
Read more in Sokal and Rohlf 2000 page 417
-1
0
1
2
The arcsine transformation
• Also known as the angular transformation
• It is especially appropriate to percentages
The arcsine transformation
1
Y *  arcsin Y
Transformed data
0.8
0.6
lineal
arcsine
0.4
It is appropriate only for
data expressed as
proportions
0.2
0
0
0.2
0.4
0.6
0.8
Proportion original data
1
Since the transformations
discussed are NON-LINEAR,
confidence limits computed in
the transformed scale and
changed back to the original
scale would be
asymmetrical
Evaluating Ecological Responses to
Hydrologic Changes in a Payment-forenvironmental-services Program on Florida
Ranchlands
Patrick Bohlen, Elizabeth Boughton, John Fauth, David
Jenkins, Pedro Quintana-Ascencio, Sanjay Shukla and Hilary
Swain
G08K10487
Palaez Ranch Wetland Water
Retention
Histogram of pointdata$mosqct[pointdata$mosqct > 0]
300
200
Frequency
100
20
0
10
0
1000
2000
3000
4000
pointdata$mosqct
0
Frequency
30
400
500
40
Histogram of pointdata$mosqct
0
1000
2000
3000
4000
5000
pointdata$mosqct[pointdata$mosqct > 0]
6000
5000
6000
Call:
glm(formula = mosqct ~ depth + depth^2, data = pointdata)
Deviance Residuals:
Min
1Q Median
-28.1
-27.4
-25.5
3Q
-19.5
Max
6388.3
Coefficients:
Estimate Std. Error t
(Intercept) 19.40214
42.98439
depth
1.11973
4.07803
depth^2
-0.03597
0.07907
value Pr(>|t|)
0.451
0.652
0.275
0.784
-0.455
0.649
(Dispersion parameter for gaussian family taken to be
117582.7)
Null deviance: 77787101
Residual deviance: 77722147
AIC: 9641.5
on 663
on 661
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 2
Residuals of Model
4000
3000
residuals
60
2000
40
0
1000
20
0
pointdata$mosqct
80
5000
6000
100
Model assuming gaussian errors
0
10
20
30
40
pointdata$depth
50
60
-40
-30
-20
-10
0
Y_hats
10
20
30
Call:
zeroinfl(formula = mosqct ~ depth + depth^2, data = pointdata, dist
= "poisson", EM = TRUE)
Pearson residuals:
Min
1Q
Median
3Q
-6.765e-01 -5.630e-01 -5.316e-01 -4.768e-01
Max
9.393e+05
Count model coefficients (poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.8550639 0.0498470
37.22
<2e-16 ***
depth
0.4364981 0.0064139
68.06
<2e-16 ***
depth^2
-0.0139134 0.0001914 -72.68
<2e-16 ***
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.6400910 0.3274469
1.955 0.05061 .
depth
0.0846763 0.0371673
2.278 0.02271 *
depth^2
-0.0027798 0.0009356 -2.971 0.00297 **
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of iterations in BFGS optimization: 1
Log-likelihood: -4.728e+04 on 6 Df
>
Residuals of Model assuming Poisson errors for counts
40
20
30
residuals_p
60
40
0
10
20
0
pointdata$mosqct
50
80
60
70
100
Zero inflated models
0
10
20
30
pointdata$depth
40
50
60
0
10
20
Y_hats_poisson
30
40
Download