Class-presentation

advertisement
CHAPTER 8
Managing and Curating Data
The Second Step
Storing and Curating Data
Storage: Temporary and Archival
Permanent archives
 The only medium acceptable as truly archival is
acid-free paper
Electronic storage
 Do not expect electronic media to last more than
5-10 years
 Should be used primarily for working copies
 If used, copy datasets onto newer electronic
media on a regular basis
Curating Data
 Most ecological and environmental data are
collected by researchers using funds obtained
through grants and contracts
They are technically owned by the granting
agency, and they need to be made widely
available (e.g., Internet)
 Unfortunately, when budgets are cut, data
management and curation costs are often the first
items to be dropped
The Final Step
Transforming the Data
Transformation
 A mathematical function that is applied to all of the
observations of a given variable Y*=f(Y)
 Most are fairly simple algebraic functions as long
as they are continuous monotonic functions
DO NOT change the rank order of the data
DO change relative spacing
Why Transform Data?
(1) Patterns in the data may be easier to
understand and communicate than patterns in
the raw data
Converting curves into straight lines
(2) Necessary for analysis to be valid – “meeting
the assumptions”
The Species-Area Relationship
A classic example
If we plot the number of species against the area
of the island, the data often follow a simple power
function, S=cAz where
S = number of species
A = is island area
c and z are constants fitted to the data
The Species-Area Relationship
A classic example
Area (km2)
No. of species
Log10 (Area)
Log10 (Species)
Albermarle
5824.9
325
3.765
2.512
Charles
165.8
319
2.220
2.504
Chatham
505.1
306
2.703
2.486
James
525.8
224
2.721
2.350
Indefatigable
1007.5
193
3.003
2.286
Abingdon
51.8
119
1.714
2.076
Duncan
18.4
103
1.265
2.013
Narborough
634.6
80
2.803
1.903
Hood
46.6
79
1.668
1.898
Seymour
2.6
52
0.415
1.716
Barrington
19.4
48
1.288
1.681
Gardner
0.5
48
-0.301
1.681
Bindloe
116.6
47
2.067
1.672
Jervis
4.8
42
0.681
1.623
Tower
11.4
22
1.057
1.342
Wenman
47
14
1.672
1.146
Culpepper
2.3
7
0.362
0.845
Island
The Species-Area Relationship
N u m b e r o f S p e cie s
400
300
200
100
0
0
1000
2000
3000
4000
5000
Isla n d A re a (km2)
6000
7000
The Species-Area Relationship
If species richness and island area are related
exponentially, we can transform this equation by
taking logarithms of both sides
log (S) = log (cAz)
log (S) = log (c) + zlog (A)
The Species-Area Relationship
2 .6
log10 (N u m b e r o f S p e cie s)
2 .4
2 .2
2 .0
1 .8
1 .6
1 .4
1 .2
1 .0
0 .8
0 .6
-1
0
1
2
log10 (Isla n d A re a )
3
4
Other Transformations
Cube-Root Transformation (Y3)
measures of mass or volume that are
allometrically related to linear measures of body
size or length
Logarithmically transformed
examines relationships between two measures
of masses or volumes (Y3), and transforms both
X and Y
Why Transform Data?
Statistics Demands it
All statistical tests require data to fit certain
mathematical assumptions
Examples
Analysis of Variance (1) homoscedastic
(2) residuals must be
normal random variables
Regression (1) normally-distributed residuals
that are uncorrelated with the
independent variable
Five Common Transformations
(1)Logarithmic Transformation
(2)Square-root Transformation
(3)Angular (or arcsine) Transformation
(4)Reciprocal Transformation
(5)Box-Cox Transformation
Logarithmic Transformation
Replaces each observation with its logarithm
Y*=log (Y)
Often equalizes variances for data which mean
and variance are positively correlated, which
also tend to have outliers with positively-skewed
residuals
Logarithm of 0 is not defined – add 1 to each
observation
Square-root Transformation
Replaces each observation with its square root
Y*=SQRT(Y)
Used most frequently for count data, which often
follows a Poisson distribution
Yields a variance independent of mean
Does not transform data values equal to 0 – add
some small number to observations
Arcsine Transformation
Also Arcsine-square root or angular
Replaces each observation with the arcsine of
the square root of the value
Y*=arcsine(SQRT(Y))
Principally used for proportions
Removes the dependence of the variance on the
mean
Gives transformed data in units of radians, not
degrees
Reciprocal Transformation
Replaces each value with its reciprocal
Y*=1/Y
Commonly used for data that records rates,
which often appear as hyperbolic
Box-Cox Transformation
A family of transformations
Y*=(Ylambda-1)/lambda
Y*=loge (Y)
(for lambda 0)
(for lambda=0)
L= -(v/2)loge(s2T)+(lambda-1)(v/n)sigma
(logeY)
V=degrees of freedom
N=sample size
s2T=variance of transformed values of Y
Box-Cox Transformation
Y*=(Ylambda-1)/lambda
Y*=loge (Y)
(for lambda not equal to 0)
(for lambda=0)
L= -(v/2)loge(s2T)+(lambda-1)(v/n)sigma (logeY)
The value of lambda that results when the last
equation is maximized is used in one of the first
two equations to provide the closest fit of the
transformed data to a normal distribution
The last equation must be solved iteratively
(trying different lambda values until L is
maximized) using computer software
Box-Cox Transformation
Y*=(Ylambda-1)/lambda
Y*=loge (Y)
(for lambda not equal to 0)
(for lambda=0)
L= -(v/2)loge(s2T)+(lambda-1)(v/n)sigma (logeY)
 When lambda=1, equation 1 results in a linear
transformation
 When lambda=1/2, a square-root transformation
 When lambda=-1, a reciprocal transformation
 When lambda=0, equation 2 results in a natural
logarithmic transformation
 ALWAYS try using simple arithmetic
transformations FIRST
Box-Cox Transformation
Y*=(Ylambda-1)/lambda
Y*=loge (Y)
(for lambda not equal to 0)
(for lambda=0)
L= -(v/2)loge(s2T)+(lambda-1)(v/n)sigma (logeY)
ALWAYS try using simple arithmetic
transformations FIRST
If data is right-skewed, try using familiar
transformations from the series1/SQRT(Y),
SQRT(Y), ln (Y), 1/Y
If left-skewed, try Y2, Y3, etc
1
0.9
0.8
0.7
Original
0.6
Logarithmic
0.5
Square Root
Arcsine
0.4
Reciprocal
0.3
0.2
0.1
0
0
0.2
0.4
0.6
0.8
1
Reporting Results
 You should report results in the original units,
which includes back-transforming the
transformed values
 Back-transformed mean will be very different
from arithmetic mean
 Also, back-transformations will normally result
in asymmetrical confidence intervals
Back-Transformations
Logarithmic – antilog(Y*) or eY
Square Root – Y*2
Arcsine – Sin(Y*2)
Reciprocal – 1/(Y*)
Reporting Results
 Lastly, transforming data should be added to your
audit trail (documented in the metadata)
Create a new spreadsheet and store it on
permanent media
Download