CHAPTER 8 Managing and Curating Data The Second Step Storing and Curating Data Storage: Temporary and Archival Permanent archives The only medium acceptable as truly archival is acid-free paper Electronic storage Do not expect electronic media to last more than 5-10 years Should be used primarily for working copies If used, copy datasets onto newer electronic media on a regular basis Curating Data Most ecological and environmental data are collected by researchers using funds obtained through grants and contracts They are technically owned by the granting agency, and they need to be made widely available (e.g., Internet) Unfortunately, when budgets are cut, data management and curation costs are often the first items to be dropped The Final Step Transforming the Data Transformation A mathematical function that is applied to all of the observations of a given variable Y*=f(Y) Most are fairly simple algebraic functions as long as they are continuous monotonic functions DO NOT change the rank order of the data DO change relative spacing Why Transform Data? (1) Patterns in the data may be easier to understand and communicate than patterns in the raw data Converting curves into straight lines (2) Necessary for analysis to be valid – “meeting the assumptions” The Species-Area Relationship A classic example If we plot the number of species against the area of the island, the data often follow a simple power function, S=cAz where S = number of species A = is island area c and z are constants fitted to the data The Species-Area Relationship A classic example Area (km2) No. of species Log10 (Area) Log10 (Species) Albermarle 5824.9 325 3.765 2.512 Charles 165.8 319 2.220 2.504 Chatham 505.1 306 2.703 2.486 James 525.8 224 2.721 2.350 Indefatigable 1007.5 193 3.003 2.286 Abingdon 51.8 119 1.714 2.076 Duncan 18.4 103 1.265 2.013 Narborough 634.6 80 2.803 1.903 Hood 46.6 79 1.668 1.898 Seymour 2.6 52 0.415 1.716 Barrington 19.4 48 1.288 1.681 Gardner 0.5 48 -0.301 1.681 Bindloe 116.6 47 2.067 1.672 Jervis 4.8 42 0.681 1.623 Tower 11.4 22 1.057 1.342 Wenman 47 14 1.672 1.146 Culpepper 2.3 7 0.362 0.845 Island The Species-Area Relationship N u m b e r o f S p e cie s 400 300 200 100 0 0 1000 2000 3000 4000 5000 Isla n d A re a (km2) 6000 7000 The Species-Area Relationship If species richness and island area are related exponentially, we can transform this equation by taking logarithms of both sides log (S) = log (cAz) log (S) = log (c) + zlog (A) The Species-Area Relationship 2 .6 log10 (N u m b e r o f S p e cie s) 2 .4 2 .2 2 .0 1 .8 1 .6 1 .4 1 .2 1 .0 0 .8 0 .6 -1 0 1 2 log10 (Isla n d A re a ) 3 4 Other Transformations Cube-Root Transformation (Y3) measures of mass or volume that are allometrically related to linear measures of body size or length Logarithmically transformed examines relationships between two measures of masses or volumes (Y3), and transforms both X and Y Why Transform Data? Statistics Demands it All statistical tests require data to fit certain mathematical assumptions Examples Analysis of Variance (1) homoscedastic (2) residuals must be normal random variables Regression (1) normally-distributed residuals that are uncorrelated with the independent variable Five Common Transformations (1)Logarithmic Transformation (2)Square-root Transformation (3)Angular (or arcsine) Transformation (4)Reciprocal Transformation (5)Box-Cox Transformation Logarithmic Transformation Replaces each observation with its logarithm Y*=log (Y) Often equalizes variances for data which mean and variance are positively correlated, which also tend to have outliers with positively-skewed residuals Logarithm of 0 is not defined – add 1 to each observation Square-root Transformation Replaces each observation with its square root Y*=SQRT(Y) Used most frequently for count data, which often follows a Poisson distribution Yields a variance independent of mean Does not transform data values equal to 0 – add some small number to observations Arcsine Transformation Also Arcsine-square root or angular Replaces each observation with the arcsine of the square root of the value Y*=arcsine(SQRT(Y)) Principally used for proportions Removes the dependence of the variance on the mean Gives transformed data in units of radians, not degrees Reciprocal Transformation Replaces each value with its reciprocal Y*=1/Y Commonly used for data that records rates, which often appear as hyperbolic Box-Cox Transformation A family of transformations Y*=(Ylambda-1)/lambda Y*=loge (Y) (for lambda 0) (for lambda=0) L= -(v/2)loge(s2T)+(lambda-1)(v/n)sigma (logeY) V=degrees of freedom N=sample size s2T=variance of transformed values of Y Box-Cox Transformation Y*=(Ylambda-1)/lambda Y*=loge (Y) (for lambda not equal to 0) (for lambda=0) L= -(v/2)loge(s2T)+(lambda-1)(v/n)sigma (logeY) The value of lambda that results when the last equation is maximized is used in one of the first two equations to provide the closest fit of the transformed data to a normal distribution The last equation must be solved iteratively (trying different lambda values until L is maximized) using computer software Box-Cox Transformation Y*=(Ylambda-1)/lambda Y*=loge (Y) (for lambda not equal to 0) (for lambda=0) L= -(v/2)loge(s2T)+(lambda-1)(v/n)sigma (logeY) When lambda=1, equation 1 results in a linear transformation When lambda=1/2, a square-root transformation When lambda=-1, a reciprocal transformation When lambda=0, equation 2 results in a natural logarithmic transformation ALWAYS try using simple arithmetic transformations FIRST Box-Cox Transformation Y*=(Ylambda-1)/lambda Y*=loge (Y) (for lambda not equal to 0) (for lambda=0) L= -(v/2)loge(s2T)+(lambda-1)(v/n)sigma (logeY) ALWAYS try using simple arithmetic transformations FIRST If data is right-skewed, try using familiar transformations from the series1/SQRT(Y), SQRT(Y), ln (Y), 1/Y If left-skewed, try Y2, Y3, etc 1 0.9 0.8 0.7 Original 0.6 Logarithmic 0.5 Square Root Arcsine 0.4 Reciprocal 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 Reporting Results You should report results in the original units, which includes back-transforming the transformed values Back-transformed mean will be very different from arithmetic mean Also, back-transformations will normally result in asymmetrical confidence intervals Back-Transformations Logarithmic – antilog(Y*) or eY Square Root – Y*2 Arcsine – Sin(Y*2) Reciprocal – 1/(Y*) Reporting Results Lastly, transforming data should be added to your audit trail (documented in the metadata) Create a new spreadsheet and store it on permanent media