A Metabolomic Data Uncertainty Budget for the Plant Arabidopsis thaliana Philip M. Dixon and Geng Ding In “Statistics and Metabolomics,” David Banks discusses five places for collaboration between statisticians and biologists collecting and interpreting metabolomic data. Here, we illustrate the first of those: the construction of an uncertainty budget. Our example comes from plant metabolomics. In plant metabolomics, the measurements are the same as in human metabolomics: the concentrations of cellular metabolites usually with a molecular weight less than 500. Although it could be used in the same way the human metabolome is used—as a fingerprint for rapid identification of disease—the primary motivation for studying the plant metabolome is its usefulness to basic science. The metabolome is the intermediary between enzyme activity, which ultimately is a consequence of the plant genome, and the phenotype, the observable characteristics of individual plants. The metabolome provides a tool for understanding the function of genes, even if that gene has minimal or no effect on the phenotype. Using reverse genetic techniques, it is possible to create a knockout mutant, in which the DNA sequence for a specific gene is changed and the gene product is disabled. Plants with the knockout mutant are then compared to wild-type controls. The knockout sometimes kills the plant, sometimes changes the visible phenotype, and sometimes produces plants that look identical to the wild-type. When the knockout is not lethal, comparing metabolomes of knockout and wild-type individuals provides a way to discover whether the gene of interest has a function and to understand the metabolic origin of phenotypic changes. 12 VOL. 20, NO. 1, 2007 The Data Set The data used here are part of Geng Ding’s investigation of knockout mutants for an enzyme that degrades an amino acid. The plants being studied are Arabidopsis thaliana, a model organism widely used in plant science. For each of two mutants, Ding has plants of three genotypes, differing in the number of copies (zero, one, or two) of the knockout DNA. These genotypes have subtle differences in the phenotype, but the differences are tiny during vegetative growth. Ding’s biological goal is to compare mean concentrations of each of 18 amino acids among the six combinations of two mutants and three genotypes, called “id’s” henceforth. Two plants of each id were grown in a homogeneous environment. At harvest, tissue from each of the 12 plants was split into two containers, yielding 24 samples. Because it wasn’t feasible to extract amino acids from all 24 samples at the same time, extraction was done in two batches. The first 12 containers, from one plant of each of the six id’s, were extracted in one batch. Then, the remaining 12 samples (and six plants) were extracted. Each of the 24 samples (6 ids × 2 plants per id × 2 extracts per plant) was then measured. Amino acid concentrations were measured by a gas chromatograph with a flame ionization detector (GC-FID). This detector measured the amount of carbon-containing compounds coming out of the GC every few seconds. Specific amino acids were identified by comparing their retention time in the GC to known standards. Each amino acid was quantified by integrating the signal from a distinct peak, normalizing by an internal standard, and using a calibration curve to determine the amino acid concentration. Concentrations were expressed as micromole of amino acid per gram of plant fresh weight. This data collection scheme, a complete block design, provides a measure of variability between extraction batches, a measure of the biological variability between plants, and measures of the differences among id’s. The variability between the two samples from the same plant includes the withinplant variability, the variability between extractions, and the variability between measurements. The entire study was then repeated to give a total of 48 samples. The variability between repetitions provides a measure of the repeatability of the results, including long-term drift in the measuring process and the growth environment. One extract was measured twice. The two measurements for this extract provide an estimate of the variability between measurements. Because only one extract was measured twice, the second measurement is omitted from most of the analyses described here. 0.8 0.7 − − − 0.5 0.6 − − 2 0.4 THR concentration (nm/mg) − − Replicate 1 21 11 12 22 Plants Measurements Figure 1. Variability between replicates, plants within replicates, and measurements within plants for one id. The two dots labeled “Replicate” are the averages for replicate 1 and replicate 2. The four dots labeled “Plants” are the averages for the two plants from each replicate, sorted by replicate average and labeled by the replicate number. The eight dots labeled “Measurements” are the measurements from each plant, sorted by plant average and labeled by replicate and plant number. Each column average is indicated by –. Components of the Uncertainty Budget Our goal here is quantify components of the uncertainty budget. Each level of replication (two repetitions of the study, two extraction batches, two plants per repletion and treatment, and two extractions/measurements per plant) is a component of the uncertainty budget that can be quantified by estimating variance components. Extraction batches are nested within repetitions, id’s are crossed with extraction batches, plants are nested within extraction batches, and extractions/measurements are nested within plants. Although Banks indicates many specific reasons for sampling and measurement variability in his article, most are confounded in this study and cannot be separated. Ding’s sampling design provides an estimate of the biological variability between plants, which is crucial for the comparison of genotypes and mutants. It also provides an estimate of the variability between extracts. The single extract measured twice provides an estimate of the variability between measurements. Many designs to estimate variance components use only nested sampling. One example would be a design that grows plants of one genotype in three pots. Three plants are individually harvested and extracted from each pot, and then each extract is measured twice. Ding’s design introduces crossed effects because of the blocking by replicate and extraction batch. This blocking provides more precise estimates of differences between genotypes and mutants, but it complicates the analysis of variance components. We will first illustrate a typical analysis of nested effects by considering data from only one id. Then, we will illustrate the analysis of the entire data set. These are illustrated using data for one of the 18 amino acids measured by Ding, Threonine. A Model for Nested Random Effects The data for a single id includes only one plant per extraction batch, so there are three nested random effects: between repetitions, between plants in a repetition, and between confounded extracts and measurements. One commonly used model for nested random effects can be written as: Yijk = µ + α i + βij + λijk , 1) The concentration of Threonine in replicate i, plant j, and measurement k is denoted Yijk. The overall mean Threonine concentration is denoted µ. The deviation from the mean associated with replicate i is α i. The deviation from the mean of replicate i associated with plant ij is βij. Within plant ij, the deviation of measurements ijk about the plant mean is γ ijk. The terms α i, , βij and γ ijk are considered random effects when the goal of the analysis is to estimate the magnitude of their variability. It is common to assume all random effects are independent normal random variables with constant variance The variance between observations from a randomly chosen replicate, plant, and measurement is the sum of the three variance components, In this sense, the variance components partition the random variation among observations into components associated with each source of uncertainty. The data for one id are shown in Figure 1. The two replicate averages are similar, the four plant averages are quite different—even when compared within the replicate—and the measurements from the same plant are very similar. The pooled variance between measurements on the same plant estimates Table 1—ANOVA Table and Expected Mean Squares for Data From a Single Id Source d.f. Sum-of-Squares Mean Square Replicates 1 0.004857 0.004857 Plants(Reps) 2 0.109532 0.054766 Measurements (Plants, Reps) 4 0.001480 0.000370 Corrected Total 7 0.115869 Expected Mean Square 2 2 2 σ meas + 2σ plant + 4σ rep 2 2 σ meas + σ plant 2 σ meas CHANCE 13 0.200 0.050 0.010 0.002 SD of measurements 0.5 1.0 1.5 2.0 2.5 Average of measurements 0.100 0.020 0.005 SD of log transf. meas. 0.500 Figure 2. Plot of the standard deviation (s.d.) and average of the two measurements per plant. Both X and Y axes are log scaled. −0.5 0.0 0.5 Average of log transf. meas. Figure 3. Plot of the standard deviation and average of the log-transformed measurements per plant. The y-axis is log scaled; the x-axis is not because some averages are less than zero. 2 σ meas , but the pooled variance between plant averages (Yij.), using dots as subscripts to indicate averaging (i.e., Yij.=(Yij1+ 2 Yij2)/2), overestimates σ plant . This is because, within a replicate, (i.e., conditional on α i ), the variance between plant averages, 2 2 2 2 Var Yij.is σ plant + σ meas / 2, which is larger than σ plant if σ meas >0. Similarly, the variance between replicate averages, Var Yi..= 2 2 2 2 Yi.. = σ rep + σ plant / 2 + σ meas / 4 , overestimates σ rep . Estimators of Variance Components Although there are many estimators of the variance 2 2 2 components— σ rep , σ plant , and σ meas —the two most commonly used are the ANOVA and REML estimators. The ANOVA, or method-of-moments estimator, starts with an ANOVA table quantifying the observed variability for each component. The variance components are estimated by equating the observed 14 VOL. 21, NO. 2, 2008 mean squares to their expected values—the expected mean squares—and solving for the variance components. For the Threonine data in Figure 1, the ANOVA table and expected mean squares are given in Table 1. 2 The estimated variance components are σˆ meas = 0.00037, 2 2 ˆ ˆ σ plant = 0.027, and σ rep = −0.012. The variance component for plants is much larger than that for measurements, consistent with the pattern in Figure 1. The negative estimate for replicates is disconcerting, since a variance must be non-negative. Negative ANOVA estimates often occur when the parameter is close to zero, when the degrees of freedom for the effect are small, when there are outliers, or when the model is wrong. However, ANOVA estimates are unbiased when the model is correct and robust to the assumption of normality because they are computed from variances only. REML, restricted maximum likelihood, estimates are always non-negative because the estimates are constrained to lie within the parameter space for a variance. REML differs from standard maximum likelihood (ML) in correctly accounting for the estimation of any fixed effects. As a simple example, if independentN(µ, σ 2 ) the ML estimator of the variance of 2 a single sample, ∑ (Yi − Y ) / n, is biased. The REML estimator 2 ∑ (Yi − Y ) / (n − 1) is the usual unbiased variance estimator. However, when data have multiple levels of variation, REML estimates of variance components are often biased. The bias arises for two reasons: the constraint that an estimate is nonnegative and the adjustment to other variance components that occur when a negative ANOVA estimate is shifted to zero. For example, the REML estimates for the data in Figure 2 2 2 1 are σˆ meas = 0.00037, σˆ plant = 0.019, and σ rep = 0. The replicate variance is estimated as zero, but that forces a shift in the plantplant variance component (from 0.027 to 0.019). However, the replicate variance is estimated from only two replicates (one degree of freedom), so one should expect a poor estimate. There is no consensus among statisticians as to which estimator is better. I prefer the ANOVA estimates because they are less dependent on a model and because estimates at one level are not adjusted because of insufficient data at another level. Others prefer REML estimators. The previous analysis uses only one-sixth the data in which there are only four plants and eight measurements. The entire data set includes 24 plants and 48 measurements. Pooled estimates of variance components using all the data will be more precise, which may eliminate the problem of a negative estimated variance component if it is reasonable to assume variance components are the same for all id’s. We will separately consider the measurement variance and the plant-plant variance. Characteristics of the Measurement Variance The assumption of equal measurement variance is easy to assess using a plot of the average of the two measurements per plant against the standard deviation of those two measurements (Figure 2). There is a lot of variability because each standard deviation is computed from two measurements, but it is clear the measurement standard deviation tends to increase with the average. When this happens, using log Y instead of Y often equalizes the variances. As Banks indicates in his article, metabolomic data are usually log-transformed because of the Table 2—ANOVA Table and Expected Mean Squares for Data From All Six Id’s Source d.f. Sum-ofSquares Mean Square Expected Mean Square Replicates 1 4.077 4.0770 2 2 2 2 σ meas + 2σ plant + 12σ batch + 24σ rep Extraction 2 0.428 0.2138 2 2 2 σ meas + 2σ plant + 12σ batch Id 5 1.922 0.3842 2 2 σ meas + 2σ plant + 8 ∑ δ k2 / 5 Plants 15 2.908 0.1939 2 2 σ meas + 2σ plant Measurements 24 0.540 0.0225 2 σ meas Characteristics of Plant-Plant Variation 0.05 0.02 Plant s.d., log(Y) 0.10 0.20 biological focus on ratios naturally expressed on a log scale. The Threonine data illustrate another reason for a transformation— to equalize variances. A useful characteristic of a random variable with a log normal distribution is that the coefficient of variance is a function of the log scale variance. If log Y (µ, σ 2 ), then the mean and the variance of the untransformed Y are E 2 2 2 Y = eµ +σ /2, Var Y = e2 µ +2σ − e2 µ +σ , so the coefficient of variation is 2 VarY c.v. Y = eσ − 1. Hence, assuming a constant variance ( EY )2 on the log scale is equivalent to assuming a constant coefficient of variation for the untransformed values. After using a transformation, one should check that it worked as intended. This can be done by plotting the average and standard deviation of the two log-transformed measurements per plant (Figure 3). While there is much less pattern after the transformation, there is still a tendency for the standard deviation to increase with the mean. A stronger transformation in the Box-Cox family, perhaps 1/Y, would do a better job of equalizing the measurement variances for this specific data set. However, a transformation of Y affects all aspects of the model. Before making a final choice, it would be good to assess the characteristics of the plant-plant variability. 0.10 Corrected Total 47 −0.4 −0.2 0.0 0.2 0.4 0.6 Plant average, log(Y) 0.05 0.02 0.01 Plant s.d., 1/Y Figure 4. Plot of the plant-plant standard deviation (s.d.) and average, after log transforming the measurements 0.6 0.8 1.0 1.2 1.4 1.6 Plant average, 1/Y Figure 5. Plot of the plant-plant standard deviation (s.d.) and average, after using a 1/Y transformation of each measurement It is harder to assess the characteristics of plant-plant variability (or any variability other than the residual variation) because the plant-plant variation is not directly observed. The only direct information about characteristics of the plant-plant variation comes from averages of the two measurements for each plant. Because these are averages of measurements, characteristics of the plant-plant variation are confounded with those of the measurement variation. CHANCE 15 associated with replicate i is α i. The deviation from the mean of replicate i associated with extraction batch ij is θij. The deviation from the mean associated with id k is δ k. The deviation associated with each plant ijk for each id k in extraction batch ij is βijk. Within each plant ijk, γ ijkl is the deviation of the observation ijkl from the plant mean. The variability described by the γ ijkl includes the variability among extracts and variability among measurements because there is only one measurement per extract in the 48-observation data set. All the random effects are assumed to be independent and normally distributed. Each source of variation has its own variance 2 2 2 component: α i N (0, σ rep ), θij N (0, σ batch ), βijk N (0, σ plant ), and γ ijkl N (0, σ 2 ). meas Fitting model (2) to the Threonine data gives the ANOVA table in Table 2. The estimated variance components are 2 2 2 2 σˆ rep = 0.16, σˆ batch = 0.0017, σˆ plant = 0.086, and σˆ meas = 0.022. The REML estimates of the variance components, in this case, are exactly the same because the data are balanced and all estimated variance components are positive. Table 3—Standard Error (s.e.) of the Difference of Two Treatment Means for Different Choices of Sample 2 2 Size, Assuming σ plant = 0.086, σ extract = 0.022, and 2 σ tech = 0.00034 Number of: Two approaches can be used to investigate plant-plant variation. One is to assume a model, and, based on that model, calculate the best unbiased linear predictor (BLUP) of each random effect (i.e., predict the random effect), βij , associated with plant ij. The other is to ignore the measurement variability and use traditional diagnostics to evaluate the averages for each plant. The second approach is reasonable when the contribution of the measurement variance is approximately the same for all plants. This is the case here for log-transformed data, so we use plant averages to investigate the plant-plant variability. If observations are log transformed, the standard deviation (s.d.) between plant averages is approximately constant (Figure 4). But, if observations are transformed using the stronger 1/Y transformation, the s.d. between plant average sis clearly not constant (Figure 5). Hence, the analysis will use a log transformation because it provides an approximately constant measurement variance and a constant plant-plant variance. A Model for All Observations The model for all 48 observations is then: log Yijkl = µ + α i + θij + δ k + βijk + γ ijkl (2) The Threonine concentration in replicate i, extraction batch j, id k, and measurement l is Yijkl. The overall mean Threonine concentration is denoted µ. The deviation from the mean 16 VOL. 21, NO. 2, 2008 Plants Extracts per Plant Measurements per Extract s.e. of Difference 4 2 1 0.220 4 2 10 0.220 4 4 1 0.214 8 2 1 0.156 Estimating the Variability Between Measurements of the Same Extract Ding re-measured one of the 48 extracts used in the above analysis. The two measurements are 2.244 and 2.187. The variance of these values estimates the technical measurement variance (i.e., the variability between measurements made on the same 2 extract). Using log-transformed values, this is σ tech = 0.00034, which is two orders of magnitude less than the combination of measurement and extraction variability. Given an estimate of the technical measurement variance, it is possible to estimate the contribution to the error due to extraction. Because each of the 48 extracts in the original data set was measured once, 2 2 2 2 σ meas = σ tech + σ extract , whereas σ extract is the variance component between extracts of the same plant. The estimated variance 2 2 2 = σ meas − σ tech = 0.022 − 0.00034 = 0.022. component is σ extract 2 Although σ tech is not precise because it is a one degree of freedom (d.f.) estimate, it is clear that essentially all the variability between measurements is due to variability between different extractions of a single plant. Almost none of the variability comes from the instrument measurement. The Uncertainty Budget Consistent with the earlier results for one id, the biological variance between plants is ca. four times larger than the variance between extractions and two orders of magnitude larger than the technical variability between measurements. The variability between different extracts is small, but the variability between the two replicates of the study is surprisingly large. The data indicate why the replicate variance component is so large. The average Threonine concentrations are 0.64 and 0.76 nm/mg for the two extractions in the first replicate and 1.23 and 1.52 nm/mg for the two extractions in the second replicate. The large variance component between replicates makes sense, but the biological reasons for such a large variation are, as of yet, unknown. Since the model assumes log-transformed values are normally distributed, the variance components can be converted into coefficients of variation for each component of error, as described previously. The technical measurement c.v. is exp(0.00034) − 1 = 1.8% , the extraction c.v. is 15.1%, the plant-plant c.v. is 29.9%, the batch c.v. is 4.1%, and the replicate c.v. is 42%. The uncertainty budget and estimated variance components provide useful information for designing subsequent studies. The goal of Ding’s work is to compare metabolite concentrations among genotypes and mutants. Blocking by extraction and replicate (i.e., measuring all id’s [combinations of genotypes and mutants] in the same extraction and same replicate) increases the precision of comparisons among id’s. When the average metabolite concentration is calculated from r replicates, b batches, e extractions, and m measurements per plant, the variance of the average difference between two id’s is: σ2 σ2 σ2 VarY..1. − Y..2. = 2 plant + extract + tech . rbem rbe rb When comparisons are made within blocks, neither the replicate nor batch variances contribute to the variance of the difference. The only variance components that matter are those for plants, extracts, and measurements. Increasing the number of plants—by increasing either the number of replicates, r, or the number of extraction batches, b—decreases the contribution of all three variance components, 2 2 2 σ plant , σ extracts and σ tech . This effect is sometimes called hidden replication because increasing the number of plants also increases the numbers of extracts and measurements. An alternative is to retain the same number of plants, but increase the number of extracts or measurements per plant. Assuming the variance components estimated from these data apply to a new study, the expected precision can be calculated for various combinations of # of plants, # of extractions per plant, and # measurements per extract (Table 3). Because the technical measurement variance is so small, relative to the other sources of variability, increasing the number of measurements per extract tenfold has essentially no effect on the precision. Doubling the number of extracts per plant leads to a small increase in precision, but doubling the number of plants markedly increases the precision of the difference. The general advice for designing a study with multiple sources of error would be to replicate “as high up as possible.” In this study, that would be to increase the number of components, as it is here. Final Thoughts Plant metabolomics has given us new biological data for studying the relationship between genotype and phenotype, thereby learning about basic scientific processes. Using data from one metabolite, we have explored the characteristics of measurement and plant-plant variability, constructed an uncertainty budget, and used the estimated variance components to evaluate design choices. We found that the biological variability between plants is larger than the variability between extractions, and considerably larger than the variability between measurements of the same extract. Similar sorts of evaluations are possible whenever there are replicated observations for each important source of variability, but the details of the statistical model will depend on the experimental design (i.e., whether random effects are crossed or nested). Estimating variance components and identifying the important parts of the uncertainty budget help design more precise and costeffective studies. Further Reading Variance components analysis is described in many intermediate-level applied statistics books. Two of many good chapter-length treatments are in Angela M. Dean and Daniel Voss’ Design and Analysis of Experiments and George E. P. Box, J. Stuart Hunter, and William G. Hunter’s Statistics for Experimenters. Details and many extensions of what has been described here are presented in Shayle R. Searle, George Casella, and Charles E. McCulloch’s book, Variance Components, and D. R. Cox and P. J. Solomon’s book, Components of Variance. CHANCE 17