The Data Set

advertisement
A Metabolomic Data Uncertainty Budget for the
Plant Arabidopsis thaliana
Philip M. Dixon and Geng Ding
In “Statistics and Metabolomics,” David Banks discusses five
places for collaboration between
statisticians and biologists collecting and interpreting metabolomic
data. Here, we illustrate the first of
those: the construction of an uncertainty budget.
Our example comes from
plant metabolomics. In plant
metabolomics, the measurements are the same as in
human metabolomics: the
concentrations of cellular metabolites usually with a molecular
weight less than 500.
Although it could be used in
the same way the human metabolome is used—as a fingerprint for rapid identification of disease—the primary motivation for studying the plant metabolome is its usefulness to
basic science. The metabolome is the intermediary between
enzyme activity, which ultimately is a consequence of the plant
genome, and the phenotype, the observable characteristics of
individual plants.
The metabolome provides a tool for understanding the
function of genes, even if that gene has minimal or no effect
on the phenotype. Using reverse genetic techniques, it is
possible to create a knockout mutant, in which the DNA
sequence for a specific gene is changed and the gene product is
disabled. Plants with the knockout mutant are then compared
to wild-type controls. The knockout sometimes kills the plant,
sometimes changes the visible phenotype, and sometimes produces plants that look identical to the wild-type.
When the knockout is not lethal, comparing
metabolomes of knockout and wild-type
individuals provides a way to discover
whether the gene of interest has a function
and to understand the metabolic origin
of phenotypic changes.
12
VOL. 20, NO. 1, 2007
The Data Set
The data used here are part of Geng Ding’s investigation of
knockout mutants for an enzyme that degrades an amino acid.
The plants being studied are Arabidopsis thaliana, a model
organism widely used in plant science. For each of two mutants,
Ding has plants of three genotypes, differing in the number
of copies (zero, one, or two) of the knockout DNA. These
genotypes have subtle differences in the phenotype, but the
differences are tiny during vegetative growth. Ding’s biological
goal is to compare mean concentrations of each of 18 amino
acids among the six combinations of two mutants and three
genotypes, called “id’s” henceforth. Two plants of each id were
grown in a homogeneous environment. At harvest, tissue from
each of the 12 plants was split into two containers, yielding
24 samples. Because it wasn’t feasible to extract amino acids
from all 24 samples at the same time, extraction was done in
two batches. The first 12 containers, from one plant of each
of the six id’s, were extracted in one batch. Then, the remaining 12 samples (and six plants) were extracted. Each of the
24 samples (6 ids × 2 plants per id × 2 extracts per plant) was
then measured.
Amino acid concentrations were measured by a gas
chromatograph with a flame ionization detector (GC-FID).
This detector measured the amount of carbon-containing
compounds coming out of the GC every few seconds. Specific
amino acids were identified by comparing their retention time
in the GC to known standards. Each amino acid was quantified
by integrating the signal from a distinct peak, normalizing by
an internal standard, and using a calibration curve to determine
the amino acid concentration. Concentrations were expressed
as micromole of amino acid per gram of plant fresh weight.
This data collection scheme, a complete block design,
provides a measure of variability between extraction batches,
a measure of the biological variability between plants, and
measures of the differences among id’s. The variability between
the two samples from the same plant includes the withinplant variability, the variability between extractions, and the
variability between measurements. The entire study was then
repeated to give a total of 48 samples. The variability between
repetitions provides a measure of the repeatability of the
results, including long-term drift in the measuring process and
the growth environment. One extract was measured twice. The
two measurements for this extract provide an estimate of the
variability between measurements. Because only one extract
was measured twice, the second measurement is omitted from
most of the analyses described here.
0.8
0.7
−
−
−
0.5
0.6
−
−
2
0.4
THR concentration (nm/mg)
−
−
Replicate
1
21 11 12 22
Plants
Measurements
Figure 1. Variability between replicates, plants within replicates,
and measurements within plants for one id. The two dots labeled
“Replicate” are the averages for replicate 1 and replicate 2. The
four dots labeled “Plants” are the averages for the two plants
from each replicate, sorted by replicate average and labeled by
the replicate number. The eight dots labeled “Measurements” are
the measurements from each plant, sorted by plant average and
labeled by replicate and plant number. Each column average is
indicated by –.
Components of the Uncertainty Budget
Our goal here is quantify components of the uncertainty
budget. Each level of replication (two repetitions of the study,
two extraction batches, two plants per repletion and treatment,
and two extractions/measurements per plant) is a component
of the uncertainty budget that can be quantified by estimating
variance components. Extraction batches are nested within
repetitions, id’s are crossed with extraction batches, plants are
nested within extraction batches, and extractions/measurements
are nested within plants. Although Banks indicates many
specific reasons for sampling and measurement variability
in his article, most are confounded in this study and cannot
be separated. Ding’s sampling design provides an estimate of
the biological variability between plants, which is crucial for
the comparison of genotypes and mutants. It also provides
an estimate of the variability between extracts. The single
extract measured twice provides an estimate of the variability
between measurements.
Many designs to estimate variance components use
only nested sampling. One example would be a design that
grows plants of one genotype in three pots. Three plants are
individually harvested and extracted from each pot, and then
each extract is measured twice. Ding’s design introduces crossed
effects because of the blocking by replicate and extraction
batch. This blocking provides more precise estimates of
differences between genotypes and mutants, but it complicates
the analysis of variance components. We will first illustrate a
typical analysis of nested effects by considering data from only
one id. Then, we will illustrate the analysis of the entire data
set. These are illustrated using data for one of the 18 amino
acids measured by Ding, Threonine.
A Model for Nested Random Effects
The data for a single id includes only one plant per extraction
batch, so there are three nested random effects: between
repetitions, between plants in a repetition, and between
confounded extracts and measurements. One commonly used
model for nested random effects can be written as:
Yijk = µ + α i + βij + λijk ,
1)
The concentration of Threonine in replicate i, plant j, and
measurement k is denoted Yijk. The overall mean Threonine
concentration is denoted µ. The deviation from the mean
associated with replicate i is α i. The deviation from the mean
of replicate i associated with plant ij is βij. Within plant ij,
the deviation of measurements ijk about the plant mean is
γ ijk. The terms α i, , βij and γ ijk are considered random effects
when the goal of the analysis is to estimate the magnitude of
their variability. It is common to assume all random effects
are independent normal random variables with constant
variance
The variance between observations from a randomly chosen
replicate, plant, and measurement is the sum of the three variance components,
In this sense, the
variance components partition the random variation among
observations into components associated with each source
of uncertainty.
The data for one id are shown in Figure 1. The two
replicate averages are similar, the four plant averages are quite
different—even when compared within the replicate—and the
measurements from the same plant are very similar. The pooled
variance between measurements on the same plant estimates
Table 1—ANOVA Table and Expected Mean Squares for Data From a Single Id
Source
d.f.
Sum-of-Squares
Mean Square
Replicates
1
0.004857
0.004857
Plants(Reps)
2
0.109532
0.054766
Measurements
(Plants, Reps)
4
0.001480
0.000370
Corrected Total
7
0.115869
Expected Mean Square
2
2
2
σ meas
+ 2σ plant
+ 4σ rep
2
2
σ meas
+ σ plant
2
σ meas
CHANCE
13
0.200
0.050
0.010
0.002
SD of measurements
0.5
1.0
1.5
2.0
2.5
Average of measurements
0.100
0.020
0.005
SD of log transf. meas.
0.500
Figure 2. Plot of the standard deviation (s.d.) and average of the two
measurements per plant. Both X and Y axes are log scaled.
−0.5
0.0
0.5
Average of log transf. meas.
Figure 3. Plot of the standard deviation and average of the log-transformed measurements per plant. The y-axis is log scaled; the x-axis
is not because some averages are less than zero.
2
σ meas
, but the pooled variance between plant averages (Yij.),
using dots as subscripts to indicate averaging (i.e., Yij.=(Yij1+
2
Yij2)/2), overestimates σ plant
. This is because, within a replicate,
(i.e., conditional on α i ), the variance between plant averages,
2
2
2
2
Var Yij.is σ plant
+ σ meas
/ 2, which is larger than σ plant if σ meas >0.
Similarly, the variance between replicate averages, Var Yi..=
2
2
2
2
Yi.. = σ rep
+ σ plant
/ 2 + σ meas
/ 4 , overestimates σ rep
.
Estimators of Variance Components
Although there are many estimators of the variance
2
2
2
components— σ rep
, σ plant
, and σ meas
—the two most commonly
used are the ANOVA and REML estimators. The ANOVA, or
method-of-moments estimator, starts with an ANOVA table
quantifying the observed variability for each component. The
variance components are estimated by equating the observed
14
VOL. 21, NO. 2, 2008
mean squares to their expected values—the expected mean
squares—and solving for the variance components. For the
Threonine data in Figure 1, the ANOVA table and expected
mean squares are given in Table 1.
2
The estimated variance components are σˆ meas
= 0.00037,
2
2
ˆ
ˆ
σ plant = 0.027, and σ rep = −0.012. The variance component for
plants is much larger than that for measurements, consistent
with the pattern in Figure 1. The negative estimate for replicates
is disconcerting, since a variance must be non-negative.
Negative ANOVA estimates often occur when the parameter
is close to zero, when the degrees of freedom for the effect are
small, when there are outliers, or when the model is wrong.
However, ANOVA estimates are unbiased when the model
is correct and robust to the assumption of normality because
they are computed from variances only.
REML, restricted maximum likelihood, estimates are always
non-negative because the estimates are constrained to lie
within the parameter space for a variance. REML differs from
standard maximum likelihood (ML) in correctly accounting
for the estimation of any fixed effects. As a simple example, if
independentN(µ, σ 2 ) the ML estimator of the variance of
2
a single sample, ∑ (Yi − Y ) / n, is biased. The REML estimator
2
∑ (Yi − Y ) / (n − 1) is the usual unbiased variance estimator.
However, when data have multiple levels of variation, REML
estimates of variance components are often biased. The bias
arises for two reasons: the constraint that an estimate is nonnegative and the adjustment to other variance components
that occur when a negative ANOVA estimate is shifted to
zero. For example, the REML estimates for the data in Figure
2
2
2
1 are σˆ meas
= 0.00037, σˆ plant
= 0.019, and σ rep
= 0. The replicate
variance is estimated as zero, but that forces a shift in the plantplant variance component (from 0.027 to 0.019). However, the
replicate variance is estimated from only two replicates (one
degree of freedom), so one should expect a poor estimate.
There is no consensus among statisticians as to which
estimator is better. I prefer the ANOVA estimates because they
are less dependent on a model and because estimates at one
level are not adjusted because of insufficient data at another
level. Others prefer REML estimators.
The previous analysis uses only one-sixth the data in
which there are only four plants and eight measurements.
The entire data set includes 24 plants and 48 measurements.
Pooled estimates of variance components using all the data
will be more precise, which may eliminate the problem of
a negative estimated variance component if it is reasonable
to assume variance components are the same for all id’s. We
will separately consider the measurement variance and the
plant-plant variance.
Characteristics of the Measurement Variance
The assumption of equal measurement variance is easy to assess
using a plot of the average of the two measurements per plant
against the standard deviation of those two measurements
(Figure 2). There is a lot of variability because each standard
deviation is computed from two measurements, but it is clear
the measurement standard deviation tends to increase with
the average. When this happens, using log Y instead of Y
often equalizes the variances. As Banks indicates in his article,
metabolomic data are usually log-transformed because of the
Table 2—ANOVA Table and Expected Mean Squares for Data From All Six Id’s
Source
d.f.
Sum-ofSquares
Mean
Square
Expected Mean Square
Replicates
1
4.077
4.0770
2
2
2
2
σ meas
+ 2σ plant
+ 12σ batch
+ 24σ rep
Extraction
2
0.428
0.2138
2
2
2
σ meas
+ 2σ plant
+ 12σ batch
Id
5
1.922
0.3842
2
2
σ meas
+ 2σ plant
+ 8 ∑ δ k2 / 5
Plants
15
2.908
0.1939
2
2
σ meas
+ 2σ plant
Measurements
24
0.540
0.0225
2
σ meas
Characteristics of Plant-Plant Variation
0.05
0.02
Plant s.d., log(Y)
0.10
0.20
biological focus on ratios naturally expressed on a log scale. The
Threonine data illustrate another reason for a transformation—
to equalize variances.
A useful characteristic of a random variable with a log
normal distribution is that the coefficient of variance is a
function of the log scale variance. If log Y
(µ, σ 2 ), then
the mean
and
the
variance
of
the
untransformed
Y are E
2
2
2
Y = eµ +σ /2, Var Y = e2 µ +2σ − e2 µ +σ , so the coefficient of variation is
2
VarY
c.v. Y
= eσ − 1. Hence, assuming a constant variance
( EY )2
on the log scale is equivalent to assuming a constant coefficient
of variation for the untransformed values.
After using a transformation, one should check that it worked
as intended. This can be done by plotting the average and
standard deviation of the two log-transformed measurements
per plant (Figure 3). While there is much less pattern after
the transformation, there is still a tendency for the standard
deviation to increase with the mean. A stronger transformation
in the Box-Cox family, perhaps 1/Y, would do a better job of
equalizing the measurement variances for this specific data set.
However, a transformation of Y affects all aspects of the model.
Before making a final choice, it would be good to assess the
characteristics of the plant-plant variability.
0.10
Corrected Total 47
−0.4
−0.2
0.0
0.2
0.4
0.6
Plant average, log(Y)
0.05
0.02
0.01
Plant s.d., 1/Y
Figure 4. Plot of the plant-plant standard deviation (s.d.) and average,
after log transforming the measurements
0.6
0.8
1.0
1.2
1.4
1.6
Plant average, 1/Y
Figure 5. Plot of the plant-plant standard deviation (s.d.) and average,
after using a 1/Y transformation of each measurement
It is harder to assess the characteristics of plant-plant variability (or any variability other than the residual variation)
because the plant-plant variation is not directly observed.
The only direct information
about characteristics of the
plant-plant variation comes
from averages of the two
measurements for each plant.
Because these are averages
of measurements, characteristics of the plant-plant variation are
confounded with those of the measurement variation.
CHANCE
15
associated with replicate i is α i. The deviation from the mean
of replicate i associated with extraction batch ij is θij. The
deviation from the mean associated with id k is δ k. The deviation associated with each plant ijk for each id k in extraction
batch ij is βijk. Within each plant ijk, γ ijkl is the deviation of
the observation ijkl from the plant mean. The variability
described by the γ ijkl includes the variability among extracts
and variability among measurements because there is only one
measurement per extract in the 48-observation data set. All the
random effects are assumed to be independent and normally
distributed. Each source of variation has its own variance
2
2
2
component: α i N (0, σ rep
), θij N (0, σ batch
), βijk N (0, σ plant
), and
γ ijkl N (0, σ 2 ).
meas
Fitting model (2) to the Threonine data gives the ANOVA
table in Table 2. The estimated variance components are
2
2
2
2
σˆ rep
= 0.16, σˆ batch
= 0.0017, σˆ plant
= 0.086, and σˆ meas
= 0.022.
The REML estimates of the variance components, in this
case, are exactly the same because the data are balanced and
all estimated variance components are positive.
Table 3—Standard Error (s.e.) of the Difference of
Two Treatment Means for Different Choices of Sample
2
2
Size, Assuming σ plant
= 0.086, σ extract
= 0.022, and
2
σ tech
= 0.00034
Number of:
Two approaches can be used to investigate plant-plant
variation. One is to assume a model, and, based on that
model, calculate the best unbiased linear predictor (BLUP)
of each random effect (i.e., predict the random effect), βij ,
associated with plant ij. The other is to ignore the measurement variability and use traditional diagnostics to evaluate
the averages for each plant. The second approach is reasonable when the contribution of the measurement variance is
approximately the same for all plants. This is the case here for
log-transformed data, so we use plant averages to investigate
the plant-plant variability.
If observations are log transformed, the standard deviation
(s.d.) between plant averages is approximately constant (Figure 4). But, if observations are transformed using the stronger
1/Y transformation, the s.d. between plant average sis clearly
not constant (Figure 5). Hence, the analysis will use a log
transformation because it provides an approximately constant
measurement variance and a constant plant-plant variance.
A Model for All Observations
The model for all 48 observations is then:
log Yijkl = µ + α i + θij + δ k + βijk + γ ijkl
(2)
The Threonine concentration in replicate i, extraction batch
j, id k, and measurement l is Yijkl. The overall mean Threonine
concentration is denoted µ. The deviation from the mean
16
VOL. 21, NO. 2, 2008
Plants
Extracts per
Plant
Measurements
per Extract
s.e. of
Difference
4
2
1
0.220
4
2
10
0.220
4
4
1
0.214
8
2
1
0.156
Estimating the Variability Between
Measurements of the Same Extract
Ding re-measured one of the 48 extracts used in the above analysis. The two measurements are 2.244 and 2.187. The variance
of these values estimates the technical measurement variance
(i.e., the variability between measurements made on the same
2
extract). Using log-transformed values, this is σ tech
= 0.00034,
which is two orders of magnitude less than the combination of
measurement and extraction variability. Given an estimate of
the technical measurement variance, it is possible to estimate
the contribution to the error due to extraction. Because each
of the 48 extracts in the original data set was measured once,
2
2
2
2
σ meas
= σ tech
+ σ extract
, whereas σ extract
is the variance component
between extracts of the same plant. The estimated variance
2
2
2
= σ meas
− σ tech
= 0.022 − 0.00034 = 0.022.
component is σ extract
2
Although σ tech is not precise because it is a one degree of freedom (d.f.) estimate, it is clear that essentially all the variability
between measurements is due to variability between different
extractions of a single plant. Almost none of the variability
comes from the instrument measurement.
The Uncertainty Budget
Consistent with the earlier results for one id, the biological
variance between plants is ca. four times larger than the variance
between extractions and two orders of magnitude larger than
the technical variability between measurements. The variability
between different extracts is small, but the variability between
the two replicates of the study is surprisingly large. The data
indicate why the replicate variance component is so large. The
average Threonine concentrations are 0.64 and 0.76 nm/mg
for the two extractions in the first replicate and 1.23 and 1.52
nm/mg for the two extractions in the second replicate. The
large variance component between replicates makes sense,
but the biological reasons for such a large variation are, as of
yet, unknown.
Since the model assumes log-transformed values are
normally distributed, the variance components can be converted
into coefficients of variation for each component of error,
as described previously. The technical measurement c.v. is
exp(0.00034) − 1 = 1.8% , the extraction c.v. is 15.1%, the
plant-plant c.v. is 29.9%, the batch c.v. is 4.1%, and the
replicate c.v. is 42%.
The uncertainty budget and estimated variance components
provide useful information for designing subsequent
studies. The goal of Ding’s work is to compare metabolite
concentrations among genotypes and mutants. Blocking by
extraction and replicate (i.e., measuring all id’s [combinations
of genotypes and mutants] in the same extraction and same
replicate) increases the precision of comparisons among id’s.
When the average metabolite concentration is calculated from
r replicates, b batches, e extractions, and m measurements per
plant, the variance of the average difference between two
id’s is:
σ2
σ2
σ2 
VarY..1. − Y..2. = 2  plant + extract + tech  .
rbem
rbe
 rb
When comparisons are made within blocks, neither the
replicate nor batch variances contribute to the variance of
the difference. The only variance components that matter are
those for plants, extracts, and measurements.
Increasing the number of plants—by increasing either the
number of replicates, r, or the number of extraction batches,
b—decreases the contribution of all three variance components,
2
2
2
σ plant
, σ extracts
and σ tech
. This effect is sometimes called hidden
replication because increasing the number of plants also
increases the numbers of extracts and measurements. An
alternative is to retain the same number of plants, but increase
the number of extracts or measurements per plant. Assuming
the variance components estimated from these data apply to a
new study, the expected precision can be calculated for various
combinations of # of plants, # of extractions per plant, and #
measurements per extract (Table 3).
Because the technical measurement variance is so
small, relative to the other
sources of variability,
increasing the number of
measurements per extract
tenfold has essentially
no effect on the precision.
Doubling the number of extracts
per plant leads to a small increase
in precision, but doubling the number of plants markedly increases the
precision of the difference. The general advice for designing a
study with multiple sources of error would be to replicate “as
high up as possible.” In this study, that would be to increase the
number of components, as it is here.
Final Thoughts
Plant metabolomics has given us new biological data for studying the relationship between genotype and phenotype, thereby
learning about basic scientific processes. Using data from one
metabolite, we have explored the characteristics of measurement and plant-plant variability, constructed an uncertainty
budget, and used the estimated variance components to
evaluate design choices. We found that the biological variability between plants is larger than the variability between
extractions, and considerably larger than the variability
between measurements of the same extract. Similar sorts of
evaluations are possible whenever there are replicated observations for each important source of variability, but the details
of the statistical model will depend on the experimental design
(i.e., whether random effects are crossed or nested). Estimating
variance components and identifying the important parts of
the uncertainty budget help design more precise and costeffective studies.
Further Reading
Variance components analysis is described in many intermediate-level applied statistics books. Two of many good
chapter-length treatments are in Angela M. Dean and Daniel
Voss’ Design and Analysis of Experiments and George E. P. Box, J.
Stuart Hunter, and William G. Hunter’s Statistics for Experimenters.
Details and many extensions of what has been described here
are presented in Shayle R. Searle, George Casella, and Charles
E. McCulloch’s book, Variance Components, and D. R. Cox and P.
J. Solomon’s book, Components of Variance.
CHANCE
17
Download