Uploaded by iMaynz

GCMS-PCA-analchem

advertisement
Anal. Chem. 2004, 76, 1738-1745
A Strategy for Identifying Differences in Large
Series of Metabolomic Samples Analyzed by GC/MS
Pa
1 r Jonsson,†,‡ Jonas Gullberg,‡,§ Anders Nordstro
1 m,§ Miyako Kusano,§ Mariusz Kowalczyk,§
†
,§
Michael Sjo
1 stro
1 m, and Thomas Moritz*
Umeå Plant Science Center, Department of Forest Genetics and Plant Physiology, Swedish University of Agricultural
Sciences, SE-901 87 Umeå, Sweden, and Research Group for Chemometrics, Organic Chemistry, Department of
Chemistry, Umeå University, SE-901 87 Umeå, Sweden
In metabolomics, the purpose is to identify and quantify
all the metabolites in a biological system. Combined gas
chromatography and mass spectrometry (GC/MS) is one
of the most commonly used techniques in metabolomics
together with 1H NMR, and it has been shown that more
than 300 compounds can be distinguished with GC/MS
after deconvolution of overlapping peaks. To avoid having
to deconvolute all analyzed samples prior to multivariate
analysis of the data, we have developed a strategy for rapid
comparison of nonprocessed MS data files. The method
includes baseline correction, alignment, time window
determinations, alternating regression, PLS-DA, and identification of retention time windows in the chromatograms
that explain the differences between the samples. Use of
alternating regression also gives interpretable loadings,
which retain the information provided by m/z values that
vary between the samples in each retention time window.
The method has been applied to plant extracts derived
from leaves of different developmental stages and plants
subjected to small changes in day length. The data show
that the new method can detect differences between the
samples and that it gives results comparable to those
obtained when deconvolution is applied prior to the
multivariate analysis. We suggest that this method can be
used for rapid comparison of large sets of GC/MS data,
thereby applying time-consuming deconvolution only to
parts of the chromatograms that contribute to explain the
differences between the samples.
In biology there is intense interest in large-scale analyses of
gene transcript levels (microarray analysis), proteins (proteomics),
and metabolites (metabolomics). The general aim is to obtain
information that can explain and identify the differences between
certain sets of organisms (e.g., differences in genotypes) or
between people affected by various diseases and disease-free
controls or elucidate factors that influence biochemical events.
For example, in proteomics, samples analyzed by 2D-gel electrophoresis are compared and protein spots that differ between them
can be identified by appropriate methods. In metabolomics, the
* Corresponding author. E-mail: thomas.moritz@genfys.slu.se.
†
Umeå University.
‡
P.J. and J.G. contributed equally to the work.
§
Swedish University of Agricultural Sciences.
1738 Analytical Chemistry, Vol. 76, No. 6, March 15, 2004
metabolic profiles of difference samples are compared and
contrasted, usually either by gas chromatography/mass spectrometry (GC/MS) (for a review, see Fiehn1) or by NMR.2 One
of the advantages with GC/MS compared to NMR is that large
numbers of substances can be sensitively determined in a single
analysis of a very complex sample. In some metabolomics studies,
more than 300 compounds have been identified and quantified in
a single GC/MS chromatogram.3 However, in such analyses, a
large number of the compounds coelute or are not completely
chromatographically resolved, so mathematical curve resolution
procedures, often named deconvolution, need to be applied to
obtain accurate mass spectra and to resolve the chromatographic
peaks.
Multivariate methods have been developed and used to clarify
chromatographic and spectral profiles from overlapping chromatographic peaks obtained using various types of hyphenated chromatography systems, including GC/MS, HPLC/DAD, and HPLC/
MS. The multivariate curve resolution methods can be divided
into iterative, noniterative, and hybrid approaches.4 The iterative
methods all define start profiles, but the procedures for selecting
initial estimates and resolution differ among them. Examples of
iterative methods include iterative target transformation factor
analysis5 and alternating regression (AR).6 The disadvantage is,
generally, that high-quality results require a good choice of
starting vectors. The noniterative methods, such as orthogonal
projections7 and heuristic evolving latent projections,8 involve rank
analysis of evolving matrices. The main disadvantage with these
methods is that they are very difficult to automate, due to the
need to define analyte elution windows by local rank analysis.
Hybrid methods, such as automatic window factor analysis9 and
Gentle,10 start from a set of key spectra from which concentration
and spectral profiles are estimated. Hybrid and iterative methods
(1) Fiehn, O. Plant Mol. Biol. 2002, 48, 155-171.
(2) Nicholson, J. K.; Connelly, J.; Lindon, J. C.; Holmes, E. Nat. Rev. Drug
Discovery 2002, 1, 153-161.
(3) Fiehn, O.; Kopka, J.; Dormann, P.; Altmann, T.; Trethewey, R. N.; Willmitzer,
L. Nat. Biotechnol. 2000, 18, 1157-1161.
(4) Liang, Y. Z.; Kvalheim, O. M. Fresenius J. Anal. Chem. 2001, 370, 694704.
(5) Gemperline, P. J Chem. Info. Comput. Sci. 1984, 24, 206-212.
(6) Karjalainen, E. J. Chemom. Intell. Lab. Syst. 1989, 7, 31-38.
(7) Liang, Y. Z.; Kvalheim, O. M. Anal. Chim. Acta 1994, 292, 5-15.
(8) Kvalheim, O. M.; Liang, Y. Z. Anal. Chem. 1992, 64, 936-946.
(9) Malinowski, E. R. J. Chemom. 1996, 10, 273-279.
(10) Manne, R.; Grande, B. V. Chemom. Intell. Lab. Syst. 2000, 50, 35-46.
10.1021/ac0352427 CCC: $27.50
© 2004 American Chemical Society
Published on Web 02/11/2004
are the only real options for automatic curve resolution. Recently,
Gentle has been shown successful for automatic data processing
of LC/MS analysis.11 Although automatic curve resolution of GC/
MS chromatograms can also be applied with a high degree of
success, e.g., by using the freely available AMDIS software,12 it
is still a fairly slow process. All samples have to be resolved
separately, and the estimated spectral profiles of all samples must
be carefully checked in order to obtain reliable mass spectra and
peak areas. Therefore, to compare large numbers of samples, and
thus increase throughput, requires pattern recognition routines
that can compare samples quickly, but still retain information that
describes the differences between the samples. In metabolomic
studies, this includes data that enable metabolites to be identified
and quantified that significantly differ in concentration between
samples.
In chromatography, fingerprinting based on pattern recognition
methods has been used in various analytical areas, e.g., forensic13
and medical14,15 applications, chemical environment on dairy
farmers,16 hydrocarbon pollution in soils,17 and mutagenicity
studies.18 However, nonprocessed GC/MS data have not been
used previously for the comparison of large series of complex
samples. Although Demir and Brereton19 showed that multivariate
calibration using partial least squares to latent structures (PLS)
is superior to univariate calibration in a study of 21 mixtures of
known composition; their study was based on only two closely
eluting organic compounds.
In MS-based metabolomics studies, the identification of differences between samples have normally involved different multivariate tools such as principal component analysis (PCA),
hierarchical cluster analysis (HCA),20 discriminate analysis,21 or
correlative network analysis.22 However, to our knowledge, in all
published metabolomic studies based on GC/MS, the starting
point of the data analysis has been the deconvolution of chromatograms, followed by peak matching, and then identification of the
differences between samples, often using multivariate tools. In
this study, we demonstrate, using metabolomic samples from the
model plant species hybrid aspen, that GC/MS data can be
compared using supervised multivariate techniques on nonprocessed GC/MS data sets. Using this methodology, hundreds of
samples can be compared, and the regions of the GC/MS
chromatogram that contribute most strongly to differences
(11) Idborg-Björkman, H.; Edlund, P. O.; Kvalheim, O. M.; Schuppe-Koistinen,
I.; Jacobsson, S. P. Anal. Chem. 2003, 75, 4784-4792.
(12) Halket, J. M.; Przyborowska, A.; Stein, S. E.; Mallard, W. G.; Down, S.;
Chalmers, R. A. Rapid Commun. Mass Spectrom. 1999, 13, 279-284.
(13) Stout, S. A.; Uhler, A. D.; McCarthy, K. J.; Emsbo-Mattingly, S. D. Environ.
Forensics 2002, 3, 9-11.
(14) Jellum, E.; Harboe, M.; Bjune, G.; Wold, S. J. Pharm. Biomed. Anal. 1991,
9, 663-669.
(15) Kimura, H.; Yamamoto, T.; Seiji, Y. Tohoku J. Exp. Med. 1999, 188, 317334.
(16) Sunesson, A. L.; Gullberg, J.; Blomquist, G. J. Environ. Monit. 2001, 3,
210-216.
(17) Pavon, J. L. P.; Sanchez, M. D.; Pinto, C. G.; Laespada, M. E. F.; Cordero,
B. M.; Pena, A. G. Anal. Chem. 2003, 75, 2034-2041.
(18) Eide, I.; Neverdal, G.; Thorvaldsen, B.; Shen, H. L.; Grung, B.; Kvalheim,
O. Environ. Sci. Technol. 2001, 35, 2314-2318.
(19) Demir, C.; Brereton, R. G. Analyst 1997, 122, 631-638.
(20) Sumner, L. W.; Mendes, P.; Dixon, R. A. Phytochemistry 2003, 62, 817836.
(21) Allen, J.; Davey, H. M.; Broadhurst, D.; Heald, J. K.; Rowland, J. J.; Oliver,
S. G.; Kell, D. B. Nat. Biotechnol. 2003, 21, 692-696.
(22) Steuer, R.; Kurths, J.; Fiehn, O.; Weckwerth, W. Bioinformatics 2003, 19,
1019-1026.
between the groups of samples can be recognized. Deconvolution
of these regions can then be performed, and thus, the amount of
time spent on curve resolution can be minimized and more time
can be spent on identifying the metabolites that explain the
differences between the samples.
EXPERIMENTAL SECTION
Sampling, Extraction, and Derivatization. Twenty-eight
hybrid aspen (Populus tremula × Populus tremuloides) plants were
grown in a growth chamber under 18-h photoperiods (long day;
LD), as previously described.23 After 15 weeks in the growth
chamber, seven plants were sampled, and after a further 2 days,
seven more were sampled (designated LD0 and LD2, respectively).
For the remaining 14 plants, the photoperiod was changed to 12
h (short photoperiod; SD). After 2 days, seven plants were
sampled, and after a further 4 days, the last seven were sampled
(designated SD2 and SD6, respectively). In each case, leaves 2,
10, and 20, counting from the top of the plant, were taken. All
samples were frozen immediately in liquid nitrogen and stored at
-80 °C until analysis. Each leaf sample was analyzed separately.
Each sample was first homogenized in liquid nitrogen with a
mortar and pestle. Ten milligrams of sample and 1 mL of
extraction medium (chloroform/MeOH/H2O; 2:6:2) including
stable isotope reference compounds ([2H4]-succinic acid, [13C5,15N]glutamic acid, [2H7]-cholesterol, [13C3]-myristic acid, [13C4]-Rketoglutarate, [13C12]-sucrose, [13C4]-hexadecanoic acid, [2H4]-1,4butanediamine, [2H6]-2-hydoxybenzoic acid, and [13C6]-glucose)
were added to an Eppendorf tube. The extraction was performed
using an MM 301 vibration mill (Retsch GmbH & Co. KG, Haan,
Germany) at a frequency of 30 Hz s-1 for 3 min after adding 3-mm
tungsten carbide beads (Retsch GmbH & Co. KG) to each tube
to increase the extraction efficiency. After extraction, 200 µL of
the extract was evaporated to dryness in a Speed-vac concentrator
(Savant Instrument, Framingdale, NY). Then, 30 µL of methoxyamine hydrochloride (15 mg mL-1) in pyridine was added to
the sample. After 16 h of derivatization at room temperature, the
sample was trimethylsilylated for 1 h at room temperature by
adding 30 µL of MSTFA with 1% TMCS (Pierce Rockford, IL).
After silylation, 30 µL of heptane was added.
GC/MS. One microliter of the derivatized sample was injected
splitless by an Agilent 7683 autosampler (Agilent, Atlanta, GA)
into an Agilent 6890 gas chromatograph equipped with a 10 m ×
0.18 mm i.d. fused-silica capillary column with a chemically bonded
0.18-µm DB 5-MS stationary phase (J&W Scientific, Folsom, CA).
The injector temperature was 270 °C, the purge flow was 20 mL
min-1, and the purge was turned on after 60 s. The gas flow rate
through the column was 1 mL min-1, and the column temperature
was held at 70 °C for 2 min, then increased by 40 °C min-1 to 320
°C, and held there for 1 min. The column effluent was introduced
into the ion source of a Pegasus III time-of-flight mass spectrometer, GC/TOFMS (Leco Corp., St Joseph, MI). The transfer line
and the ion source temperatures were 250 and 200 °C, respectively. Ions were generated by a 70-eV electron beam at an
ionization current of 2.0 mA, and 30 spectra s-1 were recorded in
the mass range 60-800 m/z. The acceleration voltage was turned
on after a solvent delay of 170 s. The detector voltage was 1500
V.
(23) Eriksson, M. E.; Moritz, T. Planta 2002, 214, 920-930.
Analytical Chemistry, Vol. 76, No. 6, March 15, 2004
1739
Figure 1. Total ion current chromatogram (TIC) from a typical
analysis of a methoxime- and trimethylsilyl-derivatized extract from
Populus.
Analysis of GC/MS Data. All data were processed by
ChromaTOF (1.00) software (Leco Corp.). Automatic peak detection and mass spectrum deconvolution were performed using a
peak width set to 2.0 s. Peaks with signal-to-noise (S/N) values
lower than 10 were rejected. The S/N values were based on the
masses chosen by the software for quantification. Peak area
calculation and sample comparison were performed according to
the following procedure. The sample with the most detected
components was used as a master sample. The mass spectra from
the master sample were visually inspected, and spectra consisting
mainly of low-intensity noise were removed. Then all the other
samples’ deconvoluted mass spectra were matched with the
master mass spectra. Both mass spectra and retention times were
used to match samples. As the samples were very similar, the
probability of obtaining errors due to the choice of master sample
was regarded as low. To obtain accurate peak areas for the
deconvoluted components, the unique masses for each component
were specified and the samples were reprocessed.
Nonprocessed MS files from GC/TOFMS analysis were
exported in CSV format to MATLAB software 6.5 (Mathworks,
Natick, MA), where all data pretreatment procedures, such as
baseline correction and chromatogram alignment, were performed
using custom scripts. Multivariate analysis was performed with
SIMCA-P+ 10.0.4.0 software (Umetrics AB, Umeå, Sweden).
RESULTS AND DISCUSSION
GC/MS System. Since the first GC/TOFMS instruments
became commercially available a few years ago, intense interest
has developed in using them to analyze complex mixtures.24 The
main advantage of these systems is that they enable spectra to
be accumulated rapidly, thereby increasing the speed of GC/MS
analyses and making the deconvolution of chromatograms more
accurate. The chosen spectra accumulation speed (30 spectra s-1)
resulted in 20-40 data points per peak. With the rapid GCtemperature programming, each GC/MS analysis took only 15
min, and more than 200 compounds could be detected in each
analysis after deconvolution. A typical GC/TOFMS total ion
current chromatogram obtained from a complex plant extract is
shown in Figure 1.
(24) Veriotti, T.; Sacks, R. Anal. Chem. 2001, 73, 4395-4402.
1740 Analytical Chemistry, Vol. 76, No. 6, March 15, 2004
Figure 2. Summary of the new strategy for rapid comparison of
GC/MS data sets. White boxes show the steps that are done with all
samples individually, gray boxes steps involving all samples simultaneously, and black boxes steps that involves modified data
simultaneously.
Strategy for Comparison of Samples. A new strategy for
rapid comparison of GC/MS data sets was developed and is
summarized in Figure 2. It involves smoothing and correction of
baselines, alignments, windows setting, AR, and multivariate
modeling. All steps, including importing raw data into processing
software, were computer automated, besides the windows setting
that is done manually. Below the strategy is described in detail.
Preprocessing of Data. All MS files (nonprocessed; CSV
format) were exported into MATLAB software for further processing. Ideally, the nonprocessed MS files would be subjected directly
to multivariate analysis, but certain problems make preprocessing
of raw data before multivariate analysis essential, including
baseline problems, retention time drifts, variations in peak shape,
and differences in recovery between the analyzed samples. An
important step in the preprocessing procedure is to reduce the
noise levels for every m/z channel. We have chosen to smooth
the data with moving averages (each of seven time points). This
reduces the noise and makes it easier to observe the start and
end of the peaks. To reduce the background, the minimum
response value for each m/z channel is subtracted from all the
other response values in the same m/z channel, which is thus
set to a zero level. Liang et al.25 suggested that zero component
regions can be used for background correction. In accordance
with this approach, we have linearly interpolated between zero
component regions to remove systematic noise from the data (see
the Time Windows section).
Alignment. Another important step for comparing chromatograms by multivariate modeling is to align the chromatograms
derived from each sample. Although GC/MS analysis is considered to be very reproducible under optimized injection and GC
(25) Liang, Y. Z.; Kvalheim, O. M.; Rahmani, A.; Brereton, R. G. Chemom. Intell.
Lab. Syst. 1993, 18, 265-279.
Figure 4. Division of each chromatogram into time windows, each
of which can contain several compounds, to overcome small variations in retention shifts over the whole chromatogram. The figure
shows only 3% of the total analysis. Vertical lines represents individual
time windows.
Figure 3. Differences in retention times between different injections
(A) and alignment of chromatograms (B) for correction of retention
shifts prior to multivariate comparisons of samples. All chromatograms
are aligned so that the covariance of each TIC with a master TIC is
maximized.
conditions, small differences in retention between different injections will always occur (Figure 3A). Such differences may arise
from small variations in gas flow, solvent composition, temperature, or other unknown factors that influence retention time
reproducibility. Several methods have been developed for correcting retention drifts.26-28 The best method to choose to correct
large metabolomic or environmental data sets depends on a
number of criteria. One is that it must be simple, since the purpose
of the multivariate modeling with raw data is to make the data
analysis faster than is usually possible when modeling processed
GC/MS data. The method must also be valid for the alignment
of very complex samples, since they are likely to contain large
numbers of compounds (more than 200 in this investigation, for
instance). The first step in the method published by Malmquist
and Danielsson29 meets these criteria. This method aligns chromatograms by finding the maximal covariance between the
chromatograms (see Supporting Information I). This is especially
applicable when samples have large similarities to each other. This
(26) Nielsen, N. P. V.; Carstensen, J. M.; Smedsgaard, J. J, Chromatogr., A 1998,
805, 17-35.
(27) Fraga, C. G.; Prazen, B. J.; Synovec, R. E. Anal. Chem. 2001, 73, 58335840.
(28) Johnson, K. J.; Wright, B. W.; Jarman, K. H.; Synovec, R. E. J, Chromatogr.,
A 2003, 996, 141-155.
(29) Malmquist, G.; Danielsson, R. J, Chromatogr., A 1994, 687, 71-88.
means that the largest peaks in the chromatogram must be found
in all measurements, because they contribute most to the
covariance. In metabolomic studies, very similar samples are
usually compared, and most of the major peaks in the chromatograms occur in all of the samples, so there is only a small risk
that the peak alignment (Figure 3B) cannot be done automatically.
It must also be emphasized that the major purpose of the
alignment is to allow the retention time windows to be set, rather
than to correct for differences in peak shapes in the different
chromatograms.
Time Windows. A problem that can often affect chromatogram alignment is that retention drifts may not remain constant
throughout the course of each chromatographic analysis. In such
cases, the alignment procedure (as described above, for instance)
will not be satisfactory if the alignment is based solely on a single
large peak in each chromatogram. We have overcome this
problem by dividing each MS file into small time windows (Figure
4) and summarizing the total intensity of each m/z channel for
each time window. This gives a total mass spectrum for each time
window and reduces the resolution, but the signal for different
compounds in each time window can be found in the summarized
mass spectra, and thereby, the information is not lost. The main
advantage with this method is that it increases the accuracy of
the multivariate modeling. Even if the chromatogram alignment
is not perfect, the differences in the time window will be detected,
as the summarized mass spectrum in each time window is
compared between samples. A time window is defined as a short
retention span that starts and ends in a region of the chromatogram that does not contain any compound (a zero component
region). A time window of appropriate size can be visually selected
for each investigation by plotting all aligned chromatograms at
the same time. This stage of the processing (Figure 2) is done
simultaneously for all samples, and it is the only step that is done
manually since all other steps are fully computer automated.
Here we divided each chromatogram into 51 windows. Consequently, in many cases, there are several compounds in each
time window. However, that is not a problem since the multivariate
modeling of the summarized mass spectrum for each time window
will still recognize the differences between samples. Another
Analytical Chemistry, Vol. 76, No. 6, March 15, 2004
1741
Figure 5. Visualization of the summarization and compression of
the 3D-data structure in one time window. An approximation of the
compound intensities and corresponding mass profiles can be
generated using alternating regression. The intensity matrix for all
the components in the windows is used for the multivariate analysis.
advantage with the time window approach is that the multivariate
modeling will be based on smaller data sets and thus increase
the speed of the process.
Extracting Information from Each Time Window Using
Alternating Regression. The first stage in the comparison of
samples is a projection step that begins by chemically ranking
the time window matrices (i.e., the summarized mass spectra of
all samples in the respective windows; Figure 5). The chemical
rank, i.e., the number of components, is found for one time-window
at a time by examining the eigenvalues from a principal component
analysis (PCA).30 The chemical rank is then set to the number of
eigenvalues above the noise level. Using the AR method,6
approximations of the compounds’ intensity and their corresponding mass profiles can then be derived.
C ) X‚S‚(ST‚S)-1
(1)
where C is the concentration, S is the mass spectrum, X is the
matrix, and the superscript T denotes transposition and -1 inverse
of a matrix.
S ) XT‚C‚(CT‚C)-1
(2)
The summarized mass spectra from each sample are rows in
the X matrix. Both C and S have nonnegative constraints, meaning
that the concentration and the mass spectra can never include
negative values. Negative values in the spectrum profile are set
(30) Wold, S.; Esbensen, K.; Geladi, P. Chemom. Intell. Lab. Syst. 1987, 2, 3752.
1742
Analytical Chemistry, Vol. 76, No. 6, March 15, 2004
to zero, and negative concentration values are set to the lowest
of all positive concentrations.
A random number is used as a starting estimate of S. If the
AR algorithm does not find a solution, or the correlation between
spectral profiles is larger than 95%, the chemical rank is reduced
by one. For example, as the aldehyde and keto groups are
converted into oximes during the derivatization, some metabolites
will form two tautomeric forms that can appear very close in the
chromatogram, with identical mass spectra. From the AR analysis
the concentrations, C, are used to describe the concentration of
the components in every sample in each time window. C values
from all time windows are used as an X matrix (246 variables
generated from the 51 time windows) in further multivariate
modeling. C contains only information about compounds that differ
between samples, because all samples have been compared at
the same time. Each of the values for C has corresponding m/z
values that can be used for interpretation later. Using this method,
the information from each time window is explained by a small
number of variables instead of the entire summarized mass
spectrum. This results in a more easily interpretable PLS model.
Multivariate Analysis of Data. Because a large amount of
information can be found in each analysis, a number of different
multivariate modeling techniques can be applied. The general
approach (in the few metabolomic studies that have been
published, for instance) has been to use PCA or HCA. However,
we have chosen to find the differences between samples by using
PLS analysis.31 The main advantage of PLS is that it is a supervised
calibration method that allows the information from the X matrix
to be correlated with response data, the Y matrix, rather than
simply describing the variation in the X matrix, as PCA does. PLS
models can be used not only for predicting Y values in new
samples but also to evaluate the ability to predict the different
groups, for example, using cross-validation.32 PLS discriminate
analysis (PLS-DA) is often used to locate differences between
members of different logical groups. PLS-DA searches for structural information that can discriminate between sample groups.
The variables used to describe each sample are the “concentrations” derived from AR analysis, and the “concentrations” form
the X matrix. The Y matrix is the description of the properties of
the samples; here a “dummy matrix” is used to describe group
membership. A dummy matrix has the same number of columns
as the number of classes and the same number of rows as the
number of samples. For a column representing a specific class,
the row values are zeros except for samples belonging to the class
that corresponds to the column. For these samples, values are
set to 1.
Each sample is normalized before the multivariate analysis.
Normalization is essential if the samples are not identical, e.g., if
there are differences in sample weight or volume, or a purification
or derivatization step is involved that might result in variations in
recovery. Our approach is to divide the response values by the
sample weight and the intensity of one or more internal standards.
In GC/MS analysis of complex, derivatized samples, the diversity
of the compound classes will inevitably cause differences in
recovery for different compounds. For more accurate compensation of differences in recovery, more than one internal standard
(31) Wold, S.; Sjöström, M.; Eriksson, L. Chemom. Intell. Lab. Syst. 2001, 58,
109-130.
(32) Wold, S. Technometrics 1978, 20, 397-405.
should be used, and the internal standards should represent a
wide range of compounds and chromatographic behavior, covering
the whole range of the chromatogram. In our methodology for
multivariate modeling, such an approach can be used to normalize
separate time windows with different internal standards.
Before calculating a PLS model, the variables in the X matrices
are log-transformed before they are centered and scaled to unit
variance. Log transformation makes it easier to fulfill the normal
distribution requirement. The Y matrix is centered and scaled to
unit variance (SIMCA-P default). After the PLS model has been
calculated, the ability of the model to estimate Y from X is
assessed. This can be done by using the model to make
predictions concerning new and known samples. This is the point
at which a decision can be made as to whether further data
analysis of the sample set would be worthwhile or if there appears
to be no differences between the samples according to the
multivariate modeling. If the data analysis is to be continued, the
most important variables in X that are correlated to Y will be
sought. This can be done by interpreting the loadings from the
PLS model using jackknifing33,34 and identifying the time window(s) in which significant variables occur. Furthermore, by using
AR as described above, the m/z values that differ in each time
window between the samples are also identified.
Comparison of the New Method with the Curve Resolution
Data Approach. The traditional way of comparing GC/MS data
sets is to resolve chromatographic overlaps in the MS files, to
then calculate the relative amounts of each compound, and finally
subject the data to multivariate modeling. To test whether our
new approach is able to extract the same information as the
traditional method, we compared the two methods by analyzing
GC/MS data obtained from 84 aspen samples. These samples
were taken from trees included in a project designed to identify
molecular and biochemical signals associated with the induction
of growth cessation in woody species: a process that is poorly
understood. The samples consisted of three leaves (2, 10, and
20) collected from plants growing under either LD or SD
photoperiods. For more information see the Experimental Section.
Automatic identification and quantification with the ChromaTof
software resulted in different quantification masses being suggested for the same compound in different samples, which could
have resulted in errors during the comparison of samples.
Therefore, specific quantification masses were selected for each
selected compound in the master sample. This deconvolution of
54 samples (only leaves 10 and 20 were analyzed this way; of note
is that two samples were lost during extraction) using the
ChromaTof software took more than 2 days and resulted in 241
quantified compounds. The concentrations of the compounds were
used as X variables in the multivariate data analysis. All variables
were log-transformed, centered, and scaled to unit variance. The
data were also normalized against an internal standard and sample
weight. PCA analysis was applied first, to compare the two
methods. Both methods showed very similar results, and a clear
separation between groups corresponding to leaf 10 and leaf 20
could be seen (Figure 6A and B). The usefulness of the method
was also observed when adding the analysis of leaves 2, which
resulted in clearly separation of the three groups (Figure 6C).
(33) Efron, B. Ann. Stat. 1986, 14, 1301-1304.
(34) Martens, H.; Martens, M. Food Quality Preference 2000, 11, 5-16.
Figure 6. PCA score plots from the analysis of leaf 10 (O) and 20
(3) from four different classes of plants. (A) shows a score plot from
processed MS files and (B) a score plot from nonprocessed MS files
derived by the described method. (C) shows a score plot from
nonprocessed MS files derived by the described method with data
from leaf 2 (0), leaf 10 (O), and leaf 20 (3).
This was not surprising as the leaves 2, 10, and 20 are leaves at
completely different developmental stages (sink to source), and
thereby the metabolic differences are very large (See Supporting
Information where the new method has been used with independent metabolomics data published on the Web.). However, to
further explore the usefulness of the new method, four groups of
samples, representing plants subjected to four different photoperiodic treatments, were investigated. Instead of using PCA, we
used a supervised method for this, since the aim in metabolomic
studies is often to identify variables that statistically explain the
differences between groups of samples, and PCA would not detect
Analytical Chemistry, Vol. 76, No. 6, March 15, 2004
1743
Figure 8. PLS loading plots (w[1]) of limit regions for (A) processed
MS files and (B) nonprocessed MS files derived by the described
method. Positive values are associated with LD and negative values
with SD. The confidence interval was calculated with jackknifing.33,34
Two retention time windows are shown. In window 16, the two
methods show variables that are significantly correlated with SD. In
window 17, both methods show variables that are significantly
correlated with LD and SD. After processing of time windows of
interest, the statistical significance was checked with the Student
t-test.
Figure 7. PCA and PLS-DA score plots from the analysis of leaf
10 from four different classes of plants (9, LD0; *, LD2; O, SD2; 2,
SD6). (A) shows a PCA score plot from processed MS files, (B) shows
a PLS-DA score plot from processed MS files, and (C) shows a PLSDA score plot from nonprocessed MS files derived by the described
method.
the small differences between the groups of samples that were
analyzed in the present investigation. In metabolomics and also,
for example, transcriptomics analysis of small changes over time
is very common, and therefore, the methods used must be able
to identify also small differences. This was observed in the present
investigation where the PCA analysis of leaf 10 LD and SD samples
could not separate the different groups (Figure 7A). However,
PLS-DA was also used to investigate the differences between two
of these groups (LD2 and SD6; Y matrix). The PLS-DA model with
the X matrix generated with deconvolution (concentration metabolites) shows two significant components according to crossvalidation.29 The explained variation in the X matrix (R2X) is 0.36,
the explained variation in the Y matrix (R2Y) is 0.93, and the
predictive ability according to 7-fold cross-validation (Q2Y) is 0.55.
The PLS-DA model with the X matrix generated with our new
1744 Analytical Chemistry, Vol. 76, No. 6, March 15, 2004
method shows one significant component, R2X is 0.19, R2Y is 0.82,
and Q2Y is 0.49.
To validate the models, the two groups (LD0 and SD2) that
were not included in the model were used to test its predictive
quality (Figure 7B and C). Both methods predicted that LD0 (9)
would be similar to LD2 (*) and that SD2 (O) would be intermediate between LD2 and SD6 (2). This is consistent with the
hypothesis that samples grown 2 days more in LD (LD2) should
group with the LD0 samples. In samples exposed to only 2 days
in short photoperiods, only very small metabolic changes will have
occurred compared to those that will have occurred after 6 SDs.
Therefore, the prediction that SD2 is an intermediate between LD2
and SD6 fits well with the hypothesis. The two methods were
further studied by comparing the loadings from the PLS models.
We have used the first loading vector (w) (Figure 8). This is the
loading vector that should be interpreted in the PLS model when
the rank of Y is 1.35 If the confidence interval (from jackknifing)
does not include 0, a variable is considered significant. In 7 of
the 51 time windows, neither of the methods found significant
variables, in 30 of the windows both methods found significant
variables, in 5 of the windows the traditional method found
significant variables, but not the new method, and in 9 of the
windows only the new method found significant variables (in 4 of
these windows the traditional method did not find any peaks at
all). The results show that the two methods give similar results,
e.g., differences in amino acid and carbohydrate metabolism (e.g.,
(35) Trygg, J.; Wold, S. J. Chemom. 2002, 16, 119-128.
norvaline, glutamic acid, aspartic acid, fumaric acid, R-ketoglutaric
acid, and mannose phosphate; a complete list of metabolic
differences identified in the samples will be published elsewhere)
were observed with both methods. The small differences that still
were observed can be explained by the fact that none of the
methods are 100% accurate; e.g., automatic and accurate deconvolution of 54 complex samples is very difficult. However, with
the new method, the deconvolution is more rapid. A significant
variable from the PLS-DA model corresponds to one time window
and a m/z profile in that time window (see Supporting Information
III for an example). It is then possible to use unique masses to
detect the compound(s) that explain the differences in that specific
time window. This speeds up the deconvolution significantly, and
thereby, the total time spent on deconvolution is decreased.
CONCLUSIONS
Development of the new strategy for comparing GC/MS data
has been driven by the desire to be able to rapidly compare large
numbers of samples. To our knowledge, no multivariate method
has been developed prior to this that uses nonprocessed GC/MS
data for comparison of samples. This is in contrast to comparisons
of 1H NMR samples, where two-dimensional raw data (unlike GC/
MS data, which are three-dimensional) have been used for
multivariate modeling in a number of cases.2 Every step in the
new method is semiautomated and therefore less labor-intensive
than the more traditional methods, and more than 50 samples
can easily be compared within 5 h. Although mathematical curve
resolution procedures must be applied to the retention time
windows that explain the differences between samples, it is
advantageous to decrease the number of sections of the chromatograms that need to be processed in this way. By using AR,
knowledge about the masses of interesting compounds is also
acquired, which further simplifies the deconvolution. Furthermore,
most importantly the method provides a rapid means for screening
for differences between large sample sets, allowing the experimenter to decide, at an early stage, the value of continuing to
characterize differences between the analyzed samples.
ACKNOWLEDGMENT
We thank Wallenberg Consortium North (WCN), EU-strategic
funding, and the Swedish Research Council for financial support,
Krister Lundgren for help with the GC/MS analysis, and Dr.
Anders Berglund for comments on the manuscript.
SUPPORTING INFORMATION AVAILABLE
Additional information as noted in text. This material is
available free of charge via the Internet at http://pubs.acs.org.
Received for review October 21, 2003. Accepted January
9, 2004.
AC0352427
Analytical Chemistry, Vol. 76, No. 6, March 15, 2004
1745
Download