Anal. Chem. 2004, 76, 1738-1745 A Strategy for Identifying Differences in Large Series of Metabolomic Samples Analyzed by GC/MS Pa 1 r Jonsson,†,‡ Jonas Gullberg,‡,§ Anders Nordstro 1 m,§ Miyako Kusano,§ Mariusz Kowalczyk,§ † ,§ Michael Sjo 1 stro 1 m, and Thomas Moritz* Umeå Plant Science Center, Department of Forest Genetics and Plant Physiology, Swedish University of Agricultural Sciences, SE-901 87 Umeå, Sweden, and Research Group for Chemometrics, Organic Chemistry, Department of Chemistry, Umeå University, SE-901 87 Umeå, Sweden In metabolomics, the purpose is to identify and quantify all the metabolites in a biological system. Combined gas chromatography and mass spectrometry (GC/MS) is one of the most commonly used techniques in metabolomics together with 1H NMR, and it has been shown that more than 300 compounds can be distinguished with GC/MS after deconvolution of overlapping peaks. To avoid having to deconvolute all analyzed samples prior to multivariate analysis of the data, we have developed a strategy for rapid comparison of nonprocessed MS data files. The method includes baseline correction, alignment, time window determinations, alternating regression, PLS-DA, and identification of retention time windows in the chromatograms that explain the differences between the samples. Use of alternating regression also gives interpretable loadings, which retain the information provided by m/z values that vary between the samples in each retention time window. The method has been applied to plant extracts derived from leaves of different developmental stages and plants subjected to small changes in day length. The data show that the new method can detect differences between the samples and that it gives results comparable to those obtained when deconvolution is applied prior to the multivariate analysis. We suggest that this method can be used for rapid comparison of large sets of GC/MS data, thereby applying time-consuming deconvolution only to parts of the chromatograms that contribute to explain the differences between the samples. In biology there is intense interest in large-scale analyses of gene transcript levels (microarray analysis), proteins (proteomics), and metabolites (metabolomics). The general aim is to obtain information that can explain and identify the differences between certain sets of organisms (e.g., differences in genotypes) or between people affected by various diseases and disease-free controls or elucidate factors that influence biochemical events. For example, in proteomics, samples analyzed by 2D-gel electrophoresis are compared and protein spots that differ between them can be identified by appropriate methods. In metabolomics, the * Corresponding author. E-mail: thomas.moritz@genfys.slu.se. † Umeå University. ‡ P.J. and J.G. contributed equally to the work. § Swedish University of Agricultural Sciences. 1738 Analytical Chemistry, Vol. 76, No. 6, March 15, 2004 metabolic profiles of difference samples are compared and contrasted, usually either by gas chromatography/mass spectrometry (GC/MS) (for a review, see Fiehn1) or by NMR.2 One of the advantages with GC/MS compared to NMR is that large numbers of substances can be sensitively determined in a single analysis of a very complex sample. In some metabolomics studies, more than 300 compounds have been identified and quantified in a single GC/MS chromatogram.3 However, in such analyses, a large number of the compounds coelute or are not completely chromatographically resolved, so mathematical curve resolution procedures, often named deconvolution, need to be applied to obtain accurate mass spectra and to resolve the chromatographic peaks. Multivariate methods have been developed and used to clarify chromatographic and spectral profiles from overlapping chromatographic peaks obtained using various types of hyphenated chromatography systems, including GC/MS, HPLC/DAD, and HPLC/ MS. The multivariate curve resolution methods can be divided into iterative, noniterative, and hybrid approaches.4 The iterative methods all define start profiles, but the procedures for selecting initial estimates and resolution differ among them. Examples of iterative methods include iterative target transformation factor analysis5 and alternating regression (AR).6 The disadvantage is, generally, that high-quality results require a good choice of starting vectors. The noniterative methods, such as orthogonal projections7 and heuristic evolving latent projections,8 involve rank analysis of evolving matrices. The main disadvantage with these methods is that they are very difficult to automate, due to the need to define analyte elution windows by local rank analysis. Hybrid methods, such as automatic window factor analysis9 and Gentle,10 start from a set of key spectra from which concentration and spectral profiles are estimated. Hybrid and iterative methods (1) Fiehn, O. Plant Mol. Biol. 2002, 48, 155-171. (2) Nicholson, J. K.; Connelly, J.; Lindon, J. C.; Holmes, E. Nat. Rev. Drug Discovery 2002, 1, 153-161. (3) Fiehn, O.; Kopka, J.; Dormann, P.; Altmann, T.; Trethewey, R. N.; Willmitzer, L. Nat. Biotechnol. 2000, 18, 1157-1161. (4) Liang, Y. Z.; Kvalheim, O. M. Fresenius J. Anal. Chem. 2001, 370, 694704. (5) Gemperline, P. J Chem. Info. Comput. Sci. 1984, 24, 206-212. (6) Karjalainen, E. J. Chemom. Intell. Lab. Syst. 1989, 7, 31-38. (7) Liang, Y. Z.; Kvalheim, O. M. Anal. Chim. Acta 1994, 292, 5-15. (8) Kvalheim, O. M.; Liang, Y. Z. Anal. Chem. 1992, 64, 936-946. (9) Malinowski, E. R. J. Chemom. 1996, 10, 273-279. (10) Manne, R.; Grande, B. V. Chemom. Intell. Lab. Syst. 2000, 50, 35-46. 10.1021/ac0352427 CCC: $27.50 © 2004 American Chemical Society Published on Web 02/11/2004 are the only real options for automatic curve resolution. Recently, Gentle has been shown successful for automatic data processing of LC/MS analysis.11 Although automatic curve resolution of GC/ MS chromatograms can also be applied with a high degree of success, e.g., by using the freely available AMDIS software,12 it is still a fairly slow process. All samples have to be resolved separately, and the estimated spectral profiles of all samples must be carefully checked in order to obtain reliable mass spectra and peak areas. Therefore, to compare large numbers of samples, and thus increase throughput, requires pattern recognition routines that can compare samples quickly, but still retain information that describes the differences between the samples. In metabolomic studies, this includes data that enable metabolites to be identified and quantified that significantly differ in concentration between samples. In chromatography, fingerprinting based on pattern recognition methods has been used in various analytical areas, e.g., forensic13 and medical14,15 applications, chemical environment on dairy farmers,16 hydrocarbon pollution in soils,17 and mutagenicity studies.18 However, nonprocessed GC/MS data have not been used previously for the comparison of large series of complex samples. Although Demir and Brereton19 showed that multivariate calibration using partial least squares to latent structures (PLS) is superior to univariate calibration in a study of 21 mixtures of known composition; their study was based on only two closely eluting organic compounds. In MS-based metabolomics studies, the identification of differences between samples have normally involved different multivariate tools such as principal component analysis (PCA), hierarchical cluster analysis (HCA),20 discriminate analysis,21 or correlative network analysis.22 However, to our knowledge, in all published metabolomic studies based on GC/MS, the starting point of the data analysis has been the deconvolution of chromatograms, followed by peak matching, and then identification of the differences between samples, often using multivariate tools. In this study, we demonstrate, using metabolomic samples from the model plant species hybrid aspen, that GC/MS data can be compared using supervised multivariate techniques on nonprocessed GC/MS data sets. Using this methodology, hundreds of samples can be compared, and the regions of the GC/MS chromatogram that contribute most strongly to differences (11) Idborg-Björkman, H.; Edlund, P. O.; Kvalheim, O. M.; Schuppe-Koistinen, I.; Jacobsson, S. P. Anal. Chem. 2003, 75, 4784-4792. (12) Halket, J. M.; Przyborowska, A.; Stein, S. E.; Mallard, W. G.; Down, S.; Chalmers, R. A. Rapid Commun. Mass Spectrom. 1999, 13, 279-284. (13) Stout, S. A.; Uhler, A. D.; McCarthy, K. J.; Emsbo-Mattingly, S. D. Environ. Forensics 2002, 3, 9-11. (14) Jellum, E.; Harboe, M.; Bjune, G.; Wold, S. J. Pharm. Biomed. Anal. 1991, 9, 663-669. (15) Kimura, H.; Yamamoto, T.; Seiji, Y. Tohoku J. Exp. Med. 1999, 188, 317334. (16) Sunesson, A. L.; Gullberg, J.; Blomquist, G. J. Environ. Monit. 2001, 3, 210-216. (17) Pavon, J. L. P.; Sanchez, M. D.; Pinto, C. G.; Laespada, M. E. F.; Cordero, B. M.; Pena, A. G. Anal. Chem. 2003, 75, 2034-2041. (18) Eide, I.; Neverdal, G.; Thorvaldsen, B.; Shen, H. L.; Grung, B.; Kvalheim, O. Environ. Sci. Technol. 2001, 35, 2314-2318. (19) Demir, C.; Brereton, R. G. Analyst 1997, 122, 631-638. (20) Sumner, L. W.; Mendes, P.; Dixon, R. A. Phytochemistry 2003, 62, 817836. (21) Allen, J.; Davey, H. M.; Broadhurst, D.; Heald, J. K.; Rowland, J. J.; Oliver, S. G.; Kell, D. B. Nat. Biotechnol. 2003, 21, 692-696. (22) Steuer, R.; Kurths, J.; Fiehn, O.; Weckwerth, W. Bioinformatics 2003, 19, 1019-1026. between the groups of samples can be recognized. Deconvolution of these regions can then be performed, and thus, the amount of time spent on curve resolution can be minimized and more time can be spent on identifying the metabolites that explain the differences between the samples. EXPERIMENTAL SECTION Sampling, Extraction, and Derivatization. Twenty-eight hybrid aspen (Populus tremula × Populus tremuloides) plants were grown in a growth chamber under 18-h photoperiods (long day; LD), as previously described.23 After 15 weeks in the growth chamber, seven plants were sampled, and after a further 2 days, seven more were sampled (designated LD0 and LD2, respectively). For the remaining 14 plants, the photoperiod was changed to 12 h (short photoperiod; SD). After 2 days, seven plants were sampled, and after a further 4 days, the last seven were sampled (designated SD2 and SD6, respectively). In each case, leaves 2, 10, and 20, counting from the top of the plant, were taken. All samples were frozen immediately in liquid nitrogen and stored at -80 °C until analysis. Each leaf sample was analyzed separately. Each sample was first homogenized in liquid nitrogen with a mortar and pestle. Ten milligrams of sample and 1 mL of extraction medium (chloroform/MeOH/H2O; 2:6:2) including stable isotope reference compounds ([2H4]-succinic acid, [13C5,15N]glutamic acid, [2H7]-cholesterol, [13C3]-myristic acid, [13C4]-Rketoglutarate, [13C12]-sucrose, [13C4]-hexadecanoic acid, [2H4]-1,4butanediamine, [2H6]-2-hydoxybenzoic acid, and [13C6]-glucose) were added to an Eppendorf tube. The extraction was performed using an MM 301 vibration mill (Retsch GmbH & Co. KG, Haan, Germany) at a frequency of 30 Hz s-1 for 3 min after adding 3-mm tungsten carbide beads (Retsch GmbH & Co. KG) to each tube to increase the extraction efficiency. After extraction, 200 µL of the extract was evaporated to dryness in a Speed-vac concentrator (Savant Instrument, Framingdale, NY). Then, 30 µL of methoxyamine hydrochloride (15 mg mL-1) in pyridine was added to the sample. After 16 h of derivatization at room temperature, the sample was trimethylsilylated for 1 h at room temperature by adding 30 µL of MSTFA with 1% TMCS (Pierce Rockford, IL). After silylation, 30 µL of heptane was added. GC/MS. One microliter of the derivatized sample was injected splitless by an Agilent 7683 autosampler (Agilent, Atlanta, GA) into an Agilent 6890 gas chromatograph equipped with a 10 m × 0.18 mm i.d. fused-silica capillary column with a chemically bonded 0.18-µm DB 5-MS stationary phase (J&W Scientific, Folsom, CA). The injector temperature was 270 °C, the purge flow was 20 mL min-1, and the purge was turned on after 60 s. The gas flow rate through the column was 1 mL min-1, and the column temperature was held at 70 °C for 2 min, then increased by 40 °C min-1 to 320 °C, and held there for 1 min. The column effluent was introduced into the ion source of a Pegasus III time-of-flight mass spectrometer, GC/TOFMS (Leco Corp., St Joseph, MI). The transfer line and the ion source temperatures were 250 and 200 °C, respectively. Ions were generated by a 70-eV electron beam at an ionization current of 2.0 mA, and 30 spectra s-1 were recorded in the mass range 60-800 m/z. The acceleration voltage was turned on after a solvent delay of 170 s. The detector voltage was 1500 V. (23) Eriksson, M. E.; Moritz, T. Planta 2002, 214, 920-930. Analytical Chemistry, Vol. 76, No. 6, March 15, 2004 1739 Figure 1. Total ion current chromatogram (TIC) from a typical analysis of a methoxime- and trimethylsilyl-derivatized extract from Populus. Analysis of GC/MS Data. All data were processed by ChromaTOF (1.00) software (Leco Corp.). Automatic peak detection and mass spectrum deconvolution were performed using a peak width set to 2.0 s. Peaks with signal-to-noise (S/N) values lower than 10 were rejected. The S/N values were based on the masses chosen by the software for quantification. Peak area calculation and sample comparison were performed according to the following procedure. The sample with the most detected components was used as a master sample. The mass spectra from the master sample were visually inspected, and spectra consisting mainly of low-intensity noise were removed. Then all the other samples’ deconvoluted mass spectra were matched with the master mass spectra. Both mass spectra and retention times were used to match samples. As the samples were very similar, the probability of obtaining errors due to the choice of master sample was regarded as low. To obtain accurate peak areas for the deconvoluted components, the unique masses for each component were specified and the samples were reprocessed. Nonprocessed MS files from GC/TOFMS analysis were exported in CSV format to MATLAB software 6.5 (Mathworks, Natick, MA), where all data pretreatment procedures, such as baseline correction and chromatogram alignment, were performed using custom scripts. Multivariate analysis was performed with SIMCA-P+ 10.0.4.0 software (Umetrics AB, Umeå, Sweden). RESULTS AND DISCUSSION GC/MS System. Since the first GC/TOFMS instruments became commercially available a few years ago, intense interest has developed in using them to analyze complex mixtures.24 The main advantage of these systems is that they enable spectra to be accumulated rapidly, thereby increasing the speed of GC/MS analyses and making the deconvolution of chromatograms more accurate. The chosen spectra accumulation speed (30 spectra s-1) resulted in 20-40 data points per peak. With the rapid GCtemperature programming, each GC/MS analysis took only 15 min, and more than 200 compounds could be detected in each analysis after deconvolution. A typical GC/TOFMS total ion current chromatogram obtained from a complex plant extract is shown in Figure 1. (24) Veriotti, T.; Sacks, R. Anal. Chem. 2001, 73, 4395-4402. 1740 Analytical Chemistry, Vol. 76, No. 6, March 15, 2004 Figure 2. Summary of the new strategy for rapid comparison of GC/MS data sets. White boxes show the steps that are done with all samples individually, gray boxes steps involving all samples simultaneously, and black boxes steps that involves modified data simultaneously. Strategy for Comparison of Samples. A new strategy for rapid comparison of GC/MS data sets was developed and is summarized in Figure 2. It involves smoothing and correction of baselines, alignments, windows setting, AR, and multivariate modeling. All steps, including importing raw data into processing software, were computer automated, besides the windows setting that is done manually. Below the strategy is described in detail. Preprocessing of Data. All MS files (nonprocessed; CSV format) were exported into MATLAB software for further processing. Ideally, the nonprocessed MS files would be subjected directly to multivariate analysis, but certain problems make preprocessing of raw data before multivariate analysis essential, including baseline problems, retention time drifts, variations in peak shape, and differences in recovery between the analyzed samples. An important step in the preprocessing procedure is to reduce the noise levels for every m/z channel. We have chosen to smooth the data with moving averages (each of seven time points). This reduces the noise and makes it easier to observe the start and end of the peaks. To reduce the background, the minimum response value for each m/z channel is subtracted from all the other response values in the same m/z channel, which is thus set to a zero level. Liang et al.25 suggested that zero component regions can be used for background correction. In accordance with this approach, we have linearly interpolated between zero component regions to remove systematic noise from the data (see the Time Windows section). Alignment. Another important step for comparing chromatograms by multivariate modeling is to align the chromatograms derived from each sample. Although GC/MS analysis is considered to be very reproducible under optimized injection and GC (25) Liang, Y. Z.; Kvalheim, O. M.; Rahmani, A.; Brereton, R. G. Chemom. Intell. Lab. Syst. 1993, 18, 265-279. Figure 4. Division of each chromatogram into time windows, each of which can contain several compounds, to overcome small variations in retention shifts over the whole chromatogram. The figure shows only 3% of the total analysis. Vertical lines represents individual time windows. Figure 3. Differences in retention times between different injections (A) and alignment of chromatograms (B) for correction of retention shifts prior to multivariate comparisons of samples. All chromatograms are aligned so that the covariance of each TIC with a master TIC is maximized. conditions, small differences in retention between different injections will always occur (Figure 3A). Such differences may arise from small variations in gas flow, solvent composition, temperature, or other unknown factors that influence retention time reproducibility. Several methods have been developed for correcting retention drifts.26-28 The best method to choose to correct large metabolomic or environmental data sets depends on a number of criteria. One is that it must be simple, since the purpose of the multivariate modeling with raw data is to make the data analysis faster than is usually possible when modeling processed GC/MS data. The method must also be valid for the alignment of very complex samples, since they are likely to contain large numbers of compounds (more than 200 in this investigation, for instance). The first step in the method published by Malmquist and Danielsson29 meets these criteria. This method aligns chromatograms by finding the maximal covariance between the chromatograms (see Supporting Information I). This is especially applicable when samples have large similarities to each other. This (26) Nielsen, N. P. V.; Carstensen, J. M.; Smedsgaard, J. J, Chromatogr., A 1998, 805, 17-35. (27) Fraga, C. G.; Prazen, B. J.; Synovec, R. E. Anal. Chem. 2001, 73, 58335840. (28) Johnson, K. J.; Wright, B. W.; Jarman, K. H.; Synovec, R. E. J, Chromatogr., A 2003, 996, 141-155. (29) Malmquist, G.; Danielsson, R. J, Chromatogr., A 1994, 687, 71-88. means that the largest peaks in the chromatogram must be found in all measurements, because they contribute most to the covariance. In metabolomic studies, very similar samples are usually compared, and most of the major peaks in the chromatograms occur in all of the samples, so there is only a small risk that the peak alignment (Figure 3B) cannot be done automatically. It must also be emphasized that the major purpose of the alignment is to allow the retention time windows to be set, rather than to correct for differences in peak shapes in the different chromatograms. Time Windows. A problem that can often affect chromatogram alignment is that retention drifts may not remain constant throughout the course of each chromatographic analysis. In such cases, the alignment procedure (as described above, for instance) will not be satisfactory if the alignment is based solely on a single large peak in each chromatogram. We have overcome this problem by dividing each MS file into small time windows (Figure 4) and summarizing the total intensity of each m/z channel for each time window. This gives a total mass spectrum for each time window and reduces the resolution, but the signal for different compounds in each time window can be found in the summarized mass spectra, and thereby, the information is not lost. The main advantage with this method is that it increases the accuracy of the multivariate modeling. Even if the chromatogram alignment is not perfect, the differences in the time window will be detected, as the summarized mass spectrum in each time window is compared between samples. A time window is defined as a short retention span that starts and ends in a region of the chromatogram that does not contain any compound (a zero component region). A time window of appropriate size can be visually selected for each investigation by plotting all aligned chromatograms at the same time. This stage of the processing (Figure 2) is done simultaneously for all samples, and it is the only step that is done manually since all other steps are fully computer automated. Here we divided each chromatogram into 51 windows. Consequently, in many cases, there are several compounds in each time window. However, that is not a problem since the multivariate modeling of the summarized mass spectrum for each time window will still recognize the differences between samples. Another Analytical Chemistry, Vol. 76, No. 6, March 15, 2004 1741 Figure 5. Visualization of the summarization and compression of the 3D-data structure in one time window. An approximation of the compound intensities and corresponding mass profiles can be generated using alternating regression. The intensity matrix for all the components in the windows is used for the multivariate analysis. advantage with the time window approach is that the multivariate modeling will be based on smaller data sets and thus increase the speed of the process. Extracting Information from Each Time Window Using Alternating Regression. The first stage in the comparison of samples is a projection step that begins by chemically ranking the time window matrices (i.e., the summarized mass spectra of all samples in the respective windows; Figure 5). The chemical rank, i.e., the number of components, is found for one time-window at a time by examining the eigenvalues from a principal component analysis (PCA).30 The chemical rank is then set to the number of eigenvalues above the noise level. Using the AR method,6 approximations of the compounds’ intensity and their corresponding mass profiles can then be derived. C ) X‚S‚(ST‚S)-1 (1) where C is the concentration, S is the mass spectrum, X is the matrix, and the superscript T denotes transposition and -1 inverse of a matrix. S ) XT‚C‚(CT‚C)-1 (2) The summarized mass spectra from each sample are rows in the X matrix. Both C and S have nonnegative constraints, meaning that the concentration and the mass spectra can never include negative values. Negative values in the spectrum profile are set (30) Wold, S.; Esbensen, K.; Geladi, P. Chemom. Intell. Lab. Syst. 1987, 2, 3752. 1742 Analytical Chemistry, Vol. 76, No. 6, March 15, 2004 to zero, and negative concentration values are set to the lowest of all positive concentrations. A random number is used as a starting estimate of S. If the AR algorithm does not find a solution, or the correlation between spectral profiles is larger than 95%, the chemical rank is reduced by one. For example, as the aldehyde and keto groups are converted into oximes during the derivatization, some metabolites will form two tautomeric forms that can appear very close in the chromatogram, with identical mass spectra. From the AR analysis the concentrations, C, are used to describe the concentration of the components in every sample in each time window. C values from all time windows are used as an X matrix (246 variables generated from the 51 time windows) in further multivariate modeling. C contains only information about compounds that differ between samples, because all samples have been compared at the same time. Each of the values for C has corresponding m/z values that can be used for interpretation later. Using this method, the information from each time window is explained by a small number of variables instead of the entire summarized mass spectrum. This results in a more easily interpretable PLS model. Multivariate Analysis of Data. Because a large amount of information can be found in each analysis, a number of different multivariate modeling techniques can be applied. The general approach (in the few metabolomic studies that have been published, for instance) has been to use PCA or HCA. However, we have chosen to find the differences between samples by using PLS analysis.31 The main advantage of PLS is that it is a supervised calibration method that allows the information from the X matrix to be correlated with response data, the Y matrix, rather than simply describing the variation in the X matrix, as PCA does. PLS models can be used not only for predicting Y values in new samples but also to evaluate the ability to predict the different groups, for example, using cross-validation.32 PLS discriminate analysis (PLS-DA) is often used to locate differences between members of different logical groups. PLS-DA searches for structural information that can discriminate between sample groups. The variables used to describe each sample are the “concentrations” derived from AR analysis, and the “concentrations” form the X matrix. The Y matrix is the description of the properties of the samples; here a “dummy matrix” is used to describe group membership. A dummy matrix has the same number of columns as the number of classes and the same number of rows as the number of samples. For a column representing a specific class, the row values are zeros except for samples belonging to the class that corresponds to the column. For these samples, values are set to 1. Each sample is normalized before the multivariate analysis. Normalization is essential if the samples are not identical, e.g., if there are differences in sample weight or volume, or a purification or derivatization step is involved that might result in variations in recovery. Our approach is to divide the response values by the sample weight and the intensity of one or more internal standards. In GC/MS analysis of complex, derivatized samples, the diversity of the compound classes will inevitably cause differences in recovery for different compounds. For more accurate compensation of differences in recovery, more than one internal standard (31) Wold, S.; Sjöström, M.; Eriksson, L. Chemom. Intell. Lab. Syst. 2001, 58, 109-130. (32) Wold, S. Technometrics 1978, 20, 397-405. should be used, and the internal standards should represent a wide range of compounds and chromatographic behavior, covering the whole range of the chromatogram. In our methodology for multivariate modeling, such an approach can be used to normalize separate time windows with different internal standards. Before calculating a PLS model, the variables in the X matrices are log-transformed before they are centered and scaled to unit variance. Log transformation makes it easier to fulfill the normal distribution requirement. The Y matrix is centered and scaled to unit variance (SIMCA-P default). After the PLS model has been calculated, the ability of the model to estimate Y from X is assessed. This can be done by using the model to make predictions concerning new and known samples. This is the point at which a decision can be made as to whether further data analysis of the sample set would be worthwhile or if there appears to be no differences between the samples according to the multivariate modeling. If the data analysis is to be continued, the most important variables in X that are correlated to Y will be sought. This can be done by interpreting the loadings from the PLS model using jackknifing33,34 and identifying the time window(s) in which significant variables occur. Furthermore, by using AR as described above, the m/z values that differ in each time window between the samples are also identified. Comparison of the New Method with the Curve Resolution Data Approach. The traditional way of comparing GC/MS data sets is to resolve chromatographic overlaps in the MS files, to then calculate the relative amounts of each compound, and finally subject the data to multivariate modeling. To test whether our new approach is able to extract the same information as the traditional method, we compared the two methods by analyzing GC/MS data obtained from 84 aspen samples. These samples were taken from trees included in a project designed to identify molecular and biochemical signals associated with the induction of growth cessation in woody species: a process that is poorly understood. The samples consisted of three leaves (2, 10, and 20) collected from plants growing under either LD or SD photoperiods. For more information see the Experimental Section. Automatic identification and quantification with the ChromaTof software resulted in different quantification masses being suggested for the same compound in different samples, which could have resulted in errors during the comparison of samples. Therefore, specific quantification masses were selected for each selected compound in the master sample. This deconvolution of 54 samples (only leaves 10 and 20 were analyzed this way; of note is that two samples were lost during extraction) using the ChromaTof software took more than 2 days and resulted in 241 quantified compounds. The concentrations of the compounds were used as X variables in the multivariate data analysis. All variables were log-transformed, centered, and scaled to unit variance. The data were also normalized against an internal standard and sample weight. PCA analysis was applied first, to compare the two methods. Both methods showed very similar results, and a clear separation between groups corresponding to leaf 10 and leaf 20 could be seen (Figure 6A and B). The usefulness of the method was also observed when adding the analysis of leaves 2, which resulted in clearly separation of the three groups (Figure 6C). (33) Efron, B. Ann. Stat. 1986, 14, 1301-1304. (34) Martens, H.; Martens, M. Food Quality Preference 2000, 11, 5-16. Figure 6. PCA score plots from the analysis of leaf 10 (O) and 20 (3) from four different classes of plants. (A) shows a score plot from processed MS files and (B) a score plot from nonprocessed MS files derived by the described method. (C) shows a score plot from nonprocessed MS files derived by the described method with data from leaf 2 (0), leaf 10 (O), and leaf 20 (3). This was not surprising as the leaves 2, 10, and 20 are leaves at completely different developmental stages (sink to source), and thereby the metabolic differences are very large (See Supporting Information where the new method has been used with independent metabolomics data published on the Web.). However, to further explore the usefulness of the new method, four groups of samples, representing plants subjected to four different photoperiodic treatments, were investigated. Instead of using PCA, we used a supervised method for this, since the aim in metabolomic studies is often to identify variables that statistically explain the differences between groups of samples, and PCA would not detect Analytical Chemistry, Vol. 76, No. 6, March 15, 2004 1743 Figure 8. PLS loading plots (w[1]) of limit regions for (A) processed MS files and (B) nonprocessed MS files derived by the described method. Positive values are associated with LD and negative values with SD. The confidence interval was calculated with jackknifing.33,34 Two retention time windows are shown. In window 16, the two methods show variables that are significantly correlated with SD. In window 17, both methods show variables that are significantly correlated with LD and SD. After processing of time windows of interest, the statistical significance was checked with the Student t-test. Figure 7. PCA and PLS-DA score plots from the analysis of leaf 10 from four different classes of plants (9, LD0; *, LD2; O, SD2; 2, SD6). (A) shows a PCA score plot from processed MS files, (B) shows a PLS-DA score plot from processed MS files, and (C) shows a PLSDA score plot from nonprocessed MS files derived by the described method. the small differences between the groups of samples that were analyzed in the present investigation. In metabolomics and also, for example, transcriptomics analysis of small changes over time is very common, and therefore, the methods used must be able to identify also small differences. This was observed in the present investigation where the PCA analysis of leaf 10 LD and SD samples could not separate the different groups (Figure 7A). However, PLS-DA was also used to investigate the differences between two of these groups (LD2 and SD6; Y matrix). The PLS-DA model with the X matrix generated with deconvolution (concentration metabolites) shows two significant components according to crossvalidation.29 The explained variation in the X matrix (R2X) is 0.36, the explained variation in the Y matrix (R2Y) is 0.93, and the predictive ability according to 7-fold cross-validation (Q2Y) is 0.55. The PLS-DA model with the X matrix generated with our new 1744 Analytical Chemistry, Vol. 76, No. 6, March 15, 2004 method shows one significant component, R2X is 0.19, R2Y is 0.82, and Q2Y is 0.49. To validate the models, the two groups (LD0 and SD2) that were not included in the model were used to test its predictive quality (Figure 7B and C). Both methods predicted that LD0 (9) would be similar to LD2 (*) and that SD2 (O) would be intermediate between LD2 and SD6 (2). This is consistent with the hypothesis that samples grown 2 days more in LD (LD2) should group with the LD0 samples. In samples exposed to only 2 days in short photoperiods, only very small metabolic changes will have occurred compared to those that will have occurred after 6 SDs. Therefore, the prediction that SD2 is an intermediate between LD2 and SD6 fits well with the hypothesis. The two methods were further studied by comparing the loadings from the PLS models. We have used the first loading vector (w) (Figure 8). This is the loading vector that should be interpreted in the PLS model when the rank of Y is 1.35 If the confidence interval (from jackknifing) does not include 0, a variable is considered significant. In 7 of the 51 time windows, neither of the methods found significant variables, in 30 of the windows both methods found significant variables, in 5 of the windows the traditional method found significant variables, but not the new method, and in 9 of the windows only the new method found significant variables (in 4 of these windows the traditional method did not find any peaks at all). The results show that the two methods give similar results, e.g., differences in amino acid and carbohydrate metabolism (e.g., (35) Trygg, J.; Wold, S. J. Chemom. 2002, 16, 119-128. norvaline, glutamic acid, aspartic acid, fumaric acid, R-ketoglutaric acid, and mannose phosphate; a complete list of metabolic differences identified in the samples will be published elsewhere) were observed with both methods. The small differences that still were observed can be explained by the fact that none of the methods are 100% accurate; e.g., automatic and accurate deconvolution of 54 complex samples is very difficult. However, with the new method, the deconvolution is more rapid. A significant variable from the PLS-DA model corresponds to one time window and a m/z profile in that time window (see Supporting Information III for an example). It is then possible to use unique masses to detect the compound(s) that explain the differences in that specific time window. This speeds up the deconvolution significantly, and thereby, the total time spent on deconvolution is decreased. CONCLUSIONS Development of the new strategy for comparing GC/MS data has been driven by the desire to be able to rapidly compare large numbers of samples. To our knowledge, no multivariate method has been developed prior to this that uses nonprocessed GC/MS data for comparison of samples. This is in contrast to comparisons of 1H NMR samples, where two-dimensional raw data (unlike GC/ MS data, which are three-dimensional) have been used for multivariate modeling in a number of cases.2 Every step in the new method is semiautomated and therefore less labor-intensive than the more traditional methods, and more than 50 samples can easily be compared within 5 h. Although mathematical curve resolution procedures must be applied to the retention time windows that explain the differences between samples, it is advantageous to decrease the number of sections of the chromatograms that need to be processed in this way. By using AR, knowledge about the masses of interesting compounds is also acquired, which further simplifies the deconvolution. Furthermore, most importantly the method provides a rapid means for screening for differences between large sample sets, allowing the experimenter to decide, at an early stage, the value of continuing to characterize differences between the analyzed samples. ACKNOWLEDGMENT We thank Wallenberg Consortium North (WCN), EU-strategic funding, and the Swedish Research Council for financial support, Krister Lundgren for help with the GC/MS analysis, and Dr. Anders Berglund for comments on the manuscript. SUPPORTING INFORMATION AVAILABLE Additional information as noted in text. This material is available free of charge via the Internet at http://pubs.acs.org. Received for review October 21, 2003. Accepted January 9, 2004. AC0352427 Analytical Chemistry, Vol. 76, No. 6, March 15, 2004 1745