INVESTIGACIÓN - digital

advertisement
PREPRINT DOCUMENT
Information of the Journal in which the present paper is published:
 Actualidad Analítica, Boletín de la SEQA, 48: 30-32, 2015
[Escriba aquí una descripción breve del documento. Normalmente, una descripción breve es un resumen corto
del contenido del documento. Escriba aquí una descripción breve del documento. Normalmente, una descripción
breve es un resumen corto del contenido del documento.]
INVESTIGACIÓN
CHALLENGES IN IDENTIFICATION OF MS DATA IN -OMICS: PROFILE/CENTROID ACQUISITION AND THE BENEFIT
OF CHEMOMETRICS
E. Gorrochategui Matas (1), Y. Wang (2), S. Lacorte (1), C. Porte (1), R. Tauler (1)
(1) Department of Environmental Chemistry, IDAEA-CSIC, Barcelona, Spain; (2) Cerno Bioscience, Norwalk, United States
e-mail: egmqam@cid.csic.es
In the present work some protocols for data conversion,
compression, processing and identification are presented. In
most cases, they are described in MATLAB programming
language. Data conversion protocols are defined for Waters,
Thermo and Agilent vendor instrumentation. Related to data
compression methods, two distinct approaches are compared:
the commonly used procedure of “binning” and a recently
developed procedure, based on searching regions of significant
mass traces, referred to as “regions of interest or ROIs”3. Related
to data processing, MCR-ALS is presented as a valid method for
proper resolution of chromatographic -omic profiles without the
need of peak alignment or shaping. Example results are
presented for an LC-MS lipidomic analysis of of human placental
choriocarcinoma cells (JEG-3) exposed to contaminants4.
Figure 1 shows an overview of the data processing workflows
both for the target and untargeted approaches. The procedures
for file conversion, data compression and data processing in the
untargeted approach are described in detail in the next section.
Target omics
1. State-of-the-art and objectives
Actualidad Analítica
•
•
•
•
•
LipidMaps
MassBank
Metlin
Proteomics
Others
Direct
search
in
databases
Acquisition
(LC-MS)
Raw data
Software file converter (Waters)
File
conversion
netCDF, TXT,
mzXML
m/z
tr
m/z
tr
Reorganization
tr
Matrix
Matrix
m/z
Mass
traces
Binning
Bin size
Data
compression
ROI
Untargeted omics
Liquid chromatography coupled to mass spectrometry (LCMS) has evolved as a powerful analytical methodology widely
used in some -omic platforms, such as metabolomics, which
includes lipidomics among others. However, interpretation of LCMS -omic profiles appears as challenging for researchers since the
obtained data contain thousands to millions of compounds to
analyze. Therefore, there is an urgent need of developing fast,
automatized and untargeted data processing methods to replace
the traditional time-consuming target approaches.
Progresses on building novel untargeted methods have been
focused on new ways of data compression and feature
detection. Compression of LC-MS data appears as no simple
procedure since it must ensure no loss of relevant information,
e.g. spectral resolution. In addition, correct feature detection is
not obvious, even when using high resolution MS, characterized
for its high mass accuracy. Recent studies related with spectral
accuracy have proved that acquisition of MS data in profile
mode, against the common centroide acquisition, brings
information of other isotope clusters, allowing a better
identification of “unknowns”1. Moreover, resolution of the
profiling problem, e.g. solving chromatographic coelutions,
evolves as a crucial step in the identification process. Related to
the profiling problem, the use of some chemometric methods
such as multivariate curve resolution alternated least squares
(MCR-ALS)2 methods appears as a powerful but still little
explored approach.
Despite the rapid avancing of bioinformatic tools, a unique
valid and commonly accepted untargeted method for LC-MS data
processing is still pursued. Therefore, the main objectives of our
research focuse on the evaluation of the distinct data conversion,
compression, processing and identification methods for LC-MS omic studies on the one hand and the demonstration of the
multiple advantages of the use of chemometric methods in
omics, such as MCR-ALS.
No bin size
Compressed data
•
•
•
•
Solving coelutions
MarkerLynx
MetAlign
XCMS
Mzmine, etc.
• PCA
• MCR-ALS
Commercial &
open-source
frameworks
Data
processing
CHEMOMETRICS
Figure 1. Overview of distinct data processing strategies used for
targeted and untargeted –omic studies.
2. Procedures
2.1. Data conversion
Raw data in vendor-format need to be converted into a format
readible with MATLAB software, such as text (.txt), netCDF (.cdf)
or mzXML formats.
Step by step procedures for data conversion are described
below for the three most important manufacturers of LC-MS
instrumentation, Waters, Thermo and Agilent Technologies and
their respective MassLynx, Xcalibur and MassHunter software
platforms.
Página 1
INVESTIGACIÓN
(c) Agilent Technologies (MassHunter)
This software requires an external programe called Proteowizard
to proceed with the data file conversion.
1| Install Proteowizard software as described in the web.
2| Go to MSConvert options.
3| Click ‘Browse’ and select the source folder of the raw data files
(.d) to convert. Multiple files can be selected at once.
4| Click the button ‘Add’.
5| Select the output directory.
6| Select the output format (.mzXML or .txt).
7| Click ‘Start’ to begin file conversion.
2.2. Data compression
Raw files coming from high resolution LC-MS instrumentation
containt big amounts of data, very difficult to process. The case
we present in Figure 2 corresponds to the data of an LC-TOF-MS
chromatogram of 20 minutes, which implicates about 1800 time
points when acquired at 1.5 scans/s and 7000000 m/z values for
0.0001 amu resolution in a MS range of 700 amu. Thus, the final
matrix dimensions of such a chromatogram are (1800 x 7000000)
which would require 0.1 (1800 x 7000000 x 8) terabytes of
storage. Moreover, this is only for a single LC-MS chromatogram.
Thus, when considering a set of 10 samples, for instance, the
storage needed would increase up to 1 terabyte, which is not
currently feasible for standard laboratory computers.
For this reason, one of the first and crucial steps in data
processing involves their compression, which needs to be
reliable; with no loss of spectral information. Here we describe
two distinct strategies valid for MS data compression.
(a) Binning
Binning is the more widely used procedure for the
compression of raw LC-MS data. The term binning can be defined
as the “agrupation of mass values into a small number of bins
containing data within a particular mz range”3. In the case of high
resolution mass spectrometry, the small m/z intervals of
acquisition (0.0001-0.0009 units) evolve to broader m/z chunks
with the application of this procedure. Moreover, the width of
these m/z intervals can be precisely defined when fixing the bin
size. In Figure 2a is represented an example of a binning
Actualidad Analítica
(b) ROI
On the other hand, the search of regions of interest (ROI)
among the LC-MS chromatograms allows the compression of the
original LC-MS data with no loss of spectral resolution. Regions of
interest contain data from interesting mass traces, which means
values with significant intensity, higher than a fixed signal to
noise ratio threshold (SNRThr). For this reason ROIs are also
defined as “high density data point regions”. Moreover, these
high density regions must contain a minimum number of
consecutive data points (ρmin ≥ 3) with a specific mass deviation,
typically set to a generous multiple of the mass accuracy (µ, given
in ppm) of the mass spectrometer. As shown in Figure 2b, ROIs
are searched scan by scan and mass traces of different lengths
are obtained for each case. Common density based regions
among scans are further combined to obtain the final number of
ROIs. For each ROI, m/z values are calculated as the mean of all
the m/z values from the serie of data points. In the same way, the
intensity value associated to one ROI is calculated as the sum of
the intensities of the serie of data points. In this case, matrix
dimensions are also reduced in the m/z direction, but with no
loss of spectral resolution (Figure 2b). Thus, in this case m/z
dimensions are related to the number of important mass traces
found in the LC-TOF-MS chromatogram.
Differing from the binning procedure, with the ROI strategy
final representation of compressed data is a matrix with nonequidistant m/z intervals which requires an additional
reorganization-step to finally obtain a data matrix.
m/z (7000000)
Human cell
tr
(1800)
High Resolution
MS data
Eva Gorro Pos
LipidsEva_Control_Pos1
1: TOF MS ES+
TIC
4.67e5
100
100
90
Lipids
80
70
60
50
%
(b) Thermo Technologies (Xcalibur)
1| Go to ‘Tools > File Converter’.
2| Specify the source data type.
3| Click ‘Browse’ and select the source folder of the raw data files
(.raw) to convert.
4| Select the desired files to convert. Multiple files can be
selected at once, and all files are selected automaticatelly by
clicking on the button ‘Select All’.
5| Click the button ‘Add Job(s)’.
6| Select the destination path and data type, ‘ANDI Files’ for .cdf
format or ‘Text Files’ for .txt format.
7| Click ‘Convert’ to begin file conversion.
procedure applied to the LC-TOF-MS chromatogram mentioned
before when fixing a bin size of 0.1 m/z units. As shown in the
figure, the initial matrix of dimensions (1800 x 7000000) turns
into a matrix of (1800 x 7000) size. Thus, as it can be notice, in
this procedure, while reducing data dimensions 1000 times in the
m/z dimension, a loss of spectral resolution occurs, from 0.0001
to 0.1 amu resolution. Therefore, with binning procedure the
final m/z dimensions directly depend on bin size.
In general terms, the application of binning allows the
organization of raw data with non-equidistant m/z intervals into a
matrix representation with regular m/z chunks. Binning
procedure is all-purpose and allows for a fast data processing.
However, its application always results in a loss of spectral
resolution.
Relative abundance (%)
(a) Waters Technologies (MassLynx)
1| Open the Databridge interface of the MassLynx file converter.
2| Click ‘Select’ and browse the raw data files (.raw) to convert.
Only one file can be selected at once.
3| Click ‘Options’ and specify the source of the raw files
(MassLynx) and the target output format which must be ‘netCDF’
for .cdf files or ‘ASCII’ for .txt files.
4| Indicate the output directory and the name of the file.
5| Click ‘Convert’ to begin file conversion.
40
30
20
10
0
0
Time
1.00
2.00
1
~ Terabytes
of storage
tr
(1800)
MS data
~ Megabytes
of storage
3.00
3
4.00
4
5.00
5
6.00
6
7.00
7
8.00
9.00
8
9
10.00
10
11.00
12.00
13.00
14.00
15.00
16.00
17.00
18.00
20.00
21.00
m/z
Reorganization
tr
(1800)
19.00
11 12 13 14 15 16 17 18 19 20 21
Retention time (min)
LC-TOF-MS
chromatogram
m/z
m/z (7000)
(Bin size= 0.1)
Low
Res.
2
High
tr
Res.
(1800) MS data
~ Terabytes
of storage
Figure 2. Scheme of the steps involved in the compression of data
by the application of two distinct strategies: a] Binning and b]
ROI.
Página 2
200
150
Treat. A
100
INVESTIGACIÓN
Controls
10
2.3. Data processing by MCR-ALS
b c
4
x 10
30
40
50 60
70
80
B]
160
Treat.CC
Treat.
140
120
Treat. B
Treat. B
100
80
Treat.A
A
Treat.
60
40
Controls
Controls
20
45
50 55
60
65
70
75 80
85
MCR-ALS components
MCR-ALS components
Figure 5. Correlation map of sample types and components
considering the calculated areas of MCR-ALS analysis. The results
are shown for only some components.
2.4. Data identification
The identification of the resolved MCR-ALS compounds appears
difficult, even when disposing the information of high resolution
mass. Recently, some studies have proved that the acquisition of
MS spectra in a continuum or profile mode allows a better
identification of the “unknowns” since it provides information of
other relevant isotopic clusters.
As it can be observed in Figure 6, the acquisition in profile
mode provides a numerical representation of the ion dispersion
or aberration in the mass spectrometer, including spatial and
velocity dispersion inside the ion source. Thus, mass lectures are
represented as Gausssian curves whereas in the centroid mode
they appear as discrete lines.
M
M
M+1
M+2
M+1
M+2
m/z
Continuum Spectra
No information loss
Bigger data file size
Stick Spectra
Smaller data file size
Information loss
Figure 6. Schematic representation of a mass spectrum acquired
in profile and centroide modes.
2.5
a
2
20
MCR-ALS components
Data corresponding to high resolution LC-MS always contain
huge amount of chromatographic peaks with multiple coelutions,
especially in complex samples such as the lipid sample we use
here as example. The resolution of such coelutions can result in a
easier and more feasible identification.
In this context, MCR-ALS methods evolve as powerful tools to
solve the “profiling problem”. Multivariate curve resolution
methods are based on the same bilinear decomposition of
original data sets used by PCA, but under completely different
constraints and with a different goal. The mathematical basis of
the bilinear model used by MCR is shown in the Equation of
Figure 4.
In this equation, matrix D (I x J) represents the data output of a
second-order instrument. In the case of LC-MS data, D matrix
contains the MS spectra at all retention times (i=1,…I) in its rows,
and the chromatograms at all spectra m/z channels (j=1,…J) in its
columns. This matrix is decomposed in the product of two small
factor matrices, C and ST. The C (I x N) matrix contains column
vectors which correspond to the elution profiles of the N
(n=1,…N) pure components of matrix D. In ST (N x J) matrix, row
vectors correspond to the spectra of the N pure components. The
part of D that is not explained by the model forms the residual
matrix, E (I x J).
MCR-ALS methods assume that the variation measured in all
samples in the original data set can be described by a
combination of a small number of chemically meaningful profiles.
In the case of LC-MS data sets, information of the data table can
be reproduced by the combination of a small number of pure
mass spectra (row profiles in the ST matrix) weighted by the
concentration of each of them along the elution direction (the
related chromatographic elution peaks, column profiles in C).
In Figure 4 is represented an example of the aplication of MCRALS to an especific time region of the previously mentioned LCTOF-MS chromatogram of a lipid sample. In this case, the number
of selected components is 4, one of them explaining background
noise. As it can be seen, the coelutions are solved and the
information of the areas and masses of each compound are
finally obtained.
3
50
1.5
1
3. Conclusions
0.5
0
0
10
20
30
40
50
60
LC-MS data of –omic studies contain thousands to milions of
features to analyze. Previous to further analysis, data must be
reduced in the m/z dimension, either by binning or ROI, being the
latest better in the sense of no lose of spectral resolution.
Identification process is facilitated with the previous analysis of
data by MCR-ALS, which has been proven to resolve the profiling
problem, and when acquiring in profile or continuum mode.
However, further research is still necessary to find a unique valid
untargeted data processing methodology for omics.
70
D: Compressed Data using
ROI/Binning/Interpolation
Results for a 4-component analysis
b
4
3.5
x 10
b
c
3
a
2.5
a
2
a bc
c
0.8
0.7
0.6
0.5
0.4
1.5
0.3
1
0.2
0.5
0
0
10
20
30
40
50
60
70
Peak areas
0.1
0
400
C: Concentration profiles
450
500
550
600
650
700
ST: Spectra profiles
Figure 4.Example of the aplication of MCR-ALS method for an
especific region of a lipidomic LC-TOF-MS chromatogram.
As can be seen in Figure 5, the representation of some of the
MCR-ALS obtained areas for each sample type (controls versus
treated samples) allows the observation of the effects produced
by the contaminants in the cells.
Actualidad Analítica
Acknowledgements. The research leading to these results has
received funding from European Research Council, under the
European Union’s Seventh Framework Programme (FP/20072013)/ERC Grant Agreement no. 320737. First author
acknowledges Spanish Government (Ministerio de Educación,
Cultura y Deporte) for a predoctoral FPU scholarship.
4. References
Página 3
INVESTIGACIÓN
[1] Wang Y. et al. (2010) The Concept of Spectral Accuracy for MS. Anal.
Chem.
[2] Tauler, R. (1995) Multivariate curve resolution applied to second order
data. Chemom. Intell. Lab. Syst.
[3] Tautenhahn, R. et al. (2008) Highly sensitive feature detection for high
resolution LC/MS. BMC Bioinf.
[4] Gorrochategui, E. et al. (2014) Characterization of complex lipid
mixtures in contaminant exposed JEG-3 cells using liquid chromatography
and high-resolution mass spectrometry, Environ. Sci. Pollut. Res.
Actualidad Analítica
Página 4
INVESTIGACIÓN
1
2
Actualidad Analítica
Página 5
Download