PREPRINT DOCUMENT Information of the Journal in which the present paper is published: Actualidad Analítica, Boletín de la SEQA, 48: 30-32, 2015 [Escriba aquí una descripción breve del documento. Normalmente, una descripción breve es un resumen corto del contenido del documento. Escriba aquí una descripción breve del documento. Normalmente, una descripción breve es un resumen corto del contenido del documento.] INVESTIGACIÓN CHALLENGES IN IDENTIFICATION OF MS DATA IN -OMICS: PROFILE/CENTROID ACQUISITION AND THE BENEFIT OF CHEMOMETRICS E. Gorrochategui Matas (1), Y. Wang (2), S. Lacorte (1), C. Porte (1), R. Tauler (1) (1) Department of Environmental Chemistry, IDAEA-CSIC, Barcelona, Spain; (2) Cerno Bioscience, Norwalk, United States e-mail: egmqam@cid.csic.es In the present work some protocols for data conversion, compression, processing and identification are presented. In most cases, they are described in MATLAB programming language. Data conversion protocols are defined for Waters, Thermo and Agilent vendor instrumentation. Related to data compression methods, two distinct approaches are compared: the commonly used procedure of “binning” and a recently developed procedure, based on searching regions of significant mass traces, referred to as “regions of interest or ROIs”3. Related to data processing, MCR-ALS is presented as a valid method for proper resolution of chromatographic -omic profiles without the need of peak alignment or shaping. Example results are presented for an LC-MS lipidomic analysis of of human placental choriocarcinoma cells (JEG-3) exposed to contaminants4. Figure 1 shows an overview of the data processing workflows both for the target and untargeted approaches. The procedures for file conversion, data compression and data processing in the untargeted approach are described in detail in the next section. Target omics 1. State-of-the-art and objectives Actualidad Analítica • • • • • LipidMaps MassBank Metlin Proteomics Others Direct search in databases Acquisition (LC-MS) Raw data Software file converter (Waters) File conversion netCDF, TXT, mzXML m/z tr m/z tr Reorganization tr Matrix Matrix m/z Mass traces Binning Bin size Data compression ROI Untargeted omics Liquid chromatography coupled to mass spectrometry (LCMS) has evolved as a powerful analytical methodology widely used in some -omic platforms, such as metabolomics, which includes lipidomics among others. However, interpretation of LCMS -omic profiles appears as challenging for researchers since the obtained data contain thousands to millions of compounds to analyze. Therefore, there is an urgent need of developing fast, automatized and untargeted data processing methods to replace the traditional time-consuming target approaches. Progresses on building novel untargeted methods have been focused on new ways of data compression and feature detection. Compression of LC-MS data appears as no simple procedure since it must ensure no loss of relevant information, e.g. spectral resolution. In addition, correct feature detection is not obvious, even when using high resolution MS, characterized for its high mass accuracy. Recent studies related with spectral accuracy have proved that acquisition of MS data in profile mode, against the common centroide acquisition, brings information of other isotope clusters, allowing a better identification of “unknowns”1. Moreover, resolution of the profiling problem, e.g. solving chromatographic coelutions, evolves as a crucial step in the identification process. Related to the profiling problem, the use of some chemometric methods such as multivariate curve resolution alternated least squares (MCR-ALS)2 methods appears as a powerful but still little explored approach. Despite the rapid avancing of bioinformatic tools, a unique valid and commonly accepted untargeted method for LC-MS data processing is still pursued. Therefore, the main objectives of our research focuse on the evaluation of the distinct data conversion, compression, processing and identification methods for LC-MS omic studies on the one hand and the demonstration of the multiple advantages of the use of chemometric methods in omics, such as MCR-ALS. No bin size Compressed data • • • • Solving coelutions MarkerLynx MetAlign XCMS Mzmine, etc. • PCA • MCR-ALS Commercial & open-source frameworks Data processing CHEMOMETRICS Figure 1. Overview of distinct data processing strategies used for targeted and untargeted –omic studies. 2. Procedures 2.1. Data conversion Raw data in vendor-format need to be converted into a format readible with MATLAB software, such as text (.txt), netCDF (.cdf) or mzXML formats. Step by step procedures for data conversion are described below for the three most important manufacturers of LC-MS instrumentation, Waters, Thermo and Agilent Technologies and their respective MassLynx, Xcalibur and MassHunter software platforms. Página 1 INVESTIGACIÓN (c) Agilent Technologies (MassHunter) This software requires an external programe called Proteowizard to proceed with the data file conversion. 1| Install Proteowizard software as described in the web. 2| Go to MSConvert options. 3| Click ‘Browse’ and select the source folder of the raw data files (.d) to convert. Multiple files can be selected at once. 4| Click the button ‘Add’. 5| Select the output directory. 6| Select the output format (.mzXML or .txt). 7| Click ‘Start’ to begin file conversion. 2.2. Data compression Raw files coming from high resolution LC-MS instrumentation containt big amounts of data, very difficult to process. The case we present in Figure 2 corresponds to the data of an LC-TOF-MS chromatogram of 20 minutes, which implicates about 1800 time points when acquired at 1.5 scans/s and 7000000 m/z values for 0.0001 amu resolution in a MS range of 700 amu. Thus, the final matrix dimensions of such a chromatogram are (1800 x 7000000) which would require 0.1 (1800 x 7000000 x 8) terabytes of storage. Moreover, this is only for a single LC-MS chromatogram. Thus, when considering a set of 10 samples, for instance, the storage needed would increase up to 1 terabyte, which is not currently feasible for standard laboratory computers. For this reason, one of the first and crucial steps in data processing involves their compression, which needs to be reliable; with no loss of spectral information. Here we describe two distinct strategies valid for MS data compression. (a) Binning Binning is the more widely used procedure for the compression of raw LC-MS data. The term binning can be defined as the “agrupation of mass values into a small number of bins containing data within a particular mz range”3. In the case of high resolution mass spectrometry, the small m/z intervals of acquisition (0.0001-0.0009 units) evolve to broader m/z chunks with the application of this procedure. Moreover, the width of these m/z intervals can be precisely defined when fixing the bin size. In Figure 2a is represented an example of a binning Actualidad Analítica (b) ROI On the other hand, the search of regions of interest (ROI) among the LC-MS chromatograms allows the compression of the original LC-MS data with no loss of spectral resolution. Regions of interest contain data from interesting mass traces, which means values with significant intensity, higher than a fixed signal to noise ratio threshold (SNRThr). For this reason ROIs are also defined as “high density data point regions”. Moreover, these high density regions must contain a minimum number of consecutive data points (ρmin ≥ 3) with a specific mass deviation, typically set to a generous multiple of the mass accuracy (µ, given in ppm) of the mass spectrometer. As shown in Figure 2b, ROIs are searched scan by scan and mass traces of different lengths are obtained for each case. Common density based regions among scans are further combined to obtain the final number of ROIs. For each ROI, m/z values are calculated as the mean of all the m/z values from the serie of data points. In the same way, the intensity value associated to one ROI is calculated as the sum of the intensities of the serie of data points. In this case, matrix dimensions are also reduced in the m/z direction, but with no loss of spectral resolution (Figure 2b). Thus, in this case m/z dimensions are related to the number of important mass traces found in the LC-TOF-MS chromatogram. Differing from the binning procedure, with the ROI strategy final representation of compressed data is a matrix with nonequidistant m/z intervals which requires an additional reorganization-step to finally obtain a data matrix. m/z (7000000) Human cell tr (1800) High Resolution MS data Eva Gorro Pos LipidsEva_Control_Pos1 1: TOF MS ES+ TIC 4.67e5 100 100 90 Lipids 80 70 60 50 % (b) Thermo Technologies (Xcalibur) 1| Go to ‘Tools > File Converter’. 2| Specify the source data type. 3| Click ‘Browse’ and select the source folder of the raw data files (.raw) to convert. 4| Select the desired files to convert. Multiple files can be selected at once, and all files are selected automaticatelly by clicking on the button ‘Select All’. 5| Click the button ‘Add Job(s)’. 6| Select the destination path and data type, ‘ANDI Files’ for .cdf format or ‘Text Files’ for .txt format. 7| Click ‘Convert’ to begin file conversion. procedure applied to the LC-TOF-MS chromatogram mentioned before when fixing a bin size of 0.1 m/z units. As shown in the figure, the initial matrix of dimensions (1800 x 7000000) turns into a matrix of (1800 x 7000) size. Thus, as it can be notice, in this procedure, while reducing data dimensions 1000 times in the m/z dimension, a loss of spectral resolution occurs, from 0.0001 to 0.1 amu resolution. Therefore, with binning procedure the final m/z dimensions directly depend on bin size. In general terms, the application of binning allows the organization of raw data with non-equidistant m/z intervals into a matrix representation with regular m/z chunks. Binning procedure is all-purpose and allows for a fast data processing. However, its application always results in a loss of spectral resolution. Relative abundance (%) (a) Waters Technologies (MassLynx) 1| Open the Databridge interface of the MassLynx file converter. 2| Click ‘Select’ and browse the raw data files (.raw) to convert. Only one file can be selected at once. 3| Click ‘Options’ and specify the source of the raw files (MassLynx) and the target output format which must be ‘netCDF’ for .cdf files or ‘ASCII’ for .txt files. 4| Indicate the output directory and the name of the file. 5| Click ‘Convert’ to begin file conversion. 40 30 20 10 0 0 Time 1.00 2.00 1 ~ Terabytes of storage tr (1800) MS data ~ Megabytes of storage 3.00 3 4.00 4 5.00 5 6.00 6 7.00 7 8.00 9.00 8 9 10.00 10 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 20.00 21.00 m/z Reorganization tr (1800) 19.00 11 12 13 14 15 16 17 18 19 20 21 Retention time (min) LC-TOF-MS chromatogram m/z m/z (7000) (Bin size= 0.1) Low Res. 2 High tr Res. (1800) MS data ~ Terabytes of storage Figure 2. Scheme of the steps involved in the compression of data by the application of two distinct strategies: a] Binning and b] ROI. Página 2 200 150 Treat. A 100 INVESTIGACIÓN Controls 10 2.3. Data processing by MCR-ALS b c 4 x 10 30 40 50 60 70 80 B] 160 Treat.CC Treat. 140 120 Treat. B Treat. B 100 80 Treat.A A Treat. 60 40 Controls Controls 20 45 50 55 60 65 70 75 80 85 MCR-ALS components MCR-ALS components Figure 5. Correlation map of sample types and components considering the calculated areas of MCR-ALS analysis. The results are shown for only some components. 2.4. Data identification The identification of the resolved MCR-ALS compounds appears difficult, even when disposing the information of high resolution mass. Recently, some studies have proved that the acquisition of MS spectra in a continuum or profile mode allows a better identification of the “unknowns” since it provides information of other relevant isotopic clusters. As it can be observed in Figure 6, the acquisition in profile mode provides a numerical representation of the ion dispersion or aberration in the mass spectrometer, including spatial and velocity dispersion inside the ion source. Thus, mass lectures are represented as Gausssian curves whereas in the centroid mode they appear as discrete lines. M M M+1 M+2 M+1 M+2 m/z Continuum Spectra No information loss Bigger data file size Stick Spectra Smaller data file size Information loss Figure 6. Schematic representation of a mass spectrum acquired in profile and centroide modes. 2.5 a 2 20 MCR-ALS components Data corresponding to high resolution LC-MS always contain huge amount of chromatographic peaks with multiple coelutions, especially in complex samples such as the lipid sample we use here as example. The resolution of such coelutions can result in a easier and more feasible identification. In this context, MCR-ALS methods evolve as powerful tools to solve the “profiling problem”. Multivariate curve resolution methods are based on the same bilinear decomposition of original data sets used by PCA, but under completely different constraints and with a different goal. The mathematical basis of the bilinear model used by MCR is shown in the Equation of Figure 4. In this equation, matrix D (I x J) represents the data output of a second-order instrument. In the case of LC-MS data, D matrix contains the MS spectra at all retention times (i=1,…I) in its rows, and the chromatograms at all spectra m/z channels (j=1,…J) in its columns. This matrix is decomposed in the product of two small factor matrices, C and ST. The C (I x N) matrix contains column vectors which correspond to the elution profiles of the N (n=1,…N) pure components of matrix D. In ST (N x J) matrix, row vectors correspond to the spectra of the N pure components. The part of D that is not explained by the model forms the residual matrix, E (I x J). MCR-ALS methods assume that the variation measured in all samples in the original data set can be described by a combination of a small number of chemically meaningful profiles. In the case of LC-MS data sets, information of the data table can be reproduced by the combination of a small number of pure mass spectra (row profiles in the ST matrix) weighted by the concentration of each of them along the elution direction (the related chromatographic elution peaks, column profiles in C). In Figure 4 is represented an example of the aplication of MCRALS to an especific time region of the previously mentioned LCTOF-MS chromatogram of a lipid sample. In this case, the number of selected components is 4, one of them explaining background noise. As it can be seen, the coelutions are solved and the information of the areas and masses of each compound are finally obtained. 3 50 1.5 1 3. Conclusions 0.5 0 0 10 20 30 40 50 60 LC-MS data of –omic studies contain thousands to milions of features to analyze. Previous to further analysis, data must be reduced in the m/z dimension, either by binning or ROI, being the latest better in the sense of no lose of spectral resolution. Identification process is facilitated with the previous analysis of data by MCR-ALS, which has been proven to resolve the profiling problem, and when acquiring in profile or continuum mode. However, further research is still necessary to find a unique valid untargeted data processing methodology for omics. 70 D: Compressed Data using ROI/Binning/Interpolation Results for a 4-component analysis b 4 3.5 x 10 b c 3 a 2.5 a 2 a bc c 0.8 0.7 0.6 0.5 0.4 1.5 0.3 1 0.2 0.5 0 0 10 20 30 40 50 60 70 Peak areas 0.1 0 400 C: Concentration profiles 450 500 550 600 650 700 ST: Spectra profiles Figure 4.Example of the aplication of MCR-ALS method for an especific region of a lipidomic LC-TOF-MS chromatogram. As can be seen in Figure 5, the representation of some of the MCR-ALS obtained areas for each sample type (controls versus treated samples) allows the observation of the effects produced by the contaminants in the cells. Actualidad Analítica Acknowledgements. The research leading to these results has received funding from European Research Council, under the European Union’s Seventh Framework Programme (FP/20072013)/ERC Grant Agreement no. 320737. First author acknowledges Spanish Government (Ministerio de Educación, Cultura y Deporte) for a predoctoral FPU scholarship. 4. References Página 3 INVESTIGACIÓN [1] Wang Y. et al. (2010) The Concept of Spectral Accuracy for MS. Anal. Chem. [2] Tauler, R. (1995) Multivariate curve resolution applied to second order data. Chemom. Intell. Lab. Syst. [3] Tautenhahn, R. et al. (2008) Highly sensitive feature detection for high resolution LC/MS. BMC Bioinf. [4] Gorrochategui, E. et al. (2014) Characterization of complex lipid mixtures in contaminant exposed JEG-3 cells using liquid chromatography and high-resolution mass spectrometry, Environ. Sci. Pollut. Res. Actualidad Analítica Página 4 INVESTIGACIÓN 1 2 Actualidad Analítica Página 5