MS Metabolomics Core -BCM 2014 Metabolomics Tier 1 Data Analysis Targeted Mass Spectrometry: For MRM data acquired using Agilent 6430 or 6490 QQQ, data handling will be performed using both Qualitative and Quantitative Analysis software from Agilent Technologies, that will ascertain retention time, verify MRM transitions and calculate absolute levels of metabolites using calibration curves. Low-level analysis. The initial low level analysis of metabolomic data will use a series of steps including preliminary review of data, visualization for spotting patterns, data cleaning, and Data imputation normalization. cleaning basic and statistical analyses will include identifying potential outliers, checking for normality, examining the proportion and/or variance for each variable. As samples will be obtained from multiple data sources (Bruker vs. Varian NMRs) or will be acquired in batches, this descriptive analysis will first be conducted stratifying by data source and batch to determine if there are any important site differences that will need to be adjusted for, during the data normalization steps. For mass spec data, the mode of acquisition (targeted vs. unbiased) will impact the imputation procedure employed. For targeted acquisition, it is not expected the data to have any missing values. For unbiased acquisition, missing values are a common feature. In that case, only metabolites with fewer than 50% missing values for studies involving a relatively large number of tissues/biofluids will be considered, and with fewer than 20% missing values in the case of cell lines. Missing values will be imputed either at the minimum detection level or through a k=5 nearest neighbor procedure (pamr MS Metabolomics Core -BCM 2014 package in the R programming language). Depending on the study design several different approaches will be available, ranging from simple median centering, to centering and scaling based on the values of internally spiked standards, to employing more advanced fixed effects analysis of variance procedure that use factors, data platform, batch information, ionization mode, etc. For samples run on days that are fairly apart, batch effects occur that need to be corrected. To correct such batch effects, we either use analysis of variance techniques, or when the focus is on unsupervised techniques such as clustering and principal components analysis, the function “removeBatchEffect” available in the R package “limma” is employed. Standardization of Isotopic Enrichment data and Determination of Steady State: For every substrate or metabolite, a standard curve of known isotopic enrichments versus measured enrichments is first performed. A Regression analysis is performed and raw isotopic data first corrected based on the slope and intercept of the curve when R2=0.99. To calculate flux from the corrected data, steady state isotopic enrichment (plateau) has to be determined by performing a regression analysis of enrichment versus time to ascertain whether slope is not different from zero. If the slope is greater than zero, a non-linear regression is performed to estimate the asymptote (plateau). Normalization of Biolog metabolic phenotyping data: First the data from these metabolic phenotyping assays will be corrected for background signal, obtained from the median reading of the three empty wells on the plate. Following this, the data will be scaled by dividing each measurement on the plate by the inter-quartile range of the data. In our experience, this simple normalization has worked well even for samples that have been run on different days or belong to different conditions (control vs. treatment). Mid-level analysis (Figure 1): These types of analysis involve identification of differentially expressed metabolites, model building for classificatory or survival analysis purposes, dimension reduction for extracting broad patterns from the data and identification of groups of samples and/or metabolites. MS Metabolomics Core -BCM 2014 Specifically, differentially expressed compounds across two classes will be identified using parametric (t-tests) and non-parametric (rank-sum) tests, while for multiple classes we will employ analysis of variance models. The latter models in addition to the key treatment factors being tested also allow for the incorporation of key covariate information, such as clinical (stage of the disease, indices of physiological impairments), as well as demographic and health habits of the samples (e.g. age, race, gender, education, smoking and drinking status in the case of humans and strain, gender and housing conditions in the case of mice). Given the large number of markers that are likely to be identified as significantly different between groups, as well as the number of conditions and differences in any given experiment, we recognize that these multiple comparisons increase the possibility of Type I error (false positives). Hence, family-wise error rate (FWER) methods and false discovery rate (FDR) methods would be employed to reduce or eliminate false positives. In many studies, classificatory and/or prognostic models will be built. Such models are important for delineating metabolomics signatures associated with clinical outcomes, including disease/normal status, recurrence times, stage progression, etc. For categorical outcomes (e.g. disease/normal status, etc.) there are several standard models in the machine learning literature that will be employed, including logistic regression, random forests and support vector machines, whereas for outcomes capturing event times (e.g. disease recurrence, survival, etc) Cox-proportional hazards models will be used. An important aspect of this modeling will be to enforce sparsity through penalization (e.g. lasso or group lasso penalties) that lead to more parsimonious models that exhibit good theoretical properties in terms of inference and predictive ability. Especially in the case of structured penalties (e.g. group lasso that is implemented in the R package grplasso), one can impose a priori biological information, such as pathway structure. The performance of these classificatory and prognostic models will be assessed through K-fold cross-validated error rates, and through receiver-operator characteristic (ROC) curve, while the area under the curve (AUC) will be used as an overall measure of model fit. The significance of the AUC metric for each MS Metabolomics Core -BCM 2014 fitted model will be assessed through the Mann-Whitney U-statistic and it will also be used to select between competing models. Finally, depending on project needs the core will also provide several other standard analyses to users to understand and gain insight into their data including dimension reduction techniques, such as principal components analysis and penalized (for sparsity) variants for obtaining more robust low-dimensional representations of the samples and/or the metabolites, clustering of samples and/or metabolites into groups using a wide range of algorithms (hierarchical, model-based, partition such as k-means and robust variants and graph based ones, such as normalized cuts). In addition, it will provide enhanced visualization capabilities by mapping results into pathways (see also Section on pathway mapping).