Bayesian Alignment Model for Analysis of LC-MS-based Omic Data Tsung-Heng Tsai Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering Yue Wang, Chair Luiz A. DaSilva Seong K. Mun Habtom W. Ressom Jianhua Xuan Guoqiang Yu April 18, 2014 Arlington, Virginia Keywords: alignment, Bayesian inference, biomarker discovery, liquid chromatography-mass spectrometry (LC-MS), Markov chain Monte Carlo (MCMC) Copyright 2014, Tsung-Heng Tsai Bayesian Alignment Model for Analysis of LC-MS-based Omic Data Tsung-Heng Tsai (ABSTRACT) Liquid chromatography coupled with mass spectrometry (LC-MS) has been widely used in various omic studies for biomarker discovery. Appropriate LC-MS data preprocessing steps are needed to detect true differences between biological groups. Retention time alignment is one of the most important yet challenging preprocessing steps, in order to ensure that ion intensity measurements among multiple LC-MS runs are comparable. In this dissertation, we propose a Bayesian alignment model (BAM) for analysis of LC-MS data. BAM uses Markov chain Monte Carlo (MCMC) methods to draw inference on the model parameters and provides estimates of the retention time variability along with uncertainty measures, enabling a natural framework to integrate information of various sources. From methodology development to practical application, we investigate the alignment problem through three research topics: 1) development of single-profile Bayesian alignment model, 2) development of multi-profile Bayesian alignment model, and 3) application to biomarker discovery research. Chapter 2 introduces the profile-based Bayesian alignment using a single chromatogram, e.g., base peak chromatogram from each LC-MS run. The single-profile alignment model improves on existing MCMC-based alignment methods through 1) the implementation of an efficient MCMC sampler using a block Metropolis-Hastings algorithm, and 2) an adaptive mechanism for knot specification using stochastic search variable selection (SSVS). Chapter 3 extends the model to integrate complementary information that better captures the variability in chromatographic separation. We use Gaussian process regression on the internal standards to derive a prior distribution for the mapping functions. In addition, a clustering approach is proposed to identify multiple representative chromatograms for each LC-MS run. With the Gaussian process prior, these chromatograms are simultaneously considered in the profile-based alignment, which greatly improves the model estimation and facilitates the subsequent peak matching process. Chapter 4 demonstrates the applicability of the proposed Bayesian alignment model to biomarker discovery research. We integrate the proposed Bayesian alignment model into a rigorous preprocessing pipeline for LC-MS data analysis. Through the developed analysis pipeline, candidate biomarkers for hepatocellular carcinoma (HCC) are identified and confirmed on a complementary platform. Acknowledgments First of all, I thank my advisor, Professor Yue Wang for his support and guidance on my study, research and life. I am very grateful for his contributions of time, knowledge and expertise to my research during my PhD study. Much of my research was conducted in Professor Habtom W. Ressom’s laboratory. I would like to express my sincere gratitude to him, for all the help, support and insights he has provided to me throughout these years. In addition, his kindness, enthusiasm and positive energy have made it a pleasure to work with him. I want to thank the other members of my advisory committee: Professors Luiz A. DaSilva, Seong K. Mun, Jianhua Xuan, and Guoqiang Yu, for their invaluable feedback and constructive suggestions to improve the present work. I have been very fortunate to have several mentors who helped and supported me throughout these years. In particular, I would like to thank Professor Mahlet Tadesse, for guiding me along and providing feedback on my work. Discussions with her have been a constant source of inspiration. Her innovative thinking and high standard on research work have had a significant influence on how I approach a research problem. I thank Dr. Da-Wei Wang, my former supervisor who brought me into the area of bioinformatics seven years ago. Over the years, he has extended his support and continued to provide me very helpful suggestions. I enjoyed working with my colleagues in CBIL and Ressom Lab. Especially, I thank Cristina Di Poto, Minkun Wang, and Yi Zhao, for their generous help and selfless contributions in our collaborative projects. I would like to give special thanks to Sherry Hwang and Jeff Hwang, who have provided me tremendous support and encouragement, starting from the very first day I landed in the Dulles Airport. I cannot imagine getting through the past few years without their support. Most importantly, I would like to thank my parents, Hai-Shu Tsai and Pi-O Lin, my sisters Tsai-Chen Tsai and Wen-Fang Tsai, my brother Yi-Chan Tsai, my brother-in-law HsiangChih Hsiao, and my sister-in-law Ching-Hui Yang. They have always motivated, encouraged, and supported me. This dissertation would not have been possible without their unconditional love. iii Contents 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Principles of LC-MS . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.2 LC-MS data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Retention time alignment of LC-MS data . . . . . . . . . . . . . . . . . . . . 9 1.3.1 Feature-based approaches . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3.2 Profile-based approaches . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4 Research topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.5 List of relevant publications . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.6 Organization of the dissertation . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.3 2 Single-profile Bayesian alignment model 17 2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Generative model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3 Posterior inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.1 Full conditionals of model parameters . . . . . . . . . . . . . . . . . . 23 2.3.2 Block Metropolis-Hastings algorithm . . . . . . . . . . . . . . . . . . 26 2.4 Number and position of knots . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.5.1 31 Simulated data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv 2.6 2.7 2.5.2 LC-MS spike-in data set . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.5.3 LC-MS metabolomic data sets . . . . . . . . . . . . . . . . . . . . . . 36 Alternative formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.6.1 Jupp transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.6.2 Hamiltonian Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . 41 2.6.3 Single-profile alignment using Hamiltonian Monte Carlo . . . . . . . . 43 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3 Multi-profile Bayesian alignment model 51 3.1 Gaussian process prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.2 Multi-profile alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3 Chromatographic clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.4 Analyzed data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.5 Analysis of LC-MS proteomic data set . . . . . . . . . . . . . . . . . . . . . 62 3.6 Analysis of LC-MS glycomic data set . . . . . . . . . . . . . . . . . . . . . . 72 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4 Application of Bayesian alignment model to biomarker discovery 83 4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.2 Experimental methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.3 Global profiling analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.4 Multiple reaction monitoring quantification . . . . . . . . . . . . . . . . . . . 94 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5 Conclusion 102 5.1 Summary of original contributions . . . . . . . . . . . . . . . . . . . . . . . . 102 5.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 v Bibliography 108 A Proteomic ground-truth data 121 B Glycomic ground-truth data 130 vi List of Figures 1.1 Example of an LC-MS run. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2 Profile-based approach is composed of two components: prototype function c 2013 IEEE) . . . . . . . . . . . . . . . . . . . . and mapping functions. ( 12 An illustrative example showing the functionalities of the prototype function m(t) and the mapping function ui (t). For example, the intensity of the sample at time t = 1 corresponds to the intensity of the prototype function at time c 2013 IEEE) . . . . . . . . . . . . ui (1) = 2, which is given by m(2) = 3. ( 21 2.2 c 2013 IEEE) Directed acyclic graph of the single-profile alignment model. ( 22 2.3 One realization of simulated data with different noise levels: (a) no noise, (b) SNR 40, (c) SNR 35 and (d) SNR 30. (e)–(h) are the aligned data using BAM with SSVS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 (a) Base peak chromatograms of the original LC-MS data. (b) zooms in the c 2013 IEEE) retention time range 100 − 250 for the chromatograms in (a). ( 35 Aligned chromatograms by (a) DTW, (b) CPM, (c) BHCR, and (d) BAM. (e), (f), (g), and (h) zoom in the retention time range 100 − 250 for the chromatograms in (a), (b), (c), and (d), respectively. Misalignments by DTW c 2013 IEEE) . . . . . . . . . . . and BHCR are observed in (e) and (g). ( 36 (a) Trace plot of the number of knots in the models visited at each MCMC iteration for the chromatogram from the seventh replicate of the second serum aliquot. (b) Box plot of the number of knots visited by the MCMC sampler c 2013 IEEE) . . . . . . . . . . . . . . . . . . . . for each chromatogram. ( 37 Chromatogram of the seventh replicate from the second serum aliquot and generated profiles based on the sampled model parameters during the initial 200 MCMC iterations for (a) BAM and (b) BHCR. The region where the MCMC sampler for BHCR gets stuck at inaccurate retention time points is c 2013 IEEE) . . . . . . . . . . . . . . . . . . . . . . . . . . highlighted. ( 38 2.1 2.4 2.5 2.6 2.7 vii 2.8 Difference between the identity function and estimated mapping function obtained from the posterior median by BAM for each of the 14 chromatograms. c 2013 IEEE) . The filled region corresponds to the 90% credible interval. ( 45 Original extracted ion chromatograms for each of the 16 m/z values corresponding to the spiked-in peptides. For each m/z value, two plots showing the chromatograms of all seven replicates are depicted: chromatograms for aliquots with serum alone (left) and serum with spiked-in peptides (right). c 2013 IEEE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ( 46 2.10 Aligned extracted ion chromatograms for each of the 16 m/z values corresponding to the spiked-in peptides. For each m/z value, two plots showing the chromatograms of all seven replicates are depicted: chromatograms for aliquots with serum alone (left) and serum with spiked-in peptides (right). c 2013 IEEE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ( 47 2.11 Chromatograms in the metabolomic data sets, M1 and M2, before and after alignment by BAM. The inset is a zoomed part in the middle retention time c 2013 IEEE) . . . . . . . . . . . . . . . . . range of the chromatograms. ( 48 2.12 Trajectories for a bivariate normal distribution, simulated using 20 leapfrog steps (ǫ = 0.25) with an initial position at the lower-left side of the distribution. Contours of equal probability ratio to the highest (0.1, 0.2, . . . , 0.9) are depicted. Different values of initial momentum are considered. . . . . . . . 49 2.13 The Hamiltonians along the trajectories in Figure 2.12, under different initial conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.14 Base peak chromatograms of the LC-MS data. Alignment is performed using the HMC model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.9 3.1 Three main components of the multi-profile Bayesian alignment model: Gaussian process prior, chromatographic clutering, and profile-based alignment. . 52 3.2 Directed acyclic graph of the multi-profile alignment model. . . . . . . . . . 54 3.3 Binned chromatograms in a portion of one LC-MS run. Similar m/z values do not imply similar chromatographic profiles. . . . . . . . . . . . . . . . . 57 3.4 Base peak chromatograms in the two analyzed data sets. . . . . . . . . . . . 60 3.5 Histograms of the logarithm of peak intensities in the two analyzed data sets. 61 3.6 Scatter plots of the detected peaks in the two analyzed data sets. The intensity is log-transformed and color-coded. . . . . . . . . . . . . . . . . . . . . . . . 62 Normalized overlapping level (a) and sum of squared errors (b) using the L-method in the proteomic data set. The sufficient number of clusters is four. 64 3.7 viii 3.8 3.9 Base peak chromatograms in the proteomic data set, before and after alignment. The inset is a zoomed part in the middle retention time range of the chromatograms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Trace plots of 1/σε2 estimated based on single-profile alignment without using Gaussian process prior (SP) and with using Gaussian process prior (GPSP). (b) rooms in the precision range 21 − 23.5 for the last 2500 iterations in (a). 65 3.10 Trace plots of ui (t) − ui+1(t) at the knot points τ1 − τ5 . . . . . . . . . . . . 68 3.11 Trace plots of ui (t) − ui+1(t) at the knot points τ6 − τ10 . . . . . . . . . . . . 69 3.12 Trace plots of ui (t) − ui+1(t) at the knot points τ11 − τ15 . . . . . . . . . . . 70 3.13 Trace plots of ui (t) − ui+1(t) at the knot points τ16 − τ20 . . . . . . . . . . . 71 3.14 Trace plots of ui (t) − ui+1(t) at the knot points τ21 − τ23 . . . . . . . . . . . 72 3.15 Difference between the identity function and estimated mapping function obtained from the posterior median by GPMP for each of the 20 LC-MS runs in the proteomic data set. The filled region corresponds to the 90% credible interval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.16 Measures of precision and recall in the proteomic data set, based on 72 pairs of tolerance parameters in SIMA. The five procedures compared are: raw (∗), GP (), SP (), GPSP (△), and GPMP (♦). . . . . . . . . . . . . . . . . 77 3.17 Base peak chromatograms of the 23 LC-MS runs in the glycomic data set. . 78 3.18 Normalized overlapping level (a) and sum of squared errors (b) using the L-method in the glycomic data set. The sufficient number of clusters is four. 79 3.19 Clustered ion chromatograms in the glycomic data set. (a)-(d) are the unaligned chromatograms and (e)-(h) are their corresponding aligned chromatograms. The inset is a zoomed part in the middle retention time range of the chromatograms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 3.20 Difference between the identity function and estimated mapping function obtained from the posterior median by GPMP for each of the 23 LC-MS runs in the glycomic data set. The filled region corresponds to the 90% credible interval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.21 Measures of precision and recall in the glycomic data set, based on 72 pairs of tolerance parameters in SIMA. The five procedures compared are: raw (∗), GP (), SP (), GPSP (△), and GPMP (♦). . . . . . . . . . . . . . . . . 81 ix 3.22 Measures of precision and recall in the glycomic data set, where multi-profile alignment is considered. The number of chromatograms are: (a) two, (b) three, (c) four, and (d) five. Five cases are compared with peak lists input to SIMA: raw (∗), adjusted by multi-profile alignment using binning (), adjusted by multi-profile alignment using chromatographic clustering (), adjusted by multi-profile alignment using binning with Gaussian process prior (△), and adjusted by multi-profile alignment using chromatographic clustering with Gaussian process prior (♦). . . . . . . . . . . . . . . . . . . . . . . 82 4.1 Workflow for the LC-MS-based analysis of N-glycans in sera. . . . . . . . . 85 4.2 Characteristics of a peak in LC-MS raw data. . . . . . . . . . . . . . . . . . 87 4.3 Clustered ion chromatograms in E1. (a)-(d) are the unaligned chromatograms and (e)-(h) are their corresponding aligned chromatograms. . . . . . . . . . 89 Clustered ion chromatograms in E2. (a)-(d) are the unaligned chromatograms and (e)-(h) are their corresponding aligned chromatograms. . . . . . . . . . 89 Clustered ion chromatograms in E3. (a)-(d) are the unaligned chromatograms and (e)-(h) are their corresponding aligned chromatograms. . . . . . . . . . 90 Clustered ion chromatograms in E4. (a)-(d) are the unaligned chromatograms and (e)-(h) are their corresponding aligned chromatograms. . . . . . . . . . 90 Distribution of the RT differences (in second) across LC-MS runs of consensus peaks identified by SIMA with different parameters of RT tolerance. . . . . 91 Histograms of peak intensities before (a) and after (b) logarithmic transformation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Quantification results of eight candidate N-glycan biomarkers in sera of HCC cases and cirrhotic controls by the MRM analysis. (a)-(c) Down-regulated bisected GlcNAc glycans. (d)-(e) Up-regulated beta-1,6-GlcNAc branching glycans. (f)-(h) Up-regulated tetra-antennary glycans. . . . . . . . . . . . . 96 4.10 Base peak chromatograms in four batches. . . . . . . . . . . . . . . . . . . . 98 4.4 4.5 4.6 4.7 4.8 4.9 4.11 Three clusters of the identified N-glycan candidate biomarkers and their fold change directions (HCC versus cirrhosis). . . . . . . . . . . . . . . . . . . . 100 x List of Tables 2.1 Summary of full conditionals of model parameters. . . . . . . . . . . . . . . 26 2.2 Pairwise correlation coefficient and cross-correlation with the underlying pattern for the simulated LC-MS data, before alignment (original) and after alignment by DTW, CPM, BHCR and BAM (with fixed knots and with SSVS). Means (standard deviations) are reported for the simulated data based on 200 realizations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Correlation coefficients for the LC-MS spike-in data, before alignment (original) and after alignment by DTW, CPM, BHCR (with knot density of 0.2) c 2013 IEEE) . . . . . . . . . . . . . . . . . . . . and BAM (with SSVS). ( 35 Comparison of the peak matching results by using OpenMS alone (raw) and using three profile-based alignment models (DTW, CPM and BAM) for retention time correction prior to applying OpenMS on the metabolomic data c 2013 IEEE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . sets. ( 39 Summary of full conditionals of model parameters in the multi-profile Bayesian alignment model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.2 Summary of the analyzed data sets. 59 3.3 Peptide sequences of the internal standard. . . . . . . . . . . . . . . . . . . 63 3.4 Mass and retention time of each of the internal standard peaks in the proteomic data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Performance comparison in the LC-MS proteomic data set. Five approaches are compared: no alignment performed (raw), alignment performed using a Gaussian process regression (GP), single-profile alignment performed without using a Gaussian process prior (SP), single-profile alignment performed with a Gaussian process prior (GPSP), and multi-profile alignment (G = 4) performed with a Gaussian process prior (GPMP). . . . . . . . . . . . . . . 67 Mass and retention time of each of the internal standard peaks in the glycomic data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 2.3 2.4 3.1 3.5 3.6 . . . . . . . . . . . . . . . . . . . . . . xi 3.7 Performance comparison in the LC-MS glycomic data set. Five approaches are compared: no alignment performed (raw), alignment performed using a Gaussian process regression (GP), single-profile alignment performed without using a Gaussian process prior (SP), single-profile alignment performed with a Gaussian process prior (GPSP), and multi-profile alignment (G = 4) performed with a Gaussian process prior (GPMP). . . . . . . . . . . . . . . 74 Multi-profile alignment of the glycomic data set with and without using a Gaussian process (GP) prior. Chromatograms are derived by binning along m/z dimension. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Multi-profile alignment of the glycomic data set with and without using a Gaussian process (GP) prior. Chromatograms are derived using the chromatographic clustering procedure. . . . . . . . . . . . . . . . . . . . . . . . 75 4.1 Characteristics of the study cohort. . . . . . . . . . . . . . . . . . . . . . . 86 4.2 Two-way analysis of variance of the glycomic data. . . . . . . . . . . . . . . 93 4.3 N-glycan candidate biomarkers identified by the LC-MS global profiling. 94 4.4 N-glycan candidate biomarkers identified by the MRM targeted quantification. 95 4.5 Number of missing values associated with the 12 significant N-glycans cuased by the peak matching step. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 3.9 A.1 Ground-truth peaks in the proteomic data set. B.1 Ground-truth peaks in the glycomic data set. xii . . 99 . . . . . . . . . . . . . . . . 121 . . . . . . . . . . . . . . . . . 130 Chapter 1 Introduction The development and application of the high-throughput omic technology have dramatically increased the capacity to describe various aspects of biology in unprecedented detail. This, at the same time, has necessitated an increased reliance on computational techniques to extract knowledge from the vast amount of biological data. Liquid chromatography coupled with mass spectrometry (LC-MS) has been widely used for profiling expression levels of biomolecules in a variety of omic studies. This dissertation is focused on retention time alignment, which is a crucial preprocessing step in the analysis of LC-MS-based omic data. A Bayesian alignment model is proposed to address the alignment task. We investigate this problem from methodology development to practical application in this dissertation. 1.1 Motivation LC-MS has been an indispensable tool in various omic studies including proteomics, glycomis and metabolomics [1, 96, 143]. Each LC-MS run generates data consisting of thousands of ion intensities characterized by their specific retention time (RT) and mass-to-charge ratio (m/z) values, thus enabling comprehensive profiling of a variety of biomolecules. This high-throughput technique is widely applied to identify candidate markers whose expression levels change between groups of distinct biological conditions [5, 51, 80]. In order to ensure an unbiased comparison of the ion intensities, several preprocessing steps including peak detection, retention time alignment, peak matching, normalization, and charge state deconvolution need to be appropriately handled [67]. Typically, these preprocessing steps generate a list of detected peaks with their RT, m/z values and intensities, which are subsequently analyzed using statistical tests to identify significant differences in ion intensities. One of the crucial preprocessing steps is the correct matching of consensus peaks across multiple LC-MS runs. With the advances in mass spectrometry technology, it is now possible to achieve highly precise and accurate mass measurement (low- to sub-ppm) [81]. However, 1 Chapter 1. Introduction 2 controlling the chromatographic variability is still a challenging task. This often results in substantial variation in retention time across multiple LC-MS runs, raising significant challenges in the preprocessing pipeline. Without appropriate correction of retention time, the peak matching step is error-prone and the subsequent analysis may yield misleading results. Therefore, retention time alignment is a prerequisite for the quantitative analysis of LC-MS data and is the focus of this dissertation. 1.2 Background The sequencing of human genome, i.e., determining the order of approximately 3.2 billion DNA base pairs of adenine (A), thymine (T), cytosine (C) and guanine (G) residing in the 23 human chromosomes, is largely complete (> 99%) and publicly available [58]. This has raised new challenges in life science research, to discover and characterize the association between the DNA sequences and their downstream biological activities [43,69]. In particular, an area of increased interest is systems biology, which focuses on studying biological entities and their interactions to characterize the emergent properties of complex biological systems [3, 56]. Systems biology approaches consist of integrated and interactive analyses of diverse data representing the biological activities at multiple levels, to gain insight into molecular and cellular networks. The ability to comprehensively measure key biological entities in a highthroughput fashion is a prerequisite in this endeavor, and consequently, a variety of omic technologies have been developed [54]. The flow of biological information from the genome to its downstream phenotypes involves several complex and interactive processes. The central dogma of biology states that DNA (genomics) is transcribed to mRNA (transcriptomics) and then translated into protein (proteomics), influenced by several regulatory factors including epigenetic modifications (epigenomics). Proteins can catalyze reactions that regulate and produce a variety of biomolecules including metabolites (metabolomics), glycans (glycomics) and lipids (lipidomics). While the genome of an organism is generally considered static, its expression as gene products is continuously changing due to the influence of biological suppresors at different levels. Thus, delineation of the gene products (e.g., mRNAs and proteins) is crucial to understand the biological system. Significant efforts have been made in transcriptomic studies, and technologies to measure mRNA abundance levels have become reliable in routine use [84, 138]. Investigation of transcriptomics appears essential as it links between the DNA sequences and proteins. Unfortunately, expression levels of proteins and their downstream products cannot be simply inferred from mRNA levels and there is a substantial discrepancy observed [46, 57, 78]. With current analytical methods, data from different omic studies reveal complementary aspects of the biological system, and integration of these data may lead to a more comprehensive understanding of the underlying mechanisms [63]. The human genome contains approximately 21,000 protein-coding genes, which can be expressed into about one million proteins [62, 134]. Systematic investigation of proteins and Chapter 1. Introduction 3 their downstream products provides important insight on the mechanism of post-translational processes [62,134]. Due to the close proximity of these biomolecules to biological phenotypes, their expression levels reflect a rapid and observable response to environmental perturbations. This may potentially reveal the underlying mechanisms involved in human diseases, and aid in the development of effective treatment to the diseases. In this regard, there is a broad interest in identifying the biomolecules including proteins, glycans and metabolites as potential biomarkers for clinical applications. Such effort depends on not only reliable high-throughput techniques but also rigorous computational pipelines to extract relevant information from the vast amount of data. With recent advances of mass spectrometry and separation methods, LC-MS has become one of the essential analytical tools in biomedical research. LC-MS provides sensitive qualitative and quantitative analyses of a variety of biomolecules in a high-throughput fashion, and there has been enormous progress in systems biology and biomarker discovery using LC-MS-based omics [2,24,47]. Basic principles of LC-MS, preprocessing pipelines for LC-MS data analysis, and associated challenges are introduced in this section. For more details about the LC-MS technique, we refer interested readers to the literature [1, 25, 31]. 1.2.1 Principles of LC-MS Liquid chromatography. Liquid chromatography (LC) is a chromatographic technique used to separate a mixture of compounds that are dissolved in a solvent. Reversed phase high-performance liquid chromatography (RP-HPLC) is the most commonly used method in LC-MS applications. In RP-HPLC, the mixture is dissolved in a mobile phase, composed of water and organic solvents. With a high-pressure pump, the mixture solution is directed into a RP-HPLC column (the stationary phase), using a solvent gradient with increasing organic concentration. The stationary phase is typically hydrophobic or non-polar, while the mobile phase is moderately polar. The choices of column material, type of solvent, and the solvent gradient all play a role in chromatographic separation. Different compounds in the mixture pass through the column at different rates due to the differences in their hydrophobicity and polarity. In RP-HPLC, hydrophilic compounds elute from the column earlier than hydrophobic compounds, and the time where a compound elutes from the column is called elution time or retention time. In most LC-MS applications, a liquid chromatography is coupled on-line to a mass spectrometer. Alternatively, compounds eluting from the column can be collected in aliquots and analyzed by the mass spectrometer afterwards. Mass spectrometry. Mass spectrometry (MS) is an analytical technique that measures the mass-to-charge ratio (m/z) of charged molecules. The mass spectrometric analysis generates a mass spectrum summarizing abundance of detected ions distinguished in different m/z values. A mass spectrometer consists of three basic components: 1) an ion source that converts sample molecules into charged ions, 2) a mass analyzer that distinguishes the Chapter 1. Introduction 4 charged ions on the basis of their m/z values, and 3) a detector that counts the number of ions at each m/z value. Depending on the implementation of these components, there are different types of MS instruments. Major instrument configurations include: 1) electrospray ionization (ESI) and matrix-assisted laser desorption/ionization (MALDI) for the ion source; 2) quadrupole (Q), ion-trap, time-of-flight (TOF), Fourier transform ion cyclotron resonance and Orbitrap for the mass analyzer; and 3) electron multiplier for the detector. The compatibility of different analyzers with different ionization methods varies. For example, while all the analyzers listed above can be used in conjunction with ESI ion source, MALDI is most commonly coupled to a TOF analyzer (MALDI-TOF). In addition, different configurations can be combined to achieve better performance or specific goals, e.g., quadrupole-time-offlight (Q-TOF) and triple quadrupole (QqQ) mass spectrometry. The list above is by no means exhaustive. For more details, please refer to the literature [1, 25, 31]. Tandem mass spectrometry. Tandem mass spectrometry (also called MS/MS) combines two steps of MS analysis with some form of fragmentation in between. Tandem mass spectrometry is a key technique for identification of biomolecules, through examining the fragmentation pattern of particular ions selectively. Typically, in the analysis by the first mass spectrometer, several ions are selected based on either their intensities (data-dependent acquisition, DDA) or targeted m/z values of interest (data-independent acquisition, DIA). The selected ions (called precursor ions) are isolated for further analysis through fragmentation by collision-induced dissociation (CID) or electron-transfer dissociation (ETD). The resulting fragment ions (product ions) are then analyzed by the second mass spectrometer, which produces a MS/MS spectrum, presenting detailed chemical makeup of the analyte. The fragment ions are produced following certain rules of ion dissociation [120]. In proteomic studies with CID fragmentation, for example, when the amide-bonds of a peptide backbone break and the charge is retained on the C-terminus, y-ions are produced, while b-ions are produced when the charge is retained on the N-terminus. Given the m/z values and intensities of the b- and y-ions along with the m/z value of their precursor ion, the peptide sequence can be deduced through either database search [32, 93] or de novo sequencing [19, 35, 79]. While the tandem MS spectrum provides valuable information for identification of an analyte, MS/MS data acquisition suffers from the undersampling issue. Generally only few of the precusor ions of high intensities can be selected for fragmentation using DDA strategies. If high-abundance molecules are not removed, they may dominate MS/MS features and obscure less abundant molecules of interest. Liquid chromatography-mass spectrometry. Mass spectrometers are often coupled with separation methods such as gas chromatography or liquid chromatography, to reduce the chance to analyze coincident molecules and increase the overall dynamic range of detection. Liquid chromatography-mass spectrometry (LC-MS) is one of the most commonly used techniques, where a liquid chromatographic process is employed prior to injection of a sample into the mass spectrometry. Through LC-MS, fewer ions are analyzed simultaneously by the Chapter 1. Introduction 5 mass spectrometer (compared to the whole sample injected at once). This reduces the ion suppression effect [6]. In addition, molecules with the same molecular weight but different hydrophobicity, e.g., isomers, may elute from the column and enter the mass spectrometer at different times, thus reducing ambiguity in differentiating these molecules. An LC-MS run produces a set of MS spectra, acquired at multiple scans of different retention times. The MS/MS and LC-MS techniques can be naturally combined into a unified analytical method, known as LC-MS/MS. Typically in an LC-MS/MS experiment using DDA, the first mass spectrometer acquire a precursor scan, which measures all ions associated to the eluted molecules at a given retention time; a subset of the ions is selected based on their intensities, sequentially isolated and fragmented, prior to the second MS analysis. This produces several MS/MS scans succeeding to the precursor scan. The alternating process is repeated to automatically acquire both MS and MS/MS spectra throughout the LC gradient. While the MS/MS spectrum presents the fragmentation pattern of an analyte, its experimental mass and charge state are obtained from the measurement of precursor ion in the MS spectrum. Multiple reaction monitoring. LC-MS/MS using DDA is generally biased towards analysis of the most abundant and observable molecules. Biologically relevant molecular responses, however, are often less discernible in that analysis. Targeted quantification by multiple reaction monitoring (MRM) using triple quadrupole (QqQ) mass spectrometers has been introduced to overcome the limitations of the DDA analysis [97]. Essentially, the MRM method organizes the analysis of a specific list of targeted molecules, characterized by the m/z values of their precursor and fragment ions. The precursor-fragment ion pairs are called transitions, which are highly specific and unique for the targeted molecules. A specific ion is selected in the first quadrupole (Q1) on the basis of its precursor m/z value. The ion gets fragmented by collision-induced dissociation (CID) in the second quadrupole. Only the relevant ions produced by the fragmentation are selected in the third quadrupole (Q3). The resulting transitions are then used for quantification. As the data acquisition is highly specific with less interference from irrelevant ions, the MRM analysis can yield more sensitive and accurate quantification results. Moreover, the RT information can be used for scheduling the detection of a specific transition, to increase the capacity of molecules being quantified. This process is often referred to as scheduled MRM. 1.2.2 LC-MS data analysis LC-MS methods can be used for extraction of quantitative information and detection of differential abundance. This requires that a rigorous analysis workflow be implemented. In addition to analytical considerations, crucial steps include: 1) experimental design that avoids introducing bias during data acquisition and enables effective utilization of available resource [95], 2) data preprocessing pipeline that extracts meaningful features [67], and 3) statistical test that identifies significant changes based on the experimental design [66]. Chapter 1. Introduction 6 Coordination of these three steps is key to a reliable LC-MS analysis. Good experimental design provides an opportunity to process and compare samples in an unbiased manner. It helps identify true differences in the presence of variability from various sources. This benefit can diminish if the data analysts fail to appropriately process the LC-MS data and conduct the subsequent statistical tests in accordance with the experimental design. This section provides a high-level overview of the LC-MS preprocessing pipelines and highlights associated challenges. Interested readers are referred to the literature for further information [67, 73]. An LC-MS run contains retention time information in a chromatogram, m/z value in MS spectrum, and relative ion abundance for each particular ion. MS signals of all ions throughout the chromatographic separation are formatted in a three-dimensional map that defines the LC-MS data, as shown in Figure 1.1. LC-MS can profile thousands of biomolecules in a single run, which necessities an automatic and reliable preprocessing pipeline to extract meaningful features. In order to ensure an unbiased comparison of the ion intensities, several preprocessing steps including noise filtering, deisotoping, peak detection, retention time alignment, peak matching and normalization need to be appropriately handled. Typically, these preprocessing steps generate a list of detected peaks characterized by their retention times, m/z values and ion intensities. Subsequent statistical analysis is used to identify significant differences in ion intensities across distinct groups. In LC-MS data, the peak representing a compound is characterized by its isotopic pattern resulting from common isotopes such as 12 C and 13 C in a set of mass spectra within its elution duration, in superposition of noise signals. Adequate consideration of such characteristics is crucial for LC-MS data analysis. Although several software tools have been developed (e.g., OpenMS [121], msInspect [8], MZmine 2 [100] and XCMS [119]), very few studies have systematically evaluated and compared their performance [133]. As a result, determining the most appropriate computational pipeline remains challenging. In the following, we introduce the main preprocessing steps. As the way to characterize a peak is not universal, the order of the preprocessing steps may vary in different software tools. Noise filtering. LC-MS data are subject to electronic/chemical noises due to contaminants present in the column solvent or instrumental interference. Appropriate noise filtering can increase the signal-to-noise ratio (SNR) and facilitate the subsequent peak detection step. Some software tools (e.g., XCMS [119] and MZmine 2 [100]) integrate the noise filtering into the peak detection step. Smoothing filters such Gaussian filter and Savitzky-Golay filter [116] are commonly applied to eliminate the effects of noises. Due to the differences of LC-MS platforms in resolution and detection limit, parameters for the smoothing filters need to be adaptively selected, preferably through a pilot experiment with similar experimental setup. Deisotoping. Most chemical elements have naturally occurring isotopes. For example, C and 13 C are two stable isotopes of the element carbon with mass numbers 12 and 13, 12 7 Chapter 1. Introduction 7 x 10 4 3.5 Ion count 3 2.5 2 1.5 1 0.5 2000 1750 0 10 1500 20 1250 30 1000 40 750 50 60 500 m/z Retention time Figure 1.1: Example of an LC-MS run. respectively. As a result, each analyte gives rise to more than one ion peaks in the LC-MS data, where the peak arising solely from the most common isotope is called the monoisotopic peak. In proteomics, for example, each peptide is characterized by an envelope of ion peaks due to its constituent amino acids. 13 C is the most abundant isotope of the elements that make up amino acids, constituting about 1.11% of the carbon species. The approximately one dalton (Da) mass difference between 13 C and 12 C results in 1/z difference between adjacent ion peaks in the isotopic envelope, where z is the charge state of the charged peptide. The deisotoping step integrates siblings of ion peaks originating from the same analyte and summarizes with its monoisotopic mass. This facilitates the interpretation of LC-MS data, and assures the validity of the independence assumption considered in many statistical tests. DeconTools [60] is widely used to deisotope MS spectra, which involves: 1) identification of isotopic distribution, 2) prediction of the charge state based on the distance between the ion peaks in the isotopic distribution, and 3) comparison between the observed isotopic distribution and a theoretical distribution generated based on an average residue. Peak detection. Peak detection is a procedure to determine the existence of a peak in a specific range of retention times and m/z values, and to quantify its intensity. It is a prerequisite for the subsequent analysis of LC-MS data. Most LC-MS peak detection approaches [100, 119, 121, 142] are adapted from the previous advances in MALDI-TOF data analysis [18, 26]. These methods proceed with the peak detection via a pattern matching Chapter 1. Introduction 8 process with a pre-defined pattern, followed by a filtering step based on quantified peak characteristics. To better capture the characteristics of elution profiles, asymmetric patterns are often considered [142]. A critical issue is that the elution profiles may vary across different retention times [120]. As a result, the use of a single pattern throughout the whole retention time range in the current approaches may lead to inaccurate estimates of peak characteristics and signal to noise ratio (SNR). The latter is commonly employed as a filtering criterion. Usually peak detection is performed on each LC-MS run separately, without using information from other runs within the same experiment. Utilization of multiscale information from multiple runs has been proposed for the analysis of MALDI-TOF data [144], which may lead to more reliable peak detection result. The same concept may be applied to LC-MS data analysis, where the peak matching step to be introduced plays a significant role. Normalization. In LC-MS-based omics, one challenge is to detect true biological differences in the presence of various sources of variability. This requires appropriate normalization of intensity measurements to remove systematic biases and eliminate the effect of obscuring variability. Current normalization approaches proceed with the task through identification of a reference for ion intensities, and utilization of the reference to adjust LC-MS data. Apparently, identification of reliable reference is key to the success of the normalization process. Most current methods assume that each of the LC-MS runs in the same experiment should have an equal concentration of molecules on average [68]. With this assumption, various measures including summation, median, and quantile of the ion intensities are used as the reference for normalization. Unfortunately, the validity of this assumption is questionable as an increase of concentration in a specific group of molecules is not necessarily compensated by a decrease in other groups [123]. More rigorous approaches using regression methods based on a set of matched peaks [16] or spiked-in internal standards [123] have been proposed. However, it is unclear that if neighboring ions (in terms of RT, m/z value, or intensity) would necessarily share a similar drifting trend along the analysis order. At present, the use of quality control (QC) runs to assess and correct variability in LC-MS data appears to be the most reliable approach [28], in which QC runs can be collected from a reference sample or a mixture pooled from the analyzed samples. This idea has been successfully implemented for large-scale metabolomic studies, where variability along the analysis order is estimated for each of the detected peaks through assessment of the QC runs [28]. This circumvents the need to select the unknown reference, with additional experimental challenges to assure appropriate coverage and reproducible detection of ions in the QC runs. Peak matching. The peak matching step groups consensus peaks across LC-MS runs prior to applying statistical analysis to identify significant differences in ion intensities. This preprocessing step ensures that measurements from multiple LC-MS runs are comparable. It is also crucial for potential extensions of peak detection and normalization steps, in order to integrate information from multiple runs. The main challenge in peak matching results Chapter 1. Introduction 9 from the presence of variability of retention time and m/z values among the LC-MS runs. With the advances in mass spectrometry technology, it is now possible to achieve highly precise and accurate mass measurement (low- to sub-ppm) [81]. However, controlling the chromatographic variability is still a challenging task. Most current LC-MS preprocessing pipelines combine the estimation of retention time variability with the peak matching step, in order to correct that variability (called retention time alignment) and achieve reliable identification of consensus peaks. Although a number of algorithms [8, 100, 119, 121] have been proposed to address the retention time alignment problem, it is still one of the most challenging tasks in the LC-MS preprocessing pipeline due to the following issues: 1. The retention time shift across LC-MS runs is non-linear [101]. 2. Retention time alignment relies on correct identification of consensus peaks or profiles. Performing this process based on misaligned data can be ambiguous. 3. A peak may be absent in some LC-MS runs caused by either analytical or computational issue [136]. These issues are further elaborated in the next section, where we review related studies on retention time alignment. Furthermore, we discuss how we address the alignment problem in consideration of the issues. 1.3 Retention time alignment of LC-MS data As discussed in Section 1.2.2, retention time alignment of LC-MS data is crucial for the peak matching step. Based on the type of inputs, alignment approaches can be categorized as: 1) feature-based approaches and 2) profile-based approaches [135]. The feature-based approaches perform the alignment task based on relevant signals (features, usually referred to as peaks), which are distinguished from irrelevant parts in the peak detection step. The profile-based approaches, on the other hand, make use of chromatographic profiles to estimate the variability along retention time and adjust the LC-MS runs accordingly. In addition to the formulation of alignment problem and required input data, these two types of alignment approaches differ in the coverage of the preprocessing pipeline. The profile-based approaches address the alignment problem, while the feature-based approaches usually deal with both the alignment and peak matching processes simultaneously. 1.3.1 Feature-based approaches Most current LC-MS preprocessing pipelines (e.g., OpenMS [121], msInspect [8], MZmine 2 [100] and XCMS [119]) employ the feature-based retention time alignment based on a set Chapter 1. Introduction 10 of identified peaks. The feature-based approaches rely on the correct identification of a set of consensus peaks across LC-MS runs. With this matching information, retention time correction can then be carried out naturally. The main distinction among these approaches is the way they identify the consensus peaks, which greatly affects the alignment results. Several feature-based approaches choose a reference from the analyzed peak lists based on some heuristic measure such as the number of peaks [4, 8, 145]. The rest of the peak lists are aligned to the reference list in a pairwise manner, where the consensus peaks are identified based on some pre-defined tolerance ranges. The retention time alignment can be subsequently performed using regression methods. Progressive adjustment is proposed [4, 145] by using reliable consensus peaks for an initial (coarse) retention time correction, followed by a more accurate alignment using refined consensus peaks. If there is a least one comprehensive peak list with good quality, this approach can perform reasonably well. However, the alignment performance degrades significantly when there is a lack of reproducibility among LC-MS runs being considered. In order to eliminate the need for the objective selection of a reference list, clustering methods are applied in a number of feature-based approaches [70, 72, 88, 101, 119]. These approaches cluster peaks across LC-MS runs and specify a more complete reference based on the identified consensus peaks. Similar idea has also been proposed by using kernel density estimation of all the detected peaks [136]. From a different perspective, a variable selection approach [137] is proposed to identify the consensus peaks using the elastic net method [148]. To reduce the chance of using erroneously identified consensus peaks, which can deteriorate the alignment result, these approaches either include a module to assess the quality of the consensus peaks [70,101,119], or apply a robust method for the regression [101,137]. Pairwise retention time correction versus the reference list is employed in all the aforementioned approaches with an exception of the simultaneous multiple alignment (SIMA) model [136]. In SIMA, a kernel density estimation is used to derive a multi-dimensional retention time ridge that represents the retention time variation among all the analyzed LC-MS runs. Iterative refinement of the estimation through alternating the identification of consensus peaks and the regression has also been proposed [100, 119]. Incorporation of MS/MS identification can greatly reduce the ambiguity in matching the consensus peaks, as proposed in [34, 59]. Using only MS/MS identification [34], however, requires reproducible acquisition of MS/MS spectra and good coverage of the identified peaks in retention time, which is barely possible in experiments using the data-dependent acquisition strategy. In addition, its application is limited to LC-MS/MS experiments with sufficient MS/MS spectra acquired. The PEPPeR platform [59] integrates both peak lists and MS/MS identification to perform the retention time alignment, in order to overcome the limitation. However, the integration is implemented through an ad-hoc approach, partly due to the lack of uncertainty measures. Performance of the feature-based alignment approaches is highly dependent on three factors: 1) peak detection result, 2) reliability of consensus peaks, and 3) coverage of consensus Chapter 1. Introduction 11 peaks in retention time. The latter two factors present a trade-off in identification of the consensus peaks, which is key to good alignment performance. To address the trade-off, sophisticated clustering methods are proposed to explore more possible consensus peaks, followed by prioritization of these peaks based on their qualities. However, a fundamental issue is that the consensus peaks usually cannot be adequately determined based on unaligned data. Moreover, estimation of retention time variation by the feature-based approaches is limited to only a subset of time points, which is usually not as accurate as considering the whole chromatograms, as done in the profile-based approaches. 1.3.2 Profile-based approaches The profile-based approaches utilize chromatograms of the LC-MS runs to estimate the variability along retention time and adjust the LC-MS runs accordingly. It is assumed that there exists a pattern underlying multiple chromatograms from the same biological group and the profile variability is relatively small compared to distortions caused by misalignment. Compared to a set of retention time points used by the feature-based approaches, the chromatographic profiles provide more comprehensive information about the variation throughout the whole retention time range. Appropriate utilization of the whole chromatogram allows improved estimation of the retention time variation characterized by the mapping functions. Figure 1.2 presents the concept of a profile-based approach. The algorithm estimates 1) a prototype function that represents the underlying pattern across the observed data, and 2) a set of mapping functions that characterize the relationship between the prototype function and the observation. In LC-MS data alignment, the monotonicity constraint is commonly applied on the mapping function, to retain the elution order of LC process. The goal of the profile-based alignment is to estimate the underlying prototype and mapping functions most likely to have generated the observed data. The majority of the profile-based alignment approaches [15, 17, 29, 103, 104, 129] are based on two standard warping algorithms: dynamic time warping (DTW) [114] and correlation optimized warping (COW) [94]. DTW was initially proposed for processing time-series data in the context of speech recognition. It uses a dynamic programming algorithm [9] to search for a mapping function between a pair of time-series traces. Essentially one trace is considered as the reference, to which the other is aligned using warping operations to stretch or shrink its profile. The warping operations are characterized by the mapping function, in order to make the warped trace as similar as possible to the reference. To avoid overfitting, regularization of DTW is often considered using constraints of: 1) minimum/maximum slope in the mapping function, and 2) maximum allowable deviation of the mapping function from the identity function. COW [94] uses the same functionality as in DTW, with additional regularization. In COW, the mapping function is constrained to be piecewise linear, where only a subset of time points (knots) are allowed to apply the warping operations with linear interpolation in between the knots. In order to align a set of chromatograms, one reference must be specified as the prototype function, to which each of the chromatograms is aligned 12 Chapter 1. Introduction prototype function mapping functions synthetic data similarity observation adjust Figure 1.2: Profile-based approach is composed of two components: prototype function and c 2013 IEEE) mapping functions. ( based on a defined distance (e.g., Euclidean distance). Several distance measurements are considered in [103, 104], while the main challenge in the DTW- and COW-based approaches is owing to the choice of the unknown prototype function. In these approaches, heuristic measurements (e.g., the correlation between chromatograms) are commonly used to determine a prototype function, and no further adjustment on the prototype function is allowed during the estimation. To address this concern, probabilistic models are proposed in order to estimate a prototype function from the observations in a more principled way. In the statistics community, a similar problem called curve registration [106] has been studied in the context of functional data analysis [107]. The prototype and mapping functions are estimated by a Procrustes fitting procedure, where the estimate is iteratively updated. The mapping function in the Procrustes analysis is estimated via the measure of relative curvature and regularization can be applied to penalize large curvature values. The continuous profile model (CPM) [74] is another effective approach to perform the profile-based alignment. CPM formulates the alignment problem as a hidden Markov model [105], where each chromatogram represents a noisy transformation of the latent trace and the time points are indexed as hidden states in the HMM. Both the prototype function (latent trace) and the mapping functions (hidden states) are estimated from the observations using an expectation-maximization (EM) algorithm [22, 74]. Extension of the CPM model has been proposed to handle multiple binned chromatograms, with a moderate number of bins [76]. However, the strategy of mapping the data onto a higher-dimensional space may limit its applicability to high-resolution LC-MS data generated by the current technologies. A common issue of the current alignment approaches (both feature-based and profile-based) is the lack of uncertainty assessment. A measure of uncertainty is desired to provide a confidence level in the alignment results and assist in making decisions for subsequent analyses or data integration. In spike-in experiments or studies utilizing MS/MS identification results, the integration of this complementary information has been found to lead to better alignment results, e.g., in [59, 61]. However, the integration is often implemented through ad-hoc approaches, partly due to the lack of uncertainty measures. In combining the information from various sources, accounting for the uncertainty in the alignment from each source and Chapter 1. Introduction 13 performing the integration in a principled way may lead to improved results. 1.4 Research topics In this dissertation, we propose a Bayesian alignment model (BAM), which integrates complementary information embedded in the LC-MS data to address the aforementioned challenges. Specifically, the model offers two major attractive features: 1) it estimates the retention time variability along with uncertainty measures, and 2) it integrates multiple sources of information including internal standards and clustered chromatograms, through weighing the uncertainty measures. We investigate the alignment problem through three research topics: 1) development of single-profile Bayesian alignment model, 2) development of multi-profile Bayesian alignment model, and 3) application to biomarker discovery research. Development of single-profile Bayesian alignment model. We first study the profilebased Bayesian alignment using a single chromatogram, e.g., base peak chromatogram from each LC-MS run. Relevant work includes a Bayesian hierarchical model for curve registration (BHCR) [127], which uses a Markov chain Monte Carlo (MCMC) method for parameter inference. For the alignment of LC-MS data, which consist of many chromatographic peaks, more flexible and effective MCMC algorithms are desired. We observe that the elementwise Metropolis-Hastings algorithm used in BHCR is prone to overfitting due to inefficient MCMC moves. To overcome this problem, we propose a block Metropolis-Hastings algorithm using a mixture of block transition moves [110] for more flexible and effective updates. In addition, a stochastic search technique is built into BAM to enable adaptive knot specification for the mapping functions. For LC-MS data where chromatographic peaks are not homogeneously present along retention time, a uniformly distributed knot specification is not desirable. Instead of fixing the knots upfront, we propose using stochastic search variable selection (SSVS) [40] to determine the number and positions of knots. Possible extension using Hamiltonian Monte Carlo method [27, 92] is also investigated. Development of multi-profile Bayesian alignment model. The single-profile alignment model is further extended to handle multiple representative chromatograms simultaneously. The use of multiple chromatograms is considered in a few studies, by either binning the LC-MS data [75] or using all the extracted ion chromatograms with acceptable quality [17]. However, a suitable procedure to utilize multiple representative chromatograms while retaining computational feasibility is currently not available. We propose a clustering approach to identify multiple representative chromatograms from each LC-MS run. These chromatograms are simultaneously considered in the profile-based alignment to facilitate the estimation of the prototype and mapping functions. Moreover, we incorporate the Gaussian process regression [108] to estimate the retention time variation, based on the information Chapter 1. Introduction 14 of internal standards. The use of internal standards enables a high-confidence estimation of retention time variation, which avoids the ambiguity in identifying consensus peaks encountered in the feature-based approaches. Our proposed method allows us to infer a predictive distribution over the entire retention time range. The inferred information can then be used as the prior of the mapping function for the profile-based alignment. Application of Bayesian alignment model to biomarker discovery. Cancer treatment is generally more effective when the disease is diagnosed early. Defining clinically relevant biomarkers for early detection of cancer has potentially far-reaching implications for disease management and patient health. LC-MS has been widely used in various omic studies for cancer biomarker discovery. We investigate the applicability of the proposed Bayesian alignment model in a biomarker discovery study using LC-MS-based glycomics. The glycomic study consists of two complementary analyses: 1) global profiling using LCMS, and 2) targeted quantification using MRM. The alignment model is integrated into a preprocessing pipeline for LC-MS data analysis. Through the developed pipeline, we identify candidate biomarkers from global profiling analysis and confirm the result with that by target quantification. 1.5 List of relevant publications Journal papers 1. T.-H. Tsai, M.G. Tadesse, C. Di Poto, L.K. Pannell, Y. Mechref, Y. Wang, H.W. Ressom. Multi-profile Bayesian alignment model for LC-MS data analysis with integration of internal standards. Bioinformatics, 29(21):2774–2780, 2013. 2. T.-H. Tsai, M.G. Tadesse, Y. Wang, H.W. Ressom. Profile-based LC-MS data alignment — A Bayesian approach. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 10(2):494–503, 2013. 3. J.F. Xiao, R.S. Varghese, B. Zhou, M.R. Ranjbar, Y. Zhao, T.-H. Tsai, C. Di Poto, J. Wang, D. Goerlitz, Y. Luo, A.K. Cheema, N. Sarhan, H. Soliman, M.G. Tadesse, D.H. Ziada, H.W. Ressom. LC-MS based serum metabolomics for identification of hepatocellular carcinoma biomarkers in Egyptian cohort. Journal of Proteome Research, 11(12):5914-23, 2012. 4. H.W. Ressom, J.F. Xiao, L. Tuli, R.S. Varghese, B. Zhou, T.-H. Tsai, M.R. Ranjbar, Y. Zhao, J. Wang, C. Di Poto, A.K. Cheema, M.G. Tadesse, R. Goldman, K. Shetty. Utilization of metabolomics to identify serum biomarkers for hepatocellular carcinoma in patients with liver cirrhosis. Analytica Chimica Acta, 743:90-100, 2012. Chapter 1. Introduction 15 5. L. Tuli, T.-H. Tsai, R.S. Varghese, J.F. Xiao, A.K. Cheema, H.W. Ressom. Using a spike-in experiment to evaluate analysis of LC-MS data. Proteome Science, 10(13), 2012. 6. G.K. Befekadu, M.G. Tadesse, T.-H. Tsai, H.W. Ressom. Probabilistic mixture regression models for alignment of LC-MS data. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 8(5):1417-24, 2011. Conference papers 1. Y. Zhao, T.-H. Tsai, C. Di Poto, L. Pannell, M.G. Tadesse, H.W. Ressom. Variability assessment of LC-MS experiments and its application to experimental design and difference detection. Proceedings of the IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS), Washington, DC, December 2012. 2. T.-H. Tsai, M.G. Tadesse, Y. Wang, H.W. Ressom. Bayesian alignment model for LC-MS data. Proceedings of IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Atlanta, GA, November 2011. 3. L. Tuli, T.-H. Tsai, R.S. Varghese, A.K. Cheema, H.W. Ressom. Using a spike-in experiment to evaluate analysis of LC-MS data. Proceedings of the IEEE International Workshop on Computational Proteomics, Hong Kong, December 2010. Manuscripts submitted/in preparation 1. Y. Zhao, T.-H. Tsai, C. Di Poto, L.K. Pannell, M.G. Tadesse, H.W. Ressom. Mixed effects model for variability assessment and difference detection in LC-MS data. Statistics and Its Interface, in revision. 2. T.-H. Tsai, M. Wang, C. Di Poto, Y. Hu, S. Zhou, Y. Zhao, R.S. Varghese, Y. Luo, M.G. Tadesse, D.H. Ziada, C.S. Desai, K. Shetty, Y. Mechref, H.W. Ressom. LC-MS profiling of N-glycans derived from human serum samples for biomarker discovery in hepatocellular carcinoma. Journal of Proteome Research, in preparation. 3. T.-H. Tsai, E. Song, C. Di Poto, M. Wang, Y. Luo, R.S. Varghese, M.G. Tadesse, D.H. Ziada, C.S. Desai, K. Shetty, Y. Mechref, H.W. Ressom. LC-MS/MS based serum proteomics for identification of hepatocellular carcinoma (HCC) biomarkers. Proteomics, in preparation. 1.6 Organization of the dissertation The remainder of this dissertation is organized as follows. Chapter 2 introduces the singleprofile Bayesian alignment model. The hierarchical model, the block Metropolis-Hastings Chapter 1. Introduction 16 algorithm, the posterior inference for the model parameters, and the SSVS procedure for knot specification are described in the chapter. Chapter 3 presents the extended multiprofile alignment model, including the chromatographic clustering approach to derive representative chromatograms and the Gaussian process prior that uses information from internal standards. Chapter 4 demonstrates the applicability of the proposed alignment model in a cancer biomarker discovery study. Finally, Chapter 5 concludes this dissertation with a summary of contributions and possible extensions in future work. Chapter 2 Single-profile Bayesian alignment model This chapter1 introduces the profile-based alignment using a single chromatogram for each LC-MS run. The chromatograms are obtained based on total ion count or base peak intensity. We address the alignment problem within a Bayesian framework. The inference for the model parameters is based on their posterior distributions, which are estimated using Markov chain Monte Carlo (MCMC) methods. 2.1 Preliminaries We use a variety of Markov chain Monte Carlo (MCMC) methods in order to obtain the posterior distribution of parameters in the proposed Bayesian alignment model. This section gives a brief introduction of the MCMC methods that are used in this dissertation. We refer interested readers to [14, 89, 90] for more detailed and rigorous expositions of this topic. For simple problems where a conjugate prior is available, i.e., posterior and prior distributions are in the same family, the posterior distribution can be used directly to infer properties associated with the model parameters. However, in most complex problems including the profile-based alignment, it is not possible to obtain a closed-form expression for the posterior distribution. A general approach to estimate the posterior distribution in such problems is to use Markov chain Monte Carlo (MCMC) methods, where a large number of (correlated) samples from a Markov chain are used to make Monte Carlo estimates. A Markov chain refers to a sequence of states in which a sample of the states only depends on its preceding sample, governed by a transition probability T (θ ′ ← θ), i.e., probability of a state change from θ to 1 Part of this chapter has been published in an earlier work [132]: T.-H. Tsai, M.G. Tadesse, Y. Wang, H.W. Ressom. Profile-based LC-MS data alignment — A Bayesian approach. IEEE/ACM Transactions on c 2013 IEEE) Computational Biology and Bioinformatics, 10(2):494–503, 2013. ( 17 18 Chapter 2. Single-profile Bayesian alignment model θ ′ . Based on the law of large numbers, the Monte Carlo estimates would (asymptotically) converge to the true values with an infinite number of samples, if the samples are obtained from the target distribution. To ensure the validity of the Monte Carlo estimates, the Markov chain must converge to the target distribution p(θ), where θ is the parameter of interest. The convergence requires two fundamental conditions: 1. The Markov chain must be ergodic — it is possible to pass from any of the states to another. 2. The transition T (θ ′ ← θ) leaves p(θ) invariant, that is X p(θ ′ ) = p(θ)T (θ ′ ← θ), for all θ ′ . (2.1) θ The second condition of invariance ensures that if the Markov chain reaches the targeted distribution p(θ) at some point, then the subsequently sampled states would follow that distribution as well. In many MCMC methods, this is usually accomplished through the detailed balance condition (also known as reversibility): p(θ)T (θ ′ ← θ) = p(θ ′ )T (θ ← θ ′ ), for all θ and θ ′ . (2.2) Detailed balance ensures that θ and θ ′ are reversible in the Markov chain, and it is a sufficient condition of invariance as shown below: X X X T (θ ← θ ′ ) = p(θ ′ ). (2.3) p(θ ′ )T (θ ← θ ′ ) = p(θ ′ ) p(θ)T (θ ′ ← θ) = θ θ θ Metropolis-Hastings algorithm [50, 85] and Gibbs sampling [38, 39] are two common MCMC methods that satisfy detailed balance. A combination of both methods is often used to construct a transition of Markov chain for a target distribution that is too complex to sample directly from it. We describe both methods in the following. Metropolis-Hastings algorithm. Metropolis algorithm was first introduced in the seminal paper published by Metropolis et al. [85], and further generalized by Hastings [50]. Algorithm 1 outlines the procedure to update θ in a Markov chain using Metropolis-Hastings algorithm. A change of state (θ ′ ← θ) is proposed based on a proposal distribution q(θ ′ ← θ), and the acceptance probability rA of the proposed change is computed in consideration of the ratio under the target distribution p(θ ′ )/p(θ) and the transition ratio q(θ ← θ ′ )/q(θ ′ ← θ) that adjusts for non-symmetric proposal distributions. When a symmetric proposal distribution is employed, i.e., q(θ ← θ ′ ) = q(θ ′ ← θ), this updating scheme is called Metropolis algorithm. In such case the acceptance probability rA becomes min 1, p(θ ′ )/p(θ) . If the proposed change is rejected, then the succeeding sample of the Markov chain stays at the current state θ. 19 Chapter 2. Single-profile Bayesian alignment model Algorithm 1 Metropolis-Hastings algorithm Propose θ ′ ∼ q(θ ′ ← θ) Compute acceptance probability rA = min 1, Set θ = θ ′ with probability rA p(θ′ )q(θ←θ′ ) p(θ)q(θ ′ ←θ) Transitions defined by Metropolis-Hastings algorithm leave the target distribution p(θ) invariant as they satisfy detailed balance: p(θ ′ )q(θ ← θ ′ ) ′ q(θ ′ ← θ)p(θ) T (θ ← θ)p(θ) = min 1, ′ p(θ)q(θ ← θ) = min p(θ)q(θ ′ ← θ), p(θ ′ )q(θ ← θ ′ ) = min p(θ ′ )q(θ ← θ ′ ), p(θ)q(θ′ ← θ) p(θ)q(θ ′ ← θ) = min 1, q(θ ← θ ′ )p(θ ′ ) p(θ ′ )q(θ ← θ ′ ) = T (θ ← θ ′ )p(θ ′ ). (2.4) Since the acceptance probability is based on the probability ratio, the algorithm does not require a direct evaluation on the target distribution p(θ), i.e., evaluation on some function that is proportional to p(θ) suffices. Gibbs sampling. Gibbs sampling was initially proposed in the context of image restoration application [39] and has been applied to a variety of Bayesian inference problems since the early nineties [38]. The algorithm updates each component of the state (θi′ ← θi ) by sampling from its full conditional, i.e., the conditional distribution given other components p(θi θ \i ). Gibbs sampling can be shown as a special case of Metropolis-Hastings algorithm, with the proposal distribution defined by q(θ ′ ← θ) = p(θi′ θ \i )I θ ′\i = θ \i , (2.5) Chapter 2. Single-profile Bayesian alignment model 20 where I θ ′\i = θ \i is an indicator function ensuring all components other than θi stay fixed. The acceptance probability of Gibbs sampling is one: p(θ ′ )q(θ ← θ ′ ) p(θ)q(θ ′ ← θ) p(θ ′ )p(θi θ ′\i )I θ \i = θ ′\i = p(θ)p(θi′ θ \i )I θ ′\i = θ \i p(θi′ , θ ′\i )p(θi θ ′\i )I θ \i = θ ′\i = p(θi , θ \i )p(θi′ θ \i )I θ ′\i = θ \i p(θi′ θ ′\i )p(θ ′\i )p(θi θ ′\i )I θ \i = θ ′\i = p(θi θ \i )p(θ \i )p(θ′ θ \i )I θ ′\i = θ \i rA = i = 1. (2.6) Gibbs sampling eliminates the need to define and tune a proposal distribution, which is required in Metropolis-Hasting algorithm. It is useful when the full conditional is tractable and can be sampled efficiently. 2.2 Generative model We introduce a generative model to formulate the alignment problem. The observed chromatograms from N replicates, yi (t), i = 1, . . . , N, t = t1 , . . . , tT , are assumed to share a similar profile characterized by the prototype function m(t). We use a piecewise linear function to model the nonlinear variability [101] along retention time. For the i-th chromatogram at retention time t, the intensity value isreferred to as the prototype function indexed by the mapping function ui (t), i.e., m ui (t) . Figure 2.1 illustrates the relationship between the sample and the prototype function through the mapping function. The intensity of the sample at time 1, for example, is referred to as the intensity of the prototype function at time ui (1) = 2, that is m(2) = 3. By incorporating the variability of intensity using affine transformation, each chromatogram is modeled as: yi (t) = ci + ai · m ui (t) + εi (t), i = 1, 2, . . . , N, (2.7) where ai and ci are scaling and translation parameters, and the errors εi (t)’s are indepeniid dent and identically distributed normal random variables εi (t) ∼ N (0, σε2 ). These parameters characterize the individual variability of each chromatogram. Conjugate normal prior distributions are chosen for ai and ci , i.e., ai ∼ N (a0 , σa2 ) and ci ∼ N (c0 , σc2 ). The prototype function is modeled with B-spline regression [20]: m = Bm ψ, (2.8) 21 Chapter 2. Single-profile Bayesian alignment model m (ui(t)) ui(t) 4 3.5 3 2.5 2 1.5 1 0.5 4 3 Intensity Time of prototype Intensity m(t) 2 1 1 1 2 3 2 3 4 4 3.5 3 2.5 2 1.5 1 0.5 Time of sample 4 Time of prototype 1 2 3 4 Time of sample Figure 2.1: An illustrative example showing the functionalities of the prototype function m(t) and the mapping function ui (t). For example, the intensity of the sample at time t = 1 corresponds to the intensity of the prototype function at time ui (1) = 2, which is given by c 2013 IEEE) m(2) = 3. ( ⊤ where m = m(t1 ), . . . , m(tT ) ∈ RT ×1 , Bm ∈ RT ×L , and ψ ∈ RL×1 . The regression coefficients for the prototype function, ψ, are specified by a first-order random walk: ψl ∼ N (ψl−1 , σψ2 ), where ψ0 = 0. The mapping function ui (t) is a piecewise linear function characterized by a set of knots τ = (τ0 , τ1 , . . . , τK+1) and their corresponding mapping indices φi = (φi,0 , φi,1 , . . . , φi,K+1), where τ0 = t1 and τK+1 = tT . The mapping function is defined in terms of τ and φi , ( φi,j , for t = τj ui (t) = (2.9) τj+1 −t t−τj φ + τj+1 −τj φi,j+1, for τj < t < τj+1 τj+1 −τj i,j To keep the elution order of LC process, the monotonicity constraint φi,0 < · · · < φi,K+1 needs to be satisfied. The prior of φi is specified via slope values ω i = (ωi,1 , . . . , ωi,K+1), where ωi,j is assumed to follow a normal distribution with mean ωi,j−1 and variance σω2 truncated below by 0 to ensure monotonicity of φi , and is defined as ωi,j = φi,j − φi,j−1 . τj − τj−1 (2.10) The prior of φi is therefore given by p(φi ) = K+1 Y j=1 pTN ωi,j ωi,j−1, σω2 , (2.11) where ωi,0 = 1 and pTN (·) corresponds to the truncated normal density function. Finally, we 22 Chapter 2. Single-profile Bayesian alignment model specify the priors for the other model parameters to complete the hierarchy: a0 c0 1/σa2 1/σc2 1/σε2 1/σψ2 ∼ N (µa, σa20 ), ∼ N (µc, σc20 ), ∼ G(αa , βa ), ∼ G(αc , βc ), ∼ G(αε , βε ), ∼ G(αψ , βψ ). These priors are chosen to be conjugate to the likelihood function. Figure 2.2 presents the directed acyclic graph of the single-profile alignment model where the model parameters are represented by open circles, the hyperparameters by solid dots, and the observations by filled circles. µc σc20 µa σa20 c0 αc βc αψ βψ σψ2 a0 σc2 ci σa2 αa βa εi σε2 αε βε φi σω2 ai ψ yi N c 2013 IEEE) Figure 2.2: Directed acyclic graph of the single-profile alignment model. ( 2.3 Posterior inference Based on the generative model introduced in Section 2.2, the alignment problem is translated into an inference task: given the chromatograms y ={y1 , y2 , . . . , yN }, we need to estimate the model parameters a, c, ψ, φ, a0 , c0 , σa2 , σc2 , σε2 , σψ2 . Once the inference is complete, the alignment can be carried out by applying an inverse mapping function to each chromatogram, 23 Chapter 2. Single-profile Bayesian alignment model i.e., ŷi (t) = yi û−1 i (t) . The parameter inference is drawn using MCMC methods. For the parameters whose full conditionals have closed forms, θ = a, c, ψ, a0 , c0 , σa2 , σc2 , σε2 , σψ2 , we use the Gibbs sampler to update their values. The remaining parameters, i.e., the mapping function coefficients, φ = {φ1 , . . . , φN }, are updated using a Metropolis-Hastings algorithm with a uniform proposal density that reflects the constraints on the boundaries. 2.3.1 Full conditionals of model parameters Derivation of the full conditional for each of the parameters, a, c, ψ, a0 , c0 , σa2 , σc2 , σε2 , σψ2 , is shown below. Table 2.1 summarizes all the full conditionals used in the single-profile model. Full conditional of a0 N Y p ai a0 , σa2 p a0 y, θ \a0 , φ ∝ p a0 µa , σa20 i=1 ! N X (a0 − µa )2 (ai − a0 )2 ∝ exp − × exp − 2σa20 2σa2 i=1 ! P 2 2 a )/(σ + Nσ ) a20 − 2a0 (σa2 µa + σa20 N a a0 i=1 i , ∝ exp − 2σa20 σa2 /(σa2 + Nσa20 ) (2.12) 2 which implies the full conditional of a0 is a normal distribution N (â0 , σ̂a0 ), where σ̂a20 = P −1 2 2 1/σa0 + N/σa2 and â0 = σ̂a20 µa /σa20 + N i=1 ai /σa . Full conditional of c0 N Y 2 p ci c0 , σc2 p c0 y, θ \c0 , φ ∝ p c0 µc , σc0 i=1 (c0 − µc )2 ∝ exp − 2σc20 × exp N X i=1 PN (ci − c0 )2 − 2σc2 ! c2 − 2c0 (σc2 µc + σc20 i=1 ci )/(σc2 + Nσc20 ) ∝ exp − 0 2σc20 σc2 /(σc2 + Nσc20 ) ! , (2.13) which implies the full conditional of c0 is a normal distribution N ĉ0 , σ̂c20 , where σ̂c20 = −1 P 2 2 1/σc0 + N/σc2 c /σ and ĉ0 = σ̂c20 µc /σc20 + N . i c i=1 24 Chapter 2. Single-profile Bayesian alignment model Full conditional of (ai , ci ) p ai , ci y, θ\(ai ,ci) , φ ∝ p ai a0 , σa2 p ci c0 , σc2 p yi ŷi , σε2 ! 2 2 2 (ci − c0 ) kyi − ŷi k (ai − a0 ) exp − exp − , ∝ exp − 2 2 2σa 2σc 2σε2 (2.14) which implies the full conditional of (ai , ci ) is a bivariate normal distribution N (µ̂i , Σ̂i ), −1 ⊤ −1 −1 where Σ̂i = Σac + W⊤ W/σε2 and µ̂i = Σ̂i Σac a0 c0 + W⊤ yi /σε2 , with Σac = diag(σa2 , σc2 ) and W = Bm (ui )ψ 1 ∈ RT ×2 . Full conditional of 1/σa2 N Y p ai a0 , σa2 p 1/σa2 y, θ \σa2 , φ ∝ p 1/σa2 αa , βa ∝ = 1 σa2 1 σa2 αa −1 i=1 exp − αa +N/2−1 βa σa2 exp − × 1 σa2 1 σa2 N/2 βa + 1 2 exp N X i=1 N X − (ai − a0 ) 2σa2 !! (ai − a0 )2 i=1 2 , ! (2.15) which implies the full conditional of 1/σa2 is a gamma distribution G(α̂a , β̂a ), where α̂a = PN αa + N/2 and β̂a = βa + i=1 (ai − a0 )2 /2. Full conditional of 1/σc2 p 1/σc2 N Y y, θ\σ2 , φ ∝ p 1/σ 2 αc , βc p ci c0 , σc2 c c ∝ = 1 σc2 1 σc2 αc −1 i=1 βc exp − 2 σc αc +N/2−1 exp − × 1 σc2 1 σc2 N/2 βc + 1 2 exp N X i=1 N X i=1 (ci − c0 )2 − 2σc2 !! (ci − c0 )2 , ! (2.16) which implies the full conditional of 1/σc2 is a gamma distribution G(α̂c , β̂c ), where α̂c = PN αc + N/2 and β̂c = βc + i=1 (ci − c0 )2 /2. 25 Chapter 2. Single-profile Bayesian alignment model Full conditional of 1/σε2 p 1/σε2 N Y ŷi , σε2 y, θ \σ2 , φ ∝ p 1/σε2 αε , βε p y i ε i=1 ! N T /2 N X 1 βε kyi − ŷi k2 1 exp − 2 × exp − ∝ σε2 σε σε2 2σε2 i=1 !! αε +N T /2−1 N 1 1X 1 = , (2.17) kyi − ŷi k2 exp − 2 βε + σε2 σε 2 i=1 αε −1 which implies the full conditional of 1/σε2 is a gamma distribution G(α̂ε , β̂ε ), where α̂ε = PN αε + NT /2 and β̂ε = βε + i=1 kyi − ŷi k2 /2. Full conditional of 1/σψ2 p 1/σψ2 y, θ\σψ2 , φ ∝ p 1/σψ2 αψ , βψ p ψ σψ2 !αψ −1 ! 2 −1 −1/2 1 1 ⊤ 2 −1 −1 βψ ∝ exp − ψ (σψ Ω ) ψ exp − 2 × σψ Ω σψ2 σψ 2 ! !αψ +L/2−1 1 ⊤ 1 1 , (2.18) exp − 2 βψ + ψ Ωψ = σψ2 σψ 2 which implies the full conditional of 1/σψ2 is a gamma distribution G(α̂ψ , β̂ψ ), where α̂ψ = αψ + L/2 and β̂ψ = βψ + ψ ⊤ Ωψ/2, and Ω ∈ RL×L is a triple-diagonal matrix: −1 0 .. .. . . −1 Ω= . . . 2 −1 0 −1 1 2 . Full conditional of ψ N 2Y p ψ y, θ\ψ , φ ∝ p ψ σψ p yi ŷi , σε2 i=1 Y N 1 ⊤ 2 −1 −1 kyi − (ci 1 + ai Bm (ui )ψ)k2 ∝ exp − ψ (σψ Ω ) ψ , exp − 2 2 2σ ε i=1 (2.19) 26 Chapter 2. Single-profile Bayesian alignment model Table 2.1: Summary of full conditionals of model parameters. Parameter a0 Distribution 2 N (â0 , σ̂a0 ) c0 N ĉ0 , σ̂c20 (ai , ci ) N (µ̂i , Σ̂i ) 1/σa2 G(α̂a , β̂a ) 1/σc2 G(α̂c , β̂c ) 1/σε2 G(α̂ε , β̂ε ) 1/σψ2 G(α̂ψ , β̂ψ ) N (µ̂ψ , Σ̂ψ ) ψ −1 2 σ̂a20 = 1/σa0 + N/σa2 P ai /σa2 â0 = σ̂a20 µa /σa20 + N i=1 −1 2 σ̂c20 = 1/σc0 + N/σc2 P 2 c /σ ĉ0 = σ̂c20 µc /σc20 + N i c i=1 −1 ⊤ 2 Σ̂i = Σ−1 ac + W W/σǫ ⊤ −1 ⊤ 2 µ̂i = Σ̂i Σac a0 c0 + W yi /σǫ α̂a = αa + N/2 P 2 β̂a = βa + N i=1 (ai − a0 ) /2 α̂c = αc + N/2 P 2 β̂c = βc + N i=1 (ci − c0 ) /2 α̂ε = αε + NT /2 P 2 β̂ε = βε + N i=1 kyi − ŷi k /2 α̂ψ = αψ + L/2 β̂ψ = βψ + ψ ⊤ Ωψ/2 −1 Σ̂ψ = Ω/σψ2 + X⊤ X/σǫ2 µ̂ψ = Σ̂ψ X⊤ (Y − C)/σǫ2 which implies the full conditional of ψ is a multivariate normal distribution N (µ̂ψ , Σ̂ψ ), −1 where Σ̂ψ = Ω/σψ2 + X⊤ X/σε2 and µ̂ψ = Σ̂ψ X⊤ (Y − C)/σε2 , and a1 Bm (u1 ) y1 c1 1 a2 Bm (u2 ) y2 c2 1 X= , Y = .. , and C = .. . .. . . . aN Bm (uN ) yN cN 1 2.3.2 Block Metropolis-Hastings algorithm Algorithm 2 outlines one iteration of the MCMC procedure in the Bayesian hierarchical curve registration (BHCR) model [127]. The transition ratio rT for the proposal density is one, while the likelihood ratio rL and prior ratio rP in the Metropolis-Hastings acceptance (m) probability, rA , for updating φi,j (φ′i,j ← φi,j ) are given by: (m+1) p yi φ′i,j , φi,\j , θ (m+1) , (2.20) rL = (m) (m+1) p yi φi,j , φi,\j , θ (m+1) 27 Chapter 2. Single-profile Bayesian alignment model ′ (m) ′ ′ (m+1) ωi,j , σω2 pTN ωi,j+1 pTN ωi,j+2 ωi,j−1 , σω2 pTN ωi,j × × rP = (m+1) (m) (m) (m) (m) pTN ωi,j ωi,j−1 , σω2 pTN ωi,j+1 ωi,j , σω2 pTN ωi,j+2 (m+1) where φi,\j (m+1) φi,\j denotes the set of coefficients φi (m+1) (m+1) (m) (m) φi,0 , . . . , φi,j−1 , φi,j+1, . . . , φi,K+1 . ′ ωi,j+1, σω2 , (2.21) (m) 2 ω , σ ω i,j+1 at iteration m + 1 with φi,j excluded, i.e., The MCMC move for updating φi,j changes = the slopes ωi,j and ωi,j+1, and consequently the involved densities. Algorithm 2 MCMC update of {θ (m) , φ(m) } in the BHCR model Update θ (m+1) ← θ (m) using Gibbs sampling for all φi,j do (m) (m) ′ Propose φi,j ∼ U φi,j − δ, φi,j + δ Compute the likelihood ratio rL using Equation 2.20 Compute the prior ratio rP using Equation 2.21 Compute the acceptance probability rA = min 1, rL × rP (m+1) (m+1) (m) Set φi,j = φ′i,j with probability rA (φi,j = φi,j if the proposal is rejected) end for When the misalignment involves translation shift along retention time, the element-wise Metropolis-Hastings move for φi requires a series of successive proposals in the same direction to be accepted sequentially. This incremental update hinders the mixing of the MCMC sampler. In addition, the monotonicity constraint inhibits the flexibility of the proposal to consider relatively large changes for each of the mapping function coefficients. To address this issue, we consider block proposal moves [110] to allow a set of successive coefficients to be adjusted simultaneously. Rather than updating each coefficient φi,j sequentially, the φi,j ’s are first grouped into several non-overlapping blocks, which consist of successive coefficients along the retention time, and proposals are made to update each block. The block move offers a more efficient way to update the coefficients, which improves the mixing of the MCMC sampler. We introduce binary indicator variables bj ∈ {0, 1}, j = 1, . . . , K, to identify the block boundaries, where bj = 1 if τj is at the boundary of a block and b0 = bK+1 = 1. This indicator variable follows a Bernoulli distribution with p(bj = 1) = rblock . Based on the boundary configuration, coefficients within the same block φi,j:j+Bj −1 = (φi,j , φi,j+1, . . . , φi,j+Bj −1 ) are proposed to be moved in the same direction, where bj = bj+Bj = 1 and bj+1 = · · · = bj+Bj −1 = 0. The element-wise move can be viewed as a special case of the block move where rblock = 1 and each block only contains a single coefficient. We consider a mixture of transitions where rblock is randomly selected from {1, 1/2, 1/4} at each iteration. The configuration of blocks is therefore variable within a Markov chain. We summarize the procedure for the block Metropolis-Hastings technique in Algorithm 3. For MCMC updates in the alignment model, most computational effort is spent in evaluating Chapter 2. Single-profile Bayesian alignment model 28 Algorithm 3 Block Metropolis-Hastings algorithm Sample rblock ∼ U 1, 21 , 14 Sample δ ∼ 12 · U(0, δsmall ) + 21 · U(0, δlarge ) Sample bj ∼ Bernoulli(rblock ), for j = 1, . . . , K for all block φi,j:j+Bj −1 do (m) (m) Propose φ′i,j:j+Bj −1 ∼ U φi,j:j+Bj −1 − δ, φi,j:j+Bj −1 + δ Compute the likelihood ratio rL Compute the prior ratio rP Compute the acceptance probability rA = min 1, rL × rP (m+1) Set φi,j:j+Bj −1 = φ′i,j:j+Bj −1 with probability rA end for proposals by the Metropolis-Hastings algorithm. In each MCMC iteration, the element-wise Metropolis-Hastings move requires N × K times of proposals and evaluations for all the mapping function coefficients φi,j , whereas the proposed block move reduces the number by a factor of rblock . With the considered mixture of transitions {1, 1/2, 1/4}, the expected factor of reduction is given by 1 1 1 7 ×1+ ×2+ ×4= . 3 3 3 3 The calculation is based on unoptimized implementation of the Metropolis-Hastings algorithm, in which the computation involved in evaluating each proposed change is identical. 2.4 Number and position of knots Knot specification for the mapping functions is crucial to the alignment result. Although accurate alignment requires sufficiently dense knots to enable precise adjustments, an overly dense knot specification restricts the transition flexibility and is prone to overfitting. We address the knot specification issue using stochastic search variable selection (SSVS) [40]. At each iteration, along with the update of all the model parameters {θ, φ}, a change is proposed for the knot specification using one of the following transition moves: 1. knot inclusion – adding a knot into the current knot list; or 2. knot exclusion – removing a knot from the current knot list. In the alignment model, the first and the last time points are set as fixed knots, i.e., unchanged throughout the Markov chain, to control the span range of mapping function. For the middle T − 2 time points, t2 , . . . , tT −1 , K of them are determined as knots and their 29 Chapter 2. Single-profile Bayesian alignment model placement is (τ1 , . . . , τK ). A binary indicator variable associated with each time point γt is introduced to denote if a knot is present at time t, that is, γt = 1 if t belongs to (τ1 , . . . , τK ) and γt = 0 otherwise. This binary indicator is assumed to follow a Bernoulli distribution with p(γt = 1) = rknot , and thus, the probability density of a valid knot specification is given by K p(τ1 , . . . , τK ) = rknot (1 − rknot )T −2−K . (2.22) We estimate the knot specification for each chromatogram separately, i.e., each mapping function ui is defined by its own set of knots τ i and mapping function coefficients φi . To keep the following discussion uncluttered, we make a slight abuse of notation by dropping the index of chromatogram i. At each iteration, after the update of {θ, φ}, one of the middle T − 2 time points is randomly sampled from a uniform distribution, i.e., t ∼ U {t2 , . . . , tT −1 }, and its corresponding γt is proposed to be updated to γt′ such that γt′ = 1−γt . The procedure for each proposal is summarized as follows: Knot inclusion. When γt = 0 and τk−1 < t < τk , t is added into the knot list (γt′ = 1 ← γt = 0) such that the new set of knots becomes ′ ′ ′ ) = (τ0 , . . . , τk−1, t, τk , . . . , τK+1). , . . . , τK+2 , τk′ , τk+1 τ ′ = (τ0′ , . . . , τk−1 The mapping function coefficient for the newly added knot, φ′k is sampled from a normal distribution truncated below by φk−1 and above by φk , ν ∼ T N (µ, σ 2 ), such that the new set of mapping function coefficients becomes φ′ = (φ′0 , . . . , φ′k−1, φ′k , φ′k+1 , . . . , φ′K+2) = (φ0 , . . . , φk−1 , ν, φk , . . . , φK+1), where µ= τk − t t − τk−1 φk−1 + φk , τk − τk−1 τk − τk−1 (2.23) and σ = min {φk − µ, µ − φk−1} /4. (2.24) The acceptance probability for the proposal (γ ′ , τ ′ , φ′ ← γ, τ , φ) is calculated as p y θ, γ ′ , τ ′ , φ′ p(γ ′ , τ ′ , φ′ ) q(γ, τ , φ; γ ′ , τ ′ , φ′ ) × rA = × , p(γ, τ , φ) q(γ ′ , τ ′ , φ′ ; γ, τ , φ) p y θ, γ, τ , φ {z } | {z } | {z } | Likelihood ratio Prior ratio Transition ratio where the likelihood ratio, the prior ratio and the transition ratio are considered. The likelihood function is given by N Y p(y θ, γ, τ , φ) = p yi ŷi , σε2 I , i=1 (2.25) 30 Chapter 2. Single-profile Bayesian alignment model which is a product of N multivariate normal densities, where ŷi = ci 1 + ai · m(ui ). Based on the priors of τ (Equation 2.22) and φ (Equation 2.11), the prior ratio is given by QK+2 ′ ′ 2 K+1 p(γ ′ , τ ′ , φ′ ) rknot (1 − rknot )T −3−K j=1 pTN (ωj | ωj−1 , σω ) × = K Q K+1 2 p(γ, τ , φ) rknot (1 − rknot )T −2−K j=1 pTN (ωj | ωj−1 , σω ) = ′ ′ ′ ′ pTN (ωk′ | ωk−1 , σω2 )pTN (ωk+1 | ωk′ , σω2 )pTN (ωk+2 | ωk+1 , σω2 ) rknot × , 1 − rknot pTN (ωk | ωk−1 , σω2 )pTN (ωk+1 | ωk , σω2 ) (2.26) ′ ′ where ωk−1 = ωk−1, ωk+2 = ωk+1, ωk′ = φ′k − φ′k−1 ν − φk−1 , = ′ ′ τk − τk−1 t − τk−1 and φ′k+1 − φ′k φk − ν . = ′ ′ τk+1 − τk τk − t The transition ratio is the remaining component, which is calculated by ∂(τ ′ , φ′ ) q(γ, τ , φ ← γ ′ , τ ′ , φ′ ) 1 1/(T − 2) × × = 1/(T − 2) pTN (φ′ | µ, σ 2 ) ∂(τ , t, φ, ν) q(γ ′ , τ ′ , φ′ ← γ, τ , φ) ′ ωk+1 = k 1 = , ′ pTN (φk | µ, σ 2 ) (2.27) where the Jacobian term is needed to account for the change in dimension of τ and φ and its determinant is equal to one because of the one-to-one determinisitic mapping. The product of the three ratios is the acceptance probability for the proposal of knot inclusion. Knot exclusion. When γt = 1 and t = τk , the knot τk is removed from the knot list (γt′ = 0 ← γt = 1), such that the new set of knots and the mapping function coefficients become ′ ′ τ ′ = (τ0′ , . . . , τk−1 , τk′ , τk+1 , . . . , τK′ ) = (τ0 , . . . , τk−1 , τk+1 , τk+2 , . . . , τK+1), and φ′ = (φ′0 , . . . , φ′k−1 , φ′k , φ′k+1, . . . , φ′K ) = (φ0 , . . . , φk−1, φk+1, φk+2, . . . , φK+1). Similar to the knot inclusion, the acceptance probability for (γ ′ , τ ′ , φ′ ← γ, τ , φ) involves the likelihood ratio, the prior ratio and the transition ratio, where the latter two are provided as follows. The prior ratio for knot exclusion is given by QK ′ ′ 2 K−1 rknot (1 − rknot )T −1−K p(γ ′ , τ ′ , φ′ ) j=1 pTN (ωj | ωj−1 , σω ) × QK+1 = K 2 p(γ, τ , φ) rknot (1 − rknot )T −2−K j=1 pTN (ωj | ωj−1 , σω ) = ′ ′ pTN (ωk′ | ωk−1 , σω2 )pTN (ωk+1 | ωk′ , σω2 ) 1 − rknot × , rknot pTN (ωk | ωk−1 , σω2 )pTN (ωk+1 | ωk , σω2 )pTN (ωk+2 | ωk+1 , σω2 ) (2.28) Chapter 2. Single-profile Bayesian alignment model 31 ′ ′ where ωk−1 = ωk−1, ωk+1 = ωk+2, and ωk′ = φ′k − φ′k−1 φk+1 − φk−1 . = ′ ′ τk − τk−1 τk+1 − τk−1 To calculate the transition ratio, an imaginary move of knot inclusion that adds back the knot τk = t and samples the mapping function coefficient φk = ν ′ needs to be considered. With the imaginary knot inclusion, the transition ratio is given by ∂(τ ′ , t, φ′ , ν ′ ) q(γ, τ , φ ← γ ′ , τ ′ , φ′ ) 1/(T − 2) 2 = × pTN (φk | µ, σ ) × q(γ ′ , τ ′ , φ′ ← γ, τ , φ) 1/(T − 2) ∂(τ , φ) = pTN (φk | µ, σ 2 ). (2.29) Similar to the knot inclusion, the Jacobian determinant is one due to the deterministic one-to-one mapping. 2.5 Experiments We applied the Bayesian alignment model (BAM) to a simulated data set, an LC-MS proteomic data set [76] and two LC-MS metabolomic data sets [71]. The performance of BAM was compared with that of the Bayesian hierarchical curve registration (BHCR) model [127], the dynamic time warping (DTW) model [129], and the continuous profile model (CPM) [74]. The advantage of applying appropriate retention time correction prior to performing a feature-based approach is demonstrated through the LC-MS metabolomic data sets. 2.5.1 Simulated data set We generated a profile pattern composed of three Gaussian peaks with the same standard deviation but distinct mean values, (t − µ2 )2 (t − µ3 )2 (t − µ1 )2 + 10 · exp − + 10 · exp − , m(t) = 10 · exp − 2σ 2 2σ 2 2σ 2 where µ1 = 10, µ2 = 20, µ3 = 40, σ = 1.25, and t = 0.5, 1, . . . , 50 (100 time points). The mapping function ui was generated through the even-numbered order statistics as described in [44] to model the retention time variability. For the number of time points T and the interval (0, R) considered for the profile, the implementation of ui is given through the following steps: 1. Generate 2T + 1 samples uniformly on the interval (0, R). 2. Sort the 2T + 1 samples in ascending order. 32 Chapter 2. Single-profile Bayesian alignment model 3. Pick the even-numbered samples from the sorted samples and assign each to ui (1), ui (2), . . . , ui (T ). The density of ui (1), ui (2), . . . , ui (T ) is given by p(ui ) = (2T + 1)! × u (1)−0 × u (2)−u (1) ×· · ·× u (T )−u (T −1) × R−u (T ) , (2.30) i i i i i i R2T +1 where T = 100 and R = 50 were chosen in the simulation. Ten replicate data were generated by the formulation: yi (t) = m ui (t) + εi (t), i = 1, . . . , 10, 12 12 10 10 10 8 8 8 8 6 6 6 6 4 4 2 2 0 0 0 10 20 30 40 −2 0 50 10 20 Time 30 40 −2 0 50 4 2 0 10 20 Time (a) No noise 30 40 −2 0 50 (c) SNR 35 12 10 10 8 8 8 8 6 6 6 6 4 2 2 2 0 0 0 10 20 30 Time (e) No noise 40 50 −2 0 10 20 30 Time (f) SNR 40 40 50 Intensity 12 10 Intensity 12 4 −2 0 30 40 50 40 50 (d) SNR 30 10 −2 0 20 Time 12 4 10 Time (b) SNR 40 Intensity Intensity 4 2 −2 0 Intensity 12 10 Intensity 12 Intensity Intensity where random noises were produced based on signal-to-noise ratio (SNR). Three values of SNR (30, 35 and 40 dB) were considered. Figure 2.3 depicts one realization of the simulated data with different noise levels. The data set simulates the scenario where peaks are not uniformly distributed along the time range, and different choices of knot specification may lead to distinct results. 4 2 0 10 20 30 Time (g) SNR 35 40 50 −2 0 10 20 30 Time (h) SNR 30 Figure 2.3: One realization of simulated data with different noise levels: (a) no noise, (b) SNR 40, (c) SNR 35 and (d) SNR 30. (e)–(h) are the aligned data using BAM with SSVS. Alignment performance was assessed based on two measurements: 1) correlation coefficient between pairs of replicate data, and 2) cross-correlation between the profile pattern and replicate data. It should be emphasized that the pairwise correlation coefficients between data can exaggerate the true alignment performance as the values do not reflect potential losses of chromatographic information. Thus, to ensure the peak patterns are not significantly distorted during alignment, we calculated the correlation coefficients between the profile pattern and the aligned data as well. Chapter 2. Single-profile Bayesian alignment model 33 We compared the performance of four alignment methods on the simulated data set: DTW, CPM, BHCR and BAM (with fixed knots and with SSVS). Three values of knot density (0.05, 0.2 and 0.4) with equally-spaced knots were considered for BHCR and BAM with fixed knots. The fixed knot specification for BAM is mainly for comparison purpose. In addition, BAM with SSVS, which can automatically handle the knot specification, was applied to the data. Table 2.2 summarizes the performance measurements before alignment and after alignment using each of the four methods. The results for the simulated data are based on 200 realizations. From Table 2.2, significant improvements are observed after alignment by all the methods. DTW and CPM yield good performance in terms of pairwise correlation coefficient. However, their cross-correlation results suggest that the profiles are overly distorted in the effort to make them resemble each other. This phenomenon is undesirable since meaningful information is lost during the alignment. In overall, BAM with SSVS yields the best result in terms of both pairwise correlation coefficient and cross-correlation with the underlying pattern. The difference between BAM with fixed knots and BHCR is due to the ability of BAM to overcome the mixing problem of standard MCMC methods in multimodal models by using block Metropolis-Hastings updates, whereas BHCR is prone to getting stuck at local modes by relying on incremental updates. The distinction becomes significant with knot density of 0.4, where BHCR leads to the worst performance in terms of both measurements. As discussed in Section 2.4, accurate alignment requires sufficiently dense knot placement while naı̈vely increasing the number of knots may lead to overfitting. This is primarily due to monotonicity constraint for the parameter φ being controlled by the positions of the knots. Selecting a good knot specification setting is even more involved in practical applications since there is no ground-truth available on which to calibrate. This problem is circumvented by the adaptive selection of knots in BAM with SSVS, which provides an automatic approach to place knots adaptively, according to the complexity of the underlying profile. 2.5.2 LC-MS spike-in data set The spike-in data set by Listgarten et al. [76] consists of two aliquots of the same human serum sample where the second aliquot has three known peptides spiked in. Seven replicate LC-MS runs from each aliquot were acquired using a capillary-scale LC coupled to an iontrap mass spectrometer. Each run is preprocessed and represented by a 501 (RT points) × 2401 (m/z bins) data matrix. In order to know the true differences between the two aliquots in the LC-MS data, eight ground-truth runs from a mixture of spiked-in peptides (without serum) were acquired and 32 experimentally detected m/z values of ground-truth were reported. The spike-in experiment can help to evaluate alignment results based on the true differences spiked in the sample. Detailed experimental information can be found in [76]. Figure 2.4 depicts the base peak chromatograms of the 14 LC-MS runs, where significant 34 Chapter 2. Single-profile Bayesian alignment model Table 2.2: Pairwise correlation coefficient and cross-correlation with the underlying pattern for the simulated LC-MS data, before alignment (original) and after alignment by DTW, CPM, BHCR and BAM (with fixed knots and with SSVS). Means (standard deviations) are reported for the simulated data based on 200 realizations. SNR 40 35 30 40 35 30 CPM 0.871 (0.094) 0.870 (0.090) 0.879 (0.085) 0.850 (0.088) 0.893 (0.067) 0.889 (0.069) 0.894 (0.061) 0.869 (0.066) 0.05 0.837 (0.108) 0.834 (0.115) 0.809 (0.111) 0.765 (0.087) 0.918 (0.040) 0.919 (0.034) 0.910 (0.035) 0.888 (0.029) BHCR 0.2 0.866 (0.170) 0.849 (0.160) 0.812 (0.176) 0.765 (0.128) 0.934 (0.054) 0.935 (0.043) 0.924 (0.044) 0.896 (0.041) 0.4 0.461 (0.122) 0.448 (0.116) 0.437 (0.126) 0.403 (0.106) 0.848 (0.039) 0.841 (0.041) 0.836 (0.042) 0.809 (0.039) 0.05 0.840 (0.111) 0.844 (0.108) 0.825 (0.096) 0.776 (0.081) 0.917 (0.041) 0.921 (0.033) 0.910 (0.032) 0.891 (0.027) BAM 0.2 0.4 0.896 0.768 (0.153) (0.200) 0.880 0.763 (0.144) (0.190) 0.845 0.703 (0.167) (0.191) 0.812 0.628 (0.116) (0.156) 0.946 0.916 (0.043) (0.053) 0.940 0.912 (0.041) (0.053) 0.932 0.896 (0.040) (0.053) 0.909 0.863 (0.034) (0.048) SSVS 0.938 (0.116) 0.933 (0.109) 0.900 (0.112) 0.840 (0.091) 0.947 (0.044) 0.945 (0.045) 0.939 (0.035) 0.911 (0.032) shifts are observed along the RT points. We applied four models, DTW, CPM, BHCR and BAM to align the chromatograms (Figure 2.5). The alignment result by DTW is shown in Figure 2.5a, where DTW is prone to overly distorting the profiles and the estimator often gets stuck in local optima. CPM yields the best performance in this data set in terms of both visual assessment (Figure 2.5b) and correlation coefficients as shown in Table 2.3. In general, it works quite well on problems with moderate dimension (less than 1000). The results by the two Bayesian models, BHCR and BAM are shown in Figures 2.5c and 2.5d, respectively. In the result by BHCR, several peaks from two replicates of the second aliquot are not correctly aligned to the majority of peaks in RT range 100 − 250. Instead, they are erroneously aligned to other tiny peaks around their original retention times. A similar issue is observed in the DTW result. As mentioned in Section 2.3.2, the element-wise MetropolisHastings move utilized in BHCR is prone to getting stuck at local modes. It is particularly hard to get away from the trap if the true parameter values are far from the current values and lie beyond the range of values that can be proposed by the Markov chain transition moves. In contrast to BHCR, this trapping effect is overcome by BAM as shown in Figure 2.5d. The inference is based on 15,000 MCMC iterations obtained after discarding the initial 5000 iterations as burn-in. The knot specification is automatically handled with the SSVS procedure. Figure 2.6a shows the trace plot of the number of knots selected at each MCMC iteration for the chromatogram from the seventh replicate of the second serum aliquot; we see that it stabilizes around 60 knots. Figure 2.6b gives a summary of the numbers of knots across the models visited by the MCMC sampler for each of the 14 chromatograms. Cross-correlation ∞ DTW 0.836 (0.083) 0.826 (0.091) 0.817 (0.090) 0.790 (0.073) 0.881 (0.057) 0.876 (0.059) 0.871 (0.061) 0.848 (0.056) Pairwise correlation ∞ Original 0.366 (0.081) 0.356 (0.075) 0.352 (0.075) 0.312 (0.077) 0.822 (0.034) 0.816 (0.036) 0.808 (0.036) 0.778 (0.035) 35 Chapter 2. Single-profile Bayesian alignment model 8 x 10 Serum Serum + spiked−in peptides 8 3.5 3 3 2.5 2.5 Ion count Ion count 3.5 2 1.5 0.5 0.5 200 300 400 500 Serum + spiked−in peptides 1.5 1 100 Serum 2 1 0 0 x 10 0 50 100 Retention time 150 200 250 Retention time (a) (b) Figure 2.4: (a) Base peak chromatograms of the original LC-MS data. (b) zooms in the c 2013 IEEE) retention time range 100 − 250 for the chromatograms in (a). ( Figure 2.7a depicts the generated profiles by BAM during the initial 200 MCMC iterations for the chromatogram corresponding to the seventh replicate of the serum aliquot with spiked-in peptides. The block move effectively corrects the significant misalignments at the beginning of the Markov chain whereas the MCMC sampler for BHCR gets stuck at inaccurate retention time points (Figure 2.7b). Table 2.3: Correlation coefficients for the LC-MS spike-in data, before alignment (original) and after alignment by DTW, CPM, BHCR (with knot density of 0.2) and BAM (with c 2013 IEEE) SSVS). ( Group 1 Group 2 Original DTW 0.35 0.86 0.28 0.88 CPM 0.95 0.94 BHCR 0.92 0.77 BAM 0.92 0.91 Figure 2.8 shows the posterior difference between the estimated mapping function and the identity function with the 90% credible interval for all the chromatograms. It deserves pointing out that the nonlinear variability of the LC process as shown in the figure necessitates a nonlinear modeling of the mapping function. RT ranges where significant peaks are present lead to tighter credible interval, leading to higher confidence in the alignment result. Finally, according to the ground-truth of 32 m/z values reported in [76], we observe a clear contrast in the extracted ion chromatograms (EICs) of 16 m/z values (433.63, 513.50, 524.48, 535.00, 601.03, 615.60, 647.50, 649.11, 674.93, 699.03, 784.50, 811.00, 1047.12, 1297.43, 1348.65, and 1575.09) between two sets of LC-MS runs corresponding to the “presence” and “absence” of spiked-in peptides. We use these EICs to demonstrate the alignment performance. Figures 2.9 and 2.10 depict the 16 EICs before alignment and after alignment by BAM, respectively. As shown in Figure 2.10, the retention time shifts observed in the 36 Chapter 2. Single-profile Bayesian alignment model 8 x 10 Serum + spiked−in peptides 8 3.5 x 10 Serum Serum + spiked−in peptides 8 3.5 x 10 Serum Serum + spiked−in peptides 8 3.5 3 2.5 2 1.5 2 1.5 Ion count 3 2.5 Ion count 3 2.5 Ion count 3 2.5 2 1.5 1 1 1 0.5 0.5 0.5 8 3.5 x 10 200 300 400 0 0 500 100 200 300 Retention time Retention time (a) DTW (b) CPM Serum Serum + spiked−in peptides 8 3.5 x 10 Serum 400 0 0 500 100 200 300 400 0 0 500 (c) BHCR Serum + spiked−in peptides 8 3.5 x 10 Serum Serum + spiked−in peptides 8 3.5 2.5 2.5 2.5 Ion count 2.5 Ion count 3 1.5 2 1.5 1 1 1 0.5 0.5 0.5 150 200 250 0 50 100 150 Retention time Retention time (e) DTW (f) CPM 200 250 0 50 100 150 Retention time (g) BHCR 400 500 200 250 x 10 Serum Serum + spiked−in peptides 2 1 100 300 1.5 0.5 0 50 200 (d) BAM 3 2 Serum + spiked−in peptides Retention time 3 1.5 100 Retention time 3 2 Serum 2 1 100 x 10 1.5 0.5 0 0 Ion count Serum Ion count Ion count 3.5 0 50 100 150 200 250 Retention time (h) BAM Figure 2.5: Aligned chromatograms by (a) DTW, (b) CPM, (c) BHCR, and (d) BAM. (e), (f), (g), and (h) zoom in the retention time range 100 − 250 for the chromatograms in (a), (b), (c), and (d), respectively. Misalignments by DTW and BHCR are observed in (e) and c 2013 IEEE) (g). ( original EICs are effectively corrected by applying BAM. In addition, compared to the EICs prior to the alignment, the aligned EICs exhibit more distinct and specific differences between the two sets of LC-MS runs in terms of the spiked-in peptides. This will facilitate the subsequent peak matching step, as elaborated in Section 2.5.3. 2.5.3 LC-MS metabolomic data sets Lange et al. [71] compared a set of feature-based alignment models on four publicly available data sets (two proteomic and two metabolomic data sets)2 . In this benchmark study, peak detection was performed on the raw data and the resulting peak list was stored in a .featureXML file (format of OpenMS [121]) for each LC-MS run. In addition to the peak lists, the two metabolomic data sets, designated as M1 and M2 have raw data available in .mzData and .netCDF formats, respectively. To evaluate the alignment result, ground-truth data were generated based on ion annotation [126], correlation of chromatographic profile, and consistency of peak. Comparison was carried out by measuring recall and precision by the alignment models against the ground-truth data. For the details, we refer interested readers to the paper [71]. As discussed in Section 1.3, correct identification of the consensus peaks is crucial for the 2 Available at http://msbi.ipb-halle.de/msbi/caap 37 Chapter 2. Single-profile Bayesian alignment model 80 80 75 70 60 Number of knots Number of knots 70 50 40 65 60 55 50 30 20 0 45 40 0.2 0.4 0.6 0.8 1 Iteration (a) 1.2 1.4 1.6 1.8 2 4 x 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Index of chromatogram (b) Figure 2.6: (a) Trace plot of the number of knots in the models visited at each MCMC iteration for the chromatogram from the seventh replicate of the second serum aliquot. (b) Box plot of the number of knots visited by the MCMC sampler for each chromatogram. c 2013 IEEE) ( feature-based approaches. We presume that appropriate retention time correction can facilitate this process and subsequently lead to improved performance. To confirm this idea, we applied DTW, CPM and BAM to both M1 and M2 data sets based on the chromatograms extracted from the raw data, where M1 and M2 consist of 44 and 24 LC-MS runs, respectively. According to the estimated mapping functions, we modified the RT values in the peak lists, i.e., replacing t by ui (t), and applied a feature-based alignment model, OpenMS [121] on the adjusted lists using the same m/z tolerances as in [71]. For each LC-MS run, binning along the m/z dimension was performed to obtain 100 binned chromatograms of identical total ion counts. Chromatogram quality was evaluated using the mass chromatographic quality (MCQ) value accounting for potential baseline and noise [139]. Those chromatograms with MCQ value less than 0.85 were screened out and the sums of the remaining ones were used as input to the profile-based models. Figure 2.11 depicts the chromatograms in data sets M1 and M2, before and after alignment by BAM. By adjusting the RT values based on the BAM alignment result, the identification of the consensus peaks using the feature-based models is reduced to a simpler task. Table 2.4 presents the performance measurements, in terms of recall, precision, and F -measure, when using the feature-based model alone (raw) and when coupling the profile-based models (DTW, CPM, and BAM) to the feature-based model. We note that applying first a profile-based alignment model can lead to improved results, although this is not always the case. If the mapping functions are not correctly estimated, this procedure can deteriorate the result by making the generation of consensus peaks even more difficult. Using DTW for retention time correction 38 Chapter 2. Single-profile Bayesian alignment model 8 3.5 x 10 3 8 Chromatogram Iterations 1−50 Iterations 51−100 Iterations 101−200 3.5 3 Chromatogram Iterations 1−50 Iterations 51−100 Iterations 101−200 2.5 Ion count Ion count 2.5 2 1.5 2 1.5 1 1 0.5 0.5 0 0 x 10 100 200 300 400 500 0 0 Retention time (a) BAM 100 200 300 400 500 Retention time (b) BHCR Figure 2.7: Chromatogram of the seventh replicate from the second serum aliquot and generated profiles based on the sampled model parameters during the initial 200 MCMC iterations for (a) BAM and (b) BHCR. The region where the MCMC sampler for BHCR c 2013 IEEE) gets stuck at inaccurate retention time points is highlighted. ( yields improved performance on M2 but lower recall on M1. This may be due to the low SNR in the latter. We calculate the SNR for these data as 2 X m (t) [ = E1 , SNR T σε2 t∈{t1 ,...,tT } based on the posterior distribution estimated by BAM. The estimates of SNR are 18.58 and 41.07 in M1 and M2, respectively. The higher SNR in M2 suggests that better profiles are available for mapping function estimation and more accurate estimation result is expected. On the other hand, it is more challenging to align the noisy chromatograms in M1. As also demonstrated in Sections 2.5.1 and 2.5.2, DTW is prone to overfitting the data, which may overly distort the profile and fail to estimate the correct mapping function. The other considered profile-based model, CPM shows good performance in the simulated data and the LC-MS proteomic data. Unfortunately, we have been unable to effectively use CPM for retention time correction in the metabolomic data sets, primarily due to the higherdimensional data (1525 and 2397 RT points in M1 and M2, respectively). The current version of CPM does not correctly estimate the mapping functions in M1 and fails to process the data in M2 due to numerical problems3 . As CPM maps the data onto a higher-dimensional space, efforts are needed for an attempt to apply the model to high-resolution LC-MS data. We note that using BAM for retention time correction prior to performing the feature-based model leads to improved performance on both data sets, M1 and M2. 3 The program was performed on a PC with an Intel Core 2 Duo 64 bit 2.66 GHz 39 Chapter 2. Single-profile Bayesian alignment model Table 2.4: Comparison of the peak matching results by using OpenMS alone (raw) and using three profile-based alignment models (DTW, CPM and BAM) for retention time correction c 2013 IEEE) prior to applying OpenMS on the metabolomic data sets. ( M1 M2 2.6 Recall Precision F -measure Recall Precision F -measure OpenMS Raw DTW CPM 0.87 0.85 0.34 0.69 0.74 0.68 0.77 0.79 0.45 0.93 0.97 – 0.79 0.83 – 0.85 0.89 – BAM 0.88 0.74 0.80 0.97 0.83 0.89 Alternative formulation The monotonicity constraint on the mapping function can hinder efficient estimation of the mapping function coefficients φ since it is not straightforward to make effective proposals on a constrained space. This issue can be circumvented by mapping the constrained coefficients φ onto a constraint-free space through the Jupp transformation, where efficient MCMC methods such as the Hamiltonian Monte Carlo algorithm can be applied to estimate the transformed coefficients. This section presents an alternative formulation to further investigate the profile-based alignment. 2.6.1 Jupp transformation The Jupp transformation and the inverse transformation are given below4 [64]: • Jupp transformation: ϑ = Jupp(φ) ( φj j = 0, K + 1 ϑj = φj+1 −φj log φj −φj−1 j = 1, . . . , K (2.31) • Inverse Jupp transformation: φ = Jupp−1 (ϑ) φj = ϑ0 + (ϑK+1 − ϑ0 ) × 1 + exp(ϑ1 ) + · · · + exp(ϑ1 + · · · + ϑj−1 ) , 1 + exp(ϑ1 ) + · · · + exp(ϑ1 + · · · + ϑK ) (2.32) for j = 1, . . . , K, where φ0 = ϑ0 and φK+1 = ϑK+1 . CPU and 8 GB RAM. This implementation issue was also noted by the author as at http://www.cs.toronto.edu/∼jenn/CPM/README.txt 4 The index of sample i is dropped out in this section to keep the notation uncluttered. Chapter 2. Single-profile Bayesian alignment model 40 The inverse Jupp transformation ensures the monotonicity in φ as a series of (positive) exponential elements are accumulated sequentially. In the following, we introduce fundamental properties of ϑ that are used in the Hamiltonian Monte Carlo algorithm. Prior of ϑ. As described in Section 2.2, the prior of the mapping function coefficient φ is specified via the slope of each segment: ωj ∼ T N (ωj−1, σω2 ). Assuming σω2 is small enough such that pT N ωj ωj−1, σω2 ≈ pN ωj ωj−1, σω2 for j = 1, . . . , K + 1, the prior for ωj can be presented as ωj ∼ N (ω0, jσω2 ). (2.33) Furthermore, if the knots are equally-spaced along retention time, i.e., Dτ = τj − τj−1 , for j = 1, . . . , K + 1, the prior for the difference between adjacent mapping function coefficients, φj − φj−1, is given by (φj − φj−1) ∼ N (Dτ ω0 , jDτ2 σω2 ) = N (Dτ , jDτ2 σω2 ), (2.34) φj − φj−1 = Dτ + ε0 + ε1 + · · · + εj−1, (2.35) and equivalently, iid where εj ∼ N (0, Dτ2 σω2 ), and j = 1, . . . , K. Based on the definition of the Jupp transformation, ϑj can be rewritten as φj+1 − φj ϑj = log φj − φj−1 φj − φj−1 + εj = log φj − φj−1 εj . (2.36) = log 1 + Dτ + ε0 + · · · + εj−1 Using the delta method, the variance of ϑj can be approximated as 2 j j j X X X ∂ϑj ∂ϑj ∂ϑj Var(ϑj ) ≈ Cov(εl , εm ) Var(εl ) + ∂ε ∂ε ∂ε l l m l=0 l=0 m6=l = σω2 . (2.37) Similarly, the covariance Cov(ϑj , ϑk ) can be approximated as max(j,k) max(j,k) max(j,k) X X X ∂ϑj ∂ϑk ∂ϑk ∂ϑj Var(εl ) + Cov(εl , εm ) Cov(ϑj , ϑk ) ≈ ∂εl ∂εl ∂εl ∂εm m6=l l=0 l=0 = 0. (2.38) Based on Equations 2.37 and 2.38, the prior of ϑ is given by iid ϑj ∼ N (0, σω2 ), j = 1, . . . , K. (2.39) Chapter 2. Single-profile Bayesian alignment model 41 Partial derivative. As the Jupp transformation is a non-lienar mapping between φ and ϑ, it is important to know how a change on the transformed coefficients ϑ affects the mapping function coefficients φ, and consequently, the mapping function u(t). This can be characterized by the partial derivative ∂φj /∂ϑk , where two cases are distinguished: ∂φj −(ϑK+1 − ϑ0 ) = 2 × exp(ϑ1 + · · · + ϑk ) + · · · + exp(ϑ1 + · · · + ϑK ) ∂ϑk 1 + exp(ϑ1 ) + · · · + exp(ϑ1 + · · · + ϑK ) × 1 + exp(ϑ1 ) + · · · + exp(ϑ1 + · · · + ϑj−1 ) , (2.40) for k ≥ j, and −(ϑK+1 − ϑ0 ) ∂φj = 2 × exp(ϑ1 + · · · + ϑj ) + · · · + exp(ϑ1 + · · · + ϑK ) ∂ϑk 1 + exp(ϑ1 ) + · · · + exp(ϑ1 + · · · + ϑK ) × 1 + exp(ϑ1 ) + · · · + exp(ϑ1 + · · · + ϑk−1 ) , (2.41) for k < j. The partial derivative can be integrated into MCMC sampling schemes, to explore the distribution of interest in a more effective manner. This is accomplished by using the Hamiltonian Monte Carlo algorithm. 2.6.2 Hamiltonian Monte Carlo Hamiltonian Monte Carlo (HMC) is an efficient MCMC algorithm [27,92]. It generates samples through simulating appropriate Hamiltonian dynamics, where a proposal is determined in consideration of an auxiliary momentum variable. In the estimation of ϑ, the logarithm of the desired density is denoted by L(ϑ). We introduce an independent auxiliary variable p ∼ N (0, M), with the same dimension of ϑ, and the negative joint log-density of (ϑ, p) is given by 1 1 H(ϑ, p) = −L(ϑ) + log(2π)K |M| + p⊤ M−1 p. (2.42) 2 2 This equation has a physical analogy to a Hamiltonian, which is the sum of a potential energy function −L(ϑ) defined by the position variable ϑ, and a kinetic energy p⊤ M−1 p/2. The variables p and M are therefore interpreted as the momentum and the mass matrix, respectively. In addition, the dynamics of ϑ and p are determined by the Hamiltonian’s equations: ∂H dϑ = = M−1 p, (2.43) dtH ∂p ∂H dp =− = ∇ϑ L(ϑ). (2.44) dtH ∂ϑ Based on the equations, a Markov chain can be generated via the Hamiltonian dynamics. Specifically, at each iteration, a position trajectory of ϑ is simulated and the ending point of Chapter 2. Single-profile Bayesian alignment model 42 the trajectory is proposed as a new state to be evaluated by the Metropolis algorithm, where the acceptance probability is based on the energy difference between the starting and ending points of the trajectory. For numerical implementation of non-trivial problems, however, the Hamiltonian’s equations can only be approximated by discretizing the dynamics with some small stepsize ǫ. This is typically performed using the “leapfrog” integrator as formulated below: p(tH + ǫ/2) = p(tH ) + ǫ∇ϑ L(ϑ(tH ))/2, ϑ(tH + ǫ) = ϑ(tH ) + ǫM−1 p(tH + ǫ/2), p(tH + ǫ) = p(tH + ǫ/2) + ǫ∇ϑ L(ϑ(tH + ǫ))/2. With appropriate stepsize ǫ and number of steps L, the Hamiltonian Monte Carlo algorithm proceeds with L steps of leapfrog integrator, followed by an evaluation using the Metropolis algorithm as shown in Algorithm 4. Algorithm 4 Hamiltonian Monte Carlo Given ϑ(m) , ε, L Sample a new momentum p ∼ N (0, M) Compute the Hamiltonian H based on (ϑ(m) , p) using Equation 2.42 Set ϑ′ ← ϑ(m) , p′ ← p for l = 1 to L do Set (ϑ′ , p′ ) ← Leapfrog(ϑ′ , p′ , ǫ) end for Compute the Hamiltonian H ′ based on (ϑ′ , p′ ) using Equation 2.42 Set ϑ(m+1) ← ϑ′ with probability min 1, exp(−H ′ + H) , otherwise set ϑ(m+1) ← ϑ(m) We demonstrate the advantage of using the HMC algorithm through generating samples from a bivariate normal distribution (example adapted from [92]). The Hamiltonian is defined by 1 ⊤ −1 1 ⊤ 1 0.95 H(ϑ, p) = ϑ Σ ϑ + p p, with Σ = , 0.95 1 2 2 where the two components of ϑ are highly correlated. Figure 2.12 shows the simulated trajectories using 20 leapfrog steps, with an initial position at the lower-left side of the distribution. Two different initial conditions of the momentum p are considered. As shown in the figure, random-walk behavior is eliminated in the trajectories. Instead, both trajectories track the distribution in a systematic manner, assisted by the information from the auxiliary momentum. With the simulated Hamiltonian shown in Figure 2.13, the acceptance probability pA of the Metropolis update is determined based on the energy difference between the starting and ending points of the trajectory. That is, pA = exp(−0.44) = 0.64 for the first proposal in Figure 2.12a, and pA = exp(−0.037) = 0.96 for the second proposal in Figure 2.12b. This example demonstrates the capability of using the HMC algorithm to 43 Chapter 2. Single-profile Bayesian alignment model effectively explore the distribution of interest. In Section 2.6.3, we introduce the application of the HMC algorithm to perform inference of the Jupp transformed coefficients in the profile-based alignment problem. 2.6.3 Single-profile alignment using Hamiltonian Monte Carlo We use the HMC algorithm to estimate the Jupp transformed coefficients ϑ in the singleprofile alignment model, where L(ϑ) is defined as L(ϑ) = log p(y|ϑ) + log p(ϑ) 2 K X X ϑ2j y(t) − ŷ(t) =− − + constant. (2.45) 2σε2 2σω2 j=1 t∈{t1 ,...,tT } The generated profile ŷ(t) = c + a · m (u(t)) is determined based on the translation and scaling variables c, a, the prototype function m(t) and the mapping function u(t). The gradient required by the leapfrog integrator is given by ∂L ϑk ∂u ∂φj −1 X ∂u ∂φj+1 ′ − 2 y(t) − ŷ(t) − am (u(t)) × = 2 + ∂ϑk σǫ ∂φj ∂ϑk ∂φj+1 ∂ϑk σω t∈{t1 ,...,tT } X ′ ∂φj t − τj ∂φj+1 ϑk τj+1 − t a × + × − 2, y(t) − ŷ(t) m (u(t)) × = 2 σǫ τj+1 − τj ∂ϑk τj+1 − τj ∂ϑk σω t∈{t1 ,...,tT } (2.46) where ∂φj /∂ϑk is given in Equations 2.40 and 2.41. We applied the HMC alignment model to the LC-MS spike-in data set discussed in Section 2.5.2. In the HMC model, the trajectory is simulated using a mixture of two stepsizes (4 × 10−6 and 8 × 10−6 ) and 200 leapfrog steps. With this setting, the acceptance rates (among 1000 iterations) are 0.39 and 0.35 for ǫ = 4 × 10−6 and ǫ = 8 × 10−6 , respectively. Figure 2.14 depicts the base peak chromatograms of the 14 LC-MS runs, before alignment and after alignment using the HMC model. As shown in the figure, significant misalignments in the original chromatograms are effectively corrected within initial 10 iterations, and further improvement at both ends of the chromatograms can be achieved with additional HMC iterations. While the preliminary result appears encouraging, the main issue of using the HMC algorithm is to select appropriate stepsize ǫ and the number of steps L for simulating a trajectory such that effective MCMC moves can be proposed with high acceptance probabilities. Current experimental setting is through trial and error according to some guidelines introduced in [92]. The tuning of the parameters requires time and experience and may limit the model’s practical applicability. Incorporation of adaptive HMC samplers, e.g., the recently developed “No-U-Turn Sampler” (NUTS) [53], may deserve further investigation. Chapter 2. Single-profile Bayesian alignment model 2.7 44 Summary This chapter introduces a Bayesian alignment model for LC-MS data using single chromatograms from multiple LC-MS runs. The single-profile model improves on existing Bayesian methods by 1) using an efficient MCMC sampler, and 2) adaptively selecting knots for the mapping function. Due to the mathematical intractability of the mapping function and the monotonicity constraint imposed on it, designing an effective updating scheme is crucial to ensure good mixing of the MCMC sampler. We propose a block Metropolis-Hastings algorithm that enables flexible transition and prevents the sampler from getting trapped in local modes of the posterior distribution. Moreover, an extension using SSVS is provided for adaptive knot specification. For the profile-based alignment, evaluation on both simulated and real data sets shows improved alignment results in terms of pairwise correlation coefficients, cross-correlation with the underlying pattern, as well as visual assessment relative to the ground-truth. In addition, alignment by the proposed model is demonstrated to facilitate the subsequent peak matching process using two metabolomic benchmark data sets. Unresolved issues include the following: 1) lack of integration of informative prior knowledge, e.g., the information of internal standards, and 2) implicit assumption of the existence of an underlying pattern based on a single ion chromatogram. Identifying representative chromatograms can potentially improve the alignment accuracy by providing more comprehensive information about the retention time variability across LC-MS runs. The strategy is computationally feasible since the alignment model can be extended to handle multiple chromatograms by introducing associated prototype functions accordingly. An important practical issue is how to extract the informative chromatograms from an LC-MS run, which is discussed in the next chapter. 45 Chapter 2. Single-profile Bayesian alignment model Chromatogram 1 (first aliquot) Chromatogram 1 (second aliquot) 20 Difference Difference 20 10 0 −10 −20 0 100 200 300 400 10 0 −10 −20 500 0 100 Retention time Chromatogram 2 (first aliquot) Difference Difference 0 −10 0 100 200 300 400 0 −10 −20 500 0 100 Chromatogram 3 (first aliquot) Difference Difference −10 0 100 200 300 400 0 −10 −20 500 0 100 Chromatogram 4 (first aliquot) 400 500 Chromatogram 4 (second aliquot) Difference Difference 300 20 10 0 −10 0 100 200 300 400 10 0 −10 −20 500 0 100 Retention time 200 300 400 500 Retention time Chromatogram 5 (first aliquot) Chromatogram 5 (second aliquot) 20 Difference 20 Difference 200 Retention time 20 10 0 −10 0 100 200 300 400 10 0 −10 −20 500 0 100 Retention time 200 300 400 500 Retention time Chromatogram 6 (first aliquot) Chromatogram 6 (second aliquot) 20 Difference 20 Difference 500 10 Retention time 10 0 −10 0 100 200 300 400 10 0 −10 −20 500 0 100 Retention time 200 300 400 500 Retention time Chromatogram 7 (first aliquot) Chromatogram 7 (second aliquot) 20 Difference 20 Difference 400 Chromatogram 3 (second aliquot) 0 10 0 −10 −20 300 20 10 −20 200 Retention time 20 −20 500 10 Retention time −20 400 20 10 −20 300 Chromatogram 2 (second aliquot) 20 −20 200 Retention time 0 100 200 300 Retention time 400 500 10 0 −10 −20 0 100 200 300 400 500 Retention time Figure 2.8: Difference between the identity function and estimated mapping function obtained from the posterior median by BAM for each of the 14 chromatograms. The filled c 2013 IEEE) region corresponds to the 90% credible interval. ( 46 Chapter 2. Single-profile Bayesian alignment model 6 6 x 10 10 0 350 6 x 10 400 450 10 0 350 6 x 10 400 450 10 5 0 350 x 10 400 (a) m/z 433.63 7 7 x 10 5 400 x 10 2 1 0 350 450 x 10 2 1 0 300 450 x 10 4 2 0 350 5 0 300 350 0 400 300 (c) m/z 524.48 6 x 10 15 10 5 0 350 350 450 x 10 15 10 5 0 350 400 400 6 350 7 6 x 10 5 0 350 400 7 5 0 450 350 (g) m/z 647.50 400 400 x 10 4 2 0 450 350 (i) m/z 674.93 7 350 400 450 400 350 7 10 0 400 350 400 450 350 400 400 450 400 450 450 500 7 x 10 15 10 5 0 300 5 0 350 400 x 10 10 5 0 350 400 10 0 400 (o) m/z 1348.65 0 450 350 (l) m/z 811.00 7 350 7 400 450 x 10 10 5 0 350 (n) m/z 1297.43 7 7 500 400 x 10 400 7 x 10 x 10 450 2 1 0 400 300 (j) m/z 699.03 7 (m) m/z 1047.12 x 10 350 7 x 10 7 350 450 x 10 5 2 1 0 400 300 (k) m/z 784.50 7 x 10 15 10 5 0 300 400 2 1 0 300 x 10 2 1 0 300 400 7 x 10 4 2 0 450 350 (h) m/z 649.11 x 10 7 x 10 x 10 2 1 0 400 300 (f) m/z 615.60 7 7 x 10 4 2 0 350 x 10 2 1 0 450 350 (d) m/z 535.00 7 (e) m/z 601.03 x 10 450 7 7 6 400 400 (b) m/z 513.50 7 x 10 450 10 5 0 350 x 10 5 450 500 0 400 5 450 0 500 400 (p) m/z 1575.09 Figure 2.9: Original extracted ion chromatograms for each of the 16 m/z values corresponding to the spiked-in peptides. For each m/z value, two plots showing the chromatograms of all seven replicates are depicted: chromatograms for aliquots with serum alone (left) and c 2013 IEEE) serum with spiked-in peptides (right). ( 47 Chapter 2. Single-profile Bayesian alignment model 6 6 x 10 10 0 350 6 x 10 400 450 10 0 350 6 x 10 400 450 10 5 0 350 x 10 400 (a) m/z 433.63 7 7 x 10 5 400 x 10 2 1 0 350 450 x 10 2 1 0 300 450 x 10 4 2 0 350 5 0 300 350 0 400 300 (c) m/z 524.48 6 x 10 15 10 5 0 350 350 450 x 10 15 10 5 0 350 400 400 6 350 7 6 x 10 5 0 350 400 7 5 0 450 350 (g) m/z 647.50 400 400 x 10 4 2 0 450 350 (i) m/z 674.93 7 350 400 450 400 350 7 10 0 400 350 400 450 350 400 400 450 400 450 450 500 7 x 10 15 10 5 0 300 5 0 350 400 x 10 10 5 0 350 400 10 0 400 (o) m/z 1348.65 0 450 350 (l) m/z 811.00 7 350 7 400 450 x 10 10 5 0 350 (n) m/z 1297.43 7 7 500 400 x 10 400 7 x 10 x 10 450 2 1 0 400 300 (j) m/z 699.03 7 (m) m/z 1047.12 x 10 350 7 x 10 7 350 450 x 10 5 2 1 0 400 300 (k) m/z 784.50 7 x 10 15 10 5 0 300 400 2 1 0 300 x 10 2 1 0 300 400 7 x 10 4 2 0 450 350 (h) m/z 649.11 x 10 7 x 10 x 10 2 1 0 400 300 (f) m/z 615.60 7 7 x 10 4 2 0 350 x 10 2 1 0 450 350 (d) m/z 535.00 7 (e) m/z 601.03 x 10 450 7 7 6 400 400 (b) m/z 513.50 7 x 10 450 10 5 0 350 x 10 5 450 500 0 400 5 450 0 500 400 (p) m/z 1575.09 Figure 2.10: Aligned extracted ion chromatograms for each of the 16 m/z values corresponding to the spiked-in peptides. For each m/z value, two plots showing the chromatograms of all seven replicates are depicted: chromatograms for aliquots with serum alone (left) and c 2013 IEEE) serum with spiked-in peptides (right). ( 48 Chapter 2. Single-profile Bayesian alignment model 4 4 x 10 6 6 6 1150 1200 1250 3 1 1 1500 2000 2500 0 1100 0 0 3000 500 1000 Retention time (sec) 5 1500 2000 2500 3000 5 9 5 10 x 10 5 x 10 10 8 x 10 8 7 7 5 5 6 0 1350 5 1400 1450 4 Ion count 6 Ion count 1250 (b) Aligned chromatograms in M1 data set x 10 2 2 1 1 1000 1500 2000 2500 3000 3500 Retention time (sec) (c) Original chromatograms in M2 data set 1400 1450 4 3 500 0 1350 5 3 0 0 1200 Retention time (sec) (a) Original chromatograms in M1 data set 9 1150 3 2 1000 2 4 2 500 x 10 4 5 0 1100 0 0 4 6 2 4 x 10 x 10 4 5 Ion count 7 4 Ion count 7 0 0 500 1000 1500 2000 2500 3000 3500 Retention time (sec) (d) Aligned chromatograms in M2 data set Figure 2.11: Chromatograms in the metabolomic data sets, M1 and M2, before and after alignment by BAM. The inset is a zoomed part in the middle retention time range of the c 2013 IEEE) chromatograms. ( 49 Chapter 2. Single-profile Bayesian alignment model Momentum trace 1 1 1 1 −1 −1 −2 −2 −1 0 1 θ −2 −2 2 0 θ p −1 0 p2 2 2 2 0 θ Position trace Momentum trace 2 2 2 Position trace 2 −1 0 1 2 −1 −2 −2 p 1 0 −1 0 1 θ −2 −2 2 −1 (a) ϑ(0) = (−1.5, −1.5)⊤, p(0) = (−1, 1)⊤ 0 1 2 p 1 1 1 (b) ϑ(0) = (−1.5, −1.5)⊤, p(0) = (−1, −1.5)⊤ Figure 2.12: Trajectories for a bivariate normal distribution, simulated using 20 leapfrog steps (ǫ = 0.25) with an initial position at the lower-left side of the distribution. Contours of equal probability ratio to the highest (0.1, 0.2, . . . , 0.9) are depicted. Different values of initial momentum are considered. Hamiltonian 3.1 2.6 3 2.5 2.9 H(θ,p) H(θ,p) Hamiltonian 2.7 2.4 2.8 2.3 2.7 2.2 2.6 2.1 0 5 10 15 Leapfrog step (a) ϑ(0) = (−1.5, −1.5)⊤, p(0) = (−1, 1)⊤ 20 2.5 0 5 10 15 20 Leapfrog step (b) ϑ(0) = (−1.5, −1.5)⊤, p(0) = (−1, −1.5)⊤ Figure 2.13: The Hamiltonians along the trajectories in Figure 2.12, under different initial conditions. 50 Chapter 2. Single-profile Bayesian alignment model Serum 8 x 10 Serum + spiked−in peptides 3 3 2.5 2.5 2 1.5 0.5 0.5 200 300 400 0 0 500 100 200 Retention time Serum 8 Serum + spiked−in peptides 3 2.5 2.5 2 1.5 0.5 0.5 300 Retention time (c) Iteration 100 of HMC 400 500 x 10 Serum + spiked−in peptides 1.5 1 200 500 2 1 100 Serum 8 3.5 3 0 0 400 (b) Iteration 10 of HMC Ion count Ion count x 10 300 Retention time (a) Original 3.5 Serum + spiked−in peptides 1.5 1 100 x 10 2 1 0 0 Serum 8 3.5 Ion count Ion count 3.5 0 0 100 200 300 400 500 Retention time (d) Iteration 1000 of HMC Figure 2.14: Base peak chromatograms of the LC-MS data. Alignment is performed using the HMC model. Chapter 3 Multi-profile Bayesian alignment model The generic task of retention time alignment is to estimate a set of mapping functions in N LC-MS runs, ui (t), i = 1, . . . , N, t = t1 , . . . , tT , that characterizes the mapping relationship between observed retention times in each LC-MS run and a consensus reference. This chapter extends the single-profile alignment model introduced in Chapter 2 to handle multiple representative chromatograms simultaneously. Moreover, we use Gaussian process regression on the internal standards to derive a prior distribution for the mapping functions, which is then integrated into the profile-based alignment model. Figure 3.1 presents the three main components of the multi-profile Bayesian alignment model, which are elaborated in the following sections. 3.1 Gaussian process prior During the sample preparation of an LC-MS experiment, an internal standard mixture is often spiked into the sample for the purpose of quality assessment of the experiment. The mixture is composed of compounds whose behavior are well characterized, where a set of peaks can be easily detected in the LC-MS data and associated with these compounds. It is therefore possible to identify the peaks of internal standard and their retention times in each LC-MS run. With this information, adjustment can be made for each internal standard peak. Furthermore, this can be extended to other time points by conducting a Gaussian process regression to estimate the mapping function for each run with a regression function. For each LC-MS run, we have the mapping relationship {s, r}, where s = (s1 , . . . , sR )⊤ is the vector of original retention times for the R internal standard peaks, and r = (r1 , . . . , rR )⊤ is the corresponding assigned vector of reference times estimated by the average of each standard peak across multiple runs. A Gaussian process prior is defined over a latent mapping 51 52 Chapter 3. Multi-profile Bayesian alignment model Gaussian Process Prior Chromatographic Clustering 55 50 45 40 2 35 1.5 1 30 5 0.5 25 0 10 20 15 15 4 20 25 30 35 40 45 50 3 20 2 30 40 55 50 1 60 Profile-based Alignment prototype function mapping function synthetic data similarity observation Figure 3.1: Three main components of the multi-profile Bayesian alignment model: Gaussian process prior, chromatographic clutering, and profile-based alignment. function ui (t) of the observation {s, r}, that is ui (s) s ∼ N (µu , Σu ), (3.1) where the mean function is an identity function, i.e., µu = s, and the R × R covariance matrix Σu is defined via the covariance function κ (s − s′ )2 ′ ′ 2 , (3.2) Cov ui (s), ui (s ) = κ(s, s ) = σu exp − 2σs2 such that Cov ui (s1 ), ui (s1 ) Cov ui (s2 ), ui (s1 ) Σu = .. . Cov ui (sR ), ui (s1 ) κ(s1 , s1 ) κ(s1 , s2 ) κ(s2 , s1 ) κ(s2 , s2 ) = .. .. . . κ(sR , s1 ) κ(sR , s2 ) Cov ui (s1 ), ui (s2 ) Cov ui (s2 ), ui (s2 ) .. . Cov ui (sR ), ui (s2 ) · · · κ(s1 , sR ) · · · κ(s2 , sR ) . .. .. . . · · · κ(sR , sR ) · · · Cov ui (s1 ), ui (sR ) · · · Cov ui (s2 ), ui (sR ) .. .. . . · · · Cov ui (sR ), ui (sR ) (3.3) The covariance function reflects greater dependence between neighboring time points than distant points, and the parameters σs2 and σu2 define how closely and how significantly neighboring time points affect each other, respectively. The likelihood function is defined as r ui (s) ∼ N ui (s), σn2 I . (3.4) Chapter 3. Multi-profile Bayesian alignment model 53 Based on the defined likelihood function and the Gaussian process, the joint distribution ui (t) and r is a multivariate normal distribution: κ(t, t) κ(t, s⊤ ) t ui (t) , (3.5) , =N κ(s, t) Σu + σn2 I s r ⊤ where κ(t, s⊤ ) = κ(t, s1 ), κ(t, s2 ), . . . , κ(t, sR ) and κ(s, t) = κ(s1 , t), κ(s2 , t), . . . , κ(sR , t) . Given {s, r}, the predictive distribution of the mapping function ui (t) at time t can be inferred based on the conditional distribution of ui (t) ui (t) s, r, t ∼ N E[ui (t)], Var[ui (t)] , (3.6) where the mean and variance −1 E ui (t) t, r, s = t + κ(t, s⊤ ) Σu + σn2 I (r − s), −1 Var ui (t) t, r, s = κ(t, t) − κ(t, s⊤ ) Σu + σn2 I κ(s, t). (3.7) (3.8) This provides an effective way to infer the mapping functions. However, the estimation depends on the number of standard peaks that can be reliably used and the coverage of retention time by these peaks. We utilize more comprehensive chromatographic information in our profile-based approach, in which the Gaussian process can be incorporated using the predictive distribution of the mapping function as the prior for subsequent estimation. 3.2 Multi-profile alignment For complex biological samples, collapsing the three-dimensional data into a two-dimensional chromatogram may blur originally distinct patterns. In such cases, the lack of a consistent pattern can hinder the estimation of mapping functions. To retain better chromatographic profiles, we propose to identify multiple representative chromatograms, and perform the alignment by considering these chromatograms simultaneously. Extension of the generative model to handle multiple chromatograms can be made by introducing associated prototype functions of the representative chromatograms. That is, (g) (g) yi (t) = ci + ai · mg ui (t) + εi (t), (3.9) where sample index i = 1, . . . , N, and chromatogram index g = 1, . . . , G. The prototype function associated to the g-th representative chromatogram is modeled with B-spline regression: mg = Bm ψ g , (3.10) 54 Chapter 3. Multi-profile Bayesian alignment model and the likelihood function is the product of the G likelihood functions, p(y|θ) = G Y N Y g=1 i=1 (g) (g) (g) N yi ŷi , σε2 I , (3.11) where ŷi is given by ci · 1 + ai · mg (ui ). Figure 3.2 presents the directed acyclic graph of the multi-profile alignment model where the model parameters are represented by open circles, the hyperparameters by solid dots, and the observations by filled circles. µc σc20 µa σa20 c0 αc βc αψ βψ σψ2 a0 σc2 ci {ψ g } σa2 αa βa εi σε2 αε βε ui GP ai (g) {yi } N Figure 3.2: Directed acyclic graph of the multi-profile alignment model. Algorithm 5 outlines one iteration of the MCMC procedure in the multi-profile alignment. For parameters θ whose full conditionals have closed forms as summarized in Table 3.1, we use Gibbs sampling to update their values. The remaining parameters, i.e., the mapping function coefficients φi , are updated using the block Metropolis-Hastings algorithm as described in Section 2.3.2. That is, the φi,j ’s are first grouped into several non-overlapping blocks, which consist of successive coefficients along the retention time, and proposals are made to update each block. We introduce binary indicator variables bj ∈ {0, 1}, j = 1, . . . , K, to identify the block boundaries, where bj = 1 if τj is at the boundary of a block and b0 = bK+1 = 1. This indicator variable follows a Bernoulli distribution with p(bj = 1) = rblock . Based on the boundary configuration, coefficients within the same block φi,j:j+Bj −1 = (φi,j , φi,j+1, . . . , φi,j+Bj −1 ) are proposed to be moved in the same direction, where bj = bj+Bj = 1 and bj+1 = · · · = bj+Bj −1 = 0. We consider a mixture of transitions where rblock is randomly selected from {1, 1/2, 1/4} at each iteration. The configuration of Chapter 3. Multi-profile Bayesian alignment model 55 Table 3.1: Summary of full conditionals of model parameters in the multi-profile Bayesian alignment model. Parameter a0 Distribution 2 N (â0 , σ̂a0 ) c0 N ĉ0 , σ̂c20 (ai , ci ) N (µ̂i , Σ̂i ) 1/σa2 G(α̂a , β̂a ) 1/σc2 G(α̂c , β̂c ) 1/σε2 G(α̂ε , β̂ε ) 1/σψ2 G(α̂ψ , β̂ψ ) ψ N (µ̂ψ , Σ̂ψ ) −1 2 σ̂a20 = 1/σa0 + N/σa2 P ai /σa2 â0 = σ̂a20 µa /σa20 + N i=1 −1 2 σ̂c20 = 1/σc0 + N/σc2 P 2 ĉ0 = σ̂c20 µc /σc20 + N i=1 ci /σc PG −1 ⊤ 2 Σ̂i = Σ−1 ac + g=1 Wg Wg /σǫ P ⊤ G −1 ⊤ g 2 µi = Σ̂i Σac a0 c0 + g=1 Wg yi /σǫ α̂a = αa + N/2 P 2 β̂a = βa + N i=1 (ai − a0 ) /2 α̂c = αc + N/2 P 2 β̂c = βc + N i=1 (ci − c0 ) /2 α̂ε = αε + GNT /2 P PN (g) (g) 2 β̂ε = βε + G i=1 kyi − ŷi k /2 g=1 α̂ψ = αψ + GL/2 P ⊤ β̂ψ = βψ + G g=1 ψ g Ωψ g /2 −1 Σ̂ψ = Ω/σψ2 + X⊤ X/σε2 µ̂ψ = Σ̂ψ X⊤ (y − C)/σǫ2 56 Chapter 3. Multi-profile Bayesian alignment model blocks is therefore variable within a Markov chain. The acceptance probability, rA , for up(m) dating φi,j:j+Bj −1 (φ′i,j:j+Bj −1 ← φi,j:j+Bj −1 ), is determined by the product of the prior ratio rP , the likelihood ratio rL , and the transition ratio rT . For the block Metropolis-Hastings algorithm, the transition ratio rT for the proposal density is one, while the likelihood ratio rL and prior ratio rP are given by: (g) ′ (m+1) (m+1) G p y φ , φ , θ Y i,j:j+Bj −1 i i,\j:j+Bj −1 , rL = (3.12) (g) (m) (m+1) (m+1) φi,j:j+Bj −1 , φi,\j:j+Bj −1 , θ g=1 p yi and N u′i (t) E[ui (t)], Var[ui (t)] , rP = (m+1) N u (t) E[u (t)], Var[u (t)] i i t∈{τj−1 :τj+Bj } i Y (m+1) (3.13) where φi,\j:j+Bj −1 denotes the set of coefficients φi at iteration m + 1 with φi,j:j+Bj −1 (m+1) excluded, and u′i (t) and ui (m) (m+1) (t) are determined based on {φ′i,j:j+Bj −1 , φi,\j:j+Bj −1 } and (m+1) {φi,j:j+Bj −1 , φi,\j:j+Bj −1 }, respectively. E[ui (t)] and Var[ui (t)] are derived via Gaussian process regression as described in Section 3.1. Algorithm 5 MCMC update of {θ (m) , φ(m) } in multi-profile alignment model Update θ (m+1) ← θ (m) using Gibbs sampling rB ∼ U 1, 12 , 41 δ ∼ 12 · U(0, δsmall ) + 21 · U(0, δlarge ) bj ∼ Bernoulli(rblock ), for j = 1, . . . , K for all block φi,j:j+Bj −1 do (m) (m) ′ φi,j:j+Bj −1 ∼ U φi,j:j+Bj −1 − δ, φi,j:j+Bj −1 + δ Compute the likelihood ratio rL using Equation 3.12 Compute the prior ratio rP using Equation 3.13 Compute the acceptance probability rA = min (1, rL × rP ) (m+1) Set φi,j:j+Bj −1 = φ′i,j:j+Bj −1 with probability rA end for 3.3 Chromatographic clustering A critical issue involved in the multi-profile modeling is the identification of representative chromatograms from the LC-MS runs, where a trade-off between computational efficiency (less chromatograms) and information retention (more chromatograms) needs to be considered. The use of multiple chromatograms is considered in a few studies, by either binning 57 Chapter 3. Multi-profile Bayesian alignment model the LC-MS data [75] or using all the extracted ion chromatograms with acceptable quality [17]. However, a suitable procedure to utilize multiple representative chromatograms while retaining computational feasibility is currently not available. Naı̈vely binning along the m/z dimension is not desirable since chromatograms with similar m/z values do not necessarily resemble each other as shown in Figure 3.3, and this would inevitably blur the chromatographic profiles. To address this gap, we propose a clustering approach to identify multiple representative chromatograms from each LC-MS run. The chromatograms are simultaneously considered in the profile-based alignment to facilitate the estimation of the (b) prototype and mapping functions. With an initial set B of binned chromatograms xi at a resolution of 0.5 Da/bin, b ∈ B, we propose a clustering procedure consisting of screening of unqualified chromatograms, identification of exemplars, and agglomerative clustering as follows. 6 x 10 2 TIC 1.5 1 700 0.5 650 600 0 30 550 35 40 45 500 m/z Retention time (min) Figure 3.3: Binned chromatograms in a portion of one LC-MS run. Similar m/z values do not imply similar chromatographic profiles. Screening of unqualified chromatograms. Quality of each binned chromatogram is assessed by the mass chromatogram quality (MCQb ) and normalized cross-correlation across LC-MS runs (XCb ), where the value of MCQb is computed using the component detection algorithm (CODA) by [139] to identify contaminated binned chromatograms by baseline or spike noises in any of the LC-MS runs, n o (b) MCQb = min CODA xi (t) i = 1, . . . , N , (3.14) Chapter 3. Multi-profile Bayesian alignment model 58 and the averaged cross-correlation is to gauge the consistency of the chromatographic pattern across the runs, N X N o n X 2 (b) (b) (3.15) max xi ⋆ xi′ (t) , XCb = N(N − 1) i=1 ′ i >i where the normalized cross-correlation is defined as Z ∞ 1 (b) (b) (b) (b) xi ⋆ xi′ (t) = q xi (τ )xi′ (τ + t)dτ. (b) (b) ||xi ||2 · ||xi′ ||2 −∞ (3.16) The chromatograms are screened based on their quality. Only those satisfying the specified criterion, e.g., MCQb ≥ 0.9 and XCb ≥ 0.85 are retained for further processing. That is, Bs = B \ Bd , where Bd = b MCQb < 0.9 ∪ XCb < 0.85 . Identification of exemplars. We apply the affinity propagation algorithm [36] to identify exemplars that best represent the whole chromatographic profiles, Be ← AP(ρb,b′ , ρ̄), b, b′ ∈ Bs , based on the similarity measure of Pearson correlation coefficient ρb,b′ N 1 X (b) (b′ ) Corr xi , xi , = N i=1 (3.17) and the average of all the similarity measures, ρ̄ is assigned as the exemplar preference in the algorithm. The sum of the correlation coefficient between each chromatogram b and its exemplar π(b), X ρb,π(b) , b∈Bs is maximized, where π(b) ∈ Bs and Be = {π(b) b ∈ Bs }. To ensure a valid configuration, if π(b) = b′ , then π(b′ ) must be b′ . Agglomerative clustering. Based on the set of identified exemplars, we perform the hierarchical agglomerative approach to cluster the exemplars, which is a bottom-up approach. Initially each exemplar forms a singleton cluster, and two closest clusters are iteratively (g) merged. At each level, the clustered chromatogram yi (t) is summarized by X (b) (g) xi (t), (3.18) yi (t) = b∈Bg 59 Chapter 3. Multi-profile Bayesian alignment model where Bg denotes the set of chromatograms in the g-th cluster. The distance between two clusters is defined based on the overlapping level between two clustered chromatograms d(Bg , B ) = g′ tT N X X i=1 t=t1 n o (g) (g ′ ) min yi (t), yi (t) . (3.19) Our goal is to cluster together chromatograms with less overlaps, i.e., agglomeration of fairly distinct chromatographic profiles, to better retain the chromatographic profiles. The procedure continues until all the exemplars are merged into a single cluster. Once the hierarchy is built, the number of clusters is determined using the L-method [115]. On the plot of overlapping level against the number of clusters, there is an incremental decrease of the overlapping level and the L-method searches for the knee of the overlapping curve, where the benefit of adding an additional cluster starts decreasing. A sequence of two piecewise lines that fit the overlapping curve and their sum of squared errors are considered. The point minimizing the fitted sum of squared errors is chosen as the number of clusters. 3.4 Analyzed data sets We applied the multi-profile alignment model to two LC-MS data sets from proteomic and glycomic studies. Both data sets were generated from human serum samples with spiked-in internal standards. Table 3.2 gives a summary of the analyzed data sets. The base peak chromatograms of the LC-MS runs are shown in Figure 3.4. Table 3.2: Summary of the analyzed data sets. Sample Internal standard Liquid chromatography Mass spectrometer Number of LC-MS runs Number of MS scans Time range (min.) m/z range (Da) Peak detection Ground-truth Proteomics Human serum Tryptic peptides Agilent 1200 LTQ-Orbitrap 20 3792 − 3367 10 − 115 400 − 2000 DifProWare Mascot results Glycomics Human serum Galactose Dionex 3000 LTQ-Orbitrap Velos 23 7809 − 8643 10 − 60 500 − 2000 In-house tool Serum glycans Proteomic data set. The proteomic experiment was designed for evaluating the MARS Hu-14 column (Agilent Technologies) for depletion of high-abundance proteins in human 60 Chapter 3. Multi-profile Bayesian alignment model 7 18 8 x 10 7 16 x 10 6 14 5 Ion count Ion count 12 10 8 4 3 6 2 4 1 2 0 20 40 60 80 100 0 10 20 30 40 Retention time (min) Retention time (min) (a) Proteomic data set (b) Glycomic data set 50 60 Figure 3.4: Base peak chromatograms in the two analyzed data sets. serum. The tryptic peptides are a mixture of the following five non-human proteins (BrukerMichrom): Alcohol deydrogenase (yeast), Carbonic anhydrase (bovine), Cytrochrome c (equine), Enolase (yeast), and Myoglobin (equine). Serum samples from five healthy individuals were analyzed. LC-MS/MS analysis of the serum samples was performed on an Agilent 1200 nano-LC coupled to an LTQ-Orbitrap mass spectrometer, where data were acquired with double injections from two groups, with two different concentrations of the spiked-in tryptic peptides. LC-MS/MS data of the internal standard mixture were also acquired in duplicate right before the data acquisition of the serum samples. The mass spectrometer was scanned approximately every second using a 60,000 resolution setting. For each scan, up to five ions were automatically selected based on their intensities for the MS/MS analysis in the LTQ. We used the DifProWare platform1 to perform LC-MS data preprocessing including deisotoping of mass spectra, peak detection, and charge state deconvolution. Each LC-MS run was preprocessed separately. Peak detection was performed on the basis of LC-MS data without using the MS/MS spectra. Glycomic data set. The glycomic data set is from an untargeted LC-MS study aimed at identifying glycomic disease biomarkers. We analyzed human serum samples representing two distinct biological groups (cases and controls). The data set was generated from the serum samples of 11 cases and 12 controls. Sample preparation consists of release, purification, reduction, and permethylation of N-linked glycans. Following the sample preparation, LC-MS data were acquired using a Dionex 3000 Ultimate nano-LC system interfaced to an LTQ-Orbitrap Velos mass spectrometer on positive mode. An internal standard mixture of galactose was added to the samples prior to the LC-MS data acquisition. We performed LC-MS data preprocessing including deisotoping using DeconTools [60] and peak detection 1 Available at mciproteomics.usouthal.edu/difproware/ 61 Chapter 3. Multi-profile Bayesian alignment model through deconvolution of chromatographic profiles. Each LC-MS run was preprocessed separately, and the sample labeling information did not account for any analysis conducted here. Preprocessed peak lists. Figure 3.5 presents the histogram of the logarithm of peak intensities, where there are 61,637 and 2933 consensus peaks in the proteomic and glycomic data sets, respectively. The scatter plots of the detected peaks are shown in Figure 3.6. For visualization purpose, only the common peaks (present in more than 50% of the LC-MS runs) are depicted. Multiply charged ions were identified and the corresponding masses were recorded. Please note that the mass ranges are different in the two analyzed data sets. Total number of peaks: 2933 Total number of peaks: 61637 700 18000 16000 600 Number of peaks Number of peaks 14000 12000 10000 8000 6000 500 400 300 200 4000 100 2000 0 10 12 14 16 18 20 Log intensity (a) Proteomic data set 22 24 0 12 14 16 18 20 22 24 26 Log intensity (b) Glycomic data set Figure 3.5: Histograms of the logarithm of peak intensities in the two analyzed data sets. We evaluate the alignment results based on the consensus list of the ground-truth data. Specifically, we compare the retention time (RT) difference across LC-MS runs, the coefficient of variation (CV) of extracted ion chromatograms, and the peak matching performance. We use the simultaneous multiple alignment (SIMA) model [136], in which a feature-based alignment module is embedded, to perform the peak matching process. SIMA has shown outstanding performance in the four benchmark data sets by [71]. The RT difference measures the difference between the largest and smallest retention times for a consensus peak. The CV evaluates the variability across chromatograms. The peak matching performance is evaluated through precision and recall of the peak matching results against the ground-truth data. Precision and recall are defined by [71] and provided as follows. With a slight abuse of notation, we denote the consensus peak in the ground-truth by gti , i = 1, . . . , N, and the consensus peak from the peak matching result by pmj , j = 1, . . . , M, where singleton peak is discarded. For each consensus peak in ground-truth gti , an associated set Mi is defined as Mi = j |gti ∩ pmj | > 0 , (3.20) 62 Chapter 3. Multi-profile Bayesian alignment model 5500 12000 11000 23 5000 10000 22 4500 21 4000 20 3500 9000 7000 Mass Mass 8000 19 6000 24 22 20 3000 5000 18 2500 4000 17 2000 3000 16 1500 15 1000 18 16 2000 14 1000 20 40 60 80 100 120 14 500 10 20 30 40 50 60 Retention time (min) Retention time (min) (a) Proteomic data set (b) Glycomic data set Figure 3.6: Scatter plots of the detected peaks in the two analyzed data sets. The intensity is log-transformed and color-coded. where |Mi | is the number of unique elements in the set Mi , presenting the number of consensus peaks (in the peak matching result) that are split from a single peak in groundtruth. A set of relevant peaks to the ground-truth gti can therefore be given by [ pm fi = pmj , (3.21) j∈Mi and the matching precision and recall are defined as N 1 X |gti ∩ pm f i| Precision = , N i=1 |pm f i| and Recall = respectively. 3.5 N 1 X |gti ∩ pm f i| , N i=1 |Mi | · |gti | (3.22) (3.23) Analysis of LC-MS proteomic data set The LC-MS proteomic data set consists of 20 LC-MS runs from serum samples with a mixture of internal standard spiked into. To identify peaks corresponding to the spikedin internal standard, MS/MS spectra of the internal standard mixture were searched with Mascot. Precursor peaks of the identified peptide sequences were assigned based on their masses and retention times, and the resulting list consists of 22 peaks of internal standard. 63 Chapter 3. Multi-profile Bayesian alignment model Table 3.3: Peptide sequences of the internal standard. IS1-1 IS1-2 IS1-3 IS1-4 IS1-5 IS2-1 IS2-2 IS3-1 IS3-2 IS3-3 IS4-1 IS4-2 IS4-3 IS4-4 IS4-5 IS4-6 IS4-7 IS4-8 IS4-9 IS5-1 IS5-2 IS5-3 Protein Alcohol dehydrogenase Alcohol dehydrogenase Alcohol dehydrogenase Alcohol dehydrogenase Alcohol dehydrogenase Carbonic anhydrase Carbonic anhydrase Cytochrome c Cytochrome c Cytochrome c Enolase Enolase Enolase Enolase Enolase Enolase Enolase Enolase Enolase Myoglobin Myoglobin Myoglobin Peptide sequence ANELLINVK ATDGGAHGVINVSVSEAAIEASTR LPLVGGHEGAGVVVGMGENVK SISIVGSYVGNR VVGLSTLPEIYEK AVVQDPALKPLALVYGEATSR VLDALDSIK EETLMEYLENPK GITWKEETLMEYLENPK TGQAPGFTYTDANK AVDDFLISLDGTANK DGKYDLDFKNPNSDK GNPTVEVELTTEK IEEELGDNAVFAGENFHHGDKL SGETEDTFIADLVVGLR TAGIQIVADDLTVTNPK VNQIGTLSESIK YDLDFKNPNSDK YGASAGNVGDEGGVAPNIQTAEEALDLIVDAIK GLSDGEWQQVLNVWGK HGTVVLTALGGILK VEADIAGHGQEVLIR Mass 1012.5925 2311.146 2018.0659 1250.6645 1446.8006 2197.2161 972.5497 1494.6948 2080.0233 1469.682 1577.7972 1754.8143 1415.7175 2440.1345 1820.9233 1754.9463 1287.7054 1454.6706 3256.6206 1814.898 1377.8373 1605.8496 Peptide sequences of the 22 peaks and their retention times are given in Tables 3.3 and 3.4, respectively. NA in Table 3.4 means that an internal standard is not present. The ground-truth data were generated based on the Mascot search result. A list of MS/MS spectra with identification score > 60 and present in a least 10 out of 20 LC-MS runs was compiled. Each peptide sequence was assigned to a peak detected by DifProWare based on its mass and retention time, which resulted in a list of consensus peaks (with the same identity). Putative matching was also performed to the runs without a qualified identification sequence. The list was further refined based on visual inspection of the extracted ion chromatogram of each consensus peak, where erroneous assignments were removed. The resulting groundtruth data consist of 273 unique peptide sequences from 70 unique proteins (Appendix A). We compared the following procedures: no alignment performed to adjust the peak lists (raw), alignment performed using a Gaussian process regression as defined in Equation 3.7 (GP), single-profile alignment performed with no information about internal standards (SP), single-profile alignment performed with a Gaussian process prior (GPSP), and multi-profile alignment performed with a Gaussian process prior (GPMP). For the multi-profile align- 64 Chapter 3. Multi-profile Bayesian alignment model 1 1 0.9 0.8 0.8 0.7 0.7 0.6 0.6 SSE Normalized overlapping level 0.9 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0.1 0 0.4 0.3 0.2 0 0 10 20 30 40 50 Number of clusters (a) Normalized overlapping level 60 0 0 10 20 30 2 4 40 6 50 8 10 60 Number of clusters (b) Sum of squared errors Figure 3.7: Normalized overlapping level (a) and sum of squared errors (b) using the Lmethod in the proteomic data set. The sufficient number of clusters is four. ment, representative chromatograms are first identified as discussed in Section 3.3. We use the L-method to determine the sufficient number of clusters as demonstrated in Figure 3.7. As shown in the figure, the sum of squared errors is minimized when the number of clusters is chosen as four. Most current LC-MS preprocessing pipelines do not adjust the peak lists detected upfront (raw) and directly apply a feature-based alignment, e.g., SIMA. Single-profile alignment (SP) represents the current profile-based models including our alignment model introduced in Chapter 2, where a single chromatogram is considered without utilizing any information about internal standards. Table 3.5 summarizes the results of the five approaches. Performance comparison based on retention time (RT) difference in seconds across runs for consensus peaks, coefficient of variation (CV) of the extracted ion chromatograms of consensus peaks, precision and recall. For RT difference and CV, means (standard deviations) are reported based on the 273 consensus peaks. For precision and recall, means (standard deviations) are reported based on 72 pairs of tolerance parameters of m/z ∈ {0.05, 0.1, 0.25} and RT ∈ {5, 10, . . . , 120} in SIMA. As the chromatographic patterns are well captured by the base peak chromatogram in this data set, the single-profile approach (GP) yields reasonable alignment result (Figure 3.8). Figure 3.9 depicts the trace plots of model precision (1/σε2 ) estimated based on single-profile alignment with (GPSP) and without (SP) Gaussian process prior. As shown in the figure, incorporating the Gaussian process prior leads to a shorter burn-in period as well as higher precision to fit the observed chromatograms. Based on the estimation by GPSP, Figures 3.10– 3.14 show the trace plots of differences between the mapping functions ui (t) and ui+1(t), at the knot points τj , j = 1, . . . , 23. From these figures, less uncertainty of the estimation is observed in the middle retention time range of the chromatograms. Also, the difference is not consistent throughout the whole retention time range, which suggests that a linear modeling of the mapping function is not appropriate. As shown in Table 3.5, integration of internal 65 Chapter 3. Multi-profile Bayesian alignment model 7 18 7 x 10 18 8 2 x 10 8 2 16 x 10 16 1 12 12 0 58 60 62 10 8 4 4 2 2 40 60 80 60 62 8 6 20 0 58 10 6 0 1 14 Ion count 14 Ion count x 10 0 100 20 40 Retention time (min) 60 80 100 Retention time (min) (a) Original chromatograms (b) Aligned chromatograms Figure 3.8: Base peak chromatograms in the proteomic data set, before and after alignment. The inset is a zoomed part in the middle retention time range of the chromatograms. SP GPSP 23.5 20 23 15 25 20 10 GPSP 22.5 ε 1 / σ2 1 / σ2ε SP 25 22 15 10 5 5 0 0 0 0 2500 5000 7500 Iteration (a) 21.5 50 100 10000 150 12500 200 15000 21 1.25 1.3 1.35 1.4 Iteration 1.45 1.5 4 x 10 (b) Figure 3.9: Trace plots of 1/σε2 estimated based on single-profile alignment without using Gaussian process prior (SP) and with using Gaussian process prior (GPSP). (b) rooms in the precision range 21 − 23.5 for the last 2500 iterations in (a). standards (GPSP) and multiple chromatograms (GPMP) can lead to further improvement. Based on the estimation by GPMP, Figure 3.15 presents the posterior difference between the estimated mapping function and the identity function with the 90% credible interval for the 20 LC-MS runs. For the peak matching performance, Figure 3.16 shows the measures of precision and recall of the five considered approaches, based on 72 pairs of tolerance parameters in SIMA, where GPMP yields the best performance, with the least variability to the choice of parameters. This indicates that appropriate retention time alignment by GPMP makes the subsequent peak matching process more robust to the selection of parameters. 66 Chapter 3. Multi-profile Bayesian alignment model Table 3.4: Mass and retention time of each of the internal standard peaks in the proteomic data set. IS1-1 IS1-2 IS1-3 IS1-4 IS1-5 IS2-1 IS2-2 IS3-1 IS3-2 IS3-3 IS4-1 IS4-2 IS4-3 IS4-4 IS4-5 IS4-6 IS4-7 IS4-8 IS4-9 IS5-1 IS5-2 IS5-3 Mass 1012.5925 2311.146 2018.0659 1250.6645 1446.8006 2197.2161 972.5497 1494.6948 2080.0233 1469.682 1577.7972 1754.8143 1415.7175 2440.1345 1820.9233 1754.9463 1287.7054 1454.6706 3256.6206 1814.898 1377.8373 1605.8496 1 NA 40.86 40.86 34.15 46.17 54.39 35.81 52.81 54.76 NA NA 24.99 30.08 39.89 67.68 45.63 NA NA 87.05 58.32 52.9 30.42 2 35.29 40.72 40.63 34.15 45.98 54.15 35.73 52.76 54.72 23.44 52.02 25.03 30.01 39.75 67.53 45.52 32.29 27.28 87 58.23 52.76 30.36 IS1-1 IS1-2 IS1-3 IS1-4 IS1-5 IS2-1 IS2-2 IS3-1 IS3-2 IS3-3 IS4-1 IS4-2 IS4-3 IS4-4 IS4-5 IS4-6 IS4-7 IS4-8 IS4-9 IS5-1 IS5-2 IS5-3 Mass 1012.5925 2311.146 2018.0659 1250.6645 1446.8006 2197.2161 972.5497 1494.6948 2080.0233 1469.682 1577.7972 1754.8143 1415.7175 2440.1345 1820.9233 1754.9463 1287.7054 1454.6706 3256.6206 1814.898 1377.8373 1605.8496 11 36.24 41.71 41.54 34.92 46.77 55.01 36.6 53.57 55.42 24.04 52.74 25.77 30.86 40.56 68.16 46.36 33.15 27.99 87.52 58.99 53.57 31.2 12 36.11 41.47 41.29 34.84 46.72 54.79 36.46 53.34 55.39 24 52.61 25.63 30.77 40.41 68.01 46.29 33.03 27.88 87.61 58.88 53.44 31.11 Retention time (min.) in each LC-MS run 3 4 5 6 7 8 35.22 NA 35.24 NA 36.58 36.7 40.68 41.01 40.67 40.64 41.96 42.07 40.5 40.92 40.67 NA 41.96 41.98 33.99 34.42 34.5 33.97 35.36 35.48 45.91 46.17 45.93 45.86 47.16 47.34 54.14 54.42 54.23 54.16 55.33 55.37 35.66 36.03 35.6 35.64 36.93 37.14 52.66 52.98 52.75 52.7 53.86 53.99 54.61 54.85 54.76 54.72 55.8 55.92 23.21 NA 23.3 NA 24.54 24.59 51.9 NA 51.92 NA 53.03 53.27 24.8 25.21 24.91 24.89 26.27 26.33 29.93 30.28 30.07 30 31.4 31.45 39.63 39.99 39.7 39.59 40.99 41.02 67.49 67.6 67.43 67.51 68.58 68.62 45.45 45.78 45.47 45.5 46.71 46.8 32.22 NA 32.2 32.12 33.6 33.73 27.12 NA 27.26 NA 28.6 28.58 86.88 87.2 86.95 86.93 88.06 88.04 58.18 58.42 58.21 58.27 59.39 59.44 52.66 52.98 52.85 52.79 53.86 53.99 30.27 30.62 30.42 30.34 31.65 31.79 Retention time (min.) in each LC-MS run 13 14 15 16 17 18 36.05 35.98 NA 36.52 NA 36.13 41.45 41.46 41.64 41.75 41.64 41.58 41.28 41.19 41.55 41.66 NA 41.22 34.77 34.67 35.12 35.74 34.95 34.82 46.55 46.61 46.92 46.96 46.79 46.68 54.79 54.78 55.09 55.14 55.1 54.77 36.4 36.33 36.75 36.88 36.64 36.48 53.45 53.38 53.52 53.68 53.62 53.45 55.32 55.34 55.61 55.57 55.51 55.33 23.93 23.92 NA 24.2 NA 24.05 52.59 52.56 NA 52.85 NA 52.72 25.7 25.51 25.81 25.98 25.85 25.64 30.73 30.72 31.02 31.13 30.95 30.79 40.41 40.32 40.61 40.69 40.6 40.44 68.04 68.06 68.4 68.28 68.29 68.23 46.2 46.19 46.47 46.6 46.45 46.23 33.01 32.92 NA 33.34 NA 32.98 27.88 27.84 NA 28.3 NA 27.91 87.57 87.54 87.92 87.71 87.63 87.62 58.86 58.8 59.14 59.19 59.02 58.89 53.45 53.38 53.69 53.68 53.71 53.45 31.08 30.98 31.45 31.49 31.3 31.13 9 36.21 41.66 41.57 35 46.86 54.99 36.66 53.62 55.54 24.05 52.8 25.74 30.95 40.61 68.23 46.41 33.23 27.99 87.77 59.1 53.62 31.29 10 36.15 41.57 41.48 34.92 46.85 54.98 36.6 53.51 55.43 24.04 52.69 25.71 30.85 40.51 68.18 46.31 33.05 27.99 87.81 58.88 53.51 31.2 19 35.92 41.4 41.32 34.6 46.59 54.85 36.35 53.38 55.31 23.85 52.57 25.5 30.59 40.36 68.06 46.15 32.86 27.73 87.64 58.7 53.38 30.93 20 NA 41.47 41.47 34.87 46.63 54.9 36.46 53.35 55.36 23.9 NA 25.64 30.85 40.35 68.13 46.23 NA 27.97 87.65 58.85 53.44 31.1 67 Chapter 3. Multi-profile Bayesian alignment model Table 3.5: Performance comparison in the LC-MS proteomic data set. Five approaches are compared: no alignment performed (raw), alignment performed using a Gaussian process regression (GP), single-profile alignment performed without using a Gaussian process prior (SP), single-profile alignment performed with a Gaussian process prior (GPSP), and multiprofile alignment (G = 4) performed with a Gaussian process prior (GPMP). RT CV Precision Recall Raw 83.08 (20.84) 1.634 (0.256) 0.937 (0.026) 0.638 (0.241) GP 18.02 (22.74) 1.118 (0.372) 0.983 (0.007) 0.933 (0.089) SP 19.37 (28.56) 1.150 (0.415) 0.985 (0.004) 0.952 (0.043) GPSP 11.74 (19.61) 1.060 (0.380) 0.988 (0.004) 0.962 (0.050) GPMP 10.70 (20.67) 1.052 (0.380) 0.990 (0.003) 0.970 (0.027) 68 Chapter 3. Multi-profile Bayesian alignment model −20 40 (τ ) 20 i+1 1 0 i=13 i=14 i=15 i=16 i=17 i=18 i=19 60 20 0 i 1 i+1 1 (τ ) 40 i 1 i 1 20 80 i=7 i=8 i=9 i=10 i=11 i=12 60 u (τ )−u i+1 1 (τ ) 40 u (τ )−u 80 i=1 i=2 i=3 i=4 i=5 i=6 60 u (τ )−u 80 −20 0 −20 −40 −60 −40 −40 −60 −60 −80 5000 7500 10000 12500 −80 0 15000 2500 5000 Iteration −120 0 15000 2500 5000 12500 15000 20 0 i=13 i=14 i=15 i=16 i=17 i=18 i=19 40 20 0 i+1 2 (τ ) i+1 2 10000 60 i=7 i=8 i=9 i=10 i=11 i=12 40 i 2 −20 7500 Iteration 60 u (τ )−u (τ ) i 2 i+1 2 u (τ )−u 12500 80 i=1 i=2 i=3 i=4 i=5 i=6 40 0 10000 i 2 60 20 7500 Iteration (τ ) 2500 u (τ )−u −80 0 −100 −20 −20 −40 10000 12500 −60 0 15000 2500 5000 −40 −40 10000 12500 i=7 i=8 i=9 i=10 i=11 i=12 −60 0 15000 2500 5000 −40 −40 10000 12500 −60 0 15000 5000 15000 i=13 i=14 i=15 i=16 i=17 i=18 i=19 (τ ) 20 0 −20 5000 7500 10000 12500 −60 0 15000 2500 5000 7500 10000 12500 15000 Iteration 60 i=7 i=8 i=9 i=10 i=11 i=12 20 0 i+1 5 20 0 i=13 i=14 i=15 i=16 i=17 i=18 i=19 40 i 5 (τ ) i+1 5 12500 40 i+1 4 2500 40 i 5 0 −20 10000 −40 60 u (τ )−u 20 7500 Iteration 80 i=1 i=2 i=3 i=4 i=5 i=6 40 (τ ) 2500 60 Iteration 60 i+1 5 −20 −60 0 0 −20 80 i 5 0 i=7 i=8 i=9 i=10 i=11 i=12 Iteration u (τ )−u 20 15000 20 −20 15000 i 4 (τ ) i 4 0 −20 −20 −40 −40 −40 −60 −80 0 12500 40 i+1 4 20 7500 10000 60 u (τ )−u i 4 u (τ )−u i+1 4 (τ ) 40 5000 7500 80 i=1 i=2 i=3 i=4 i=5 i=6 12500 i=13 i=14 i=15 i=16 i=17 i=18 i=19 Iteration 60 10000 −40 Iteration 80 7500 40 i+1 3 0 −20 2500 5000 Iteration 20 −20 −60 0 2500 60 i 3 (τ ) i 3 0 7500 −60 0 15000 40 i+1 3 20 5000 12500 60 u (τ )−u i 3 u (τ )−u i+1 3 (τ ) 40 2500 10000 u (τ )−u i=1 i=2 i=3 i=4 i=5 i=6 60 −60 0 7500 Iteration 80 (τ ) 7500 Iteration u (τ )−u 5000 (τ ) 2500 80 u (τ )−u −60 0 −40 −40 2500 5000 7500 Iteration 10000 12500 15000 −60 0 2500 5000 7500 Iteration 10000 12500 15000 −60 0 2500 5000 7500 10000 Iteration Figure 3.10: Trace plots of ui (t) − ui+1 (t) at the knot points τ1 − τ5 . 12500 15000 69 Chapter 3. Multi-profile Bayesian alignment model 7500 10000 12500 −60 0 15000 2500 5000 i 7 0 −20 10000 12500 15000 60 20 0 i+1 7 20 0 i=13 i=14 i=15 i=16 i=17 i=18 i=19 40 −20 −40 −40 10000 12500 −60 0 15000 2500 5000 −60 0 15000 2500 5000 −40 −40 12500 −60 0 15000 i=13 i=14 i=15 i=16 i=17 i=18 i=19 20 0 −20 2500 5000 7500 10000 12500 −60 0 15000 2500 5000 Iteration −20 20 0 i=13 i=14 i=15 i=16 i=17 i=18 i=19 20 10 −10 0 −20 −20 −40 15000 30 u (τ )−u i+1 9 i 9 0 12500 40 (τ ) 40 u (τ )−u 20 10000 50 i=7 i=8 i=9 i=10 i=11 i=12 60 (τ ) 40 7500 Iteration 80 i=1 i=2 i=3 i=4 i=5 i=6 60 15000 −40 Iteration 80 12500 40 i+1 8 0 −20 10000 i=7 i=8 i=9 i=10 i=11 i=12 20 −20 10000 60 i 8 i+1 8 i 8 0 7500 Iteration 40 u (τ )−u 20 7500 12500 60 (τ ) 40 5000 10000 (τ ) i=1 i=2 i=3 i=4 i=5 i=6 2500 7500 Iteration 80 u (τ )−u 7500 i+1 9 5000 i 9 2500 60 (τ ) 7500 Iteration i=7 i=8 i=9 i=10 i=11 i=12 Iteration i+1 8 5000 −20 80 i 8 2500 i 7 (τ ) i+1 7 20 −60 u (τ )−u −60 0 15000 40 −40 (τ ) 12500 60 u (τ )−u i 7 u (τ )−u i+1 7 (τ ) 40 i+1 9 10000 80 i=1 i=2 i=3 i=4 i=5 i=6 60 i 9 7500 Iteration (τ ) 5000 u (τ )−u 2500 80 u (τ )−u −20 −40 Iteration −60 0 0 i+1 6 0 −40 −60 −80 0 20 −20 −40 −80 0 40 (τ ) 0 −20 20 i=13 i=14 i=15 i=16 i=17 i=18 i=19 i 6 i+1 6 (τ ) 40 i 6 i 6 20 60 i=7 i=8 i=9 i=10 i=11 i=12 60 u (τ )−u i+1 6 (τ ) 40 u (τ )−u 80 i=1 i=2 i=3 i=4 i=5 i=6 60 u (τ )−u 80 −30 −40 −60 −40 5000 7500 10000 12500 −60 0 15000 2500 5000 Iteration i 10 0 5000 −40 12500 15000 −60 0 10000 12500 15000 i=13 i=14 i=15 i=16 i=17 i=18 i=19 40 20 0 i+1 10 0 −40 10000 i=7 i=8 i=9 i=10 i=11 i=12 20 −20 7500 Iteration 40 −20 7500 2500 60 i 10 (τ ) i+1 10 20 Iteration −50 0 15000 60 u (τ )−u (τ ) i+1 10 i 10 u (τ )−u 40 5000 12500 80 i=1 i=2 i=3 i=4 i=5 i=6 60 2500 10000 Iteration 80 −60 0 7500 (τ ) 2500 u (τ )−u −80 0 −20 −40 2500 5000 7500 Iteration 10000 12500 15000 −60 0 2500 5000 7500 10000 Iteration Figure 3.11: Trace plots of ui (t) − ui+1 (t) at the knot points τ6 − τ10 . 12500 15000 70 Chapter 3. Multi-profile Bayesian alignment model i 11 0 (τ ) 20 30 i+1 11 20 0 −20 20 10 0 −10 −20 −30 −20 −40 i=13 i=14 i=15 i=16 i=17 i=18 i=19 40 i 11 (τ ) 40 50 i=7 i=8 i=9 i=10 i=11 i=12 60 i+1 11 40 u (τ )−u (τ ) i+1 11 i 11 u (τ )−u 80 i=1 i=2 i=3 i=4 i=5 i=6 60 u (τ )−u 80 −40 −60 0 2500 5000 7500 10000 12500 −40 0 15000 2500 5000 Iteration −50 0 15000 2500 5000 −20 12500 15000 (τ ) 30 i+1 12 0 i=13 i=14 i=15 i=16 i=17 i=18 i=19 40 i 12 20 i 12 0 10000 50 i=7 i=8 i=9 i=10 i=11 i=12 u (τ )−u (τ ) 40 i+1 12 20 7500 Iteration 60 u (τ )−u (τ ) 40 i+1 12 12500 80 i=1 i=2 i=3 i=4 i=5 i=6 60 i 12 10000 Iteration 80 u (τ )−u 7500 20 10 0 −10 −20 −40 −30 −20 −60 −40 5000 7500 10000 12500 −40 0 15000 2500 5000 Iteration −50 0 15000 2500 5000 0 i 13 0 i=7 i=8 i=9 i=10 i=11 i=12 15000 −20 i=13 i=14 i=15 i=16 i=17 i=18 i=19 40 30 10 −20 20 0 −10 −20 −40 −40 12500 50 i+1 13 20 10000 60 i 13 (τ ) 20 i+1 13 40 7500 Iteration 40 u (τ )−u (τ ) i+1 13 12500 60 i=1 i=2 i=3 i=4 i=5 i=6 60 i 13 10000 Iteration 80 u (τ )−u 7500 (τ ) 2500 u (τ )−u −80 0 −30 −60 0 2500 5000 7500 10000 12500 −60 0 15000 2500 5000 Iteration −40 0 15000 2500 5000 10000 12500 15000 (τ ) 0 u (τ )−u 0 i 14 i+1 14 (τ ) 20 −20 i=13 i=14 i=15 i=16 i=17 i=18 i=19 40 20 i 14 0 −20 i=7 i=8 i=9 i=10 i=11 i=12 40 i+1 14 20 7500 Iteration 60 u (τ )−u (τ ) 40 i+1 14 12500 60 i=1 i=2 i=3 i=4 i=5 i=6 60 i 14 10000 Iteration 80 u (τ )−u 7500 −20 −40 −40 −40 −60 −80 0 2500 5000 7500 10000 12500 −60 0 15000 2500 5000 Iteration −60 0 15000 2500 5000 10000 12500 15000 (τ ) 0 u (τ )−u 0 i+1 15 (τ ) 20 −20 i=13 i=14 i=15 i=16 i=17 i=18 i=19 40 20 i 15 0 −20 i=7 i=8 i=9 i=10 i=11 i=12 40 i+1 15 20 7500 Iteration 60 u (τ )−u (τ ) 40 i+1 15 12500 60 i=1 i=2 i=3 i=4 i=5 i=6 60 i 15 10000 i 15 80 u (τ )−u 7500 Iteration −20 −40 −40 −40 −60 −80 0 2500 5000 7500 Iteration 10000 12500 15000 −60 0 2500 5000 7500 Iteration 10000 12500 15000 −60 0 2500 5000 7500 10000 Iteration Figure 3.12: Trace plots of ui (t) − ui+1(t) at the knot points τ11 − τ15 . 12500 15000 71 Chapter 3. Multi-profile Bayesian alignment model (τ ) 20 0 u (τ )−u 0 i+1 16 20 i+1 16 0 −20 −20 i=13 i=14 i=15 i=16 i=17 i=18 i=19 40 (τ ) 40 i 16 i 16 20 60 i=7 i=8 i=9 i=10 i=11 i=12 u (τ )−u i+1 16 (τ ) 40 u (τ )−u 60 i=1 i=2 i=3 i=4 i=5 i=6 60 i 16 80 −20 −40 −40 −40 −60 −80 0 2500 5000 7500 10000 12500 −60 0 15000 2500 5000 Iteration −60 0 15000 −20 (τ ) 0 10 −10 −20 −30 7500 10000 12500 −40 0 15000 5000 7500 10000 12500 20 10 0 −10 −20 −50 0 15000 2500 5000 i=1 i=2 i=3 i=4 i=5 i=6 i+1 18 20 0 −20 −20 −40 −40 15000 i=13 i=14 i=15 i=16 i=17 i=18 i=19 30 (τ ) 40 i 18 i 18 0 12500 40 u (τ )−u i+1 18 20 10000 50 i=7 i=8 i=9 i=10 i=11 i=12 60 (τ ) 40 7500 Iteration 80 u (τ )−u (τ ) i+1 18 i 18 u (τ )−u i=13 i=14 i=15 i=16 i=17 i=18 i=19 Iteration 60 15000 −40 2500 Iteration 80 12500 −30 −40 5000 10000 30 i+1 17 20 7500 40 i 17 (τ ) i+1 17 0 2500 5000 50 i=7 i=8 i=9 i=10 i=11 i=12 30 i 17 20 −60 0 2500 Iteration 40 u (τ )−u (τ ) i+1 17 12500 50 i=1 i=2 i=3 i=4 i=5 i=6 40 i 17 10000 u (τ )−u 60 u (τ )−u 7500 Iteration 20 10 0 −10 −20 −30 −40 −60 0 2500 5000 7500 10000 12500 −60 0 15000 2500 5000 Iteration −50 0 15000 i=7 i=8 i=9 i=10 i=11 i=12 i+1 19 10 i 19 0 7500 10000 12500 15000 2500 5000 10000 12500 5 0 −5 −20 0 15000 2500 5000 (τ ) i+1 20 −50 10000 12500 15000 i=13 i=14 i=15 i=16 i=17 i=18 i=19 60 i 20 −20 i=7 i=8 i=9 i=10 i=11 i=12 u (τ )−u (τ ) i+1 20 0 0 i 20 20 7500 Iteration 80 50 u (τ )−u (τ ) i+1 20 i 20 7500 100 i=1 i=2 i=3 i=4 i=5 i=6 40 u (τ )−u i=13 i=14 i=15 i=16 i=17 i=18 i=19 Iteration 60 15000 −15 Iteration 80 12500 −10 −20 5000 10000 10 (τ ) 20 −30 0 7500 15 −10 −20 2500 5000 20 u (τ )−u (τ ) i+1 19 i 19 −10 −30 0 2500 Iteration 30 u (τ )−u (τ ) i 19 i+1 19 u (τ )−u 12500 40 i=1 i=2 i=3 i=4 i=5 i=6 20 0 10000 Iteration 30 10 7500 40 20 0 −20 −40 −100 −40 −60 −80 0 2500 5000 7500 Iteration 10000 12500 15000 −150 0 2500 5000 7500 Iteration 10000 12500 15000 −60 0 2500 5000 7500 10000 Iteration Figure 3.13: Trace plots of ui (t) − ui+1(t) at the knot points τ16 − τ20 . 12500 15000 72 Chapter 3. Multi-profile Bayesian alignment model i 21 −5 (τ ) 0 −20 −40 −10 −15 −60 −20 0 −80 0 7500 10000 12500 15000 −30 2500 5000 12500 −40 0 15000 2500 5000 10000 12500 15000 20 0 i+1 22 20 0 −20 i=13 i=14 i=15 i=16 i=17 i=18 i=19 40 i 22 i+1 22 (τ ) 40 i 22 −20 7500 Iteration 60 i=7 i=8 i=9 i=10 i=11 i=12 60 u (τ )−u (τ ) i+1 22 i 22 10000 80 i=1 i=2 i=3 i=4 i=5 i=6 40 u (τ )−u 7500 Iteration 60 0 0 −10 (τ ) 5000 10 u (τ )−u 2500 20 −20 Iteration 20 30 i+1 21 20 i=13 i=14 i=15 i=16 i=17 i=18 i=19 40 i 21 (τ ) 40 u (τ )−u 5 50 i=7 i=8 i=9 i=10 i=11 i=12 60 i+1 21 i+1 21 0 u (τ )−u i 21 (τ ) 15 10 80 i=1 i=2 i=3 i=4 i=5 i=6 20 u (τ )−u 25 −20 −40 −40 −40 −60 5000 7500 10000 12500 −80 0 15000 2500 5000 Iteration −60 0 15000 2500 5000 0 −10 −10 −20 12500 15000 0 −20 i=13 i=14 i=15 i=16 i=17 i=18 i=19 40 20 0 i+1 23 i+1 23 10 i 23 10 10000 60 i=7 i=8 i=9 i=10 i=11 i=12 20 (τ ) 20 7500 Iteration 30 u (τ )−u i+1 23 (τ ) 30 i 23 12500 40 i=1 i=2 i=3 i=4 i=5 i=6 40 u (τ )−u 10000 i 23 50 −20 −30 −30 −40 −40 −40 −50 0 7500 Iteration (τ ) 2500 u (τ )−u −60 0 2500 5000 7500 Iteration 10000 12500 15000 −50 0 2500 5000 7500 10000 12500 15000 −60 0 2500 Iteration 5000 7500 10000 12500 15000 Iteration Figure 3.14: Trace plots of ui (t) − ui+1(t) at the knot points τ21 − τ23 . 3.6 Analysis of LC-MS glycomic data set The glycomic data set is from an untargeted LC-MS study aimed at identifying N-glycan disease biomarkers. Human serum samples representing two distinct biological groups (cases and controls) were analyzed in this study. The data set was generated from serum samples of 11 cases and 12 controls. An internal standard mixture was added to the serum samples prior to the LC-MS data acquisition. We performed LC-MS data preprocessing including deisotoping using DeconTools [60] and peak detection through deconvolution of chromatographic profiles. In this study, the average residue composition was set to C10 H18 N0.5 O5 for the deisotoping step. Each LC-MS run was preprocessed separately, and the labeling information did not account for any analysis conducted in this study. Peaks of the internal standard in the glycomic data set were identified by comparing measured mass values with theoretical molecular weights of galactose units. Table 3.6 presents the masses of the five internal standard peaks of different galactose units (galactose 3–7) and 73 Chapter 3. Multi-profile Bayesian alignment model their retention times in each LC-MS run. The ground-truth data were generated based on a list of human serum glycans characterized by the number of monosaccharides: HexNAc, Hexose, Deoxyhexose and NeuAc. The putative compositions were assigned by comparison of measured mass values with theoretical values, in consideration of hydrogen adducts. The resulting ground-truth data consist of 106 peaks. The complete list can be found in Appendix B. Table 3.6: Mass and retention time of each of the internal standard peaks in the glycomic data set. Gal3 Gal4 Gal5 Gal6 Gal7 Mass 674.3726 878.4724 1082.5722 1286.672 1490.7718 Gal3 Gal4 Gal5 Gal6 Gal7 Mass 674.3726 878.4724 1082.5722 1286.672 1490.7718 Gal3 Gal4 Gal5 Gal6 Gal7 Mass 674.3726 878.4724 1082.5722 1286.672 1490.7718 Retention time (min.) in each LC-MS run 2 3 4 5 6 7 26.67 26.42 26.29 26.28 26.12 26.02 30.49 29.68 29.61 29.5 29.39 29.31 33.56 33.24 33.05 32.93 32.86 32.8 37.13 36.76 36.6 36.48 36.38 36.33 40.61 40.24 40.03 39.88 39.81 39.8 Retention time (min.) in each LC-MS run 9 10 11 12 13 14 15 26.02 25.86 25.79 25.72 25.56 25.47 25.33 29.28 29.09 29.05 28.98 28.76 28.69 28.51 32.74 32.58 32.52 32.4 32.25 32.2 31.97 36.29 36.08 36.04 35.92 35.71 35.66 35.45 39.77 39.5 39.44 39.32 39.13 39.09 38.87 Retention time (min.) in each LC-MS run 17 18 19 20 21 22 23 24.96 24.83 24.69 24.59 24.65 24.6 24.65 28.22 28.07 27.83 27.85 27.85 27.87 27.88 31.64 31.45 31.4 31.27 31.28 31.32 31.21 35.16 35.02 34.97 34.81 34.78 34.8 34.81 38.66 38.52 38.44 38.3 38.2 38.28 38.26 1 27.19 30.51 34.05 37.54 40.99 8 26.03 29.29 32.71 36.26 39.7 16 25.04 28.29 31.72 35.24 38.7 We performed the same comparison as in Section 3.5, where five approaches were compared: no alignment performed (raw), alignment performed using a Gaussian process regression (GP), single-profile alignment performed without using a Gaussian process prior (SP), single-profile alignment performed with a Gaussian process prior (GPSP), and multiprofile alignment performed with a Gaussian process prior (GPMP). As in the analysis of the proteomic data set, we use the L-method to determine the sufficient number of clusters as demonstrated in Figure 3.18. The sufficient number of clusters to capture the chromatographic patterns was found to be four. Table 3.7 presents the evaluation comparison in the glycomic data set, which turns out to be the most challenging case in our study due to the lack of consistent pattern in the base peak chromatogram. Figure 3.17 depicts each base peak chromatogram of the 23 LC-MS runs. As shown in the figure, most of the chromatographic profiles are concentrated in the range between 20 and 35 minutes. Unfortunately, there is 74 Chapter 3. Multi-profile Bayesian alignment model no consistent chromatographic pattern from the base peak chromatograms. Consequently, using single-profile alignment does not yield significant improvement as in the proteomic data set, as shown in Table 3.7. Gaussian process regression (GP, with only five internal standard peaks) performs comparably to both single-profile alignment approaches (SP and GPSP). Utilization of the internal standard and multiple chromatograms (GPMP) is particularly advantageous in this data set. Figure 3.19 depicts the clustered ion chromatograms, before and after alignment by GPMP. As shown in the figure, chromatographic patterns are better retained and the multi-profile alignment can therefore be effectively performed. As a result, GPMP outperforms the other four approaches in terms of all the measurements in Table 3.7. Based on the estimation by GPMP, Figure 3.20 shows the posterior difference between the estimated mapping function and the identity function with the 90% credible interval for the 23 LC-MS runs. Figure 3.21 shows the measures of precision and recall of the five considered approaches, based on 72 pairs of tolerance parameters in SIMA. As in the proteomic data set, GPMP yields the best performance, with the least variability to the choice of parameters. Table 3.7: Performance comparison in the LC-MS glycomic data set. Five approaches are compared: no alignment performed (raw), alignment performed using a Gaussian process regression (GP), single-profile alignment performed without using a Gaussian process prior (SP), single-profile alignment performed with a Gaussian process prior (GPSP), and multiprofile alignment (G = 4) performed with a Gaussian process prior (GPMP). RT CV Precision Recall Raw 103.20 (32.63) 1.194 (0.346) 0.943 (0.008) 0.612 (0.215) GP 42.36 (29.76) 0.931 (0.284) 0.965 (0.005) 0.819 (0.136) SP 67.45 (42.39) 1.090 (0.399) 0.967 (0.008) 0.773 (0.166) GPSP 44.65 (31.35) 0.978 (0.366) 0.976 (0.003) 0.829 (0.167) GPMP 24.85 (27.21) 0.821 (0.292) 0.980 (0.002) 0.907 (0.095) For the multi-profile alignment, a set of representative chromatograms are first identified as discussed in Section 3.3. We compared cases where chromatograms are derived either by binning along m/z or using the chromatographic clustering (Figure 3.22). Tables 3.8 and 3.9 summarize the peak matching performance using multi-profile alignment with the number of chromatograms varying between two and five. As shown in the table, using the chromatographic clustering procedure outperforms the binning approach. Incorporation of the Gaussian process prior shows significant improvement in the cases of binning and using two clustered chromatograms. This indicates that using the informative prior is beneficial for the profile-based alignment, especially when a consistent chromatographic pattern is unavailable. For the other cases of chromatographic clustering (G = 3, . . . , 5), the results with or without integrating the internal standards are similar. This is because the chromatographic patterns have already been well captured. As a result, the use of prior information did not improve the alignment results further. We believe that further improvement can be achieved with the addition of more internal standards that allow better coverage of the retention time. 75 Chapter 3. Multi-profile Bayesian alignment model Table 3.8: Multi-profile alignment of the glycomic data set with and without using a Gaussian process (GP) prior. Chromatograms are derived by binning along m/z dimension. Binning without GP prior # of bins 2 3 4 5 0.957 0.960 0.964 0.963 Precision (0.007) (0.008) (0.007) (0.006) 0.766 0.783 0.770 0.780 Recall (0.138) (0.122) (0.147) (0.154) Binning with GP prior 2 3 4 5 0.976 0.977 0.978 0.974 (0.003) (0.003) (0.004) (0.006) 0.848 0.856 0.830 0.837 (0.151) (0.141) (0.160) (0.156) Table 3.9: Multi-profile alignment of the glycomic data set with and without using a Gaussian process (GP) prior. Chromatograms are derived using the chromatographic clustering procedure. # of clusters Precision Recall 3.7 Clustering without GP prior 2 3 4 5 0.964 0.980 0.979 0.980 (0.007) (0.004) (0.004) (0.002) 0.796 0.906 0.910 0.913 (0.138) (0.094) (0.099) (0.092) Clustering with GP prior 2 3 4 5 0.980 0.979 0.980 0.980 (0.003) (0.003) (0.002) (0.003) 0.904 0.904 0.907 0.908 (0.098) (0.099) (0.095) (0.096) Summary This chapter extends the single-profile alignment model to handle multiple chromatograms simultaneously, and incorporates the Gaussian process prior for the mapping function. The extended model uses multi-profile modeling with representative chromatograms identified by a clustering approach. Through comprehensive evaluation on LC-MS data sets from proteomic and glycomic studies, the multi-profile Bayesian alignment model has shown significant benefits for the subsequent peak matching step, which is crucial when comparing thousands of features across multiple LC-MS runs. Although our discussion focuses on using internal standards to derive the Gaussian process prior, it is also possible to specify the Gaussian process prior for the mapping relationship based on the identification of MS/MS spectra or targeted compounds. 76 Chapter 3. Multi-profile Bayesian alignment model Run 1 Run 2 1000 2000 3000 4000 5000 0 −50 6000 50 Difference 0 −50 1000 2000 Time (sec) 3000 4000 Run 4 2000 3000 4000 5000 −200 6000 1000 2000 3000 5000 −50 6000 1000 2000 4000 5000 5000 −50 6000 1000 2000 Run 13 3000 4000 5000 3000 4000 5000 −50 6000 1000 2000 3000 4000 5000 3000 4000 5000 −50 6000 1000 2000 3000 1000 2000 6000 3000 4000 5000 6000 5000 6000 Run 18 4000 5000 0 −50 6000 1000 2000 Time (sec) 3000 4000 Time (sec) Run 19 Run 20 50 Difference 50 Difference 5000 50 Time (sec) 0 −50 4000 Time (sec) 0 −50 6000 3000 0 Difference Difference 0 6000 Run 15 50 2000 2000 Run 17 Run 16 1000 1000 Time (sec) −100 5000 50 Time (sec) 100 4000 Time (sec) 0 −50 6000 3000 0 Difference Difference 2000 2000 Run 14 0 6000 Run 12 50 1000 1000 Time (sec) 50 5000 Time (sec) Difference Difference 4000 0 −50 6000 4000 50 Time (sec) Difference 3000 Run 11 Difference 3000 3000 0 Time (sec) 0 2000 2000 Run 9 50 1000 1000 Time (sec) Difference Difference Difference 4000 6000 50 Run 10 Difference −50 6000 0 50 −200 5000 Run 8 Time (sec) −50 4000 50 3000 5000 0 Time (sec) 0 4000 Run 6 0 Run 7 2000 3000 50 −100 50 1000 2000 Time (sec) Difference Difference Difference 1000 1000 Run 5 Time (sec) −50 −50 6000 100 0 −50 5000 0 Time (sec) 50 −50 Run 3 50 Difference Difference 50 1000 2000 3000 4000 Time (sec) 5000 6000 0 −50 1000 2000 3000 4000 5000 6000 Time (sec) Figure 3.15: Difference between the identity function and estimated mapping function obtained from the posterior median by GPMP for each of the 20 LC-MS runs in the proteomic data set. The filled region corresponds to the 90% credible interval. 77 Chapter 3. Multi-profile Bayesian alignment model 1 0.98 Precision 0.96 0.94 0.92 0.9 0.88 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Recall Figure 3.16: Measures of precision and recall in the proteomic data set, based on 72 pairs of tolerance parameters in SIMA. The five procedures compared are: raw (∗), GP (), SP (), GPSP (△), and GPMP (♦). 78 Chapter 3. Multi-profile Bayesian alignment model 3.5 x 10 x 10 16 2 1.6 1.5 4 1.4 3 Ion count 2 2.5 0 20 2 25 30 35 1.5 0.8 1.5 4 1 3.5 25 30 35 10 0.5 8 0 20 25 30 35 6 0.5 0.2 4 2 3 2.5 0 20 2 25 30 35 1.5 30 40 50 60 20 30 40 50 60 x 10 3 8 8 3 50 0 10 60 15 30 35 0 20 25 30 35 8 5 5 0 20 6 25 30 35 1 0.5 30 40 50 20 30 Time (min) 40 50 0 10 60 3 8 30 2.5 50 0 10 60 8 35 1.5 1 1 0.5 0.5 30 40 50 60 8 6 4.5 4 0 20 25 30 35 1 1.5 0 20 1 25 30 Ion count 30 20 x 10 4 3.5 1 Ion count 25 Ion count 0 20 35 x 10 x 10 2 2 1 30 Time (min) 5 8 3 2 2 25 8 x 10 x 10 2.5 2 2 40 Time (min) x 10 3 2.5 0 20 8 x 10 3 20 Time (min) 8 x 10 2 3 1 2 0 10 60 8 1.5 4 4 2 4 0.5 x 10 6 8 1 60 6 10 Ion count Ion count 25 1.5 50 8 x 10 12 1 40 x 10 7 x 10 10 0 20 20 30 Time (min) 7 2 1 0 10 20 8 x 10 2 2 1.5 14 2.5 3 40 Time (min) x 10 x 10 3 2 30 7 8 2.5 20 Time (min) Time (min) 4 0.5 0 10 Ion count 20 1 2 0 10 8 Ion count x 10 4 0.4 3 8 6 4.5 12 0 20 x 10 x 10 0.6 0 10 Ion count 2 14 0.5 1 5 8 x 10 1 1.2 1 3.5 8 x 10 8 1.8 Ion count 2 8 6 Ion count x 10 4 Ion count 7 8 8 4.5 35 2 3 2.5 0 20 2 25 30 35 1.5 0.5 1 0.5 20 30 40 50 0 10 60 20 30 Time (min) x 10 7 0 10 60 3.5 8 x 10 2 5 Ion count 1 0 20 25 30 35 1 50 0 10 60 20 30 40 50 60 Time (min) 8 x 10 2.5 8 x 10 8 x 10 3 x 10 3 2 3 2.5 4 4 2 3 0 20 25 30 35 1 0 20 1.5 2 1 1 0.5 2 2 2 25 30 35 1 1.5 0 20 1 25 30 35 0.5 0 10 20 30 40 50 0 10 60 20 30 40 50 0 10 60 8 x 10 6 x 10 18 8 x 10 6 4 3 4 Ion count 1 0 20 25 30 35 1.5 2 3 0 20 25 30 Ion count 2 2 50 0 10 60 30 35 2 14 1.5 50 60 3.5 x 10 8 x 10 4 x 10 3 12 1 10 0.5 3 2.5 0 20 8 2 40 Time (min) 8 16 25 30 35 2 1 2 0 20 1.5 25 30 35 6 1 1 4 1 0.5 0 10 20 8 x 10 x 10 5 3 2.5 40 7 8 4 30 Time (min) 8 3.5 20 Time (min) Time (min) Ion count 0.5 Ion count 40 4 6 6 1.5 30 Time (min) 7 8 2 4 20 8 x 10 3 2.5 Ion count 50 x 10 Ion count 3 40 Time (min) 7 8 Ion count 0 10 0.5 2 20 30 40 50 0 10 60 20 30 40 0 10 60 20 30 Time (min) Time (min) 8 3.5 50 40 2.5 8 4 x 10 3 8 x 10 3 50 60 8 3 x 10 2 2 2 0 20 25 30 35 1 1.5 0 20 1 25 30 Ion count 1 1.5 40 Time (min) x 10 2 2 30 x 10 2.5 2 Ion count Ion count 2.5 20 8 3 3 0 10 60 Time (min) 8 x 10 50 35 1 1.5 0 20 25 30 35 1 1 0.5 0.5 0.5 0 10 20 30 40 Time (min) 50 60 0 10 20 30 40 Time (min) 50 60 0 10 20 30 40 50 60 Time (min) Figure 3.17: Base peak chromatograms of the 23 LC-MS runs in the glycomic data set. 79 Chapter 3. Multi-profile Bayesian alignment model 1 0.8 0.7 0.8 0.6 0.7 0.5 0.6 SSE Normalized overlapping level 0.9 0.5 0.4 0.4 0.4 0.3 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0.1 0 0 5 10 15 20 25 30 0 0 0 35 5 10 15 2 20 4 6 25 8 10 30 35 Number of clusters Number of clusters (a) Normalized overlapping level (b) Sum of squared errors Figure 3.18: Normalized overlapping level (a) and sum of squared errors (b) using the Lmethod in the glycomic data set. The sufficient number of clusters is four. 3 1.4 1 2 2 2 20 22 1 0 30 0.8 32 34 0.6 1.5 0 50 52 Intensity (A.U.) 0 18 1.4 1 Intensity (A.U.) 1 2 1.5 0.5 1 1 1.6 1.2 Intensity (A.U.) Intensity (A.U.) 1.8 2.5 3 2.5 54 1 0.5 1.2 0 32 1 34 36 0.8 0.6 0.4 0.2 20 30 40 50 0 10 60 0.2 20 Retention time (min) 30 50 0 10 60 20 Retention time (min) (a) 30 40 50 0 10 60 2 1.6 1 1 1.4 0.5 20 22 1.5 1 0.5 1 0 30 0.8 32 34 0.6 1.5 0 50 52 Intensity (A.U.) 2 Intensity (A.U.) 0 18 50 60 1.8 2.5 1 1 2 40 (d) 1.2 2 30 Retention time (min) (c) 1.4 3 20 Retention time (min) (b) 3 2.5 Intensity (A.U.) 40 Intensity (A.U.) 0 10 0.4 0.5 0.5 54 1 1.2 0 32 1 34 36 0.8 0.6 0.4 0 10 0.4 0.5 0.5 0.2 20 30 40 Retention time (min) (e) 50 60 0 10 0.2 20 30 40 Retention time (min) (f) 50 60 0 10 20 30 40 Retention time (min) (g) 50 60 0 10 20 30 40 50 60 Retention time (min) (h) Figure 3.19: Clustered ion chromatograms in the glycomic data set. (a)-(d) are the unaligned chromatograms and (e)-(h) are their corresponding aligned chromatograms. The inset is a zoomed part in the middle retention time range of the chromatograms. 80 Chapter 3. Multi-profile Bayesian alignment model Run 1 Run 2 1000 1500 2000 2500 3000 0 −50 −100 3500 50 Difference 0 −50 −100 1000 1500 Time (sec) Run 4 1500 2000 2500 3000 0 −100 3500 1000 1500 2000 2500 3000 1000 1500 2000 2500 3000 2000 2500 3000 2000 1000 1500 3000 2000 2500 3000 1500 2000 2500 3000 2500 3000 3500 2000 2500 3000 3500 1000 1500 2000 2500 3000 3500 3000 3500 3000 3500 3000 3500 0 1000 1500 2000 2500 Time (sec) Run 21 50 0 1000 1500 2000 2500 3000 50 0 −50 3500 1000 1500 Time (sec) 2000 2500 Time (sec) Run 22 Run 23 100 Difference 100 Difference 2500 100 Time (sec) 50 0 −50 2000 50 −50 Difference Difference 1500 1500 Run 20 −50 3500 Run 18 100 0 3000 Time (sec) 0 Run 19 1000 1000 Time (sec) 50 2500 100 Time (sec) 100 2000 0 −50 3500 Difference Difference 2000 1500 Run 17 −50 3500 Run 15 50 0 3000 Time (sec) Difference 1000 Run 16 1500 1000 Time (sec) 50 2500 0 −50 3500 0 −50 3500 2000 50 Time (sec) 1000 1500 Run 14 2500 3500 Run 12 0 Difference 1500 3000 Time (sec) 50 0 2500 50 Run 13 1000 1000 Time (sec) 50 2000 0 −50 3500 −20 −40 3500 3500 Run 9 Difference Difference Difference 1500 1500 Run 11 0 3000 Time (sec) 20 1000 1000 Time (sec) Time (sec) Difference −50 3500 0 −50 3500 2500 50 Run 10 Difference 3000 Difference Difference Difference 1500 50 Difference 2500 Run 8 Time (sec) −50 2000 50 0 2000 0 Time (sec) Run 7 1000 1500 Run 6 Difference Difference Difference 1000 1000 Time (sec) −50 50 −50 −100 3500 50 Time (sec) −50 3000 Run 5 0 −50 2500 50 −50 −50 2000 0 −50 Time (sec) 50 −100 Run 3 50 Difference Difference 50 1000 1500 2000 2500 Time (sec) 3000 3500 50 0 −50 1000 1500 2000 2500 3000 3500 Time (sec) Figure 3.20: Difference between the identity function and estimated mapping function obtained from the posterior median by GPMP for each of the 23 LC-MS runs in the glycomic data set. The filled region corresponds to the 90% credible interval. 81 Chapter 3. Multi-profile Bayesian alignment model 1 0.99 Precision 0.98 0.97 0.96 0.95 0.94 0.93 0.92 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Recall Figure 3.21: Measures of precision and recall in the glycomic data set, based on 72 pairs of tolerance parameters in SIMA. The five procedures compared are: raw (∗), GP (), SP (), GPSP (△), and GPMP (♦). 82 Chapter 3. Multi-profile Bayesian alignment model B2 C2 GPB2 GPC2 Raw 1 0.98 0.98 Precision Precision Raw 1 0.96 0.94 0.92 0.9 0.1 B3 GPC3 0.94 0.92 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.9 0.2 0.3 0.4 Raw B4 0.5 0.6 0.7 0.8 0.9 Recall (a) G = 2 (b) G = 3 C4 GPB4 GPC4 Raw 1 1 0.98 0.98 Precision Precision GPB3 0.96 Recall 0.96 0.94 0.92 0.9 0.1 C3 B5 C5 GPB5 GPC5 0.96 0.94 0.92 0.2 0.3 0.4 0.5 0.6 Recall (c) G = 4 0.7 0.8 0.9 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Recall (d) G = 5 Figure 3.22: Measures of precision and recall in the glycomic data set, where multi-profile alignment is considered. The number of chromatograms are: (a) two, (b) three, (c) four, and (d) five. Five cases are compared with peak lists input to SIMA: raw (∗), adjusted by multi-profile alignment using binning (), adjusted by multi-profile alignment using chromatographic clustering (), adjusted by multi-profile alignment using binning with Gaussian process prior (△), and adjusted by multi-profile alignment using chromatographic clustering with Gaussian process prior (♦). Chapter 4 Application of Bayesian alignment model to biomarker discovery LC-MS has been widely used for profiling a variety of biomolecules, characterizing their expression levels, and associating relevant patterns with distinct biological conditions of interest. In this chapter, we integrate the proposed Bayesian alignment model into a preprocessing pipeline for LC-MS data analysis. We demonstrate the applicability of the alignment model in practical biomedical problems by applying the preprocessing pipeline in a largescale glycomic study for cancer biomarker discovery. The LC-MS-based glycomic study consists of two complementary analyses: 1) global profiling using LC-MS, and 2) targeted quantification by multiple reaction monitoring (MRM). Through our developed pipeline, we identify candidate biomarkers from global profiling analysis and confirm the result with that by target quantification. 4.1 Background Hepatocellular carcinoma. Hepatocellular carcinoma (HCC) is the third leading cause of cancer mortality worldwide with five-year relative survival rates less than 15% [7, 33]. In the US, while the combined cancer mortality rate has been declining for two decades, incidence and mortality rates of HCC are still increasing [118]. The resistance of HCC to existing treatments and the lack of reliable biomarkers for early detection make it one of the most hard-to-treat cancers. Most of the risk factors for HCC including chronic infection with hepatitis B virus (HBV) or hepatitis C virus (HCV) lead to the development of liver cirrhosis, which is considered as a precursor of HCC and present in 80–90% of HCC patients [30]. The malignant conversion of cirrhosis to HCC is often fatal in part because adequate biomarkers are not available for diagnosis during the progression stages of HCC. Survival rates of patients with HCC can significantly be improved if the diagnosis was made at earlier stages, when 83 Chapter 4. Application of Bayesian alignment model to biomarker discovery 84 treatment is more effective [12]. Alpha-fetoprotein (AFP), the serologic biomarker for HCC in current use, is not effective for early diagnosis due to its low sensitivity [48,130]. Therefore more potent biomarkers for early-stage HCC are needed. LC-MS-based glycomics. Glycosylation is one of the most common post-translational modifications of proteins. Altered patterns of glycosylation have been associated with various diseases and many currently used cancer biomarkers, including AFP, are glycoproteins [13,37]. The analysis of glycosylation is particularly relevant to liver pathology because of the major influence of this organ on the homeostasis of blood glycoproteins. However, quantitative analysis of glycoproteins remains challenging in large-scale studies due to the dynamic nature of glycosylation [13]. An effective alternative is to analyze glycans released from proteins and associate the glycomic changes with pathological conditions of interest. N-glycans are of particular interest as their involvement in major biological processes, including cell-cell interactions and intracellular signaling, has important implications in disease progression [37]. Also, several enzymes that allow efficient release of this type of glycans have been made available [82]. Through appropriate analytical methods that yield broad coverage of the glycome, characterizing glycomic patterns in serum/plasma of patients with cancer has proven a promising strategy to discover biomarkers for early diagnosis of cancer [82,113]. In particular, mass spectrometry is an enabling technology for analysis of glycans in cancer biomarker discovery [82]. The use of matrix-assisted laser desorption/ionization (MALDI) mass spectrometry to identify N-glycan biomarkers for HCC has been widely applied and discussed [42,65,109,125]. With recent advances in mass spectrometry and separation methods, LC-MS is capable of profiling hundreds of glycans including isomeric glycoforms [23,55]. Higher sensitivity of LC-MS over MALDI in detecting permethylated N-glycans derived from serum has been demonstrated in a comparative study [55]. However, to date the glycomic profiling using LC-MS has not been fully exploited for large-scale biomarker discovery studies and there is still a lack of appropriate computational tools [113]. We apply LC-MS-based serum glycomics for HCC biomarker discovery in patients with liver cirrhosis. Workflow of the proposed glycomic analysis is shown in Figure 4.1. Sera were collected from participants recruited at the Tanta University Hospital in Egypt. We utilize two complementary platforms to perform global profiling and targeted quantification of N-glycans, and identify candidate biomarkers that distinguish HCC cases from cirrhotic controls. Global profiling is performed using a high-resolution mass spectrometer (LTQOrbitrap Velos), while targeted quantification is performed using a triple quadrupole (QqQ) mass spectrometer in multiple reaction monitoring (MRM) mode [97, 98]. The integrative workflow consisting of global profiling and targeted quantification is widely applied in LC-MS-based proteomic studies but to our best knowledge, has not yet been exploited in glycomics. In this analysis, we identify candidate biomarkers from global profiling using the developed pipeline and confirm the result with that by targeted quantification. Chapter 4. Application of Bayesian alignment model to biomarker discovery 85 Figure 4.1: Workflow for the LC-MS-based analysis of N-glycans in sera. 4.2 Experimental methods Study cohort. The samples in this study were obtained from participants recruited in Egypt. The study cohort consists of adult patients with HCC or liver cirrhosis recruited 86 Chapter 4. Application of Bayesian alignment model to biomarker discovery from the outpatient clinics and inpatient wards of the Tanta University Hospital, Egypt. The participants consist of 89 subjects (40 HCC cases and 49 patients with liver cirrhosis). Detailed characteristics of the patient population are provided in Table 4.1. Table 4.1: Characteristics of the study cohort. Age Mean (SD) BMI Mean (SD) Gender Male HCC (n = 40) Cirrhosis (n = 49) p-value 53.2 (3.9) 53.8 (7.6) 0.3530 HCV serology HCV Ab+ HBV serology HBsAg+ MELD* Mean (SD) MELD ≤ 10 AFP Median (IQR) HCC stage Stage I Stage II Stage III Unknown 24.9 (3.1) 24.5 (4.4) 0.6513 77.5% 67.3% 0.3474 100.0% 100.0% 1.0000 0.0% 6.1% 0.2492 18.6 (7.7) 20.0% 18.9 (7.1) 12.2% 0.1328 0.3863 275.9 (1244.3) 72.5% 15.0% 5.0% 7.5% *MELD: model for end-stage liver disease Experimental design. We analyzed the collected sera in four batches (designated as E1, E2, E3 and E4). Each batch consists of approximately 24 samples, balanced between HCC cases and cirrhotic controls in terms of age, race, gender, smoking, alcohol and BMI. Samples within the same batch were prepared together and LC-MS analysis was performed following a randomized order to avoid systematic biases. The same procedure of sample preparation was applied for the analysis of global profiling and targeted quantification, which consists of release, purification, reduction and permethylation of N-glycans. All the samples were analyzed through global profiling and targeted quantification.1 1 Laboratory methods: Permethylated N-glycans were separated by an ultimate 3000 nano-LC system (Dionex, Sunnyvale, CA) with an Acclaim PepMap C18 column at 55◦ C to prompt efficient separation. The flow rate of nanopump was set to 350 nl/min. Mobile phase A consisted of 2% ACN and 98% water with 0.1% formic acid, while mobile phase B consisted of ACN with 0.1% formic acid. The gradient program started at 20% mobile phase B over 10 min, which was ramped to 38% at 11 min and linearly increased to 60% in the Chapter 4. Application of Bayesian alignment model to biomarker discovery 87 5 x 10 Ion count 15 10 5 0 34 1308 1307 34.5 1306 35 1305 35.5 1304 Retention time (min) m/z Figure 4.2: Characteristics of a peak in LC-MS raw data. 4.3 Global profiling analysis Data preprocessing. The LC-MS raw data were analyzed using a preprocessing pipeline consisting of in-house-developed algorithms and open-source software tools. The pipeline converts the raw data into a peak list. Each LC-MS run contains thousands of peaks and ubiquitous noises, where a representative peak in LC-MS raw data is shown in Figure 4.2. As introduced in Section 1.2.2, an isotopic pattern is present in mass spectra throughout the elution of the profiled compound. In consideration of such characteristics, we developed a data preprocessing pipeline to perform deisotoping of mass spectra, peak detection, retention time alignment and peak matching. We performed the deisotoping of mass spectra using DeconTools [60], where the monoisotopic following 32 min. Then, mobile phase B was increased to 90% in 3 min and the percentage was kept for 4 min. Finally, mobile phase B was decreased to 20% in 1 min and the percentage was kept for 9 min to equilibrate the column. The nano-LC system was interfaced to an LTQ-Orbitrap Velos (Thermo Scientific, San Jose, CA) hybrid mass spectrometer. The mass spectrometer was operated in data-dependent acquisition (DDA) mode, where each MS full scan (m/z range 500–2000) was followed by five MS/MS scans of the most intense ions. Targeted quantification of 117 N-glycans including isomers was performed by MRM using a QqQ mass spectrometer (TSQ Vantage) with Q1 and Q3 operated at a unit resolution. The chromatographic condition was as described above in the analysis of global profiling utilizing an ultimate 3000 nano-LC system with identical gradient setup. The dwell time was 2.7 sec on average. Sample preparation and data acquisition were performed in the laboratory of Dr. Yehia Mechref at Texas Tech University. Chapter 4. Application of Bayesian alignment model to biomarker discovery 88 mass and charge state were deduced. DeconTools allows us to specify an appropriate average residue composition for the calculation of isotopic distribution. The average composition for the monosaccharides (C10 H18 N0.43 O5 S0 ) was determined based on the permethylated Nglycans commonly found in our previous glycomic studies [42,109,125]. After the deisotoping step, peak detection was performed using an in-house-developed algorithm. Specifically, deisotoped ions with the same molecular weight (with 10 ppm tolerance) were linked along scans to generate a chromatographic trace. Low-quality traces were screened out according to the following criteria: 1) minimal scans of 20 to define a trace, 2) minimal total ion count of 50,000, 3) minimal density of 0.3 in a trace, and 4) maximal allowable missing values of 20 between adjacent scans. After the screening step, missing values within remaining traces were interpolated using their corresponding extracted ion chromatograms from raw data. The interpolated trace was further processed through successive convolution with a SavitzkyGolay smoothing filter, and a first-order derivative of a Gaussian kernel to identify the position and boundary of the chromatographic peak at zero-crossing and its enclosing local extrema, respectively. In the peak list of each LC-MS run, properties used to characterize a peak were: monoisotopic mass, charge state, intensity (area under curve within boundary) and retention time (RT). Prior to matching the detected peaks across LC-MS runs, we applied our developed Bayesian alignment model (BAM) to estimate the mapping function for each LC-MS run and modified the RT values in the peak lists, i.e., replacing t by ui (t). Figures 4.3–4.6 depict the clustered ion chromatograms, before and after alignment by BAM in the four batches. The adjusted peaks in multiple LC-MS runs were then matched using the simultaneous multiple alignment (SIMA) model [136]. The resulting consensus peak list of the LC-MS runs was further refined such that only the peaks detected in over half of the runs were retained. The alignment process can greatly reduce the ambiguity during the peak matching step. Figure 4.7 presents the RT differences across runs of these refined peaks, with different parameters used in SIMA. The proportion of peaks with different RT variations (one second range) are summarized. As shown in this figure, applying the BAM led to more consistent RT values across runs. It reduced the mode from 10 seconds to less than five. Moreover, the peak matching results became more robust to the selection of the parameters. After the peak matching, a normalization step was apply to ensure of the summed intensity of detected peaks is identical in all the runs from the same batch. Finally, missing values owing to either peak detection or alignment were interpolated using corresponding extracted ion chromatograms. In this study, peaks out of the expected RT range of glycans (15–50 min) were excluded from subsequent analysis. The preprocessing pipeline resulted in 2132, 2620, 2412, and 2392 consensus peaks in E1, E2, E3, and E4, respectively. Statistical analysis. Following the data preprocessing, the most relevant peaks with differential abundance between HCC cases and cirrhotic controls were selected using a two-way analysis of variance (ANOVA) model. Peaks from the four batches were matched upfront, 89 Chapter 4. Application of Bayesian alignment model to biomarker discovery 6 2.5 2.5 3 5 2 1 3 2 0.5 0 15 25 30 35 40 45 1 20 25 30 35 40 45 0 15 50 0.5 20 25 Retention time (min) Retention time (min) (a) 30 35 40 45 0 15 50 0 15 25 30 35 40 45 50 1 40 45 50 2 1.5 1 0 15 50 1.5 0.5 1 20 Intensity (A.U.) Intensity (A.U.) Intensity (A.U.) Intensity (A.U.) 0.5 45 3 2 2 40 2.5 4 3 35 (d) 2.5 2 30 Retention time (min) 5 1 25 (c) 6 1.5 20 Retention time (min) (b) 2.5 2 1.5 1 0 15 50 1.5 0.5 1 20 Intensity (A.U.) 1.5 4 Intensity (A.U.) Intensity (A.U.) Intensity (A.U.) 2 2.5 20 25 30 35 40 45 0 15 50 0.5 20 25 Retention time (min) Retention time (min) (e) 30 35 40 45 0 15 50 20 25 Retention time (min) (f) 30 35 Retention time (min) (g) (h) Figure 4.3: Clustered ion chromatograms in E1. (a)-(d) are the unaligned chromatograms and (e)-(h) are their corresponding aligned chromatograms. 3.5 2 2.5 2.5 1.8 3 1.6 2 2 1.2 1 0.8 Intensity (A.U.) 2 1.5 Intensity (A.U.) 1.4 Intensity (A.U.) Intensity (A.U.) 2.5 1.5 1 1.5 1 0.6 1 0.4 0.5 0.5 0.5 0.2 0 15 20 25 30 35 40 45 0 15 50 20 25 Retention time (min) 30 35 40 45 0 15 50 20 25 Retention time (min) (a) (b) 3.5 30 35 40 45 0 15 50 20 25 Retention time (min) (c) 2 30 35 40 45 50 40 45 50 Retention time (min) (d) 2.5 2.5 1.8 3 1.6 2 2 1.2 1 0.8 Intensity (A.U.) 2 1.5 Intensity (A.U.) 1.4 Intensity (A.U.) Intensity (A.U.) 2.5 1.5 1 1.5 1 0.6 1 0.4 0.5 0.5 0.5 0.2 0 15 20 25 30 35 Retention time (min) (e) 40 45 50 0 15 20 25 30 35 Retention time (min) (f) 40 45 50 0 15 20 25 30 35 Retention time (min) (g) 40 45 50 0 15 20 25 30 35 Retention time (min) (h) Figure 4.4: Clustered ion chromatograms in E2. (a)-(d) are the unaligned chromatograms and (e)-(h) are their corresponding aligned chromatograms. 90 Chapter 4. Application of Bayesian alignment model to biomarker discovery 1 0.5 20 25 30 35 40 45 3 2.5 2 1.5 1 1 1 0.5 0.5 20 25 Retention time (min) (a) 35 40 45 0 15 50 1 30 35 40 45 40 45 0 15 50 3 3 2.5 2 1.5 2 1.5 1 Retention time (min) 30 35 40 45 0 15 50 35 40 45 50 40 45 50 2 1.5 0.5 20 25 Retention time (min) (e) 30 1 0.5 25 25 (d) 2.5 20 20 Retention time (min) 3 0 15 50 35 2.5 0.5 25 30 (c) 1 0.5 20 25 Retention time (min) Intensity (A.U.) Intensity (A.U.) 1.5 0 15 20 (b) 2 Intensity (A.U.) 30 Retention time (min) 2.5 2 1.5 0.5 0 15 50 2 1.5 Intensity (A.U.) 0 15 3 2.5 Intensity (A.U.) Intensity (A.U.) Intensity (A.U.) 2 1.5 3 2.5 Intensity (A.U.) 2.5 30 35 40 45 0 15 50 20 25 Retention time (min) (f) 30 35 Retention time (min) (g) (h) Figure 4.5: Clustered ion chromatograms in E3. (a)-(d) are the unaligned chromatograms and (e)-(h) are their corresponding aligned chromatograms. 3 1.8 1.6 1.6 1.4 2.5 2.5 2 1 1 0.8 Intensity (A.U.) 1.5 1.2 1.2 Intensity (A.U.) Intensity (A.U.) Intensity (A.U.) 1.4 2 1 0.8 0.6 0.4 0.5 0.5 0.2 0.2 20 25 30 35 40 45 0 15 50 20 25 Retention time (min) 30 35 40 45 0 15 50 20 25 Retention time (min) (a) (b) 3 30 35 40 45 0 15 50 1.8 1.6 1.6 1.4 1 30 35 40 45 50 40 45 50 Retention time (min) (d) 2.5 2 1 0.8 Intensity (A.U.) 1.2 1.2 Intensity (A.U.) Intensity (A.U.) Intensity (A.U.) 1.5 25 (c) 1.4 2 20 Retention time (min) 2.5 1 0.8 0.6 1.5 1 0.6 0.4 0.4 0.5 0.5 0.2 0.2 0 15 1 0.6 0.4 0 15 1.5 20 25 30 35 Retention time (min) (e) 40 45 50 0 15 20 25 30 35 Retention time (min) (f) 40 45 50 0 15 20 25 30 35 Retention time (min) (g) 40 45 50 0 15 20 25 30 35 Retention time (min) (h) Figure 4.6: Clustered ion chromatograms in E4. (a)-(d) are the unaligned chromatograms and (e)-(h) are their corresponding aligned chromatograms. 91 Chapter 4. Application of Bayesian alignment model to biomarker discovery 0.1 0.08 SIMA BAM+SIMA 0.09 SIMA BAM+SIMA 0.07 0.08 0.06 0.06 Proportion Proportion 0.07 0.05 0.04 0.05 0.04 0.03 0.03 0.02 0.02 0.01 0.01 0 0 10 20 30 40 50 0 0 60 10 20 RT difference (a) RT tolerance of 10 50 60 0.08 SIMA BAM+SIMA 0.07 0.06 0.06 0.05 0.05 0.04 0.04 0.03 0.03 0.02 0.02 0.01 0.01 10 20 30 40 50 SIMA BAM+SIMA 0.07 Proportion Proportion 40 (b) RT tolerance of 20 0.08 0 0 30 RT difference 0 0 60 10 20 RT difference 30 40 50 60 RT difference (c) RT tolerance of 30 (d) RT tolerance of 40 Figure 4.7: Distribution of the RT differences (in second) across LC-MS runs of consensus peaks identified by SIMA with different parameters of RT tolerance. and peak intensity was modeled in terms of group effect (HCC versus cirrhosis), batch effect and interaction between group and batch. Specifically, the peak intensity for replicate k (k = 1, . . . , nij ) from group i (i = 1, 2) in batch j (j = 1, 2, 3, 4) is modeled as Yijk = µ + Gi + Bj + (G × B)ij + ǫijk , where µ is the overall mean of the samples, Gi ’s are the group effects X Gi = 0, i Bj ’s are the batch effects X j Bj = 0, (4.1) Chapter 4. Application of Bayesian alignment model to biomarker discovery 92 (G × B)ij ’s are the interactions between group and batch X (G × B)ij = 0, ∀j = 1, 2, 3, 4, i X (G × B)ij = 0, ∀i = 1, 2, j and ǫijk ’s are the random errors from a zero-mean normal distribution. Table 4.2 summarizes the properties of the ANOVA model, where sum of squares (SS) and degree of freedom (df) corresponding to each source are given below: X SSG = (Ȳi.. − Ȳ... )2 , dfG = # Groups − 1, ijk SSB = X (Ȳ.j. − Ȳ...)2 , dfB = # Batches − 1, ijk SSGB = X (Ȳij. − Ȳi.. − Ȳ.j. + Ȳ... )2 , ijk SSE = X (Yijk − Ȳij. )2 , dfE = X ijk X (nij − 1), ij ijk SST = dfGB = dfG × dfB , (Yijk − Ȳ... )2 , dfT = X nij − 1, ij and estimates associated with the considered factors at different levels are P ijk Yijk Ȳ... = P , ij nij P jk Yijk , i = 1, 2, Ȳi.. = P j nij P Yijk Ȳ.j. = Pik , j = 1, 2, 3, 4, nij i P Yijk Ȳij. = k , i = 1, 2 and j = 1, 2, 3, 4. nij We calculated p-values with the null hypothesis that the group means within each batch are the same. Peaks with a p-value < 0.05 and having a consistent direction of fold change (FC) between groups in all four batches were selected as statistically significant. Prior to the statistical analysis, a logarithmic transformation was applied to ensure validity of the normal distribution assumption as shown in Figure 4.8. 93 Chapter 4. Application of Bayesian alignment model to biomarker discovery Table 4.2: Two-way analysis of variance of the glycomic data. Source Group effect Batch effect Interaction effect Error Total df dfG dfB dfGB dfE dfT SS SSG SSB SSGB SSE SST MS MSG = SSG/dfG MSB = SSB/dfB MSGB = SSGB/dfGB MSE = SSE/dfE MST = SST /dfT F -statistic MSG/MSE MSB/MSE MSGB/MSE 4 8 x 10 14000 7 12000 6 10000 4 Count Count 5 8000 3 6000 2 4000 1 2000 0 0 2 4 6 8 10 Intensity (a) 12 14 16 18 10 x 10 0 15 20 25 30 Log(Intensity) 35 40 (b) Figure 4.8: Histograms of peak intensities before (a) and after (b) logarithmic transformation. Candidate N-glycan biomarkers through LC-MS global profiling. The statistical analysis based on the ANOVA model revealed 78 peaks that are statistically significant (p-value < 0.05 and consistent FC). Putative glycan structures were assigned to the selected peaks by matching experimentally measured mass values with theoretical values of human serum N-glycans that were previously characterized according to the number of five monosaccharides: N-acetylglucosamine (GlcNAc), mannose, galactose, fucose, and Nacetylneuraminic acid (NeuNAc). The matching (with tolerance of 2 ppm) resulted in 10 significant N-glycans (Table 4.3). Potential isomers were observed for two galactosylated beta-1,6-GlcNAc branching glycans [5-3-3-0-1] and [5-3-3-0-2]. All of the significant glycans belong to the complex type. Chapter 4. Application of Bayesian alignment model to biomarker discovery 94 Table 4.3: N-glycan candidate biomarkers identified by the LC-MS global profiling. N-glycan RT [5-3-0-0-0] 27.6 [5-3-1-0-0] 29.0 [5-3-1-0-1] 31.8 [5-3-3-0-1] 30.9 32.5 [5-3-3-0-2] 33.0 34.6 [5-3-3-0-3] 34.8 [6-3-4-0-1] 33.7 [6-3-4-0-2] 35.7 [6-3-4-0-3] 37.7 [6-3-4-0-4] 39.1 4.4 Charge 2 3 2 3 2 3 3 3 3 4 3 4 3 3 4 3 4 4 p-value Fold change 0.004 ↓1.76 0.007 ↓1.77 0.028 ↓1.33 0.025 ↓1.34 0.040 ↓1.50 0.036 ↑1.32 0.019 ↑1.34 0.016 ↑1.46 0.004 ↑1.54 0.040 ↑1.46 0.030 ↑1.51 0.042 ↑1.46 0.028 ↑1.33 0.018 ↑1.66 0.030 ↑1.63 0.034 ↑1.83 0.027 ↑1.78 0.027 ↑1.74 Multiple reaction monitoring quantification Targeted quantification of 117 N-glycans including isomers was performed by MRM using a QqQ mass spectrometer. These targets include 1) N-glycans that were detected on the Orbitrap or QqQ instrument in our previous studies, 2) N-glycans evaluated as potential HCC biomarkers in previous studies [21, 42, 77, 109, 124, 125], and 3) N-glycans involved in Golgi apparatus retrieved from KEGG GLYCAN database [49]. The 117 N-glycans were represented by 213 channels (three transitions in each) consisting of their different adduct forms and charge states. Curation of the transitions was performed to eliminate channels with unfavorable chromatographic profiles or significant noises, and to determine appropriate RT windows for quantification. Owing to the unit resolution in Q1 and Q3, interferences may appear across channels with close m/z values in their transitions. The observed elution order of N-glycans on the Orbitrap system was used to elucidate some ambiguous cases in the MRM analysis. Among the 213 transition channels, 93 channels representing 124 potential isomers of 65 N-glycans were detected consistently and quantified for subsequent analysis. As in the global profiling analysis, peak intensities were log-transformed prior to the statistical analysis. A normalization step was also applied to ensure the mean of the log-transformed peak intensities is identical in all the MRM runs from the same batch. The two-way ANOVA model used in Chapter 4. Application of Bayesian alignment model to biomarker discovery 95 the global profiling was applied to identify candidate N-glycan biomarkers. Through the targeted quantification, we identified 10 N-glycans that are statistically significant (p-value < 0.05 and consistent FC in four batches) as shown in Table 4.4. Four of these significant glycans have a p-value < 0.01: [5-3-0-0-0] and [5-3-1-0-1], which are down-regulated in HCC; [5-3-3-0-2] and [5-3-3-0-3], which are up-regulated in HCC. Most of the significant glycans were also identified by the global profiling analysis, i.e., [5-3-0-0-0], [5-3-1-0-0], [5-3-1-0-1], [5-3-3-0-2], [5-3-3-0-3], [6-3-4-0-2], [6-3-4-0-3] and [6-3-4-0-4]. Their MRM quantification results are shown in Figure 4.9. Table 4.4: N-glycan candidate biomarkers identified by the MRM targeted quantification. N-glycan RT [5-3-0-0-0] 29.5 [5-3-1-0-0] 30.3 [5-3-1-0-1] [5-3-1-1-1] [5-3-3-0-2] [5-3-3-0-3] 33.8 36.0 34.3 36.5 [5-3-3-2-1] 37.0 [6-3-4-0-2] 38.3 [6-3-4-0-3] 39.5 [6-3-4-0-4] 36.5 40.8 4.5 Charge 2 2 3 2 2 3 3 4 3 3 4 4 4 4 p-value Fold change 0.0009 ↓1.81 0.018 ↓1.34 0.020 ↓1.34 0.003 ↓1.39 0.027 ↓1.39 0.003 ↑1.45 0.010 ↑1.36 0.029 ↑1.36 0.014 ↑1.37 0.024 ↑1.29 0.020 ↑1.37 0.011 ↑1.50 0.012 ↑1.44 0.017 ↑1.73 Discussion We analyzed N-glycans in sera from HCC cases and cirrhotic controls, where N-glycans were enzymatically removed from serum proteins and permethylated, allowing relative quantification of hundreds of oligosaccharides. Candidate N-glycan biomarkers were identified through LC-MS-based global profiling and targeted quantification. The most relevant glycans in distinguishing HCC cases from cirrhotic controls were selected using a two-way ANOVA model. We identified 10 statistically significant N-glycans through each of the quantification approaches (12 in total). Although none of these glycans had an adjusted p-value < 0.05 in consideration of multiple testing correction using the method by Benjamini and Hochberg [10], our integrative analysis revealed a good overlap of the significant glycans identified by both quantification approaches. There are eight candidate biomarkers overlapping between the two complementary platforms: [5-3-0-0-0], [5-3-1-0-0], [5-3-1-0-1], [5-3-3-0-2], [5-3-3-0-3], [6- 96 Chapter 4. Application of Bayesian alignment model to biomarker discovery [5−3−1−0−0] − RT: 30.3, FC: ↓1.34 [5−3−1−0−1] − RT: 33.8, FC: ↓1.39 28 22 24 27 21 23 2 25 24 20 19 22 21 2 26 Log (intensity) 25 Log (intensity) 23 2 Log (intensity) [5−3−0−0−0] − RT: 29.5, FC: ↓1.81 29 20 18 23 19 17 22 21 16 HCC Cirrhosis 18 HCC (a) 23 25 22 24 20 19 23 18 17 2 2 21 [6−3−4−0−2] − RT: 38.3, FC: ↑1.37 20 Log (intensity) 26 Cirrhosis (c) [5−3−3−0−3] − RT: 36.5, FC: ↑1.36 24 Log (intensity) 2 HCC (b) [5−3−3−0−2] − RT: 34.3, FC: ↑1.45 Log (intensity) 17 Cirrhosis 22 16 19 21 18 20 17 HCC 19 Cirrhosis (d) 15 HCC 14 Cirrhosis (e) HCC Cirrhosis (f) [6−3−4−0−3] − RT: 39.5, FC: ↑1.5 [6−3−4−0−4] − RT: 40.8, FC: ↑1.73 23 24 23 22 22 21 Log (intensity) 20 20 19 2 2 Log (intensity) 21 19 18 17 18 16 17 15 16 HCC Cirrhosis (g) 14 HCC Cirrhosis (h) Figure 4.9: Quantification results of eight candidate N-glycan biomarkers in sera of HCC cases and cirrhotic controls by the MRM analysis. (a)-(c) Down-regulated bisected GlcNAc glycans. (d)-(e) Up-regulated beta-1,6-GlcNAc branching glycans. (f)-(h) Up-regulated tetra-antennary glycans. Chapter 4. Application of Bayesian alignment model to biomarker discovery 97 3-4-0-2], [6-3-4-0-3], and [6-3-4-0-4]. Six of these candidate biomarkers are sialylated glycans, which were often excluded from investigation in previous studies [21, 77, 83, 124]. We used the 12 significant N-glycans to further evaluate the performance of the proposed Bayesian alignment model (BAM). Specifically, the consensus peaks corresponding to the 12 glycans were searched against the peak detection results, and three procedures were compared: no alignment performed to adjust individual peak lists (SIMA), alignment performed using BAM (BAM+SIMA), and alignment performed based on single chromatograms using DTW (DTW+SIMA). We compared the number of missing values caused by the peak matching step with different parameters in SIMA (RT tolerances of 10, 20, 30 and 40). As shown in Table 4.5, applying BAM to adjust the retention time variation facilitated the subsequent peak matching step, and it made the peak matching result more robust to the selection of parameters, as also discussed in Chapter 3. A small RT parameter in SIMA may lack sufficient coverage to capture the RT variation, while a large RT parameter induces possible interference between peaks. Without appropriate retention time alignment, choosing a right parameter for the peak matching step can be ambiguous. Alignment by DTW did not eliminate such ambiguity. As discussed in Section 2.5.3, DTW is prone to overfitting the data, especially when the chromatographic profiles captured by base peak chromatograms are not consistent across runs (Figure 4.10). As a result, this procedure introduced additional variability in the peak matching process. It is noted that the comparison presented here is not comprehensive as there is no ground-truth information that can be used for a thorough evaluation as conducted in Section 3.6. Biosynthesis of N-glycans in the Golgi apparatus involves trimming of mannose residuals and stepwise addition of monosaccharides, resulting in three groups of N-glycans: highmannose, complex and hybrid types. Some biomarker discovery studies for other types of cancer have shown that structurally-related glycans are likely to have correlated changes of levels due to the same biosynthesis process they are involved in [113]. Our analysis exhibited a similar phenomenon, where many of the candidate biomarkers for HCC are closely related in their structures. These glycans can be grouped into three clusters, and within each cluster the glycans show consistent changes in their levels. Specifically, there are glycans with 1) bisected GlcNAc structure, 2) beta-1,6-GlcNAc branching structure, and 3) tetra-antennary structure as shown in Figure 4.11. Further elucidation of the relationship between the identified complex N-glycans can be obtained by referring to their biosynthesis process. N-acetylglucosaminyltransferase III (GnT-III) and N-acetylglucosaminyltransferase V (GnT-V) are glycosyltransferases that have been known to play a key role in the formation of N-glycan branches [99,147]. GnT-III and GnT-V lead to two distinct branching structures in N-glycans with contrasting implications of cancer metastasis. GnT-III catalyzes the formation of a bisected GlcNAc linkage in N-glycans, which has been associated to inhibition of cancer metastasis, whereas GnT-V catalyzes the addition of beta-1,6-GlcNAc branching of N-glycans, and has been considered as a promoter of metastasis. Their implications on HCC have also been discussed [77,83,141]. In this study, decreased levels in HCC were found in GnT-III’s downstream products (bisected GlcNAc glycans), while opposite alteration was 98 Chapter 4. Application of Bayesian alignment model to biomarker discovery 9 9 x 10 3 2.5 2.5 2 2 Ion count Ion count 3 1.5 1.5 1 1 0.5 0.5 0 15 20 25 30 35 40 45 x 10 0 15 50 20 25 Retention time (min) (a) Batch 1 3 2.5 2.5 2 2 Ion count Ion count 40 45 50 40 45 50 9 x 10 1.5 1 0.5 0.5 20 25 30 35 Retention time (min) (c) Batch 3 40 45 50 x 10 1.5 1 0 15 35 (b) Batch 2 9 3 30 Retention time (min) 0 15 20 25 30 35 Retention time (min) (d) Batch 4 Figure 4.10: Base peak chromatograms in four batches. Chapter 4. Application of Bayesian alignment model to biomarker discovery 99 Table 4.5: Number of missing values associated with the 12 significant N-glycans cuased by the peak matching step. N-glycan RT Charge [5-3-0-0-0] 27.6 2 3 [5-3-1-0-0] 29.0 2 3 [5-3-1-0-1] 31.8 2 [5-3-1-1-1] 34.0 2 [5-3-3-0-1] 30.9 3 32.5 3 [5-3-3-0-2] 33.0 3 4 34.6 3 [5-3-3-0-3] 34.8 3 4 [5-3-3-2-1] 37.1 2 [6-3-4-0-1] 33.7 3 [6-3-4-0-2] 35.7 3 4 [6-3-4-0-3] 37.7 3 4 [6-3-4-0-4] 39.1 4 10 2 2 1 1 18 15 8 8 11 8 3 7 8 8 9 0 0 0 0 13 SIMA 20 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 40 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 DTW+SIMA 10 20 30 40 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 7 0 0 0 12 0 0 0 1 1 0 0 1 1 1 1 1 1 1 1 2 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 2 0 0 0 10 5 5 5 BAM+SIMA 10 20 30 40 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 11 1 0 0 found in GnT-V’s downstream products (beta-1,6-GlcNAc branching glycans). This observation matches the implication of the roles of GnT-III and GnT-V in cancer metastasis, consistently representing the progression from cirrhosis to HCC. In addition, increased levels of beta-galactoside alpha-2,6-sialyltransferase (ST6GalI), which transfers sialic acid residue in alpha-2,6 linkage to a terminal galactose, have been associated with progression and poor prognosis in HCC [13, 52, 102, 146]. Consistent findings were obtained in this study, where increased levels in HCC were found in a number of sialylated GlcNAc branching N-glycans (downstream products of ST6GalI). 4.6 Summary This chapter presents an LC-MS-based glycomic study to identify N-glycan biomarkers for HCC. We demonstrate the capability of the developed Bayesian alignment model (BAM) to enhance the reliability in LC-MS data preprocessing. We integrate the BAM into a pre- Chapter 4. Application of Bayesian alignment model to biomarker discovery 100 Figure 4.11: Three clusters of the identified N-glycan candidate biomarkers and their fold change directions (HCC versus cirrhosis). processing pipeline for LC-MS global profiling analysis in the glycomic study. Through the preprocessing pipeline, we identified 10 candidate biomarkers from global profiling. Confirmation was based on the result by targeted quantification, performed on a complementary platform using MRM. Eight of these candidate N-glycan biomarkers were identified by both quantification approaches and match closely with the implications of important glycosyltransferases in cancer progression and metastasis. The results of this study illustrate the power of the integrative approach combining LC-MS based global profiling and targeted quantification for a comprehensive serum glycomic analysis, to investigate changes in Nglycan levels between HCC cases and patients with liver cirrhosis. In addition, preprocessing of the LC-MS data is crucial for the global profiling analysis and the BAM greatly facilitates a reliable analysis in the discovery phase. Through the appropriate analyzing workflow, this study revealed that glycomic changes during cancer progression represent systematic alteration. Enrichment analysis could potentially increase the statistical power to detect the glycomic changes on a systems level and yield biologically relevant results, as demonstrated in the genomic analysis [122]. Comprehensive coverage of the glycome is a prerequisite, and the presented LC-MS-based workflow is expected to serve as a primary approach for this type of analysis. An associated challenge is to develop reliable pipelines for glycan identification as in other omic analysis, most notably metabolomics [140]. In order to allow a rigorous integration of changes in multiple glycans, defining appropriate categories of ontological/topological information is critical. This strat- Chapter 4. Application of Bayesian alignment model to biomarker discovery 101 egy may be further enhanced by integrating additional characteristics through other omic analysis, such as proteomics. Investigation of these candidate biomarkers on a larger population may allow specific stratification of the subjects on the basis of etiology and disease stage. Our current work in this regard involves the development of more rigorous subsequent analysis using multivariate statistical analysis or enrichment analysis to identify a panel of HCC biomarkers, and to evaluate if the observed glycomic changes can be reliably used for early detection of HCC in high risk population of cirrhotic patients. Furthermore, we will incorporate additional omic measurements on the same subjects for a more comprehensive characterization. Chapter 5 Conclusion This dissertation addresses retention time alignment for LC-MS data analysis. While LC-MS has gained significant attention in systems biology and biomarker discovery, chromatographic methods often cannot generate reproducible measurements in retention time, limiting the ability to apply this analytical technique for large-scale omic studies. As a result, there is a pressing need in developing powerful computational tools to address the issue of retention time alignment. We approached this problem from fundamental methodology development to rigorous integration of complementary information in LC-MS data. This chapter summarizes original contributions of the research work and discusses possible future work. 5.1 Summary of original contributions In this dissertation, we proposed and developed a Bayesian alignment model (BAM), which integrates complementary information embedded in the LC-MS data through a mathematically rigorous framework. The BAM belongs to the category of profile-based approaches, which are composed of two major components: a prototype function and a set of mapping functions. The profile-based approaches make use of chromatograms covering the whole retention time range. This is in contrast to the commonly applied feature-based alignment approaches, which rely on a subset of the time points. In addition, the profile-based approaches avoid heavy dependency on preceding processes including peak detection for each LC-MS run and identification of consensus peaks across runs, as in the feature-based approaches. Appropriate estimation of the prototype and mapping functions is crucial for good alignment results in the profile-based approaches. In contrast to time-warping approaches, in which the alignment is performed in a pairwise manner, the proposed BAM simultaneously leverages all the LC-MS runs to perform appropriate retention time alignment. The BAM uses Markov chain Monte Carlo (MCMC) methods to draw inference on the model parameters, which 102 Chapter 5. Conclusion 103 estimates the retention time variability along with uncertainty measures. This enables a nature framework to integrate complementary information from various sources, through weighing the uncertainty measures. The original contributions of this research work are summarized as follows. Development of single-profile Bayesian alignment model. We have developed a single-profile Bayesian alignment model for LC-MS data analysis. The alignment model improves on existing Bayesian methods by 1) using an efficient MCMC sampler based on the proposed block Metropolis-Hastings algorithm, and 2) adaptively selecting knots for the mapping function using stochastic search variable selection (SSVS). Due to the mathematical intractability of the mapping function and the monotonicity constraint imposed on it, designing an effective updating scheme is crucial to ensure good mixing of the MCMC sampler. The proposed block Metropolis-Hastings algorithm enables flexible transition and prevents the sampler from getting trapped in local modes of the posterior distribution. Moreover, an extension using SSVS has been developed for adaptive determination of the number and positions of knots. The developed methodology has been evaluated through comparison with competitive approaches, based on both simulated and real LC-MS data. The evaluation shows that our alignment model yields better performance, which is accomplished through improved estimation and modeling of the mapping functions. Furthermore, possible extension through formulation with the Jupp transformation, which enables use of the efficient Hamiltonian Monte Carlo (HMC) algorithm has been investigated and discussed. Development of multi-profile Bayesian alignment model. We have further extended the single-profile alignment model to handle multiple representative chromatograms simultaneously. Along with our developed MCMC sampling schemes, the multi-profile alignment model offers two major attractive features: 1) considering multi-profile modeling with representative chromatograms identified by a clustering approach, and 2) using Gaussian process prior based on internal standards to guide the alignment process. Conventional approaches by binning along the m/z dimension to derive chromatograms are not satisfactory as the chromatographic profiles would inevitably be blurred. We have developed a novel clustering approach to identify multiple representative chromatograms from each LC-MS run. This approach takes into account of chromatographic quality and reproducibility across runs, and searches for a clustering configuration with less overlap between chromatograms. The resulting chromatograms are simultaneously considered in the profile-based alignment to improve the estimation of prototype and mapping functions. Moreover, a novel use of internal standards as landmarks in the alignment process has been developed through Gaussian process regression. Our developed model enables a rigorous integration of information from various sources. Comprehensive evaluation of the model has been performed on LC-MS data from proteomic and glycomic studies. The evaluation demonstrates that the proposed alignment model significantly eliminates the experimental variability in retention time measurement and facilitates the subsequent peak matching process, which is key to the analysis of large- Chapter 5. Conclusion 104 scale LC-MS-based omic studies. Application of Bayesian alignment model to biomarker discovery. We have integrated the proposed Bayesian alignment model (BAM) into a preprocessing pipeline for analysis of LC-MS data, and applied the pipeline to a large-scale LC-MS-based glycomic study for cancer biomarker discovery. The glycomic study consists of two complementary analyses: 1) global profiling using LC-MS, and 2) targeted quantification by MRM. Through the developed pipeline, we identified reliable candidate biomarkers from global profiling analysis and confirmed the result with that by target quantification. Preprocessing of the LC-MS data is crucial for the global profiling analysis and the BAM ensures a reliable analysis in this phase. Through cross-platform confirmation, the results of this study illustrate the power of the integrative approach for a comprehensive LC-MS-based omic analysis. 5.2 Future work There are several remaining topics that can be further explored. We discuss some of the possible extensions in this section. The discussion presented here can be viewed as a starting point for future research. Computational efficiency. A current bottleneck of the developed Bayesian alignment model (BAM) is the required computational time. This issue could be circumvented through developing more efficient sampling schemes for parameter inference. The Hamiltonian Monte Carlo (HMC) model introduced in Section 2.6 is an initial effort in this direction. To further improve the HMC model and broaden its application scope, the main focus is to develop adaptive ways to tune the HMC parameters. Incorporation of adaptive HMC samplers, e.g., the recently developed “No-U-Turn Sampler” (NUTS) [53], may deserve further investigation. Besides MCMC methods, an interesting alternative is the use of variational methods such as the integrated nested Laplace approximations (INLA) [112], which may further improve the sampling efficiency for the parameter inference. In devising powerful sampling schemes, some sort of transformation may be desirable since it is often challenging to perform efficient parameter inference in a constrained space, as in the BAM. However, this might entangle the integration of informative prior, which can be naturally performed in the original space of BAM through Gaussian process regression as described in Section 3.1. We are currently working on developing sampling approaches that can improve the efficiency of sampling while retaining the convenience to incorporate informative prior. In addition to methodology development, practical approaches through re-arrangement of alignment process may further reduce computational burden. One possible approach is to employ a coarse-to-fine alignment procedure, where an approximate yet fast estimate is first derived based on down-sampled data. This estimate is then used to initialize a more precise estimation based on the com- Chapter 5. Conclusion 105 plete data. In the BAM, this can be accomplished by scheduling appropriate block moves, i.e., using smaller values of rblock to create larger blocks in initial MCMC iterations. Devising adaptive ways to choose appropriate values of rblock while ensuring a valid MCMC sampler (also known as adaptive MCMC [111]) would be an interesting topic to pursue in this direction. Methodology extension. The main assumption of the BAM is that there must exist a consistent pattern that is representative of considered LC-MS runs. One concern with the single-group assumption is due to possible outliers, where some LC-MS runs may exhibit a significantly distinct pattern from the rest of the runs. In such case, irrelevant measurements from the outlying runs would disturb the estimation of the prototype and mapping functions. A possible extension to address this issue is to introduce a mixture distribution of inlier and outlier, which defines the attribute of each observed chromatogram and is updated through assessing the consistency between the prototype function and the chromatogram. For an MCMC update, if an observed chromatogram is identified as an outlier, it would be excluded from the current estimation of model parameters. This extension is expected to improve estimation of model parameters in the BAM. Moreover, based on the posterior probability of the attribute, the outlying runs can also be detected in a principled way. When samples arise from different biological subgroups, the model needs to be further extended to account for the heterogeneity across these subgroups. In addition to incorporating grouping information, e.g., by using a Dirichlet process mixture [91], a module that can distinguish common chromatographic profiles from those unique to specific groups needs to to developed. This would help identify and prioritize representative profiles for the alignment process. We believe that simultaneous alignment of samples from multiple groups will ensure coherence in the preprocessing step and data comparability, which may facilitate downstream analyses, such as difference detection. More interestingly, this may reveal heterogeneity and novel patterns within a pre-defined biological group. Integrated model for LC-MS data preprocessing. Appropriate utilization of rich information embedded in the LC-MS data is crucial in the data preprocessing, as demonstrated in Chapter 3. A natural extension of the current work is to develop an integrated model that handles peak detection and retention time alignment simultaneously. Current preprocessing pipelines perform the two steps sequentially and peak detection is often carried out for each LC-MS run separately, without leveraging information from multiple runs. The two preprocessing steps are closely related and the benefit by combining them is two-fold. On the one hand, performing peak detection with information from other replicate runs could potentially reduce associated uncertainty and lead to more reliable results [87], if retention time alignment across runs is appropriately handled. On the other hand, identification of consistent patterns across LC-MS runs is key to the success of retention time alignment, and peak detection step may reveal good candidates of such patterns. Characterization of chromatographic profile is involved in both steps, and the BAM offers a natural framework Chapter 5. Conclusion 106 to construct the integrated model. In addition to improving the two preprocessing steps, the integrated model will lead to a coherent peak matching process. Furthermore, this may allow adequate normalization of peak intensity based on hydrophobicity, chemical class and other relevant properties. Potential applications. Although this dissertation is focused on retention time alignment of LC-MS data, some of the developed methodology can be used beyond LC-MS data analysis. There is a broad interest in functional data analysis [107], and alignment of important features of curves (called curve registration [106]) is crucial for appropriate interpretation and analysis of functional data, which arise in many different areas including economics, chemistry and biology. In particular, curve registration has been applied to analysis of biological data acquired by a variety of technologies including microarray [128], two-dimensional gel electrophoresis [45], and electrocardiography [131]. In terms of complexity and size of data, we do not see major issues in applying our developed alignment model to these studies. Recent advances in immunoprecipitation, affinity purification-mass spectrometry (AP-MS), and other technologies have enabled large-scale characterization of protein-protein, protein-DNA, and other molecular interactions [41]. Analysis of interaction networks provides important insights for systems biology research [86]. At the same time, it has raised a number of interesting computational questions. By using appropriate computational methods to analyze the interaction networks, it is possible to identify key pathways and/or complexes that regulate specific biological processes. In particular, comparative analysis of interaction networks, e.g., alignment and comparison of interaction networks across species, may shed light on basic cellular processes and phenotypic evolution. A fundamental challenge lies in the alignment of interaction networks [11,117]. With additional efforts to translate and further characterize the network alignment problem, some ideas and methodology developed in this dissertation may be useful in this field, e.g., the considered MCMC samplers and stochastic searching strategy. 5.3 Conclusion Appropriate LC-MS data preprocessing steps are needed to detect true differences between biological groups in LC-MS-based omic studies. Retention time alignment is one of the most important yet challenging preprocessing steps. In this dissertation, we investigate the alignment problem from methodology development to practical application through three research topics: 1) development of single-profile Bayesian alignment model, 2) development of multi-profile Bayesian alignment model, and 3) application to biomarker discovery research. The proposed Bayesian alignment model has been evaluated and compared with its competitive models, based on LC-MS data sets from proteomic, metabolomic and glycomic studies. Experimental results show improved performance by the proposed model and demonstrate its applicability in LC-MS-based omic studies, where the model greatly eliminates the exper- Chapter 5. Conclusion 107 imental variability in retention time measurement and facilitates the peak matching process through appropriate integration of complementary information. Finally, several related tasks are proposed for future work. Bibliography [1] R. Aebersold and M. Mann. 422(6928):198–207, 2003. Mass spectrometry-based proteomics. Nature, [2] C.H. Ahrens, E. Brunner, E. Qeli, K. Basler, and R. Aebersold. Generating and navigating proteome maps using mass spectrometry. Nature Reviews Molecular Cell Biology, 11(11):789–801, 2010. [3] U. Alon. An Introduction to Systems Biology: Design Principles of Biological Circuits. Chapman and Hall/CRC, 2006. [4] A.H. America, J.H. Cordewener, M.H. van Geffen, A. Lommen, J.P. Vissers, R.J. Bino, and R.D. Hall. Alignment and statistical difference analysis of complex peptide data sets generated by multidimensional LC-MS. Proteomics, 6(2):641–653, 2006. [5] H.J. An, S.R. Kronewitter, M.L. de Leoz, and C.B. Lebrilla. Glycomics and disease markers. Current Opinion in Chemical Biology, 13(5-6):601–607, 2009. [6] T.M. Annesley. Ion suppression in mass spectrometry. Clinical Chemistry, 49(7):1041– 1044, 2003. [7] A. Arzumanyan, H.M. Reis, and M.A. Feitelson. Pathogenic mechanisms in HBV- and HCV-associated hepatocellular carcinoma. Nature Reviews Cancer, 13:123–135, 2013. [8] M. Bellew, M. Coram, M. Fitzgibbon, M. Igra, T. Randolph, P. Wang, D. May, J. Eng, R. Fang, C. Lin, J. Chen, D. Goodlett, J. Whiteaker, A. Paulovich, and M. McIntosh. A suite of algorithms for the comprehensive analysis of complex protein mixtures using high-resolution LC-MS. Bioinformatics, 22(15):1902–1909, 2006. [9] R. Bellman. The theory of dynamic programming. Bulletin of the American Mathematical Society, 60:503–516, 1954. [10] Y. Benjamini and Y. Hochberg. Controlling the false discoveryrate: a practical and powerful approach to multiple testing. Journal of Royal Statistical Society, Series B, 57(1):289–300, 1995. 108 Bibliography 109 [11] J. Berg and M. Lassig. Cross-species analysis of biological networks by Bayesian alignment. Proceedings of the National Academy of Sciences of the United States of America, 103(29):10967–10972, 2006. [12] E.S. Bialecki and A.M. Di Bisceglie. Diagnosis of hepatocellular carcinoma. HPB : The Official Journal of the International Hepato Pancreato Biliary Association, 7(1):26–34, 2005. [13] B. Blomme, C. Van Steenkiste, N. Callewaert, and H. Van Vlierberghe. Alteration of protein glycosylation in liver diseases. Journal of Hepatology, 50(3):592–603, 2009. [14] S. Brooks, A. Gelman, G.L. Jones, and X.-L. Meng. Handbook of Markov Chain Monte Carlo. Chapman & Hall/CRC, 2011. [15] D. Bylund, R. Danielsson, G. Malmquist, and K.E. Markides. Chromatographic alignment by warping and dynamic programming as a pre-processing tool for PARAFAC modelling of liquid chromatographymass spectrometry data. Journal of Chromatography A, 961(2):237–244, 2002. [16] S.J. Callister, R.C. Barry, J.N. Adkins, E.T. Johnson, W.J. Qian, B.J. WebbRobertson, R.D. Smith, and M.S. Lipton. Normalization approaches for removing systematic biases associated with mass spectrometry and label-free proteomics. Journal of Proteome Research, 5(2):277–286, 2006. [17] C. Christin, H.C. Hoefsloot, A.K. Smilde, F. Suits, R. Bischoff, and P.L. Horvatovich. Time alignment algorithms based on selected mass traces for complex LC-MS data. Journal of Proteome Research, 9(3):1483–1495, 2010. [18] K.R. Coombes, S. Tsavachidis, J.S. Morris, K.A. Baggerly, M.C. Hung, and H.M. Kuerer. Improved peak detection and quantification of mass spectrometry data acquired from surface-enhanced laser desorption and ionization by denoising spectra with the undecimated discrete wavelet transform. Proteomics, 5(16):4107–4117, 2005. [19] V. Dancik, T.A. Addona, K.R. Clauser, J.E. Vath, and P.A. Pevzner. De novo peptide sequencing via tandem mass spectrometry. Journal of Computational Biology, 6(34):327–342, 1999. [20] C. de Boor. A Practical Guide to Splines. Springer-Verlag, New York, 1978. [21] E.N. Debruyne, D. Vanderschaeghe, H. Van Vlierberghe, A. Vanhecke, N. Callewaert, and J.R. Delanghe. Diagnostic value of the hemopexin N-glycan profile in hepatocellular carcinoma patients. Clinical Chemistry, 56(5):823–831, 2010. [22] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1– 38, 1977. 110 Bibliography [23] J.L. Desantos-Garcia, S.I. Khalil, A. Hussein, Y. Hu, and Y. Mechref. Enhanced sensitivity of LC-MS analysis of permethylated N-glycans through online purification. Electrophoresis, 32(24):3516–3525, 2011. [24] E.P. Diamandis. Mass spectrometry as a diagnostic and a cancer biomarker discovery tool: opportunities and potential limitations. Molecular and Cellular Proteomics, 3(4):367–378, 2004. [25] B. Domon and R. Aebersold. Mass spectrometry and protein analysis. 312(5771):212–217, 2006. Science, [26] P. Du, W.A. Kibbe, and S.M. Lin. Improved peak detection in mass spectrum by incorporating continuous wavelet transform-based pattern matching. Bioinformatics, 22(17):2059–2065, 2006. [27] S. Duane, A.D. Kennedy, B.J. Pendleton, and D. Roweth. Hybrid Monte Carlo. Physics Letters B, 195(2):216–222, 1987. [28] W.B. Dunn, D. Broadhurst, P. Begley, E. Zelena, S. Francis-McIntyre, N. Anderson, M. Brown, J.D. Knowles, A. Halsall, J.N. Haselden, A.W. Nicholls, I.D. Wilson, D.B. Kell, R. Goodacre, and Human Serum Metabolome (HUSERMET) Consortium. Procedures for large-scale metabolic profiling of serum and plasma using gas chromatography and liquid chromatography coupled to mass spectrometry. Nature Protocols, 6(7):1060–1083, 2011. [29] P.H. Eilers. Parametric time warping. Analytical Chemistry, 76(2):404–411, 2004. [30] H.B. El-Serag. Hepatocellular carcinoma. 365(12):1118–1127, 2011. New England Journal of Medicine, [31] J.E. Elias, W. Haas, B.K. Faherty, and S.P. Gygi. Comparative evaluation of mass spectrometry platforms used in large-scale proteomics investigations. Nature Methods, 2(9):667–675, 2005. [32] J.K. Eng, B.C. Searle, K.R. Clauser, and D.L. Tabb. A face in the crowd: recognizing peptides through database search. Molecular and Cellular Proteomics, 10(11):R111.009522, 2011. [33] J. Ferlay, H.R. Shin, F. Bray, D. Forman, C. Mathers, and D.M. Parkin. Estimates of worldwide burden of cancer in 2008: GLOBOCAN 2008. International Journal of Cancer, 127(12):2893–2917, 2010. [34] B. Fischer, J. Grossmann, V. Roth, W. Gruissem, S. Baginsky, and J.M. Buhmann. Semi-supervised LC/MS alignment for differential proteomics. Bioinformatics, 22(14):e132–e140, 2006. Bibliography 111 [35] A.M. Frank, M.M. Savitski, M.L. Nielsen, R.A. Zubarev, and P.A. Pevzner. De novo peptide sequencing and identification with precision mass spectrometry. Journal of Proteome Research, 6(1):114–123, 2007. [36] B.J. Frey and D. Dueck. Clustering by passing messages between data points. Science, 315(5814):972–976, 2007. [37] M.M. Fuster and J.D. Esko. The sweet and sour of cancer: glycans as novel therapeutic targets. Nature Reviews Cancer, 5:526–542, 2005. [38] A.E. Gelfand and A.F. Smith. Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85(410):398–409, 1990. [39] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6):721–741, 1984. [40] E.I. George and R.E. McCulloch. Variable selection via Gibbs sampling. Journal of the American Statistical Association, 88(423):881–889, 1993. [41] A.C. Gingras, M. Gstaiger, B. Raught, and R. Aebersold. Analysis of protein complexes using mass spectrometry. Nature Reviews Molecular Cell Biology, 8(8):645–654, 2007. [42] R. Goldman, H.W. Ressom, R.S. Varghese, L. Goldman, G. Bascug, C.A. Loffredo, M. Abdel-Hamid, I. Gouda, S. Ezzat, Z. Kyselova, Y. Mechref, and M.V. Novotny. Detection of hepatocellular carcinoma using glycomic analysis. Clinical Cancer Research, 15(5):1808–1813, 2009. [43] C. Gonzaga-Jauregui, J.R. Lupski, and R.A. Gibbs. Human genome sequencing in health and disease. Annual Review of Medicine, 63:35–61, 2012. [44] P.J. Green. Reversible-jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82(4):711–732, 1995. [45] P.J. Green and K.V. Mardia. Bayesian alignment using hierarchical models, with applications in protein bioinformatics. Biometrika, 93(2):235–254, 2006. [46] D. Greenbaum, C. Colangelo, K. Williams, and M. Gerstein. Comparing protein abundance and mrna expression levels on a genomic scale. Genome Biology, 4(9):117, 2003. [47] M. Gstaiger and R. Aebersold. Applying mass spectrometry-based proteomics to genetics, genomics and network biology. Nature Reviews Genetics, 10(9):617–627, 2009. [48] S. Gupta, S. Bent, and J. Kohlwes. Test characteristics of alpha-fetoprotein for detecting hepatocellular carcinoma in patients with hepatitis C. A systematic review and critical analysis. Annals of Internal Medicine, 139(1):46–50, 2003. Bibliography 112 [49] K. Hashimoto, S. Goto, S. Kawano, K.F. Aoki-Kinoshita, N. Ueda, M. Hamajima, T. Kawasaki, and M. Kanehisa. KEGG as a glycome informatics resource. Glycobiology, 16(5):63R–70R, 2006. [50] W.K. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97–109, 1970. [51] A.M. Hawkridge and D.C. Muddiman. Mass spectrometry-based biomarker discovery: toward a global proteome index of individuality. Annual Review of Analytical Chemistry, 2:265–277, 2009. [52] M. Hedlund, E. Ng, A. Varki, and N.M. Varki. alpha 2-6-linked sialic acids on Nglycans modulate carcinoma differentiation in vivo. Cancer Research, 68(2):388–394, 2008. [53] M.D. Hoffman and A. Gelman. The No-U-Turn Sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, in press. [54] L. Hood, J.R. Heath, M.E. Phelps, and B. Lin. Systems biology and new technologies enable predictive and preventative medicine. Science, 306(5696):640–643, 2004. [55] Y. Hu and Y. Mechref. Comparing MALDI-MS, RP-LC-MALDI-MS and RP-LC-ESIMS glycomic profiles of permethylated N-glycans derived from model glycoproteins and human blood serum. Electrophoresis, 33(12):1768–1777, 2012. [56] T. Ideker, T. Galitski, and L. Hood. A new approach to decoding life: systems biology. Annual Review of Genomics and Human Genetics, 2:343–372, 2001. [57] T. Ideker, V. Thorsson, J.A. Ranish, R. Christmas, J. Buhler, J.K. Eng, R. Bumgarner, D.R. Goodlett, R. Aebersold, and L. Hood. Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science, 292(5518):929–934, 2001. [58] International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature, 431(7011):931–945, 2004. [59] J.D. Jaffe, D.R. Mani, K.C. Leptos, G.M. Church, M.A. Gillette, and S.A. Carr. PEPPeR, a platform for experimental proteomic pattern recognition. Molecular and Cellular Proteomics, 5(10):1927–1941, 2006. [60] N. Jaitly, A. Mayampurath, K. Littlefield, J.N. Adkins, G.A. Anderson, and R.D. Smith. Decon2LS: an open-source software package for automated processing and visualization of high resolution mass spectrometry data. BMC Bioinformatics, 10:87, 2009. Bibliography 113 [61] N. Jaitly, M.E. Monroe, V.A. Petyuk, T.R. Clauss, J.N. Adkins, and R.D. Smith. Robust algorithm for alignment of liquid chromatographymass spectrometry analyses in an accurate mass and time tag data analysis pipeline. Analytical Chemistry, 78(21):7397–7409, 2006. [62] O.N. Jensen. Modification-specific proteomics: characterization of post-translational modifications by mass spectrometry. Current Opinion in Chemical Biology, 8(1):33–41, 2004. [63] A.R. Joyce and B.Ø. Palsson. The model organism as a system: integrating ‘omics’ data sets. Nature Reviews Molecular Cell Biology, 7(3):198–210, 2006. [64] D.L.B. Jupp. Approximation to data by splines with free knots. SIAM Journal on Numerical Analysis, 15(2):328–343, 1978. [65] T. Kamiyama, H. Yokoo, J. Furukawa, M. Kurogochi, T. Togashi, N. Miura, K. Nakanishi, H. Kamachi, T. Kakisaka, Y. Tsuruga, M. Fujiyoshi, A. Taketomi, S. Nishimura, and S. Todo. Identification of novel serum biomarkers of hepatocellular carcinoma using glycomic analysis. Hepatology, 57(6):2314–2325, 2013. [66] Y. Karpievitch, J. Stanley, T. Taverner, J. Huang, J.N. Adkins, C. Ansong, F. Heffron, T.O. Metz, W.J. Qian, H. Yoon, R.D. Smith, and A.R. Dabney. A statistical framework for protein quantitation in bottom-up MS-based proteomics. Bioinformatics, 25(16):2028–2034, 2009. [67] Y.V. Karpievitch, A.D. Polpitiya, G.A. Anderson, R.D. Smith, and A.R. Dabney. Liquid chromatography mass spectrometry-based proteomics: Biological and technological aspects. The Annals of Applied Statistics, 4(4):1797–1823, 2010. [68] K. Kultima, A. Nilsson, B. Scholz, U.L. Rossbach, M. Falth, and P.E. Andren. Development and evaluation of normalization methods for label-free relative quantification of endogenous peptides. Molecular and Cellular Proteomics, 8(10):2285–2295, 2009. [69] E.S. Lander. Initial impact of the sequencing of the human genome. 470(7333):187–197, 2011. Nature, [70] E. Lange, C. Gropl, O. Schulz-Trieglaff, A. Leinenbach, C. Huber, and K. Reinert. A geometric approach for the alignment of liquid chromatography-mass spectrometry data. Bioinformatics, 23(13):i273–i281, 2007. [71] E. Lange, R. Tautenhahn, S. Neumann, and C. Gropl. Critical assessment of alignment procedures for LC-MS proteomics and metabolomics measurements. BMC Bioinformatics, 9:375, 2008. Bibliography 114 [72] X.J. Li, E.C. Yi, C.J. Kemp, H. Zhang, and R. Aebersold. A software suite for the generation and comparison of peptide arrays from sets of data collected by liquid chromatography-mass spectrometry. Molecular and Cellular Proteomics, 4(9):1328– 1340, 2005. [73] J. Listgarten and A. Emili. Statistical and computational methods for comparative proteomic profiling using liquid chromatography-tandem mass spectrometry. Molecular and Cellular Proteomics, 4(4):419–343, 2005. [74] J. Listgarten, R.M. Neal, S.T. Roweis, and A. Emili. Multiple alignment of continuous time series. In Advances in Neural Information Processing Systems, pages 817–824. MIT Press, 2005. [75] J. Listgarten, R.M. Neal, S.T. Roweis, R. Puckrin, and S. Cutler. Bayesian detection of infrequent differences in sets of time series with shared structure. In Advances in Neural Information Processing Systems, pages 905–912. MIT Press, 2007. [76] J. Listgarten, R.M. Neal, S.T. Roweis, P. Wong, and A. Emili. Difference detection in LC-MS data for protein biomarker discovery. Bioinformatics, 23(2):e198–e204, 2007. [77] X.E. Liu, L. Desmyter, C.F. Gao, W. Laroy, S. Dewaele, V. Vanhooren, L. Wang, H. Zhuang, N. Callewaert, C. Libert, R. Contreras, and C. Chen. N-glycomic changes in hepatocellular carcinoma patients with liver cirrhosis induced by hepatitis B virus. Hepatology, 46(5):1426–1435, 2007. [78] P. Lu, C. Vogel, R. Wang, X. Yao, and E.M. Marcotte. Absolute protein expression profiling estimates the relative contributions of transcriptional and translational regulation. Nature Biotechnology, 25(1):117–124, 2007. [79] B. Ma, K. Zhang, C. Hendrie, C. Liang, M. Li, A. Doherty-Kirby, and G. Lajoie. PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. Rapid Communications in Mass Spectrometry, 17(20):2337–2342, 2003. [80] R. Madsen, T. Lundstedt, and J. Trygg. Chemometrics in metabolomics–a review in human disease diagnosis. Analytica Chimica Acta, 659(1-2):23–33, 2010. [81] M. Mann and N.L. Kelleher. Precision proteomics: the case for high resolution and high mass accuracy. Proceedings of the National Academy of Sciences of the United States of America, 105(47):18132–18138, 2008. [82] Y. Mechref, Y. Hu, A. Garcia, and A. Hussein. Identifying cancer biomarkers by mass spectrometry-based glycomics. Electrophoresis, 33(12):1755–1767, 2012. [83] A. Mehta, P. Norton, H. Liang, M.A. Comunale, M. Wang, L. Rodemich-Betesh, A. Koszycki, K. Noda, E. Miyoshi, and T. Block. Increased levels of tetra-antennary N-linked glycan but not core fucosylation are associated with hepatocellular carcinoma tissue. Cancer Epidemiology, Biomarkers & Prevention, 21(6):925–933, 2012. Bibliography 115 [84] Members of the Toxicogenomics Research Consortium. Standardizing global gene expression analysis between laboratories and across platforms. Nature Methods, 2(5):351– 356, 2005. [85] N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, and E. Teller. Equation of state calculations by fast computing machines. Journal of Chemical Physics, 21(6):1087–1092, 1953. [86] K. Mitra, A.R. Carvunis, S.K. Ramesh, and T. Ideker. Integrative approaches for finding modular structure in biological networks. Nature Reviews Genetics, 14(10):719– 732, 2013. [87] J.S. Morris, K.R. Coombes, J. Koomen, K.A. Baggerly, and R. Kobayashi. Feature extraction and quantification for mass spectrometry in biomedical applications using the mean spectrum. Bioinformatics, 21(9):1764–1775, 2005. [88] L.N. Mueller, O. Rinner, A. Schmidt, S. Letarte, B. Bodenmiller, M.Y. Brusniak, O. Vitek, R. Aebersold, and M. Muller. Superhirn - a novel tool for high resolution LC-MS-based peptide/protein profiling. Proteomics, 7(19):3470–3480, 2007. [89] I. Murray. Advances in Markov chain Monte Carlo methods. PhD thesis, Gatsby Computational Neuroscience Unit, University College London, 2007. [90] R.M. Neal. Probabilistic inference using Markov chain Monte Carlo methods. Technical Report CRG-TR-93-1, Department of Computer Science, University of Toronto, 1993. [91] R.M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2):249–265, 2000. [92] R.M. Neal. MCMC using Hamiltonian dynamics. In S. Brooks, A. Gelman, G.L. Jones, and X.-L. Meng, editors, Handbook of Markov Chain Monte Carlo, pages 113– 162. Chapman & Hall/CRC, 2011. [93] A.I. Nesvizhskii. Protein identification by tandem mass spectrometry and sequence database searching. Methods in Molecular Biology, 367:87–119, 2007. [94] N.V. Nielsen, J.M. Carstensen, and J. Smedsgaard. Aligning of single and multiple wavelength chromatographic profiles for chemometric data analysis using correlation optimised warping. Journal of Chromatography A, 805(1-2):17–35, 1998. [95] A.L. Oberg and O. Vitek. Statistical design of quantitative mass spectrometry-based proteomic experiments. Journal of Proteome Research, 8(5):2144–2156, 2009. [96] G.J. Patti, O. Yanes, and G. Siuzdak. Innovation: Metabolomics: the apogee of the omics trilogy. Nature Reviews Molecular Cell Biology, 13(4):263–269, 2012. Bibliography 116 [97] P. Picotti and R. Aebersold. Selected reaction monitoring-based proteomics: workflows, potential, pitfalls and future directions. Nature Methods, 9(6):555–566, 2012. [98] P. Picotti, O. Rinner, R. Stallmach, F. Dautel, T. Farrah, B. Domon, H. Wenschuh, and R. Aebersold. High-throughput generation of selected reaction-monitoring assays for proteins and proteomes. Nature Methods, 7(1):43–46, 2010. [99] S.S. Pinho, C.A. Reis, J. Paredes, A.M. Magalhaes, A.C. Ferreira, J. Figueiredo, W. Xiaogang, F. Carneiro, F. Gartner, and R. Seruca. The role of Nacetylglucosaminyltransferase III and V in the post-transcriptional modifications of E-cadherin. Human Molecular Genetics, 18(14):2599–2608, 2009. [100] T. Pluskal, S. Castillo, A. Villar-Briones, and M. Oresic. MZmine 2: modular framework for processing, visualizing, and analyzing mass spectrometry-based molecular profile data. BMC Bioinformatics, 11:395, 2010. [101] K. Podwojski, A. Fritsch, D.C. Chamrad, W. Paul, B. Sitek, K. Stuhler, P. Mutzel, C. Stephan, H.E. Meyer, W. Urfer, K. Ickstadt, and J. Rahnenfuhrer. Retention time alignment algorithms for LC/MS data must consider non-linear shifts. Bioinformatics, 25(6):758–764, 2009. [102] D. Pousset, V. Piller, N. Bureaud, M. Monsigny, and F. Piller. Increased alpha2,6 sialylation of N-glycans in a transgenic mouse model of hepatocellular carcinoma. Cancer Research, 57(19):4249–4256, 1997. [103] A. Prakash, P. Mallick, J. Whiteaker, H. Zhang, A. Paulovich, M. Flory, H. Lee, R. Aebersold, and B. Schwikowski. Signal maps for mass spectrometry-based comparative proteomics. Molecular and Cellular Proteomics, 5(3):423–32, 2006. [104] J.T. Prince and E.M. Marcotte. Chromatographic alignment of ESI-LC-MS proteomics data sets by ordered bijective interpolated warping. Analytical Chemistry, 78(17):6140– 6152, 2006. [105] L.R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989. [106] J.O. Ramsay and X. Li. Curve registration. Journal of the Royal Statistical Society, Series B, 60(2):351–363, 1998. [107] J.O. Ramsay and B.W. Silverman. Functional Data Analysis. Springer Series in Statistics. Springer, second edition, 2005. [108] C.E. Rasmussen and C.K.I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2006. Bibliography 117 [109] H.W. Ressom, R.S. Varghese, L. Goldman, Y. An, C.A. Loffredo, M. Abdel-Hamid, Z. Kyselova, Y. Mechref, M. Novotny, S.K. Drake, and R. Goldman. Analysis of MALDI-TOF mass spectrometry data for discovery of peptide and glycan biomarkers of hepatocellular carcinoma. Journal of Proteome Research, 7(2):603–610, 2008. [110] G.O. Roberts and S.K. Sahu. Updating schemes, correlation structure, blocking and parameterisation for the Gibbs sampler. Journal of the Royal Statistical Society, Series B, 59(2):291–317, 1997. [111] J.S. Rosenthal. Optimal proposal distributions and adaptive MCMC. In S. Brooks, A. Gelman, G.L. Jones, and X.-L. Meng, editors, Handbook of Markov Chain Monte Carlo, pages 93–111. Chapman & Hall/CRC, 2011. [112] H. Rue, S. Martino, and N. Chopin. Approximate Bayesian inference for latent Gaussian models using integrated nested Laplace approximations (with discussion). Journal of the Royal Statistical Society, Series B, 71(2):319–392, 2009. [113] L.R. Ruhaak, S. Miyamoto, and C.B. Lebrilla. Developments in the identification of glycan biomarkers for the detection of cancer. Molecular and Cellular Proteomics, 12(4):846–855, 2013. [114] H. Sakoe and S. Chiba. Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 26(1):43–49, 1978. [115] S. Salvador and P. Chan. Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. In Tools with Artificial Intelligence, 2004. ICTAI 2004. 16th IEEE International Conference on, pages 576–584, Nov 2004. [116] A. Savitzky and M.J.E. Golay. Smoothing and differentiation of data by simplified least squares procedures. Analytical Chemistry, 36(8):1627–1639, 1964. [117] R. Sharan, S. Suthram, R.M. Kelley, T. Kuhn, S. McCuine, P. Uetz, T. Sittler, R.M. Karp, and T. Ideker. Conserved patterns of protein interaction in multiple species. Proceedings of the National Academy of Sciences of the United States of America, 102(6):1974–1979, 2005. [118] R. Siegel, J. Ma, Z. Zou, and A. Jemal. Cancer statistics, 2014. CA: A Cancer Journal for Clinicians, 64(1):9–29, 2014. [119] C.A. Smith, E.J. Want, G. O’Maille, R. Abagyan, and G. Siuzdak. XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. Analytical Chemistry, 78(3):779–787, 2006. [120] H. Steen and M. Mann. The ABC’s (and XYZ’s) of peptide sequencing. Nature Reviews Molecular Cell Biology, 5(9):699–711, 2004. Bibliography 118 [121] M. Sturm, A. Bertsch, C. Gropl, A. Hildebrandt, R. Hussong, E. Lange, N. Pfeifer, O. Schulz-Trieglaff, A. Zerck, K. Reinert, and O. Kohlbacher. OpenMS - an opensource software framework for mass spectrometry. BMC Bioinformatics, 9:163, 2008. [122] A. Subramanian, P. Tamayo, V.K. Mootha, S. Mukherjee, B.L. Ebert, M.A. Gillette, A. Paulovich, S.L. Pomeroy, T.R. Golub, E.S. Lander, and J.P. Mesirov. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America, 102(43):15545–15550, 2005. [123] M. Sysi-Aho, M. Katajamaa, L. Yetukuri, and M. Oresic. Normalization method for metabolomics data using optimal selection of multiple internal standards. BMC Bioinformatics, 8:93, 2007. [124] K. Tanabe, A. Deguchi, M. Higashi, H. Usuki, Y. Suzuki, Y. Uchimura, S. Kuriyama, and K. Ikenaka. Outer arm fucosylation of N-glycans increases in sera of hepatocellular carcinoma patients. Biochemical and Biophysical Research Communications, 374(2):219–225, 2008. [125] Z. Tang, R.S. Varghese, S. Bekesova, C.A. Loffredo, M.A. Hamid, Z. Kyselova, Y. Mechref, M. Novotny, R. Goldman, and H.W. Ressom. Identification of N-glycan serum markers associated with hepatocellular carcinoma from mass spectrometry data. Journal of Proteome Research, 9(1):104–112, 2010. [126] R. Tautenhahn, C. Bottcher, and S. Neumann. Annotation of LC/ESI-MS mass signals. In Sepp Hochreiter and Roland Wagner, editors, Bioinformatics Research and Development, volume 4414 of Lecture Notes in Computer Science, pages 371–380. Springer Berlin Heidelberg, 2007. [127] D. Telesca and L.Y. Inoue. Bayesian hierarchical curve registration. Journal of the American Statistical Association, 103(481):328–339, 2008. [128] D. Telesca, L.Y. Inoue, M. Neira, R. Etzioni, M. Gleave, and C. Nelson. Differential expression and network inferences through functional data modeling. Biometrics, 65(3):793–804, 2009. [129] G. Tomasi, F. van den Berg, and C. Andersson. Correlation optimized warping and dynamic time warping as preprocessing methods for chromatographic data. Journal of Chemometrics, 18(5):231–241, 2004. [130] F. Trevisani, P.E. D’Intino, A.M. Morselli-Labate, G. Mazzella, E. Accogli, P. Caraceni, M. Domenicali, S. De Notariis, E. Roda, and M. Bernardi. Serum alpha-fetoprotein for diagnosis of hepatocellular carcinoma in patients with chronic liver disease: influence of HBsAg and anti-HCV status. Journal of Hepatology, 34(4):570–575, 2001. Bibliography 119 [131] T. Trigano, U. Isserles, and Y. Ritov. Semiparametric curve alignment and shift density estimation for biological data. IEEE Transactions on Signal Processing, 59(5):1970– 1984, 2011. [132] T.-H. Tsai, M.G. Tadesse, Y. Wang, and H.W. Ressom. Profile-based LC-MS data alignment — a Bayesian approach. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 10(2):494–503, 2013. [133] L. Tuli, T.-H. Tsai, R.S. Varghese, J.F. Xiao, A.K. Cheema, and H.W. Ressom. Using a spike-in experiment to evaluate analysis of LC-MS data. Proteome Science, 10:13, 2012. [134] M. Tyers and M. Mann. From genomics to proteomics. Nature, 422(6928):193–197, 2003. [135] M. Vandenbogaert, S. Li-Thiao-Te, H.M. Kaltenbach, R. Zhang, T. Aittokallio, and B. Schwikowski. Alignment of LC-MS images, with applications to biomarker discovery and protein identification. Proteomics, 8(4):650–672, 2008. [136] B. Voss, M. Hanselmann, B.Y. Renard, M.S. Lindner, U. Kothe, M. Kirchner, and F.A. Hamprecht. SIMA: simultaneous multiple alignment of LC/MS peak lists. Bioinformatics, 27(7):987–993, 2011. [137] P. Wang, H. Tang, M.P. Fitzgibbon, M. McIntosh, M. Coram, H. Zhang, E. Yi, and R. Aebersold. A statistical method for chromatographic alignment of LC-MS data. Biostatistics, 8(2):357–367, 2007. [138] Z. Wang, M. Gerstein, and M. Snyder. RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics, 10(1):57–63, 2009. [139] W. Windig, J.M. Phalp, and A.W. Payne. A noise and background reduction method for component detection in liquid chromatography/mass spectrometry. Analytical Chemistry, 68(20):3602–3606, 1996. [140] J.F. Xiao, B. Zhou, and H.W. Ressom. Metabolite identification and quantitation in LC-MS/MS-based metabolomics. Trends in Analytical Chemistry, 32:1–14, 2012. [141] M. Yao, D.P. Zhou, S.M. Jiang, Q.H. Wang, X.D. Zhou, Z.Y. Tang, and J.X. Gu. Elevated activity of N-acetylglucosaminyltransferase V in human hepatocellular carcinoma. Journal of Cancer Research and Clinical Oncology, 124(1):27–30, 1998. [142] T. Yu, Y. Park, J.M. Johnson, and D.P. Jones. apLCMS–adaptive processing of highresolution LC/MS data. Bioinformatics, 25(15):1930–1936, 2009. [143] J. Zaia. Mass spectrometry and glycomics. OMICS, 14(4):401–418, 2010. Bibliography 120 [144] P. Zhang, H. Li, H. Wang, S.T. Wong, and X. Zhou. Peak tree: a new tool for multiscale hierarchical representation and peak detection of mass spectrometry data. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 8(4):1054–1066, 2011. [145] X. Zhang, J.M. Asara, J. Adamec, M. Ouzzani, and A.K. Elmagarmid. Data preprocessing in liquid chromatography-mass spectrometry-based proteomics. Bioinformatics, 21(21):4054–4059, 2005. [146] Y. Zhao, Y. Li, H. Ma, W. Dong, H. Zhou, X. Song, J. Zhang, and L. Jia. Modification of sialylation mediates the invasive properties and chemosensitivity of human hepatocellular carcinoma. Molecular and Cellular Proteomics, 13(2):520–536, 2014. [147] Y. Zhao, Y. Sato, T. Isaji, T. Fukuda, A. Matsumoto, E. Miyoshi, J. Gu, and N. Taniguchi. Branched N-glycans regulate the biological functions of integrins and cadherins. FEBS Journal, 275(9):1939–1948, 2008. [148] H. Zou and T. Trevor. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67(2):301–320, 2005. Appendix A Proteomic ground-truth data The ground-truth data were generated based on the Mascot search result. A list of MS/MS spectra with identification score > 60 and present in at least 10 out of 20 LC-MS/MS runs was compiled. Each peptide sequence was assigned to a peak detected by DifProWare based on its mass and retention time, which resulted in a list of consensus peaks (with the same identity). Putative matching was also performed to the runs without a qualified identification sequence. The list was further refined based on visual inspection of the extracted ion chromatogram of each consensus peak, where erroneous assignments were removed. Table A.1 presents the resulting ground-truth data, which consists of 273 unique peptide sequences from 70 unique proteins. Table A.1: Ground-truth peaks in the proteomic data set. Protein Peptide sequence Mascot score Mass (Da) Time (min.) transferrin [Homo sapiens] SASDLTWDNLK HSTIFENLANK EGYYGYTGAFR SVIPSDGPSVACVK MYLGYEYVTAIR SKEFQLFSSPHGK FDEFFSEGCAPGSK TAGWNIPMGLLYNK IECVSAETTEDCIAK NLNEKDYELLCLDGTR SAGWNIPIGLLYCDLPEPR AIAANEADAVTLDAGLVYDAYLAPNNLKPVVAEFYGSK IMNGEADAMSLDGGFVYIAGK SMGGKEDLIWELLNQAQEHFGKDK DSGFQMNQLR 77.7926 70.2421 69.3411 65.636 98.3025 73.9056 91.877 66.8031 115.0145 87.7071 97.2836 160.9273 104.5931 77.5 66.3677 1248.6029 1272.6483 1282.5659 1357.6949 1477.7339 1490.753 1519.6346 1576.818 1610.7215 1894.917 2113.0801 3953.0353 2158.0202 2772.3731 1194.5473 37.1555 32.4386 33.6282 32.0186 48.2243 31.2152 39.8019 58.3531 31.461 43.9851 70.5105 70.4842 53.0688 65.2475 30.9991 121 Appendix A. Proteomic ground-truth data 122 complement component 4A preproprotein [Homo sapiens] VGDTLNLNLR 94.4817 1113.6169 37.1683 GLEEELQFSLGSK 86.64 1435.726 48.2367 VLSLAQEQVGGSPEK 107.7717 1540.814 28.8422 TTNIQGINLLFSSR 93.603 1562.8495 53.0759 VTASDPLDTLGSEGALSPGGVASLLR 113.287 2482.3079 63.152 ceruloplasmin precursor [Homo sapiens] EVGPTNADPVCLAK 68.039 1412.7009 30.347 QSEDSTFYLGER 71.9963 1430.6352 31.5729 ALYLQYTDETFR 81.6025 1518.7414 43.22 NNEGTYYSPNYNPQSR 70.3669 1902.8175 22.6015 MYYSAVDPTKDIFTGLIGPMK 99.4777 2346.179 59.9307 HYYIGIIETTWDYASDHGEK 66.3108 2397.1056 53.8928 KAEEEHLGILGPQLHADVGDKVK 80.3927 2482.324 33.2863 GPEEEHLGILGPVIWAEVGDTIR 69.6812 2486.2963 67.5311 GVYSSDVFDIFPGTYQTLEMFPR 86.0423 2668.2703 72.9248 ERGPEEEHLGILGPVIWAEVGDTIR 96.578 2771.4424 64.3123 ADDKVYPGEQYTYMLLATEEQSPGEGDGNCVTR 106.7565 3635.6218 51.5539 IDTINLFPATLFDAYMVAQNPGEWMLSCQNLNHLK 103.6335 4006.9624 84.3879 GAYPLSIEPIGVR 76.9083 1370.7614 45.7183 HIDREFVVMFSVVDENFSWYLEDNIK 78.0573 3230.5561 77.8872 complement factor H isoform a precursor [Homo sapiens] SLGNVIMVCR 68.355 1090.5647 41.9534 TGESVEFVCK 72.1533 1097.5083 28.7364 DGWSAQPTCIK 71.0445 1204.5571 29.2732 KGEWVALNPLR 69.1553 1281.7192 41.5403 SPDVINGSPISQK 65.0494 1340.6963 25.6546 EIMENYNIALR 66.1253 1364.6808 38.7847 SSNLIILEEHLK 65.2694 1394.7779 43.4492 WSSPPQCEGLPCK 69.148 1430.6366 34.0066 AGEQVTYTCATYYK 89.246 1596.7165 29.3064 NTEILTGSWSDQTYPEGTQAIYK 95.8119 2601.2364 44.6598 apolipoprotein A-I preproprotein [Homo sapiens] >gi|55637005|ref|XP 508770.1| PREDICTED: similar to preproapolipoprotein AI isoform 5 [Pan troglodytes] >gi|114640451|ref|XP 001153269.1| PREDICTED: similar to preproapolipoprotein AI isoform 1 [Pan tr DLATVYVDVLK 83.8683 1234.6863 52.0847 VSFLSALEEYTK 86.8663 1385.7174 61.7605 DYVSQFEGSALGK 109.4789 1399.6676 40.2577 VKDLATVYVDVLK 76.1879 1461.8492 46.2894 VSFLSALEEYTKK 67.3931 1513.8113 56.674 LLDNWDSVTSTFSK 100.5315 1611.7876 49.3435 DSGRDYVSQFEGSALGK 101.3459 1814.8499 37.7129 VKDLATVYVDVLKDSGR 86.4525 1877.0299 47.2567 Appendix A. Proteomic ground-truth data 123 LREQLGPVTQEFWDNLEK 77.43 2201.1273 55.3091 LREQLGPVTQEFWDNLEKETEGLR 73.0821 2886.4692 65.5815 DLATVYVDVLKDSGRDYVSQFEGSALGK 67.7955 3031.5346 64.1003 inter-alpha (globulin) inhibitor H4 [Homo sapiens] LALDNGGLAR 64.6264 998.5513 27.5272 AEAQAQYSAAVAK 85.3582 1306.6529 18.4805 ITFELVYEELLK 99.0484 1495.8269 68.463 NMEQFQVSVSVAPNAK 79.4064 1747.8622 38.2361 NPLVWVHASPEHVVVTR 74.2842 1939.047 38.5479 QGPVNLLSDPEQGVEVTGQYER 87.5953 2414.1836 44.6506 FSSHVGGTLGQFYQEVLWGSPAASDDGRR 84.0694 3123.5033 57.7378 DQFNLIVFSTEATQWRPSLVPASAENVNK 106.5981 3260.6664 65.8521 AGFSWIEVTFK 67.8206 1283.6607 59.7794 SPEQQETVLDGNLIIR 101.1672 1810.9506 44.7905 RLDYQEGPPGVEISCWSVEL 99.9945 2276.0945 61.8115 QGPVNLLSDPEQGVEVTGQYEREK 91.8931 2671.3161 42.3906 DTDRFSSHVGGTLGQFYQEVLWGSPAASDDGRR 92.635 3610.7076 57.4664 serine proteinase inhibitor, clade A, member 1 [Homo sapiens] >gi|50363219|ref|NP 001002236.1| serine proteinase inhibitor, clade A, member 1 [Homo sapiens] >gi|50363221|ref|NP 001002235.1| serine proteinase inhibitor, clade A, member 1 [Homo sapien SVLGQLGITK 72.9017 1014.6097 40.9374 ITPNLAEFAFSLYR 93.9995 1640.8659 66.9569 VFSNGADLSGVTEEAPLK 120.7722 1832.9214 38.8215 FNKPFVFLMIEQNTK 76.248 1854.982 58.6251 ELDRDTVFALVNYIFFK 67.648 2089.1019 82.852 VFSNGADLSGVTEEAPLKLSK 84.7684 2161.1347 41.9904 GTEAAGAMFLEAIPMSIPPEVK 80.6075 2258.148 65.1551 LVDKFLEDVKK 64.844 1332.7679 31.642 LYHSEAFTVNFGDTEEAKK 67.1459 2185.0388 36.4992 TLNQPDSQLQLTTGNGLFLSEGLK 106.2309 2573.3522 54.3709 IVDLVKELDRDTVFALVNYIFFK 84.9686 2756.5326 89.2975 KLYHSEAFTVNFGDTEEAKK 73.8193 2313.1361 33.4614 complement factor B preproprotein [Homo sapiens] VKDISEVVTPR 74.7225 1241.6958 24.1633 VSEADSSNADWVTK 98.1875 1507.682 23.9549 LLQEGQALEYVCPSGFYPYPVQTR 75.8527 2757.3679 56.8651 NPREDYLDVYVFGVGPLVNQVNINALASK 118.9275 3203.6807 66.7554 KNPREDYLDVYVFGVGPLVNQVNINALASK 79.3271 3331.7774 63.7146 DMENLEDVFYQMIDESQSLSLCGMVWEHR 152.7023 3503.5341 89.0342 YGLVTYATYPK 71.6214 1274.6584 35.7583 serpin peptidase inhibitor, clade G, member 1 precursor [Homo sapiens] >gi|73858570|ref|NP 001027466.1| serpin peptidase inhibitor, clade G, member 1 precursor [Homo sapiens] LLDSLPSDTR 67.2247 1115.5841 28.7276 Appendix A. Proteomic ground-truth data LVLLNAIYLSAK 82.0444 1316.8148 VTTSQDMLSIMEK 97.2353 1481.7159 LEDMEQALSPSVFK 93.6837 1592.7833 GVTSVSQIFHSPDLAIR 68.8577 1825.9764 HRLEDMEQALSPSVFK 84.7535 1885.9402 TLLVFEVQQPFLFVLWDQQHK 70.352 2614.4093 serpin peptidase inhibitor, clade C, member 1 [Homo sapiens] VAEGTQVLELPFK 72.1855 1429.7882 ADGESCSASMMYQEGK 89.5863 1692.6472 AFLEVNEEGSEAAASTAVVIAGR 117.64 2290.1548 ITDVIPSEAINELTVLVLVNTIYFK 103.078 2803.59 ELTPEVLQEWLDELEEMMLVVHMPR 68.6858 3065.5194 VEKELTPEVLQEWLDELEEMMLVVHMPR 80.3672 3421.72 VAEGTQVLELPFKGDDITMVLILPKPEK 73.4173 3079.7119 inter-alpha globulin inhibitor H2 polypeptide [Homo sapiens] SSALDMENFR 71.8625 1168.5211 IQPSGGTNINEALLR 103.036 1581.8533 LWAYLTINQLLAER 85.5421 1702.9512 VVNNSPQPQNVVFDVQIPK 86.9511 2121.1312 SILQMSLDHHIVTPLTSLVIENEAGDER 87.946 3116.6023 NVKENIQDNISLFSLGMGFDVDYDFLKR 89.3794 3276.6337 FDPAKLDQIESVITATSANTQLVLETLAQMDDLQDFLSK 102.3133 4308.2102 ETAVDGELVVLYDVK 71.5467 1648.8645 serpin peptidase inhibitor, clade A, member 3 precursor [Homo sapiens] ITLLSALVETR 77.8327 1214.7308 LYGSEAFATDFQDSAAAK 109.1178 1890.8708 GTHVDLGLASANVDFAFSLYK 87.933 2224.1298 DYNLNDILLQLGIEEAFTSK 119.9825 2295.1781 GKITDLIKDLDSQTMMVLVNYIFFK 88.4409 2931.5746 inter-alpha (globulin) inhibitor H1 [Homo sapiens] QAVDTAVDGVFIR 80.1283 1389.7298 QYYEGSEIVVAGR 82.7705 1469.7199 LWAYLTIQELLAK 87.5894 1560.9016 ILGDMQPGDYFDLVLFGTR 90.7527 2156.0759 GMADQDGLKPTIDKPSEDSPPLEMLGPR 73.5464 2993.4595 GIEILNQVQESLPELSNHASILIMLTDGDPTEGVTDR 135.0546 4004.0306 GFSLDEATNLNGGLLR 94.6844 1675.8619 GSLVQASEANLQAAQDFVR 118.83 2003.0164 apolipoprotein B precursor [Homo sapiens] SVSDGIAALDLNAVANK 100.6533 1656.8768 HSITNPLAVLCEFISQSIK 83.2593 2099.1239 VPSYTLILPSLELPVLHVPR 77.5581 2242.3232 LTISEQNIQR 68.74 1200.6494 plasminogen [Homo sapiens] FVTWIEGVMR 72.8823 1236.6396 124 58.3024 43.8197 47.9032 49.5992 42.3374 77.4467 49.6683 24.3281 47.7842 91.7838 91.2884 89.1198 64.6367 34.176 35.4084 67.8821 47.4299 60.9536 69.7602 91.668 51.1432 57.4577 41.5059 59.6488 76.4287 92.8801 40.6757 29.5462 72.1131 72.5335 43.2395 74.5964 50.9258 51.5598 45.6344 91.3152 70.0301 25.3496 54.7743 125 Appendix A. Proteomic ground-truth data QLGAGSIEECAAK 72.862 NPDGDVGGPWCYTTNPR 120.2067 VILGAHQEVNLEPHVQEIEVSR 78.3774 TPENYPNAGLTMNYCR 101.8809 apolipoprotein A-IV precursor [Homo sapiens] ALVQQMEQLR 73.9028 LGEVNTYAGDLQK 78.927 KLVPFATELHER 65.2054 LKEEIGKELEELR 66.2947 SELTQQLNALFQDK 105.6494 SLAELGGHLDQQVEEFR 95.9063 SLAELGGHLDQQVEEFRR 66.7573 LGPHAGDVEGHLSFLEK 69.5583 angiotensinogen preproprotein [Homo sapiens] FMQAVTGWK 64.7875 ALQDQLVLVAAK 82.6965 ADSQAQLLLSTVVGVFTAPGLHLK 75.0783 TIHLTMPQLVLQGSYDLQDLLAQAELPAILHTELNLQK 129.8611 alpha 1B-glycoprotein precursor [Homo sapiens] ATWSGAVLAGR 81.357 SGLSTGWTQLSK 73.404 SLPAPWLSMAPVSWITPGLK 79.306 TPGAAANLELIFVGPQHAGNYR 109.3616 SWVPHTFESELSDPVELLVAES 81.7562 LHDNQNGWSGDSAPVELILSDETLPAPEFSPEPESGR 99.1515 vitamin D-binding protein precursor [Homo sapiens] HLSLLTTLSNR 67.6305 VMDKYTFELSR 69.152 FPSGTFEQVSQLVK 81.16 KFPSGTFEQVSQLVK 76.8736 VPTADLEDVLPLAEDITNILSK 84.1179 EYANQFMWEYSTNYGQAPLSLLVSYTK 78.6581 LAQKVPTADLEDVLPLAEDITNILSK 82.7457 EFSHLGKEDFTSLSLVLYSR 95.878 afamin precursor [Homo sapiens] ESLLNHFLYEVAR 69.7321 IAPQLSTEELVSLGEK 100.5375 RNPFVFAPTLLTVAVHFEEVAK 85.5395 LKHELTDEELQSLFTNFANVVDK 80.4831 gelsolin isoform a precursor [Homo sapiens] AGALNSNDAFVLK 72.2257 TPSAAYLWVGTGASEAEK 117.8583 AQPVQVAEGSEPDGFWEALGGK 77.2736 TPSAAYLWVGTGASEAEKTGAQELLR 91.0018 alpha-2-HS-glycoprotein [Homo sapiens] 1275.6154 1847.7967 2495.3215 1842.8145 27.6392 35.8907 39.0943 35.9689 1214.6479 1406.7076 1438.7936 1584.8759 1633.842 1926.9509 2083.0512 1804.9119 36.9434 28.5615 32.8886 36.5651 58.1552 45.6714 43.2074 38.3486 1066.5296 1267.7541 2464.3831 4267.3052 36.4695 40.9959 82.3134 88.6288 1087.5797 1263.6503 2150.1745 2295.1889 2470.2058 3989.8813 33.0551 36.346 74.5875 52.8175 69.8822 60.9303 1253.7105 1387.6809 1565.8181 1693.9103 2365.2762 3202.5168 2805.5578 2327.1944 40.161 34.0999 49.626 46.6934 85.2436 69.9468 80.8053 54.0283 1589.8281 1712.9275 2484.3678 2689.3755 60.3031 46.4254 74.4662 66.7315 1318.6925 1836.8954 2271.0944 2705.3847 37.6181 42.8101 53.307 54.0121 Appendix A. Proteomic ground-truth data EHAVEGDCDFQLLK 70.8065 1602.7391 TVVQPSVGAAAGPVVPPCPGR 93.1778 1958.0475 HTFMGVVSLGSPSGEVSHPR 100.509 2080.0233 AQLVPLPPSTYVEFTVSGTDCVAK 74.3933 2521.2942 alpha-2-plasmin inhibitor [Homo sapiens] IQEFLSGLPEDTVLLLLNAIHFQGFWR 75.7836 3155.7082 MSLSSFSVNRPFLFFIFEDTTGLPLFVGSVR 79.1313 3509.8274 hemopexin precursor [Homo sapiens] SGAQATWTELPWPHEK 70.3061 1836.8872 SGAQATWTELPWPHEKVDGALCMEK 82.272 2783.3191 DGWHSWPIAHQWPQGPSAVDAAFSWEEK 72.103 3218.4878 DGWHSWPIAHQWPQGPSAVDAAFSWEEKLYLVQGTQVYVFLTK72.925 4971.4787 GGYTLVSGYPK 70.037 1140.5837 YYCFQGNQFLR 63.224 1437.6563 SLGPNSCSANGPGLYLIHGPNLYCYSDVEK 87.3833 3167.4891 EVGTPHGIILDSVDAAFICPGSSR 91.2262 2440.2264 kininogen 1 isoform 2 [Homo sapiens] IGEIKEETTSHLR 79.652 1511.7946 DIPTNSPELEETLTHTITK 90.0725 2138.0847 IASFSQNCDIYPGKDFVQPPTK 69.7439 2454.1988 fibronectin 1 isoform 3 preproprotein [Homo sapiens] DLQFVEVTDVK 78.3227 1291.6697 SSPVVIDASTAIDAPSNLR 91.4045 1911.9969 RPGGEPSPEGTTGQSYNQYSQR 87.2975 2395.0829 TEIDKPSQMQVTDVQDNSISVK 73.1918 2461.2083 VPGTSTSATLTGLTR 86.435 1460.7873 coagulation factor II preproprotein [Homo sapiens] SPQELLCGASLISDR 97.0471 1587.7997 SEGSSVNLSPPLEQCVPDR 75.202 2012.9552 NPDSSTTGPWCYTTDPTVR 85.5528 2096.9187 IVEGSDAEIGMSPWQVMLFR 89.399 2264.1106 SEGSSVNLSPPLEQCVPDRGQQYQGR 65.8008 2830.3404 ELLESYIDGR 70.2144 1193.5965 C-type lectin domain family 3, member B [Homo sapiens] SRLDTLAQEVALLK 65.0436 1555.8989 GGTLGTPQTGSENDALYEYLR 109.7033 2241.065 keratin 1 [Homo sapiens] FLEQQNQVLQTK 76.1 1474.7827 inter-alpha (globulin) inhibitor H3 preproprotein [Homo sapiens] LWAYLTIEQLLEK 91.3425 1618.9074 serum amyloid P component precursor [Homo sapiens] VGEYSLYIGR 74.0706 1155.5952 QGYFVEAQPK 75.6519 1165.5793 AYSLFSYNTQGR 85.7937 1405.6675 126 39.9996 34.7402 37.5547 57.8151 92.6985 89.5949 46.183 50.7631 58.9136 76.7352 30.9613 44.9582 53.4818 56.0383 20.3646 45.9273 43.0719 42.2396 42.1224 19.235 32.0907 32.0147 48.7403 41.1497 37.0768 66.9951 38.2878 39.624 51.7843 46.8582 28.0465 72.8909 35.8244 25.3918 39.3873 Appendix A. Proteomic ground-truth data 127 IVLGQEQDSYGGKFDR 82.3956 1810.8886 28.703 complement component 6 precursor [Homo sapiens] >gi|189242612|ref|NP 001108603.2| complement component 6 precursor [Homo sapiens] IFDDFGTHYFTSGSLGGVYDLLYQFSSEELK 90.5336 3534.6703 81.5984 complement component 4 binding protein, alpha chain precursor [Homo sapiens] LSLEIEQLELQR 69.34 1469.8169 49.3183 FSAICQGDGTWSPR 75.7408 1523.6875 36.2249 KPDVSHGEMVSGFGPIYNYK 73.1646 2224.0678 38.765 KPDVSHGEMVSGFGPIYNYKDTIVFK 93.1407 2927.4645 48.6949 heparin cofactor II precursor [Homo sapiens] FTVDRPFLFLIYEHR 68.176 1952.039 60.3305 alpha-2-glycoprotein 1, zinc [Homo sapiens] YSLTYIYTGLSK 97.8745 1407.7346 47.596 NILDRQDPPSVVVTSHQAPGEK 79.4367 2386.2297 29.8159 HVEDVPAFQALGSLNDLQFFR 82.2165 2402.2176 67.6579 apolipoprotein E precursor [Homo sapiens] AATVGSLAGQPLQER 73.6921 1496.7986 29.4961 serine (or cysteine) proteinase inhibitor, clade F (alpha-2 antiplasmin, pigment epithelium derived factor), member 1 [Homo sapiens] TVQAVLTVPK 66.3931 1054.6408 32.2106 LAAAVSNFGYDLYR 78.6684 1558.7848 45.8008 IAQLPLTGSMSIIFFLPLK 90.3446 2088.217 82.2969 peptidoglycan recognition protein 2 precursor [Homo sapiens] HTASAWLMSAPNSGPHNR 71.894 1932.9048 29.4364 DGSPDVTTADIGANTPDATK 106.6953 1944.8967 25.9227 AGLLRPDYALLGHR 65.826 1550.8713 39.2893 apolipoprotein A-II preproprotein [Homo sapiens] AGTELVNFLSYFVELGTQPATQ 69.158 2384.2065 86.029 KAGTELVNFLSYFVELGTQPATQ 87.4082 2512.302 81.0994 EPCVESLVSQYFQTVTDYGKDLMEK 80.145 2908.3726 80.8426 complement component 1, r subcomponent [Homo sapiens] TLDEFTIIQNLQPQYQFR 71.05 2253.1568 60.3064 LFGEVTSPLFPKPYPNNFETTTVITVPTGYR 76.72 3484.813 60.2166 retinol-binding protein 4, plasma precursor [Homo sapiens] >gi|113865843|ref|NP 001038960.1| retinol-binding protein 4, plasma [Pan troglodytes] LLNNWDVCADMVGTFTDTEDPAK 110.5675 2554.1516 66.0509 KDPEGLFLQDNIVAEFSVDETGQMSATAK 100.4329 3139.5233 60.9867 vitronectin precursor [Homo sapiens] SIAQYWLGCPAPGHL 74.7238 1611.7979 56.8878 DVWGIEGPIDAAFTR 69.6887 1645.822 58.8065 corticosteroid binding globulin precursor [Homo sapiens] IVDLFSGLDSPAILVLVNYIFFK 134.4473 2582.46 94.6978 clusterin isoform 1 [Homo sapiens] Appendix A. Proteomic ground-truth data 128 ELDESLQVAER 65.0356 1287.6338 27.8548 VTTVASHTSDSDVPSGVTEVVVK 74.75 2313.1772 32.1829 LFDSDPITVTVPVEVSR 88.712 1872.9918 50.3583 protein S, alpha preproprotein [Homo sapiens] TYDSEGVILYAESIDHSAWLLIALR 89.1809 2834.4632 79.8842 complement component 8, gamma polypeptide [Homo sapiens] SLPVSDSVLSGFEQR 94.8744 1619.8243 47.4286 histidine-rich glycoprotein precursor [Homo sapiens] GGEGTGYFVDFSVR 86.3563 1489.6902 46.5406 DSPVLIDFFEDTER 91.1205 1681.7938 63.0791 orosomucoid 1 precursor [Homo sapiens] EQLGEFYEALDCLR 85.8347 1684.7879 61.8452 YVGGQEHFAHLLILR 72.82 1751.9501 44.5235 insulin-like growth factor binding protein, acid labile subunit isoform 2 precursor [Homo sapiens] LAELPADALGPLQR 74.3792 1462.8203 45.1268 VAGLLEDTFPGLLGLR 81.6047 1669.9516 68.7936 apolipoprotein H precursor [Homo sapiens] VCPFAGILENGAVR 83.5184 1444.7578 53.9545 serine (or cysteine) proteinase inhibitor, clade A, member 7 [Homo sapiens] NALALFVLPK 70.99 1084.6702 57.5629 EGQMESVEAAMSSK 82.3525 1482.638 31.6632 FSISATYDLGATLLK 79.4854 1598.865 59.9171 SFMLLILER 62.064 1120.637 63.3066 alpha-1-microglobulin/bikunin preproprotein [Homo sapiens] AFIQLWAFDAVK 90.4755 1407.7642 64.3939 coagulation factor XII precursor [Homo sapiens] VVGGLVALR 76.3167 882.5645 35.0509 complement factor H-related 1 [Homo sapiens] >gi|239758113|ref|XP 002346300.1| PREDICTED: similar to complement factor H-related 1 isoform 1 [Homo sapiens] EIMENYNIALR 66.21 1364.6808 38.7847 STDTSCVNPPTVQNAHILSR 71.744 2139.0435 31.8693 apolipoprotein C-III precursor [Homo sapiens] DALSSVQESQVAQQAR 116.276 1715.8487 26.7761 pro-platelet basic protein precursor [Homo sapiens] GKEESLDSDLYAELR 111.1358 1723.8316 41.2275 complement component 8, alpha polypeptide precursor [Homo sapiens] LGSLGAACEQTQTEGAK 92.2314 1662.7941 24.8774 paraoxonase 1 precursor [Homo sapiens] ILLMDLNEEDPTVLELGITGSK 94.6882 2399.2661 63.7875 alpha-2-macroglobulin precursor [Homo sapiens] FEVQVTVPK 66.354 1045.5833 35.7438 VGFYESDVMGR 81.3462 1258.5725 34.8086 leucine-rich alpha-2-glycoprotein 1 [Homo sapiens] Appendix A. Proteomic ground-truth data 129 VAAGAFQGLR 82.2481 988.546 29.2645 apolipoprotein C-II precursor [Homo sapiens] STAAMSTYTGIFTDQVLSVLKGEE 87.0789 2547.2585 74.422 plasma kallikrein B1 precursor [Homo sapiens] IAYGTQGSSGYSLR 85.4547 1458.7144 26.1314 ubiquitin and ribosomal protein S27a precursor [Homo sapiens] ribosomal protein S27a [Bos taurus] >gi|27807503|ref|NP 777203.1| >gi|62859181|ref|NP 001016172.1| hypothetical protein LOC548926 [Xenopus (Silurana) tropicalis] >gi|148222699|ref|NP 0010 TITLEVEPSDTIENVK 82.5073 1786.9272 41.5314 complement component 1, s subcomponent precursor [Homo sapiens] >gi|41393602|ref|NP 958850.1| complement component 1, s subcomponent precursor [Homo sapiens] GFQVVVTLR 71.1854 1017.5919 42.253 TNFDNDIALVR 69.764 1276.6445 38.3258 albumin preproprotein [Homo sapiens] >gi|197098046|ref|NP 001127106.1| albumin [Pongo abelii] ALVLIAFAQYLQQCPFEDHVK 75.852 2432.2704 77.0452 RHPDYSVVLLLR 70.1975 1466.8393 44.7459 complement factor I preproprotein [Homo sapiens] VFSLQWGEVK 71.9273 1191.6329 48.0329 GLETSLAECTFTK 84.56 1398.6752 43.7576 complement component 4B preproprotein [Homo sapiens] VGDTLNLNLR 94.1364 1113.6169 37.1683 LNMGITDLQGLR 75.185 1329.7129 46.4189 GLEEELQFSLGSK 85.9882 1435.726 48.2367 VLSLAQEQVGGSPEK 103.8555 1540.814 28.8422 TTNIQGINLLFSSR 93.4373 1562.8495 53.0759 MRPSTDTITVMVENSHGLR 64.412 2143.0568 39.6596 VTASDPLDTLGSEGALSPGGVASLLR 118.43 2482.3079 63.152 carboxypeptidase N, polypeptide 1 precursor [Homo sapiens] IHILPSMNPDGYEVAAAQGPNKPGYLVGR 87.647 3063.5734 45.2203 beta globin [Homo sapiens] >gi|55635219|ref|XP 508242.1| PREDICTED: hypothetical protein [Pan troglodytes] VNVDEVGGEALGR 73.0646 1313.6622 30.1181 apolipoprotein D precursor [Homo sapiens] MTVTDQVNCPK 72.927 1234.5702 23.4879 Appendix B Glycomic ground-truth data The ground-truth data were generated based on a list of human serum N-glycans characterized by the number of monosaccharides: N-acetylglucosamine (GlcNAc), hexose, fucose, N-acetylneuraminic acid (NeuNAc). The putative compositions were assigned by comparison of measured mass values with theoretical values, in consideration of hydrogen adducts. Erroneous assignments were removed based on visual inspection of the extracted ion chromatogram of each glycan. Table B.1 presents all the 106 ground-truth peaks considered in this study. Table B.1: Ground-truth peaks in the glycomic data set. Monosaccharide composition GlcNAc hexose fucose NeuNAc Mass (Da) Charge Time (min.) 3 3 3 2 2 3 3 3 3 3 4 4 2 2 3 4 4 4 1409.7515 1409.7515 1409.7515 1572.8248 1572.8248 1583.8407 1613.8513 1613.8513 1787.9405 1787.9405 1654.8778 1654.8778 1776.9246 1776.9246 1817.9511 1828.967 1828.967 1858.9776 1 2 2 1 2 2 2 2 2 2 1 2 1 2 2 1 2 1 22.796 23.543 23.9518 26.3688 26.3784 24.6975 23.9812 25.7926 25.1786 27.1283 23.8111 24.0786 28.4588 28.4733 26.9831 25.2808 25.2493 24.3177 3 3 3 5 5 3 4 4 4 4 3 3 6 6 5 3 3 4 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 130 131 Appendix B. Glycomic ground-truth data 4 5 5 5 2 2 3 4 4 4 4 5 5 5 5 2 2 3 3 4 4 4 5 5 5 5 2 2 2 4 4 4 4 4 4 5 5 4 4 5 5 5 5 5 6 4 3 3 3 7 7 6 4 4 5 5 3 3 4 4 8 8 5 5 5 4 4 4 4 5 5 9 9 10 4 4 6 6 5 5 5 5 5 5 4 4 6 5 5 6 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 1 1 1 1 0 1 1 0 1858.9776 1900.0041 1900.0041 1900.0041 1981.0244 1981.0244 2022.0509 2033.0668 2033.0668 2063.0774 2063.0774 2074.0933 2074.0933 2104.1039 2104.1039 2185.1242 2185.1242 2179.1248 2179.1248 2237.1666 2220.1513 2220.1513 2278.1931 2278.1931 2308.2037 2308.2037 2389.224 2389.224 2593.3238 2394.2405 2394.2405 2441.2664 2441.2664 2424.2511 2424.2511 2482.2929 2482.2929 2598.3403 2598.3403 2639.3668 2639.3668 2686.3927 2669.3774 2669.3774 2757.4298 2 1 2 3 1 2 2 2 3 2 2 2 3 2 3 2 2 2 3 2 2 3 2 3 2 3 2 2 2 2 3 2 3 2 3 2 3 2 3 2 3 2 2 3 2 24.546 26.6363 26.7748 26.7692 30.5505 30.7067 27.2873 25.81 25.852 25.0111 29.7737 27.9739 27.7071 27.0821 27.1319 32.5645 35.3194 28.1398 27.8983 26.259 26.2445 26.2604 28.3308 28.3576 27.0912 27.0925 34.3466 36.979 39.3056 27.0948 27.3372 26.493 26.6756 26.6974 26.7318 28.2663 28.2549 27.9223 27.8264 29.9275 30.0366 28.8021 28.8598 28.8853 20.8819 132 Appendix B. Glycomic ground-truth data 6 6 4 4 5 4 4 4 5 5 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 5 5 5 6 6 6 6 3 3 3 4 5 5 7 6 6 6 6 5 5 5 7 6 6 5 5 5 6 6 6 6 5 5 6 6 6 6 6 6 6 6 7 7 6 6 6 7 7 7 7 5 5 5 6 7 7 7 0 0 1 1 1 0 0 2 0 0 1 1 1 0 2 2 1 1 1 2 2 0 1 1 1 0 0 0 0 1 2 2 2 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 2 2 0 1 1 2 2 2 2 0 0 1 2 2 1 1 2 2 2 2 3 3 2 2 3 3 3 2 1 2 3 0 0 1 2 2 2 1 2757.4298 2757.4298 2802.4401 2802.4401 2843.4666 2785.4248 2785.4248 2819.4554 2873.4772 2873.4772 2959.514 2959.514 2959.514 2989.5246 2860.4819 2860.4819 3047.5664 3204.6403 3204.6403 3221.6556 3221.6556 3234.6509 3408.7401 3408.7401 3408.7401 3595.8246 3595.8246 3683.877 3683.877 3769.9138 3944.003 3944.003 4032.0554 3496.7925 3857.9662 4219.1399 1992.0403 1992.0403 2353.214 3163.6138 3612.8399 3612.8399 3567.8296 3 3 2 3 2 2 3 3 3 3 2 3 4 3 2 3 2 2 4 2 3 3 2 4 4 2 3 3 4 2 3 4 3 3 3 3 2 2 2 3 3 4 4 20.9064 21.7782 28.1754 28.3406 29.9053 28.2106 28.2033 27.8066 28.0849 30.495 29.4338 29.3898 29.7546 29.8786 29.9152 30.0363 31.6121 31.3248 31.3569 31.3066 31.3643 29.042 29.8497 30.0652 32.952 29.7408 29.5267 30.0214 30.0263 30.7326 32.041 32.0927 32.8241 29.8974 31.2025 32.3782 27.24 28.3142 29.1873 37.2819 29.6059 29.7611 26.9435