Bayesian Alignment Model for Analysis of LC-MS

advertisement
Bayesian Alignment Model for Analysis of LC-MS-based Omic Data
Tsung-Heng Tsai
Dissertation submitted to the Faculty of the
Virginia Polytechnic Institute and State University
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy
in
Electrical Engineering
Yue Wang, Chair
Luiz A. DaSilva
Seong K. Mun
Habtom W. Ressom
Jianhua Xuan
Guoqiang Yu
April 18, 2014
Arlington, Virginia
Keywords: alignment, Bayesian inference, biomarker discovery, liquid
chromatography-mass spectrometry (LC-MS), Markov chain Monte Carlo (MCMC)
Copyright 2014, Tsung-Heng Tsai
Bayesian Alignment Model for Analysis of LC-MS-based Omic Data
Tsung-Heng Tsai
(ABSTRACT)
Liquid chromatography coupled with mass spectrometry (LC-MS) has been widely used in
various omic studies for biomarker discovery. Appropriate LC-MS data preprocessing steps
are needed to detect true differences between biological groups. Retention time alignment is
one of the most important yet challenging preprocessing steps, in order to ensure that ion
intensity measurements among multiple LC-MS runs are comparable. In this dissertation,
we propose a Bayesian alignment model (BAM) for analysis of LC-MS data. BAM uses
Markov chain Monte Carlo (MCMC) methods to draw inference on the model parameters
and provides estimates of the retention time variability along with uncertainty measures,
enabling a natural framework to integrate information of various sources. From methodology
development to practical application, we investigate the alignment problem through three
research topics: 1) development of single-profile Bayesian alignment model, 2) development of
multi-profile Bayesian alignment model, and 3) application to biomarker discovery research.
Chapter 2 introduces the profile-based Bayesian alignment using a single chromatogram,
e.g., base peak chromatogram from each LC-MS run. The single-profile alignment model
improves on existing MCMC-based alignment methods through 1) the implementation of an
efficient MCMC sampler using a block Metropolis-Hastings algorithm, and 2) an adaptive
mechanism for knot specification using stochastic search variable selection (SSVS).
Chapter 3 extends the model to integrate complementary information that better captures
the variability in chromatographic separation. We use Gaussian process regression on the
internal standards to derive a prior distribution for the mapping functions. In addition, a
clustering approach is proposed to identify multiple representative chromatograms for each
LC-MS run. With the Gaussian process prior, these chromatograms are simultaneously
considered in the profile-based alignment, which greatly improves the model estimation and
facilitates the subsequent peak matching process.
Chapter 4 demonstrates the applicability of the proposed Bayesian alignment model to
biomarker discovery research. We integrate the proposed Bayesian alignment model into
a rigorous preprocessing pipeline for LC-MS data analysis. Through the developed analysis pipeline, candidate biomarkers for hepatocellular carcinoma (HCC) are identified and
confirmed on a complementary platform.
Acknowledgments
First of all, I thank my advisor, Professor Yue Wang for his support and guidance on my
study, research and life. I am very grateful for his contributions of time, knowledge and
expertise to my research during my PhD study.
Much of my research was conducted in Professor Habtom W. Ressom’s laboratory. I would
like to express my sincere gratitude to him, for all the help, support and insights he has
provided to me throughout these years. In addition, his kindness, enthusiasm and positive
energy have made it a pleasure to work with him.
I want to thank the other members of my advisory committee: Professors Luiz A. DaSilva,
Seong K. Mun, Jianhua Xuan, and Guoqiang Yu, for their invaluable feedback and constructive suggestions to improve the present work.
I have been very fortunate to have several mentors who helped and supported me throughout
these years. In particular, I would like to thank Professor Mahlet Tadesse, for guiding me
along and providing feedback on my work. Discussions with her have been a constant source
of inspiration. Her innovative thinking and high standard on research work have had a
significant influence on how I approach a research problem. I thank Dr. Da-Wei Wang, my
former supervisor who brought me into the area of bioinformatics seven years ago. Over the
years, he has extended his support and continued to provide me very helpful suggestions.
I enjoyed working with my colleagues in CBIL and Ressom Lab. Especially, I thank Cristina
Di Poto, Minkun Wang, and Yi Zhao, for their generous help and selfless contributions in
our collaborative projects.
I would like to give special thanks to Sherry Hwang and Jeff Hwang, who have provided
me tremendous support and encouragement, starting from the very first day I landed in the
Dulles Airport. I cannot imagine getting through the past few years without their support.
Most importantly, I would like to thank my parents, Hai-Shu Tsai and Pi-O Lin, my sisters
Tsai-Chen Tsai and Wen-Fang Tsai, my brother Yi-Chan Tsai, my brother-in-law HsiangChih Hsiao, and my sister-in-law Ching-Hui Yang. They have always motivated, encouraged,
and supported me. This dissertation would not have been possible without their unconditional love.
iii
Contents
1 Introduction
1
1.1
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.2.1
Principles of LC-MS . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.2.2
LC-MS data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
Retention time alignment of LC-MS data . . . . . . . . . . . . . . . . . . . .
9
1.3.1
Feature-based approaches . . . . . . . . . . . . . . . . . . . . . . . .
9
1.3.2
Profile-based approaches . . . . . . . . . . . . . . . . . . . . . . . . .
11
1.4
Research topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
1.5
List of relevant publications . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
1.6
Organization of the dissertation . . . . . . . . . . . . . . . . . . . . . . . . .
15
1.3
2 Single-profile Bayesian alignment model
17
2.1
Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.2
Generative model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
2.3
Posterior inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
2.3.1
Full conditionals of model parameters . . . . . . . . . . . . . . . . . .
23
2.3.2
Block Metropolis-Hastings algorithm . . . . . . . . . . . . . . . . . .
26
2.4
Number and position of knots . . . . . . . . . . . . . . . . . . . . . . . . . .
28
2.5
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
2.5.1
31
Simulated data set . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iv
2.6
2.7
2.5.2
LC-MS spike-in data set . . . . . . . . . . . . . . . . . . . . . . . . .
33
2.5.3
LC-MS metabolomic data sets . . . . . . . . . . . . . . . . . . . . . .
36
Alternative formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
2.6.1
Jupp transformation . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
2.6.2
Hamiltonian Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . .
41
2.6.3
Single-profile alignment using Hamiltonian Monte Carlo . . . . . . . .
43
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
3 Multi-profile Bayesian alignment model
51
3.1
Gaussian process prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
3.2
Multi-profile alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
3.3
Chromatographic clustering . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
3.4
Analyzed data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
3.5
Analysis of LC-MS proteomic data set . . . . . . . . . . . . . . . . . . . . .
62
3.6
Analysis of LC-MS glycomic data set . . . . . . . . . . . . . . . . . . . . . .
72
3.7
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
4 Application of Bayesian alignment model to biomarker discovery
83
4.1
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
4.2
Experimental methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
4.3
Global profiling analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
87
4.4
Multiple reaction monitoring quantification . . . . . . . . . . . . . . . . . . .
94
4.5
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
95
4.6
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
99
5 Conclusion
102
5.1
Summary of original contributions . . . . . . . . . . . . . . . . . . . . . . . . 102
5.2
Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.3
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
v
Bibliography
108
A Proteomic ground-truth data
121
B Glycomic ground-truth data
130
vi
List of Figures
1.1
Example of an LC-MS run. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.2
Profile-based approach is composed of two components: prototype function
c 2013 IEEE) . . . . . . . . . . . . . . . . . . . .
and mapping functions. (
12
An illustrative example showing the functionalities of the prototype function
m(t) and the mapping function ui (t). For example, the intensity of the sample
at time t = 1 corresponds to the intensity of the prototype function at time
c 2013 IEEE) . . . . . . . . . . . .
ui (1) = 2, which is given by m(2) = 3. (
21
2.2
c 2013 IEEE)
Directed acyclic graph of the single-profile alignment model. (
22
2.3
One realization of simulated data with different noise levels: (a) no noise, (b)
SNR 40, (c) SNR 35 and (d) SNR 30. (e)–(h) are the aligned data using BAM
with SSVS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
(a) Base peak chromatograms of the original LC-MS data. (b) zooms in the
c 2013 IEEE)
retention time range 100 − 250 for the chromatograms in (a). (
35
Aligned chromatograms by (a) DTW, (b) CPM, (c) BHCR, and (d) BAM.
(e), (f), (g), and (h) zoom in the retention time range 100 − 250 for the
chromatograms in (a), (b), (c), and (d), respectively. Misalignments by DTW
c 2013 IEEE) . . . . . . . . . . .
and BHCR are observed in (e) and (g). (
36
(a) Trace plot of the number of knots in the models visited at each MCMC
iteration for the chromatogram from the seventh replicate of the second serum
aliquot. (b) Box plot of the number of knots visited by the MCMC sampler
c 2013 IEEE) . . . . . . . . . . . . . . . . . . . .
for each chromatogram. (
37
Chromatogram of the seventh replicate from the second serum aliquot and
generated profiles based on the sampled model parameters during the initial
200 MCMC iterations for (a) BAM and (b) BHCR. The region where the
MCMC sampler for BHCR gets stuck at inaccurate retention time points is
c 2013 IEEE) . . . . . . . . . . . . . . . . . . . . . . . . . .
highlighted. (
38
2.1
2.4
2.5
2.6
2.7
vii
2.8
Difference between the identity function and estimated mapping function obtained from the posterior median by BAM for each of the 14 chromatograms.
c 2013 IEEE) .
The filled region corresponds to the 90% credible interval. (
45
Original extracted ion chromatograms for each of the 16 m/z values corresponding to the spiked-in peptides. For each m/z value, two plots showing
the chromatograms of all seven replicates are depicted: chromatograms for
aliquots with serum alone (left) and serum with spiked-in peptides (right).
c 2013 IEEE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
(
46
2.10 Aligned extracted ion chromatograms for each of the 16 m/z values corresponding to the spiked-in peptides. For each m/z value, two plots showing
the chromatograms of all seven replicates are depicted: chromatograms for
aliquots with serum alone (left) and serum with spiked-in peptides (right).
c 2013 IEEE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
(
47
2.11 Chromatograms in the metabolomic data sets, M1 and M2, before and after
alignment by BAM. The inset is a zoomed part in the middle retention time
c 2013 IEEE) . . . . . . . . . . . . . . . . .
range of the chromatograms. (
48
2.12 Trajectories for a bivariate normal distribution, simulated using 20 leapfrog
steps (ǫ = 0.25) with an initial position at the lower-left side of the distribution. Contours of equal probability ratio to the highest (0.1, 0.2, . . . , 0.9) are
depicted. Different values of initial momentum are considered. . . . . . . .
49
2.13 The Hamiltonians along the trajectories in Figure 2.12, under different initial
conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
2.14 Base peak chromatograms of the LC-MS data. Alignment is performed using
the HMC model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
2.9
3.1
Three main components of the multi-profile Bayesian alignment model: Gaussian process prior, chromatographic clutering, and profile-based alignment. .
52
3.2
Directed acyclic graph of the multi-profile alignment model. . . . . . . . . .
54
3.3
Binned chromatograms in a portion of one LC-MS run. Similar m/z values
do not imply similar chromatographic profiles. . . . . . . . . . . . . . . . .
57
3.4
Base peak chromatograms in the two analyzed data sets. . . . . . . . . . . .
60
3.5
Histograms of the logarithm of peak intensities in the two analyzed data sets.
61
3.6
Scatter plots of the detected peaks in the two analyzed data sets. The intensity
is log-transformed and color-coded. . . . . . . . . . . . . . . . . . . . . . . .
62
Normalized overlapping level (a) and sum of squared errors (b) using the
L-method in the proteomic data set. The sufficient number of clusters is four.
64
3.7
viii
3.8
3.9
Base peak chromatograms in the proteomic data set, before and after alignment. The inset is a zoomed part in the middle retention time range of the
chromatograms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
Trace plots of 1/σε2 estimated based on single-profile alignment without using
Gaussian process prior (SP) and with using Gaussian process prior (GPSP).
(b) rooms in the precision range 21 − 23.5 for the last 2500 iterations in (a).
65
3.10 Trace plots of ui (t) − ui+1(t) at the knot points τ1 − τ5 .
. . . . . . . . . . .
68
3.11 Trace plots of ui (t) − ui+1(t) at the knot points τ6 − τ10 . . . . . . . . . . . .
69
3.12 Trace plots of ui (t) − ui+1(t) at the knot points τ11 − τ15 .
. . . . . . . . . .
70
3.13 Trace plots of ui (t) − ui+1(t) at the knot points τ16 − τ20 .
. . . . . . . . . .
71
3.14 Trace plots of ui (t) − ui+1(t) at the knot points τ21 − τ23 .
. . . . . . . . . .
72
3.15 Difference between the identity function and estimated mapping function obtained from the posterior median by GPMP for each of the 20 LC-MS runs
in the proteomic data set. The filled region corresponds to the 90% credible
interval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
3.16 Measures of precision and recall in the proteomic data set, based on 72 pairs
of tolerance parameters in SIMA. The five procedures compared are: raw (∗),
GP (), SP (), GPSP (△), and GPMP (♦). . . . . . . . . . . . . . . . .
77
3.17 Base peak chromatograms of the 23 LC-MS runs in the glycomic data set. .
78
3.18 Normalized overlapping level (a) and sum of squared errors (b) using the
L-method in the glycomic data set. The sufficient number of clusters is four.
79
3.19 Clustered ion chromatograms in the glycomic data set. (a)-(d) are the unaligned chromatograms and (e)-(h) are their corresponding aligned chromatograms.
The inset is a zoomed part in the middle retention time range of the chromatograms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.20 Difference between the identity function and estimated mapping function obtained from the posterior median by GPMP for each of the 23 LC-MS runs
in the glycomic data set. The filled region corresponds to the 90% credible
interval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
80
3.21 Measures of precision and recall in the glycomic data set, based on 72 pairs
of tolerance parameters in SIMA. The five procedures compared are: raw (∗),
GP (), SP (), GPSP (△), and GPMP (♦). . . . . . . . . . . . . . . . .
81
ix
3.22 Measures of precision and recall in the glycomic data set, where multi-profile
alignment is considered. The number of chromatograms are: (a) two, (b)
three, (c) four, and (d) five. Five cases are compared with peak lists input to SIMA: raw (∗), adjusted by multi-profile alignment using binning (),
adjusted by multi-profile alignment using chromatographic clustering (), adjusted by multi-profile alignment using binning with Gaussian process prior
(△), and adjusted by multi-profile alignment using chromatographic clustering with Gaussian process prior (♦). . . . . . . . . . . . . . . . . . . . . . .
82
4.1
Workflow for the LC-MS-based analysis of N-glycans in sera.
. . . . . . . .
85
4.2
Characteristics of a peak in LC-MS raw data. . . . . . . . . . . . . . . . . .
87
4.3
Clustered ion chromatograms in E1. (a)-(d) are the unaligned chromatograms
and (e)-(h) are their corresponding aligned chromatograms. . . . . . . . . .
89
Clustered ion chromatograms in E2. (a)-(d) are the unaligned chromatograms
and (e)-(h) are their corresponding aligned chromatograms. . . . . . . . . .
89
Clustered ion chromatograms in E3. (a)-(d) are the unaligned chromatograms
and (e)-(h) are their corresponding aligned chromatograms. . . . . . . . . .
90
Clustered ion chromatograms in E4. (a)-(d) are the unaligned chromatograms
and (e)-(h) are their corresponding aligned chromatograms. . . . . . . . . .
90
Distribution of the RT differences (in second) across LC-MS runs of consensus
peaks identified by SIMA with different parameters of RT tolerance. . . . .
91
Histograms of peak intensities before (a) and after (b) logarithmic transformation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
93
Quantification results of eight candidate N-glycan biomarkers in sera of HCC
cases and cirrhotic controls by the MRM analysis. (a)-(c) Down-regulated
bisected GlcNAc glycans. (d)-(e) Up-regulated beta-1,6-GlcNAc branching
glycans. (f)-(h) Up-regulated tetra-antennary glycans. . . . . . . . . . . . .
96
4.10 Base peak chromatograms in four batches. . . . . . . . . . . . . . . . . . . .
98
4.4
4.5
4.6
4.7
4.8
4.9
4.11 Three clusters of the identified N-glycan candidate biomarkers and their fold
change directions (HCC versus cirrhosis). . . . . . . . . . . . . . . . . . . . 100
x
List of Tables
2.1
Summary of full conditionals of model parameters.
. . . . . . . . . . . . . .
26
2.2
Pairwise correlation coefficient and cross-correlation with the underlying pattern for the simulated LC-MS data, before alignment (original) and after alignment by DTW, CPM, BHCR and BAM (with fixed knots and with SSVS).
Means (standard deviations) are reported for the simulated data based on 200
realizations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
Correlation coefficients for the LC-MS spike-in data, before alignment (original) and after alignment by DTW, CPM, BHCR (with knot density of 0.2)
c 2013 IEEE) . . . . . . . . . . . . . . . . . . . .
and BAM (with SSVS). (
35
Comparison of the peak matching results by using OpenMS alone (raw) and
using three profile-based alignment models (DTW, CPM and BAM) for retention time correction prior to applying OpenMS on the metabolomic data
c 2013 IEEE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
sets. (
39
Summary of full conditionals of model parameters in the multi-profile Bayesian
alignment model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
3.2
Summary of the analyzed data sets.
59
3.3
Peptide sequences of the internal standard.
. . . . . . . . . . . . . . . . . .
63
3.4
Mass and retention time of each of the internal standard peaks in the proteomic data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
Performance comparison in the LC-MS proteomic data set. Five approaches
are compared: no alignment performed (raw), alignment performed using a
Gaussian process regression (GP), single-profile alignment performed without using a Gaussian process prior (SP), single-profile alignment performed
with a Gaussian process prior (GPSP), and multi-profile alignment (G = 4)
performed with a Gaussian process prior (GPMP). . . . . . . . . . . . . . .
67
Mass and retention time of each of the internal standard peaks in the glycomic
data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73
2.3
2.4
3.1
3.5
3.6
. . . . . . . . . . . . . . . . . . . . . .
xi
3.7
Performance comparison in the LC-MS glycomic data set. Five approaches
are compared: no alignment performed (raw), alignment performed using a
Gaussian process regression (GP), single-profile alignment performed without using a Gaussian process prior (SP), single-profile alignment performed
with a Gaussian process prior (GPSP), and multi-profile alignment (G = 4)
performed with a Gaussian process prior (GPMP). . . . . . . . . . . . . . .
74
Multi-profile alignment of the glycomic data set with and without using a
Gaussian process (GP) prior. Chromatograms are derived by binning along
m/z dimension. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
Multi-profile alignment of the glycomic data set with and without using a
Gaussian process (GP) prior. Chromatograms are derived using the chromatographic clustering procedure. . . . . . . . . . . . . . . . . . . . . . . .
75
4.1
Characteristics of the study cohort.
. . . . . . . . . . . . . . . . . . . . . .
86
4.2
Two-way analysis of variance of the glycomic data. . . . . . . . . . . . . . .
93
4.3
N-glycan candidate biomarkers identified by the LC-MS global profiling.
94
4.4
N-glycan candidate biomarkers identified by the MRM targeted quantification. 95
4.5
Number of missing values associated with the 12 significant N-glycans cuased
by the peak matching step. . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.8
3.9
A.1 Ground-truth peaks in the proteomic data set.
B.1 Ground-truth peaks in the glycomic data set.
xii
. .
99
. . . . . . . . . . . . . . . . 121
. . . . . . . . . . . . . . . . . 130
Chapter 1
Introduction
The development and application of the high-throughput omic technology have dramatically
increased the capacity to describe various aspects of biology in unprecedented detail. This,
at the same time, has necessitated an increased reliance on computational techniques to
extract knowledge from the vast amount of biological data. Liquid chromatography coupled
with mass spectrometry (LC-MS) has been widely used for profiling expression levels of
biomolecules in a variety of omic studies. This dissertation is focused on retention time
alignment, which is a crucial preprocessing step in the analysis of LC-MS-based omic data.
A Bayesian alignment model is proposed to address the alignment task. We investigate this
problem from methodology development to practical application in this dissertation.
1.1
Motivation
LC-MS has been an indispensable tool in various omic studies including proteomics, glycomis
and metabolomics [1, 96, 143]. Each LC-MS run generates data consisting of thousands of
ion intensities characterized by their specific retention time (RT) and mass-to-charge ratio (m/z) values, thus enabling comprehensive profiling of a variety of biomolecules. This
high-throughput technique is widely applied to identify candidate markers whose expression
levels change between groups of distinct biological conditions [5, 51, 80]. In order to ensure
an unbiased comparison of the ion intensities, several preprocessing steps including peak
detection, retention time alignment, peak matching, normalization, and charge state deconvolution need to be appropriately handled [67]. Typically, these preprocessing steps generate
a list of detected peaks with their RT, m/z values and intensities, which are subsequently
analyzed using statistical tests to identify significant differences in ion intensities.
One of the crucial preprocessing steps is the correct matching of consensus peaks across
multiple LC-MS runs. With the advances in mass spectrometry technology, it is now possible
to achieve highly precise and accurate mass measurement (low- to sub-ppm) [81]. However,
1
Chapter 1. Introduction
2
controlling the chromatographic variability is still a challenging task. This often results
in substantial variation in retention time across multiple LC-MS runs, raising significant
challenges in the preprocessing pipeline. Without appropriate correction of retention time,
the peak matching step is error-prone and the subsequent analysis may yield misleading
results. Therefore, retention time alignment is a prerequisite for the quantitative analysis of
LC-MS data and is the focus of this dissertation.
1.2
Background
The sequencing of human genome, i.e., determining the order of approximately 3.2 billion
DNA base pairs of adenine (A), thymine (T), cytosine (C) and guanine (G) residing in the 23
human chromosomes, is largely complete (> 99%) and publicly available [58]. This has raised
new challenges in life science research, to discover and characterize the association between
the DNA sequences and their downstream biological activities [43,69]. In particular, an area
of increased interest is systems biology, which focuses on studying biological entities and their
interactions to characterize the emergent properties of complex biological systems [3, 56].
Systems biology approaches consist of integrated and interactive analyses of diverse data
representing the biological activities at multiple levels, to gain insight into molecular and
cellular networks. The ability to comprehensively measure key biological entities in a highthroughput fashion is a prerequisite in this endeavor, and consequently, a variety of omic
technologies have been developed [54].
The flow of biological information from the genome to its downstream phenotypes involves
several complex and interactive processes. The central dogma of biology states that DNA
(genomics) is transcribed to mRNA (transcriptomics) and then translated into protein (proteomics), influenced by several regulatory factors including epigenetic modifications (epigenomics). Proteins can catalyze reactions that regulate and produce a variety of biomolecules
including metabolites (metabolomics), glycans (glycomics) and lipids (lipidomics). While
the genome of an organism is generally considered static, its expression as gene products is
continuously changing due to the influence of biological suppresors at different levels. Thus,
delineation of the gene products (e.g., mRNAs and proteins) is crucial to understand the biological system. Significant efforts have been made in transcriptomic studies, and technologies
to measure mRNA abundance levels have become reliable in routine use [84, 138]. Investigation of transcriptomics appears essential as it links between the DNA sequences and proteins.
Unfortunately, expression levels of proteins and their downstream products cannot be simply
inferred from mRNA levels and there is a substantial discrepancy observed [46, 57, 78]. With
current analytical methods, data from different omic studies reveal complementary aspects
of the biological system, and integration of these data may lead to a more comprehensive
understanding of the underlying mechanisms [63].
The human genome contains approximately 21,000 protein-coding genes, which can be expressed into about one million proteins [62, 134]. Systematic investigation of proteins and
Chapter 1. Introduction
3
their downstream products provides important insight on the mechanism of post-translational
processes [62,134]. Due to the close proximity of these biomolecules to biological phenotypes,
their expression levels reflect a rapid and observable response to environmental perturbations. This may potentially reveal the underlying mechanisms involved in human diseases,
and aid in the development of effective treatment to the diseases. In this regard, there is
a broad interest in identifying the biomolecules including proteins, glycans and metabolites
as potential biomarkers for clinical applications. Such effort depends on not only reliable
high-throughput techniques but also rigorous computational pipelines to extract relevant
information from the vast amount of data.
With recent advances of mass spectrometry and separation methods, LC-MS has become one
of the essential analytical tools in biomedical research. LC-MS provides sensitive qualitative
and quantitative analyses of a variety of biomolecules in a high-throughput fashion, and there
has been enormous progress in systems biology and biomarker discovery using LC-MS-based
omics [2,24,47]. Basic principles of LC-MS, preprocessing pipelines for LC-MS data analysis,
and associated challenges are introduced in this section. For more details about the LC-MS
technique, we refer interested readers to the literature [1, 25, 31].
1.2.1
Principles of LC-MS
Liquid chromatography. Liquid chromatography (LC) is a chromatographic technique
used to separate a mixture of compounds that are dissolved in a solvent. Reversed phase
high-performance liquid chromatography (RP-HPLC) is the most commonly used method in
LC-MS applications. In RP-HPLC, the mixture is dissolved in a mobile phase, composed of
water and organic solvents. With a high-pressure pump, the mixture solution is directed into
a RP-HPLC column (the stationary phase), using a solvent gradient with increasing organic
concentration. The stationary phase is typically hydrophobic or non-polar, while the mobile
phase is moderately polar. The choices of column material, type of solvent, and the solvent
gradient all play a role in chromatographic separation. Different compounds in the mixture
pass through the column at different rates due to the differences in their hydrophobicity
and polarity. In RP-HPLC, hydrophilic compounds elute from the column earlier than
hydrophobic compounds, and the time where a compound elutes from the column is called
elution time or retention time. In most LC-MS applications, a liquid chromatography is
coupled on-line to a mass spectrometer. Alternatively, compounds eluting from the column
can be collected in aliquots and analyzed by the mass spectrometer afterwards.
Mass spectrometry. Mass spectrometry (MS) is an analytical technique that measures
the mass-to-charge ratio (m/z) of charged molecules. The mass spectrometric analysis generates a mass spectrum summarizing abundance of detected ions distinguished in different
m/z values. A mass spectrometer consists of three basic components: 1) an ion source
that converts sample molecules into charged ions, 2) a mass analyzer that distinguishes the
Chapter 1. Introduction
4
charged ions on the basis of their m/z values, and 3) a detector that counts the number of
ions at each m/z value. Depending on the implementation of these components, there are
different types of MS instruments. Major instrument configurations include: 1) electrospray
ionization (ESI) and matrix-assisted laser desorption/ionization (MALDI) for the ion source;
2) quadrupole (Q), ion-trap, time-of-flight (TOF), Fourier transform ion cyclotron resonance
and Orbitrap for the mass analyzer; and 3) electron multiplier for the detector. The compatibility of different analyzers with different ionization methods varies. For example, while all
the analyzers listed above can be used in conjunction with ESI ion source, MALDI is most
commonly coupled to a TOF analyzer (MALDI-TOF). In addition, different configurations
can be combined to achieve better performance or specific goals, e.g., quadrupole-time-offlight (Q-TOF) and triple quadrupole (QqQ) mass spectrometry. The list above is by no
means exhaustive. For more details, please refer to the literature [1, 25, 31].
Tandem mass spectrometry. Tandem mass spectrometry (also called MS/MS) combines two steps of MS analysis with some form of fragmentation in between. Tandem mass
spectrometry is a key technique for identification of biomolecules, through examining the
fragmentation pattern of particular ions selectively. Typically, in the analysis by the first
mass spectrometer, several ions are selected based on either their intensities (data-dependent
acquisition, DDA) or targeted m/z values of interest (data-independent acquisition, DIA).
The selected ions (called precursor ions) are isolated for further analysis through fragmentation by collision-induced dissociation (CID) or electron-transfer dissociation (ETD). The
resulting fragment ions (product ions) are then analyzed by the second mass spectrometer,
which produces a MS/MS spectrum, presenting detailed chemical makeup of the analyte.
The fragment ions are produced following certain rules of ion dissociation [120]. In proteomic studies with CID fragmentation, for example, when the amide-bonds of a peptide
backbone break and the charge is retained on the C-terminus, y-ions are produced, while
b-ions are produced when the charge is retained on the N-terminus. Given the m/z values and intensities of the b- and y-ions along with the m/z value of their precursor ion,
the peptide sequence can be deduced through either database search [32, 93] or de novo
sequencing [19, 35, 79]. While the tandem MS spectrum provides valuable information for
identification of an analyte, MS/MS data acquisition suffers from the undersampling issue.
Generally only few of the precusor ions of high intensities can be selected for fragmentation
using DDA strategies. If high-abundance molecules are not removed, they may dominate
MS/MS features and obscure less abundant molecules of interest.
Liquid chromatography-mass spectrometry. Mass spectrometers are often coupled
with separation methods such as gas chromatography or liquid chromatography, to reduce
the chance to analyze coincident molecules and increase the overall dynamic range of detection. Liquid chromatography-mass spectrometry (LC-MS) is one of the most commonly used
techniques, where a liquid chromatographic process is employed prior to injection of a sample
into the mass spectrometry. Through LC-MS, fewer ions are analyzed simultaneously by the
Chapter 1. Introduction
5
mass spectrometer (compared to the whole sample injected at once). This reduces the ion
suppression effect [6]. In addition, molecules with the same molecular weight but different
hydrophobicity, e.g., isomers, may elute from the column and enter the mass spectrometer
at different times, thus reducing ambiguity in differentiating these molecules. An LC-MS
run produces a set of MS spectra, acquired at multiple scans of different retention times.
The MS/MS and LC-MS techniques can be naturally combined into a unified analytical
method, known as LC-MS/MS. Typically in an LC-MS/MS experiment using DDA, the first
mass spectrometer acquire a precursor scan, which measures all ions associated to the eluted
molecules at a given retention time; a subset of the ions is selected based on their intensities,
sequentially isolated and fragmented, prior to the second MS analysis. This produces several
MS/MS scans succeeding to the precursor scan. The alternating process is repeated to
automatically acquire both MS and MS/MS spectra throughout the LC gradient. While the
MS/MS spectrum presents the fragmentation pattern of an analyte, its experimental mass
and charge state are obtained from the measurement of precursor ion in the MS spectrum.
Multiple reaction monitoring. LC-MS/MS using DDA is generally biased towards analysis of the most abundant and observable molecules. Biologically relevant molecular responses, however, are often less discernible in that analysis. Targeted quantification by
multiple reaction monitoring (MRM) using triple quadrupole (QqQ) mass spectrometers
has been introduced to overcome the limitations of the DDA analysis [97]. Essentially, the
MRM method organizes the analysis of a specific list of targeted molecules, characterized
by the m/z values of their precursor and fragment ions. The precursor-fragment ion pairs
are called transitions, which are highly specific and unique for the targeted molecules. A
specific ion is selected in the first quadrupole (Q1) on the basis of its precursor m/z value.
The ion gets fragmented by collision-induced dissociation (CID) in the second quadrupole.
Only the relevant ions produced by the fragmentation are selected in the third quadrupole
(Q3). The resulting transitions are then used for quantification. As the data acquisition is
highly specific with less interference from irrelevant ions, the MRM analysis can yield more
sensitive and accurate quantification results. Moreover, the RT information can be used for
scheduling the detection of a specific transition, to increase the capacity of molecules being
quantified. This process is often referred to as scheduled MRM.
1.2.2
LC-MS data analysis
LC-MS methods can be used for extraction of quantitative information and detection of
differential abundance. This requires that a rigorous analysis workflow be implemented.
In addition to analytical considerations, crucial steps include: 1) experimental design that
avoids introducing bias during data acquisition and enables effective utilization of available
resource [95], 2) data preprocessing pipeline that extracts meaningful features [67], and
3) statistical test that identifies significant changes based on the experimental design [66].
Chapter 1. Introduction
6
Coordination of these three steps is key to a reliable LC-MS analysis. Good experimental
design provides an opportunity to process and compare samples in an unbiased manner. It
helps identify true differences in the presence of variability from various sources. This benefit
can diminish if the data analysts fail to appropriately process the LC-MS data and conduct
the subsequent statistical tests in accordance with the experimental design. This section
provides a high-level overview of the LC-MS preprocessing pipelines and highlights associated
challenges. Interested readers are referred to the literature for further information [67, 73].
An LC-MS run contains retention time information in a chromatogram, m/z value in MS
spectrum, and relative ion abundance for each particular ion. MS signals of all ions throughout the chromatographic separation are formatted in a three-dimensional map that defines
the LC-MS data, as shown in Figure 1.1. LC-MS can profile thousands of biomolecules in
a single run, which necessities an automatic and reliable preprocessing pipeline to extract
meaningful features. In order to ensure an unbiased comparison of the ion intensities, several preprocessing steps including noise filtering, deisotoping, peak detection, retention time
alignment, peak matching and normalization need to be appropriately handled. Typically,
these preprocessing steps generate a list of detected peaks characterized by their retention
times, m/z values and ion intensities. Subsequent statistical analysis is used to identify
significant differences in ion intensities across distinct groups.
In LC-MS data, the peak representing a compound is characterized by its isotopic pattern
resulting from common isotopes such as 12 C and 13 C in a set of mass spectra within its elution
duration, in superposition of noise signals. Adequate consideration of such characteristics
is crucial for LC-MS data analysis. Although several software tools have been developed
(e.g., OpenMS [121], msInspect [8], MZmine 2 [100] and XCMS [119]), very few studies have
systematically evaluated and compared their performance [133]. As a result, determining the
most appropriate computational pipeline remains challenging. In the following, we introduce
the main preprocessing steps. As the way to characterize a peak is not universal, the order
of the preprocessing steps may vary in different software tools.
Noise filtering. LC-MS data are subject to electronic/chemical noises due to contaminants present in the column solvent or instrumental interference. Appropriate noise filtering
can increase the signal-to-noise ratio (SNR) and facilitate the subsequent peak detection
step. Some software tools (e.g., XCMS [119] and MZmine 2 [100]) integrate the noise filtering into the peak detection step. Smoothing filters such Gaussian filter and Savitzky-Golay
filter [116] are commonly applied to eliminate the effects of noises. Due to the differences of
LC-MS platforms in resolution and detection limit, parameters for the smoothing filters need
to be adaptively selected, preferably through a pilot experiment with similar experimental
setup.
Deisotoping. Most chemical elements have naturally occurring isotopes. For example,
C and 13 C are two stable isotopes of the element carbon with mass numbers 12 and 13,
12
7
Chapter 1. Introduction
7
x 10
4
3.5
Ion count
3
2.5
2
1.5
1
0.5
2000
1750
0
10
1500
20
1250
30
1000
40
750
50
60
500
m/z
Retention time
Figure 1.1: Example of an LC-MS run.
respectively. As a result, each analyte gives rise to more than one ion peaks in the LC-MS
data, where the peak arising solely from the most common isotope is called the monoisotopic
peak. In proteomics, for example, each peptide is characterized by an envelope of ion peaks
due to its constituent amino acids. 13 C is the most abundant isotope of the elements that
make up amino acids, constituting about 1.11% of the carbon species. The approximately
one dalton (Da) mass difference between 13 C and 12 C results in 1/z difference between
adjacent ion peaks in the isotopic envelope, where z is the charge state of the charged
peptide. The deisotoping step integrates siblings of ion peaks originating from the same
analyte and summarizes with its monoisotopic mass. This facilitates the interpretation of
LC-MS data, and assures the validity of the independence assumption considered in many
statistical tests. DeconTools [60] is widely used to deisotope MS spectra, which involves: 1)
identification of isotopic distribution, 2) prediction of the charge state based on the distance
between the ion peaks in the isotopic distribution, and 3) comparison between the observed
isotopic distribution and a theoretical distribution generated based on an average residue.
Peak detection. Peak detection is a procedure to determine the existence of a peak in
a specific range of retention times and m/z values, and to quantify its intensity. It is
a prerequisite for the subsequent analysis of LC-MS data. Most LC-MS peak detection
approaches [100, 119, 121, 142] are adapted from the previous advances in MALDI-TOF data
analysis [18, 26]. These methods proceed with the peak detection via a pattern matching
Chapter 1. Introduction
8
process with a pre-defined pattern, followed by a filtering step based on quantified peak
characteristics. To better capture the characteristics of elution profiles, asymmetric patterns
are often considered [142]. A critical issue is that the elution profiles may vary across
different retention times [120]. As a result, the use of a single pattern throughout the
whole retention time range in the current approaches may lead to inaccurate estimates of
peak characteristics and signal to noise ratio (SNR). The latter is commonly employed as
a filtering criterion. Usually peak detection is performed on each LC-MS run separately,
without using information from other runs within the same experiment. Utilization of multiscale information from multiple runs has been proposed for the analysis of MALDI-TOF
data [144], which may lead to more reliable peak detection result. The same concept may
be applied to LC-MS data analysis, where the peak matching step to be introduced plays a
significant role.
Normalization. In LC-MS-based omics, one challenge is to detect true biological differences in the presence of various sources of variability. This requires appropriate normalization
of intensity measurements to remove systematic biases and eliminate the effect of obscuring
variability. Current normalization approaches proceed with the task through identification
of a reference for ion intensities, and utilization of the reference to adjust LC-MS data. Apparently, identification of reliable reference is key to the success of the normalization process.
Most current methods assume that each of the LC-MS runs in the same experiment should
have an equal concentration of molecules on average [68]. With this assumption, various
measures including summation, median, and quantile of the ion intensities are used as the
reference for normalization. Unfortunately, the validity of this assumption is questionable as
an increase of concentration in a specific group of molecules is not necessarily compensated by
a decrease in other groups [123]. More rigorous approaches using regression methods based
on a set of matched peaks [16] or spiked-in internal standards [123] have been proposed.
However, it is unclear that if neighboring ions (in terms of RT, m/z value, or intensity)
would necessarily share a similar drifting trend along the analysis order. At present, the use
of quality control (QC) runs to assess and correct variability in LC-MS data appears to be
the most reliable approach [28], in which QC runs can be collected from a reference sample
or a mixture pooled from the analyzed samples. This idea has been successfully implemented
for large-scale metabolomic studies, where variability along the analysis order is estimated
for each of the detected peaks through assessment of the QC runs [28]. This circumvents
the need to select the unknown reference, with additional experimental challenges to assure
appropriate coverage and reproducible detection of ions in the QC runs.
Peak matching. The peak matching step groups consensus peaks across LC-MS runs
prior to applying statistical analysis to identify significant differences in ion intensities. This
preprocessing step ensures that measurements from multiple LC-MS runs are comparable.
It is also crucial for potential extensions of peak detection and normalization steps, in order
to integrate information from multiple runs. The main challenge in peak matching results
Chapter 1. Introduction
9
from the presence of variability of retention time and m/z values among the LC-MS runs.
With the advances in mass spectrometry technology, it is now possible to achieve highly
precise and accurate mass measurement (low- to sub-ppm) [81]. However, controlling the
chromatographic variability is still a challenging task. Most current LC-MS preprocessing
pipelines combine the estimation of retention time variability with the peak matching step,
in order to correct that variability (called retention time alignment) and achieve reliable
identification of consensus peaks. Although a number of algorithms [8, 100, 119, 121] have
been proposed to address the retention time alignment problem, it is still one of the most
challenging tasks in the LC-MS preprocessing pipeline due to the following issues:
1. The retention time shift across LC-MS runs is non-linear [101].
2. Retention time alignment relies on correct identification of consensus peaks or profiles.
Performing this process based on misaligned data can be ambiguous.
3. A peak may be absent in some LC-MS runs caused by either analytical or computational issue [136].
These issues are further elaborated in the next section, where we review related studies on
retention time alignment. Furthermore, we discuss how we address the alignment problem
in consideration of the issues.
1.3
Retention time alignment of LC-MS data
As discussed in Section 1.2.2, retention time alignment of LC-MS data is crucial for the
peak matching step. Based on the type of inputs, alignment approaches can be categorized
as: 1) feature-based approaches and 2) profile-based approaches [135]. The feature-based
approaches perform the alignment task based on relevant signals (features, usually referred
to as peaks), which are distinguished from irrelevant parts in the peak detection step. The
profile-based approaches, on the other hand, make use of chromatographic profiles to estimate
the variability along retention time and adjust the LC-MS runs accordingly. In addition to
the formulation of alignment problem and required input data, these two types of alignment
approaches differ in the coverage of the preprocessing pipeline. The profile-based approaches
address the alignment problem, while the feature-based approaches usually deal with both
the alignment and peak matching processes simultaneously.
1.3.1
Feature-based approaches
Most current LC-MS preprocessing pipelines (e.g., OpenMS [121], msInspect [8], MZmine 2
[100] and XCMS [119]) employ the feature-based retention time alignment based on a set
Chapter 1. Introduction
10
of identified peaks. The feature-based approaches rely on the correct identification of a set
of consensus peaks across LC-MS runs. With this matching information, retention time
correction can then be carried out naturally. The main distinction among these approaches
is the way they identify the consensus peaks, which greatly affects the alignment results.
Several feature-based approaches choose a reference from the analyzed peak lists based on
some heuristic measure such as the number of peaks [4, 8, 145]. The rest of the peak lists are
aligned to the reference list in a pairwise manner, where the consensus peaks are identified
based on some pre-defined tolerance ranges. The retention time alignment can be subsequently performed using regression methods. Progressive adjustment is proposed [4, 145] by
using reliable consensus peaks for an initial (coarse) retention time correction, followed by a
more accurate alignment using refined consensus peaks. If there is a least one comprehensive peak list with good quality, this approach can perform reasonably well. However, the
alignment performance degrades significantly when there is a lack of reproducibility among
LC-MS runs being considered.
In order to eliminate the need for the objective selection of a reference list, clustering methods are applied in a number of feature-based approaches [70, 72, 88, 101, 119]. These approaches cluster peaks across LC-MS runs and specify a more complete reference based on
the identified consensus peaks. Similar idea has also been proposed by using kernel density
estimation of all the detected peaks [136]. From a different perspective, a variable selection
approach [137] is proposed to identify the consensus peaks using the elastic net method [148].
To reduce the chance of using erroneously identified consensus peaks, which can deteriorate
the alignment result, these approaches either include a module to assess the quality of the
consensus peaks [70,101,119], or apply a robust method for the regression [101,137]. Pairwise
retention time correction versus the reference list is employed in all the aforementioned approaches with an exception of the simultaneous multiple alignment (SIMA) model [136]. In
SIMA, a kernel density estimation is used to derive a multi-dimensional retention time ridge
that represents the retention time variation among all the analyzed LC-MS runs. Iterative
refinement of the estimation through alternating the identification of consensus peaks and
the regression has also been proposed [100, 119].
Incorporation of MS/MS identification can greatly reduce the ambiguity in matching the
consensus peaks, as proposed in [34, 59]. Using only MS/MS identification [34], however,
requires reproducible acquisition of MS/MS spectra and good coverage of the identified
peaks in retention time, which is barely possible in experiments using the data-dependent
acquisition strategy. In addition, its application is limited to LC-MS/MS experiments with
sufficient MS/MS spectra acquired. The PEPPeR platform [59] integrates both peak lists
and MS/MS identification to perform the retention time alignment, in order to overcome
the limitation. However, the integration is implemented through an ad-hoc approach, partly
due to the lack of uncertainty measures.
Performance of the feature-based alignment approaches is highly dependent on three factors:
1) peak detection result, 2) reliability of consensus peaks, and 3) coverage of consensus
Chapter 1. Introduction
11
peaks in retention time. The latter two factors present a trade-off in identification of the
consensus peaks, which is key to good alignment performance. To address the trade-off,
sophisticated clustering methods are proposed to explore more possible consensus peaks,
followed by prioritization of these peaks based on their qualities. However, a fundamental
issue is that the consensus peaks usually cannot be adequately determined based on unaligned
data. Moreover, estimation of retention time variation by the feature-based approaches is
limited to only a subset of time points, which is usually not as accurate as considering the
whole chromatograms, as done in the profile-based approaches.
1.3.2
Profile-based approaches
The profile-based approaches utilize chromatograms of the LC-MS runs to estimate the variability along retention time and adjust the LC-MS runs accordingly. It is assumed that there
exists a pattern underlying multiple chromatograms from the same biological group and the
profile variability is relatively small compared to distortions caused by misalignment. Compared to a set of retention time points used by the feature-based approaches, the chromatographic profiles provide more comprehensive information about the variation throughout the
whole retention time range. Appropriate utilization of the whole chromatogram allows improved estimation of the retention time variation characterized by the mapping functions.
Figure 1.2 presents the concept of a profile-based approach. The algorithm estimates 1) a
prototype function that represents the underlying pattern across the observed data, and 2)
a set of mapping functions that characterize the relationship between the prototype function
and the observation. In LC-MS data alignment, the monotonicity constraint is commonly
applied on the mapping function, to retain the elution order of LC process. The goal of the
profile-based alignment is to estimate the underlying prototype and mapping functions most
likely to have generated the observed data.
The majority of the profile-based alignment approaches [15, 17, 29, 103, 104, 129] are based
on two standard warping algorithms: dynamic time warping (DTW) [114] and correlation
optimized warping (COW) [94]. DTW was initially proposed for processing time-series data
in the context of speech recognition. It uses a dynamic programming algorithm [9] to search
for a mapping function between a pair of time-series traces. Essentially one trace is considered as the reference, to which the other is aligned using warping operations to stretch
or shrink its profile. The warping operations are characterized by the mapping function, in
order to make the warped trace as similar as possible to the reference. To avoid overfitting,
regularization of DTW is often considered using constraints of: 1) minimum/maximum slope
in the mapping function, and 2) maximum allowable deviation of the mapping function from
the identity function. COW [94] uses the same functionality as in DTW, with additional
regularization. In COW, the mapping function is constrained to be piecewise linear, where
only a subset of time points (knots) are allowed to apply the warping operations with linear
interpolation in between the knots. In order to align a set of chromatograms, one reference
must be specified as the prototype function, to which each of the chromatograms is aligned
12
Chapter 1. Introduction
prototype
function
mapping
functions
synthetic
data
similarity
observation
adjust
Figure 1.2: Profile-based approach is composed of two components: prototype function and
c 2013 IEEE)
mapping functions. (
based on a defined distance (e.g., Euclidean distance). Several distance measurements are
considered in [103, 104], while the main challenge in the DTW- and COW-based approaches
is owing to the choice of the unknown prototype function. In these approaches, heuristic
measurements (e.g., the correlation between chromatograms) are commonly used to determine a prototype function, and no further adjustment on the prototype function is allowed
during the estimation.
To address this concern, probabilistic models are proposed in order to estimate a prototype
function from the observations in a more principled way. In the statistics community, a
similar problem called curve registration [106] has been studied in the context of functional
data analysis [107]. The prototype and mapping functions are estimated by a Procrustes
fitting procedure, where the estimate is iteratively updated. The mapping function in the
Procrustes analysis is estimated via the measure of relative curvature and regularization can
be applied to penalize large curvature values. The continuous profile model (CPM) [74]
is another effective approach to perform the profile-based alignment. CPM formulates the
alignment problem as a hidden Markov model [105], where each chromatogram represents a
noisy transformation of the latent trace and the time points are indexed as hidden states in
the HMM. Both the prototype function (latent trace) and the mapping functions (hidden
states) are estimated from the observations using an expectation-maximization (EM) algorithm [22, 74]. Extension of the CPM model has been proposed to handle multiple binned
chromatograms, with a moderate number of bins [76]. However, the strategy of mapping the
data onto a higher-dimensional space may limit its applicability to high-resolution LC-MS
data generated by the current technologies.
A common issue of the current alignment approaches (both feature-based and profile-based)
is the lack of uncertainty assessment. A measure of uncertainty is desired to provide a confidence level in the alignment results and assist in making decisions for subsequent analyses or
data integration. In spike-in experiments or studies utilizing MS/MS identification results,
the integration of this complementary information has been found to lead to better alignment results, e.g., in [59, 61]. However, the integration is often implemented through ad-hoc
approaches, partly due to the lack of uncertainty measures. In combining the information
from various sources, accounting for the uncertainty in the alignment from each source and
Chapter 1. Introduction
13
performing the integration in a principled way may lead to improved results.
1.4
Research topics
In this dissertation, we propose a Bayesian alignment model (BAM), which integrates complementary information embedded in the LC-MS data to address the aforementioned challenges.
Specifically, the model offers two major attractive features: 1) it estimates the retention time
variability along with uncertainty measures, and 2) it integrates multiple sources of information including internal standards and clustered chromatograms, through weighing the
uncertainty measures. We investigate the alignment problem through three research topics:
1) development of single-profile Bayesian alignment model, 2) development of multi-profile
Bayesian alignment model, and 3) application to biomarker discovery research.
Development of single-profile Bayesian alignment model. We first study the profilebased Bayesian alignment using a single chromatogram, e.g., base peak chromatogram from
each LC-MS run. Relevant work includes a Bayesian hierarchical model for curve registration (BHCR) [127], which uses a Markov chain Monte Carlo (MCMC) method for parameter
inference. For the alignment of LC-MS data, which consist of many chromatographic peaks,
more flexible and effective MCMC algorithms are desired. We observe that the elementwise Metropolis-Hastings algorithm used in BHCR is prone to overfitting due to inefficient
MCMC moves. To overcome this problem, we propose a block Metropolis-Hastings algorithm using a mixture of block transition moves [110] for more flexible and effective updates.
In addition, a stochastic search technique is built into BAM to enable adaptive knot specification for the mapping functions. For LC-MS data where chromatographic peaks are not
homogeneously present along retention time, a uniformly distributed knot specification is not
desirable. Instead of fixing the knots upfront, we propose using stochastic search variable
selection (SSVS) [40] to determine the number and positions of knots. Possible extension
using Hamiltonian Monte Carlo method [27, 92] is also investigated.
Development of multi-profile Bayesian alignment model. The single-profile alignment model is further extended to handle multiple representative chromatograms simultaneously. The use of multiple chromatograms is considered in a few studies, by either binning
the LC-MS data [75] or using all the extracted ion chromatograms with acceptable quality [17]. However, a suitable procedure to utilize multiple representative chromatograms
while retaining computational feasibility is currently not available. We propose a clustering
approach to identify multiple representative chromatograms from each LC-MS run. These
chromatograms are simultaneously considered in the profile-based alignment to facilitate the
estimation of the prototype and mapping functions. Moreover, we incorporate the Gaussian
process regression [108] to estimate the retention time variation, based on the information
Chapter 1. Introduction
14
of internal standards. The use of internal standards enables a high-confidence estimation of
retention time variation, which avoids the ambiguity in identifying consensus peaks encountered in the feature-based approaches. Our proposed method allows us to infer a predictive
distribution over the entire retention time range. The inferred information can then be used
as the prior of the mapping function for the profile-based alignment.
Application of Bayesian alignment model to biomarker discovery. Cancer treatment is generally more effective when the disease is diagnosed early. Defining clinically
relevant biomarkers for early detection of cancer has potentially far-reaching implications
for disease management and patient health. LC-MS has been widely used in various omic
studies for cancer biomarker discovery. We investigate the applicability of the proposed
Bayesian alignment model in a biomarker discovery study using LC-MS-based glycomics.
The glycomic study consists of two complementary analyses: 1) global profiling using LCMS, and 2) targeted quantification using MRM. The alignment model is integrated into a
preprocessing pipeline for LC-MS data analysis. Through the developed pipeline, we identify candidate biomarkers from global profiling analysis and confirm the result with that by
target quantification.
1.5
List of relevant publications
Journal papers
1. T.-H. Tsai, M.G. Tadesse, C. Di Poto, L.K. Pannell, Y. Mechref, Y. Wang, H.W. Ressom. Multi-profile Bayesian alignment model for LC-MS data analysis with integration
of internal standards. Bioinformatics, 29(21):2774–2780, 2013.
2. T.-H. Tsai, M.G. Tadesse, Y. Wang, H.W. Ressom. Profile-based LC-MS data alignment — A Bayesian approach. IEEE/ACM Transactions on Computational Biology
and Bioinformatics, 10(2):494–503, 2013.
3. J.F. Xiao, R.S. Varghese, B. Zhou, M.R. Ranjbar, Y. Zhao, T.-H. Tsai, C. Di Poto, J.
Wang, D. Goerlitz, Y. Luo, A.K. Cheema, N. Sarhan, H. Soliman, M.G. Tadesse, D.H.
Ziada, H.W. Ressom. LC-MS based serum metabolomics for identification of hepatocellular carcinoma biomarkers in Egyptian cohort. Journal of Proteome Research,
11(12):5914-23, 2012.
4. H.W. Ressom, J.F. Xiao, L. Tuli, R.S. Varghese, B. Zhou, T.-H. Tsai, M.R. Ranjbar,
Y. Zhao, J. Wang, C. Di Poto, A.K. Cheema, M.G. Tadesse, R. Goldman, K. Shetty.
Utilization of metabolomics to identify serum biomarkers for hepatocellular carcinoma
in patients with liver cirrhosis. Analytica Chimica Acta, 743:90-100, 2012.
Chapter 1. Introduction
15
5. L. Tuli, T.-H. Tsai, R.S. Varghese, J.F. Xiao, A.K. Cheema, H.W. Ressom. Using
a spike-in experiment to evaluate analysis of LC-MS data. Proteome Science, 10(13),
2012.
6. G.K. Befekadu, M.G. Tadesse, T.-H. Tsai, H.W. Ressom. Probabilistic mixture regression models for alignment of LC-MS data. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 8(5):1417-24, 2011.
Conference papers
1. Y. Zhao, T.-H. Tsai, C. Di Poto, L. Pannell, M.G. Tadesse, H.W. Ressom. Variability assessment of LC-MS experiments and its application to experimental design and
difference detection. Proceedings of the IEEE International Workshop on Genomic
Signal Processing and Statistics (GENSIPS), Washington, DC, December 2012.
2. T.-H. Tsai, M.G. Tadesse, Y. Wang, H.W. Ressom. Bayesian alignment model for
LC-MS data. Proceedings of IEEE International Conference on Bioinformatics and
Biomedicine (BIBM), Atlanta, GA, November 2011.
3. L. Tuli, T.-H. Tsai, R.S. Varghese, A.K. Cheema, H.W. Ressom. Using a spike-in
experiment to evaluate analysis of LC-MS data. Proceedings of the IEEE International
Workshop on Computational Proteomics, Hong Kong, December 2010.
Manuscripts submitted/in preparation
1. Y. Zhao, T.-H. Tsai, C. Di Poto, L.K. Pannell, M.G. Tadesse, H.W. Ressom. Mixed
effects model for variability assessment and difference detection in LC-MS data. Statistics and Its Interface, in revision.
2. T.-H. Tsai, M. Wang, C. Di Poto, Y. Hu, S. Zhou, Y. Zhao, R.S. Varghese, Y. Luo,
M.G. Tadesse, D.H. Ziada, C.S. Desai, K. Shetty, Y. Mechref, H.W. Ressom. LC-MS
profiling of N-glycans derived from human serum samples for biomarker discovery in
hepatocellular carcinoma. Journal of Proteome Research, in preparation.
3. T.-H. Tsai, E. Song, C. Di Poto, M. Wang, Y. Luo, R.S. Varghese, M.G. Tadesse,
D.H. Ziada, C.S. Desai, K. Shetty, Y. Mechref, H.W. Ressom. LC-MS/MS based
serum proteomics for identification of hepatocellular carcinoma (HCC) biomarkers.
Proteomics, in preparation.
1.6
Organization of the dissertation
The remainder of this dissertation is organized as follows. Chapter 2 introduces the singleprofile Bayesian alignment model. The hierarchical model, the block Metropolis-Hastings
Chapter 1. Introduction
16
algorithm, the posterior inference for the model parameters, and the SSVS procedure for
knot specification are described in the chapter. Chapter 3 presents the extended multiprofile alignment model, including the chromatographic clustering approach to derive representative chromatograms and the Gaussian process prior that uses information from internal
standards. Chapter 4 demonstrates the applicability of the proposed alignment model in
a cancer biomarker discovery study. Finally, Chapter 5 concludes this dissertation with a
summary of contributions and possible extensions in future work.
Chapter 2
Single-profile Bayesian alignment
model
This chapter1 introduces the profile-based alignment using a single chromatogram for each
LC-MS run. The chromatograms are obtained based on total ion count or base peak intensity.
We address the alignment problem within a Bayesian framework. The inference for the model
parameters is based on their posterior distributions, which are estimated using Markov chain
Monte Carlo (MCMC) methods.
2.1
Preliminaries
We use a variety of Markov chain Monte Carlo (MCMC) methods in order to obtain the
posterior distribution of parameters in the proposed Bayesian alignment model. This section
gives a brief introduction of the MCMC methods that are used in this dissertation. We refer
interested readers to [14, 89, 90] for more detailed and rigorous expositions of this topic.
For simple problems where a conjugate prior is available, i.e., posterior and prior distributions
are in the same family, the posterior distribution can be used directly to infer properties
associated with the model parameters. However, in most complex problems including the
profile-based alignment, it is not possible to obtain a closed-form expression for the posterior
distribution. A general approach to estimate the posterior distribution in such problems is
to use Markov chain Monte Carlo (MCMC) methods, where a large number of (correlated)
samples from a Markov chain are used to make Monte Carlo estimates. A Markov chain refers
to a sequence of states in which a sample of the states only depends on its preceding sample,
governed by a transition probability T (θ ′ ← θ), i.e., probability of a state change from θ to
1
Part of this chapter has been published in an earlier work [132]: T.-H. Tsai, M.G. Tadesse, Y. Wang,
H.W. Ressom. Profile-based LC-MS data alignment — A Bayesian approach. IEEE/ACM Transactions on
c 2013 IEEE)
Computational Biology and Bioinformatics, 10(2):494–503, 2013. (
17
18
Chapter 2. Single-profile Bayesian alignment model
θ ′ . Based on the law of large numbers, the Monte Carlo estimates would (asymptotically)
converge to the true values with an infinite number of samples, if the samples are obtained
from the target distribution. To ensure the validity of the Monte Carlo estimates, the Markov
chain must converge to the target distribution p(θ), where θ is the parameter of interest.
The convergence requires two fundamental conditions:
1. The Markov chain must be ergodic — it is possible to pass from any of the states to
another.
2. The transition T (θ ′ ← θ) leaves p(θ) invariant, that is
X
p(θ ′ ) =
p(θ)T (θ ′ ← θ), for all θ ′ .
(2.1)
θ
The second condition of invariance ensures that if the Markov chain reaches the targeted
distribution p(θ) at some point, then the subsequently sampled states would follow that
distribution as well. In many MCMC methods, this is usually accomplished through the
detailed balance condition (also known as reversibility):
p(θ)T (θ ′ ← θ) = p(θ ′ )T (θ ← θ ′ ),
for all θ and θ ′ .
(2.2)
Detailed balance ensures that θ and θ ′ are reversible in the Markov chain, and it is a sufficient
condition of invariance as shown below:
X
X
X
T (θ ← θ ′ ) = p(θ ′ ).
(2.3)
p(θ ′ )T (θ ← θ ′ ) = p(θ ′ )
p(θ)T (θ ′ ← θ) =
θ
θ
θ
Metropolis-Hastings algorithm [50, 85] and Gibbs sampling [38, 39] are two common MCMC
methods that satisfy detailed balance. A combination of both methods is often used to
construct a transition of Markov chain for a target distribution that is too complex to
sample directly from it. We describe both methods in the following.
Metropolis-Hastings algorithm. Metropolis algorithm was first introduced in the seminal paper published by Metropolis et al. [85], and further generalized by Hastings [50]. Algorithm 1 outlines the procedure to update θ in a Markov chain using Metropolis-Hastings algorithm. A change of state (θ ′ ← θ) is proposed based on a proposal distribution q(θ ′ ← θ),
and the acceptance probability rA of the proposed change is computed in consideration of the
ratio under the target distribution p(θ ′ )/p(θ) and the transition ratio q(θ ← θ ′ )/q(θ ′ ← θ)
that adjusts for non-symmetric proposal distributions. When a symmetric proposal distribution is employed, i.e., q(θ ← θ ′ ) = q(θ ′ ← θ), this updating scheme is called Metropolis
algorithm. In such case the acceptance probability rA becomes min 1, p(θ ′ )/p(θ) . If the
proposed change is rejected, then the succeeding sample of the Markov chain stays at the
current state θ.
19
Chapter 2. Single-profile Bayesian alignment model
Algorithm 1 Metropolis-Hastings algorithm
Propose θ ′ ∼ q(θ ′ ← θ)
Compute acceptance probability rA = min 1,
Set θ = θ ′ with probability rA
p(θ′ )q(θ←θ′ )
p(θ)q(θ ′ ←θ)
Transitions defined by Metropolis-Hastings algorithm leave the target distribution p(θ) invariant as they satisfy detailed balance:
p(θ ′ )q(θ ← θ ′ )
′
q(θ ′ ← θ)p(θ)
T (θ ← θ)p(θ) = min 1,
′
p(θ)q(θ ← θ)
= min p(θ)q(θ ′ ← θ), p(θ ′ )q(θ ← θ ′ )
= min p(θ ′ )q(θ ← θ ′ ), p(θ)q(θ′ ← θ)
p(θ)q(θ ′ ← θ)
= min 1,
q(θ ← θ ′ )p(θ ′ )
p(θ ′ )q(θ ← θ ′ )
= T (θ ← θ ′ )p(θ ′ ).
(2.4)
Since the acceptance probability is based on the probability ratio, the algorithm does not
require a direct evaluation on the target distribution p(θ), i.e., evaluation on some function
that is proportional to p(θ) suffices.
Gibbs sampling. Gibbs sampling was initially proposed in the context of image restoration application [39] and has been applied to a variety of Bayesian inference problems since
the early nineties [38]. The algorithm updates each component of the state (θi′ ← θi ) by
sampling
from its full conditional, i.e., the conditional distribution given other components
p(θi θ \i ). Gibbs sampling can be shown as a special case of Metropolis-Hastings algorithm,
with the proposal distribution defined by
q(θ ′ ← θ) = p(θi′ θ \i )I θ ′\i = θ \i ,
(2.5)
Chapter 2. Single-profile Bayesian alignment model
20
where I θ ′\i = θ \i is an indicator function ensuring all components other than θi stay fixed.
The acceptance probability of Gibbs sampling is one:
p(θ ′ )q(θ ← θ ′ )
p(θ)q(θ ′ ← θ)
p(θ ′ )p(θi θ ′\i )I θ \i = θ ′\i
=
p(θ)p(θi′ θ \i )I θ ′\i = θ \i
p(θi′ , θ ′\i )p(θi θ ′\i )I θ \i = θ ′\i
=
p(θi , θ \i )p(θi′ θ \i )I θ ′\i = θ \i
p(θi′ θ ′\i )p(θ ′\i )p(θi θ ′\i )I θ \i = θ ′\i
=
p(θi θ \i )p(θ \i )p(θ′ θ \i )I θ ′\i = θ \i
rA =
i
= 1.
(2.6)
Gibbs sampling eliminates the need to define and tune a proposal distribution, which is
required in Metropolis-Hasting algorithm. It is useful when the full conditional is tractable
and can be sampled efficiently.
2.2
Generative model
We introduce a generative model to formulate the alignment problem. The observed chromatograms from N replicates, yi (t), i = 1, . . . , N, t = t1 , . . . , tT , are assumed to share a
similar profile characterized by the prototype function m(t). We use a piecewise linear function to model the nonlinear variability [101] along retention time. For the i-th chromatogram
at retention time t, the intensity value isreferred to as the prototype function indexed by
the mapping function ui (t), i.e., m ui (t) . Figure 2.1 illustrates the relationship between
the sample and the prototype function through the mapping function. The intensity of the
sample at time 1, for example, is referred to as the intensity of the prototype function at
time ui (1) = 2, that is m(2) = 3.
By incorporating the variability of intensity using affine transformation, each chromatogram
is modeled as:
yi (t) = ci + ai · m ui (t) + εi (t), i = 1, 2, . . . , N,
(2.7)
where ai and ci are scaling and translation parameters, and the errors εi (t)’s are indepeniid
dent and identically distributed normal random variables εi (t) ∼ N (0, σε2 ). These parameters characterize the individual variability of each chromatogram. Conjugate normal prior
distributions are chosen for ai and ci , i.e., ai ∼ N (a0 , σa2 ) and ci ∼ N (c0 , σc2 ).
The prototype function is modeled with B-spline regression [20]:
m = Bm ψ,
(2.8)
21
Chapter 2. Single-profile Bayesian alignment model
m (ui(t))
ui(t)
4
3.5
3
2.5
2
1.5
1
0.5
4
3
Intensity
Time of prototype
Intensity
m(t)
2
1
1
1
2
3
2
3
4
4
3.5
3
2.5
2
1.5
1
0.5
Time of sample
4
Time of prototype
1
2
3
4
Time of sample
Figure 2.1: An illustrative example showing the functionalities of the prototype function
m(t) and the mapping function ui (t). For example, the intensity of the sample at time t = 1
corresponds to the intensity of the prototype function at time ui (1) = 2, which is given by
c 2013 IEEE)
m(2) = 3. (
⊤
where m = m(t1 ), . . . , m(tT )
∈ RT ×1 , Bm ∈ RT ×L , and ψ ∈ RL×1 . The regression
coefficients for the prototype function, ψ, are specified by a first-order random walk: ψl ∼
N (ψl−1 , σψ2 ), where ψ0 = 0.
The mapping function ui (t) is a piecewise linear function characterized by a set of knots
τ = (τ0 , τ1 , . . . , τK+1) and their corresponding mapping indices φi = (φi,0 , φi,1 , . . . , φi,K+1),
where τ0 = t1 and τK+1 = tT . The mapping function is defined in terms of τ and φi ,
(
φi,j ,
for t = τj
ui (t) =
(2.9)
τj+1 −t
t−τj
φ + τj+1 −τj φi,j+1, for τj < t < τj+1
τj+1 −τj i,j
To keep the elution order of LC process, the monotonicity constraint φi,0 < · · · < φi,K+1
needs to be satisfied. The prior of φi is specified via slope values ω i = (ωi,1 , . . . , ωi,K+1),
where ωi,j is assumed to follow a normal distribution with mean ωi,j−1 and variance σω2
truncated below by 0 to ensure monotonicity of φi , and is defined as
ωi,j =
φi,j − φi,j−1
.
τj − τj−1
(2.10)
The prior of φi is therefore given by
p(φi ) =
K+1
Y
j=1
pTN ωi,j ωi,j−1, σω2 ,
(2.11)
where ωi,0 = 1 and pTN (·) corresponds to the truncated normal density function. Finally, we
22
Chapter 2. Single-profile Bayesian alignment model
specify the priors for the other model parameters to complete the hierarchy:
a0
c0
1/σa2
1/σc2
1/σε2
1/σψ2
∼ N (µa, σa20 ),
∼ N (µc, σc20 ),
∼ G(αa , βa ),
∼ G(αc , βc ),
∼ G(αε , βε ),
∼ G(αψ , βψ ).
These priors are chosen to be conjugate to the likelihood function. Figure 2.2 presents the
directed acyclic graph of the single-profile alignment model where the model parameters are
represented by open circles, the hyperparameters by solid dots, and the observations by filled
circles.
µc σc20
µa σa20
c0
αc βc
αψ βψ
σψ2
a0
σc2
ci
σa2
αa βa
εi
σε2
αε βε
φi
σω2
ai
ψ
yi
N
c 2013 IEEE)
Figure 2.2: Directed acyclic graph of the single-profile alignment model. (
2.3
Posterior inference
Based on the generative model introduced in Section 2.2, the alignment problem is translated
into an inference task: given the chromatograms y ={y1 , y2 , . . . , yN }, we need to estimate
the model parameters a, c, ψ, φ, a0 , c0 , σa2 , σc2 , σε2 , σψ2 . Once the inference is complete, the
alignment can be carried out by applying an inverse mapping function to each chromatogram,
23
Chapter 2. Single-profile Bayesian alignment model
i.e., ŷi (t) = yi û−1
i (t) . The parameter inference is drawn using MCMC methods. For
the
parameters whose full conditionals have closed forms, θ = a, c, ψ, a0 , c0 , σa2 , σc2 , σε2 , σψ2 , we
use the Gibbs sampler to update their values. The remaining parameters, i.e., the mapping
function coefficients, φ = {φ1 , . . . , φN }, are updated using a Metropolis-Hastings algorithm
with a uniform proposal density that reflects the constraints on the boundaries.
2.3.1
Full conditionals of model parameters
Derivation of the full conditional for each of the parameters, a, c, ψ, a0 , c0 , σa2 , σc2 , σε2 , σψ2 , is
shown below. Table 2.1 summarizes all the full conditionals used in the single-profile model.
Full conditional of a0
N
Y
p ai a0 , σa2
p a0 y, θ \a0 , φ ∝ p a0 µa , σa20
i=1
!
N
X
(a0 − µa )2
(ai − a0 )2
∝ exp −
× exp
−
2σa20
2σa2
i=1
!
P
2
2
a
)/(σ
+
Nσ
)
a20 − 2a0 (σa2 µa + σa20 N
a
a0
i=1 i
,
∝ exp −
2σa20 σa2 /(σa2 + Nσa20 )
(2.12)
2
which implies the full conditional of a0 is a normal distribution N (â0 , σ̂a0
), where σ̂a20 =
P
−1
2
2
1/σa0
+ N/σa2
and â0 = σ̂a20 µa /σa20 + N
i=1 ai /σa .
Full conditional of c0
N
Y
2
p ci c0 , σc2
p c0 y, θ \c0 , φ ∝ p c0 µc , σc0
i=1
(c0 − µc )2
∝ exp −
2σc20
× exp
N
X
i=1
PN
(ci − c0 )2
−
2σc2
!
c2 − 2c0 (σc2 µc + σc20 i=1 ci )/(σc2 + Nσc20 )
∝ exp − 0
2σc20 σc2 /(σc2 + Nσc20 )
!
,
(2.13)
which implies the full conditional of c0 is a normal distribution N ĉ0 , σ̂c20 , where σ̂c20 =
−1
P
2
2
1/σc0
+ N/σc2
c
/σ
and ĉ0 = σ̂c20 µc /σc20 + N
.
i
c
i=1
24
Chapter 2. Single-profile Bayesian alignment model
Full conditional of (ai , ci )
p ai , ci y, θ\(ai ,ci) , φ ∝ p ai a0 , σa2 p ci c0 , σc2 p yi ŷi , σε2
!
2
2
2
(ci − c0 )
kyi − ŷi k
(ai − a0 )
exp −
exp −
,
∝ exp −
2
2
2σa
2σc
2σε2
(2.14)
which implies the full conditional of (ai , ci ) is a bivariate normal distribution
N (µ̂i , Σ̂i ),
−1
⊤
−1
−1
where Σ̂i = Σac + W⊤ W/σε2 and µ̂i = Σ̂i Σac a0 c0 + W⊤ yi /σε2 , with Σac =
diag(σa2 , σc2 ) and W = Bm (ui )ψ 1 ∈ RT ×2 .
Full conditional of 1/σa2
N
Y
p ai a0 , σa2
p 1/σa2 y, θ \σa2 , φ ∝ p 1/σa2 αa , βa
∝
=
1
σa2
1
σa2
αa −1
i=1
exp −
αa +N/2−1
βa
σa2
exp −
×
1
σa2
1
σa2
N/2
βa +
1
2
exp
N
X
i=1
N
X
−
(ai − a0 )
2σa2
!!
(ai − a0 )2
i=1
2
,
!
(2.15)
which implies the full conditional
of 1/σa2 is a gamma distribution G(α̂a , β̂a ), where α̂a =
PN
αa + N/2 and β̂a = βa + i=1 (ai − a0 )2 /2.
Full conditional of 1/σc2
p
1/σc2
N
Y
y, θ\σ2 , φ ∝ p 1/σ 2 αc , βc
p ci c0 , σc2
c
c
∝
=
1
σc2
1
σc2
αc −1
i=1
βc
exp − 2
σc
αc +N/2−1
exp −
×
1
σc2
1
σc2
N/2
βc +
1
2
exp
N
X
i=1
N
X
i=1
(ci − c0 )2
−
2σc2
!!
(ci − c0 )2
,
!
(2.16)
which implies the full conditional
of 1/σc2 is a gamma distribution G(α̂c , β̂c ), where α̂c =
PN
αc + N/2 and β̂c = βc + i=1 (ci − c0 )2 /2.
25
Chapter 2. Single-profile Bayesian alignment model
Full conditional of 1/σε2
p
1/σε2
N
Y
ŷi , σε2
y, θ \σ2 , φ ∝ p 1/σε2 αε , βε
p
y
i
ε
i=1
!
N T /2
N
X
1
βε
kyi − ŷi k2
1
exp − 2 ×
exp
−
∝
σε2
σε
σε2
2σε2
i=1
!!
αε +N T /2−1
N
1
1X
1
=
,
(2.17)
kyi − ŷi k2
exp − 2 βε +
σε2
σε
2 i=1
αε −1
which implies the full conditional
of 1/σε2 is a gamma distribution G(α̂ε , β̂ε ), where α̂ε =
PN
αε + NT /2 and β̂ε = βε + i=1 kyi − ŷi k2 /2.
Full conditional of 1/σψ2
p 1/σψ2 y, θ\σψ2 , φ ∝ p 1/σψ2 αψ , βψ p ψ σψ2
!αψ −1
!
2 −1 −1/2
1
1 ⊤ 2 −1 −1
βψ
∝
exp − ψ (σψ Ω ) ψ
exp − 2 × σψ Ω
σψ2
σψ
2
!
!αψ +L/2−1
1 ⊤
1
1
,
(2.18)
exp − 2 βψ + ψ Ωψ
=
σψ2
σψ
2
which implies the full conditional of 1/σψ2 is a gamma distribution G(α̂ψ , β̂ψ ), where α̂ψ =
αψ + L/2 and β̂ψ = βψ + ψ ⊤ Ωψ/2, and Ω ∈ RL×L is a triple-diagonal matrix:

−1
0
.. ..

.
.
 −1
Ω=
.
. . 2 −1

0
−1 1
2



.

Full conditional of ψ
N
2Y
p ψ y, θ\ψ , φ ∝ p ψ σψ
p yi ŷi , σε2
i=1
Y
N
1 ⊤ 2 −1 −1
kyi − (ci 1 + ai Bm (ui )ψ)k2
∝ exp − ψ (σψ Ω ) ψ
,
exp −
2
2
2σ
ε
i=1
(2.19)
26
Chapter 2. Single-profile Bayesian alignment model
Table 2.1: Summary of full conditionals of model parameters.
Parameter
a0
Distribution
2
N (â0 , σ̂a0
)
c0
N ĉ0 , σ̂c20
(ai , ci )
N (µ̂i , Σ̂i )
1/σa2
G(α̂a , β̂a )
1/σc2
G(α̂c , β̂c )
1/σε2
G(α̂ε , β̂ε )
1/σψ2
G(α̂ψ , β̂ψ )
N (µ̂ψ , Σ̂ψ )
ψ
−1
2
σ̂a20 = 1/σa0
+ N/σa2
P
ai /σa2
â0 = σ̂a20 µa /σa20 + N
i=1
−1
2
σ̂c20 = 1/σc0
+ N/σc2
P
2
c
/σ
ĉ0 = σ̂c20 µc /σc20 + N
i
c
i=1
−1
⊤
2
Σ̂i = Σ−1
ac + W W/σǫ
⊤
−1
⊤
2
µ̂i = Σ̂i Σac a0 c0 + W yi /σǫ
α̂a = αa + N/2
P
2
β̂a = βa + N
i=1 (ai − a0 ) /2
α̂c = αc + N/2
P
2
β̂c = βc + N
i=1 (ci − c0 ) /2
α̂ε = αε + NT /2
P
2
β̂ε = βε + N
i=1 kyi − ŷi k /2
α̂ψ = αψ + L/2
β̂ψ = βψ + ψ ⊤ Ωψ/2
−1
Σ̂ψ = Ω/σψ2 + X⊤ X/σǫ2
µ̂ψ = Σ̂ψ X⊤ (Y − C)/σǫ2
which implies the full conditional of ψ is a multivariate normal distribution N (µ̂ψ , Σ̂ψ ),
−1
where Σ̂ψ = Ω/σψ2 + X⊤ X/σε2 and µ̂ψ = Σ̂ψ X⊤ (Y − C)/σε2 , and






a1 Bm (u1 )
y1
c1 1
 a2 Bm (u2 ) 
 y2 
 c2 1 






X=
 , Y =  ..  , and C =  ..  .
..


 . 
 . 
.
aN Bm (uN )
yN
cN 1
2.3.2
Block Metropolis-Hastings algorithm
Algorithm 2 outlines one iteration of the MCMC procedure in the Bayesian hierarchical
curve registration (BHCR) model [127]. The transition ratio rT for the proposal density is
one, while the likelihood ratio rL and prior ratio rP in the Metropolis-Hastings acceptance
(m)
probability, rA , for updating φi,j (φ′i,j ← φi,j ) are given by:
(m+1)
p yi φ′i,j , φi,\j , θ (m+1)
,
(2.20)
rL = (m)
(m+1)
p yi φi,j , φi,\j , θ (m+1)
27
Chapter 2. Single-profile Bayesian alignment model
′
(m)
′
′ (m+1)
ωi,j , σω2
pTN ωi,j+1
pTN ωi,j+2
ωi,j−1 , σω2
pTN ωi,j
×
×
rP =
(m+1)
(m)
(m)
(m)
(m)
pTN ωi,j ωi,j−1 , σω2
pTN ωi,j+1 ωi,j , σω2
pTN ωi,j+2
(m+1)
where φi,\j
(m+1)
φi,\j
denotes the set of coefficients φi
(m+1)
(m+1)
(m)
(m) φi,0 , . . . , φi,j−1 , φi,j+1, . . . , φi,K+1 .
′
ωi,j+1, σω2
, (2.21)
(m)
2
ω
,
σ
ω
i,j+1
at iteration m + 1 with φi,j excluded, i.e.,
The MCMC move for updating φi,j changes
=
the slopes ωi,j and ωi,j+1, and consequently the involved densities.
Algorithm 2 MCMC update of {θ (m) , φ(m) } in the BHCR model
Update θ (m+1) ← θ (m) using Gibbs sampling
for all φi,j do
(m)
(m)
′
Propose φi,j ∼ U φi,j − δ, φi,j + δ
Compute the likelihood ratio rL using Equation 2.20
Compute the prior ratio rP using Equation 2.21
Compute the acceptance probability rA = min 1, rL × rP
(m+1)
(m+1)
(m)
Set φi,j
= φ′i,j with probability rA (φi,j
= φi,j if the proposal is rejected)
end for
When the misalignment involves translation shift along retention time, the element-wise
Metropolis-Hastings move for φi requires a series of successive proposals in the same direction
to be accepted sequentially. This incremental update hinders the mixing of the MCMC
sampler. In addition, the monotonicity constraint inhibits the flexibility of the proposal to
consider relatively large changes for each of the mapping function coefficients. To address
this issue, we consider block proposal moves [110] to allow a set of successive coefficients to
be adjusted simultaneously. Rather than updating each coefficient φi,j sequentially, the φi,j ’s
are first grouped into several non-overlapping blocks, which consist of successive coefficients
along the retention time, and proposals are made to update each block. The block move
offers a more efficient way to update the coefficients, which improves the mixing of the
MCMC sampler.
We introduce binary indicator variables bj ∈ {0, 1}, j = 1, . . . , K, to identify the block
boundaries, where bj = 1 if τj is at the boundary of a block and b0 = bK+1 = 1. This indicator
variable follows a Bernoulli distribution with p(bj = 1) = rblock . Based on the boundary
configuration, coefficients within the same block φi,j:j+Bj −1 = (φi,j , φi,j+1, . . . , φi,j+Bj −1 ) are
proposed to be moved in the same direction, where bj = bj+Bj = 1 and bj+1 = · · · = bj+Bj −1 =
0. The element-wise move can be viewed as a special case of the block move where rblock = 1
and each block only contains a single coefficient. We consider a mixture of transitions where
rblock is randomly selected from {1, 1/2, 1/4} at each iteration. The configuration of blocks
is therefore variable within a Markov chain. We summarize the procedure for the block
Metropolis-Hastings technique in Algorithm 3.
For MCMC updates in the alignment model, most computational effort is spent in evaluating
Chapter 2. Single-profile Bayesian alignment model
28
Algorithm 3 Block Metropolis-Hastings algorithm
Sample rblock ∼ U 1, 21 , 14
Sample δ ∼ 12 · U(0, δsmall ) + 21 · U(0, δlarge )
Sample bj ∼ Bernoulli(rblock ), for j = 1, . . . , K
for all block φi,j:j+Bj −1 do
(m)
(m)
Propose φ′i,j:j+Bj −1 ∼ U φi,j:j+Bj −1 − δ, φi,j:j+Bj −1 + δ
Compute the likelihood ratio rL
Compute the prior ratio rP
Compute the acceptance probability rA = min 1, rL × rP
(m+1)
Set φi,j:j+Bj −1 = φ′i,j:j+Bj −1 with probability rA
end for
proposals by the Metropolis-Hastings algorithm. In each MCMC iteration, the element-wise
Metropolis-Hastings move requires N × K times of proposals and evaluations for all the
mapping function coefficients φi,j , whereas the proposed block move reduces the number by
a factor of rblock . With the considered mixture of transitions {1, 1/2, 1/4}, the expected
factor of reduction is given by
1
1
1
7
×1+ ×2+ ×4= .
3
3
3
3
The calculation is based on unoptimized implementation of the Metropolis-Hastings algorithm, in which the computation involved in evaluating each proposed change is identical.
2.4
Number and position of knots
Knot specification for the mapping functions is crucial to the alignment result. Although
accurate alignment requires sufficiently dense knots to enable precise adjustments, an overly
dense knot specification restricts the transition flexibility and is prone to overfitting. We
address the knot specification issue using stochastic search variable selection (SSVS) [40].
At each iteration, along with the update of all the model parameters {θ, φ}, a change is
proposed for the knot specification using one of the following transition moves:
1. knot inclusion – adding a knot into the current knot list; or
2. knot exclusion – removing a knot from the current knot list.
In the alignment model, the first and the last time points are set as fixed knots, i.e., unchanged throughout the Markov chain, to control the span range of mapping function. For
the middle T − 2 time points, t2 , . . . , tT −1 , K of them are determined as knots and their
29
Chapter 2. Single-profile Bayesian alignment model
placement is (τ1 , . . . , τK ). A binary indicator variable associated with each time point γt is
introduced to denote if a knot is present at time t, that is, γt = 1 if t belongs to (τ1 , . . . , τK )
and γt = 0 otherwise. This binary indicator is assumed to follow a Bernoulli distribution
with p(γt = 1) = rknot , and thus, the probability density of a valid knot specification is given
by
K
p(τ1 , . . . , τK ) = rknot
(1 − rknot )T −2−K .
(2.22)
We estimate the knot specification for each chromatogram separately, i.e., each mapping
function ui is defined by its own set of knots τ i and mapping function coefficients φi . To
keep the following discussion uncluttered, we make a slight abuse of notation by dropping
the index of chromatogram i. At each iteration, after the update of {θ, φ}, one of the middle
T − 2 time points is randomly sampled from a uniform distribution, i.e., t ∼ U {t2 , . . . , tT −1 },
and its corresponding γt is proposed to be updated to γt′ such that γt′ = 1−γt . The procedure
for each proposal is summarized as follows:
Knot inclusion. When γt = 0 and τk−1 < t < τk , t is added into the knot list (γt′ = 1 ←
γt = 0) such that the new set of knots becomes
′
′
′
) = (τ0 , . . . , τk−1, t, τk , . . . , τK+1).
, . . . , τK+2
, τk′ , τk+1
τ ′ = (τ0′ , . . . , τk−1
The mapping function coefficient for the newly added knot, φ′k is sampled from a normal
distribution truncated below by φk−1 and above by φk , ν ∼ T N (µ, σ 2 ), such that the new
set of mapping function coefficients becomes
φ′ = (φ′0 , . . . , φ′k−1, φ′k , φ′k+1 , . . . , φ′K+2) = (φ0 , . . . , φk−1 , ν, φk , . . . , φK+1),
where
µ=
τk − t
t − τk−1
φk−1 +
φk ,
τk − τk−1
τk − τk−1
(2.23)
and
σ = min {φk − µ, µ − φk−1} /4.
(2.24)
The acceptance probability for the proposal (γ ′ , τ ′ , φ′ ← γ, τ , φ) is calculated as
p y θ, γ ′ , τ ′ , φ′
p(γ ′ , τ ′ , φ′ ) q(γ, τ , φ; γ ′ , τ ′ , φ′ )
×
rA =
×
,
p(γ, τ , φ)
q(γ ′ , τ ′ , φ′ ; γ, τ , φ)
p y θ, γ, τ , φ
{z
} |
{z
}
|
{z
} |
Likelihood ratio
Prior ratio
Transition ratio
where the likelihood ratio, the prior ratio and the transition ratio are considered. The
likelihood function is given by
N
Y
p(y θ, γ, τ , φ) =
p yi ŷi , σε2 I ,
i=1
(2.25)
30
Chapter 2. Single-profile Bayesian alignment model
which is a product of N multivariate normal densities, where ŷi = ci 1 + ai · m(ui ). Based
on the priors of τ (Equation 2.22) and φ (Equation 2.11), the prior ratio is given by
QK+2
′
′
2
K+1
p(γ ′ , τ ′ , φ′ )
rknot
(1 − rknot )T −3−K
j=1 pTN (ωj | ωj−1 , σω )
×
= K
Q
K+1
2
p(γ, τ , φ)
rknot (1 − rknot )T −2−K
j=1 pTN (ωj | ωj−1 , σω )
=
′
′
′
′
pTN (ωk′ | ωk−1
, σω2 )pTN (ωk+1
| ωk′ , σω2 )pTN (ωk+2
| ωk+1
, σω2 )
rknot
×
,
1 − rknot
pTN (ωk | ωk−1 , σω2 )pTN (ωk+1 | ωk , σω2 )
(2.26)
′
′
where ωk−1
= ωk−1, ωk+2
= ωk+1,
ωk′ =
φ′k − φ′k−1
ν − φk−1
,
=
′
′
τk − τk−1
t − τk−1
and
φ′k+1 − φ′k
φk − ν
.
=
′
′
τk+1 − τk
τk − t
The transition ratio is the remaining component, which is calculated by
∂(τ ′ , φ′ ) q(γ, τ , φ ← γ ′ , τ ′ , φ′ )
1
1/(T − 2)
×
×
=
1/(T − 2) pTN (φ′ | µ, σ 2 ) ∂(τ , t, φ, ν) q(γ ′ , τ ′ , φ′ ← γ, τ , φ)
′
ωk+1
=
k
1
=
,
′
pTN (φk | µ, σ 2 )
(2.27)
where the Jacobian term is needed to account for the change in dimension of τ and φ and its
determinant is equal to one because of the one-to-one determinisitic mapping. The product
of the three ratios is the acceptance probability for the proposal of knot inclusion.
Knot exclusion. When γt = 1 and t = τk , the knot τk is removed from the knot list
(γt′ = 0 ← γt = 1), such that the new set of knots and the mapping function coefficients
become
′
′
τ ′ = (τ0′ , . . . , τk−1
, τk′ , τk+1
, . . . , τK′ ) = (τ0 , . . . , τk−1 , τk+1 , τk+2 , . . . , τK+1),
and
φ′ = (φ′0 , . . . , φ′k−1 , φ′k , φ′k+1, . . . , φ′K ) = (φ0 , . . . , φk−1, φk+1, φk+2, . . . , φK+1).
Similar to the knot inclusion, the acceptance probability for (γ ′ , τ ′ , φ′ ← γ, τ , φ) involves
the likelihood ratio, the prior ratio and the transition ratio, where the latter two are provided
as follows. The prior ratio for knot exclusion is given by
QK
′
′
2
K−1
rknot
(1 − rknot )T −1−K
p(γ ′ , τ ′ , φ′ )
j=1 pTN (ωj | ωj−1 , σω )
× QK+1
= K
2
p(γ, τ , φ)
rknot (1 − rknot )T −2−K
j=1 pTN (ωj | ωj−1 , σω )
=
′
′
pTN (ωk′ | ωk−1
, σω2 )pTN (ωk+1
| ωk′ , σω2 )
1 − rknot
×
,
rknot
pTN (ωk | ωk−1 , σω2 )pTN (ωk+1 | ωk , σω2 )pTN (ωk+2 | ωk+1 , σω2 )
(2.28)
Chapter 2. Single-profile Bayesian alignment model
31
′
′
where ωk−1
= ωk−1, ωk+1
= ωk+2, and
ωk′ =
φ′k − φ′k−1
φk+1 − φk−1
.
=
′
′
τk − τk−1
τk+1 − τk−1
To calculate the transition ratio, an imaginary move of knot inclusion that adds back the
knot τk = t and samples the mapping function coefficient φk = ν ′ needs to be considered.
With the imaginary knot inclusion, the transition ratio is given by
∂(τ ′ , t, φ′ , ν ′ ) q(γ, τ , φ ← γ ′ , τ ′ , φ′ )
1/(T − 2)
2
=
× pTN (φk | µ, σ ) × q(γ ′ , τ ′ , φ′ ← γ, τ , φ)
1/(T − 2)
∂(τ , φ) = pTN (φk | µ, σ 2 ).
(2.29)
Similar to the knot inclusion, the Jacobian determinant is one due to the deterministic
one-to-one mapping.
2.5
Experiments
We applied the Bayesian alignment model (BAM) to a simulated data set, an LC-MS proteomic data set [76] and two LC-MS metabolomic data sets [71]. The performance of BAM
was compared with that of the Bayesian hierarchical curve registration (BHCR) model [127],
the dynamic time warping (DTW) model [129], and the continuous profile model (CPM) [74].
The advantage of applying appropriate retention time correction prior to performing a
feature-based approach is demonstrated through the LC-MS metabolomic data sets.
2.5.1
Simulated data set
We generated a profile pattern composed of three Gaussian peaks with the same standard
deviation but distinct mean values,
(t − µ2 )2
(t − µ3 )2
(t − µ1 )2
+ 10 · exp −
+ 10 · exp −
,
m(t) = 10 · exp −
2σ 2
2σ 2
2σ 2
where µ1 = 10, µ2 = 20, µ3 = 40, σ = 1.25, and t = 0.5, 1, . . . , 50 (100 time points). The
mapping function ui was generated through the even-numbered order statistics as described
in [44] to model the retention time variability. For the number of time points T and the
interval (0, R) considered for the profile, the implementation of ui is given through the
following steps:
1. Generate 2T + 1 samples uniformly on the interval (0, R).
2. Sort the 2T + 1 samples in ascending order.
32
Chapter 2. Single-profile Bayesian alignment model
3. Pick the even-numbered samples from the sorted samples and assign each to ui (1),
ui (2), . . . , ui (T ).
The density of ui (1), ui (2), . . . , ui (T ) is given by
p(ui ) =
(2T + 1)!
×
u
(1)−0
×
u
(2)−u
(1)
×·
·
·×
u
(T
)−u
(T
−1)
×
R−u
(T
)
, (2.30)
i
i
i
i
i
i
R2T +1
where T = 100 and R = 50 were chosen in the simulation. Ten replicate data were generated
by the formulation:
yi (t) = m ui (t) + εi (t), i = 1, . . . , 10,
12
12
10
10
10
8
8
8
8
6
6
6
6
4
4
2
2
0
0
0
10
20
30
40
−2
0
50
10
20
Time
30
40
−2
0
50
4
2
0
10
20
Time
(a) No noise
30
40
−2
0
50
(c) SNR 35
12
10
10
8
8
8
8
6
6
6
6
4
2
2
2
0
0
0
10
20
30
Time
(e) No noise
40
50
−2
0
10
20
30
Time
(f) SNR 40
40
50
Intensity
12
10
Intensity
12
4
−2
0
30
40
50
40
50
(d) SNR 30
10
−2
0
20
Time
12
4
10
Time
(b) SNR 40
Intensity
Intensity
4
2
−2
0
Intensity
12
10
Intensity
12
Intensity
Intensity
where random noises were produced based on signal-to-noise ratio (SNR). Three values of
SNR (30, 35 and 40 dB) were considered. Figure 2.3 depicts one realization of the simulated
data with different noise levels. The data set simulates the scenario where peaks are not
uniformly distributed along the time range, and different choices of knot specification may
lead to distinct results.
4
2
0
10
20
30
Time
(g) SNR 35
40
50
−2
0
10
20
30
Time
(h) SNR 30
Figure 2.3: One realization of simulated data with different noise levels: (a) no noise, (b)
SNR 40, (c) SNR 35 and (d) SNR 30. (e)–(h) are the aligned data using BAM with SSVS.
Alignment performance was assessed based on two measurements: 1) correlation coefficient
between pairs of replicate data, and 2) cross-correlation between the profile pattern and
replicate data. It should be emphasized that the pairwise correlation coefficients between
data can exaggerate the true alignment performance as the values do not reflect potential
losses of chromatographic information. Thus, to ensure the peak patterns are not significantly
distorted during alignment, we calculated the correlation coefficients between the profile
pattern and the aligned data as well.
Chapter 2. Single-profile Bayesian alignment model
33
We compared the performance of four alignment methods on the simulated data set: DTW,
CPM, BHCR and BAM (with fixed knots and with SSVS). Three values of knot density
(0.05, 0.2 and 0.4) with equally-spaced knots were considered for BHCR and BAM with fixed
knots. The fixed knot specification for BAM is mainly for comparison purpose. In addition,
BAM with SSVS, which can automatically handle the knot specification, was applied to
the data. Table 2.2 summarizes the performance measurements before alignment and after
alignment using each of the four methods. The results for the simulated data are based on
200 realizations.
From Table 2.2, significant improvements are observed after alignment by all the methods.
DTW and CPM yield good performance in terms of pairwise correlation coefficient. However, their cross-correlation results suggest that the profiles are overly distorted in the effort
to make them resemble each other. This phenomenon is undesirable since meaningful information is lost during the alignment. In overall, BAM with SSVS yields the best result
in terms of both pairwise correlation coefficient and cross-correlation with the underlying
pattern. The difference between BAM with fixed knots and BHCR is due to the ability of
BAM to overcome the mixing problem of standard MCMC methods in multimodal models
by using block Metropolis-Hastings updates, whereas BHCR is prone to getting stuck at local modes by relying on incremental updates. The distinction becomes significant with knot
density of 0.4, where BHCR leads to the worst performance in terms of both measurements.
As discussed in Section 2.4, accurate alignment requires sufficiently dense knot placement
while naı̈vely increasing the number of knots may lead to overfitting. This is primarily due to
monotonicity constraint for the parameter φ being controlled by the positions of the knots.
Selecting a good knot specification setting is even more involved in practical applications
since there is no ground-truth available on which to calibrate. This problem is circumvented
by the adaptive selection of knots in BAM with SSVS, which provides an automatic approach
to place knots adaptively, according to the complexity of the underlying profile.
2.5.2
LC-MS spike-in data set
The spike-in data set by Listgarten et al. [76] consists of two aliquots of the same human
serum sample where the second aliquot has three known peptides spiked in. Seven replicate
LC-MS runs from each aliquot were acquired using a capillary-scale LC coupled to an iontrap mass spectrometer. Each run is preprocessed and represented by a 501 (RT points)
× 2401 (m/z bins) data matrix. In order to know the true differences between the two
aliquots in the LC-MS data, eight ground-truth runs from a mixture of spiked-in peptides
(without serum) were acquired and 32 experimentally detected m/z values of ground-truth
were reported. The spike-in experiment can help to evaluate alignment results based on
the true differences spiked in the sample. Detailed experimental information can be found
in [76].
Figure 2.4 depicts the base peak chromatograms of the 14 LC-MS runs, where significant
34
Chapter 2. Single-profile Bayesian alignment model
Table 2.2: Pairwise correlation coefficient and cross-correlation with the underlying pattern
for the simulated LC-MS data, before alignment (original) and after alignment by DTW,
CPM, BHCR and BAM (with fixed knots and with SSVS). Means (standard deviations) are
reported for the simulated data based on 200 realizations.
SNR
40
35
30
40
35
30
CPM
0.871
(0.094)
0.870
(0.090)
0.879
(0.085)
0.850
(0.088)
0.893
(0.067)
0.889
(0.069)
0.894
(0.061)
0.869
(0.066)
0.05
0.837
(0.108)
0.834
(0.115)
0.809
(0.111)
0.765
(0.087)
0.918
(0.040)
0.919
(0.034)
0.910
(0.035)
0.888
(0.029)
BHCR
0.2
0.866
(0.170)
0.849
(0.160)
0.812
(0.176)
0.765
(0.128)
0.934
(0.054)
0.935
(0.043)
0.924
(0.044)
0.896
(0.041)
0.4
0.461
(0.122)
0.448
(0.116)
0.437
(0.126)
0.403
(0.106)
0.848
(0.039)
0.841
(0.041)
0.836
(0.042)
0.809
(0.039)
0.05
0.840
(0.111)
0.844
(0.108)
0.825
(0.096)
0.776
(0.081)
0.917
(0.041)
0.921
(0.033)
0.910
(0.032)
0.891
(0.027)
BAM
0.2
0.4
0.896
0.768
(0.153) (0.200)
0.880
0.763
(0.144) (0.190)
0.845
0.703
(0.167) (0.191)
0.812
0.628
(0.116) (0.156)
0.946
0.916
(0.043) (0.053)
0.940
0.912
(0.041) (0.053)
0.932
0.896
(0.040) (0.053)
0.909
0.863
(0.034) (0.048)
SSVS
0.938
(0.116)
0.933
(0.109)
0.900
(0.112)
0.840
(0.091)
0.947
(0.044)
0.945
(0.045)
0.939
(0.035)
0.911
(0.032)
shifts are observed along the RT points. We applied four models, DTW, CPM, BHCR and
BAM to align the chromatograms (Figure 2.5). The alignment result by DTW is shown in
Figure 2.5a, where DTW is prone to overly distorting the profiles and the estimator often
gets stuck in local optima. CPM yields the best performance in this data set in terms of
both visual assessment (Figure 2.5b) and correlation coefficients as shown in Table 2.3. In
general, it works quite well on problems with moderate dimension (less than 1000). The
results by the two Bayesian models, BHCR and BAM are shown in Figures 2.5c and 2.5d,
respectively. In the result by BHCR, several peaks from two replicates of the second aliquot
are not correctly aligned to the majority of peaks in RT range 100 − 250. Instead, they are
erroneously aligned to other tiny peaks around their original retention times. A similar issue
is observed in the DTW result. As mentioned in Section 2.3.2, the element-wise MetropolisHastings move utilized in BHCR is prone to getting stuck at local modes. It is particularly
hard to get away from the trap if the true parameter values are far from the current values and
lie beyond the range of values that can be proposed by the Markov chain transition moves.
In contrast to BHCR, this trapping effect is overcome by BAM as shown in Figure 2.5d.
The inference is based on 15,000 MCMC iterations obtained after discarding the initial
5000 iterations as burn-in. The knot specification is automatically handled with the SSVS
procedure. Figure 2.6a shows the trace plot of the number of knots selected at each MCMC
iteration for the chromatogram from the seventh replicate of the second serum aliquot; we
see that it stabilizes around 60 knots. Figure 2.6b gives a summary of the numbers of
knots across the models visited by the MCMC sampler for each of the 14 chromatograms.
Cross-correlation
∞
DTW
0.836
(0.083)
0.826
(0.091)
0.817
(0.090)
0.790
(0.073)
0.881
(0.057)
0.876
(0.059)
0.871
(0.061)
0.848
(0.056)
Pairwise correlation
∞
Original
0.366
(0.081)
0.356
(0.075)
0.352
(0.075)
0.312
(0.077)
0.822
(0.034)
0.816
(0.036)
0.808
(0.036)
0.778
(0.035)
35
Chapter 2. Single-profile Bayesian alignment model
8
x 10
Serum
Serum + spiked−in peptides
8
3.5
3
3
2.5
2.5
Ion count
Ion count
3.5
2
1.5
0.5
0.5
200
300
400
500
Serum + spiked−in peptides
1.5
1
100
Serum
2
1
0
0
x 10
0
50
100
Retention time
150
200
250
Retention time
(a)
(b)
Figure 2.4: (a) Base peak chromatograms of the original LC-MS data. (b) zooms in the
c 2013 IEEE)
retention time range 100 − 250 for the chromatograms in (a). (
Figure 2.7a depicts the generated profiles by BAM during the initial 200 MCMC iterations for
the chromatogram corresponding to the seventh replicate of the serum aliquot with spiked-in
peptides. The block move effectively corrects the significant misalignments at the beginning
of the Markov chain whereas the MCMC sampler for BHCR gets stuck at inaccurate retention
time points (Figure 2.7b).
Table 2.3: Correlation coefficients for the LC-MS spike-in data, before alignment (original)
and after alignment by DTW, CPM, BHCR (with knot density of 0.2) and BAM (with
c 2013 IEEE)
SSVS). (
Group 1
Group 2
Original DTW
0.35
0.86
0.28
0.88
CPM
0.95
0.94
BHCR
0.92
0.77
BAM
0.92
0.91
Figure 2.8 shows the posterior difference between the estimated mapping function and the
identity function with the 90% credible interval for all the chromatograms. It deserves pointing out that the nonlinear variability of the LC process as shown in the figure necessitates a
nonlinear modeling of the mapping function. RT ranges where significant peaks are present
lead to tighter credible interval, leading to higher confidence in the alignment result.
Finally, according to the ground-truth of 32 m/z values reported in [76], we observe a clear
contrast in the extracted ion chromatograms (EICs) of 16 m/z values (433.63, 513.50, 524.48,
535.00, 601.03, 615.60, 647.50, 649.11, 674.93, 699.03, 784.50, 811.00, 1047.12, 1297.43,
1348.65, and 1575.09) between two sets of LC-MS runs corresponding to the “presence”
and “absence” of spiked-in peptides. We use these EICs to demonstrate the alignment
performance. Figures 2.9 and 2.10 depict the 16 EICs before alignment and after alignment
by BAM, respectively. As shown in Figure 2.10, the retention time shifts observed in the
36
Chapter 2. Single-profile Bayesian alignment model
8
x 10
Serum + spiked−in peptides
8
3.5
x 10
Serum
Serum + spiked−in peptides
8
3.5
x 10
Serum
Serum + spiked−in peptides
8
3.5
3
2.5
2
1.5
2
1.5
Ion count
3
2.5
Ion count
3
2.5
Ion count
3
2.5
2
1.5
1
1
1
0.5
0.5
0.5
8
3.5
x 10
200
300
400
0
0
500
100
200
300
Retention time
Retention time
(a) DTW
(b) CPM
Serum
Serum + spiked−in peptides
8
3.5
x 10
Serum
400
0
0
500
100
200
300
400
0
0
500
(c) BHCR
Serum + spiked−in peptides
8
3.5
x 10
Serum
Serum + spiked−in peptides
8
3.5
2.5
2.5
2.5
Ion count
2.5
Ion count
3
1.5
2
1.5
1
1
1
0.5
0.5
0.5
150
200
250
0
50
100
150
Retention time
Retention time
(e) DTW
(f) CPM
200
250
0
50
100
150
Retention time
(g) BHCR
400
500
200
250
x 10
Serum
Serum + spiked−in peptides
2
1
100
300
1.5
0.5
0
50
200
(d) BAM
3
2
Serum + spiked−in peptides
Retention time
3
1.5
100
Retention time
3
2
Serum
2
1
100
x 10
1.5
0.5
0
0
Ion count
Serum
Ion count
Ion count
3.5
0
50
100
150
200
250
Retention time
(h) BAM
Figure 2.5: Aligned chromatograms by (a) DTW, (b) CPM, (c) BHCR, and (d) BAM. (e),
(f), (g), and (h) zoom in the retention time range 100 − 250 for the chromatograms in (a),
(b), (c), and (d), respectively. Misalignments by DTW and BHCR are observed in (e) and
c 2013 IEEE)
(g). (
original EICs are effectively corrected by applying BAM. In addition, compared to the EICs
prior to the alignment, the aligned EICs exhibit more distinct and specific differences between
the two sets of LC-MS runs in terms of the spiked-in peptides. This will facilitate the
subsequent peak matching step, as elaborated in Section 2.5.3.
2.5.3
LC-MS metabolomic data sets
Lange et al. [71] compared a set of feature-based alignment models on four publicly available data sets (two proteomic and two metabolomic data sets)2 . In this benchmark study,
peak detection was performed on the raw data and the resulting peak list was stored in a
.featureXML file (format of OpenMS [121]) for each LC-MS run. In addition to the peak
lists, the two metabolomic data sets, designated as M1 and M2 have raw data available in
.mzData and .netCDF formats, respectively. To evaluate the alignment result, ground-truth
data were generated based on ion annotation [126], correlation of chromatographic profile,
and consistency of peak. Comparison was carried out by measuring recall and precision by
the alignment models against the ground-truth data. For the details, we refer interested
readers to the paper [71].
As discussed in Section 1.3, correct identification of the consensus peaks is crucial for the
2
Available at http://msbi.ipb-halle.de/msbi/caap
37
Chapter 2. Single-profile Bayesian alignment model
80
80
75
70
60
Number of knots
Number of knots
70
50
40
65
60
55
50
30
20
0
45
40
0.2
0.4
0.6
0.8
1
Iteration
(a)
1.2
1.4
1.6
1.8
2
4
x 10
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Index of chromatogram
(b)
Figure 2.6: (a) Trace plot of the number of knots in the models visited at each MCMC
iteration for the chromatogram from the seventh replicate of the second serum aliquot. (b)
Box plot of the number of knots visited by the MCMC sampler for each chromatogram.
c 2013 IEEE)
(
feature-based approaches. We presume that appropriate retention time correction can facilitate this process and subsequently lead to improved performance. To confirm this idea, we
applied DTW, CPM and BAM to both M1 and M2 data sets based on the chromatograms
extracted from the raw data, where M1 and M2 consist of 44 and 24 LC-MS runs, respectively. According to the estimated mapping functions, we modified the RT values in the peak
lists, i.e., replacing t by ui (t), and applied a feature-based alignment model, OpenMS [121]
on the adjusted lists using the same m/z tolerances as in [71].
For each LC-MS run, binning along the m/z dimension was performed to obtain 100 binned
chromatograms of identical total ion counts. Chromatogram quality was evaluated using the
mass chromatographic quality (MCQ) value accounting for potential baseline and noise [139].
Those chromatograms with MCQ value less than 0.85 were screened out and the sums of
the remaining ones were used as input to the profile-based models. Figure 2.11 depicts the
chromatograms in data sets M1 and M2, before and after alignment by BAM. By adjusting the RT values based on the BAM alignment result, the identification of the consensus
peaks using the feature-based models is reduced to a simpler task. Table 2.4 presents the
performance measurements, in terms of recall, precision, and F -measure, when using the
feature-based model alone (raw) and when coupling the profile-based models (DTW, CPM,
and BAM) to the feature-based model. We note that applying first a profile-based alignment
model can lead to improved results, although this is not always the case. If the mapping
functions are not correctly estimated, this procedure can deteriorate the result by making the
generation of consensus peaks even more difficult. Using DTW for retention time correction
38
Chapter 2. Single-profile Bayesian alignment model
8
3.5
x 10
3
8
Chromatogram
Iterations 1−50
Iterations 51−100
Iterations 101−200
3.5
3
Chromatogram
Iterations 1−50
Iterations 51−100
Iterations 101−200
2.5
Ion count
Ion count
2.5
2
1.5
2
1.5
1
1
0.5
0.5
0
0
x 10
100
200
300
400
500
0
0
Retention time
(a) BAM
100
200
300
400
500
Retention time
(b) BHCR
Figure 2.7: Chromatogram of the seventh replicate from the second serum aliquot and
generated profiles based on the sampled model parameters during the initial 200 MCMC
iterations for (a) BAM and (b) BHCR. The region where the MCMC sampler for BHCR
c 2013 IEEE)
gets stuck at inaccurate retention time points is highlighted. (
yields improved performance on M2 but lower recall on M1. This may be due to the low
SNR in the latter. We calculate the SNR for these data as


2
X
m (t) 
[ = E1
,
SNR
T
σε2
t∈{t1 ,...,tT }
based on the posterior distribution estimated by BAM. The estimates of SNR are 18.58 and
41.07 in M1 and M2, respectively. The higher SNR in M2 suggests that better profiles are
available for mapping function estimation and more accurate estimation result is expected.
On the other hand, it is more challenging to align the noisy chromatograms in M1. As
also demonstrated in Sections 2.5.1 and 2.5.2, DTW is prone to overfitting the data, which
may overly distort the profile and fail to estimate the correct mapping function. The other
considered profile-based model, CPM shows good performance in the simulated data and
the LC-MS proteomic data. Unfortunately, we have been unable to effectively use CPM
for retention time correction in the metabolomic data sets, primarily due to the higherdimensional data (1525 and 2397 RT points in M1 and M2, respectively). The current version
of CPM does not correctly estimate the mapping functions in M1 and fails to process the
data in M2 due to numerical problems3 . As CPM maps the data onto a higher-dimensional
space, efforts are needed for an attempt to apply the model to high-resolution LC-MS data.
We note that using BAM for retention time correction prior to performing the feature-based
model leads to improved performance on both data sets, M1 and M2.
3
The program was performed on a PC with an Intel Core 2 Duo 64 bit 2.66 GHz
39
Chapter 2. Single-profile Bayesian alignment model
Table 2.4: Comparison of the peak matching results by using OpenMS alone (raw) and using
three profile-based alignment models (DTW, CPM and BAM) for retention time correction
c 2013 IEEE)
prior to applying OpenMS on the metabolomic data sets. (
M1
M2
2.6
Recall
Precision
F -measure
Recall
Precision
F -measure
OpenMS
Raw DTW CPM
0.87 0.85
0.34
0.69 0.74
0.68
0.77 0.79
0.45
0.93 0.97
–
0.79 0.83
–
0.85 0.89
–
BAM
0.88
0.74
0.80
0.97
0.83
0.89
Alternative formulation
The monotonicity constraint on the mapping function can hinder efficient estimation of the
mapping function coefficients φ since it is not straightforward to make effective proposals
on a constrained space. This issue can be circumvented by mapping the constrained coefficients φ onto a constraint-free space through the Jupp transformation, where efficient
MCMC methods such as the Hamiltonian Monte Carlo algorithm can be applied to estimate
the transformed coefficients. This section presents an alternative formulation to further
investigate the profile-based alignment.
2.6.1
Jupp transformation
The Jupp transformation and the inverse transformation are given below4 [64]:
• Jupp transformation: ϑ = Jupp(φ)
(
φj j = 0, K + 1
ϑj =
φj+1 −φj
log φj −φj−1
j = 1, . . . , K
(2.31)
• Inverse Jupp transformation: φ = Jupp−1 (ϑ)
φj = ϑ0 + (ϑK+1 − ϑ0 ) ×
1 + exp(ϑ1 ) + · · · + exp(ϑ1 + · · · + ϑj−1 )
,
1 + exp(ϑ1 ) + · · · + exp(ϑ1 + · · · + ϑK )
(2.32)
for j = 1, . . . , K, where φ0 = ϑ0 and φK+1 = ϑK+1 .
CPU and 8 GB RAM. This implementation issue was also noted by the author as at
http://www.cs.toronto.edu/∼jenn/CPM/README.txt
4
The index of sample i is dropped out in this section to keep the notation uncluttered.
Chapter 2. Single-profile Bayesian alignment model
40
The inverse Jupp transformation ensures the monotonicity in φ as a series of (positive) exponential elements are accumulated sequentially. In the following, we introduce fundamental
properties of ϑ that are used in the Hamiltonian Monte Carlo algorithm.
Prior of ϑ. As described in Section 2.2, the prior of the mapping function coefficient φ is
specified via the slope of each segment:
ωj ∼ T N (ωj−1, σω2 ).
Assuming σω2 is small enough such that pT N ωj ωj−1, σω2 ≈ pN ωj ωj−1, σω2 for j =
1, . . . , K + 1, the prior for ωj can be presented as
ωj ∼ N (ω0, jσω2 ).
(2.33)
Furthermore, if the knots are equally-spaced along retention time, i.e., Dτ = τj − τj−1 , for
j = 1, . . . , K + 1, the prior for the difference between adjacent mapping function coefficients,
φj − φj−1, is given by
(φj − φj−1) ∼ N (Dτ ω0 , jDτ2 σω2 ) = N (Dτ , jDτ2 σω2 ),
(2.34)
φj − φj−1 = Dτ + ε0 + ε1 + · · · + εj−1,
(2.35)
and equivalently,
iid
where εj ∼ N (0, Dτ2 σω2 ), and j = 1, . . . , K. Based on the definition of the Jupp transformation, ϑj can be rewritten as
φj+1 − φj
ϑj = log
φj − φj−1
φj − φj−1 + εj
= log
φj − φj−1
εj
.
(2.36)
= log 1 +
Dτ + ε0 + · · · + εj−1
Using the delta method, the variance of ϑj can be approximated as
2
j j j
X
X
X
∂ϑj
∂ϑj
∂ϑj
Var(ϑj ) ≈
Cov(εl , εm )
Var(εl ) +
∂ε
∂ε
∂ε
l
l
m
l=0
l=0 m6=l
= σω2 .
(2.37)
Similarly, the covariance Cov(ϑj , ϑk ) can be approximated as
max(j,k) max(j,k) max(j,k) X X
X
∂ϑj
∂ϑk
∂ϑk
∂ϑj
Var(εl ) +
Cov(εl , εm )
Cov(ϑj , ϑk ) ≈
∂εl
∂εl
∂εl
∂εm
m6=l
l=0
l=0
= 0.
(2.38)
Based on Equations 2.37 and 2.38, the prior of ϑ is given by
iid
ϑj ∼ N (0, σω2 ),
j = 1, . . . , K.
(2.39)
Chapter 2. Single-profile Bayesian alignment model
41
Partial derivative. As the Jupp transformation is a non-lienar mapping between φ and
ϑ, it is important to know how a change on the transformed coefficients ϑ affects the mapping function coefficients φ, and consequently, the mapping function u(t). This can be
characterized by the partial derivative ∂φj /∂ϑk , where two cases are distinguished:
∂φj
−(ϑK+1 − ϑ0 )
=
2 × exp(ϑ1 + · · · + ϑk ) + · · · + exp(ϑ1 + · · · + ϑK )
∂ϑk
1 + exp(ϑ1 ) + · · · + exp(ϑ1 + · · · + ϑK )
× 1 + exp(ϑ1 ) + · · · + exp(ϑ1 + · · · + ϑj−1 ) ,
(2.40)
for k ≥ j, and
−(ϑK+1 − ϑ0 )
∂φj
=
2 × exp(ϑ1 + · · · + ϑj ) + · · · + exp(ϑ1 + · · · + ϑK )
∂ϑk
1 + exp(ϑ1 ) + · · · + exp(ϑ1 + · · · + ϑK )
× 1 + exp(ϑ1 ) + · · · + exp(ϑ1 + · · · + ϑk−1 ) ,
(2.41)
for k < j.
The partial derivative can be integrated into MCMC sampling schemes, to explore the distribution of interest in a more effective manner. This is accomplished by using the Hamiltonian
Monte Carlo algorithm.
2.6.2
Hamiltonian Monte Carlo
Hamiltonian Monte Carlo (HMC) is an efficient MCMC algorithm [27,92]. It generates samples through simulating appropriate Hamiltonian dynamics, where a proposal is determined
in consideration of an auxiliary momentum variable. In the estimation of ϑ, the logarithm
of the desired density is denoted by L(ϑ). We introduce an independent auxiliary variable
p ∼ N (0, M), with the same dimension of ϑ, and the negative joint log-density of (ϑ, p) is
given by
1
1
H(ϑ, p) = −L(ϑ) + log(2π)K |M| + p⊤ M−1 p.
(2.42)
2
2
This equation has a physical analogy to a Hamiltonian, which is the sum of a potential
energy function −L(ϑ) defined by the position variable ϑ, and a kinetic energy p⊤ M−1 p/2.
The variables p and M are therefore interpreted as the momentum and the mass matrix,
respectively. In addition, the dynamics of ϑ and p are determined by the Hamiltonian’s
equations:
∂H
dϑ
=
= M−1 p,
(2.43)
dtH
∂p
∂H
dp
=−
= ∇ϑ L(ϑ).
(2.44)
dtH
∂ϑ
Based on the equations, a Markov chain can be generated via the Hamiltonian dynamics.
Specifically, at each iteration, a position trajectory of ϑ is simulated and the ending point of
Chapter 2. Single-profile Bayesian alignment model
42
the trajectory is proposed as a new state to be evaluated by the Metropolis algorithm, where
the acceptance probability is based on the energy difference between the starting and ending
points of the trajectory. For numerical implementation of non-trivial problems, however, the
Hamiltonian’s equations can only be approximated by discretizing the dynamics with some
small stepsize ǫ. This is typically performed using the “leapfrog” integrator as formulated
below:
p(tH + ǫ/2) = p(tH ) + ǫ∇ϑ L(ϑ(tH ))/2,
ϑ(tH + ǫ) = ϑ(tH ) + ǫM−1 p(tH + ǫ/2),
p(tH + ǫ) = p(tH + ǫ/2) + ǫ∇ϑ L(ϑ(tH + ǫ))/2.
With appropriate stepsize ǫ and number of steps L, the Hamiltonian Monte Carlo algorithm
proceeds with L steps of leapfrog integrator, followed by an evaluation using the Metropolis
algorithm as shown in Algorithm 4.
Algorithm 4 Hamiltonian Monte Carlo
Given ϑ(m) , ε, L
Sample a new momentum p ∼ N (0, M)
Compute the Hamiltonian H based on (ϑ(m) , p) using Equation 2.42
Set ϑ′ ← ϑ(m) , p′ ← p
for l = 1 to L do
Set (ϑ′ , p′ ) ← Leapfrog(ϑ′ , p′ , ǫ)
end for
Compute the Hamiltonian H ′ based on (ϑ′ , p′ ) using Equation
2.42
Set ϑ(m+1) ← ϑ′ with probability min 1, exp(−H ′ + H) , otherwise set ϑ(m+1) ← ϑ(m)
We demonstrate the advantage of using the HMC algorithm through generating samples from
a bivariate normal distribution (example adapted from [92]). The Hamiltonian is defined by
1 ⊤ −1
1 ⊤
1 0.95
H(ϑ, p) = ϑ Σ ϑ + p p, with Σ =
,
0.95 1
2
2
where the two components of ϑ are highly correlated. Figure 2.12 shows the simulated
trajectories using 20 leapfrog steps, with an initial position at the lower-left side of the distribution. Two different initial conditions of the momentum p are considered. As shown in
the figure, random-walk behavior is eliminated in the trajectories. Instead, both trajectories
track the distribution in a systematic manner, assisted by the information from the auxiliary
momentum. With the simulated Hamiltonian shown in Figure 2.13, the acceptance probability pA of the Metropolis update is determined based on the energy difference between
the starting and ending points of the trajectory. That is, pA = exp(−0.44) = 0.64 for the
first proposal in Figure 2.12a, and pA = exp(−0.037) = 0.96 for the second proposal in
Figure 2.12b. This example demonstrates the capability of using the HMC algorithm to
43
Chapter 2. Single-profile Bayesian alignment model
effectively explore the distribution of interest. In Section 2.6.3, we introduce the application of the HMC algorithm to perform inference of the Jupp transformed coefficients in the
profile-based alignment problem.
2.6.3
Single-profile alignment using Hamiltonian Monte Carlo
We use the HMC algorithm to estimate the Jupp transformed coefficients ϑ in the singleprofile alignment model, where L(ϑ) is defined as
L(ϑ) = log p(y|ϑ) + log p(ϑ)
2
K
X
X
ϑ2j
y(t) − ŷ(t)
=−
−
+ constant.
(2.45)
2σε2
2σω2
j=1
t∈{t1 ,...,tT }
The generated profile ŷ(t) = c + a · m (u(t)) is determined based on the translation and
scaling variables c, a, the prototype function m(t) and the mapping function u(t). The
gradient required by the leapfrog integrator is given by
∂L
ϑk
∂u ∂φj
−1 X
∂u ∂φj+1
′
− 2
y(t) − ŷ(t) − am (u(t)) ×
= 2
+
∂ϑk
σǫ
∂φj ∂ϑk ∂φj+1 ∂ϑk
σω
t∈{t1 ,...,tT }
X
′
∂φj
t − τj
∂φj+1
ϑk
τj+1 − t
a
×
+
×
− 2,
y(t) − ŷ(t) m (u(t)) ×
= 2
σǫ
τj+1 − τj
∂ϑk τj+1 − τj
∂ϑk
σω
t∈{t1 ,...,tT }
(2.46)
where ∂φj /∂ϑk is given in Equations 2.40 and 2.41.
We applied the HMC alignment model to the LC-MS spike-in data set discussed in Section 2.5.2. In the HMC model, the trajectory is simulated using a mixture of two stepsizes
(4 × 10−6 and 8 × 10−6 ) and 200 leapfrog steps. With this setting, the acceptance rates
(among 1000 iterations) are 0.39 and 0.35 for ǫ = 4 × 10−6 and ǫ = 8 × 10−6 , respectively.
Figure 2.14 depicts the base peak chromatograms of the 14 LC-MS runs, before alignment
and after alignment using the HMC model. As shown in the figure, significant misalignments
in the original chromatograms are effectively corrected within initial 10 iterations, and further improvement at both ends of the chromatograms can be achieved with additional HMC
iterations.
While the preliminary result appears encouraging, the main issue of using the HMC algorithm
is to select appropriate stepsize ǫ and the number of steps L for simulating a trajectory such
that effective MCMC moves can be proposed with high acceptance probabilities. Current
experimental setting is through trial and error according to some guidelines introduced in
[92]. The tuning of the parameters requires time and experience and may limit the model’s
practical applicability. Incorporation of adaptive HMC samplers, e.g., the recently developed
“No-U-Turn Sampler” (NUTS) [53], may deserve further investigation.
Chapter 2. Single-profile Bayesian alignment model
2.7
44
Summary
This chapter introduces a Bayesian alignment model for LC-MS data using single chromatograms from multiple LC-MS runs. The single-profile model improves on existing Bayesian
methods by 1) using an efficient MCMC sampler, and 2) adaptively selecting knots for the
mapping function. Due to the mathematical intractability of the mapping function and the
monotonicity constraint imposed on it, designing an effective updating scheme is crucial to
ensure good mixing of the MCMC sampler. We propose a block Metropolis-Hastings algorithm that enables flexible transition and prevents the sampler from getting trapped in local
modes of the posterior distribution. Moreover, an extension using SSVS is provided for adaptive knot specification. For the profile-based alignment, evaluation on both simulated and
real data sets shows improved alignment results in terms of pairwise correlation coefficients,
cross-correlation with the underlying pattern, as well as visual assessment relative to the
ground-truth. In addition, alignment by the proposed model is demonstrated to facilitate
the subsequent peak matching process using two metabolomic benchmark data sets.
Unresolved issues include the following: 1) lack of integration of informative prior knowledge, e.g., the information of internal standards, and 2) implicit assumption of the existence
of an underlying pattern based on a single ion chromatogram. Identifying representative
chromatograms can potentially improve the alignment accuracy by providing more comprehensive information about the retention time variability across LC-MS runs. The strategy
is computationally feasible since the alignment model can be extended to handle multiple
chromatograms by introducing associated prototype functions accordingly. An important
practical issue is how to extract the informative chromatograms from an LC-MS run, which
is discussed in the next chapter.
45
Chapter 2. Single-profile Bayesian alignment model
Chromatogram 1 (first aliquot)
Chromatogram 1 (second aliquot)
20
Difference
Difference
20
10
0
−10
−20
0
100
200
300
400
10
0
−10
−20
500
0
100
Retention time
Chromatogram 2 (first aliquot)
Difference
Difference
0
−10
0
100
200
300
400
0
−10
−20
500
0
100
Chromatogram 3 (first aliquot)
Difference
Difference
−10
0
100
200
300
400
0
−10
−20
500
0
100
Chromatogram 4 (first aliquot)
400
500
Chromatogram 4 (second aliquot)
Difference
Difference
300
20
10
0
−10
0
100
200
300
400
10
0
−10
−20
500
0
100
Retention time
200
300
400
500
Retention time
Chromatogram 5 (first aliquot)
Chromatogram 5 (second aliquot)
20
Difference
20
Difference
200
Retention time
20
10
0
−10
0
100
200
300
400
10
0
−10
−20
500
0
100
Retention time
200
300
400
500
Retention time
Chromatogram 6 (first aliquot)
Chromatogram 6 (second aliquot)
20
Difference
20
Difference
500
10
Retention time
10
0
−10
0
100
200
300
400
10
0
−10
−20
500
0
100
Retention time
200
300
400
500
Retention time
Chromatogram 7 (first aliquot)
Chromatogram 7 (second aliquot)
20
Difference
20
Difference
400
Chromatogram 3 (second aliquot)
0
10
0
−10
−20
300
20
10
−20
200
Retention time
20
−20
500
10
Retention time
−20
400
20
10
−20
300
Chromatogram 2 (second aliquot)
20
−20
200
Retention time
0
100
200
300
Retention time
400
500
10
0
−10
−20
0
100
200
300
400
500
Retention time
Figure 2.8: Difference between the identity function and estimated mapping function obtained from the posterior median by BAM for each of the 14 chromatograms. The filled
c 2013 IEEE)
region corresponds to the 90% credible interval. (
46
Chapter 2. Single-profile Bayesian alignment model
6
6
x 10
10
0
350
6
x 10
400
450
10
0
350
6
x 10
400
450
10
5
0
350
x 10
400
(a) m/z 433.63
7
7
x 10
5
400
x 10
2
1
0
350
450
x 10
2
1
0
300
450
x 10
4
2
0
350
5
0
300
350
0
400 300
(c) m/z 524.48
6
x 10
15
10
5
0
350
350
450
x 10
15
10
5
0
350
400
400
6
350
7
6
x 10
5
0
350
400
7
5
0
450 350
(g) m/z 647.50
400
400
x 10
4
2
0
450 350
(i) m/z 674.93
7
350
400
450
400
350
7
10
0
400
350
400
450
350
400
400
450
400
450
450
500
7
x 10
15
10
5
0
300
5
0
350
400
x 10
10
5
0
350
400
10
0
400
(o) m/z 1348.65
0
450 350
(l) m/z 811.00
7
350
7
400
450
x 10
10
5
0
350
(n) m/z 1297.43
7
7
500
400
x 10
400
7
x 10
x 10
450
2
1
0
400 300
(j) m/z 699.03
7
(m) m/z 1047.12
x 10
350
7
x 10
7
350
450
x 10
5
2
1
0
400 300
(k) m/z 784.50
7
x 10
15
10
5
0
300
400
2
1
0
300
x 10
2
1
0
300
400
7
x 10
4
2
0
450 350
(h) m/z 649.11
x 10
7
x 10
x 10
2
1
0
400 300
(f) m/z 615.60
7
7
x 10
4
2
0
350
x 10
2
1
0
450 350
(d) m/z 535.00
7
(e) m/z 601.03
x 10
450
7
7
6
400
400
(b) m/z 513.50
7
x 10
450
10
5
0
350
x 10
5
450
500
0
400
5
450
0
500 400
(p) m/z 1575.09
Figure 2.9: Original extracted ion chromatograms for each of the 16 m/z values corresponding to the spiked-in peptides. For each m/z value, two plots showing the chromatograms
of all seven replicates are depicted: chromatograms for aliquots with serum alone (left) and
c 2013 IEEE)
serum with spiked-in peptides (right). (
47
Chapter 2. Single-profile Bayesian alignment model
6
6
x 10
10
0
350
6
x 10
400
450
10
0
350
6
x 10
400
450
10
5
0
350
x 10
400
(a) m/z 433.63
7
7
x 10
5
400
x 10
2
1
0
350
450
x 10
2
1
0
300
450
x 10
4
2
0
350
5
0
300
350
0
400 300
(c) m/z 524.48
6
x 10
15
10
5
0
350
350
450
x 10
15
10
5
0
350
400
400
6
350
7
6
x 10
5
0
350
400
7
5
0
450 350
(g) m/z 647.50
400
400
x 10
4
2
0
450 350
(i) m/z 674.93
7
350
400
450
400
350
7
10
0
400
350
400
450
350
400
400
450
400
450
450
500
7
x 10
15
10
5
0
300
5
0
350
400
x 10
10
5
0
350
400
10
0
400
(o) m/z 1348.65
0
450 350
(l) m/z 811.00
7
350
7
400
450
x 10
10
5
0
350
(n) m/z 1297.43
7
7
500
400
x 10
400
7
x 10
x 10
450
2
1
0
400 300
(j) m/z 699.03
7
(m) m/z 1047.12
x 10
350
7
x 10
7
350
450
x 10
5
2
1
0
400 300
(k) m/z 784.50
7
x 10
15
10
5
0
300
400
2
1
0
300
x 10
2
1
0
300
400
7
x 10
4
2
0
450 350
(h) m/z 649.11
x 10
7
x 10
x 10
2
1
0
400 300
(f) m/z 615.60
7
7
x 10
4
2
0
350
x 10
2
1
0
450 350
(d) m/z 535.00
7
(e) m/z 601.03
x 10
450
7
7
6
400
400
(b) m/z 513.50
7
x 10
450
10
5
0
350
x 10
5
450
500
0
400
5
450
0
500 400
(p) m/z 1575.09
Figure 2.10: Aligned extracted ion chromatograms for each of the 16 m/z values corresponding to the spiked-in peptides. For each m/z value, two plots showing the chromatograms
of all seven replicates are depicted: chromatograms for aliquots with serum alone (left) and
c 2013 IEEE)
serum with spiked-in peptides (right). (
48
Chapter 2. Single-profile Bayesian alignment model
4
4
x 10
6
6
6
1150
1200
1250
3
1
1
1500
2000
2500
0
1100
0
0
3000
500
1000
Retention time (sec)
5
1500
2000
2500
3000
5
9
5
10
x 10
5
x 10
10
8
x 10
8
7
7
5
5
6
0
1350
5
1400
1450
4
Ion count
6
Ion count
1250
(b) Aligned chromatograms in M1 data set
x 10
2
2
1
1
1000
1500
2000
2500
3000
3500
Retention time (sec)
(c) Original chromatograms in M2 data set
1400
1450
4
3
500
0
1350
5
3
0
0
1200
Retention time (sec)
(a) Original chromatograms in M1 data set
9
1150
3
2
1000
2
4
2
500
x 10
4
5
0
1100
0
0
4
6
2
4
x 10
x 10
4
5
Ion count
7
4
Ion count
7
0
0
500
1000
1500
2000
2500
3000
3500
Retention time (sec)
(d) Aligned chromatograms in M2 data set
Figure 2.11: Chromatograms in the metabolomic data sets, M1 and M2, before and after
alignment by BAM. The inset is a zoomed part in the middle retention time range of the
c 2013 IEEE)
chromatograms. (
49
Chapter 2. Single-profile Bayesian alignment model
Momentum trace
1
1
1
1
−1
−1
−2
−2
−1
0
1
θ
−2
−2
2
0
θ
p
−1
0
p2
2
2
2
0
θ
Position trace
Momentum trace
2
2
2
Position trace
2
−1
0
1
2
−1
−2
−2
p
1
0
−1
0
1
θ
−2
−2
2
−1
(a) ϑ(0) = (−1.5, −1.5)⊤, p(0) = (−1, 1)⊤
0
1
2
p
1
1
1
(b) ϑ(0) = (−1.5, −1.5)⊤, p(0) = (−1, −1.5)⊤
Figure 2.12: Trajectories for a bivariate normal distribution, simulated using 20 leapfrog
steps (ǫ = 0.25) with an initial position at the lower-left side of the distribution. Contours
of equal probability ratio to the highest (0.1, 0.2, . . . , 0.9) are depicted. Different values of
initial momentum are considered.
Hamiltonian
3.1
2.6
3
2.5
2.9
H(θ,p)
H(θ,p)
Hamiltonian
2.7
2.4
2.8
2.3
2.7
2.2
2.6
2.1
0
5
10
15
Leapfrog step
(a) ϑ(0) = (−1.5, −1.5)⊤, p(0) = (−1, 1)⊤
20
2.5
0
5
10
15
20
Leapfrog step
(b) ϑ(0) = (−1.5, −1.5)⊤, p(0) = (−1, −1.5)⊤
Figure 2.13: The Hamiltonians along the trajectories in Figure 2.12, under different initial
conditions.
50
Chapter 2. Single-profile Bayesian alignment model
Serum
8
x 10
Serum + spiked−in peptides
3
3
2.5
2.5
2
1.5
0.5
0.5
200
300
400
0
0
500
100
200
Retention time
Serum
8
Serum + spiked−in peptides
3
2.5
2.5
2
1.5
0.5
0.5
300
Retention time
(c) Iteration 100 of HMC
400
500
x 10
Serum + spiked−in peptides
1.5
1
200
500
2
1
100
Serum
8
3.5
3
0
0
400
(b) Iteration 10 of HMC
Ion count
Ion count
x 10
300
Retention time
(a) Original
3.5
Serum + spiked−in peptides
1.5
1
100
x 10
2
1
0
0
Serum
8
3.5
Ion count
Ion count
3.5
0
0
100
200
300
400
500
Retention time
(d) Iteration 1000 of HMC
Figure 2.14: Base peak chromatograms of the LC-MS data. Alignment is performed using
the HMC model.
Chapter 3
Multi-profile Bayesian alignment
model
The generic task of retention time alignment is to estimate a set of mapping functions in
N LC-MS runs, ui (t), i = 1, . . . , N, t = t1 , . . . , tT , that characterizes the mapping relationship between observed retention times in each LC-MS run and a consensus reference.
This chapter extends the single-profile alignment model introduced in Chapter 2 to handle
multiple representative chromatograms simultaneously. Moreover, we use Gaussian process
regression on the internal standards to derive a prior distribution for the mapping functions,
which is then integrated into the profile-based alignment model. Figure 3.1 presents the
three main components of the multi-profile Bayesian alignment model, which are elaborated
in the following sections.
3.1
Gaussian process prior
During the sample preparation of an LC-MS experiment, an internal standard mixture is
often spiked into the sample for the purpose of quality assessment of the experiment. The
mixture is composed of compounds whose behavior are well characterized, where a set of
peaks can be easily detected in the LC-MS data and associated with these compounds. It
is therefore possible to identify the peaks of internal standard and their retention times in
each LC-MS run. With this information, adjustment can be made for each internal standard
peak. Furthermore, this can be extended to other time points by conducting a Gaussian
process regression to estimate the mapping function for each run with a regression function.
For each LC-MS run, we have the mapping relationship {s, r}, where s = (s1 , . . . , sR )⊤ is the
vector of original retention times for the R internal standard peaks, and r = (r1 , . . . , rR )⊤
is the corresponding assigned vector of reference times estimated by the average of each
standard peak across multiple runs. A Gaussian process prior is defined over a latent mapping
51
52
Chapter 3. Multi-profile Bayesian alignment model
Gaussian Process Prior
Chromatographic Clustering
55
50
45
40
2
35
1.5
1
30
5
0.5
25
0
10
20
15
15
4
20
25
30
35
40
45
50
3
20
2
30
40
55
50
1
60
Profile-based Alignment
prototype
function
mapping
function
synthetic
data
similarity
observation
Figure 3.1: Three main components of the multi-profile Bayesian alignment model: Gaussian
process prior, chromatographic clutering, and profile-based alignment.
function ui (t) of the observation {s, r}, that is
ui (s) s ∼ N (µu , Σu ),
(3.1)
where the mean function is an identity function, i.e., µu = s, and the R × R covariance
matrix Σu is defined via the covariance function κ
(s − s′ )2
′
′
2
,
(3.2)
Cov ui (s), ui (s ) = κ(s, s ) = σu exp −
2σs2
such that

Cov ui (s1 ), ui (s1 )
 Cov ui (s2 ), ui (s1 )

Σu = 
..

.
Cov ui (sR ), ui (s1 )

κ(s1 , s1 ) κ(s1 , s2 )
 κ(s2 , s1 ) κ(s2 , s2 )

=
..
..

.
.
κ(sR , s1 ) κ(sR , s2 )
Cov ui (s1 ), ui (s2 )
Cov ui (s2 ), ui (s2 )
..
.
Cov ui (sR ), ui (s2 )

· · · κ(s1 , sR )
· · · κ(s2 , sR ) 

.
..
..

.
.
· · · κ(sR , sR )

· · · Cov ui (s1 ), ui (sR )
· · · Cov ui (s2 ), ui (sR ) 


..
..

.
.
· · · Cov ui (sR ), ui (sR )
(3.3)
The covariance function reflects greater dependence between neighboring time points than
distant points, and the parameters σs2 and σu2 define how closely and how significantly neighboring time points affect each other, respectively. The likelihood function is defined as
r ui (s) ∼ N ui (s), σn2 I .
(3.4)
Chapter 3. Multi-profile Bayesian alignment model
53
Based on the defined likelihood function and the Gaussian process, the joint distribution
ui (t) and r is a multivariate normal distribution:
κ(t, t) κ(t, s⊤ )
t
ui (t)
,
(3.5)
,
=N
κ(s, t) Σu + σn2 I
s
r
⊤
where κ(t, s⊤ ) = κ(t, s1 ), κ(t, s2 ), . . . , κ(t, sR ) and κ(s, t) = κ(s1 , t), κ(s2 , t), . . . , κ(sR , t) .
Given {s, r}, the predictive distribution of the mapping function ui (t) at time t can be inferred based on the conditional distribution of ui (t)
ui (t) s, r, t ∼ N E[ui (t)], Var[ui (t)] ,
(3.6)
where the mean
and variance
−1
E ui (t) t, r, s = t + κ(t, s⊤ ) Σu + σn2 I
(r − s),
−1
Var ui (t) t, r, s = κ(t, t) − κ(t, s⊤ ) Σu + σn2 I
κ(s, t).
(3.7)
(3.8)
This provides an effective way to infer the mapping functions. However, the estimation
depends on the number of standard peaks that can be reliably used and the coverage of
retention time by these peaks. We utilize more comprehensive chromatographic information
in our profile-based approach, in which the Gaussian process can be incorporated using the
predictive distribution of the mapping function as the prior for subsequent estimation.
3.2
Multi-profile alignment
For complex biological samples, collapsing the three-dimensional data into a two-dimensional
chromatogram may blur originally distinct patterns. In such cases, the lack of a consistent
pattern can hinder the estimation of mapping functions. To retain better chromatographic
profiles, we propose to identify multiple representative chromatograms, and perform the
alignment by considering these chromatograms simultaneously. Extension of the generative
model to handle multiple chromatograms can be made by introducing associated prototype
functions of the representative chromatograms. That is,
(g)
(g)
yi (t) = ci + ai · mg ui (t) + εi (t),
(3.9)
where sample index i = 1, . . . , N, and chromatogram index g = 1, . . . , G. The prototype function associated to the g-th representative chromatogram is modeled with B-spline
regression:
mg = Bm ψ g ,
(3.10)
54
Chapter 3. Multi-profile Bayesian alignment model
and the likelihood function is the product of the G likelihood functions,
p(y|θ) =
G Y
N
Y
g=1 i=1
(g)
(g)
(g)
N yi ŷi , σε2 I ,
(3.11)
where ŷi is given by ci · 1 + ai · mg (ui ). Figure 3.2 presents the directed acyclic graph of the
multi-profile alignment model where the model parameters are represented by open circles,
the hyperparameters by solid dots, and the observations by filled circles.
µc σc20
µa σa20
c0
αc βc
αψ βψ
σψ2
a0
σc2
ci
{ψ g }
σa2
αa βa
εi
σε2
αε βε
ui
GP
ai
(g)
{yi }
N
Figure 3.2: Directed acyclic graph of the multi-profile alignment model.
Algorithm 5 outlines one iteration of the MCMC procedure in the multi-profile alignment.
For parameters θ whose full conditionals have closed forms as summarized in Table 3.1, we
use Gibbs sampling to update their values. The remaining parameters, i.e., the mapping
function coefficients φi , are updated using the block Metropolis-Hastings algorithm as described in Section 2.3.2. That is, the φi,j ’s are first grouped into several non-overlapping
blocks, which consist of successive coefficients along the retention time, and proposals are
made to update each block. We introduce binary indicator variables bj ∈ {0, 1}, j =
1, . . . , K, to identify the block boundaries, where bj = 1 if τj is at the boundary of a
block and b0 = bK+1 = 1. This indicator variable follows a Bernoulli distribution with
p(bj = 1) = rblock . Based on the boundary configuration, coefficients within the same block
φi,j:j+Bj −1 = (φi,j , φi,j+1, . . . , φi,j+Bj −1 ) are proposed to be moved in the same direction,
where bj = bj+Bj = 1 and bj+1 = · · · = bj+Bj −1 = 0. We consider a mixture of transitions
where rblock is randomly selected from {1, 1/2, 1/4} at each iteration. The configuration of
Chapter 3. Multi-profile Bayesian alignment model
55
Table 3.1: Summary of full conditionals of model parameters in the multi-profile Bayesian
alignment model.
Parameter
a0
Distribution
2
N (â0 , σ̂a0
)
c0
N ĉ0 , σ̂c20
(ai , ci )
N (µ̂i , Σ̂i )
1/σa2
G(α̂a , β̂a )
1/σc2
G(α̂c , β̂c )
1/σε2
G(α̂ε , β̂ε )
1/σψ2
G(α̂ψ , β̂ψ )
ψ
N (µ̂ψ , Σ̂ψ )
−1
2
σ̂a20 = 1/σa0
+ N/σa2
P
ai /σa2
â0 = σ̂a20 µa /σa20 + N
i=1
−1
2
σ̂c20 = 1/σc0
+ N/σc2
P
2
ĉ0 = σ̂c20 µc /σc20 + N
i=1 ci /σc
PG
−1
⊤
2
Σ̂i = Σ−1
ac +
g=1 Wg Wg /σǫ
P
⊤
G
−1
⊤ g
2
µi = Σ̂i Σac a0 c0 + g=1 Wg yi /σǫ
α̂a = αa + N/2
P
2
β̂a = βa + N
i=1 (ai − a0 ) /2
α̂c = αc + N/2
P
2
β̂c = βc + N
i=1 (ci − c0 ) /2
α̂ε = αε + GNT /2
P PN
(g)
(g) 2
β̂ε = βε + G
i=1 kyi − ŷi k /2
g=1
α̂ψ = αψ + GL/2
P
⊤
β̂ψ = βψ + G
g=1 ψ g Ωψ g /2
−1
Σ̂ψ = Ω/σψ2 + X⊤ X/σε2
µ̂ψ = Σ̂ψ X⊤ (y − C)/σǫ2
56
Chapter 3. Multi-profile Bayesian alignment model
blocks is therefore variable within a Markov chain. The acceptance probability, rA , for up(m)
dating φi,j:j+Bj −1 (φ′i,j:j+Bj −1 ← φi,j:j+Bj −1 ), is determined by the product of the prior ratio
rP , the likelihood ratio rL , and the transition ratio rT . For the block Metropolis-Hastings
algorithm, the transition ratio rT for the proposal density is one, while the likelihood ratio
rL and prior ratio rP are given by:
(g) ′
(m+1)
(m+1)
G
p
y
φ
,
φ
,
θ
Y
i,j:j+Bj −1
i
i,\j:j+Bj −1
,
rL =
(3.12)
(g) (m)
(m+1)
(m+1)
φi,j:j+Bj −1 , φi,\j:j+Bj −1 , θ
g=1 p yi
and
N u′i (t) E[ui (t)], Var[ui (t)]
,
rP =
(m+1)
N
u
(t)
E[u
(t)],
Var[u
(t)]
i
i
t∈{τj−1 :τj+Bj }
i
Y
(m+1)
(3.13)
where φi,\j:j+Bj −1 denotes the set of coefficients φi at iteration m + 1 with φi,j:j+Bj −1
(m+1)
excluded, and u′i (t) and ui
(m)
(m+1)
(t) are determined based on {φ′i,j:j+Bj −1 , φi,\j:j+Bj −1 } and
(m+1)
{φi,j:j+Bj −1 , φi,\j:j+Bj −1 }, respectively. E[ui (t)] and Var[ui (t)] are derived via Gaussian process regression as described in Section 3.1.
Algorithm 5 MCMC update of {θ (m) , φ(m) } in multi-profile alignment model
Update θ (m+1) ← θ (m) using Gibbs sampling
rB ∼ U 1, 12 , 41
δ ∼ 12 · U(0, δsmall ) + 21 · U(0, δlarge )
bj ∼ Bernoulli(rblock ), for j = 1, . . . , K
for all block φi,j:j+Bj −1 do
(m)
(m)
′
φi,j:j+Bj −1 ∼ U φi,j:j+Bj −1 − δ, φi,j:j+Bj −1 + δ
Compute the likelihood ratio rL using Equation 3.12
Compute the prior ratio rP using Equation 3.13
Compute the acceptance probability rA = min (1, rL × rP )
(m+1)
Set φi,j:j+Bj −1 = φ′i,j:j+Bj −1 with probability rA
end for
3.3
Chromatographic clustering
A critical issue involved in the multi-profile modeling is the identification of representative
chromatograms from the LC-MS runs, where a trade-off between computational efficiency
(less chromatograms) and information retention (more chromatograms) needs to be considered. The use of multiple chromatograms is considered in a few studies, by either binning
57
Chapter 3. Multi-profile Bayesian alignment model
the LC-MS data [75] or using all the extracted ion chromatograms with acceptable quality [17]. However, a suitable procedure to utilize multiple representative chromatograms
while retaining computational feasibility is currently not available. Naı̈vely binning along
the m/z dimension is not desirable since chromatograms with similar m/z values do not
necessarily resemble each other as shown in Figure 3.3, and this would inevitably blur the
chromatographic profiles. To address this gap, we propose a clustering approach to identify multiple representative chromatograms from each LC-MS run. The chromatograms are
simultaneously considered in the profile-based alignment to facilitate the estimation of the
(b)
prototype and mapping functions. With an initial set B of binned chromatograms xi at a
resolution of 0.5 Da/bin, b ∈ B, we propose a clustering procedure consisting of screening
of unqualified chromatograms, identification of exemplars, and agglomerative clustering as
follows.
6
x 10
2
TIC
1.5
1
700
0.5
650
600
0
30
550
35
40
45
500
m/z
Retention time (min)
Figure 3.3: Binned chromatograms in a portion of one LC-MS run. Similar m/z values do
not imply similar chromatographic profiles.
Screening of unqualified chromatograms. Quality of each binned chromatogram is
assessed by the mass chromatogram quality (MCQb ) and normalized cross-correlation across
LC-MS runs (XCb ), where the value of MCQb is computed using the component detection
algorithm (CODA) by [139] to identify contaminated binned chromatograms by baseline or
spike noises in any of the LC-MS runs,
n
o
(b)
MCQb = min CODA xi (t) i = 1, . . . , N ,
(3.14)
Chapter 3. Multi-profile Bayesian alignment model
58
and the averaged cross-correlation is to gauge the consistency of the chromatographic pattern
across the runs,
N X
N
o
n
X
2
(b)
(b)
(3.15)
max xi ⋆ xi′ (t) ,
XCb =
N(N − 1) i=1 ′
i >i
where the normalized cross-correlation is defined as
Z ∞
1
(b)
(b)
(b)
(b)
xi ⋆ xi′ (t) = q
xi (τ )xi′ (τ + t)dτ.
(b)
(b)
||xi ||2 · ||xi′ ||2 −∞
(3.16)
The chromatograms are screened based on their quality. Only those satisfying the specified
criterion, e.g., MCQb ≥ 0.9 and XCb ≥ 0.85 are retained for further processing. That is,
Bs = B \ Bd , where
Bd = b MCQb < 0.9 ∪ XCb < 0.85 .
Identification of exemplars. We apply the affinity propagation algorithm [36] to identify
exemplars that best represent the whole chromatographic profiles,
Be ← AP(ρb,b′ , ρ̄),
b, b′ ∈ Bs ,
based on the similarity measure of Pearson correlation coefficient
ρb,b′
N
1 X
(b)
(b′ )
Corr xi , xi
,
=
N i=1
(3.17)
and the average of all the similarity measures, ρ̄ is assigned as the exemplar preference in
the algorithm. The sum of the correlation coefficient between each chromatogram b and its
exemplar π(b),
X
ρb,π(b) ,
b∈Bs
is maximized, where π(b) ∈ Bs and Be = {π(b) b ∈ Bs }. To ensure a valid configuration, if
π(b) = b′ , then π(b′ ) must be b′ .
Agglomerative clustering. Based on the set of identified exemplars, we perform the
hierarchical agglomerative approach to cluster the exemplars, which is a bottom-up approach.
Initially each exemplar forms a singleton cluster, and two closest clusters are iteratively
(g)
merged. At each level, the clustered chromatogram yi (t) is summarized by
X (b)
(g)
xi (t),
(3.18)
yi (t) =
b∈Bg
59
Chapter 3. Multi-profile Bayesian alignment model
where Bg denotes the set of chromatograms in the g-th cluster. The distance between two
clusters is defined based on the overlapping level between two clustered chromatograms
d(Bg , B ) =
g′
tT
N X
X
i=1 t=t1
n
o
(g)
(g ′ )
min yi (t), yi (t) .
(3.19)
Our goal is to cluster together chromatograms with less overlaps, i.e., agglomeration of
fairly distinct chromatographic profiles, to better retain the chromatographic profiles. The
procedure continues until all the exemplars are merged into a single cluster. Once the
hierarchy is built, the number of clusters is determined using the L-method [115]. On the
plot of overlapping level against the number of clusters, there is an incremental decrease of
the overlapping level and the L-method searches for the knee of the overlapping curve, where
the benefit of adding an additional cluster starts decreasing. A sequence of two piecewise
lines that fit the overlapping curve and their sum of squared errors are considered. The point
minimizing the fitted sum of squared errors is chosen as the number of clusters.
3.4
Analyzed data sets
We applied the multi-profile alignment model to two LC-MS data sets from proteomic and
glycomic studies. Both data sets were generated from human serum samples with spiked-in
internal standards. Table 3.2 gives a summary of the analyzed data sets. The base peak
chromatograms of the LC-MS runs are shown in Figure 3.4.
Table 3.2: Summary of the analyzed data sets.
Sample
Internal standard
Liquid chromatography
Mass spectrometer
Number of LC-MS runs
Number of MS scans
Time range (min.)
m/z range (Da)
Peak detection
Ground-truth
Proteomics
Human serum
Tryptic peptides
Agilent 1200
LTQ-Orbitrap
20
3792 − 3367
10 − 115
400 − 2000
DifProWare
Mascot results
Glycomics
Human serum
Galactose
Dionex 3000
LTQ-Orbitrap Velos
23
7809 − 8643
10 − 60
500 − 2000
In-house tool
Serum glycans
Proteomic data set. The proteomic experiment was designed for evaluating the MARS
Hu-14 column (Agilent Technologies) for depletion of high-abundance proteins in human
60
Chapter 3. Multi-profile Bayesian alignment model
7
18
8
x 10
7
16
x 10
6
14
5
Ion count
Ion count
12
10
8
4
3
6
2
4
1
2
0
20
40
60
80
100
0
10
20
30
40
Retention time (min)
Retention time (min)
(a) Proteomic data set
(b) Glycomic data set
50
60
Figure 3.4: Base peak chromatograms in the two analyzed data sets.
serum. The tryptic peptides are a mixture of the following five non-human proteins (BrukerMichrom): Alcohol deydrogenase (yeast), Carbonic anhydrase (bovine), Cytrochrome c
(equine), Enolase (yeast), and Myoglobin (equine). Serum samples from five healthy individuals were analyzed. LC-MS/MS analysis of the serum samples was performed on an Agilent
1200 nano-LC coupled to an LTQ-Orbitrap mass spectrometer, where data were acquired
with double injections from two groups, with two different concentrations of the spiked-in
tryptic peptides. LC-MS/MS data of the internal standard mixture were also acquired in
duplicate right before the data acquisition of the serum samples. The mass spectrometer was
scanned approximately every second using a 60,000 resolution setting. For each scan, up to
five ions were automatically selected based on their intensities for the MS/MS analysis in the
LTQ. We used the DifProWare platform1 to perform LC-MS data preprocessing including
deisotoping of mass spectra, peak detection, and charge state deconvolution. Each LC-MS
run was preprocessed separately. Peak detection was performed on the basis of LC-MS data
without using the MS/MS spectra.
Glycomic data set. The glycomic data set is from an untargeted LC-MS study aimed
at identifying glycomic disease biomarkers. We analyzed human serum samples representing
two distinct biological groups (cases and controls). The data set was generated from the
serum samples of 11 cases and 12 controls. Sample preparation consists of release, purification, reduction, and permethylation of N-linked glycans. Following the sample preparation,
LC-MS data were acquired using a Dionex 3000 Ultimate nano-LC system interfaced to an
LTQ-Orbitrap Velos mass spectrometer on positive mode. An internal standard mixture of
galactose was added to the samples prior to the LC-MS data acquisition. We performed
LC-MS data preprocessing including deisotoping using DeconTools [60] and peak detection
1
Available at mciproteomics.usouthal.edu/difproware/
61
Chapter 3. Multi-profile Bayesian alignment model
through deconvolution of chromatographic profiles. Each LC-MS run was preprocessed separately, and the sample labeling information did not account for any analysis conducted
here.
Preprocessed peak lists. Figure 3.5 presents the histogram of the logarithm of peak
intensities, where there are 61,637 and 2933 consensus peaks in the proteomic and glycomic
data sets, respectively. The scatter plots of the detected peaks are shown in Figure 3.6. For
visualization purpose, only the common peaks (present in more than 50% of the LC-MS
runs) are depicted. Multiply charged ions were identified and the corresponding masses were
recorded. Please note that the mass ranges are different in the two analyzed data sets.
Total number of peaks: 2933
Total number of peaks: 61637
700
18000
16000
600
Number of peaks
Number of peaks
14000
12000
10000
8000
6000
500
400
300
200
4000
100
2000
0
10
12
14
16
18
20
Log intensity
(a) Proteomic data set
22
24
0
12
14
16
18
20
22
24
26
Log intensity
(b) Glycomic data set
Figure 3.5: Histograms of the logarithm of peak intensities in the two analyzed data sets.
We evaluate the alignment results based on the consensus list of the ground-truth data.
Specifically, we compare the retention time (RT) difference across LC-MS runs, the coefficient of variation (CV) of extracted ion chromatograms, and the peak matching performance.
We use the simultaneous multiple alignment (SIMA) model [136], in which a feature-based
alignment module is embedded, to perform the peak matching process. SIMA has shown
outstanding performance in the four benchmark data sets by [71]. The RT difference measures the difference between the largest and smallest retention times for a consensus peak.
The CV evaluates the variability across chromatograms. The peak matching performance is
evaluated through precision and recall of the peak matching results against the ground-truth
data. Precision and recall are defined by [71] and provided as follows.
With a slight abuse of notation, we denote the consensus peak in the ground-truth by gti ,
i = 1, . . . , N, and the consensus peak from the peak matching result by pmj , j = 1, . . . , M,
where singleton peak is discarded. For each consensus peak in ground-truth gti , an associated
set Mi is defined as
Mi = j |gti ∩ pmj | > 0 ,
(3.20)
62
Chapter 3. Multi-profile Bayesian alignment model
5500
12000
11000
23
5000
10000
22
4500
21
4000
20
3500
9000
7000
Mass
Mass
8000
19
6000
24
22
20
3000
5000
18
2500
4000
17
2000
3000
16
1500
15
1000
18
16
2000
14
1000
20
40
60
80
100
120
14
500
10
20
30
40
50
60
Retention time (min)
Retention time (min)
(a) Proteomic data set
(b) Glycomic data set
Figure 3.6: Scatter plots of the detected peaks in the two analyzed data sets. The intensity
is log-transformed and color-coded.
where |Mi | is the number of unique elements in the set Mi , presenting the number of
consensus peaks (in the peak matching result) that are split from a single peak in groundtruth. A set of relevant peaks to the ground-truth gti can therefore be given by
[
pm
fi =
pmj ,
(3.21)
j∈Mi
and the matching precision and recall are defined as
N
1 X |gti ∩ pm
f i|
Precision =
,
N i=1
|pm
f i|
and
Recall =
respectively.
3.5
N
1 X |gti ∩ pm
f i|
,
N i=1 |Mi | · |gti |
(3.22)
(3.23)
Analysis of LC-MS proteomic data set
The LC-MS proteomic data set consists of 20 LC-MS runs from serum samples with a
mixture of internal standard spiked into. To identify peaks corresponding to the spikedin internal standard, MS/MS spectra of the internal standard mixture were searched with
Mascot. Precursor peaks of the identified peptide sequences were assigned based on their
masses and retention times, and the resulting list consists of 22 peaks of internal standard.
63
Chapter 3. Multi-profile Bayesian alignment model
Table 3.3: Peptide sequences of the internal standard.
IS1-1
IS1-2
IS1-3
IS1-4
IS1-5
IS2-1
IS2-2
IS3-1
IS3-2
IS3-3
IS4-1
IS4-2
IS4-3
IS4-4
IS4-5
IS4-6
IS4-7
IS4-8
IS4-9
IS5-1
IS5-2
IS5-3
Protein
Alcohol dehydrogenase
Alcohol dehydrogenase
Alcohol dehydrogenase
Alcohol dehydrogenase
Alcohol dehydrogenase
Carbonic anhydrase
Carbonic anhydrase
Cytochrome c
Cytochrome c
Cytochrome c
Enolase
Enolase
Enolase
Enolase
Enolase
Enolase
Enolase
Enolase
Enolase
Myoglobin
Myoglobin
Myoglobin
Peptide sequence
ANELLINVK
ATDGGAHGVINVSVSEAAIEASTR
LPLVGGHEGAGVVVGMGENVK
SISIVGSYVGNR
VVGLSTLPEIYEK
AVVQDPALKPLALVYGEATSR
VLDALDSIK
EETLMEYLENPK
GITWKEETLMEYLENPK
TGQAPGFTYTDANK
AVDDFLISLDGTANK
DGKYDLDFKNPNSDK
GNPTVEVELTTEK
IEEELGDNAVFAGENFHHGDKL
SGETEDTFIADLVVGLR
TAGIQIVADDLTVTNPK
VNQIGTLSESIK
YDLDFKNPNSDK
YGASAGNVGDEGGVAPNIQTAEEALDLIVDAIK
GLSDGEWQQVLNVWGK
HGTVVLTALGGILK
VEADIAGHGQEVLIR
Mass
1012.5925
2311.146
2018.0659
1250.6645
1446.8006
2197.2161
972.5497
1494.6948
2080.0233
1469.682
1577.7972
1754.8143
1415.7175
2440.1345
1820.9233
1754.9463
1287.7054
1454.6706
3256.6206
1814.898
1377.8373
1605.8496
Peptide sequences of the 22 peaks and their retention times are given in Tables 3.3 and 3.4,
respectively. NA in Table 3.4 means that an internal standard is not present.
The ground-truth data were generated based on the Mascot search result. A list of MS/MS
spectra with identification score > 60 and present in a least 10 out of 20 LC-MS runs was
compiled. Each peptide sequence was assigned to a peak detected by DifProWare based on its
mass and retention time, which resulted in a list of consensus peaks (with the same identity).
Putative matching was also performed to the runs without a qualified identification sequence.
The list was further refined based on visual inspection of the extracted ion chromatogram
of each consensus peak, where erroneous assignments were removed. The resulting groundtruth data consist of 273 unique peptide sequences from 70 unique proteins (Appendix A).
We compared the following procedures: no alignment performed to adjust the peak lists
(raw), alignment performed using a Gaussian process regression as defined in Equation 3.7
(GP), single-profile alignment performed with no information about internal standards (SP),
single-profile alignment performed with a Gaussian process prior (GPSP), and multi-profile
alignment performed with a Gaussian process prior (GPMP). For the multi-profile align-
64
Chapter 3. Multi-profile Bayesian alignment model
1
1
0.9
0.8
0.8
0.7
0.7
0.6
0.6
SSE
Normalized overlapping level
0.9
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.1
0
0.4
0.3
0.2
0
0
10
20
30
40
50
Number of clusters
(a) Normalized overlapping level
60
0
0
10
20
30
2
4
40
6
50
8
10
60
Number of clusters
(b) Sum of squared errors
Figure 3.7: Normalized overlapping level (a) and sum of squared errors (b) using the Lmethod in the proteomic data set. The sufficient number of clusters is four.
ment, representative chromatograms are first identified as discussed in Section 3.3. We use
the L-method to determine the sufficient number of clusters as demonstrated in Figure 3.7.
As shown in the figure, the sum of squared errors is minimized when the number of clusters is
chosen as four. Most current LC-MS preprocessing pipelines do not adjust the peak lists detected upfront (raw) and directly apply a feature-based alignment, e.g., SIMA. Single-profile
alignment (SP) represents the current profile-based models including our alignment model
introduced in Chapter 2, where a single chromatogram is considered without utilizing any information about internal standards. Table 3.5 summarizes the results of the five approaches.
Performance comparison based on retention time (RT) difference in seconds across runs for
consensus peaks, coefficient of variation (CV) of the extracted ion chromatograms of consensus peaks, precision and recall. For RT difference and CV, means (standard deviations)
are reported based on the 273 consensus peaks. For precision and recall, means (standard
deviations) are reported based on 72 pairs of tolerance parameters of m/z ∈ {0.05, 0.1, 0.25}
and RT ∈ {5, 10, . . . , 120} in SIMA.
As the chromatographic patterns are well captured by the base peak chromatogram in this
data set, the single-profile approach (GP) yields reasonable alignment result (Figure 3.8).
Figure 3.9 depicts the trace plots of model precision (1/σε2 ) estimated based on single-profile
alignment with (GPSP) and without (SP) Gaussian process prior. As shown in the figure,
incorporating the Gaussian process prior leads to a shorter burn-in period as well as higher
precision to fit the observed chromatograms. Based on the estimation by GPSP, Figures 3.10–
3.14 show the trace plots of differences between the mapping functions ui (t) and ui+1(t), at
the knot points τj , j = 1, . . . , 23. From these figures, less uncertainty of the estimation is
observed in the middle retention time range of the chromatograms. Also, the difference is not
consistent throughout the whole retention time range, which suggests that a linear modeling
of the mapping function is not appropriate. As shown in Table 3.5, integration of internal
65
Chapter 3. Multi-profile Bayesian alignment model
7
18
7
x 10
18
8
2
x 10
8
2
16
x 10
16
1
12
12
0
58
60
62
10
8
4
4
2
2
40
60
80
60
62
8
6
20
0
58
10
6
0
1
14
Ion count
14
Ion count
x 10
0
100
20
40
Retention time (min)
60
80
100
Retention time (min)
(a) Original chromatograms
(b) Aligned chromatograms
Figure 3.8: Base peak chromatograms in the proteomic data set, before and after alignment.
The inset is a zoomed part in the middle retention time range of the chromatograms.
SP
GPSP
23.5
20
23
15
25
20
10
GPSP
22.5
ε
1 / σ2
1 / σ2ε
SP
25
22
15
10
5
5
0
0
0
0
2500
5000
7500
Iteration
(a)
21.5
50
100
10000
150
12500
200
15000
21
1.25
1.3
1.35
1.4
Iteration
1.45
1.5
4
x 10
(b)
Figure 3.9: Trace plots of 1/σε2 estimated based on single-profile alignment without using
Gaussian process prior (SP) and with using Gaussian process prior (GPSP). (b) rooms in
the precision range 21 − 23.5 for the last 2500 iterations in (a).
standards (GPSP) and multiple chromatograms (GPMP) can lead to further improvement.
Based on the estimation by GPMP, Figure 3.15 presents the posterior difference between
the estimated mapping function and the identity function with the 90% credible interval for
the 20 LC-MS runs. For the peak matching performance, Figure 3.16 shows the measures
of precision and recall of the five considered approaches, based on 72 pairs of tolerance
parameters in SIMA, where GPMP yields the best performance, with the least variability to
the choice of parameters. This indicates that appropriate retention time alignment by GPMP
makes the subsequent peak matching process more robust to the selection of parameters.
66
Chapter 3. Multi-profile Bayesian alignment model
Table 3.4: Mass and retention time of each of the internal standard peaks in the proteomic
data set.
IS1-1
IS1-2
IS1-3
IS1-4
IS1-5
IS2-1
IS2-2
IS3-1
IS3-2
IS3-3
IS4-1
IS4-2
IS4-3
IS4-4
IS4-5
IS4-6
IS4-7
IS4-8
IS4-9
IS5-1
IS5-2
IS5-3
Mass
1012.5925
2311.146
2018.0659
1250.6645
1446.8006
2197.2161
972.5497
1494.6948
2080.0233
1469.682
1577.7972
1754.8143
1415.7175
2440.1345
1820.9233
1754.9463
1287.7054
1454.6706
3256.6206
1814.898
1377.8373
1605.8496
1
NA
40.86
40.86
34.15
46.17
54.39
35.81
52.81
54.76
NA
NA
24.99
30.08
39.89
67.68
45.63
NA
NA
87.05
58.32
52.9
30.42
2
35.29
40.72
40.63
34.15
45.98
54.15
35.73
52.76
54.72
23.44
52.02
25.03
30.01
39.75
67.53
45.52
32.29
27.28
87
58.23
52.76
30.36
IS1-1
IS1-2
IS1-3
IS1-4
IS1-5
IS2-1
IS2-2
IS3-1
IS3-2
IS3-3
IS4-1
IS4-2
IS4-3
IS4-4
IS4-5
IS4-6
IS4-7
IS4-8
IS4-9
IS5-1
IS5-2
IS5-3
Mass
1012.5925
2311.146
2018.0659
1250.6645
1446.8006
2197.2161
972.5497
1494.6948
2080.0233
1469.682
1577.7972
1754.8143
1415.7175
2440.1345
1820.9233
1754.9463
1287.7054
1454.6706
3256.6206
1814.898
1377.8373
1605.8496
11
36.24
41.71
41.54
34.92
46.77
55.01
36.6
53.57
55.42
24.04
52.74
25.77
30.86
40.56
68.16
46.36
33.15
27.99
87.52
58.99
53.57
31.2
12
36.11
41.47
41.29
34.84
46.72
54.79
36.46
53.34
55.39
24
52.61
25.63
30.77
40.41
68.01
46.29
33.03
27.88
87.61
58.88
53.44
31.11
Retention time (min.) in each LC-MS run
3
4
5
6
7
8
35.22 NA
35.24 NA
36.58 36.7
40.68 41.01 40.67 40.64 41.96 42.07
40.5
40.92 40.67 NA
41.96 41.98
33.99 34.42 34.5
33.97 35.36 35.48
45.91 46.17 45.93 45.86 47.16 47.34
54.14 54.42 54.23 54.16 55.33 55.37
35.66 36.03 35.6
35.64 36.93 37.14
52.66 52.98 52.75 52.7
53.86 53.99
54.61 54.85 54.76 54.72 55.8
55.92
23.21 NA
23.3
NA
24.54 24.59
51.9
NA
51.92 NA
53.03 53.27
24.8
25.21 24.91 24.89 26.27 26.33
29.93 30.28 30.07 30
31.4
31.45
39.63 39.99 39.7
39.59 40.99 41.02
67.49 67.6
67.43 67.51 68.58 68.62
45.45 45.78 45.47 45.5
46.71 46.8
32.22 NA
32.2
32.12 33.6
33.73
27.12 NA
27.26 NA
28.6
28.58
86.88 87.2
86.95 86.93 88.06 88.04
58.18 58.42 58.21 58.27 59.39 59.44
52.66 52.98 52.85 52.79 53.86 53.99
30.27 30.62 30.42 30.34 31.65 31.79
Retention time (min.) in each LC-MS run
13
14
15
16
17
18
36.05 35.98 NA
36.52 NA
36.13
41.45 41.46 41.64 41.75 41.64 41.58
41.28 41.19 41.55 41.66 NA
41.22
34.77 34.67 35.12 35.74 34.95 34.82
46.55 46.61 46.92 46.96 46.79 46.68
54.79 54.78 55.09 55.14 55.1
54.77
36.4
36.33 36.75 36.88 36.64 36.48
53.45 53.38 53.52 53.68 53.62 53.45
55.32 55.34 55.61 55.57 55.51 55.33
23.93 23.92 NA
24.2
NA
24.05
52.59 52.56 NA
52.85 NA
52.72
25.7
25.51 25.81 25.98 25.85 25.64
30.73 30.72 31.02 31.13 30.95 30.79
40.41 40.32 40.61 40.69 40.6
40.44
68.04 68.06 68.4
68.28 68.29 68.23
46.2
46.19 46.47 46.6
46.45 46.23
33.01 32.92 NA
33.34 NA
32.98
27.88 27.84 NA
28.3
NA
27.91
87.57 87.54 87.92 87.71 87.63 87.62
58.86 58.8
59.14 59.19 59.02 58.89
53.45 53.38 53.69 53.68 53.71 53.45
31.08 30.98 31.45 31.49 31.3
31.13
9
36.21
41.66
41.57
35
46.86
54.99
36.66
53.62
55.54
24.05
52.8
25.74
30.95
40.61
68.23
46.41
33.23
27.99
87.77
59.1
53.62
31.29
10
36.15
41.57
41.48
34.92
46.85
54.98
36.6
53.51
55.43
24.04
52.69
25.71
30.85
40.51
68.18
46.31
33.05
27.99
87.81
58.88
53.51
31.2
19
35.92
41.4
41.32
34.6
46.59
54.85
36.35
53.38
55.31
23.85
52.57
25.5
30.59
40.36
68.06
46.15
32.86
27.73
87.64
58.7
53.38
30.93
20
NA
41.47
41.47
34.87
46.63
54.9
36.46
53.35
55.36
23.9
NA
25.64
30.85
40.35
68.13
46.23
NA
27.97
87.65
58.85
53.44
31.1
67
Chapter 3. Multi-profile Bayesian alignment model
Table 3.5: Performance comparison in the LC-MS proteomic data set. Five approaches are
compared: no alignment performed (raw), alignment performed using a Gaussian process
regression (GP), single-profile alignment performed without using a Gaussian process prior
(SP), single-profile alignment performed with a Gaussian process prior (GPSP), and multiprofile alignment (G = 4) performed with a Gaussian process prior (GPMP).
RT
CV
Precision
Recall
Raw
83.08 (20.84)
1.634 (0.256)
0.937 (0.026)
0.638 (0.241)
GP
18.02 (22.74)
1.118 (0.372)
0.983 (0.007)
0.933 (0.089)
SP
19.37 (28.56)
1.150 (0.415)
0.985 (0.004)
0.952 (0.043)
GPSP
11.74 (19.61)
1.060 (0.380)
0.988 (0.004)
0.962 (0.050)
GPMP
10.70 (20.67)
1.052 (0.380)
0.990 (0.003)
0.970 (0.027)
68
Chapter 3. Multi-profile Bayesian alignment model
−20
40
(τ )
20
i+1 1
0
i=13
i=14
i=15
i=16
i=17
i=18
i=19
60
20
0
i 1
i+1 1
(τ )
40
i 1
i 1
20
80
i=7
i=8
i=9
i=10
i=11
i=12
60
u (τ )−u
i+1 1
(τ )
40
u (τ )−u
80
i=1
i=2
i=3
i=4
i=5
i=6
60
u (τ )−u
80
−20
0
−20
−40
−60
−40
−40
−60
−60
−80
5000
7500
10000
12500
−80
0
15000
2500
5000
Iteration
−120
0
15000
2500
5000
12500
15000
20
0
i=13
i=14
i=15
i=16
i=17
i=18
i=19
40
20
0
i+1 2
(τ )
i+1 2
10000
60
i=7
i=8
i=9
i=10
i=11
i=12
40
i 2
−20
7500
Iteration
60
u (τ )−u
(τ )
i 2
i+1 2
u (τ )−u
12500
80
i=1
i=2
i=3
i=4
i=5
i=6
40
0
10000
i 2
60
20
7500
Iteration
(τ )
2500
u (τ )−u
−80
0
−100
−20
−20
−40
10000
12500
−60
0
15000
2500
5000
−40
−40
10000
12500
i=7
i=8
i=9
i=10
i=11
i=12
−60
0
15000
2500
5000
−40
−40
10000
12500
−60
0
15000
5000
15000
i=13
i=14
i=15
i=16
i=17
i=18
i=19
(τ )
20
0
−20
5000
7500
10000
12500
−60
0
15000
2500
5000
7500
10000
12500
15000
Iteration
60
i=7
i=8
i=9
i=10
i=11
i=12
20
0
i+1 5
20
0
i=13
i=14
i=15
i=16
i=17
i=18
i=19
40
i 5
(τ )
i+1 5
12500
40
i+1 4
2500
40
i 5
0
−20
10000
−40
60
u (τ )−u
20
7500
Iteration
80
i=1
i=2
i=3
i=4
i=5
i=6
40
(τ )
2500
60
Iteration
60
i+1 5
−20
−60
0
0
−20
80
i 5
0
i=7
i=8
i=9
i=10
i=11
i=12
Iteration
u (τ )−u
20
15000
20
−20
15000
i 4
(τ )
i 4
0
−20
−20
−40
−40
−40
−60
−80
0
12500
40
i+1 4
20
7500
10000
60
u (τ )−u
i 4
u (τ )−u
i+1 4
(τ )
40
5000
7500
80
i=1
i=2
i=3
i=4
i=5
i=6
12500
i=13
i=14
i=15
i=16
i=17
i=18
i=19
Iteration
60
10000
−40
Iteration
80
7500
40
i+1 3
0
−20
2500
5000
Iteration
20
−20
−60
0
2500
60
i 3
(τ )
i 3
0
7500
−60
0
15000
40
i+1 3
20
5000
12500
60
u (τ )−u
i 3
u (τ )−u
i+1 3
(τ )
40
2500
10000
u (τ )−u
i=1
i=2
i=3
i=4
i=5
i=6
60
−60
0
7500
Iteration
80
(τ )
7500
Iteration
u (τ )−u
5000
(τ )
2500
80
u (τ )−u
−60
0
−40
−40
2500
5000
7500
Iteration
10000
12500
15000
−60
0
2500
5000
7500
Iteration
10000
12500
15000
−60
0
2500
5000
7500
10000
Iteration
Figure 3.10: Trace plots of ui (t) − ui+1 (t) at the knot points τ1 − τ5 .
12500
15000
69
Chapter 3. Multi-profile Bayesian alignment model
7500
10000
12500
−60
0
15000
2500
5000
i 7
0
−20
10000
12500
15000
60
20
0
i+1 7
20
0
i=13
i=14
i=15
i=16
i=17
i=18
i=19
40
−20
−40
−40
10000
12500
−60
0
15000
2500
5000
−60
0
15000
2500
5000
−40
−40
12500
−60
0
15000
i=13
i=14
i=15
i=16
i=17
i=18
i=19
20
0
−20
2500
5000
7500
10000
12500
−60
0
15000
2500
5000
Iteration
−20
20
0
i=13
i=14
i=15
i=16
i=17
i=18
i=19
20
10
−10
0
−20
−20
−40
15000
30
u (τ )−u
i+1 9
i 9
0
12500
40
(τ )
40
u (τ )−u
20
10000
50
i=7
i=8
i=9
i=10
i=11
i=12
60
(τ )
40
7500
Iteration
80
i=1
i=2
i=3
i=4
i=5
i=6
60
15000
−40
Iteration
80
12500
40
i+1 8
0
−20
10000
i=7
i=8
i=9
i=10
i=11
i=12
20
−20
10000
60
i 8
i+1 8
i 8
0
7500
Iteration
40
u (τ )−u
20
7500
12500
60
(τ )
40
5000
10000
(τ )
i=1
i=2
i=3
i=4
i=5
i=6
2500
7500
Iteration
80
u (τ )−u
7500
i+1 9
5000
i 9
2500
60
(τ )
7500
Iteration
i=7
i=8
i=9
i=10
i=11
i=12
Iteration
i+1 8
5000
−20
80
i 8
2500
i 7
(τ )
i+1 7
20
−60
u (τ )−u
−60
0
15000
40
−40
(τ )
12500
60
u (τ )−u
i 7
u (τ )−u
i+1 7
(τ )
40
i+1 9
10000
80
i=1
i=2
i=3
i=4
i=5
i=6
60
i 9
7500
Iteration
(τ )
5000
u (τ )−u
2500
80
u (τ )−u
−20
−40
Iteration
−60
0
0
i+1 6
0
−40
−60
−80
0
20
−20
−40
−80
0
40
(τ )
0
−20
20
i=13
i=14
i=15
i=16
i=17
i=18
i=19
i 6
i+1 6
(τ )
40
i 6
i 6
20
60
i=7
i=8
i=9
i=10
i=11
i=12
60
u (τ )−u
i+1 6
(τ )
40
u (τ )−u
80
i=1
i=2
i=3
i=4
i=5
i=6
60
u (τ )−u
80
−30
−40
−60
−40
5000
7500
10000
12500
−60
0
15000
2500
5000
Iteration
i 10
0
5000
−40
12500
15000
−60
0
10000
12500
15000
i=13
i=14
i=15
i=16
i=17
i=18
i=19
40
20
0
i+1 10
0
−40
10000
i=7
i=8
i=9
i=10
i=11
i=12
20
−20
7500
Iteration
40
−20
7500
2500
60
i 10
(τ )
i+1 10
20
Iteration
−50
0
15000
60
u (τ )−u
(τ )
i+1 10
i 10
u (τ )−u
40
5000
12500
80
i=1
i=2
i=3
i=4
i=5
i=6
60
2500
10000
Iteration
80
−60
0
7500
(τ )
2500
u (τ )−u
−80
0
−20
−40
2500
5000
7500
Iteration
10000
12500
15000
−60
0
2500
5000
7500
10000
Iteration
Figure 3.11: Trace plots of ui (t) − ui+1 (t) at the knot points τ6 − τ10 .
12500
15000
70
Chapter 3. Multi-profile Bayesian alignment model
i 11
0
(τ )
20
30
i+1 11
20
0
−20
20
10
0
−10
−20
−30
−20
−40
i=13
i=14
i=15
i=16
i=17
i=18
i=19
40
i 11
(τ )
40
50
i=7
i=8
i=9
i=10
i=11
i=12
60
i+1 11
40
u (τ )−u
(τ )
i+1 11
i 11
u (τ )−u
80
i=1
i=2
i=3
i=4
i=5
i=6
60
u (τ )−u
80
−40
−60
0
2500
5000
7500
10000
12500
−40
0
15000
2500
5000
Iteration
−50
0
15000
2500
5000
−20
12500
15000
(τ )
30
i+1 12
0
i=13
i=14
i=15
i=16
i=17
i=18
i=19
40
i 12
20
i 12
0
10000
50
i=7
i=8
i=9
i=10
i=11
i=12
u (τ )−u
(τ )
40
i+1 12
20
7500
Iteration
60
u (τ )−u
(τ )
40
i+1 12
12500
80
i=1
i=2
i=3
i=4
i=5
i=6
60
i 12
10000
Iteration
80
u (τ )−u
7500
20
10
0
−10
−20
−40
−30
−20
−60
−40
5000
7500
10000
12500
−40
0
15000
2500
5000
Iteration
−50
0
15000
2500
5000
0
i 13
0
i=7
i=8
i=9
i=10
i=11
i=12
15000
−20
i=13
i=14
i=15
i=16
i=17
i=18
i=19
40
30
10
−20
20
0
−10
−20
−40
−40
12500
50
i+1 13
20
10000
60
i 13
(τ )
20
i+1 13
40
7500
Iteration
40
u (τ )−u
(τ )
i+1 13
12500
60
i=1
i=2
i=3
i=4
i=5
i=6
60
i 13
10000
Iteration
80
u (τ )−u
7500
(τ )
2500
u (τ )−u
−80
0
−30
−60
0
2500
5000
7500
10000
12500
−60
0
15000
2500
5000
Iteration
−40
0
15000
2500
5000
10000
12500
15000
(τ )
0
u (τ )−u
0
i 14
i+1 14
(τ )
20
−20
i=13
i=14
i=15
i=16
i=17
i=18
i=19
40
20
i 14
0
−20
i=7
i=8
i=9
i=10
i=11
i=12
40
i+1 14
20
7500
Iteration
60
u (τ )−u
(τ )
40
i+1 14
12500
60
i=1
i=2
i=3
i=4
i=5
i=6
60
i 14
10000
Iteration
80
u (τ )−u
7500
−20
−40
−40
−40
−60
−80
0
2500
5000
7500
10000
12500
−60
0
15000
2500
5000
Iteration
−60
0
15000
2500
5000
10000
12500
15000
(τ )
0
u (τ )−u
0
i+1 15
(τ )
20
−20
i=13
i=14
i=15
i=16
i=17
i=18
i=19
40
20
i 15
0
−20
i=7
i=8
i=9
i=10
i=11
i=12
40
i+1 15
20
7500
Iteration
60
u (τ )−u
(τ )
40
i+1 15
12500
60
i=1
i=2
i=3
i=4
i=5
i=6
60
i 15
10000
i 15
80
u (τ )−u
7500
Iteration
−20
−40
−40
−40
−60
−80
0
2500
5000
7500
Iteration
10000
12500
15000
−60
0
2500
5000
7500
Iteration
10000
12500
15000
−60
0
2500
5000
7500
10000
Iteration
Figure 3.12: Trace plots of ui (t) − ui+1(t) at the knot points τ11 − τ15 .
12500
15000
71
Chapter 3. Multi-profile Bayesian alignment model
(τ )
20
0
u (τ )−u
0
i+1 16
20
i+1 16
0
−20
−20
i=13
i=14
i=15
i=16
i=17
i=18
i=19
40
(τ )
40
i 16
i 16
20
60
i=7
i=8
i=9
i=10
i=11
i=12
u (τ )−u
i+1 16
(τ )
40
u (τ )−u
60
i=1
i=2
i=3
i=4
i=5
i=6
60
i 16
80
−20
−40
−40
−40
−60
−80
0
2500
5000
7500
10000
12500
−60
0
15000
2500
5000
Iteration
−60
0
15000
−20
(τ )
0
10
−10
−20
−30
7500
10000
12500
−40
0
15000
5000
7500
10000
12500
20
10
0
−10
−20
−50
0
15000
2500
5000
i=1
i=2
i=3
i=4
i=5
i=6
i+1 18
20
0
−20
−20
−40
−40
15000
i=13
i=14
i=15
i=16
i=17
i=18
i=19
30
(τ )
40
i 18
i 18
0
12500
40
u (τ )−u
i+1 18
20
10000
50
i=7
i=8
i=9
i=10
i=11
i=12
60
(τ )
40
7500
Iteration
80
u (τ )−u
(τ )
i+1 18
i 18
u (τ )−u
i=13
i=14
i=15
i=16
i=17
i=18
i=19
Iteration
60
15000
−40
2500
Iteration
80
12500
−30
−40
5000
10000
30
i+1 17
20
7500
40
i 17
(τ )
i+1 17
0
2500
5000
50
i=7
i=8
i=9
i=10
i=11
i=12
30
i 17
20
−60
0
2500
Iteration
40
u (τ )−u
(τ )
i+1 17
12500
50
i=1
i=2
i=3
i=4
i=5
i=6
40
i 17
10000
u (τ )−u
60
u (τ )−u
7500
Iteration
20
10
0
−10
−20
−30
−40
−60
0
2500
5000
7500
10000
12500
−60
0
15000
2500
5000
Iteration
−50
0
15000
i=7
i=8
i=9
i=10
i=11
i=12
i+1 19
10
i 19
0
7500
10000
12500
15000
2500
5000
10000
12500
5
0
−5
−20
0
15000
2500
5000
(τ )
i+1 20
−50
10000
12500
15000
i=13
i=14
i=15
i=16
i=17
i=18
i=19
60
i 20
−20
i=7
i=8
i=9
i=10
i=11
i=12
u (τ )−u
(τ )
i+1 20
0
0
i 20
20
7500
Iteration
80
50
u (τ )−u
(τ )
i+1 20
i 20
7500
100
i=1
i=2
i=3
i=4
i=5
i=6
40
u (τ )−u
i=13
i=14
i=15
i=16
i=17
i=18
i=19
Iteration
60
15000
−15
Iteration
80
12500
−10
−20
5000
10000
10
(τ )
20
−30
0
7500
15
−10
−20
2500
5000
20
u (τ )−u
(τ )
i+1 19
i 19
−10
−30
0
2500
Iteration
30
u (τ )−u
(τ )
i 19
i+1 19
u (τ )−u
12500
40
i=1
i=2
i=3
i=4
i=5
i=6
20
0
10000
Iteration
30
10
7500
40
20
0
−20
−40
−100
−40
−60
−80
0
2500
5000
7500
Iteration
10000
12500
15000
−150
0
2500
5000
7500
Iteration
10000
12500
15000
−60
0
2500
5000
7500
10000
Iteration
Figure 3.13: Trace plots of ui (t) − ui+1(t) at the knot points τ16 − τ20 .
12500
15000
72
Chapter 3. Multi-profile Bayesian alignment model
i 21
−5
(τ )
0
−20
−40
−10
−15
−60
−20
0
−80
0
7500
10000
12500
15000
−30
2500
5000
12500
−40
0
15000
2500
5000
10000
12500
15000
20
0
i+1 22
20
0
−20
i=13
i=14
i=15
i=16
i=17
i=18
i=19
40
i 22
i+1 22
(τ )
40
i 22
−20
7500
Iteration
60
i=7
i=8
i=9
i=10
i=11
i=12
60
u (τ )−u
(τ )
i+1 22
i 22
10000
80
i=1
i=2
i=3
i=4
i=5
i=6
40
u (τ )−u
7500
Iteration
60
0
0
−10
(τ )
5000
10
u (τ )−u
2500
20
−20
Iteration
20
30
i+1 21
20
i=13
i=14
i=15
i=16
i=17
i=18
i=19
40
i 21
(τ )
40
u (τ )−u
5
50
i=7
i=8
i=9
i=10
i=11
i=12
60
i+1 21
i+1 21
0
u (τ )−u
i 21
(τ )
15
10
80
i=1
i=2
i=3
i=4
i=5
i=6
20
u (τ )−u
25
−20
−40
−40
−40
−60
5000
7500
10000
12500
−80
0
15000
2500
5000
Iteration
−60
0
15000
2500
5000
0
−10
−10
−20
12500
15000
0
−20
i=13
i=14
i=15
i=16
i=17
i=18
i=19
40
20
0
i+1 23
i+1 23
10
i 23
10
10000
60
i=7
i=8
i=9
i=10
i=11
i=12
20
(τ )
20
7500
Iteration
30
u (τ )−u
i+1 23
(τ )
30
i 23
12500
40
i=1
i=2
i=3
i=4
i=5
i=6
40
u (τ )−u
10000
i 23
50
−20
−30
−30
−40
−40
−40
−50
0
7500
Iteration
(τ )
2500
u (τ )−u
−60
0
2500
5000
7500
Iteration
10000
12500
15000
−50
0
2500
5000
7500
10000
12500
15000
−60
0
2500
Iteration
5000
7500
10000
12500
15000
Iteration
Figure 3.14: Trace plots of ui (t) − ui+1(t) at the knot points τ21 − τ23 .
3.6
Analysis of LC-MS glycomic data set
The glycomic data set is from an untargeted LC-MS study aimed at identifying N-glycan
disease biomarkers. Human serum samples representing two distinct biological groups (cases
and controls) were analyzed in this study. The data set was generated from serum samples
of 11 cases and 12 controls. An internal standard mixture was added to the serum samples
prior to the LC-MS data acquisition. We performed LC-MS data preprocessing including
deisotoping using DeconTools [60] and peak detection through deconvolution of chromatographic profiles. In this study, the average residue composition was set to C10 H18 N0.5 O5
for the deisotoping step. Each LC-MS run was preprocessed separately, and the labeling
information did not account for any analysis conducted in this study.
Peaks of the internal standard in the glycomic data set were identified by comparing measured mass values with theoretical molecular weights of galactose units. Table 3.6 presents
the masses of the five internal standard peaks of different galactose units (galactose 3–7) and
73
Chapter 3. Multi-profile Bayesian alignment model
their retention times in each LC-MS run. The ground-truth data were generated based on
a list of human serum glycans characterized by the number of monosaccharides: HexNAc,
Hexose, Deoxyhexose and NeuAc. The putative compositions were assigned by comparison of measured mass values with theoretical values, in consideration of hydrogen adducts.
The resulting ground-truth data consist of 106 peaks. The complete list can be found in
Appendix B.
Table 3.6: Mass and retention time of each of the internal standard peaks in the glycomic
data set.
Gal3
Gal4
Gal5
Gal6
Gal7
Mass
674.3726
878.4724
1082.5722
1286.672
1490.7718
Gal3
Gal4
Gal5
Gal6
Gal7
Mass
674.3726
878.4724
1082.5722
1286.672
1490.7718
Gal3
Gal4
Gal5
Gal6
Gal7
Mass
674.3726
878.4724
1082.5722
1286.672
1490.7718
Retention time (min.) in each LC-MS run
2
3
4
5
6
7
26.67 26.42 26.29 26.28 26.12 26.02
30.49 29.68 29.61 29.5
29.39 29.31
33.56 33.24 33.05 32.93 32.86 32.8
37.13 36.76 36.6
36.48 36.38 36.33
40.61 40.24 40.03 39.88 39.81 39.8
Retention time (min.) in each LC-MS run
9
10
11
12
13
14
15
26.02 25.86 25.79 25.72 25.56 25.47 25.33
29.28 29.09 29.05 28.98 28.76 28.69 28.51
32.74 32.58 32.52 32.4
32.25 32.2
31.97
36.29 36.08 36.04 35.92 35.71 35.66 35.45
39.77 39.5
39.44 39.32 39.13 39.09 38.87
Retention time (min.) in each LC-MS run
17
18
19
20
21
22
23
24.96 24.83 24.69 24.59 24.65 24.6
24.65
28.22 28.07 27.83 27.85 27.85 27.87 27.88
31.64 31.45 31.4
31.27 31.28 31.32 31.21
35.16 35.02 34.97 34.81 34.78 34.8
34.81
38.66 38.52 38.44 38.3
38.2
38.28 38.26
1
27.19
30.51
34.05
37.54
40.99
8
26.03
29.29
32.71
36.26
39.7
16
25.04
28.29
31.72
35.24
38.7
We performed the same comparison as in Section 3.5, where five approaches were compared: no alignment performed (raw), alignment performed using a Gaussian process regression (GP), single-profile alignment performed without using a Gaussian process prior
(SP), single-profile alignment performed with a Gaussian process prior (GPSP), and multiprofile alignment performed with a Gaussian process prior (GPMP). As in the analysis of
the proteomic data set, we use the L-method to determine the sufficient number of clusters
as demonstrated in Figure 3.18. The sufficient number of clusters to capture the chromatographic patterns was found to be four. Table 3.7 presents the evaluation comparison in the
glycomic data set, which turns out to be the most challenging case in our study due to the
lack of consistent pattern in the base peak chromatogram. Figure 3.17 depicts each base peak
chromatogram of the 23 LC-MS runs. As shown in the figure, most of the chromatographic
profiles are concentrated in the range between 20 and 35 minutes. Unfortunately, there is
74
Chapter 3. Multi-profile Bayesian alignment model
no consistent chromatographic pattern from the base peak chromatograms. Consequently,
using single-profile alignment does not yield significant improvement as in the proteomic
data set, as shown in Table 3.7. Gaussian process regression (GP, with only five internal
standard peaks) performs comparably to both single-profile alignment approaches (SP and
GPSP). Utilization of the internal standard and multiple chromatograms (GPMP) is particularly advantageous in this data set. Figure 3.19 depicts the clustered ion chromatograms,
before and after alignment by GPMP. As shown in the figure, chromatographic patterns are
better retained and the multi-profile alignment can therefore be effectively performed. As
a result, GPMP outperforms the other four approaches in terms of all the measurements in
Table 3.7. Based on the estimation by GPMP, Figure 3.20 shows the posterior difference
between the estimated mapping function and the identity function with the 90% credible
interval for the 23 LC-MS runs. Figure 3.21 shows the measures of precision and recall of
the five considered approaches, based on 72 pairs of tolerance parameters in SIMA. As in
the proteomic data set, GPMP yields the best performance, with the least variability to the
choice of parameters.
Table 3.7: Performance comparison in the LC-MS glycomic data set. Five approaches are
compared: no alignment performed (raw), alignment performed using a Gaussian process
regression (GP), single-profile alignment performed without using a Gaussian process prior
(SP), single-profile alignment performed with a Gaussian process prior (GPSP), and multiprofile alignment (G = 4) performed with a Gaussian process prior (GPMP).
RT
CV
Precision
Recall
Raw
103.20 (32.63)
1.194 (0.346)
0.943 (0.008)
0.612 (0.215)
GP
42.36 (29.76)
0.931 (0.284)
0.965 (0.005)
0.819 (0.136)
SP
67.45 (42.39)
1.090 (0.399)
0.967 (0.008)
0.773 (0.166)
GPSP
44.65 (31.35)
0.978 (0.366)
0.976 (0.003)
0.829 (0.167)
GPMP
24.85 (27.21)
0.821 (0.292)
0.980 (0.002)
0.907 (0.095)
For the multi-profile alignment, a set of representative chromatograms are first identified as
discussed in Section 3.3. We compared cases where chromatograms are derived either by
binning along m/z or using the chromatographic clustering (Figure 3.22). Tables 3.8 and 3.9
summarize the peak matching performance using multi-profile alignment with the number
of chromatograms varying between two and five. As shown in the table, using the chromatographic clustering procedure outperforms the binning approach. Incorporation of the
Gaussian process prior shows significant improvement in the cases of binning and using two
clustered chromatograms. This indicates that using the informative prior is beneficial for the
profile-based alignment, especially when a consistent chromatographic pattern is unavailable.
For the other cases of chromatographic clustering (G = 3, . . . , 5), the results with or without
integrating the internal standards are similar. This is because the chromatographic patterns
have already been well captured. As a result, the use of prior information did not improve
the alignment results further. We believe that further improvement can be achieved with
the addition of more internal standards that allow better coverage of the retention time.
75
Chapter 3. Multi-profile Bayesian alignment model
Table 3.8: Multi-profile alignment of the glycomic data set with and without using a Gaussian
process (GP) prior. Chromatograms are derived by binning along m/z dimension.
Binning without GP prior
# of bins
2
3
4
5
0.957
0.960
0.964
0.963
Precision
(0.007) (0.008) (0.007) (0.006)
0.766
0.783
0.770
0.780
Recall
(0.138) (0.122) (0.147) (0.154)
Binning with GP prior
2
3
4
5
0.976
0.977
0.978
0.974
(0.003) (0.003) (0.004) (0.006)
0.848
0.856
0.830
0.837
(0.151) (0.141) (0.160) (0.156)
Table 3.9: Multi-profile alignment of the glycomic data set with and without using a Gaussian process (GP) prior. Chromatograms are derived using the chromatographic clustering
procedure.
# of clusters
Precision
Recall
3.7
Clustering without GP prior
2
3
4
5
0.964
0.980
0.979
0.980
(0.007) (0.004) (0.004) (0.002)
0.796
0.906
0.910
0.913
(0.138) (0.094) (0.099) (0.092)
Clustering with GP prior
2
3
4
5
0.980
0.979
0.980
0.980
(0.003) (0.003) (0.002) (0.003)
0.904
0.904
0.907
0.908
(0.098) (0.099) (0.095) (0.096)
Summary
This chapter extends the single-profile alignment model to handle multiple chromatograms
simultaneously, and incorporates the Gaussian process prior for the mapping function. The
extended model uses multi-profile modeling with representative chromatograms identified
by a clustering approach. Through comprehensive evaluation on LC-MS data sets from
proteomic and glycomic studies, the multi-profile Bayesian alignment model has shown significant benefits for the subsequent peak matching step, which is crucial when comparing
thousands of features across multiple LC-MS runs. Although our discussion focuses on using
internal standards to derive the Gaussian process prior, it is also possible to specify the
Gaussian process prior for the mapping relationship based on the identification of MS/MS
spectra or targeted compounds.
76
Chapter 3. Multi-profile Bayesian alignment model
Run 1
Run 2
1000
2000
3000
4000
5000
0
−50
6000
50
Difference
0
−50
1000
2000
Time (sec)
3000
4000
Run 4
2000
3000
4000
5000
−200
6000
1000
2000
3000
5000
−50
6000
1000
2000
4000
5000
5000
−50
6000
1000
2000
Run 13
3000
4000
5000
3000
4000
5000
−50
6000
1000
2000
3000
4000
5000
3000
4000
5000
−50
6000
1000
2000
3000
1000
2000
6000
3000
4000
5000
6000
5000
6000
Run 18
4000
5000
0
−50
6000
1000
2000
Time (sec)
3000
4000
Time (sec)
Run 19
Run 20
50
Difference
50
Difference
5000
50
Time (sec)
0
−50
4000
Time (sec)
0
−50
6000
3000
0
Difference
Difference
0
6000
Run 15
50
2000
2000
Run 17
Run 16
1000
1000
Time (sec)
−100
5000
50
Time (sec)
100
4000
Time (sec)
0
−50
6000
3000
0
Difference
Difference
2000
2000
Run 14
0
6000
Run 12
50
1000
1000
Time (sec)
50
5000
Time (sec)
Difference
Difference
4000
0
−50
6000
4000
50
Time (sec)
Difference
3000
Run 11
Difference
3000
3000
0
Time (sec)
0
2000
2000
Run 9
50
1000
1000
Time (sec)
Difference
Difference
Difference
4000
6000
50
Run 10
Difference
−50
6000
0
50
−200
5000
Run 8
Time (sec)
−50
4000
50
3000
5000
0
Time (sec)
0
4000
Run 6
0
Run 7
2000
3000
50
−100
50
1000
2000
Time (sec)
Difference
Difference
Difference
1000
1000
Run 5
Time (sec)
−50
−50
6000
100
0
−50
5000
0
Time (sec)
50
−50
Run 3
50
Difference
Difference
50
1000
2000
3000
4000
Time (sec)
5000
6000
0
−50
1000
2000
3000
4000
5000
6000
Time (sec)
Figure 3.15: Difference between the identity function and estimated mapping function obtained from the posterior median by GPMP for each of the 20 LC-MS runs in the proteomic
data set. The filled region corresponds to the 90% credible interval.
77
Chapter 3. Multi-profile Bayesian alignment model
1
0.98
Precision
0.96
0.94
0.92
0.9
0.88
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Recall
Figure 3.16: Measures of precision and recall in the proteomic data set, based on 72 pairs
of tolerance parameters in SIMA. The five procedures compared are: raw (∗), GP (), SP
(), GPSP (△), and GPMP (♦).
78
Chapter 3. Multi-profile Bayesian alignment model
3.5
x 10
x 10
16
2
1.6
1.5
4
1.4
3
Ion count
2
2.5
0
20
2
25
30
35
1.5
0.8
1.5
4
1
3.5
25
30
35
10
0.5
8
0
20
25
30
35
6
0.5
0.2
4
2
3
2.5
0
20
2
25
30
35
1.5
30
40
50
60
20
30
40
50
60
x 10
3
8
8
3
50
0
10
60
15
30
35
0
20
25
30
35
8
5
5
0
20
6
25
30
35
1
0.5
30
40
50
20
30
Time (min)
40
50
0
10
60
3
8
30
2.5
50
0
10
60
8
35
1.5
1
1
0.5
0.5
30
40
50
60
8
6
4.5
4
0
20
25
30
35
1
1.5
0
20
1
25
30
Ion count
30
20
x 10
4
3.5
1
Ion count
25
Ion count
0
20
35
x 10
x 10
2
2
1
30
Time (min)
5
8
3
2
2
25
8
x 10
x 10
2.5
2
2
40
Time (min)
x 10
3
2.5
0
20
8
x 10
3
20
Time (min)
8
x 10
2
3
1
2
0
10
60
8
1.5
4
4
2
4
0.5
x 10
6
8
1
60
6
10
Ion count
Ion count
25
1.5
50
8
x 10
12
1
40
x 10
7
x 10
10
0
20
20
30
Time (min)
7
2
1
0
10
20
8
x 10
2
2
1.5
14
2.5
3
40
Time (min)
x 10
x 10
3
2
30
7
8
2.5
20
Time (min)
Time (min)
4
0.5
0
10
Ion count
20
1
2
0
10
8
Ion count
x 10
4
0.4
3
8
6
4.5
12
0
20
x 10
x 10
0.6
0
10
Ion count
2
14
0.5
1
5
8
x 10
1
1.2
1
3.5
8
x 10
8
1.8
Ion count
2
8
6
Ion count
x 10
4
Ion count
7
8
8
4.5
35
2
3
2.5
0
20
2
25
30
35
1.5
0.5
1
0.5
20
30
40
50
0
10
60
20
30
Time (min)
x 10
7
0
10
60
3.5
8
x 10
2
5
Ion count
1
0
20
25
30
35
1
50
0
10
60
20
30
40
50
60
Time (min)
8
x 10
2.5
8
x 10
8
x 10
3
x 10
3
2
3
2.5
4
4
2
3
0
20
25
30
35
1
0
20
1.5
2
1
1
0.5
2
2
2
25
30
35
1
1.5
0
20
1
25
30
35
0.5
0
10
20
30
40
50
0
10
60
20
30
40
50
0
10
60
8
x 10
6
x 10
18
8
x 10
6
4
3
4
Ion count
1
0
20
25
30
35
1.5
2
3
0
20
25
30
Ion count
2
2
50
0
10
60
30
35
2
14
1.5
50
60
3.5
x 10
8
x 10
4
x 10
3
12
1
10
0.5
3
2.5
0
20
8
2
40
Time (min)
8
16
25
30
35
2
1
2
0
20
1.5
25
30
35
6
1
1
4
1
0.5
0
10
20
8
x 10
x 10
5
3
2.5
40
7
8
4
30
Time (min)
8
3.5
20
Time (min)
Time (min)
Ion count
0.5
Ion count
40
4
6
6
1.5
30
Time (min)
7
8
2
4
20
8
x 10
3
2.5
Ion count
50
x 10
Ion count
3
40
Time (min)
7
8
Ion count
0
10
0.5
2
20
30
40
50
0
10
60
20
30
40
0
10
60
20
30
Time (min)
Time (min)
8
3.5
50
40
2.5
8
4
x 10
3
8
x 10
3
50
60
8
3
x 10
2
2
2
0
20
25
30
35
1
1.5
0
20
1
25
30
Ion count
1
1.5
40
Time (min)
x 10
2
2
30
x 10
2.5
2
Ion count
Ion count
2.5
20
8
3
3
0
10
60
Time (min)
8
x 10
50
35
1
1.5
0
20
25
30
35
1
1
0.5
0.5
0.5
0
10
20
30
40
Time (min)
50
60
0
10
20
30
40
Time (min)
50
60
0
10
20
30
40
50
60
Time (min)
Figure 3.17: Base peak chromatograms of the 23 LC-MS runs in the glycomic data set.
79
Chapter 3. Multi-profile Bayesian alignment model
1
0.8
0.7
0.8
0.6
0.7
0.5
0.6
SSE
Normalized overlapping level
0.9
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0
0
5
10
15
20
25
30
0
0
0
35
5
10
15
2
20
4
6
25
8
10
30
35
Number of clusters
Number of clusters
(a) Normalized overlapping level
(b) Sum of squared errors
Figure 3.18: Normalized overlapping level (a) and sum of squared errors (b) using the Lmethod in the glycomic data set. The sufficient number of clusters is four.
3
1.4
1
2
2
2
20
22
1
0
30
0.8
32
34
0.6
1.5
0
50
52
Intensity (A.U.)
0
18
1.4
1
Intensity (A.U.)
1
2
1.5
0.5
1
1
1.6
1.2
Intensity (A.U.)
Intensity (A.U.)
1.8
2.5
3
2.5
54
1
0.5
1.2
0
32
1
34
36
0.8
0.6
0.4
0.2
20
30
40
50
0
10
60
0.2
20
Retention time (min)
30
50
0
10
60
20
Retention time (min)
(a)
30
40
50
0
10
60
2
1.6
1
1
1.4
0.5
20
22
1.5
1
0.5
1
0
30
0.8
32
34
0.6
1.5
0
50
52
Intensity (A.U.)
2
Intensity (A.U.)
0
18
50
60
1.8
2.5
1
1
2
40
(d)
1.2
2
30
Retention time (min)
(c)
1.4
3
20
Retention time (min)
(b)
3
2.5
Intensity (A.U.)
40
Intensity (A.U.)
0
10
0.4
0.5
0.5
54
1
1.2
0
32
1
34
36
0.8
0.6
0.4
0
10
0.4
0.5
0.5
0.2
20
30
40
Retention time (min)
(e)
50
60
0
10
0.2
20
30
40
Retention time (min)
(f)
50
60
0
10
20
30
40
Retention time (min)
(g)
50
60
0
10
20
30
40
50
60
Retention time (min)
(h)
Figure 3.19: Clustered ion chromatograms in the glycomic data set. (a)-(d) are the unaligned
chromatograms and (e)-(h) are their corresponding aligned chromatograms. The inset is a
zoomed part in the middle retention time range of the chromatograms.
80
Chapter 3. Multi-profile Bayesian alignment model
Run 1
Run 2
1000
1500
2000
2500
3000
0
−50
−100
3500
50
Difference
0
−50
−100
1000
1500
Time (sec)
Run 4
1500
2000
2500
3000
0
−100
3500
1000
1500
2000
2500
3000
1000
1500
2000
2500
3000
2000
2500
3000
2000
1000
1500
3000
2000
2500
3000
1500
2000
2500
3000
2500
3000
3500
2000
2500
3000
3500
1000
1500
2000
2500
3000
3500
3000
3500
3000
3500
3000
3500
0
1000
1500
2000
2500
Time (sec)
Run 21
50
0
1000
1500
2000
2500
3000
50
0
−50
3500
1000
1500
Time (sec)
2000
2500
Time (sec)
Run 22
Run 23
100
Difference
100
Difference
2500
100
Time (sec)
50
0
−50
2000
50
−50
Difference
Difference
1500
1500
Run 20
−50
3500
Run 18
100
0
3000
Time (sec)
0
Run 19
1000
1000
Time (sec)
50
2500
100
Time (sec)
100
2000
0
−50
3500
Difference
Difference
2000
1500
Run 17
−50
3500
Run 15
50
0
3000
Time (sec)
Difference
1000
Run 16
1500
1000
Time (sec)
50
2500
0
−50
3500
0
−50
3500
2000
50
Time (sec)
1000
1500
Run 14
2500
3500
Run 12
0
Difference
1500
3000
Time (sec)
50
0
2500
50
Run 13
1000
1000
Time (sec)
50
2000
0
−50
3500
−20
−40
3500
3500
Run 9
Difference
Difference
Difference
1500
1500
Run 11
0
3000
Time (sec)
20
1000
1000
Time (sec)
Time (sec)
Difference
−50
3500
0
−50
3500
2500
50
Run 10
Difference
3000
Difference
Difference
Difference
1500
50
Difference
2500
Run 8
Time (sec)
−50
2000
50
0
2000
0
Time (sec)
Run 7
1000
1500
Run 6
Difference
Difference
Difference
1000
1000
Time (sec)
−50
50
−50
−100
3500
50
Time (sec)
−50
3000
Run 5
0
−50
2500
50
−50
−50
2000
0
−50
Time (sec)
50
−100
Run 3
50
Difference
Difference
50
1000
1500
2000
2500
Time (sec)
3000
3500
50
0
−50
1000
1500
2000
2500
3000
3500
Time (sec)
Figure 3.20: Difference between the identity function and estimated mapping function obtained from the posterior median by GPMP for each of the 23 LC-MS runs in the glycomic
data set. The filled region corresponds to the 90% credible interval.
81
Chapter 3. Multi-profile Bayesian alignment model
1
0.99
Precision
0.98
0.97
0.96
0.95
0.94
0.93
0.92
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Recall
Figure 3.21: Measures of precision and recall in the glycomic data set, based on 72 pairs
of tolerance parameters in SIMA. The five procedures compared are: raw (∗), GP (), SP
(), GPSP (△), and GPMP (♦).
82
Chapter 3. Multi-profile Bayesian alignment model
B2
C2
GPB2
GPC2
Raw
1
0.98
0.98
Precision
Precision
Raw
1
0.96
0.94
0.92
0.9
0.1
B3
GPC3
0.94
0.92
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.1
0.9
0.2
0.3
0.4
Raw
B4
0.5
0.6
0.7
0.8
0.9
Recall
(a) G = 2
(b) G = 3
C4
GPB4
GPC4
Raw
1
1
0.98
0.98
Precision
Precision
GPB3
0.96
Recall
0.96
0.94
0.92
0.9
0.1
C3
B5
C5
GPB5
GPC5
0.96
0.94
0.92
0.2
0.3
0.4
0.5
0.6
Recall
(c) G = 4
0.7
0.8
0.9
0.9
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Recall
(d) G = 5
Figure 3.22: Measures of precision and recall in the glycomic data set, where multi-profile
alignment is considered. The number of chromatograms are: (a) two, (b) three, (c) four,
and (d) five. Five cases are compared with peak lists input to SIMA: raw (∗), adjusted by
multi-profile alignment using binning (), adjusted by multi-profile alignment using chromatographic clustering (), adjusted by multi-profile alignment using binning with Gaussian
process prior (△), and adjusted by multi-profile alignment using chromatographic clustering
with Gaussian process prior (♦).
Chapter 4
Application of Bayesian alignment
model to biomarker discovery
LC-MS has been widely used for profiling a variety of biomolecules, characterizing their
expression levels, and associating relevant patterns with distinct biological conditions of
interest. In this chapter, we integrate the proposed Bayesian alignment model into a preprocessing pipeline for LC-MS data analysis. We demonstrate the applicability of the alignment
model in practical biomedical problems by applying the preprocessing pipeline in a largescale glycomic study for cancer biomarker discovery. The LC-MS-based glycomic study
consists of two complementary analyses: 1) global profiling using LC-MS, and 2) targeted
quantification by multiple reaction monitoring (MRM). Through our developed pipeline, we
identify candidate biomarkers from global profiling analysis and confirm the result with that
by target quantification.
4.1
Background
Hepatocellular carcinoma. Hepatocellular carcinoma (HCC) is the third leading cause
of cancer mortality worldwide with five-year relative survival rates less than 15% [7, 33].
In the US, while the combined cancer mortality rate has been declining for two decades,
incidence and mortality rates of HCC are still increasing [118]. The resistance of HCC to
existing treatments and the lack of reliable biomarkers for early detection make it one of the
most hard-to-treat cancers. Most of the risk factors for HCC including chronic infection with
hepatitis B virus (HBV) or hepatitis C virus (HCV) lead to the development of liver cirrhosis,
which is considered as a precursor of HCC and present in 80–90% of HCC patients [30]. The
malignant conversion of cirrhosis to HCC is often fatal in part because adequate biomarkers
are not available for diagnosis during the progression stages of HCC. Survival rates of patients
with HCC can significantly be improved if the diagnosis was made at earlier stages, when
83
Chapter 4. Application of Bayesian alignment model to biomarker discovery
84
treatment is more effective [12]. Alpha-fetoprotein (AFP), the serologic biomarker for HCC
in current use, is not effective for early diagnosis due to its low sensitivity [48,130]. Therefore
more potent biomarkers for early-stage HCC are needed.
LC-MS-based glycomics. Glycosylation is one of the most common post-translational
modifications of proteins. Altered patterns of glycosylation have been associated with various diseases and many currently used cancer biomarkers, including AFP, are glycoproteins [13,37]. The analysis of glycosylation is particularly relevant to liver pathology because
of the major influence of this organ on the homeostasis of blood glycoproteins. However,
quantitative analysis of glycoproteins remains challenging in large-scale studies due to the
dynamic nature of glycosylation [13]. An effective alternative is to analyze glycans released
from proteins and associate the glycomic changes with pathological conditions of interest.
N-glycans are of particular interest as their involvement in major biological processes, including cell-cell interactions and intracellular signaling, has important implications in disease
progression [37]. Also, several enzymes that allow efficient release of this type of glycans have
been made available [82]. Through appropriate analytical methods that yield broad coverage
of the glycome, characterizing glycomic patterns in serum/plasma of patients with cancer
has proven a promising strategy to discover biomarkers for early diagnosis of cancer [82,113].
In particular, mass spectrometry is an enabling technology for analysis of glycans in cancer
biomarker discovery [82]. The use of matrix-assisted laser desorption/ionization (MALDI)
mass spectrometry to identify N-glycan biomarkers for HCC has been widely applied and
discussed [42,65,109,125]. With recent advances in mass spectrometry and separation methods, LC-MS is capable of profiling hundreds of glycans including isomeric glycoforms [23,55].
Higher sensitivity of LC-MS over MALDI in detecting permethylated N-glycans derived from
serum has been demonstrated in a comparative study [55]. However, to date the glycomic
profiling using LC-MS has not been fully exploited for large-scale biomarker discovery studies
and there is still a lack of appropriate computational tools [113].
We apply LC-MS-based serum glycomics for HCC biomarker discovery in patients with liver
cirrhosis. Workflow of the proposed glycomic analysis is shown in Figure 4.1. Sera were
collected from participants recruited at the Tanta University Hospital in Egypt. We utilize two complementary platforms to perform global profiling and targeted quantification
of N-glycans, and identify candidate biomarkers that distinguish HCC cases from cirrhotic
controls. Global profiling is performed using a high-resolution mass spectrometer (LTQOrbitrap Velos), while targeted quantification is performed using a triple quadrupole (QqQ)
mass spectrometer in multiple reaction monitoring (MRM) mode [97, 98]. The integrative workflow consisting of global profiling and targeted quantification is widely applied in
LC-MS-based proteomic studies but to our best knowledge, has not yet been exploited in
glycomics. In this analysis, we identify candidate biomarkers from global profiling using the
developed pipeline and confirm the result with that by targeted quantification.
Chapter 4. Application of Bayesian alignment model to biomarker discovery
85
Figure 4.1: Workflow for the LC-MS-based analysis of N-glycans in sera.
4.2
Experimental methods
Study cohort. The samples in this study were obtained from participants recruited in
Egypt. The study cohort consists of adult patients with HCC or liver cirrhosis recruited
86
Chapter 4. Application of Bayesian alignment model to biomarker discovery
from the outpatient clinics and inpatient wards of the Tanta University Hospital, Egypt.
The participants consist of 89 subjects (40 HCC cases and 49 patients with liver cirrhosis).
Detailed characteristics of the patient population are provided in Table 4.1.
Table 4.1: Characteristics of the study cohort.
Age
Mean (SD)
BMI
Mean (SD)
Gender
Male
HCC (n = 40) Cirrhosis (n = 49) p-value
53.2 (3.9)
53.8 (7.6) 0.3530
HCV serology HCV Ab+
HBV serology
HBsAg+
MELD*
Mean (SD)
MELD ≤ 10
AFP
Median (IQR)
HCC stage
Stage I
Stage II
Stage III
Unknown
24.9 (3.1)
24.5 (4.4)
0.6513
77.5%
67.3%
0.3474
100.0%
100.0%
1.0000
0.0%
6.1%
0.2492
18.6 (7.7)
20.0%
18.9 (7.1)
12.2%
0.1328
0.3863
275.9 (1244.3)
72.5%
15.0%
5.0%
7.5%
*MELD: model for end-stage liver disease
Experimental design. We analyzed the collected sera in four batches (designated as E1,
E2, E3 and E4). Each batch consists of approximately 24 samples, balanced between HCC
cases and cirrhotic controls in terms of age, race, gender, smoking, alcohol and BMI. Samples
within the same batch were prepared together and LC-MS analysis was performed following
a randomized order to avoid systematic biases. The same procedure of sample preparation
was applied for the analysis of global profiling and targeted quantification, which consists
of release, purification, reduction and permethylation of N-glycans. All the samples were
analyzed through global profiling and targeted quantification.1
1
Laboratory methods: Permethylated N-glycans were separated by an ultimate 3000 nano-LC system
(Dionex, Sunnyvale, CA) with an Acclaim PepMap C18 column at 55◦ C to prompt efficient separation. The
flow rate of nanopump was set to 350 nl/min. Mobile phase A consisted of 2% ACN and 98% water with 0.1%
formic acid, while mobile phase B consisted of ACN with 0.1% formic acid. The gradient program started at
20% mobile phase B over 10 min, which was ramped to 38% at 11 min and linearly increased to 60% in the
Chapter 4. Application of Bayesian alignment model to biomarker discovery
87
5
x 10
Ion count
15
10
5
0
34
1308
1307
34.5
1306
35
1305
35.5
1304
Retention time (min)
m/z
Figure 4.2: Characteristics of a peak in LC-MS raw data.
4.3
Global profiling analysis
Data preprocessing. The LC-MS raw data were analyzed using a preprocessing pipeline
consisting of in-house-developed algorithms and open-source software tools. The pipeline
converts the raw data into a peak list. Each LC-MS run contains thousands of peaks and
ubiquitous noises, where a representative peak in LC-MS raw data is shown in Figure 4.2.
As introduced in Section 1.2.2, an isotopic pattern is present in mass spectra throughout the
elution of the profiled compound. In consideration of such characteristics, we developed a
data preprocessing pipeline to perform deisotoping of mass spectra, peak detection, retention
time alignment and peak matching.
We performed the deisotoping of mass spectra using DeconTools [60], where the monoisotopic
following 32 min. Then, mobile phase B was increased to 90% in 3 min and the percentage was kept for 4 min.
Finally, mobile phase B was decreased to 20% in 1 min and the percentage was kept for 9 min to equilibrate
the column. The nano-LC system was interfaced to an LTQ-Orbitrap Velos (Thermo Scientific, San Jose,
CA) hybrid mass spectrometer. The mass spectrometer was operated in data-dependent acquisition (DDA)
mode, where each MS full scan (m/z range 500–2000) was followed by five MS/MS scans of the most intense
ions. Targeted quantification of 117 N-glycans including isomers was performed by MRM using a QqQ mass
spectrometer (TSQ Vantage) with Q1 and Q3 operated at a unit resolution. The chromatographic condition
was as described above in the analysis of global profiling utilizing an ultimate 3000 nano-LC system with
identical gradient setup. The dwell time was 2.7 sec on average. Sample preparation and data acquisition
were performed in the laboratory of Dr. Yehia Mechref at Texas Tech University.
Chapter 4. Application of Bayesian alignment model to biomarker discovery
88
mass and charge state were deduced. DeconTools allows us to specify an appropriate average
residue composition for the calculation of isotopic distribution. The average composition
for the monosaccharides (C10 H18 N0.43 O5 S0 ) was determined based on the permethylated Nglycans commonly found in our previous glycomic studies [42,109,125]. After the deisotoping
step, peak detection was performed using an in-house-developed algorithm. Specifically,
deisotoped ions with the same molecular weight (with 10 ppm tolerance) were linked along
scans to generate a chromatographic trace. Low-quality traces were screened out according
to the following criteria: 1) minimal scans of 20 to define a trace, 2) minimal total ion count
of 50,000, 3) minimal density of 0.3 in a trace, and 4) maximal allowable missing values of 20
between adjacent scans. After the screening step, missing values within remaining traces were
interpolated using their corresponding extracted ion chromatograms from raw data. The
interpolated trace was further processed through successive convolution with a SavitzkyGolay smoothing filter, and a first-order derivative of a Gaussian kernel to identify the
position and boundary of the chromatographic peak at zero-crossing and its enclosing local
extrema, respectively. In the peak list of each LC-MS run, properties used to characterize a
peak were: monoisotopic mass, charge state, intensity (area under curve within boundary)
and retention time (RT).
Prior to matching the detected peaks across LC-MS runs, we applied our developed Bayesian
alignment model (BAM) to estimate the mapping function for each LC-MS run and modified
the RT values in the peak lists, i.e., replacing t by ui (t). Figures 4.3–4.6 depict the clustered
ion chromatograms, before and after alignment by BAM in the four batches. The adjusted
peaks in multiple LC-MS runs were then matched using the simultaneous multiple alignment
(SIMA) model [136]. The resulting consensus peak list of the LC-MS runs was further refined
such that only the peaks detected in over half of the runs were retained. The alignment
process can greatly reduce the ambiguity during the peak matching step. Figure 4.7 presents
the RT differences across runs of these refined peaks, with different parameters used in SIMA.
The proportion of peaks with different RT variations (one second range) are summarized.
As shown in this figure, applying the BAM led to more consistent RT values across runs. It
reduced the mode from 10 seconds to less than five. Moreover, the peak matching results
became more robust to the selection of the parameters.
After the peak matching, a normalization step was apply to ensure of the summed intensity
of detected peaks is identical in all the runs from the same batch. Finally, missing values
owing to either peak detection or alignment were interpolated using corresponding extracted
ion chromatograms. In this study, peaks out of the expected RT range of glycans (15–50 min)
were excluded from subsequent analysis. The preprocessing pipeline resulted in 2132, 2620,
2412, and 2392 consensus peaks in E1, E2, E3, and E4, respectively.
Statistical analysis. Following the data preprocessing, the most relevant peaks with differential abundance between HCC cases and cirrhotic controls were selected using a two-way
analysis of variance (ANOVA) model. Peaks from the four batches were matched upfront,
89
Chapter 4. Application of Bayesian alignment model to biomarker discovery
6
2.5
2.5
3
5
2
1
3
2
0.5
0
15
25
30
35
40
45
1
20
25
30
35
40
45
0
15
50
0.5
20
25
Retention time (min)
Retention time (min)
(a)
30
35
40
45
0
15
50
0
15
25
30
35
40
45
50
1
40
45
50
2
1.5
1
0
15
50
1.5
0.5
1
20
Intensity (A.U.)
Intensity (A.U.)
Intensity (A.U.)
Intensity (A.U.)
0.5
45
3
2
2
40
2.5
4
3
35
(d)
2.5
2
30
Retention time (min)
5
1
25
(c)
6
1.5
20
Retention time (min)
(b)
2.5
2
1.5
1
0
15
50
1.5
0.5
1
20
Intensity (A.U.)
1.5
4
Intensity (A.U.)
Intensity (A.U.)
Intensity (A.U.)
2
2.5
20
25
30
35
40
45
0
15
50
0.5
20
25
Retention time (min)
Retention time (min)
(e)
30
35
40
45
0
15
50
20
25
Retention time (min)
(f)
30
35
Retention time (min)
(g)
(h)
Figure 4.3: Clustered ion chromatograms in E1. (a)-(d) are the unaligned chromatograms
and (e)-(h) are their corresponding aligned chromatograms.
3.5
2
2.5
2.5
1.8
3
1.6
2
2
1.2
1
0.8
Intensity (A.U.)
2
1.5
Intensity (A.U.)
1.4
Intensity (A.U.)
Intensity (A.U.)
2.5
1.5
1
1.5
1
0.6
1
0.4
0.5
0.5
0.5
0.2
0
15
20
25
30
35
40
45
0
15
50
20
25
Retention time (min)
30
35
40
45
0
15
50
20
25
Retention time (min)
(a)
(b)
3.5
30
35
40
45
0
15
50
20
25
Retention time (min)
(c)
2
30
35
40
45
50
40
45
50
Retention time (min)
(d)
2.5
2.5
1.8
3
1.6
2
2
1.2
1
0.8
Intensity (A.U.)
2
1.5
Intensity (A.U.)
1.4
Intensity (A.U.)
Intensity (A.U.)
2.5
1.5
1
1.5
1
0.6
1
0.4
0.5
0.5
0.5
0.2
0
15
20
25
30
35
Retention time (min)
(e)
40
45
50
0
15
20
25
30
35
Retention time (min)
(f)
40
45
50
0
15
20
25
30
35
Retention time (min)
(g)
40
45
50
0
15
20
25
30
35
Retention time (min)
(h)
Figure 4.4: Clustered ion chromatograms in E2. (a)-(d) are the unaligned chromatograms
and (e)-(h) are their corresponding aligned chromatograms.
90
Chapter 4. Application of Bayesian alignment model to biomarker discovery
1
0.5
20
25
30
35
40
45
3
2.5
2
1.5
1
1
1
0.5
0.5
20
25
Retention time (min)
(a)
35
40
45
0
15
50
1
30
35
40
45
40
45
0
15
50
3
3
2.5
2
1.5
2
1.5
1
Retention time (min)
30
35
40
45
0
15
50
35
40
45
50
40
45
50
2
1.5
0.5
20
25
Retention time (min)
(e)
30
1
0.5
25
25
(d)
2.5
20
20
Retention time (min)
3
0
15
50
35
2.5
0.5
25
30
(c)
1
0.5
20
25
Retention time (min)
Intensity (A.U.)
Intensity (A.U.)
1.5
0
15
20
(b)
2
Intensity (A.U.)
30
Retention time (min)
2.5
2
1.5
0.5
0
15
50
2
1.5
Intensity (A.U.)
0
15
3
2.5
Intensity (A.U.)
Intensity (A.U.)
Intensity (A.U.)
2
1.5
3
2.5
Intensity (A.U.)
2.5
30
35
40
45
0
15
50
20
25
Retention time (min)
(f)
30
35
Retention time (min)
(g)
(h)
Figure 4.5: Clustered ion chromatograms in E3. (a)-(d) are the unaligned chromatograms
and (e)-(h) are their corresponding aligned chromatograms.
3
1.8
1.6
1.6
1.4
2.5
2.5
2
1
1
0.8
Intensity (A.U.)
1.5
1.2
1.2
Intensity (A.U.)
Intensity (A.U.)
Intensity (A.U.)
1.4
2
1
0.8
0.6
0.4
0.5
0.5
0.2
0.2
20
25
30
35
40
45
0
15
50
20
25
Retention time (min)
30
35
40
45
0
15
50
20
25
Retention time (min)
(a)
(b)
3
30
35
40
45
0
15
50
1.8
1.6
1.6
1.4
1
30
35
40
45
50
40
45
50
Retention time (min)
(d)
2.5
2
1
0.8
Intensity (A.U.)
1.2
1.2
Intensity (A.U.)
Intensity (A.U.)
Intensity (A.U.)
1.5
25
(c)
1.4
2
20
Retention time (min)
2.5
1
0.8
0.6
1.5
1
0.6
0.4
0.4
0.5
0.5
0.2
0.2
0
15
1
0.6
0.4
0
15
1.5
20
25
30
35
Retention time (min)
(e)
40
45
50
0
15
20
25
30
35
Retention time (min)
(f)
40
45
50
0
15
20
25
30
35
Retention time (min)
(g)
40
45
50
0
15
20
25
30
35
Retention time (min)
(h)
Figure 4.6: Clustered ion chromatograms in E4. (a)-(d) are the unaligned chromatograms
and (e)-(h) are their corresponding aligned chromatograms.
91
Chapter 4. Application of Bayesian alignment model to biomarker discovery
0.1
0.08
SIMA
BAM+SIMA
0.09
SIMA
BAM+SIMA
0.07
0.08
0.06
0.06
Proportion
Proportion
0.07
0.05
0.04
0.05
0.04
0.03
0.03
0.02
0.02
0.01
0.01
0
0
10
20
30
40
50
0
0
60
10
20
RT difference
(a) RT tolerance of 10
50
60
0.08
SIMA
BAM+SIMA
0.07
0.06
0.06
0.05
0.05
0.04
0.04
0.03
0.03
0.02
0.02
0.01
0.01
10
20
30
40
50
SIMA
BAM+SIMA
0.07
Proportion
Proportion
40
(b) RT tolerance of 20
0.08
0
0
30
RT difference
0
0
60
10
20
RT difference
30
40
50
60
RT difference
(c) RT tolerance of 30
(d) RT tolerance of 40
Figure 4.7: Distribution of the RT differences (in second) across LC-MS runs of consensus
peaks identified by SIMA with different parameters of RT tolerance.
and peak intensity was modeled in terms of group effect (HCC versus cirrhosis), batch effect
and interaction between group and batch. Specifically, the peak intensity for replicate k
(k = 1, . . . , nij ) from group i (i = 1, 2) in batch j (j = 1, 2, 3, 4) is modeled as
Yijk = µ + Gi + Bj + (G × B)ij + ǫijk ,
where µ is the overall mean of the samples, Gi ’s are the group effects
X
Gi = 0,
i
Bj ’s are the batch effects
X
j
Bj = 0,
(4.1)
Chapter 4. Application of Bayesian alignment model to biomarker discovery
92
(G × B)ij ’s are the interactions between group and batch
X
(G × B)ij = 0, ∀j = 1, 2, 3, 4,
i
X
(G × B)ij = 0,
∀i = 1, 2,
j
and ǫijk ’s are the random errors from a zero-mean normal distribution.
Table 4.2 summarizes the properties of the ANOVA model, where sum of squares (SS) and
degree of freedom (df) corresponding to each source are given below:
X
SSG =
(Ȳi.. − Ȳ... )2 ,
dfG = # Groups − 1,
ijk
SSB =
X
(Ȳ.j. − Ȳ...)2 ,
dfB = # Batches − 1,
ijk
SSGB =
X
(Ȳij. − Ȳi.. − Ȳ.j. + Ȳ... )2 ,
ijk
SSE =
X
(Yijk − Ȳij. )2 ,
dfE =
X
ijk
X
(nij − 1),
ij
ijk
SST =
dfGB = dfG × dfB ,
(Yijk − Ȳ... )2 ,
dfT =
X
nij − 1,
ij
and estimates associated with the considered factors at different levels are
P
ijk Yijk
Ȳ... = P
,
ij nij
P
jk Yijk
, i = 1, 2,
Ȳi.. = P
j nij
P
Yijk
Ȳ.j. = Pik
, j = 1, 2, 3, 4,
nij
i
P
Yijk
Ȳij. = k
, i = 1, 2 and j = 1, 2, 3, 4.
nij
We calculated p-values with the null hypothesis that the group means within each batch
are the same. Peaks with a p-value < 0.05 and having a consistent direction of fold change
(FC) between groups in all four batches were selected as statistically significant. Prior to
the statistical analysis, a logarithmic transformation was applied to ensure validity of the
normal distribution assumption as shown in Figure 4.8.
93
Chapter 4. Application of Bayesian alignment model to biomarker discovery
Table 4.2: Two-way analysis of variance of the glycomic data.
Source
Group effect
Batch effect
Interaction effect
Error
Total
df
dfG
dfB
dfGB
dfE
dfT
SS
SSG
SSB
SSGB
SSE
SST
MS
MSG = SSG/dfG
MSB = SSB/dfB
MSGB = SSGB/dfGB
MSE = SSE/dfE
MST = SST /dfT
F -statistic
MSG/MSE
MSB/MSE
MSGB/MSE
4
8
x 10
14000
7
12000
6
10000
4
Count
Count
5
8000
3
6000
2
4000
1
2000
0
0
2
4
6
8
10
Intensity
(a)
12
14
16
18
10
x 10
0
15
20
25
30
Log(Intensity)
35
40
(b)
Figure 4.8: Histograms of peak intensities before (a) and after (b) logarithmic transformation.
Candidate N-glycan biomarkers through LC-MS global profiling. The statistical analysis based on the ANOVA model revealed 78 peaks that are statistically significant (p-value < 0.05 and consistent FC). Putative glycan structures were assigned to the
selected peaks by matching experimentally measured mass values with theoretical values
of human serum N-glycans that were previously characterized according to the number of
five monosaccharides: N-acetylglucosamine (GlcNAc), mannose, galactose, fucose, and Nacetylneuraminic acid (NeuNAc). The matching (with tolerance of 2 ppm) resulted in 10
significant N-glycans (Table 4.3). Potential isomers were observed for two galactosylated
beta-1,6-GlcNAc branching glycans [5-3-3-0-1] and [5-3-3-0-2]. All of the significant glycans
belong to the complex type.
Chapter 4. Application of Bayesian alignment model to biomarker discovery
94
Table 4.3: N-glycan candidate biomarkers identified by the LC-MS global profiling.
N-glycan RT
[5-3-0-0-0] 27.6
[5-3-1-0-0] 29.0
[5-3-1-0-1] 31.8
[5-3-3-0-1] 30.9
32.5
[5-3-3-0-2] 33.0
34.6
[5-3-3-0-3] 34.8
[6-3-4-0-1] 33.7
[6-3-4-0-2] 35.7
[6-3-4-0-3] 37.7
[6-3-4-0-4] 39.1
4.4
Charge
2
3
2
3
2
3
3
3
3
4
3
4
3
3
4
3
4
4
p-value Fold change
0.004
↓1.76
0.007
↓1.77
0.028
↓1.33
0.025
↓1.34
0.040
↓1.50
0.036
↑1.32
0.019
↑1.34
0.016
↑1.46
0.004
↑1.54
0.040
↑1.46
0.030
↑1.51
0.042
↑1.46
0.028
↑1.33
0.018
↑1.66
0.030
↑1.63
0.034
↑1.83
0.027
↑1.78
0.027
↑1.74
Multiple reaction monitoring quantification
Targeted quantification of 117 N-glycans including isomers was performed by MRM using
a QqQ mass spectrometer. These targets include 1) N-glycans that were detected on the
Orbitrap or QqQ instrument in our previous studies, 2) N-glycans evaluated as potential
HCC biomarkers in previous studies [21, 42, 77, 109, 124, 125], and 3) N-glycans involved in
Golgi apparatus retrieved from KEGG GLYCAN database [49]. The 117 N-glycans were
represented by 213 channels (three transitions in each) consisting of their different adduct
forms and charge states.
Curation of the transitions was performed to eliminate channels with unfavorable chromatographic profiles or significant noises, and to determine appropriate RT windows for quantification. Owing to the unit resolution in Q1 and Q3, interferences may appear across channels
with close m/z values in their transitions. The observed elution order of N-glycans on the
Orbitrap system was used to elucidate some ambiguous cases in the MRM analysis. Among
the 213 transition channels, 93 channels representing 124 potential isomers of 65 N-glycans
were detected consistently and quantified for subsequent analysis. As in the global profiling
analysis, peak intensities were log-transformed prior to the statistical analysis. A normalization step was also applied to ensure the mean of the log-transformed peak intensities is
identical in all the MRM runs from the same batch. The two-way ANOVA model used in
Chapter 4. Application of Bayesian alignment model to biomarker discovery
95
the global profiling was applied to identify candidate N-glycan biomarkers. Through the
targeted quantification, we identified 10 N-glycans that are statistically significant (p-value
< 0.05 and consistent FC in four batches) as shown in Table 4.4.
Four of these significant glycans have a p-value < 0.01: [5-3-0-0-0] and [5-3-1-0-1], which are
down-regulated in HCC; [5-3-3-0-2] and [5-3-3-0-3], which are up-regulated in HCC. Most of
the significant glycans were also identified by the global profiling analysis, i.e., [5-3-0-0-0],
[5-3-1-0-0], [5-3-1-0-1], [5-3-3-0-2], [5-3-3-0-3], [6-3-4-0-2], [6-3-4-0-3] and [6-3-4-0-4]. Their
MRM quantification results are shown in Figure 4.9.
Table 4.4: N-glycan candidate biomarkers identified by the MRM targeted quantification.
N-glycan RT
[5-3-0-0-0] 29.5
[5-3-1-0-0] 30.3
[5-3-1-0-1]
[5-3-1-1-1]
[5-3-3-0-2]
[5-3-3-0-3]
33.8
36.0
34.3
36.5
[5-3-3-2-1] 37.0
[6-3-4-0-2] 38.3
[6-3-4-0-3] 39.5
[6-3-4-0-4] 36.5
40.8
4.5
Charge
2
2
3
2
2
3
3
4
3
3
4
4
4
4
p-value Fold change
0.0009
↓1.81
0.018
↓1.34
0.020
↓1.34
0.003
↓1.39
0.027
↓1.39
0.003
↑1.45
0.010
↑1.36
0.029
↑1.36
0.014
↑1.37
0.024
↑1.29
0.020
↑1.37
0.011
↑1.50
0.012
↑1.44
0.017
↑1.73
Discussion
We analyzed N-glycans in sera from HCC cases and cirrhotic controls, where N-glycans were
enzymatically removed from serum proteins and permethylated, allowing relative quantification of hundreds of oligosaccharides. Candidate N-glycan biomarkers were identified through
LC-MS-based global profiling and targeted quantification. The most relevant glycans in distinguishing HCC cases from cirrhotic controls were selected using a two-way ANOVA model.
We identified 10 statistically significant N-glycans through each of the quantification approaches (12 in total). Although none of these glycans had an adjusted p-value < 0.05 in consideration of multiple testing correction using the method by Benjamini and Hochberg [10],
our integrative analysis revealed a good overlap of the significant glycans identified by both
quantification approaches. There are eight candidate biomarkers overlapping between the
two complementary platforms: [5-3-0-0-0], [5-3-1-0-0], [5-3-1-0-1], [5-3-3-0-2], [5-3-3-0-3], [6-
96
Chapter 4. Application of Bayesian alignment model to biomarker discovery
[5−3−1−0−0] − RT: 30.3, FC: ↓1.34
[5−3−1−0−1] − RT: 33.8, FC: ↓1.39
28
22
24
27
21
23
2
25
24
20
19
22
21
2
26
Log (intensity)
25
Log (intensity)
23
2
Log (intensity)
[5−3−0−0−0] − RT: 29.5, FC: ↓1.81
29
20
18
23
19
17
22
21
16
HCC
Cirrhosis
18
HCC
(a)
23
25
22
24
20
19
23
18
17
2
2
21
[6−3−4−0−2] − RT: 38.3, FC: ↑1.37
20
Log (intensity)
26
Cirrhosis
(c)
[5−3−3−0−3] − RT: 36.5, FC: ↑1.36
24
Log (intensity)
2
HCC
(b)
[5−3−3−0−2] − RT: 34.3, FC: ↑1.45
Log (intensity)
17
Cirrhosis
22
16
19
21
18
20
17
HCC
19
Cirrhosis
(d)
15
HCC
14
Cirrhosis
(e)
HCC
Cirrhosis
(f)
[6−3−4−0−3] − RT: 39.5, FC: ↑1.5
[6−3−4−0−4] − RT: 40.8, FC: ↑1.73
23
24
23
22
22
21
Log (intensity)
20
20
19
2
2
Log (intensity)
21
19
18
17
18
16
17
15
16
HCC
Cirrhosis
(g)
14
HCC
Cirrhosis
(h)
Figure 4.9: Quantification results of eight candidate N-glycan biomarkers in sera of HCC
cases and cirrhotic controls by the MRM analysis. (a)-(c) Down-regulated bisected GlcNAc
glycans. (d)-(e) Up-regulated beta-1,6-GlcNAc branching glycans. (f)-(h) Up-regulated
tetra-antennary glycans.
Chapter 4. Application of Bayesian alignment model to biomarker discovery
97
3-4-0-2], [6-3-4-0-3], and [6-3-4-0-4]. Six of these candidate biomarkers are sialylated glycans,
which were often excluded from investigation in previous studies [21, 77, 83, 124].
We used the 12 significant N-glycans to further evaluate the performance of the proposed
Bayesian alignment model (BAM). Specifically, the consensus peaks corresponding to the 12
glycans were searched against the peak detection results, and three procedures were compared: no alignment performed to adjust individual peak lists (SIMA), alignment performed
using BAM (BAM+SIMA), and alignment performed based on single chromatograms using DTW (DTW+SIMA). We compared the number of missing values caused by the peak
matching step with different parameters in SIMA (RT tolerances of 10, 20, 30 and 40). As
shown in Table 4.5, applying BAM to adjust the retention time variation facilitated the
subsequent peak matching step, and it made the peak matching result more robust to the
selection of parameters, as also discussed in Chapter 3. A small RT parameter in SIMA may
lack sufficient coverage to capture the RT variation, while a large RT parameter induces possible interference between peaks. Without appropriate retention time alignment, choosing a
right parameter for the peak matching step can be ambiguous. Alignment by DTW did not
eliminate such ambiguity. As discussed in Section 2.5.3, DTW is prone to overfitting the
data, especially when the chromatographic profiles captured by base peak chromatograms
are not consistent across runs (Figure 4.10). As a result, this procedure introduced additional variability in the peak matching process. It is noted that the comparison presented
here is not comprehensive as there is no ground-truth information that can be used for a
thorough evaluation as conducted in Section 3.6.
Biosynthesis of N-glycans in the Golgi apparatus involves trimming of mannose residuals
and stepwise addition of monosaccharides, resulting in three groups of N-glycans: highmannose, complex and hybrid types. Some biomarker discovery studies for other types of
cancer have shown that structurally-related glycans are likely to have correlated changes of
levels due to the same biosynthesis process they are involved in [113]. Our analysis exhibited a similar phenomenon, where many of the candidate biomarkers for HCC are closely
related in their structures. These glycans can be grouped into three clusters, and within
each cluster the glycans show consistent changes in their levels. Specifically, there are glycans with 1) bisected GlcNAc structure, 2) beta-1,6-GlcNAc branching structure, and 3)
tetra-antennary structure as shown in Figure 4.11. Further elucidation of the relationship
between the identified complex N-glycans can be obtained by referring to their biosynthesis
process. N-acetylglucosaminyltransferase III (GnT-III) and N-acetylglucosaminyltransferase
V (GnT-V) are glycosyltransferases that have been known to play a key role in the formation
of N-glycan branches [99,147]. GnT-III and GnT-V lead to two distinct branching structures
in N-glycans with contrasting implications of cancer metastasis. GnT-III catalyzes the formation of a bisected GlcNAc linkage in N-glycans, which has been associated to inhibition
of cancer metastasis, whereas GnT-V catalyzes the addition of beta-1,6-GlcNAc branching
of N-glycans, and has been considered as a promoter of metastasis. Their implications on
HCC have also been discussed [77,83,141]. In this study, decreased levels in HCC were found
in GnT-III’s downstream products (bisected GlcNAc glycans), while opposite alteration was
98
Chapter 4. Application of Bayesian alignment model to biomarker discovery
9
9
x 10
3
2.5
2.5
2
2
Ion count
Ion count
3
1.5
1.5
1
1
0.5
0.5
0
15
20
25
30
35
40
45
x 10
0
15
50
20
25
Retention time (min)
(a) Batch 1
3
2.5
2.5
2
2
Ion count
Ion count
40
45
50
40
45
50
9
x 10
1.5
1
0.5
0.5
20
25
30
35
Retention time (min)
(c) Batch 3
40
45
50
x 10
1.5
1
0
15
35
(b) Batch 2
9
3
30
Retention time (min)
0
15
20
25
30
35
Retention time (min)
(d) Batch 4
Figure 4.10: Base peak chromatograms in four batches.
Chapter 4. Application of Bayesian alignment model to biomarker discovery
99
Table 4.5: Number of missing values associated with the 12 significant N-glycans cuased by
the peak matching step.
N-glycan RT Charge
[5-3-0-0-0] 27.6
2
3
[5-3-1-0-0] 29.0
2
3
[5-3-1-0-1] 31.8
2
[5-3-1-1-1] 34.0
2
[5-3-3-0-1] 30.9
3
32.5
3
[5-3-3-0-2] 33.0
3
4
34.6
3
[5-3-3-0-3] 34.8
3
4
[5-3-3-2-1] 37.1
2
[6-3-4-0-1] 33.7
3
[6-3-4-0-2] 35.7
3
4
[6-3-4-0-3] 37.7
3
4
[6-3-4-0-4] 39.1
4
10
2
2
1
1
18
15
8
8
11
8
3
7
8
8
9
0
0
0
0
13
SIMA
20 30
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
1 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
40
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
DTW+SIMA
10 20 30 40
1 1 1 0
1 1 1 0
1 1 1 0
1 1 1 0
0 0 0 0
1 0 0 0
1 0 0 0
0 0 0 0
7 0 0 0
12 0 0 0
1 1 0 0
1 1 1 1
1 1 1 1
2 1 1 1
0 0 0 0
0 0 0 0
0 0 0 0
1 0 0 0
2 0 0 0
10 5 5 5
BAM+SIMA
10 20 30 40
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
11 1 0 0
found in GnT-V’s downstream products (beta-1,6-GlcNAc branching glycans). This observation matches the implication of the roles of GnT-III and GnT-V in cancer metastasis,
consistently representing the progression from cirrhosis to HCC. In addition, increased levels
of beta-galactoside alpha-2,6-sialyltransferase (ST6GalI), which transfers sialic acid residue
in alpha-2,6 linkage to a terminal galactose, have been associated with progression and poor
prognosis in HCC [13, 52, 102, 146]. Consistent findings were obtained in this study, where
increased levels in HCC were found in a number of sialylated GlcNAc branching N-glycans
(downstream products of ST6GalI).
4.6
Summary
This chapter presents an LC-MS-based glycomic study to identify N-glycan biomarkers for
HCC. We demonstrate the capability of the developed Bayesian alignment model (BAM)
to enhance the reliability in LC-MS data preprocessing. We integrate the BAM into a pre-
Chapter 4. Application of Bayesian alignment model to biomarker discovery
100
Figure 4.11: Three clusters of the identified N-glycan candidate biomarkers and their fold
change directions (HCC versus cirrhosis).
processing pipeline for LC-MS global profiling analysis in the glycomic study. Through the
preprocessing pipeline, we identified 10 candidate biomarkers from global profiling. Confirmation was based on the result by targeted quantification, performed on a complementary
platform using MRM. Eight of these candidate N-glycan biomarkers were identified by both
quantification approaches and match closely with the implications of important glycosyltransferases in cancer progression and metastasis. The results of this study illustrate the
power of the integrative approach combining LC-MS based global profiling and targeted
quantification for a comprehensive serum glycomic analysis, to investigate changes in Nglycan levels between HCC cases and patients with liver cirrhosis. In addition, preprocessing
of the LC-MS data is crucial for the global profiling analysis and the BAM greatly facilitates
a reliable analysis in the discovery phase.
Through the appropriate analyzing workflow, this study revealed that glycomic changes during cancer progression represent systematic alteration. Enrichment analysis could potentially
increase the statistical power to detect the glycomic changes on a systems level and yield
biologically relevant results, as demonstrated in the genomic analysis [122]. Comprehensive
coverage of the glycome is a prerequisite, and the presented LC-MS-based workflow is expected to serve as a primary approach for this type of analysis. An associated challenge is
to develop reliable pipelines for glycan identification as in other omic analysis, most notably
metabolomics [140]. In order to allow a rigorous integration of changes in multiple glycans,
defining appropriate categories of ontological/topological information is critical. This strat-
Chapter 4. Application of Bayesian alignment model to biomarker discovery
101
egy may be further enhanced by integrating additional characteristics through other omic
analysis, such as proteomics. Investigation of these candidate biomarkers on a larger population may allow specific stratification of the subjects on the basis of etiology and disease
stage. Our current work in this regard involves the development of more rigorous subsequent
analysis using multivariate statistical analysis or enrichment analysis to identify a panel of
HCC biomarkers, and to evaluate if the observed glycomic changes can be reliably used for
early detection of HCC in high risk population of cirrhotic patients. Furthermore, we will
incorporate additional omic measurements on the same subjects for a more comprehensive
characterization.
Chapter 5
Conclusion
This dissertation addresses retention time alignment for LC-MS data analysis. While LC-MS
has gained significant attention in systems biology and biomarker discovery, chromatographic
methods often cannot generate reproducible measurements in retention time, limiting the
ability to apply this analytical technique for large-scale omic studies. As a result, there is
a pressing need in developing powerful computational tools to address the issue of retention
time alignment. We approached this problem from fundamental methodology development to
rigorous integration of complementary information in LC-MS data. This chapter summarizes
original contributions of the research work and discusses possible future work.
5.1
Summary of original contributions
In this dissertation, we proposed and developed a Bayesian alignment model (BAM), which
integrates complementary information embedded in the LC-MS data through a mathematically rigorous framework. The BAM belongs to the category of profile-based approaches,
which are composed of two major components: a prototype function and a set of mapping
functions. The profile-based approaches make use of chromatograms covering the whole retention time range. This is in contrast to the commonly applied feature-based alignment
approaches, which rely on a subset of the time points. In addition, the profile-based approaches avoid heavy dependency on preceding processes including peak detection for each
LC-MS run and identification of consensus peaks across runs, as in the feature-based approaches.
Appropriate estimation of the prototype and mapping functions is crucial for good alignment
results in the profile-based approaches. In contrast to time-warping approaches, in which the
alignment is performed in a pairwise manner, the proposed BAM simultaneously leverages all
the LC-MS runs to perform appropriate retention time alignment. The BAM uses Markov
chain Monte Carlo (MCMC) methods to draw inference on the model parameters, which
102
Chapter 5. Conclusion
103
estimates the retention time variability along with uncertainty measures. This enables a
nature framework to integrate complementary information from various sources, through
weighing the uncertainty measures. The original contributions of this research work are
summarized as follows.
Development of single-profile Bayesian alignment model. We have developed a
single-profile Bayesian alignment model for LC-MS data analysis. The alignment model
improves on existing Bayesian methods by 1) using an efficient MCMC sampler based on
the proposed block Metropolis-Hastings algorithm, and 2) adaptively selecting knots for the
mapping function using stochastic search variable selection (SSVS). Due to the mathematical
intractability of the mapping function and the monotonicity constraint imposed on it, designing an effective updating scheme is crucial to ensure good mixing of the MCMC sampler.
The proposed block Metropolis-Hastings algorithm enables flexible transition and prevents
the sampler from getting trapped in local modes of the posterior distribution. Moreover, an
extension using SSVS has been developed for adaptive determination of the number and positions of knots. The developed methodology has been evaluated through comparison with
competitive approaches, based on both simulated and real LC-MS data. The evaluation
shows that our alignment model yields better performance, which is accomplished through
improved estimation and modeling of the mapping functions. Furthermore, possible extension through formulation with the Jupp transformation, which enables use of the efficient
Hamiltonian Monte Carlo (HMC) algorithm has been investigated and discussed.
Development of multi-profile Bayesian alignment model. We have further extended
the single-profile alignment model to handle multiple representative chromatograms simultaneously. Along with our developed MCMC sampling schemes, the multi-profile alignment
model offers two major attractive features: 1) considering multi-profile modeling with representative chromatograms identified by a clustering approach, and 2) using Gaussian process
prior based on internal standards to guide the alignment process. Conventional approaches
by binning along the m/z dimension to derive chromatograms are not satisfactory as the
chromatographic profiles would inevitably be blurred. We have developed a novel clustering
approach to identify multiple representative chromatograms from each LC-MS run. This
approach takes into account of chromatographic quality and reproducibility across runs, and
searches for a clustering configuration with less overlap between chromatograms. The resulting chromatograms are simultaneously considered in the profile-based alignment to improve
the estimation of prototype and mapping functions. Moreover, a novel use of internal standards as landmarks in the alignment process has been developed through Gaussian process
regression. Our developed model enables a rigorous integration of information from various
sources. Comprehensive evaluation of the model has been performed on LC-MS data from
proteomic and glycomic studies. The evaluation demonstrates that the proposed alignment
model significantly eliminates the experimental variability in retention time measurement
and facilitates the subsequent peak matching process, which is key to the analysis of large-
Chapter 5. Conclusion
104
scale LC-MS-based omic studies.
Application of Bayesian alignment model to biomarker discovery. We have integrated the proposed Bayesian alignment model (BAM) into a preprocessing pipeline for
analysis of LC-MS data, and applied the pipeline to a large-scale LC-MS-based glycomic
study for cancer biomarker discovery. The glycomic study consists of two complementary
analyses: 1) global profiling using LC-MS, and 2) targeted quantification by MRM. Through
the developed pipeline, we identified reliable candidate biomarkers from global profiling analysis and confirmed the result with that by target quantification. Preprocessing of the LC-MS
data is crucial for the global profiling analysis and the BAM ensures a reliable analysis in this
phase. Through cross-platform confirmation, the results of this study illustrate the power of
the integrative approach for a comprehensive LC-MS-based omic analysis.
5.2
Future work
There are several remaining topics that can be further explored. We discuss some of the
possible extensions in this section. The discussion presented here can be viewed as a starting
point for future research.
Computational efficiency. A current bottleneck of the developed Bayesian alignment
model (BAM) is the required computational time. This issue could be circumvented through
developing more efficient sampling schemes for parameter inference. The Hamiltonian Monte
Carlo (HMC) model introduced in Section 2.6 is an initial effort in this direction. To further
improve the HMC model and broaden its application scope, the main focus is to develop
adaptive ways to tune the HMC parameters. Incorporation of adaptive HMC samplers, e.g.,
the recently developed “No-U-Turn Sampler” (NUTS) [53], may deserve further investigation.
Besides MCMC methods, an interesting alternative is the use of variational methods such
as the integrated nested Laplace approximations (INLA) [112], which may further improve
the sampling efficiency for the parameter inference. In devising powerful sampling schemes,
some sort of transformation may be desirable since it is often challenging to perform efficient
parameter inference in a constrained space, as in the BAM. However, this might entangle
the integration of informative prior, which can be naturally performed in the original space
of BAM through Gaussian process regression as described in Section 3.1. We are currently
working on developing sampling approaches that can improve the efficiency of sampling while
retaining the convenience to incorporate informative prior. In addition to methodology development, practical approaches through re-arrangement of alignment process may further
reduce computational burden. One possible approach is to employ a coarse-to-fine alignment
procedure, where an approximate yet fast estimate is first derived based on down-sampled
data. This estimate is then used to initialize a more precise estimation based on the com-
Chapter 5. Conclusion
105
plete data. In the BAM, this can be accomplished by scheduling appropriate block moves,
i.e., using smaller values of rblock to create larger blocks in initial MCMC iterations. Devising adaptive ways to choose appropriate values of rblock while ensuring a valid MCMC
sampler (also known as adaptive MCMC [111]) would be an interesting topic to pursue in
this direction.
Methodology extension. The main assumption of the BAM is that there must exist a
consistent pattern that is representative of considered LC-MS runs. One concern with the
single-group assumption is due to possible outliers, where some LC-MS runs may exhibit a
significantly distinct pattern from the rest of the runs. In such case, irrelevant measurements
from the outlying runs would disturb the estimation of the prototype and mapping functions.
A possible extension to address this issue is to introduce a mixture distribution of inlier
and outlier, which defines the attribute of each observed chromatogram and is updated
through assessing the consistency between the prototype function and the chromatogram.
For an MCMC update, if an observed chromatogram is identified as an outlier, it would
be excluded from the current estimation of model parameters. This extension is expected
to improve estimation of model parameters in the BAM. Moreover, based on the posterior
probability of the attribute, the outlying runs can also be detected in a principled way. When
samples arise from different biological subgroups, the model needs to be further extended to
account for the heterogeneity across these subgroups. In addition to incorporating grouping
information, e.g., by using a Dirichlet process mixture [91], a module that can distinguish
common chromatographic profiles from those unique to specific groups needs to to developed.
This would help identify and prioritize representative profiles for the alignment process. We
believe that simultaneous alignment of samples from multiple groups will ensure coherence in
the preprocessing step and data comparability, which may facilitate downstream analyses,
such as difference detection. More interestingly, this may reveal heterogeneity and novel
patterns within a pre-defined biological group.
Integrated model for LC-MS data preprocessing. Appropriate utilization of rich information embedded in the LC-MS data is crucial in the data preprocessing, as demonstrated
in Chapter 3. A natural extension of the current work is to develop an integrated model
that handles peak detection and retention time alignment simultaneously. Current preprocessing pipelines perform the two steps sequentially and peak detection is often carried out
for each LC-MS run separately, without leveraging information from multiple runs. The two
preprocessing steps are closely related and the benefit by combining them is two-fold. On
the one hand, performing peak detection with information from other replicate runs could
potentially reduce associated uncertainty and lead to more reliable results [87], if retention
time alignment across runs is appropriately handled. On the other hand, identification of
consistent patterns across LC-MS runs is key to the success of retention time alignment,
and peak detection step may reveal good candidates of such patterns. Characterization of
chromatographic profile is involved in both steps, and the BAM offers a natural framework
Chapter 5. Conclusion
106
to construct the integrated model. In addition to improving the two preprocessing steps,
the integrated model will lead to a coherent peak matching process. Furthermore, this may
allow adequate normalization of peak intensity based on hydrophobicity, chemical class and
other relevant properties.
Potential applications. Although this dissertation is focused on retention time alignment
of LC-MS data, some of the developed methodology can be used beyond LC-MS data analysis. There is a broad interest in functional data analysis [107], and alignment of important
features of curves (called curve registration [106]) is crucial for appropriate interpretation and
analysis of functional data, which arise in many different areas including economics, chemistry and biology. In particular, curve registration has been applied to analysis of biological
data acquired by a variety of technologies including microarray [128], two-dimensional gel
electrophoresis [45], and electrocardiography [131]. In terms of complexity and size of data,
we do not see major issues in applying our developed alignment model to these studies. Recent advances in immunoprecipitation, affinity purification-mass spectrometry (AP-MS), and
other technologies have enabled large-scale characterization of protein-protein, protein-DNA,
and other molecular interactions [41]. Analysis of interaction networks provides important
insights for systems biology research [86]. At the same time, it has raised a number of interesting computational questions. By using appropriate computational methods to analyze the
interaction networks, it is possible to identify key pathways and/or complexes that regulate
specific biological processes. In particular, comparative analysis of interaction networks, e.g.,
alignment and comparison of interaction networks across species, may shed light on basic
cellular processes and phenotypic evolution. A fundamental challenge lies in the alignment
of interaction networks [11,117]. With additional efforts to translate and further characterize
the network alignment problem, some ideas and methodology developed in this dissertation
may be useful in this field, e.g., the considered MCMC samplers and stochastic searching
strategy.
5.3
Conclusion
Appropriate LC-MS data preprocessing steps are needed to detect true differences between
biological groups in LC-MS-based omic studies. Retention time alignment is one of the
most important yet challenging preprocessing steps. In this dissertation, we investigate the
alignment problem from methodology development to practical application through three research topics: 1) development of single-profile Bayesian alignment model, 2) development of
multi-profile Bayesian alignment model, and 3) application to biomarker discovery research.
The proposed Bayesian alignment model has been evaluated and compared with its competitive models, based on LC-MS data sets from proteomic, metabolomic and glycomic studies.
Experimental results show improved performance by the proposed model and demonstrate
its applicability in LC-MS-based omic studies, where the model greatly eliminates the exper-
Chapter 5. Conclusion
107
imental variability in retention time measurement and facilitates the peak matching process
through appropriate integration of complementary information. Finally, several related tasks
are proposed for future work.
Bibliography
[1] R. Aebersold and M. Mann.
422(6928):198–207, 2003.
Mass spectrometry-based proteomics.
Nature,
[2] C.H. Ahrens, E. Brunner, E. Qeli, K. Basler, and R. Aebersold. Generating and
navigating proteome maps using mass spectrometry. Nature Reviews Molecular Cell
Biology, 11(11):789–801, 2010.
[3] U. Alon. An Introduction to Systems Biology: Design Principles of Biological Circuits.
Chapman and Hall/CRC, 2006.
[4] A.H. America, J.H. Cordewener, M.H. van Geffen, A. Lommen, J.P. Vissers, R.J. Bino,
and R.D. Hall. Alignment and statistical difference analysis of complex peptide data
sets generated by multidimensional LC-MS. Proteomics, 6(2):641–653, 2006.
[5] H.J. An, S.R. Kronewitter, M.L. de Leoz, and C.B. Lebrilla. Glycomics and disease
markers. Current Opinion in Chemical Biology, 13(5-6):601–607, 2009.
[6] T.M. Annesley. Ion suppression in mass spectrometry. Clinical Chemistry, 49(7):1041–
1044, 2003.
[7] A. Arzumanyan, H.M. Reis, and M.A. Feitelson. Pathogenic mechanisms in HBV- and
HCV-associated hepatocellular carcinoma. Nature Reviews Cancer, 13:123–135, 2013.
[8] M. Bellew, M. Coram, M. Fitzgibbon, M. Igra, T. Randolph, P. Wang, D. May, J. Eng,
R. Fang, C. Lin, J. Chen, D. Goodlett, J. Whiteaker, A. Paulovich, and M. McIntosh.
A suite of algorithms for the comprehensive analysis of complex protein mixtures using
high-resolution LC-MS. Bioinformatics, 22(15):1902–1909, 2006.
[9] R. Bellman. The theory of dynamic programming. Bulletin of the American Mathematical Society, 60:503–516, 1954.
[10] Y. Benjamini and Y. Hochberg. Controlling the false discoveryrate: a practical and
powerful approach to multiple testing. Journal of Royal Statistical Society, Series B,
57(1):289–300, 1995.
108
Bibliography
109
[11] J. Berg and M. Lassig. Cross-species analysis of biological networks by Bayesian alignment. Proceedings of the National Academy of Sciences of the United States of America,
103(29):10967–10972, 2006.
[12] E.S. Bialecki and A.M. Di Bisceglie. Diagnosis of hepatocellular carcinoma. HPB : The
Official Journal of the International Hepato Pancreato Biliary Association, 7(1):26–34,
2005.
[13] B. Blomme, C. Van Steenkiste, N. Callewaert, and H. Van Vlierberghe. Alteration of
protein glycosylation in liver diseases. Journal of Hepatology, 50(3):592–603, 2009.
[14] S. Brooks, A. Gelman, G.L. Jones, and X.-L. Meng. Handbook of Markov Chain Monte
Carlo. Chapman & Hall/CRC, 2011.
[15] D. Bylund, R. Danielsson, G. Malmquist, and K.E. Markides. Chromatographic alignment by warping and dynamic programming as a pre-processing tool for PARAFAC
modelling of liquid chromatographymass spectrometry data. Journal of Chromatography A, 961(2):237–244, 2002.
[16] S.J. Callister, R.C. Barry, J.N. Adkins, E.T. Johnson, W.J. Qian, B.J. WebbRobertson, R.D. Smith, and M.S. Lipton. Normalization approaches for removing
systematic biases associated with mass spectrometry and label-free proteomics. Journal of Proteome Research, 5(2):277–286, 2006.
[17] C. Christin, H.C. Hoefsloot, A.K. Smilde, F. Suits, R. Bischoff, and P.L. Horvatovich.
Time alignment algorithms based on selected mass traces for complex LC-MS data.
Journal of Proteome Research, 9(3):1483–1495, 2010.
[18] K.R. Coombes, S. Tsavachidis, J.S. Morris, K.A. Baggerly, M.C. Hung, and H.M.
Kuerer. Improved peak detection and quantification of mass spectrometry data acquired from surface-enhanced laser desorption and ionization by denoising spectra with
the undecimated discrete wavelet transform. Proteomics, 5(16):4107–4117, 2005.
[19] V. Dancik, T.A. Addona, K.R. Clauser, J.E. Vath, and P.A. Pevzner. De novo peptide
sequencing via tandem mass spectrometry. Journal of Computational Biology, 6(34):327–342, 1999.
[20] C. de Boor. A Practical Guide to Splines. Springer-Verlag, New York, 1978.
[21] E.N. Debruyne, D. Vanderschaeghe, H. Van Vlierberghe, A. Vanhecke, N. Callewaert,
and J.R. Delanghe. Diagnostic value of the hemopexin N-glycan profile in hepatocellular carcinoma patients. Clinical Chemistry, 56(5):823–831, 2010.
[22] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete
data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1–
38, 1977.
110
Bibliography
[23] J.L. Desantos-Garcia, S.I. Khalil, A. Hussein, Y. Hu, and Y. Mechref. Enhanced
sensitivity of LC-MS analysis of permethylated N-glycans through online purification.
Electrophoresis, 32(24):3516–3525, 2011.
[24] E.P. Diamandis. Mass spectrometry as a diagnostic and a cancer biomarker discovery tool: opportunities and potential limitations. Molecular and Cellular Proteomics,
3(4):367–378, 2004.
[25] B. Domon and R. Aebersold. Mass spectrometry and protein analysis.
312(5771):212–217, 2006.
Science,
[26] P. Du, W.A. Kibbe, and S.M. Lin. Improved peak detection in mass spectrum by
incorporating continuous wavelet transform-based pattern matching. Bioinformatics,
22(17):2059–2065, 2006.
[27] S. Duane, A.D. Kennedy, B.J. Pendleton, and D. Roweth. Hybrid Monte Carlo. Physics
Letters B, 195(2):216–222, 1987.
[28] W.B. Dunn, D. Broadhurst, P. Begley, E. Zelena, S. Francis-McIntyre, N. Anderson,
M. Brown, J.D. Knowles, A. Halsall, J.N. Haselden, A.W. Nicholls, I.D. Wilson, D.B.
Kell, R. Goodacre, and Human Serum Metabolome (HUSERMET) Consortium. Procedures for large-scale metabolic profiling of serum and plasma using gas chromatography and liquid chromatography coupled to mass spectrometry. Nature Protocols,
6(7):1060–1083, 2011.
[29] P.H. Eilers. Parametric time warping. Analytical Chemistry, 76(2):404–411, 2004.
[30] H.B. El-Serag. Hepatocellular carcinoma.
365(12):1118–1127, 2011.
New England Journal of Medicine,
[31] J.E. Elias, W. Haas, B.K. Faherty, and S.P. Gygi. Comparative evaluation of mass
spectrometry platforms used in large-scale proteomics investigations. Nature Methods,
2(9):667–675, 2005.
[32] J.K. Eng, B.C. Searle, K.R. Clauser, and D.L. Tabb. A face in the crowd: recognizing peptides through database search. Molecular and Cellular Proteomics,
10(11):R111.009522, 2011.
[33] J. Ferlay, H.R. Shin, F. Bray, D. Forman, C. Mathers, and D.M. Parkin. Estimates
of worldwide burden of cancer in 2008: GLOBOCAN 2008. International Journal of
Cancer, 127(12):2893–2917, 2010.
[34] B. Fischer, J. Grossmann, V. Roth, W. Gruissem, S. Baginsky, and J.M. Buhmann. Semi-supervised LC/MS alignment for differential proteomics. Bioinformatics,
22(14):e132–e140, 2006.
Bibliography
111
[35] A.M. Frank, M.M. Savitski, M.L. Nielsen, R.A. Zubarev, and P.A. Pevzner. De novo
peptide sequencing and identification with precision mass spectrometry. Journal of
Proteome Research, 6(1):114–123, 2007.
[36] B.J. Frey and D. Dueck. Clustering by passing messages between data points. Science,
315(5814):972–976, 2007.
[37] M.M. Fuster and J.D. Esko. The sweet and sour of cancer: glycans as novel therapeutic
targets. Nature Reviews Cancer, 5:526–542, 2005.
[38] A.E. Gelfand and A.F. Smith. Sampling-based approaches to calculating marginal
densities. Journal of the American Statistical Association, 85(410):398–409, 1990.
[39] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian
restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6):721–741, 1984.
[40] E.I. George and R.E. McCulloch. Variable selection via Gibbs sampling. Journal of
the American Statistical Association, 88(423):881–889, 1993.
[41] A.C. Gingras, M. Gstaiger, B. Raught, and R. Aebersold. Analysis of protein complexes
using mass spectrometry. Nature Reviews Molecular Cell Biology, 8(8):645–654, 2007.
[42] R. Goldman, H.W. Ressom, R.S. Varghese, L. Goldman, G. Bascug, C.A. Loffredo,
M. Abdel-Hamid, I. Gouda, S. Ezzat, Z. Kyselova, Y. Mechref, and M.V. Novotny. Detection of hepatocellular carcinoma using glycomic analysis. Clinical Cancer Research,
15(5):1808–1813, 2009.
[43] C. Gonzaga-Jauregui, J.R. Lupski, and R.A. Gibbs. Human genome sequencing in
health and disease. Annual Review of Medicine, 63:35–61, 2012.
[44] P.J. Green. Reversible-jump Markov chain Monte Carlo computation and Bayesian
model determination. Biometrika, 82(4):711–732, 1995.
[45] P.J. Green and K.V. Mardia. Bayesian alignment using hierarchical models, with
applications in protein bioinformatics. Biometrika, 93(2):235–254, 2006.
[46] D. Greenbaum, C. Colangelo, K. Williams, and M. Gerstein. Comparing protein abundance and mrna expression levels on a genomic scale. Genome Biology, 4(9):117, 2003.
[47] M. Gstaiger and R. Aebersold. Applying mass spectrometry-based proteomics to genetics, genomics and network biology. Nature Reviews Genetics, 10(9):617–627, 2009.
[48] S. Gupta, S. Bent, and J. Kohlwes. Test characteristics of alpha-fetoprotein for detecting hepatocellular carcinoma in patients with hepatitis C. A systematic review and
critical analysis. Annals of Internal Medicine, 139(1):46–50, 2003.
Bibliography
112
[49] K. Hashimoto, S. Goto, S. Kawano, K.F. Aoki-Kinoshita, N. Ueda, M. Hamajima,
T. Kawasaki, and M. Kanehisa. KEGG as a glycome informatics resource. Glycobiology,
16(5):63R–70R, 2006.
[50] W.K. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97–109, 1970.
[51] A.M. Hawkridge and D.C. Muddiman. Mass spectrometry-based biomarker discovery: toward a global proteome index of individuality. Annual Review of Analytical
Chemistry, 2:265–277, 2009.
[52] M. Hedlund, E. Ng, A. Varki, and N.M. Varki. alpha 2-6-linked sialic acids on Nglycans modulate carcinoma differentiation in vivo. Cancer Research, 68(2):388–394,
2008.
[53] M.D. Hoffman and A. Gelman. The No-U-Turn Sampler: adaptively setting path
lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, in press.
[54] L. Hood, J.R. Heath, M.E. Phelps, and B. Lin. Systems biology and new technologies
enable predictive and preventative medicine. Science, 306(5696):640–643, 2004.
[55] Y. Hu and Y. Mechref. Comparing MALDI-MS, RP-LC-MALDI-MS and RP-LC-ESIMS glycomic profiles of permethylated N-glycans derived from model glycoproteins
and human blood serum. Electrophoresis, 33(12):1768–1777, 2012.
[56] T. Ideker, T. Galitski, and L. Hood. A new approach to decoding life: systems biology.
Annual Review of Genomics and Human Genetics, 2:343–372, 2001.
[57] T. Ideker, V. Thorsson, J.A. Ranish, R. Christmas, J. Buhler, J.K. Eng, R. Bumgarner,
D.R. Goodlett, R. Aebersold, and L. Hood. Integrated genomic and proteomic analyses
of a systematically perturbed metabolic network. Science, 292(5518):929–934, 2001.
[58] International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature, 431(7011):931–945, 2004.
[59] J.D. Jaffe, D.R. Mani, K.C. Leptos, G.M. Church, M.A. Gillette, and S.A. Carr.
PEPPeR, a platform for experimental proteomic pattern recognition. Molecular and
Cellular Proteomics, 5(10):1927–1941, 2006.
[60] N. Jaitly, A. Mayampurath, K. Littlefield, J.N. Adkins, G.A. Anderson, and R.D.
Smith. Decon2LS: an open-source software package for automated processing and
visualization of high resolution mass spectrometry data. BMC Bioinformatics, 10:87,
2009.
Bibliography
113
[61] N. Jaitly, M.E. Monroe, V.A. Petyuk, T.R. Clauss, J.N. Adkins, and R.D. Smith.
Robust algorithm for alignment of liquid chromatographymass spectrometry analyses in an accurate mass and time tag data analysis pipeline. Analytical Chemistry,
78(21):7397–7409, 2006.
[62] O.N. Jensen. Modification-specific proteomics: characterization of post-translational
modifications by mass spectrometry. Current Opinion in Chemical Biology, 8(1):33–41,
2004.
[63] A.R. Joyce and B.Ø. Palsson. The model organism as a system: integrating ‘omics’
data sets. Nature Reviews Molecular Cell Biology, 7(3):198–210, 2006.
[64] D.L.B. Jupp. Approximation to data by splines with free knots. SIAM Journal on
Numerical Analysis, 15(2):328–343, 1978.
[65] T. Kamiyama, H. Yokoo, J. Furukawa, M. Kurogochi, T. Togashi, N. Miura, K. Nakanishi, H. Kamachi, T. Kakisaka, Y. Tsuruga, M. Fujiyoshi, A. Taketomi, S. Nishimura,
and S. Todo. Identification of novel serum biomarkers of hepatocellular carcinoma
using glycomic analysis. Hepatology, 57(6):2314–2325, 2013.
[66] Y. Karpievitch, J. Stanley, T. Taverner, J. Huang, J.N. Adkins, C. Ansong, F. Heffron,
T.O. Metz, W.J. Qian, H. Yoon, R.D. Smith, and A.R. Dabney. A statistical framework for protein quantitation in bottom-up MS-based proteomics. Bioinformatics,
25(16):2028–2034, 2009.
[67] Y.V. Karpievitch, A.D. Polpitiya, G.A. Anderson, R.D. Smith, and A.R. Dabney. Liquid chromatography mass spectrometry-based proteomics: Biological and technological
aspects. The Annals of Applied Statistics, 4(4):1797–1823, 2010.
[68] K. Kultima, A. Nilsson, B. Scholz, U.L. Rossbach, M. Falth, and P.E. Andren. Development and evaluation of normalization methods for label-free relative quantification
of endogenous peptides. Molecular and Cellular Proteomics, 8(10):2285–2295, 2009.
[69] E.S. Lander. Initial impact of the sequencing of the human genome.
470(7333):187–197, 2011.
Nature,
[70] E. Lange, C. Gropl, O. Schulz-Trieglaff, A. Leinenbach, C. Huber, and K. Reinert.
A geometric approach for the alignment of liquid chromatography-mass spectrometry
data. Bioinformatics, 23(13):i273–i281, 2007.
[71] E. Lange, R. Tautenhahn, S. Neumann, and C. Gropl. Critical assessment of alignment
procedures for LC-MS proteomics and metabolomics measurements. BMC Bioinformatics, 9:375, 2008.
Bibliography
114
[72] X.J. Li, E.C. Yi, C.J. Kemp, H. Zhang, and R. Aebersold. A software suite for the
generation and comparison of peptide arrays from sets of data collected by liquid
chromatography-mass spectrometry. Molecular and Cellular Proteomics, 4(9):1328–
1340, 2005.
[73] J. Listgarten and A. Emili. Statistical and computational methods for comparative
proteomic profiling using liquid chromatography-tandem mass spectrometry. Molecular
and Cellular Proteomics, 4(4):419–343, 2005.
[74] J. Listgarten, R.M. Neal, S.T. Roweis, and A. Emili. Multiple alignment of continuous
time series. In Advances in Neural Information Processing Systems, pages 817–824.
MIT Press, 2005.
[75] J. Listgarten, R.M. Neal, S.T. Roweis, R. Puckrin, and S. Cutler. Bayesian detection
of infrequent differences in sets of time series with shared structure. In Advances in
Neural Information Processing Systems, pages 905–912. MIT Press, 2007.
[76] J. Listgarten, R.M. Neal, S.T. Roweis, P. Wong, and A. Emili. Difference detection in
LC-MS data for protein biomarker discovery. Bioinformatics, 23(2):e198–e204, 2007.
[77] X.E. Liu, L. Desmyter, C.F. Gao, W. Laroy, S. Dewaele, V. Vanhooren, L. Wang,
H. Zhuang, N. Callewaert, C. Libert, R. Contreras, and C. Chen. N-glycomic changes
in hepatocellular carcinoma patients with liver cirrhosis induced by hepatitis B virus.
Hepatology, 46(5):1426–1435, 2007.
[78] P. Lu, C. Vogel, R. Wang, X. Yao, and E.M. Marcotte. Absolute protein expression profiling estimates the relative contributions of transcriptional and translational
regulation. Nature Biotechnology, 25(1):117–124, 2007.
[79] B. Ma, K. Zhang, C. Hendrie, C. Liang, M. Li, A. Doherty-Kirby, and G. Lajoie.
PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. Rapid Communications in Mass Spectrometry, 17(20):2337–2342, 2003.
[80] R. Madsen, T. Lundstedt, and J. Trygg. Chemometrics in metabolomics–a review in
human disease diagnosis. Analytica Chimica Acta, 659(1-2):23–33, 2010.
[81] M. Mann and N.L. Kelleher. Precision proteomics: the case for high resolution and
high mass accuracy. Proceedings of the National Academy of Sciences of the United
States of America, 105(47):18132–18138, 2008.
[82] Y. Mechref, Y. Hu, A. Garcia, and A. Hussein. Identifying cancer biomarkers by mass
spectrometry-based glycomics. Electrophoresis, 33(12):1755–1767, 2012.
[83] A. Mehta, P. Norton, H. Liang, M.A. Comunale, M. Wang, L. Rodemich-Betesh,
A. Koszycki, K. Noda, E. Miyoshi, and T. Block. Increased levels of tetra-antennary
N-linked glycan but not core fucosylation are associated with hepatocellular carcinoma
tissue. Cancer Epidemiology, Biomarkers & Prevention, 21(6):925–933, 2012.
Bibliography
115
[84] Members of the Toxicogenomics Research Consortium. Standardizing global gene expression analysis between laboratories and across platforms. Nature Methods, 2(5):351–
356, 2005.
[85] N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, and E. Teller. Equation of state calculations by fast computing machines. Journal of Chemical Physics,
21(6):1087–1092, 1953.
[86] K. Mitra, A.R. Carvunis, S.K. Ramesh, and T. Ideker. Integrative approaches for
finding modular structure in biological networks. Nature Reviews Genetics, 14(10):719–
732, 2013.
[87] J.S. Morris, K.R. Coombes, J. Koomen, K.A. Baggerly, and R. Kobayashi. Feature
extraction and quantification for mass spectrometry in biomedical applications using
the mean spectrum. Bioinformatics, 21(9):1764–1775, 2005.
[88] L.N. Mueller, O. Rinner, A. Schmidt, S. Letarte, B. Bodenmiller, M.Y. Brusniak,
O. Vitek, R. Aebersold, and M. Muller. Superhirn - a novel tool for high resolution
LC-MS-based peptide/protein profiling. Proteomics, 7(19):3470–3480, 2007.
[89] I. Murray. Advances in Markov chain Monte Carlo methods. PhD thesis, Gatsby
Computational Neuroscience Unit, University College London, 2007.
[90] R.M. Neal. Probabilistic inference using Markov chain Monte Carlo methods. Technical
Report CRG-TR-93-1, Department of Computer Science, University of Toronto, 1993.
[91] R.M. Neal. Markov chain sampling methods for Dirichlet process mixture models.
Journal of Computational and Graphical Statistics, 9(2):249–265, 2000.
[92] R.M. Neal. MCMC using Hamiltonian dynamics. In S. Brooks, A. Gelman, G.L.
Jones, and X.-L. Meng, editors, Handbook of Markov Chain Monte Carlo, pages 113–
162. Chapman & Hall/CRC, 2011.
[93] A.I. Nesvizhskii. Protein identification by tandem mass spectrometry and sequence
database searching. Methods in Molecular Biology, 367:87–119, 2007.
[94] N.V. Nielsen, J.M. Carstensen, and J. Smedsgaard. Aligning of single and multiple
wavelength chromatographic profiles for chemometric data analysis using correlation
optimised warping. Journal of Chromatography A, 805(1-2):17–35, 1998.
[95] A.L. Oberg and O. Vitek. Statistical design of quantitative mass spectrometry-based
proteomic experiments. Journal of Proteome Research, 8(5):2144–2156, 2009.
[96] G.J. Patti, O. Yanes, and G. Siuzdak. Innovation: Metabolomics: the apogee of the
omics trilogy. Nature Reviews Molecular Cell Biology, 13(4):263–269, 2012.
Bibliography
116
[97] P. Picotti and R. Aebersold. Selected reaction monitoring-based proteomics: workflows, potential, pitfalls and future directions. Nature Methods, 9(6):555–566, 2012.
[98] P. Picotti, O. Rinner, R. Stallmach, F. Dautel, T. Farrah, B. Domon, H. Wenschuh,
and R. Aebersold. High-throughput generation of selected reaction-monitoring assays
for proteins and proteomes. Nature Methods, 7(1):43–46, 2010.
[99] S.S. Pinho, C.A. Reis, J. Paredes, A.M. Magalhaes, A.C. Ferreira, J. Figueiredo,
W. Xiaogang, F. Carneiro, F. Gartner, and R. Seruca.
The role of Nacetylglucosaminyltransferase III and V in the post-transcriptional modifications of
E-cadherin. Human Molecular Genetics, 18(14):2599–2608, 2009.
[100] T. Pluskal, S. Castillo, A. Villar-Briones, and M. Oresic. MZmine 2: modular framework for processing, visualizing, and analyzing mass spectrometry-based molecular
profile data. BMC Bioinformatics, 11:395, 2010.
[101] K. Podwojski, A. Fritsch, D.C. Chamrad, W. Paul, B. Sitek, K. Stuhler, P. Mutzel,
C. Stephan, H.E. Meyer, W. Urfer, K. Ickstadt, and J. Rahnenfuhrer. Retention time
alignment algorithms for LC/MS data must consider non-linear shifts. Bioinformatics,
25(6):758–764, 2009.
[102] D. Pousset, V. Piller, N. Bureaud, M. Monsigny, and F. Piller. Increased alpha2,6 sialylation of N-glycans in a transgenic mouse model of hepatocellular carcinoma. Cancer
Research, 57(19):4249–4256, 1997.
[103] A. Prakash, P. Mallick, J. Whiteaker, H. Zhang, A. Paulovich, M. Flory, H. Lee, R. Aebersold, and B. Schwikowski. Signal maps for mass spectrometry-based comparative
proteomics. Molecular and Cellular Proteomics, 5(3):423–32, 2006.
[104] J.T. Prince and E.M. Marcotte. Chromatographic alignment of ESI-LC-MS proteomics
data sets by ordered bijective interpolated warping. Analytical Chemistry, 78(17):6140–
6152, 2006.
[105] L.R. Rabiner. A tutorial on hidden Markov models and selected applications in speech
recognition. Proceedings of the IEEE, 77(2):257–286, 1989.
[106] J.O. Ramsay and X. Li. Curve registration. Journal of the Royal Statistical Society,
Series B, 60(2):351–363, 1998.
[107] J.O. Ramsay and B.W. Silverman. Functional Data Analysis. Springer Series in Statistics. Springer, second edition, 2005.
[108] C.E. Rasmussen and C.K.I. Williams. Gaussian Processes for Machine Learning. The
MIT Press, 2006.
Bibliography
117
[109] H.W. Ressom, R.S. Varghese, L. Goldman, Y. An, C.A. Loffredo, M. Abdel-Hamid,
Z. Kyselova, Y. Mechref, M. Novotny, S.K. Drake, and R. Goldman. Analysis of
MALDI-TOF mass spectrometry data for discovery of peptide and glycan biomarkers
of hepatocellular carcinoma. Journal of Proteome Research, 7(2):603–610, 2008.
[110] G.O. Roberts and S.K. Sahu. Updating schemes, correlation structure, blocking and
parameterisation for the Gibbs sampler. Journal of the Royal Statistical Society, Series
B, 59(2):291–317, 1997.
[111] J.S. Rosenthal. Optimal proposal distributions and adaptive MCMC. In S. Brooks,
A. Gelman, G.L. Jones, and X.-L. Meng, editors, Handbook of Markov Chain Monte
Carlo, pages 93–111. Chapman & Hall/CRC, 2011.
[112] H. Rue, S. Martino, and N. Chopin. Approximate Bayesian inference for latent Gaussian models using integrated nested Laplace approximations (with discussion). Journal
of the Royal Statistical Society, Series B, 71(2):319–392, 2009.
[113] L.R. Ruhaak, S. Miyamoto, and C.B. Lebrilla. Developments in the identification of
glycan biomarkers for the detection of cancer. Molecular and Cellular Proteomics,
12(4):846–855, 2013.
[114] H. Sakoe and S. Chiba. Dynamic programming algorithm optimization for spoken
word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing,
26(1):43–49, 1978.
[115] S. Salvador and P. Chan. Determining the number of clusters/segments in hierarchical
clustering/segmentation algorithms. In Tools with Artificial Intelligence, 2004. ICTAI
2004. 16th IEEE International Conference on, pages 576–584, Nov 2004.
[116] A. Savitzky and M.J.E. Golay. Smoothing and differentiation of data by simplified
least squares procedures. Analytical Chemistry, 36(8):1627–1639, 1964.
[117] R. Sharan, S. Suthram, R.M. Kelley, T. Kuhn, S. McCuine, P. Uetz, T. Sittler, R.M.
Karp, and T. Ideker. Conserved patterns of protein interaction in multiple species.
Proceedings of the National Academy of Sciences of the United States of America,
102(6):1974–1979, 2005.
[118] R. Siegel, J. Ma, Z. Zou, and A. Jemal. Cancer statistics, 2014. CA: A Cancer Journal
for Clinicians, 64(1):9–29, 2014.
[119] C.A. Smith, E.J. Want, G. O’Maille, R. Abagyan, and G. Siuzdak. XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment,
matching, and identification. Analytical Chemistry, 78(3):779–787, 2006.
[120] H. Steen and M. Mann. The ABC’s (and XYZ’s) of peptide sequencing. Nature Reviews
Molecular Cell Biology, 5(9):699–711, 2004.
Bibliography
118
[121] M. Sturm, A. Bertsch, C. Gropl, A. Hildebrandt, R. Hussong, E. Lange, N. Pfeifer,
O. Schulz-Trieglaff, A. Zerck, K. Reinert, and O. Kohlbacher. OpenMS - an opensource software framework for mass spectrometry. BMC Bioinformatics, 9:163, 2008.
[122] A. Subramanian, P. Tamayo, V.K. Mootha, S. Mukherjee, B.L. Ebert, M.A. Gillette,
A. Paulovich, S.L. Pomeroy, T.R. Golub, E.S. Lander, and J.P. Mesirov. Gene set
enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States
of America, 102(43):15545–15550, 2005.
[123] M. Sysi-Aho, M. Katajamaa, L. Yetukuri, and M. Oresic. Normalization method
for metabolomics data using optimal selection of multiple internal standards. BMC
Bioinformatics, 8:93, 2007.
[124] K. Tanabe, A. Deguchi, M. Higashi, H. Usuki, Y. Suzuki, Y. Uchimura, S. Kuriyama,
and K. Ikenaka. Outer arm fucosylation of N-glycans increases in sera of hepatocellular carcinoma patients. Biochemical and Biophysical Research Communications,
374(2):219–225, 2008.
[125] Z. Tang, R.S. Varghese, S. Bekesova, C.A. Loffredo, M.A. Hamid, Z. Kyselova,
Y. Mechref, M. Novotny, R. Goldman, and H.W. Ressom. Identification of N-glycan
serum markers associated with hepatocellular carcinoma from mass spectrometry data.
Journal of Proteome Research, 9(1):104–112, 2010.
[126] R. Tautenhahn, C. Bottcher, and S. Neumann. Annotation of LC/ESI-MS mass signals.
In Sepp Hochreiter and Roland Wagner, editors, Bioinformatics Research and Development, volume 4414 of Lecture Notes in Computer Science, pages 371–380. Springer
Berlin Heidelberg, 2007.
[127] D. Telesca and L.Y. Inoue. Bayesian hierarchical curve registration. Journal of the
American Statistical Association, 103(481):328–339, 2008.
[128] D. Telesca, L.Y. Inoue, M. Neira, R. Etzioni, M. Gleave, and C. Nelson. Differential expression and network inferences through functional data modeling. Biometrics,
65(3):793–804, 2009.
[129] G. Tomasi, F. van den Berg, and C. Andersson. Correlation optimized warping and
dynamic time warping as preprocessing methods for chromatographic data. Journal
of Chemometrics, 18(5):231–241, 2004.
[130] F. Trevisani, P.E. D’Intino, A.M. Morselli-Labate, G. Mazzella, E. Accogli, P. Caraceni,
M. Domenicali, S. De Notariis, E. Roda, and M. Bernardi. Serum alpha-fetoprotein for
diagnosis of hepatocellular carcinoma in patients with chronic liver disease: influence
of HBsAg and anti-HCV status. Journal of Hepatology, 34(4):570–575, 2001.
Bibliography
119
[131] T. Trigano, U. Isserles, and Y. Ritov. Semiparametric curve alignment and shift density
estimation for biological data. IEEE Transactions on Signal Processing, 59(5):1970–
1984, 2011.
[132] T.-H. Tsai, M.G. Tadesse, Y. Wang, and H.W. Ressom. Profile-based LC-MS data
alignment — a Bayesian approach. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 10(2):494–503, 2013.
[133] L. Tuli, T.-H. Tsai, R.S. Varghese, J.F. Xiao, A.K. Cheema, and H.W. Ressom. Using
a spike-in experiment to evaluate analysis of LC-MS data. Proteome Science, 10:13,
2012.
[134] M. Tyers and M. Mann. From genomics to proteomics. Nature, 422(6928):193–197,
2003.
[135] M. Vandenbogaert, S. Li-Thiao-Te, H.M. Kaltenbach, R. Zhang, T. Aittokallio, and
B. Schwikowski. Alignment of LC-MS images, with applications to biomarker discovery
and protein identification. Proteomics, 8(4):650–672, 2008.
[136] B. Voss, M. Hanselmann, B.Y. Renard, M.S. Lindner, U. Kothe, M. Kirchner, and
F.A. Hamprecht. SIMA: simultaneous multiple alignment of LC/MS peak lists. Bioinformatics, 27(7):987–993, 2011.
[137] P. Wang, H. Tang, M.P. Fitzgibbon, M. McIntosh, M. Coram, H. Zhang, E. Yi, and
R. Aebersold. A statistical method for chromatographic alignment of LC-MS data.
Biostatistics, 8(2):357–367, 2007.
[138] Z. Wang, M. Gerstein, and M. Snyder. RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics, 10(1):57–63, 2009.
[139] W. Windig, J.M. Phalp, and A.W. Payne. A noise and background reduction method
for component detection in liquid chromatography/mass spectrometry. Analytical
Chemistry, 68(20):3602–3606, 1996.
[140] J.F. Xiao, B. Zhou, and H.W. Ressom. Metabolite identification and quantitation in
LC-MS/MS-based metabolomics. Trends in Analytical Chemistry, 32:1–14, 2012.
[141] M. Yao, D.P. Zhou, S.M. Jiang, Q.H. Wang, X.D. Zhou, Z.Y. Tang, and J.X. Gu.
Elevated activity of N-acetylglucosaminyltransferase V in human hepatocellular carcinoma. Journal of Cancer Research and Clinical Oncology, 124(1):27–30, 1998.
[142] T. Yu, Y. Park, J.M. Johnson, and D.P. Jones. apLCMS–adaptive processing of highresolution LC/MS data. Bioinformatics, 25(15):1930–1936, 2009.
[143] J. Zaia. Mass spectrometry and glycomics. OMICS, 14(4):401–418, 2010.
Bibliography
120
[144] P. Zhang, H. Li, H. Wang, S.T. Wong, and X. Zhou. Peak tree: a new tool for multiscale
hierarchical representation and peak detection of mass spectrometry data. IEEE/ACM
Transactions on Computational Biology and Bioinformatics, 8(4):1054–1066, 2011.
[145] X. Zhang, J.M. Asara, J. Adamec, M. Ouzzani, and A.K. Elmagarmid. Data preprocessing in liquid chromatography-mass spectrometry-based proteomics. Bioinformatics, 21(21):4054–4059, 2005.
[146] Y. Zhao, Y. Li, H. Ma, W. Dong, H. Zhou, X. Song, J. Zhang, and L. Jia. Modification of sialylation mediates the invasive properties and chemosensitivity of human
hepatocellular carcinoma. Molecular and Cellular Proteomics, 13(2):520–536, 2014.
[147] Y. Zhao, Y. Sato, T. Isaji, T. Fukuda, A. Matsumoto, E. Miyoshi, J. Gu, and
N. Taniguchi. Branched N-glycans regulate the biological functions of integrins and
cadherins. FEBS Journal, 275(9):1939–1948, 2008.
[148] H. Zou and T. Trevor. Regularization and variable selection via the elastic net. Journal
of the Royal Statistical Society, Series B, 67(2):301–320, 2005.
Appendix A
Proteomic ground-truth data
The ground-truth data were generated based on the Mascot search result. A list of MS/MS
spectra with identification score > 60 and present in at least 10 out of 20 LC-MS/MS runs was
compiled. Each peptide sequence was assigned to a peak detected by DifProWare based on its
mass and retention time, which resulted in a list of consensus peaks (with the same identity).
Putative matching was also performed to the runs without a qualified identification sequence.
The list was further refined based on visual inspection of the extracted ion chromatogram
of each consensus peak, where erroneous assignments were removed. Table A.1 presents the
resulting ground-truth data, which consists of 273 unique peptide sequences from 70 unique
proteins.
Table A.1: Ground-truth peaks in the proteomic data set.
Protein
Peptide sequence
Mascot score
Mass (Da)
Time (min.)
transferrin [Homo sapiens]
SASDLTWDNLK
HSTIFENLANK
EGYYGYTGAFR
SVIPSDGPSVACVK
MYLGYEYVTAIR
SKEFQLFSSPHGK
FDEFFSEGCAPGSK
TAGWNIPMGLLYNK
IECVSAETTEDCIAK
NLNEKDYELLCLDGTR
SAGWNIPIGLLYCDLPEPR
AIAANEADAVTLDAGLVYDAYLAPNNLKPVVAEFYGSK
IMNGEADAMSLDGGFVYIAGK
SMGGKEDLIWELLNQAQEHFGKDK
DSGFQMNQLR
77.7926
70.2421
69.3411
65.636
98.3025
73.9056
91.877
66.8031
115.0145
87.7071
97.2836
160.9273
104.5931
77.5
66.3677
1248.6029
1272.6483
1282.5659
1357.6949
1477.7339
1490.753
1519.6346
1576.818
1610.7215
1894.917
2113.0801
3953.0353
2158.0202
2772.3731
1194.5473
37.1555
32.4386
33.6282
32.0186
48.2243
31.2152
39.8019
58.3531
31.461
43.9851
70.5105
70.4842
53.0688
65.2475
30.9991
121
Appendix A. Proteomic ground-truth data
122
complement component 4A preproprotein [Homo sapiens]
VGDTLNLNLR
94.4817
1113.6169
37.1683
GLEEELQFSLGSK
86.64
1435.726
48.2367
VLSLAQEQVGGSPEK
107.7717
1540.814
28.8422
TTNIQGINLLFSSR
93.603
1562.8495
53.0759
VTASDPLDTLGSEGALSPGGVASLLR
113.287
2482.3079
63.152
ceruloplasmin precursor [Homo sapiens]
EVGPTNADPVCLAK
68.039
1412.7009
30.347
QSEDSTFYLGER
71.9963
1430.6352
31.5729
ALYLQYTDETFR
81.6025
1518.7414
43.22
NNEGTYYSPNYNPQSR
70.3669
1902.8175
22.6015
MYYSAVDPTKDIFTGLIGPMK
99.4777
2346.179
59.9307
HYYIGIIETTWDYASDHGEK
66.3108
2397.1056
53.8928
KAEEEHLGILGPQLHADVGDKVK
80.3927
2482.324
33.2863
GPEEEHLGILGPVIWAEVGDTIR
69.6812
2486.2963
67.5311
GVYSSDVFDIFPGTYQTLEMFPR
86.0423
2668.2703
72.9248
ERGPEEEHLGILGPVIWAEVGDTIR
96.578
2771.4424
64.3123
ADDKVYPGEQYTYMLLATEEQSPGEGDGNCVTR
106.7565
3635.6218
51.5539
IDTINLFPATLFDAYMVAQNPGEWMLSCQNLNHLK
103.6335
4006.9624
84.3879
GAYPLSIEPIGVR
76.9083
1370.7614
45.7183
HIDREFVVMFSVVDENFSWYLEDNIK
78.0573
3230.5561
77.8872
complement factor H isoform a precursor [Homo sapiens]
SLGNVIMVCR
68.355
1090.5647
41.9534
TGESVEFVCK
72.1533
1097.5083
28.7364
DGWSAQPTCIK
71.0445
1204.5571
29.2732
KGEWVALNPLR
69.1553
1281.7192
41.5403
SPDVINGSPISQK
65.0494
1340.6963
25.6546
EIMENYNIALR
66.1253
1364.6808
38.7847
SSNLIILEEHLK
65.2694
1394.7779
43.4492
WSSPPQCEGLPCK
69.148
1430.6366
34.0066
AGEQVTYTCATYYK
89.246
1596.7165
29.3064
NTEILTGSWSDQTYPEGTQAIYK
95.8119
2601.2364
44.6598
apolipoprotein A-I preproprotein [Homo sapiens] >gi|55637005|ref|XP 508770.1|
PREDICTED: similar to preproapolipoprotein AI isoform 5 [Pan troglodytes]
>gi|114640451|ref|XP 001153269.1| PREDICTED: similar to preproapolipoprotein AI
isoform 1 [Pan tr
DLATVYVDVLK
83.8683
1234.6863
52.0847
VSFLSALEEYTK
86.8663
1385.7174
61.7605
DYVSQFEGSALGK
109.4789
1399.6676
40.2577
VKDLATVYVDVLK
76.1879
1461.8492
46.2894
VSFLSALEEYTKK
67.3931
1513.8113
56.674
LLDNWDSVTSTFSK
100.5315
1611.7876
49.3435
DSGRDYVSQFEGSALGK
101.3459
1814.8499
37.7129
VKDLATVYVDVLKDSGR
86.4525
1877.0299
47.2567
Appendix A. Proteomic ground-truth data
123
LREQLGPVTQEFWDNLEK
77.43
2201.1273
55.3091
LREQLGPVTQEFWDNLEKETEGLR
73.0821
2886.4692
65.5815
DLATVYVDVLKDSGRDYVSQFEGSALGK
67.7955
3031.5346
64.1003
inter-alpha (globulin) inhibitor H4 [Homo sapiens]
LALDNGGLAR
64.6264
998.5513
27.5272
AEAQAQYSAAVAK
85.3582
1306.6529
18.4805
ITFELVYEELLK
99.0484
1495.8269
68.463
NMEQFQVSVSVAPNAK
79.4064
1747.8622
38.2361
NPLVWVHASPEHVVVTR
74.2842
1939.047
38.5479
QGPVNLLSDPEQGVEVTGQYER
87.5953
2414.1836
44.6506
FSSHVGGTLGQFYQEVLWGSPAASDDGRR
84.0694
3123.5033
57.7378
DQFNLIVFSTEATQWRPSLVPASAENVNK
106.5981
3260.6664
65.8521
AGFSWIEVTFK
67.8206
1283.6607
59.7794
SPEQQETVLDGNLIIR
101.1672
1810.9506
44.7905
RLDYQEGPPGVEISCWSVEL
99.9945
2276.0945
61.8115
QGPVNLLSDPEQGVEVTGQYEREK
91.8931
2671.3161
42.3906
DTDRFSSHVGGTLGQFYQEVLWGSPAASDDGRR
92.635
3610.7076
57.4664
serine
proteinase
inhibitor,
clade
A,
member
1
[Homo
sapiens]
>gi|50363219|ref|NP 001002236.1| serine proteinase inhibitor, clade A, member
1 [Homo sapiens] >gi|50363221|ref|NP 001002235.1| serine proteinase inhibitor, clade
A, member 1 [Homo sapien
SVLGQLGITK
72.9017
1014.6097
40.9374
ITPNLAEFAFSLYR
93.9995
1640.8659
66.9569
VFSNGADLSGVTEEAPLK
120.7722
1832.9214
38.8215
FNKPFVFLMIEQNTK
76.248
1854.982
58.6251
ELDRDTVFALVNYIFFK
67.648
2089.1019
82.852
VFSNGADLSGVTEEAPLKLSK
84.7684
2161.1347
41.9904
GTEAAGAMFLEAIPMSIPPEVK
80.6075
2258.148
65.1551
LVDKFLEDVKK
64.844
1332.7679
31.642
LYHSEAFTVNFGDTEEAKK
67.1459
2185.0388
36.4992
TLNQPDSQLQLTTGNGLFLSEGLK
106.2309
2573.3522
54.3709
IVDLVKELDRDTVFALVNYIFFK
84.9686
2756.5326
89.2975
KLYHSEAFTVNFGDTEEAKK
73.8193
2313.1361
33.4614
complement factor B preproprotein [Homo sapiens]
VKDISEVVTPR
74.7225
1241.6958
24.1633
VSEADSSNADWVTK
98.1875
1507.682
23.9549
LLQEGQALEYVCPSGFYPYPVQTR
75.8527
2757.3679
56.8651
NPREDYLDVYVFGVGPLVNQVNINALASK
118.9275
3203.6807
66.7554
KNPREDYLDVYVFGVGPLVNQVNINALASK
79.3271
3331.7774
63.7146
DMENLEDVFYQMIDESQSLSLCGMVWEHR
152.7023
3503.5341
89.0342
YGLVTYATYPK
71.6214
1274.6584
35.7583
serpin peptidase inhibitor, clade G, member 1 precursor [Homo sapiens]
>gi|73858570|ref|NP 001027466.1| serpin peptidase inhibitor, clade G, member 1 precursor [Homo sapiens]
LLDSLPSDTR
67.2247
1115.5841
28.7276
Appendix A. Proteomic ground-truth data
LVLLNAIYLSAK
82.0444
1316.8148
VTTSQDMLSIMEK
97.2353
1481.7159
LEDMEQALSPSVFK
93.6837
1592.7833
GVTSVSQIFHSPDLAIR
68.8577
1825.9764
HRLEDMEQALSPSVFK
84.7535
1885.9402
TLLVFEVQQPFLFVLWDQQHK
70.352
2614.4093
serpin peptidase inhibitor, clade C, member 1 [Homo sapiens]
VAEGTQVLELPFK
72.1855
1429.7882
ADGESCSASMMYQEGK
89.5863
1692.6472
AFLEVNEEGSEAAASTAVVIAGR
117.64
2290.1548
ITDVIPSEAINELTVLVLVNTIYFK
103.078
2803.59
ELTPEVLQEWLDELEEMMLVVHMPR
68.6858
3065.5194
VEKELTPEVLQEWLDELEEMMLVVHMPR
80.3672
3421.72
VAEGTQVLELPFKGDDITMVLILPKPEK
73.4173
3079.7119
inter-alpha globulin inhibitor H2 polypeptide [Homo sapiens]
SSALDMENFR
71.8625
1168.5211
IQPSGGTNINEALLR
103.036
1581.8533
LWAYLTINQLLAER
85.5421
1702.9512
VVNNSPQPQNVVFDVQIPK
86.9511
2121.1312
SILQMSLDHHIVTPLTSLVIENEAGDER
87.946
3116.6023
NVKENIQDNISLFSLGMGFDVDYDFLKR
89.3794
3276.6337
FDPAKLDQIESVITATSANTQLVLETLAQMDDLQDFLSK
102.3133
4308.2102
ETAVDGELVVLYDVK
71.5467
1648.8645
serpin peptidase inhibitor, clade A, member 3 precursor [Homo sapiens]
ITLLSALVETR
77.8327
1214.7308
LYGSEAFATDFQDSAAAK
109.1178
1890.8708
GTHVDLGLASANVDFAFSLYK
87.933
2224.1298
DYNLNDILLQLGIEEAFTSK
119.9825
2295.1781
GKITDLIKDLDSQTMMVLVNYIFFK
88.4409
2931.5746
inter-alpha (globulin) inhibitor H1 [Homo sapiens]
QAVDTAVDGVFIR
80.1283
1389.7298
QYYEGSEIVVAGR
82.7705
1469.7199
LWAYLTIQELLAK
87.5894
1560.9016
ILGDMQPGDYFDLVLFGTR
90.7527
2156.0759
GMADQDGLKPTIDKPSEDSPPLEMLGPR
73.5464
2993.4595
GIEILNQVQESLPELSNHASILIMLTDGDPTEGVTDR
135.0546
4004.0306
GFSLDEATNLNGGLLR
94.6844
1675.8619
GSLVQASEANLQAAQDFVR
118.83
2003.0164
apolipoprotein B precursor [Homo sapiens]
SVSDGIAALDLNAVANK
100.6533
1656.8768
HSITNPLAVLCEFISQSIK
83.2593
2099.1239
VPSYTLILPSLELPVLHVPR
77.5581
2242.3232
LTISEQNIQR
68.74
1200.6494
plasminogen [Homo sapiens]
FVTWIEGVMR
72.8823
1236.6396
124
58.3024
43.8197
47.9032
49.5992
42.3374
77.4467
49.6683
24.3281
47.7842
91.7838
91.2884
89.1198
64.6367
34.176
35.4084
67.8821
47.4299
60.9536
69.7602
91.668
51.1432
57.4577
41.5059
59.6488
76.4287
92.8801
40.6757
29.5462
72.1131
72.5335
43.2395
74.5964
50.9258
51.5598
45.6344
91.3152
70.0301
25.3496
54.7743
125
Appendix A. Proteomic ground-truth data
QLGAGSIEECAAK
72.862
NPDGDVGGPWCYTTNPR
120.2067
VILGAHQEVNLEPHVQEIEVSR
78.3774
TPENYPNAGLTMNYCR
101.8809
apolipoprotein A-IV precursor [Homo sapiens]
ALVQQMEQLR
73.9028
LGEVNTYAGDLQK
78.927
KLVPFATELHER
65.2054
LKEEIGKELEELR
66.2947
SELTQQLNALFQDK
105.6494
SLAELGGHLDQQVEEFR
95.9063
SLAELGGHLDQQVEEFRR
66.7573
LGPHAGDVEGHLSFLEK
69.5583
angiotensinogen preproprotein [Homo sapiens]
FMQAVTGWK
64.7875
ALQDQLVLVAAK
82.6965
ADSQAQLLLSTVVGVFTAPGLHLK
75.0783
TIHLTMPQLVLQGSYDLQDLLAQAELPAILHTELNLQK
129.8611
alpha 1B-glycoprotein precursor [Homo sapiens]
ATWSGAVLAGR
81.357
SGLSTGWTQLSK
73.404
SLPAPWLSMAPVSWITPGLK
79.306
TPGAAANLELIFVGPQHAGNYR
109.3616
SWVPHTFESELSDPVELLVAES
81.7562
LHDNQNGWSGDSAPVELILSDETLPAPEFSPEPESGR
99.1515
vitamin D-binding protein precursor [Homo sapiens]
HLSLLTTLSNR
67.6305
VMDKYTFELSR
69.152
FPSGTFEQVSQLVK
81.16
KFPSGTFEQVSQLVK
76.8736
VPTADLEDVLPLAEDITNILSK
84.1179
EYANQFMWEYSTNYGQAPLSLLVSYTK
78.6581
LAQKVPTADLEDVLPLAEDITNILSK
82.7457
EFSHLGKEDFTSLSLVLYSR
95.878
afamin precursor [Homo sapiens]
ESLLNHFLYEVAR
69.7321
IAPQLSTEELVSLGEK
100.5375
RNPFVFAPTLLTVAVHFEEVAK
85.5395
LKHELTDEELQSLFTNFANVVDK
80.4831
gelsolin isoform a precursor [Homo sapiens]
AGALNSNDAFVLK
72.2257
TPSAAYLWVGTGASEAEK
117.8583
AQPVQVAEGSEPDGFWEALGGK
77.2736
TPSAAYLWVGTGASEAEKTGAQELLR
91.0018
alpha-2-HS-glycoprotein [Homo sapiens]
1275.6154
1847.7967
2495.3215
1842.8145
27.6392
35.8907
39.0943
35.9689
1214.6479
1406.7076
1438.7936
1584.8759
1633.842
1926.9509
2083.0512
1804.9119
36.9434
28.5615
32.8886
36.5651
58.1552
45.6714
43.2074
38.3486
1066.5296
1267.7541
2464.3831
4267.3052
36.4695
40.9959
82.3134
88.6288
1087.5797
1263.6503
2150.1745
2295.1889
2470.2058
3989.8813
33.0551
36.346
74.5875
52.8175
69.8822
60.9303
1253.7105
1387.6809
1565.8181
1693.9103
2365.2762
3202.5168
2805.5578
2327.1944
40.161
34.0999
49.626
46.6934
85.2436
69.9468
80.8053
54.0283
1589.8281
1712.9275
2484.3678
2689.3755
60.3031
46.4254
74.4662
66.7315
1318.6925
1836.8954
2271.0944
2705.3847
37.6181
42.8101
53.307
54.0121
Appendix A. Proteomic ground-truth data
EHAVEGDCDFQLLK
70.8065
1602.7391
TVVQPSVGAAAGPVVPPCPGR
93.1778
1958.0475
HTFMGVVSLGSPSGEVSHPR
100.509
2080.0233
AQLVPLPPSTYVEFTVSGTDCVAK
74.3933
2521.2942
alpha-2-plasmin inhibitor [Homo sapiens]
IQEFLSGLPEDTVLLLLNAIHFQGFWR
75.7836
3155.7082
MSLSSFSVNRPFLFFIFEDTTGLPLFVGSVR
79.1313
3509.8274
hemopexin precursor [Homo sapiens]
SGAQATWTELPWPHEK
70.3061
1836.8872
SGAQATWTELPWPHEKVDGALCMEK
82.272
2783.3191
DGWHSWPIAHQWPQGPSAVDAAFSWEEK
72.103
3218.4878
DGWHSWPIAHQWPQGPSAVDAAFSWEEKLYLVQGTQVYVFLTK72.925
4971.4787
GGYTLVSGYPK
70.037
1140.5837
YYCFQGNQFLR
63.224
1437.6563
SLGPNSCSANGPGLYLIHGPNLYCYSDVEK
87.3833
3167.4891
EVGTPHGIILDSVDAAFICPGSSR
91.2262
2440.2264
kininogen 1 isoform 2 [Homo sapiens]
IGEIKEETTSHLR
79.652
1511.7946
DIPTNSPELEETLTHTITK
90.0725
2138.0847
IASFSQNCDIYPGKDFVQPPTK
69.7439
2454.1988
fibronectin 1 isoform 3 preproprotein [Homo sapiens]
DLQFVEVTDVK
78.3227
1291.6697
SSPVVIDASTAIDAPSNLR
91.4045
1911.9969
RPGGEPSPEGTTGQSYNQYSQR
87.2975
2395.0829
TEIDKPSQMQVTDVQDNSISVK
73.1918
2461.2083
VPGTSTSATLTGLTR
86.435
1460.7873
coagulation factor II preproprotein [Homo sapiens]
SPQELLCGASLISDR
97.0471
1587.7997
SEGSSVNLSPPLEQCVPDR
75.202
2012.9552
NPDSSTTGPWCYTTDPTVR
85.5528
2096.9187
IVEGSDAEIGMSPWQVMLFR
89.399
2264.1106
SEGSSVNLSPPLEQCVPDRGQQYQGR
65.8008
2830.3404
ELLESYIDGR
70.2144
1193.5965
C-type lectin domain family 3, member B [Homo sapiens]
SRLDTLAQEVALLK
65.0436
1555.8989
GGTLGTPQTGSENDALYEYLR
109.7033
2241.065
keratin 1 [Homo sapiens]
FLEQQNQVLQTK
76.1
1474.7827
inter-alpha (globulin) inhibitor H3 preproprotein [Homo sapiens]
LWAYLTIEQLLEK
91.3425
1618.9074
serum amyloid P component precursor [Homo sapiens]
VGEYSLYIGR
74.0706
1155.5952
QGYFVEAQPK
75.6519
1165.5793
AYSLFSYNTQGR
85.7937
1405.6675
126
39.9996
34.7402
37.5547
57.8151
92.6985
89.5949
46.183
50.7631
58.9136
76.7352
30.9613
44.9582
53.4818
56.0383
20.3646
45.9273
43.0719
42.2396
42.1224
19.235
32.0907
32.0147
48.7403
41.1497
37.0768
66.9951
38.2878
39.624
51.7843
46.8582
28.0465
72.8909
35.8244
25.3918
39.3873
Appendix A. Proteomic ground-truth data
127
IVLGQEQDSYGGKFDR
82.3956
1810.8886
28.703
complement
component
6
precursor
[Homo
sapiens]
>gi|189242612|ref|NP 001108603.2| complement component 6 precursor [Homo
sapiens]
IFDDFGTHYFTSGSLGGVYDLLYQFSSEELK
90.5336
3534.6703
81.5984
complement component 4 binding protein, alpha chain precursor [Homo sapiens]
LSLEIEQLELQR
69.34
1469.8169
49.3183
FSAICQGDGTWSPR
75.7408
1523.6875
36.2249
KPDVSHGEMVSGFGPIYNYK
73.1646
2224.0678
38.765
KPDVSHGEMVSGFGPIYNYKDTIVFK
93.1407
2927.4645
48.6949
heparin cofactor II precursor [Homo sapiens]
FTVDRPFLFLIYEHR
68.176
1952.039
60.3305
alpha-2-glycoprotein 1, zinc [Homo sapiens]
YSLTYIYTGLSK
97.8745
1407.7346
47.596
NILDRQDPPSVVVTSHQAPGEK
79.4367
2386.2297
29.8159
HVEDVPAFQALGSLNDLQFFR
82.2165
2402.2176
67.6579
apolipoprotein E precursor [Homo sapiens]
AATVGSLAGQPLQER
73.6921
1496.7986
29.4961
serine (or cysteine) proteinase inhibitor, clade F (alpha-2 antiplasmin, pigment epithelium derived factor), member 1 [Homo sapiens]
TVQAVLTVPK
66.3931
1054.6408
32.2106
LAAAVSNFGYDLYR
78.6684
1558.7848
45.8008
IAQLPLTGSMSIIFFLPLK
90.3446
2088.217
82.2969
peptidoglycan recognition protein 2 precursor [Homo sapiens]
HTASAWLMSAPNSGPHNR
71.894
1932.9048
29.4364
DGSPDVTTADIGANTPDATK
106.6953
1944.8967
25.9227
AGLLRPDYALLGHR
65.826
1550.8713
39.2893
apolipoprotein A-II preproprotein [Homo sapiens]
AGTELVNFLSYFVELGTQPATQ
69.158
2384.2065
86.029
KAGTELVNFLSYFVELGTQPATQ
87.4082
2512.302
81.0994
EPCVESLVSQYFQTVTDYGKDLMEK
80.145
2908.3726
80.8426
complement component 1, r subcomponent [Homo sapiens]
TLDEFTIIQNLQPQYQFR
71.05
2253.1568
60.3064
LFGEVTSPLFPKPYPNNFETTTVITVPTGYR
76.72
3484.813
60.2166
retinol-binding
protein
4,
plasma
precursor
[Homo
sapiens]
>gi|113865843|ref|NP 001038960.1| retinol-binding protein 4, plasma [Pan troglodytes]
LLNNWDVCADMVGTFTDTEDPAK
110.5675
2554.1516
66.0509
KDPEGLFLQDNIVAEFSVDETGQMSATAK
100.4329
3139.5233
60.9867
vitronectin precursor [Homo sapiens]
SIAQYWLGCPAPGHL
74.7238
1611.7979
56.8878
DVWGIEGPIDAAFTR
69.6887
1645.822
58.8065
corticosteroid binding globulin precursor [Homo sapiens]
IVDLFSGLDSPAILVLVNYIFFK
134.4473
2582.46
94.6978
clusterin isoform 1 [Homo sapiens]
Appendix A. Proteomic ground-truth data
128
ELDESLQVAER
65.0356
1287.6338
27.8548
VTTVASHTSDSDVPSGVTEVVVK
74.75
2313.1772
32.1829
LFDSDPITVTVPVEVSR
88.712
1872.9918
50.3583
protein S, alpha preproprotein [Homo sapiens]
TYDSEGVILYAESIDHSAWLLIALR
89.1809
2834.4632
79.8842
complement component 8, gamma polypeptide [Homo sapiens]
SLPVSDSVLSGFEQR
94.8744
1619.8243
47.4286
histidine-rich glycoprotein precursor [Homo sapiens]
GGEGTGYFVDFSVR
86.3563
1489.6902
46.5406
DSPVLIDFFEDTER
91.1205
1681.7938
63.0791
orosomucoid 1 precursor [Homo sapiens]
EQLGEFYEALDCLR
85.8347
1684.7879
61.8452
YVGGQEHFAHLLILR
72.82
1751.9501
44.5235
insulin-like growth factor binding protein, acid labile subunit isoform 2 precursor
[Homo sapiens]
LAELPADALGPLQR
74.3792
1462.8203
45.1268
VAGLLEDTFPGLLGLR
81.6047
1669.9516
68.7936
apolipoprotein H precursor [Homo sapiens]
VCPFAGILENGAVR
83.5184
1444.7578
53.9545
serine (or cysteine) proteinase inhibitor, clade A, member 7 [Homo sapiens]
NALALFVLPK
70.99
1084.6702
57.5629
EGQMESVEAAMSSK
82.3525
1482.638
31.6632
FSISATYDLGATLLK
79.4854
1598.865
59.9171
SFMLLILER
62.064
1120.637
63.3066
alpha-1-microglobulin/bikunin preproprotein [Homo sapiens]
AFIQLWAFDAVK
90.4755
1407.7642
64.3939
coagulation factor XII precursor [Homo sapiens]
VVGGLVALR
76.3167
882.5645
35.0509
complement factor H-related 1 [Homo sapiens] >gi|239758113|ref|XP 002346300.1|
PREDICTED: similar to complement factor H-related 1 isoform 1 [Homo sapiens]
EIMENYNIALR
66.21
1364.6808
38.7847
STDTSCVNPPTVQNAHILSR
71.744
2139.0435
31.8693
apolipoprotein C-III precursor [Homo sapiens]
DALSSVQESQVAQQAR
116.276
1715.8487
26.7761
pro-platelet basic protein precursor [Homo sapiens]
GKEESLDSDLYAELR
111.1358
1723.8316
41.2275
complement component 8, alpha polypeptide precursor [Homo sapiens]
LGSLGAACEQTQTEGAK
92.2314
1662.7941
24.8774
paraoxonase 1 precursor [Homo sapiens]
ILLMDLNEEDPTVLELGITGSK
94.6882
2399.2661
63.7875
alpha-2-macroglobulin precursor [Homo sapiens]
FEVQVTVPK
66.354
1045.5833
35.7438
VGFYESDVMGR
81.3462
1258.5725
34.8086
leucine-rich alpha-2-glycoprotein 1 [Homo sapiens]
Appendix A. Proteomic ground-truth data
129
VAAGAFQGLR
82.2481
988.546
29.2645
apolipoprotein C-II precursor [Homo sapiens]
STAAMSTYTGIFTDQVLSVLKGEE
87.0789
2547.2585
74.422
plasma kallikrein B1 precursor [Homo sapiens]
IAYGTQGSSGYSLR
85.4547
1458.7144
26.1314
ubiquitin
and
ribosomal
protein
S27a
precursor
[Homo
sapiens]
ribosomal
protein
S27a
[Bos
taurus]
>gi|27807503|ref|NP 777203.1|
>gi|62859181|ref|NP 001016172.1| hypothetical protein LOC548926 [Xenopus (Silurana) tropicalis] >gi|148222699|ref|NP 0010
TITLEVEPSDTIENVK
82.5073
1786.9272
41.5314
complement component 1,
s subcomponent
precursor
[Homo sapiens]
>gi|41393602|ref|NP 958850.1| complement component 1, s subcomponent precursor
[Homo sapiens]
GFQVVVTLR
71.1854
1017.5919
42.253
TNFDNDIALVR
69.764
1276.6445
38.3258
albumin preproprotein [Homo sapiens] >gi|197098046|ref|NP 001127106.1| albumin
[Pongo abelii]
ALVLIAFAQYLQQCPFEDHVK
75.852
2432.2704
77.0452
RHPDYSVVLLLR
70.1975
1466.8393
44.7459
complement factor I preproprotein [Homo sapiens]
VFSLQWGEVK
71.9273
1191.6329
48.0329
GLETSLAECTFTK
84.56
1398.6752
43.7576
complement component 4B preproprotein [Homo sapiens]
VGDTLNLNLR
94.1364
1113.6169
37.1683
LNMGITDLQGLR
75.185
1329.7129
46.4189
GLEEELQFSLGSK
85.9882
1435.726
48.2367
VLSLAQEQVGGSPEK
103.8555
1540.814
28.8422
TTNIQGINLLFSSR
93.4373
1562.8495
53.0759
MRPSTDTITVMVENSHGLR
64.412
2143.0568
39.6596
VTASDPLDTLGSEGALSPGGVASLLR
118.43
2482.3079
63.152
carboxypeptidase N, polypeptide 1 precursor [Homo sapiens]
IHILPSMNPDGYEVAAAQGPNKPGYLVGR
87.647
3063.5734
45.2203
beta globin [Homo sapiens] >gi|55635219|ref|XP 508242.1| PREDICTED: hypothetical
protein [Pan troglodytes]
VNVDEVGGEALGR
73.0646
1313.6622
30.1181
apolipoprotein D precursor [Homo sapiens]
MTVTDQVNCPK
72.927
1234.5702
23.4879
Appendix B
Glycomic ground-truth data
The ground-truth data were generated based on a list of human serum N-glycans characterized by the number of monosaccharides: N-acetylglucosamine (GlcNAc), hexose, fucose,
N-acetylneuraminic acid (NeuNAc). The putative compositions were assigned by comparison of measured mass values with theoretical values, in consideration of hydrogen adducts.
Erroneous assignments were removed based on visual inspection of the extracted ion chromatogram of each glycan. Table B.1 presents all the 106 ground-truth peaks considered in
this study.
Table B.1: Ground-truth peaks in the glycomic data set.
Monosaccharide composition
GlcNAc
hexose
fucose
NeuNAc
Mass (Da)
Charge
Time (min.)
3
3
3
2
2
3
3
3
3
3
4
4
2
2
3
4
4
4
1409.7515
1409.7515
1409.7515
1572.8248
1572.8248
1583.8407
1613.8513
1613.8513
1787.9405
1787.9405
1654.8778
1654.8778
1776.9246
1776.9246
1817.9511
1828.967
1828.967
1858.9776
1
2
2
1
2
2
2
2
2
2
1
2
1
2
2
1
2
1
22.796
23.543
23.9518
26.3688
26.3784
24.6975
23.9812
25.7926
25.1786
27.1283
23.8111
24.0786
28.4588
28.4733
26.9831
25.2808
25.2493
24.3177
3
3
3
5
5
3
4
4
4
4
3
3
6
6
5
3
3
4
0
0
0
0
0
1
0
0
1
1
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
130
131
Appendix B. Glycomic ground-truth data
4
5
5
5
2
2
3
4
4
4
4
5
5
5
5
2
2
3
3
4
4
4
5
5
5
5
2
2
2
4
4
4
4
4
4
5
5
4
4
5
5
5
5
5
6
4
3
3
3
7
7
6
4
4
5
5
3
3
4
4
8
8
5
5
5
4
4
4
4
5
5
9
9
10
4
4
6
6
5
5
5
5
5
5
4
4
6
5
5
6
0
0
0
0
0
0
0
1
1
0
0
1
1
0
0
0
0
0
0
1
0
0
1
1
0
0
0
0
0
1
1
1
1
0
0
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
1
1
0
0
0
0
0
0
0
1
1
0
0
1
1
0
0
1
1
1
1
0
1
1
0
1858.9776
1900.0041
1900.0041
1900.0041
1981.0244
1981.0244
2022.0509
2033.0668
2033.0668
2063.0774
2063.0774
2074.0933
2074.0933
2104.1039
2104.1039
2185.1242
2185.1242
2179.1248
2179.1248
2237.1666
2220.1513
2220.1513
2278.1931
2278.1931
2308.2037
2308.2037
2389.224
2389.224
2593.3238
2394.2405
2394.2405
2441.2664
2441.2664
2424.2511
2424.2511
2482.2929
2482.2929
2598.3403
2598.3403
2639.3668
2639.3668
2686.3927
2669.3774
2669.3774
2757.4298
2
1
2
3
1
2
2
2
3
2
2
2
3
2
3
2
2
2
3
2
2
3
2
3
2
3
2
2
2
2
3
2
3
2
3
2
3
2
3
2
3
2
2
3
2
24.546
26.6363
26.7748
26.7692
30.5505
30.7067
27.2873
25.81
25.852
25.0111
29.7737
27.9739
27.7071
27.0821
27.1319
32.5645
35.3194
28.1398
27.8983
26.259
26.2445
26.2604
28.3308
28.3576
27.0912
27.0925
34.3466
36.979
39.3056
27.0948
27.3372
26.493
26.6756
26.6974
26.7318
28.2663
28.2549
27.9223
27.8264
29.9275
30.0366
28.8021
28.8598
28.8853
20.8819
132
Appendix B. Glycomic ground-truth data
6
6
4
4
5
4
4
4
5
5
4
4
4
4
5
5
5
5
5
5
5
5
5
5
5
5
5
6
6
5
5
5
6
6
6
6
3
3
3
4
5
5
7
6
6
6
6
5
5
5
7
6
6
5
5
5
6
6
6
6
5
5
6
6
6
6
6
6
6
6
7
7
6
6
6
7
7
7
7
5
5
5
6
7
7
7
0
0
1
1
1
0
0
2
0
0
1
1
1
0
2
2
1
1
1
2
2
0
1
1
1
0
0
0
0
1
2
2
2
1
1
1
1
1
1
1
1
1
0
0
0
1
1
1
2
2
0
1
1
2
2
2
2
0
0
1
2
2
1
1
2
2
2
2
3
3
2
2
3
3
3
2
1
2
3
0
0
1
2
2
2
1
2757.4298
2757.4298
2802.4401
2802.4401
2843.4666
2785.4248
2785.4248
2819.4554
2873.4772
2873.4772
2959.514
2959.514
2959.514
2989.5246
2860.4819
2860.4819
3047.5664
3204.6403
3204.6403
3221.6556
3221.6556
3234.6509
3408.7401
3408.7401
3408.7401
3595.8246
3595.8246
3683.877
3683.877
3769.9138
3944.003
3944.003
4032.0554
3496.7925
3857.9662
4219.1399
1992.0403
1992.0403
2353.214
3163.6138
3612.8399
3612.8399
3567.8296
3
3
2
3
2
2
3
3
3
3
2
3
4
3
2
3
2
2
4
2
3
3
2
4
4
2
3
3
4
2
3
4
3
3
3
3
2
2
2
3
3
4
4
20.9064
21.7782
28.1754
28.3406
29.9053
28.2106
28.2033
27.8066
28.0849
30.495
29.4338
29.3898
29.7546
29.8786
29.9152
30.0363
31.6121
31.3248
31.3569
31.3066
31.3643
29.042
29.8497
30.0652
32.952
29.7408
29.5267
30.0214
30.0263
30.7326
32.041
32.0927
32.8241
29.8974
31.2025
32.3782
27.24
28.3142
29.1873
37.2819
29.6059
29.7611
26.9435
Download