Supplemental Information Contents

advertisement
Supplemental Information
Contents
Supplemental Methods
2
Brief descriptions and rationales of the framework
9
Discussion on COPD data sets used in the current study
12
Supplemental Figures
17
Supplemental Tables
20
References
25
1
1. Supplemental Methods
1.1.
Animal model
The animal model system used in this study was an established model for rapidly
progressing pulmonary disease, the ADA-deficient mouse model [1]. Mice genetically
deficient in ADA will spontaneously develop features of chronic lung diseases [2, 3]. At
birth, the litters bred from the ADA +/- cross ADA +/- matings were screened for ADA
enzymatic activity using zymogram analysis. The ADA-/- mice were injected with pegylated
ADA (PEG-ADA) enzyme from birth through day 25 for every 4 days to allow for normal
lung development. On day 26 PEG-ADA injections were discontinued to allow adenosine
levels to build up and pulmonary phenotypes to develop in the ADA -/- animals. Bronchial
secretions and blood plasma were collected from the ADA -/- and ADA +/- animals (3
animals per group per time point) on postnatal days 26, 30, 34, 38, and 42. These data
represent the responses to withdrawal of enzyme in the transgenic mice and the relevant
controls at each time point.
1.2.
Mouse plasma and BALF sample collections and preparation
The mouse plasma samples were depleted of their seven most abundant proteins (serum
albumin, IgG, fibrinogen, α1-antitrypsin, transferrin, haptoglobin and IgM) using a Seppro
mouse IgY7 LC10 column (Genway Biotech, San Diego, CA) following the manufacturer’s
protocols. Depleted proteins were precipitated with 10% trichloroacetic acid (TCA) and
subsequently denatured by the addition of urea to 8 M, thiourea to 2 M, dithiothreitol (DTT)
2
to 5 mM, and heated to 60 °C for 30 min. The samples were then diluted fourfold with
50 mM ammonium bicarbonate, and calcium chloride was added to 1 mM.
BALF samples were collected from the ADA+/- and ADA -/- mice as described in the
previous literature [1]. In order to concentrate proteins in BALF samples prior to trypsin
digestion, ice-cold TCA was added to the samples to a final concentration of 10%. All
samples were incubated at 4 C overnight followed by centrifugation at 14K RCF for 5
minutes. The pellet was washed one time with cold acetone and allowed to dry at room
temperature for 5 min. The protein pellet was resuspended in 25 L of denaturing buffer
(100 mM ammonium bicarbonate, 8M urea, 2 M thiourea, and 5 mM DTT) and heated to 60
C for 30 min. Following denaturation, the samples (plasma and BALF) were diluted
fourfold with 50 mM ammonium bicarbonate, pH 7.8, and calcium chloride was added to 1
mM. All samples were digested using the methylated, sequencing-grade trypsin (Promega,
Madison, WI) with a substrate-to-enzyme ratio of 50:1 (mass:mass) and incubated at 37 C
for 15 hours. Sample cleanups were done by using a 1-mL SPE C18 column (Supelco,
Bellefonte, PA). The peptides were eluted from each column with 1 mL of methanol and
concentrated via SpeedVac. The samples were reconstituted to 1 μg/μL with 25 mM
ammonium bicarbonate and frozen at -20 C until analyzed. The minimum requirement for
these experiments was 65 uL of mouse plasma or BALF yielding at least 200 ug protein
following immunoaffinity depletion of the most abundant plasma proteins.
1.3.
Human plasma sample
A subset of plasma samples originated from representative participants in a large cohort
(n=467) from the Genetics of Addiction program (University of Utah Medical School). All
3
subjects were recruited and samples were collected under institutional review boardapproved protocols at the University of Utah. All applicable requirements of the federal and
state regulations were complied with, and informed consent from each subject was
obtained before the study began. These protocols were reviewed by the Institutional
Review Board of the Pacific Northwest National Laboratory before transfer and analysis of
the samples. Selected plasma samples were from current smokers or never smokers with
low body mass index (BMI) values (< 25). Never smokers were subjects who had smoked
less than one cigarette in their lifetime. The two groups analyzed include pooled plasma
from 7 low BMI never smokers and 7 low BMI smokers with COPD. Additional details
regarding study participants have been described previously [4].
1.4.
Human plasma depletion and protein digestion
The individual human plasma in each group was pooled for protein digestion. The plasma
samples were first subjected to the separation of 12 high abundance proteins using a
ProteomeLabTM 12.7 × 79.0-mm IgY12 LC10 affinity LC column (Beckman Coulter,
Fullerton, CA) with a column capacity of 250 uL of plasma using an Agilent 1100 series
HPLC system. The protein samples from IgY12 bound fractions were denatured and
reduced in 50 mM NH4HCO3 buffer, pH 8.0, 8 M urea, 10 mM DTT for 1 h at 37 ℃. The
resulting protein mixture was diluted 6-fold with 50 mM NH4HCO3, pH 8.0, before
sequencing grade modified trypsin (Promega, Madison, WI) was added at a trypsin:protein
ratio of 1:50 (w/w). The sample was incubated at 37 °C for 3 h. The tryptically digested
sample was then loaded onto a 1-ml SPE C18 column (Supelco, Bellefonte, PA) and washed
4
with 4 ml of 0.1% TFA, 5% acetonitrile. Peptides were eluted from the SPE column with 1
ml of 0.1% TFA, 80% acetonitrile and lyophilized. Final peptide concentration was
determined by BCA protein assay (Pierce). Peptide samples were stored at -80 °C until
further analysis.
1.5.
Strong cation exchange (SCX) fractionation for the accurate mass and time (AMT)
databases
An AMT database [5, 6] was established for each type of samples using the corresponding
pooled samples: two tryptic peptide pools from plasma samples of the ADA+/- and ADA-/mice at the early (days 26, 30 and 34) and the late time points (days 38 and 42) in disease
progression, two from the mouse BALF samples at the early and late time points, one from
the low BMI never smokers, and one from the low BMI smokers, respectively. The pooled
peptide samples were fractionated by SCX-high performance liquid chromatography (HPLC)
as described previously [5, 7]. Briefly, the peptides were resuspended in mobile phase A
and 900 µL were injected onto a Polysulfoethyl A column (200 x 2.1 mm, 5 µm, 300 A;
PolyLC, Inc., Columbia, MD) and separated using an Agilent 1100 HPLC system (Agilent,
Palo Alto, CA). The autosampler and automated fraction collector were cooled to 4 C using
Peltier coolers. The mobile phases consisted of 10 mM ammonium formate, 25%
acetonitrile, pH 3.0 (mobile phase A) and 500 mM ammonium formate, 25% acetonitrile,
pH 6.8 (mobile phase B). Mobile phase A was maintained at 100% for the first 10 min and
then mobile phase B was increased from 0 to 50% over the next 40 min and from 50 to 100%
over the following 10 min before maintaining 100% mobile phase B for a final 10 min. A
flow rate of 0.2 mL/min was maintained throughout the gradient. Spectra were obtained at
5
280 nm. A total of 24 - 26 fractions were collected, lyophilized and stored at -80 C prior to
the reversed-phase LC tandem mass spectrometry (MS/MS) analyses.
1.6.
Reversed-phase capillary LC-MS analyses for AMT databases
Peptide samples obtained from the individual SCX fractions were analyzed using an
automated in-house designed high-resolution reversed phase capillary LC system [8]. This
LC system was interfaced to an LTQ ion trap mass spectrometer (Thermo Scientific, San
Jose, CA) with electrospray ionization (ESI). The mass spectrometer operated in a datadependent MS/MS mode over a full m/z range (400–2000) and a series of seven smaller
segmented m/z ranges (400–700, 700–900, 900–1100, 1100–1300, 1300–1500, 1500–
1700, and 1700–2000) for each sample. For each cycle, the ten most abundant ions from
each LC-MS scan were selected for the MS/MS analysis using the 35% collision energy.
1.7.
Generation of peptide AMT Tag databases
The resulting MS/MS measurements from the previous step were used to construct a
peptide AMT database from each sample [6]. The raw data from LC-MS/MS analyses were
converted into .dta files using an in-house software, DeconMSn (version v2.1.4.1), which
accurately calculates the parent monoisotopic mass for each spectrum from the parent
isotopic distribution using a modified THRASH algorithm [9]. For the mouse samples, the
MS-Generating Function software [10] was used to search the MS/MS spectral data against
the mouse Uniprot fasta file containing 16,383 proteins. Porcine trypsin was added into the
database as an expected contaminant. For the human samples, the MS/MS data were then
searched against the human International Protein Index (IPI) database with a total of 75,419 total
6
protein entries with the reversed sequence decoy database searching option (for assessing false
positive rate) using X!Tandem 2 software. To reduce protein mapping redundancy, all peptide
sequences were subsequently mapped to protein entries in the Human UniProt database. No
cleavage specificity was defined in the database searching. Peptide identifications in the
AMT database were further refined by controlling the spectral FDR < 1% [5].
1.8.
Reversed-phase capillary LC-LTQ-Orbitrap analyses for individual samples
Peptide samples from plasma and BALF (from Supplemental Methods 1.2) obtained on
postnatal days 26, 30, 34, 38 and 42 were individually analyzed using an LTQ-Orbitrap™
mass spectrometer (Thermo Scientific, San Jose, CA) coupled using an in-housemanufactured ESI interface. The reversed-phase capillary column was prepared by slurry
packing 3-μm Jupiter C18 bonded particles (Phenomenex, Torrence, CA) into a 65-cm-long,
75-μm-inner diameter fused silica capillary (Polymicron Technologies, Phoenix, AZ). The
mobile phases were consisted of 0.1% formic acid in water (solvent A) and 0.1% formic
acid acetonitrile (solvent B). After loading 5 µg (1 μg/μL) of peptides onto the column, the
mobile phase was held at 100% solvent A for 50 min. Exponential gradient elution was
performed by increasing the mobile phase composition from 0 to 55% solvent B over 100
min. Orbitrap™ spectra were collected from 400-2000 m/z at a resolution of 100 k [11].
The ten most abundant ions from the MS analysis were selected for MS/MS analysis using a
normalized collision energy setting of 35%. A dynamic exclusion of 1 min was used to
avoid repetitive analysis of the same abundant precursor ion. All samples were analyzed in
triplicate. The heated capillary temperature and spray voltage were maintained at 200 C
and 2.2 kV, respectively. For the human samples, slightly different mobile phases were used.
7
The mobile phase A consisted of 0.2% acetic acid and 0.05% TFA in water, and the phase B
was 0.1% TFA in 90% acetonitrile. The gradient of solvent B was increased to 60% for the
human samples. The rest of the procedures done on mice and human were identical.
1.9.
LC-LTQ-Orbitrap data analysis
The Orbitrap spectra were analyzed using the AMT tag approach [12, 13]. Briefly, high
resolution LC-MS features were deconvoluted using Decon2Ls (version 1.0.2, using default
parameters) and aligned to the AMT tag database in VIPER (version 3.45 using default
parameters) using the theoretical mass and observed normalized elution times for each
peptide [14]. This approach to proteomics research is enabled by a number of both
published [14-18] and unpublished in-group development tools that are freely available for
download at http://omics.pnl.gov. The peptide alignments scores were filtered to control
the FDR < 10% and uniqueness probability score > 0.5. A minimum of two unique peptides
were required per protein identification. The peak intensity values (i.e. abundances) were
available for the final identified peptides.
8
2 Brief descriptions and rationales of the framework
2.1. Data reduction
Disease marker identification often starts with a list of differentially expressed genes,
proteins, or metabolites in diseased conditions relative to their controls [19, 20]. Although
some rules of thumb are available for different types of experimental measurements,
determining a list of differentially expressed features is frequently constrained by various
biological and technical limitations [21, 22]. The case-specific efforts, to some extent, are
inevitable in order to properly address specific limitations in the majority of disease
marker studies. Therefore, the exact approaches implemented in the data reduction
component are often decided by the scientists who designed or conducted the studies.
2.2. Distance-based hierarchical clustering
Clustering is a common approach used in the process of knowledge discovery. It aims for
grouping data in such a way that patterns in the same group are more similar to each other
than to those in other groups [23]. The list of differentially expressed features between
different conditions, determined in the previous step, is hierarchically clustered into
several subsets based on a specified distance criterion. The distance criterion proposed
here is based on dissimilarity and can be derived from feature expression profiles,
functional annotations between the features, or a combination of both that is considered as
an integration of data-driven and knowledge-driven information. Our speculation here is
that this integrated distance may facilitate to group the features, such as genes or proteins,
into several clusters that, ideally, contain orthogonal information between the clusters. If
9
this is feasible, the individual subsets should contain reduced noise relative to the entire
data set, and the robustness of the marker candidates selected within individual subsets
could be improved relative to those extracted from the full data set.
2.3. Expert knowledge-driven disease-model-related functional selection
Addition to the semi-automated clustering approach, we also include an expert-knowledgedriven disease-model-related functional selection in the pipeline. This approach identifies
the biological processes that contain significantly changed proteins, which can potentially
be important for the disease of interest. This selection may serve as a means to validate the
results from the distance-based clustering approach as well.
2.4. Bayesian integration and classification
Bayesian fusion analyses are implemented for capturing and integrating the information
derived from the individual subsets in order to determine the sub data sets for providing
the best performances. The performances of individual clusters or sets of clusters are
numerically measured by CA, our defined evaluation metric. CA is a flexible measurement
which can be used in studies not only with binary responses, such as diseased vs. healthy,
but also with multi-categorical variables, such as cases with more than two diseased stages.
2.5. Selection of biomarker candidates and validation
Marker candidates at the cluster and individual levels are extracted and their validation on
an independent human sample data set is highly desirable whenever possible. Note,
validation can also be performed on the methodological level, i.e., evaluating a specific
10
approach for biomarker identification, in addition to a detailed assessment for a list of
biomarker candidates.
11
3. Discussion on COPD data sets used in the current study
3.1. Biological significance of selected individual protein candidates
In the demonstration data set, it actually is a quite striking observation that the four
biomarker candidate proteins from the COPD-related functional selection convey as much
information about the presence of COPD-like lung destruction as longer lists of proteins
identified solely by the clustering approach. On the other hand, however, none of the four
candidates from expert-driven functional selection are specific to lung functions, but
instead reflect biological processes that are more indicative of the generalized tissue
destruction seen in COPD. Interestingly, all of them have been reported as showing
statistical linkage with COPD and/or other lung diseases. Specifically, prothrombin (THRB)
and complement C3 (CO3) would both naturally be increased during wound healing and
inflammation. The former is cleaved during the clotting process to produce thrombin that
converts fibrinogen to fibrin [24], and the latter plays a central role in the activation of the
complement system [25]. Vitamin D binding protein (VTDB) has shown influences on
respiratory function both by determining vitamin D bioavailability and by direct effects on
innate cell function. An emerging hypothesis suggests that VTDB may have a direct role in
the pathogenesis of COPD [26, 27] as well as an indirect role in macrophage activation in
the airway as part of the innate immune response [28]. The evidences of the associations
between VTDB and COPD were reported by Metcalf and Robbins groups in the early 90s
[29, 30] . The last but not least intriguing marker candidate, adiponectin (ADIPO) is a
unique adipokine with multiple salutary effects such as antiapoptotic, anti-inflammatory,
and anti-oxidative activities in many organs and cells [31]. Recent studies have, though
inconclusively, suggested that adiponectin plays a role in signaling activity in the lung and
12
can be associated with inflammatory pulmonary diseases such as COPD and asthma. Novel
cross talk between lung and adipose tissues is currently under investigation [32].
3.2. Biomarker feasibility
Granted, many distinctions exist in respiratory physiology and anatomy as well as the
innate and adaptive immune responses between mice and human. However, the shared
biological pathways between the two provide some levels of justifications for using the
ADA-deficient mouse model to study human COPD.
3.3. Some interesting points associated with the demonstration data sets
The time course information available from the different types of sample materials
provides several valuable insights for not only COPD also the biomarker identification
schemes in general.
First off, the selection of appropriate specimens in which the biomarkers will be measured
is an essential issue. Ideally, the selected sample materials need to be; 1) easily accessed
from patients, e.g., saliva, urine, plasma or serum, 2) reliably measured in routine clinical
settings, and 3) able to provide accurate information that distinguishes the disease state in
patients. BALF is an example of the proximal yet inconvenient-to-collect specimens. Its
location potentially enables it to contain more concentrated disease-related biomolecules
and thus provide more direct biological and pathological information. In contrast, the easily
accessible but distal-to-the-disease-site sample materials, for instance plasma, in which the
disease-related biological information carried by the biomolecules can be potentially
13
diluted during the transport from the disease site to plasma, is also more likely modulated
by morbidities other than specific disease of interest. Although the first type of samples
may provide more accurate biological information, the easy accessibility of the second type
of samples, such as plasma, serum and urine, and the economical cost associated with their
collection are also practical issues that cannot be overlooked in clinical applications.
In the demonstration data sets, due to heterogeneous natures and the diverse pathological
components associated with COPD, a variety of biological specimens have been suggested
and used in the identification of COPD biomarkers, including exhaled breath, sputum, BALF,
lung biopsies, serum, plasma, etc. [33]. BALF has also been suggested as a promising
sample fluid in which to evaluate cell profiles, lymphocyte phenotypes, and cell functions in
different lung diseases with the aim of characterizing pathological mechanisms [34]. In our
results, the optimal CAs derived in BALF also consistently outperform the ones in plasma
(Table 1), which somewhat supports the assumption that BALF can be a more suitable
material than plasma for providing accurate biological information of COPD [35, 36].
However, the standardization of procedures for the collection of BALF is still an on-going
effort and limiting its clinical utility. To leverage the aforementioned pros and cons from
the distinct types sample materials, we think that it is important, at the early marker
discovery stage, to evaluate the signature molecules in both types of sample materials in
order to understand their individual pathophysiological impacts on the disease of interest.
Secondly, a second-step test for the early diagnosis of COPD can be valuable to develop.
Once the robust marker candidates in BALF (or other proximal yet hard to collect
14
specimens) are identified with comprehensive understanding on their biological roles in
the disease, a second-step test is possibly to develop. In the current clinical practice, a panel
of biomarkers in plasma (or other common samples and measurements) with a moderate
diagnostic power are routinely applied as a screening test. After this, the patients with high
tendency of having COPD from the screening test can be further tested on a group of
biomarkers with higher discriminating power in BALF, the second-step test. This type of
two-step diagnostic approaches has been widely applied in many medical diagnoses, such
as tuberculin skin test, lyme disease test, and cervical cancer test. In another sense, with
the foreseeable improvements on standardizing procedures for the collection of BALF, it
can eventually be possible to acquire BALF samples from patients at a reasonable cost in
the routine clinical settings as well [37].
Lastly, the 42-day time course in the ADA-deficient model of COPD allows us to get some
relevant information on the disease onset in the mouse model, which is particularly
valuable for determining the appropriate timing for the early diagnosis of COPD. In Fig. 3, a
similar pattern was observed in the cumulative optimal CAs from both samples (the lines in
blue), but not in the optimal CAs at the individual time points (the lines in green). The
greater differences in the individual optimal CAs were observed after day 30. We speculate
that some discrepancies of COPD-associated proteins (and other biomolecules) between
BALF and plasma may start to develop around day 30 and keep increasing until the
differences become measurable around day 34 between two phenotypes. With the
observations obtained using a very small sample size (three mice in each group at each
15
time point), we understand that further study certainly is required to draw any conclusion
on the estimation of disease onset. Nevertheless, our results provide piece valuable
information in regard to this matter.
16
The numbers of proteins changed at different time points
4. Supplemental Figures
19 7 38
45 18 16 4
1
35
19 13 4
5
4
3
5
1 17 0
19 15 12 1
8
7
0
3
16 16 1
1
0
5
19 16 15 2
2
129 90 24 6
-35
-70
-105
Day 26
Day 30
-140
Day 34
Day 38
Day 42
Time Point
Up-regulated proteins in BALF (t-test)
Down-regulated proteins in BALF (t-test)
Up-regulated proteins in BALF (G-test)
Down-regulated proteins in BALF (G-test)
Up-regulated proteins in plasma (t-test)
Down-regulated proteins in plasma (t-test)
Up-regulated proteins in plasma (G-test)
Down-regulated proteins in plasma (G-test)
Fig. S1. The bar graph of the numbers of significantly changed proteins in the ADA -/- group
relative to the ADA +/- group. The bars for up-regulation are in blue and the ones for downregulation are in red from BALF (solid color) and plasma (shaded color) on days 26 to 42,
respectively.
17
A.
Time Point:
1
2
Time Point:
1
2
3
4
5
B.
3
4
5
Fig. S2. Hierarchical clustering with (A) 396 proteins in BALF and (B) 150 proteins in
plasma differentially changed in their abundances in the ADA-deficient mice (indicated as D)
and their time-matched controls (indicated as C). Dendrograms from hierarchical
clustering are shown at left and used to display the patterns of protein expression profiles
at five different time points (on days 26 to 42) in BALF and plasma.
18
A)
Human data
based on mouse
BALF clustering
B)
Human data
based on mouse
plasma clustering
C)
Mouse BALF
D)
Mouse plasma
Fig. S3. The ROC curves and the AUC of the validation data set based on the clustering
defined by A) the mouse BALF data and B) the mouse plasma data, and the best performing
clusters in C) the mouse BALF and D) the mouse plasma.
19
5. Supplementary Tables
Table S1. Optimal individual CAs of the clusters resulted from the distance-based clustering approach.
Optimal Individual CA*
(the number of proteins in the cluster;
the optimal algorithm: K - fuzzy k-nearest neighbor (KNN); L - linear discriminant analysis (LDA); M
- multinomial logistic regression (MLR); N - Naïve Bayes (NB))
Data expression profiles (data-driven)
No. of
clusters
1
0.83
(396; K)
BALF
Plasma
0.66
(150; N)
6
0.90 (95; K)*;
0.66 (55; K);
0.79 (33; K);
0.72 (19; N);
0.72 (33; N);
0.83 (161; K)*
0.53 (32; N);
0.70 (19; L);
0.47 (17; K);
0.67 (36; N);
0.53 (40; K);
0.77 (6; L)
12
0.69 (35; K);
0.69 (37; K)*;
0.86 (60; N);
0.69 (17; N);
0.69 (13; L);
0.83 (10; L);
0.76 (23; K);
0.86 (77; K)*;
0.66 (16; K);
0.79 (84; K)*;
0.66 (6; K);
0.66 (18; N)
Functional relationships
(ontology-driven)
6
12
0.90 (201; K)*;
0.69 (35; K);
0.90 (85; K);
0.62 (22; N);
0.90 (39; N)*;
0.66 (14; L)
0.60 (68; K);
0.50 (38; L);
0.63 (19; K);
0.79 (10; N)*;
0.70 (11; L);
0.50 (4; K)
* This cluster is included in the optimal integrated CA in this analysis.
20
0.72 (32; N);
0.69 (35; K);
0.83 (79; K);
0.86 (40; K);
0.79 (14; K);
0.83 (31; K);
0.66 (27; N);
0.62 (22; N);
0.86 (39; N)*;
0.90 (59; K)*;
0.62 (4; L);
0.66 (14; L);
A combination of the other
two
6
0.79 (91; K);
0.93 (185; K)*;
0.72 (47; K);
0.79 (24; N);
0.83 (30; K)*;
0.62 (19; N)
0.54 (33; N);
0.58 (60; N);
0.56 (24; K);
0.83 (13; L)*;
0.63 (13; K);
0.53 (7; L);
12
0.72 (50; K);
0.90 (84; K)*;
0.79 (56; K);
0.83 (33; K);
0.79 (24; N);
0.69 (14; N);
0.79 (45; K)*;
0.66 (19; K);
0.79 (41; K);
0.62 (4; L);
0.69 (7; K);
0.62 (19; N)
Table S2. Optimal individual CAs of functional clusters resulted from the expert-driven
disease-model-related functional selection.
Optimal Individual CA*
(the number of proteins in the cluster; the optimal
algorithm)
All proteins
Top 3 proteins
No. of
clusters
BALF
Plasma
1
12
0.81 (317; K)
0.76 (56; K);
0.83 (183; K);
0.79 (217; K);
0.76 (86; N);
0.79 (39; K);
0.76 (13; K);
0.83 (42; K);
0.83 (42; N);
0.72 (59; K);
0.83 (57; K);
0.79 (230; K)*;
0.90 (115; K)*
0.57 (113; K)
0.50 (16; L);
0.57 (69; K);
0.50 (94; L);
0.53 (26; N);
0.73 (14; L)*;
0.57 (2; K);
0.63 (9; N);
0.57 (11; L);
0.43 (22; N);
0.57 (18; L);
0.50 (96; L);
0.50 (38; L )
1
12
0.88 (35; K)
0.90 (6; K);
0.86 (16; N);
0.83 (10; N);
0.83 (10; L);
0.86 (5; K);
0.72 (3; K);
0.93 (6; L);
1.00 (4; N)*;
0.79 (7; N);
0.90 (4;K);
0.86 (10; N);
0.69 (4; K)
0.59 (41; N)
0.60 (5; L)*;
0.70 (20; L)*;
0.70 (11; N)*;
0.83 (10; L)*;
0.77 (5; L);
0.57 (2; K);
0.70 (4; L)*;
0.73 (4; K);
0.60 (7; L);
0.63 (4; N)*;
0.60 (9; N);
0.63 (5; L)
* This cluster is included in the optimal integrated CA in this analysis.
21
Table S3. Top five optimal CA integrations and their corresponding biological process clusters with the most differentially
changed proteins members from the disease-model-related functional enrichment selection.
Optimal
Integrated
CA
0.99
#8
Carbohydrate derivative
metabolic process
CO3_MOUSE
THRB_MOUSE
VTDB_MOUSE
VATL_MOUSE
0.97
# 51
Oxoacid metabolic process
0.97
# 71
Nucleotide metabolic process
ADIPO_MOUSE
DESM_MOUSE
FAS_MOUSE
MDHM_MOUSE
VTDB_MOUSE
CO3_MOUSE
DCXR_MOUSE
MDHM_MOUSE
THRB_MOUSE
VATL_MOUSE
VTDB_MOUSE
ACTN2_MOUSE
ADIPO_MOUSE
CO3_MOUSE
DACT1_MOUSE
EGFR_MOUSE
0.97
Expert-selected disease-modelrelated clusters for the integration
# 101
Regulation of localization
Proteins in the cluster
22
Complement C3
Prothrombin
Vitamin D-binding protein
V-type proton ATPase 16 kDa
proteolipid subunit
Adiponectin
Desmin
Fatty acid synthase
Malate dehydrogenase, mitochondrial
L-xylulose reductase MDHM_MOUSE
Alpha-actinin-2
Dapper homolog 1
Epidermal growth factor receptor
THIO_MOUSE
Thioredoxin
THRB_MOUSE
1
0.97
# 11
Regulation of programmed
ADIPO_MOUSE
cell death
FABPL_MOUSE Fatty acid-binding protein, liver
THRB_MOUSE
VTDB_MOUSE
1: the 2nd -5th optimal integrations are the resulted from two cluster: one cluster is listed here, and the other cluster is the
cluster 8 (Carbohydrate derivative metabolic process).
23
Table S4. Top ten best-performing individual proteins in mouse and human plasma and
their corresponding CAs.
CA
rank
Protein
ranked
by CA in
Protein
rankedb
y CA in
Optimal individual
CA in
Optimal individual
CA in
Mouse
Mouse
Human
Human
Human
Mouse
1
CO8G
0.70
0.64
CFAI
1.00
0.60
2
GELS
0.70
0.64
PRDX2
0.93
0.53
3
APOC3
0.67
0.86
VTNC
0.93
0.53
4
CO3
0.67
0.79
APOA4
0.86
0.63
5
IGHM
0.67
0.71
APOC3
0.86
0.67
6
LUM
0.67
0.86
CERU
0.86
0.63
7
PEDF
0.67
0.50
COMP
0.86
0.53
8
APOA4
0.63
0.86
HEMO
0.86
0.57
9
CERU
0.63
0.86
LUM
0.86
0.67
10
CO8B
0.63
0.14
PLMN
0.86
0.63
24
Reference
[1]
M.R. Blackburn, S.K. Datta and R.E. Kellems, Adenosine deaminase-deficient mice generated
using a two-stage genetic engineering strategy exhibit a combined immunodeficiency. J Biol
Chem, 1998. 273(9): p. 5093-5100.
[2]
M.R. Blackburn, J.B. Volmer, J.L. Thrasher, et al., Metabolic consequences of adenosine
deaminase deficiency in mice are associated with defects in alveogenesis, pulmonary
inflammation, and airway obstruction. J Exp Med, 2000. 192(2): p. 159-170.
[3]
Y. Zhou, D.J. Schneider and M.R. Blackburn, Adenosine signaling and the regulation of chronic
lung disease. Pharmacol Ther, 2009. 123(1): p. 105-116.
[4]
H. Jin, B.J. Webb-Robertson, E.S. Peterson, et al., Smoking, COPD, and 3-nitrotyrosine levels of
plasma proteins. Environ Health Perspect, 2011. 119(9): p. 1314-1320.
[5]
J.N. Adkins, S.M. Varnum, K.J. Auberry, et al., Toward a human blood serum proteome: analysis
by multidimensional separation coupled with mass spectrometry. Mol Cell Proteomics, 2002.
1(12): p. 947-955.
[6]
R.D. Smith, G.A. Anderson, M.S. Lipton, et al., An accurate mass tag strategy for quantitative
and high-throughput proteome measurements. Proteomics, 2002. 2(5): p. 513-523.
[7]
W.J. Qian, J.M. Jacobs, D.G. Camp, 2nd, et al., Comparative proteome analyses of human plasma
following in vivo lipopolysaccharide administration using multidimensional separations coupled
with tandem mass spectrometry. Proteomics, 2005. 5(2): p. 572-584.
[8]
E.A. Livesay, K. Tang, B.K. Taylor, et al., Fully automated four-column capillary LC-MS system for
maximizing throughput in proteomic analyses. Anal Chem, 2008. 80(1): p. 294-302.
[9]
D.M. Horn, R.A. Zubarev and F.W. McLafferty, Automated reduction and interpretation of high
resolution electrospray mass spectra of large molecules. J Am Soc Mass Spectrom, 2000. 11(4): p.
320-332.
[10]
S. Kim, N. Mischerikow, N. Bandeira, et al., The generating function of CID, ETD, and CID/ETD
pairs of tandem mass spectra: applications to database search. Mol Cell Proteomics, 2010. 9(12):
p. 2840-2852.
[11]
R.T. Kelly, J.S. Page, Q. Luo, et al., Chemically etched open tubular and monolithic emitters for
nanoelectrospray ionization mass spectrometry. Anal Chem, 2006. 78(22): p. 7796-7801.
[12]
J.S. Zimmer, M.E. Monroe, W.J. Qian, et al., Advances in proteomics data analysis and display
using an accurate mass and time tag approach. Mass Spectrom Rev, 2006. 25(3): p. 450-482.
[13]
T. Liu, M.E. Belov, N. Jaitly, et al., Accurate mass measurements in proteomics. Chem Rev, 2007.
107(8): p. 3621-3653.
25
[14]
M.E. Monroe, N. Tolic, N. Jaitly, et al., VIPER: an advanced software package to support highthroughput LC-MS peptide identification. Bioinformatics, 2007. 23(15): p. 2021-2023.
[15]
N. Jaitly, M.E. Monroe, V.A. Petyuk, et al., Robust algorithm for alignment of liquid
chromatography-mass spectrometry analyses in an accurate mass and time tag data analysis
pipeline. Anal Chem, 2006. 78(21): p. 7397-7409.
[16]
G.R. Kiebel, K.J. Auberry, N. Jaitly, et al., PRISM: a data management system for high-throughput
proteomics. Proteomics, 2006. 6(6): p. 1783-1790.
[17]
M.E. Monroe, J.L. Shaw, D.S. Daly, et al., MASIC: a software program for fast quantitation and
flexible visualization of chromatographic profiles from detected LC-MS(/MS) features. Comput
Biol Chem, 2008. 32(3): p. 215-217.
[18]
K. Petritis, L.J. Kangas, B. Yan, et al., Improved peptide elution time prediction for reversed-phase
liquid chromatography-MS by incorporating peptide sequence information. Anal Chem, 2006.
78(14): p. 5026-5039.
[19]
T. Wei, B. Liao, B.L. Ackermann, et al., Data-driven analysis approach for biomarker discovery
using molecular-profiling technologies. Biomarkers, 2005. 10(2-3): p. 153-172.
[20]
B.P. Bradley, Finding biomarkers is getting easier. Ecotoxicology, 2012. 21(3): p. 631-636.
[21]
Z. Feng, R. Prentice and S. Srivastava, Research issues and strategies for genomic and proteomic
biomarker discovery and validation: a statistical perspective. Pharmacogenomics, 2004. 5(6): p.
709-719.
[22]
J.E. McDermott, J. Wang, H.D. Mitchell, et al., Challenges in Biomarker Discovery: Combining
Expert Insights with Statistical Analysis of Complex Omics Data. Expert Opin Med Diagn, 2013.
7(1): p. 37-51.
[23]
R. Nugent and M. Meila, An overview of clustering applied to molecular biology. Methods Mol
Biol, 2010. 620: p. 369-404.
[24]
K.C. Glenn, G.H. Frost, J.S. Bergmann, et al., Synthetic peptides bind to high-affinity thrombin
receptors and modulate thrombin mitogenesis. Pept Res, 1988. 1(2): p. 65-73.
[25]
M. Maslowska, H. Legakis, F. Assadi, et al., Targeting the signaling pathway of acylation
stimulating protein. J Lipid Res, 2006. 47(3): p. 643-652.
[26]
S. Dimeloe and C. Hawrylowicz, A direct role for vitamin D-binding protein in the pathogenesis of
COPD? Thorax, 2011. 66(3): p. 189-190.
[27]
L.H. Shen, X.M. Zhang, D.J. Su, et al., Association of vitamin D binding protein variants with
susceptibility to chronic obstructive pulmonary disease. J Int Med Res, 2010. 38(3): p. 1093-1098.
[28]
A.M. Wood, C. Bassford, D. Webster, et al., Vitamin D-binding protein contributes to COPD by
activation of alveolar macrophages. Thorax, 2011. 66(3): p. 205-210.
26
[29]
R.A. Robbins, G.L. Gossman, K.J. Nelson, et al., Inactivation of chemotactic factor inactivator by
cigarette smoke. A potential mechanism of modulating neutrophil recruitment to the lung. Am
Rev Respir Dis, 1990. 142(4): p. 763-768.
[30]
J.P. Metcalf, A.B. Thompson, G.L. Gossman, et al., Gcglobulin functions as a cochemotaxin in the
lower respiratory tract. A potential mechanism for lung neutrophil recruitment in cigarette
smokers. Am Rev Respir Dis, 1991. 143(4 Pt 1): p. 844-849.
[31]
P. Garcia and A. Sood, Adiponectin in pulmonary disease and critically ill patients. Curr Med
Chem, 2012. 19(32): p. 5493-5500.
[32]
Y. Takeda, K. Nakanishi, I. Tachibana, et al., Adiponectin: a novel link between adipocytes and
COPD. Vitam Horm, 2012. 90: p. 419-435.
[33]
M. Dahl and B.G. Nordestgaard, Markers of early disease and prognosis in COPD. Int J Chron
Obstruct Pulmon Dis, 2009. 4: p. 157-167.
[34]
B. Magi, E. Bargagli, L. Bini, et al., Proteome analysis of bronchoalveolar lavage in lung diseases.
Proteomics, 2006. 6(23): p. 6354-6369.
[35]
K.C. Meyer, The role of bronchoalveolar lavage in interstitial lung disease. Clin Chest Med, 2004.
25(4): p. 637-649.
[36]
H. Chen, D. Wang, C. Bai, et al., Proteomics-based biomarkers in chronic obstructive pulmonary
disease. J Proteome Res, 2010. 9(6): p. 2798-2808.
[37]
F. Sampsonas, D.P. Kontoyiannis, B.F. Dickey, et al., Performance of a standardized
bronchoalveolar lavage protocol in a comprehensive cancer center: a prospective 2-year study.
Cancer, 2011. 117(15): p. 3424-3433.
27
Download