Supplemental Information Contents Supplemental Methods 2 Brief descriptions and rationales of the framework 9 Discussion on COPD data sets used in the current study 12 Supplemental Figures 17 Supplemental Tables 20 References 25 1 1. Supplemental Methods 1.1. Animal model The animal model system used in this study was an established model for rapidly progressing pulmonary disease, the ADA-deficient mouse model [1]. Mice genetically deficient in ADA will spontaneously develop features of chronic lung diseases [2, 3]. At birth, the litters bred from the ADA +/- cross ADA +/- matings were screened for ADA enzymatic activity using zymogram analysis. The ADA-/- mice were injected with pegylated ADA (PEG-ADA) enzyme from birth through day 25 for every 4 days to allow for normal lung development. On day 26 PEG-ADA injections were discontinued to allow adenosine levels to build up and pulmonary phenotypes to develop in the ADA -/- animals. Bronchial secretions and blood plasma were collected from the ADA -/- and ADA +/- animals (3 animals per group per time point) on postnatal days 26, 30, 34, 38, and 42. These data represent the responses to withdrawal of enzyme in the transgenic mice and the relevant controls at each time point. 1.2. Mouse plasma and BALF sample collections and preparation The mouse plasma samples were depleted of their seven most abundant proteins (serum albumin, IgG, fibrinogen, α1-antitrypsin, transferrin, haptoglobin and IgM) using a Seppro mouse IgY7 LC10 column (Genway Biotech, San Diego, CA) following the manufacturer’s protocols. Depleted proteins were precipitated with 10% trichloroacetic acid (TCA) and subsequently denatured by the addition of urea to 8 M, thiourea to 2 M, dithiothreitol (DTT) 2 to 5 mM, and heated to 60 °C for 30 min. The samples were then diluted fourfold with 50 mM ammonium bicarbonate, and calcium chloride was added to 1 mM. BALF samples were collected from the ADA+/- and ADA -/- mice as described in the previous literature [1]. In order to concentrate proteins in BALF samples prior to trypsin digestion, ice-cold TCA was added to the samples to a final concentration of 10%. All samples were incubated at 4 C overnight followed by centrifugation at 14K RCF for 5 minutes. The pellet was washed one time with cold acetone and allowed to dry at room temperature for 5 min. The protein pellet was resuspended in 25 L of denaturing buffer (100 mM ammonium bicarbonate, 8M urea, 2 M thiourea, and 5 mM DTT) and heated to 60 C for 30 min. Following denaturation, the samples (plasma and BALF) were diluted fourfold with 50 mM ammonium bicarbonate, pH 7.8, and calcium chloride was added to 1 mM. All samples were digested using the methylated, sequencing-grade trypsin (Promega, Madison, WI) with a substrate-to-enzyme ratio of 50:1 (mass:mass) and incubated at 37 C for 15 hours. Sample cleanups were done by using a 1-mL SPE C18 column (Supelco, Bellefonte, PA). The peptides were eluted from each column with 1 mL of methanol and concentrated via SpeedVac. The samples were reconstituted to 1 μg/μL with 25 mM ammonium bicarbonate and frozen at -20 C until analyzed. The minimum requirement for these experiments was 65 uL of mouse plasma or BALF yielding at least 200 ug protein following immunoaffinity depletion of the most abundant plasma proteins. 1.3. Human plasma sample A subset of plasma samples originated from representative participants in a large cohort (n=467) from the Genetics of Addiction program (University of Utah Medical School). All 3 subjects were recruited and samples were collected under institutional review boardapproved protocols at the University of Utah. All applicable requirements of the federal and state regulations were complied with, and informed consent from each subject was obtained before the study began. These protocols were reviewed by the Institutional Review Board of the Pacific Northwest National Laboratory before transfer and analysis of the samples. Selected plasma samples were from current smokers or never smokers with low body mass index (BMI) values (< 25). Never smokers were subjects who had smoked less than one cigarette in their lifetime. The two groups analyzed include pooled plasma from 7 low BMI never smokers and 7 low BMI smokers with COPD. Additional details regarding study participants have been described previously [4]. 1.4. Human plasma depletion and protein digestion The individual human plasma in each group was pooled for protein digestion. The plasma samples were first subjected to the separation of 12 high abundance proteins using a ProteomeLabTM 12.7 × 79.0-mm IgY12 LC10 affinity LC column (Beckman Coulter, Fullerton, CA) with a column capacity of 250 uL of plasma using an Agilent 1100 series HPLC system. The protein samples from IgY12 bound fractions were denatured and reduced in 50 mM NH4HCO3 buffer, pH 8.0, 8 M urea, 10 mM DTT for 1 h at 37 ℃. The resulting protein mixture was diluted 6-fold with 50 mM NH4HCO3, pH 8.0, before sequencing grade modified trypsin (Promega, Madison, WI) was added at a trypsin:protein ratio of 1:50 (w/w). The sample was incubated at 37 °C for 3 h. The tryptically digested sample was then loaded onto a 1-ml SPE C18 column (Supelco, Bellefonte, PA) and washed 4 with 4 ml of 0.1% TFA, 5% acetonitrile. Peptides were eluted from the SPE column with 1 ml of 0.1% TFA, 80% acetonitrile and lyophilized. Final peptide concentration was determined by BCA protein assay (Pierce). Peptide samples were stored at -80 °C until further analysis. 1.5. Strong cation exchange (SCX) fractionation for the accurate mass and time (AMT) databases An AMT database [5, 6] was established for each type of samples using the corresponding pooled samples: two tryptic peptide pools from plasma samples of the ADA+/- and ADA-/mice at the early (days 26, 30 and 34) and the late time points (days 38 and 42) in disease progression, two from the mouse BALF samples at the early and late time points, one from the low BMI never smokers, and one from the low BMI smokers, respectively. The pooled peptide samples were fractionated by SCX-high performance liquid chromatography (HPLC) as described previously [5, 7]. Briefly, the peptides were resuspended in mobile phase A and 900 µL were injected onto a Polysulfoethyl A column (200 x 2.1 mm, 5 µm, 300 A; PolyLC, Inc., Columbia, MD) and separated using an Agilent 1100 HPLC system (Agilent, Palo Alto, CA). The autosampler and automated fraction collector were cooled to 4 C using Peltier coolers. The mobile phases consisted of 10 mM ammonium formate, 25% acetonitrile, pH 3.0 (mobile phase A) and 500 mM ammonium formate, 25% acetonitrile, pH 6.8 (mobile phase B). Mobile phase A was maintained at 100% for the first 10 min and then mobile phase B was increased from 0 to 50% over the next 40 min and from 50 to 100% over the following 10 min before maintaining 100% mobile phase B for a final 10 min. A flow rate of 0.2 mL/min was maintained throughout the gradient. Spectra were obtained at 5 280 nm. A total of 24 - 26 fractions were collected, lyophilized and stored at -80 C prior to the reversed-phase LC tandem mass spectrometry (MS/MS) analyses. 1.6. Reversed-phase capillary LC-MS analyses for AMT databases Peptide samples obtained from the individual SCX fractions were analyzed using an automated in-house designed high-resolution reversed phase capillary LC system [8]. This LC system was interfaced to an LTQ ion trap mass spectrometer (Thermo Scientific, San Jose, CA) with electrospray ionization (ESI). The mass spectrometer operated in a datadependent MS/MS mode over a full m/z range (400–2000) and a series of seven smaller segmented m/z ranges (400–700, 700–900, 900–1100, 1100–1300, 1300–1500, 1500– 1700, and 1700–2000) for each sample. For each cycle, the ten most abundant ions from each LC-MS scan were selected for the MS/MS analysis using the 35% collision energy. 1.7. Generation of peptide AMT Tag databases The resulting MS/MS measurements from the previous step were used to construct a peptide AMT database from each sample [6]. The raw data from LC-MS/MS analyses were converted into .dta files using an in-house software, DeconMSn (version v2.1.4.1), which accurately calculates the parent monoisotopic mass for each spectrum from the parent isotopic distribution using a modified THRASH algorithm [9]. For the mouse samples, the MS-Generating Function software [10] was used to search the MS/MS spectral data against the mouse Uniprot fasta file containing 16,383 proteins. Porcine trypsin was added into the database as an expected contaminant. For the human samples, the MS/MS data were then searched against the human International Protein Index (IPI) database with a total of 75,419 total 6 protein entries with the reversed sequence decoy database searching option (for assessing false positive rate) using X!Tandem 2 software. To reduce protein mapping redundancy, all peptide sequences were subsequently mapped to protein entries in the Human UniProt database. No cleavage specificity was defined in the database searching. Peptide identifications in the AMT database were further refined by controlling the spectral FDR < 1% [5]. 1.8. Reversed-phase capillary LC-LTQ-Orbitrap analyses for individual samples Peptide samples from plasma and BALF (from Supplemental Methods 1.2) obtained on postnatal days 26, 30, 34, 38 and 42 were individually analyzed using an LTQ-Orbitrap™ mass spectrometer (Thermo Scientific, San Jose, CA) coupled using an in-housemanufactured ESI interface. The reversed-phase capillary column was prepared by slurry packing 3-μm Jupiter C18 bonded particles (Phenomenex, Torrence, CA) into a 65-cm-long, 75-μm-inner diameter fused silica capillary (Polymicron Technologies, Phoenix, AZ). The mobile phases were consisted of 0.1% formic acid in water (solvent A) and 0.1% formic acid acetonitrile (solvent B). After loading 5 µg (1 μg/μL) of peptides onto the column, the mobile phase was held at 100% solvent A for 50 min. Exponential gradient elution was performed by increasing the mobile phase composition from 0 to 55% solvent B over 100 min. Orbitrap™ spectra were collected from 400-2000 m/z at a resolution of 100 k [11]. The ten most abundant ions from the MS analysis were selected for MS/MS analysis using a normalized collision energy setting of 35%. A dynamic exclusion of 1 min was used to avoid repetitive analysis of the same abundant precursor ion. All samples were analyzed in triplicate. The heated capillary temperature and spray voltage were maintained at 200 C and 2.2 kV, respectively. For the human samples, slightly different mobile phases were used. 7 The mobile phase A consisted of 0.2% acetic acid and 0.05% TFA in water, and the phase B was 0.1% TFA in 90% acetonitrile. The gradient of solvent B was increased to 60% for the human samples. The rest of the procedures done on mice and human were identical. 1.9. LC-LTQ-Orbitrap data analysis The Orbitrap spectra were analyzed using the AMT tag approach [12, 13]. Briefly, high resolution LC-MS features were deconvoluted using Decon2Ls (version 1.0.2, using default parameters) and aligned to the AMT tag database in VIPER (version 3.45 using default parameters) using the theoretical mass and observed normalized elution times for each peptide [14]. This approach to proteomics research is enabled by a number of both published [14-18] and unpublished in-group development tools that are freely available for download at http://omics.pnl.gov. The peptide alignments scores were filtered to control the FDR < 10% and uniqueness probability score > 0.5. A minimum of two unique peptides were required per protein identification. The peak intensity values (i.e. abundances) were available for the final identified peptides. 8 2 Brief descriptions and rationales of the framework 2.1. Data reduction Disease marker identification often starts with a list of differentially expressed genes, proteins, or metabolites in diseased conditions relative to their controls [19, 20]. Although some rules of thumb are available for different types of experimental measurements, determining a list of differentially expressed features is frequently constrained by various biological and technical limitations [21, 22]. The case-specific efforts, to some extent, are inevitable in order to properly address specific limitations in the majority of disease marker studies. Therefore, the exact approaches implemented in the data reduction component are often decided by the scientists who designed or conducted the studies. 2.2. Distance-based hierarchical clustering Clustering is a common approach used in the process of knowledge discovery. It aims for grouping data in such a way that patterns in the same group are more similar to each other than to those in other groups [23]. The list of differentially expressed features between different conditions, determined in the previous step, is hierarchically clustered into several subsets based on a specified distance criterion. The distance criterion proposed here is based on dissimilarity and can be derived from feature expression profiles, functional annotations between the features, or a combination of both that is considered as an integration of data-driven and knowledge-driven information. Our speculation here is that this integrated distance may facilitate to group the features, such as genes or proteins, into several clusters that, ideally, contain orthogonal information between the clusters. If 9 this is feasible, the individual subsets should contain reduced noise relative to the entire data set, and the robustness of the marker candidates selected within individual subsets could be improved relative to those extracted from the full data set. 2.3. Expert knowledge-driven disease-model-related functional selection Addition to the semi-automated clustering approach, we also include an expert-knowledgedriven disease-model-related functional selection in the pipeline. This approach identifies the biological processes that contain significantly changed proteins, which can potentially be important for the disease of interest. This selection may serve as a means to validate the results from the distance-based clustering approach as well. 2.4. Bayesian integration and classification Bayesian fusion analyses are implemented for capturing and integrating the information derived from the individual subsets in order to determine the sub data sets for providing the best performances. The performances of individual clusters or sets of clusters are numerically measured by CA, our defined evaluation metric. CA is a flexible measurement which can be used in studies not only with binary responses, such as diseased vs. healthy, but also with multi-categorical variables, such as cases with more than two diseased stages. 2.5. Selection of biomarker candidates and validation Marker candidates at the cluster and individual levels are extracted and their validation on an independent human sample data set is highly desirable whenever possible. Note, validation can also be performed on the methodological level, i.e., evaluating a specific 10 approach for biomarker identification, in addition to a detailed assessment for a list of biomarker candidates. 11 3. Discussion on COPD data sets used in the current study 3.1. Biological significance of selected individual protein candidates In the demonstration data set, it actually is a quite striking observation that the four biomarker candidate proteins from the COPD-related functional selection convey as much information about the presence of COPD-like lung destruction as longer lists of proteins identified solely by the clustering approach. On the other hand, however, none of the four candidates from expert-driven functional selection are specific to lung functions, but instead reflect biological processes that are more indicative of the generalized tissue destruction seen in COPD. Interestingly, all of them have been reported as showing statistical linkage with COPD and/or other lung diseases. Specifically, prothrombin (THRB) and complement C3 (CO3) would both naturally be increased during wound healing and inflammation. The former is cleaved during the clotting process to produce thrombin that converts fibrinogen to fibrin [24], and the latter plays a central role in the activation of the complement system [25]. Vitamin D binding protein (VTDB) has shown influences on respiratory function both by determining vitamin D bioavailability and by direct effects on innate cell function. An emerging hypothesis suggests that VTDB may have a direct role in the pathogenesis of COPD [26, 27] as well as an indirect role in macrophage activation in the airway as part of the innate immune response [28]. The evidences of the associations between VTDB and COPD were reported by Metcalf and Robbins groups in the early 90s [29, 30] . The last but not least intriguing marker candidate, adiponectin (ADIPO) is a unique adipokine with multiple salutary effects such as antiapoptotic, anti-inflammatory, and anti-oxidative activities in many organs and cells [31]. Recent studies have, though inconclusively, suggested that adiponectin plays a role in signaling activity in the lung and 12 can be associated with inflammatory pulmonary diseases such as COPD and asthma. Novel cross talk between lung and adipose tissues is currently under investigation [32]. 3.2. Biomarker feasibility Granted, many distinctions exist in respiratory physiology and anatomy as well as the innate and adaptive immune responses between mice and human. However, the shared biological pathways between the two provide some levels of justifications for using the ADA-deficient mouse model to study human COPD. 3.3. Some interesting points associated with the demonstration data sets The time course information available from the different types of sample materials provides several valuable insights for not only COPD also the biomarker identification schemes in general. First off, the selection of appropriate specimens in which the biomarkers will be measured is an essential issue. Ideally, the selected sample materials need to be; 1) easily accessed from patients, e.g., saliva, urine, plasma or serum, 2) reliably measured in routine clinical settings, and 3) able to provide accurate information that distinguishes the disease state in patients. BALF is an example of the proximal yet inconvenient-to-collect specimens. Its location potentially enables it to contain more concentrated disease-related biomolecules and thus provide more direct biological and pathological information. In contrast, the easily accessible but distal-to-the-disease-site sample materials, for instance plasma, in which the disease-related biological information carried by the biomolecules can be potentially 13 diluted during the transport from the disease site to plasma, is also more likely modulated by morbidities other than specific disease of interest. Although the first type of samples may provide more accurate biological information, the easy accessibility of the second type of samples, such as plasma, serum and urine, and the economical cost associated with their collection are also practical issues that cannot be overlooked in clinical applications. In the demonstration data sets, due to heterogeneous natures and the diverse pathological components associated with COPD, a variety of biological specimens have been suggested and used in the identification of COPD biomarkers, including exhaled breath, sputum, BALF, lung biopsies, serum, plasma, etc. [33]. BALF has also been suggested as a promising sample fluid in which to evaluate cell profiles, lymphocyte phenotypes, and cell functions in different lung diseases with the aim of characterizing pathological mechanisms [34]. In our results, the optimal CAs derived in BALF also consistently outperform the ones in plasma (Table 1), which somewhat supports the assumption that BALF can be a more suitable material than plasma for providing accurate biological information of COPD [35, 36]. However, the standardization of procedures for the collection of BALF is still an on-going effort and limiting its clinical utility. To leverage the aforementioned pros and cons from the distinct types sample materials, we think that it is important, at the early marker discovery stage, to evaluate the signature molecules in both types of sample materials in order to understand their individual pathophysiological impacts on the disease of interest. Secondly, a second-step test for the early diagnosis of COPD can be valuable to develop. Once the robust marker candidates in BALF (or other proximal yet hard to collect 14 specimens) are identified with comprehensive understanding on their biological roles in the disease, a second-step test is possibly to develop. In the current clinical practice, a panel of biomarkers in plasma (or other common samples and measurements) with a moderate diagnostic power are routinely applied as a screening test. After this, the patients with high tendency of having COPD from the screening test can be further tested on a group of biomarkers with higher discriminating power in BALF, the second-step test. This type of two-step diagnostic approaches has been widely applied in many medical diagnoses, such as tuberculin skin test, lyme disease test, and cervical cancer test. In another sense, with the foreseeable improvements on standardizing procedures for the collection of BALF, it can eventually be possible to acquire BALF samples from patients at a reasonable cost in the routine clinical settings as well [37]. Lastly, the 42-day time course in the ADA-deficient model of COPD allows us to get some relevant information on the disease onset in the mouse model, which is particularly valuable for determining the appropriate timing for the early diagnosis of COPD. In Fig. 3, a similar pattern was observed in the cumulative optimal CAs from both samples (the lines in blue), but not in the optimal CAs at the individual time points (the lines in green). The greater differences in the individual optimal CAs were observed after day 30. We speculate that some discrepancies of COPD-associated proteins (and other biomolecules) between BALF and plasma may start to develop around day 30 and keep increasing until the differences become measurable around day 34 between two phenotypes. With the observations obtained using a very small sample size (three mice in each group at each 15 time point), we understand that further study certainly is required to draw any conclusion on the estimation of disease onset. Nevertheless, our results provide piece valuable information in regard to this matter. 16 The numbers of proteins changed at different time points 4. Supplemental Figures 19 7 38 45 18 16 4 1 35 19 13 4 5 4 3 5 1 17 0 19 15 12 1 8 7 0 3 16 16 1 1 0 5 19 16 15 2 2 129 90 24 6 -35 -70 -105 Day 26 Day 30 -140 Day 34 Day 38 Day 42 Time Point Up-regulated proteins in BALF (t-test) Down-regulated proteins in BALF (t-test) Up-regulated proteins in BALF (G-test) Down-regulated proteins in BALF (G-test) Up-regulated proteins in plasma (t-test) Down-regulated proteins in plasma (t-test) Up-regulated proteins in plasma (G-test) Down-regulated proteins in plasma (G-test) Fig. S1. The bar graph of the numbers of significantly changed proteins in the ADA -/- group relative to the ADA +/- group. The bars for up-regulation are in blue and the ones for downregulation are in red from BALF (solid color) and plasma (shaded color) on days 26 to 42, respectively. 17 A. Time Point: 1 2 Time Point: 1 2 3 4 5 B. 3 4 5 Fig. S2. Hierarchical clustering with (A) 396 proteins in BALF and (B) 150 proteins in plasma differentially changed in their abundances in the ADA-deficient mice (indicated as D) and their time-matched controls (indicated as C). Dendrograms from hierarchical clustering are shown at left and used to display the patterns of protein expression profiles at five different time points (on days 26 to 42) in BALF and plasma. 18 A) Human data based on mouse BALF clustering B) Human data based on mouse plasma clustering C) Mouse BALF D) Mouse plasma Fig. S3. The ROC curves and the AUC of the validation data set based on the clustering defined by A) the mouse BALF data and B) the mouse plasma data, and the best performing clusters in C) the mouse BALF and D) the mouse plasma. 19 5. Supplementary Tables Table S1. Optimal individual CAs of the clusters resulted from the distance-based clustering approach. Optimal Individual CA* (the number of proteins in the cluster; the optimal algorithm: K - fuzzy k-nearest neighbor (KNN); L - linear discriminant analysis (LDA); M - multinomial logistic regression (MLR); N - Naïve Bayes (NB)) Data expression profiles (data-driven) No. of clusters 1 0.83 (396; K) BALF Plasma 0.66 (150; N) 6 0.90 (95; K)*; 0.66 (55; K); 0.79 (33; K); 0.72 (19; N); 0.72 (33; N); 0.83 (161; K)* 0.53 (32; N); 0.70 (19; L); 0.47 (17; K); 0.67 (36; N); 0.53 (40; K); 0.77 (6; L) 12 0.69 (35; K); 0.69 (37; K)*; 0.86 (60; N); 0.69 (17; N); 0.69 (13; L); 0.83 (10; L); 0.76 (23; K); 0.86 (77; K)*; 0.66 (16; K); 0.79 (84; K)*; 0.66 (6; K); 0.66 (18; N) Functional relationships (ontology-driven) 6 12 0.90 (201; K)*; 0.69 (35; K); 0.90 (85; K); 0.62 (22; N); 0.90 (39; N)*; 0.66 (14; L) 0.60 (68; K); 0.50 (38; L); 0.63 (19; K); 0.79 (10; N)*; 0.70 (11; L); 0.50 (4; K) * This cluster is included in the optimal integrated CA in this analysis. 20 0.72 (32; N); 0.69 (35; K); 0.83 (79; K); 0.86 (40; K); 0.79 (14; K); 0.83 (31; K); 0.66 (27; N); 0.62 (22; N); 0.86 (39; N)*; 0.90 (59; K)*; 0.62 (4; L); 0.66 (14; L); A combination of the other two 6 0.79 (91; K); 0.93 (185; K)*; 0.72 (47; K); 0.79 (24; N); 0.83 (30; K)*; 0.62 (19; N) 0.54 (33; N); 0.58 (60; N); 0.56 (24; K); 0.83 (13; L)*; 0.63 (13; K); 0.53 (7; L); 12 0.72 (50; K); 0.90 (84; K)*; 0.79 (56; K); 0.83 (33; K); 0.79 (24; N); 0.69 (14; N); 0.79 (45; K)*; 0.66 (19; K); 0.79 (41; K); 0.62 (4; L); 0.69 (7; K); 0.62 (19; N) Table S2. Optimal individual CAs of functional clusters resulted from the expert-driven disease-model-related functional selection. Optimal Individual CA* (the number of proteins in the cluster; the optimal algorithm) All proteins Top 3 proteins No. of clusters BALF Plasma 1 12 0.81 (317; K) 0.76 (56; K); 0.83 (183; K); 0.79 (217; K); 0.76 (86; N); 0.79 (39; K); 0.76 (13; K); 0.83 (42; K); 0.83 (42; N); 0.72 (59; K); 0.83 (57; K); 0.79 (230; K)*; 0.90 (115; K)* 0.57 (113; K) 0.50 (16; L); 0.57 (69; K); 0.50 (94; L); 0.53 (26; N); 0.73 (14; L)*; 0.57 (2; K); 0.63 (9; N); 0.57 (11; L); 0.43 (22; N); 0.57 (18; L); 0.50 (96; L); 0.50 (38; L ) 1 12 0.88 (35; K) 0.90 (6; K); 0.86 (16; N); 0.83 (10; N); 0.83 (10; L); 0.86 (5; K); 0.72 (3; K); 0.93 (6; L); 1.00 (4; N)*; 0.79 (7; N); 0.90 (4;K); 0.86 (10; N); 0.69 (4; K) 0.59 (41; N) 0.60 (5; L)*; 0.70 (20; L)*; 0.70 (11; N)*; 0.83 (10; L)*; 0.77 (5; L); 0.57 (2; K); 0.70 (4; L)*; 0.73 (4; K); 0.60 (7; L); 0.63 (4; N)*; 0.60 (9; N); 0.63 (5; L) * This cluster is included in the optimal integrated CA in this analysis. 21 Table S3. Top five optimal CA integrations and their corresponding biological process clusters with the most differentially changed proteins members from the disease-model-related functional enrichment selection. Optimal Integrated CA 0.99 #8 Carbohydrate derivative metabolic process CO3_MOUSE THRB_MOUSE VTDB_MOUSE VATL_MOUSE 0.97 # 51 Oxoacid metabolic process 0.97 # 71 Nucleotide metabolic process ADIPO_MOUSE DESM_MOUSE FAS_MOUSE MDHM_MOUSE VTDB_MOUSE CO3_MOUSE DCXR_MOUSE MDHM_MOUSE THRB_MOUSE VATL_MOUSE VTDB_MOUSE ACTN2_MOUSE ADIPO_MOUSE CO3_MOUSE DACT1_MOUSE EGFR_MOUSE 0.97 Expert-selected disease-modelrelated clusters for the integration # 101 Regulation of localization Proteins in the cluster 22 Complement C3 Prothrombin Vitamin D-binding protein V-type proton ATPase 16 kDa proteolipid subunit Adiponectin Desmin Fatty acid synthase Malate dehydrogenase, mitochondrial L-xylulose reductase MDHM_MOUSE Alpha-actinin-2 Dapper homolog 1 Epidermal growth factor receptor THIO_MOUSE Thioredoxin THRB_MOUSE 1 0.97 # 11 Regulation of programmed ADIPO_MOUSE cell death FABPL_MOUSE Fatty acid-binding protein, liver THRB_MOUSE VTDB_MOUSE 1: the 2nd -5th optimal integrations are the resulted from two cluster: one cluster is listed here, and the other cluster is the cluster 8 (Carbohydrate derivative metabolic process). 23 Table S4. Top ten best-performing individual proteins in mouse and human plasma and their corresponding CAs. CA rank Protein ranked by CA in Protein rankedb y CA in Optimal individual CA in Optimal individual CA in Mouse Mouse Human Human Human Mouse 1 CO8G 0.70 0.64 CFAI 1.00 0.60 2 GELS 0.70 0.64 PRDX2 0.93 0.53 3 APOC3 0.67 0.86 VTNC 0.93 0.53 4 CO3 0.67 0.79 APOA4 0.86 0.63 5 IGHM 0.67 0.71 APOC3 0.86 0.67 6 LUM 0.67 0.86 CERU 0.86 0.63 7 PEDF 0.67 0.50 COMP 0.86 0.53 8 APOA4 0.63 0.86 HEMO 0.86 0.57 9 CERU 0.63 0.86 LUM 0.86 0.67 10 CO8B 0.63 0.14 PLMN 0.86 0.63 24 Reference [1] M.R. Blackburn, S.K. Datta and R.E. Kellems, Adenosine deaminase-deficient mice generated using a two-stage genetic engineering strategy exhibit a combined immunodeficiency. J Biol Chem, 1998. 273(9): p. 5093-5100. [2] M.R. Blackburn, J.B. Volmer, J.L. Thrasher, et al., Metabolic consequences of adenosine deaminase deficiency in mice are associated with defects in alveogenesis, pulmonary inflammation, and airway obstruction. J Exp Med, 2000. 192(2): p. 159-170. [3] Y. Zhou, D.J. Schneider and M.R. Blackburn, Adenosine signaling and the regulation of chronic lung disease. Pharmacol Ther, 2009. 123(1): p. 105-116. [4] H. Jin, B.J. Webb-Robertson, E.S. Peterson, et al., Smoking, COPD, and 3-nitrotyrosine levels of plasma proteins. Environ Health Perspect, 2011. 119(9): p. 1314-1320. [5] J.N. Adkins, S.M. Varnum, K.J. Auberry, et al., Toward a human blood serum proteome: analysis by multidimensional separation coupled with mass spectrometry. Mol Cell Proteomics, 2002. 1(12): p. 947-955. [6] R.D. Smith, G.A. Anderson, M.S. Lipton, et al., An accurate mass tag strategy for quantitative and high-throughput proteome measurements. Proteomics, 2002. 2(5): p. 513-523. [7] W.J. Qian, J.M. Jacobs, D.G. Camp, 2nd, et al., Comparative proteome analyses of human plasma following in vivo lipopolysaccharide administration using multidimensional separations coupled with tandem mass spectrometry. Proteomics, 2005. 5(2): p. 572-584. [8] E.A. Livesay, K. Tang, B.K. Taylor, et al., Fully automated four-column capillary LC-MS system for maximizing throughput in proteomic analyses. Anal Chem, 2008. 80(1): p. 294-302. [9] D.M. Horn, R.A. Zubarev and F.W. McLafferty, Automated reduction and interpretation of high resolution electrospray mass spectra of large molecules. J Am Soc Mass Spectrom, 2000. 11(4): p. 320-332. [10] S. Kim, N. Mischerikow, N. Bandeira, et al., The generating function of CID, ETD, and CID/ETD pairs of tandem mass spectra: applications to database search. Mol Cell Proteomics, 2010. 9(12): p. 2840-2852. [11] R.T. Kelly, J.S. Page, Q. Luo, et al., Chemically etched open tubular and monolithic emitters for nanoelectrospray ionization mass spectrometry. Anal Chem, 2006. 78(22): p. 7796-7801. [12] J.S. Zimmer, M.E. Monroe, W.J. Qian, et al., Advances in proteomics data analysis and display using an accurate mass and time tag approach. Mass Spectrom Rev, 2006. 25(3): p. 450-482. [13] T. Liu, M.E. Belov, N. Jaitly, et al., Accurate mass measurements in proteomics. Chem Rev, 2007. 107(8): p. 3621-3653. 25 [14] M.E. Monroe, N. Tolic, N. Jaitly, et al., VIPER: an advanced software package to support highthroughput LC-MS peptide identification. Bioinformatics, 2007. 23(15): p. 2021-2023. [15] N. Jaitly, M.E. Monroe, V.A. Petyuk, et al., Robust algorithm for alignment of liquid chromatography-mass spectrometry analyses in an accurate mass and time tag data analysis pipeline. Anal Chem, 2006. 78(21): p. 7397-7409. [16] G.R. Kiebel, K.J. Auberry, N. Jaitly, et al., PRISM: a data management system for high-throughput proteomics. Proteomics, 2006. 6(6): p. 1783-1790. [17] M.E. Monroe, J.L. Shaw, D.S. Daly, et al., MASIC: a software program for fast quantitation and flexible visualization of chromatographic profiles from detected LC-MS(/MS) features. Comput Biol Chem, 2008. 32(3): p. 215-217. [18] K. Petritis, L.J. Kangas, B. Yan, et al., Improved peptide elution time prediction for reversed-phase liquid chromatography-MS by incorporating peptide sequence information. Anal Chem, 2006. 78(14): p. 5026-5039. [19] T. Wei, B. Liao, B.L. Ackermann, et al., Data-driven analysis approach for biomarker discovery using molecular-profiling technologies. Biomarkers, 2005. 10(2-3): p. 153-172. [20] B.P. Bradley, Finding biomarkers is getting easier. Ecotoxicology, 2012. 21(3): p. 631-636. [21] Z. Feng, R. Prentice and S. Srivastava, Research issues and strategies for genomic and proteomic biomarker discovery and validation: a statistical perspective. Pharmacogenomics, 2004. 5(6): p. 709-719. [22] J.E. McDermott, J. Wang, H.D. Mitchell, et al., Challenges in Biomarker Discovery: Combining Expert Insights with Statistical Analysis of Complex Omics Data. Expert Opin Med Diagn, 2013. 7(1): p. 37-51. [23] R. Nugent and M. Meila, An overview of clustering applied to molecular biology. Methods Mol Biol, 2010. 620: p. 369-404. [24] K.C. Glenn, G.H. Frost, J.S. Bergmann, et al., Synthetic peptides bind to high-affinity thrombin receptors and modulate thrombin mitogenesis. Pept Res, 1988. 1(2): p. 65-73. [25] M. Maslowska, H. Legakis, F. Assadi, et al., Targeting the signaling pathway of acylation stimulating protein. J Lipid Res, 2006. 47(3): p. 643-652. [26] S. Dimeloe and C. Hawrylowicz, A direct role for vitamin D-binding protein in the pathogenesis of COPD? Thorax, 2011. 66(3): p. 189-190. [27] L.H. Shen, X.M. Zhang, D.J. Su, et al., Association of vitamin D binding protein variants with susceptibility to chronic obstructive pulmonary disease. J Int Med Res, 2010. 38(3): p. 1093-1098. [28] A.M. Wood, C. Bassford, D. Webster, et al., Vitamin D-binding protein contributes to COPD by activation of alveolar macrophages. Thorax, 2011. 66(3): p. 205-210. 26 [29] R.A. Robbins, G.L. Gossman, K.J. Nelson, et al., Inactivation of chemotactic factor inactivator by cigarette smoke. A potential mechanism of modulating neutrophil recruitment to the lung. Am Rev Respir Dis, 1990. 142(4): p. 763-768. [30] J.P. Metcalf, A.B. Thompson, G.L. Gossman, et al., Gcglobulin functions as a cochemotaxin in the lower respiratory tract. A potential mechanism for lung neutrophil recruitment in cigarette smokers. Am Rev Respir Dis, 1991. 143(4 Pt 1): p. 844-849. [31] P. Garcia and A. Sood, Adiponectin in pulmonary disease and critically ill patients. Curr Med Chem, 2012. 19(32): p. 5493-5500. [32] Y. Takeda, K. Nakanishi, I. Tachibana, et al., Adiponectin: a novel link between adipocytes and COPD. Vitam Horm, 2012. 90: p. 419-435. [33] M. Dahl and B.G. Nordestgaard, Markers of early disease and prognosis in COPD. Int J Chron Obstruct Pulmon Dis, 2009. 4: p. 157-167. [34] B. Magi, E. Bargagli, L. Bini, et al., Proteome analysis of bronchoalveolar lavage in lung diseases. Proteomics, 2006. 6(23): p. 6354-6369. [35] K.C. Meyer, The role of bronchoalveolar lavage in interstitial lung disease. Clin Chest Med, 2004. 25(4): p. 637-649. [36] H. Chen, D. Wang, C. Bai, et al., Proteomics-based biomarkers in chronic obstructive pulmonary disease. J Proteome Res, 2010. 9(6): p. 2798-2808. [37] F. Sampsonas, D.P. Kontoyiannis, B.F. Dickey, et al., Performance of a standardized bronchoalveolar lavage protocol in a comprehensive cancer center: a prospective 2-year study. Cancer, 2011. 117(15): p. 3424-3433. 27