Support Information Materials and Methods Datasets D1, a complex sample dataset derived from the human liver, is used to evaluate the sensitivity of peptide identification. The detailed generation process is described in reference[1]. In brief, the sample was digested by trypsin, separated by strong cation exchange chromatography, and analyzed using the LTQ-FT mass spectrometer (Thermo Scientific, San Jose, CA). The raw data were converted to peak lists using iPE-MMR[2]. iPE-MMR is a combined method for precursor mass refinement that integrates DeconMSn[3], PE-MMR[4], and DtaRefinery[5] into an analysis pipeline to calibrate the monoisotopic mass error and systematic mass error for tandem mass spectrometric data. The software (version 1.2) was downloaded from http://omics.pnl.gov/software/PEMMR.php. We used the default parameter settings in iPE-MMR to perform the conversion. This ultimately resulted in 24,302 spectra. After the calibration, the mean value of the precursor ion mass error distribution was improved from 7.48±3.37 to 0.00±1.72 ppm. D2, a protein standard dataset derived from a set of 48 human proteins (Sigma, Universal Proteomics Standard Set UPS1), was previously used to demonstrate the accuracy of MP[6]. The raw data, generated by LTQ-FT mass spectrometer, was converted into peak lists with BioWorks 3.2 (Thermo Scientific). The preprocessing parameters were described in detail by Brosch et al.[6]. Note that systematic mass errors of the precursor ions have already been eliminated[6]. The MS/MS spectra (8191 spectra) were downloaded ftp://ftp.sanger.ac.uk/pub4/resources/software/mascotpercolator/. from Database construction The human IPI database (version 3.63, 84,229 sequences) including common external contaminants from the common Repository of Adventitious Proteins (cRAP, http://www.thegpm.org/crap/index.html) was used as the target database for D1. Another human IPI database (68,322 sequences) including 48 standard protein sequences and common external contaminants from cRAP was used as the target database for D2[6]. The database was provided by Brosch et al.[6], and was downloaded from ftp://ftp.sanger.ac.uk/pub4/resources/software/mascotpercolator/. The decoy databases used for D1 and D2 were constructed by randomizing the protein sequences while maintaining the average amino acid composition (RND). The decoy databases were generated by the Perl script decoy.pl, which is provided by Matrix Science (http://www.matrixscience.com/help/decoy_help.html). For D2, three decoy databases were generated to eliminate the statistical bias of q-value estimation. MS/MS database searching D1 and D2 were searched with MASCOT (version 2.2) using the following parameters: precursor mass tolerance was set to 20 ppm; monoisotopic mass was used for precursor ions; product ion mass tolerance=0.5 Da; three modifications, carbamidomethylation of cysteine, deamidation of asparagine or glutamine, and oxidation of methionine, were set as variable modifications; maximum missed cleavage =2. Each dataset was searched against the target and decoy database separately with the enzyme settings of trypsin and semi-trypsin, respectively. QC methods Two QC methods using the same Percolator technology[7], MP[6-7], and an in-house developed tool, PepDistiller, were compared in this study. MP: MASCOT Percolator (version 1.09, packaged with Percolator version 1.12)[6-7] was used for the comparison. MP was specially designed to achieve maximum target hits above a user-specified FDR threshold (e.g., 1%) through an iterative support vector machine (SVM) classifier, i.e. Percolator[7]. The iterative procedure is implemented by selecting a subset of high-confident target PSMs from the previous iteration to serve as a positive training set and half of the decoy PSMs to serve as a negative training set for training an SVM and re-ranking the entire set of PSMs in the next iteration. After several iterations, the number of target PSMs above the FDR threshold converges. The final SVM is then applied to the entire set of target PSMs and the remaining half of the decoy PSMs to obtain unbiased q-value estimations. MP extracts PSM features from MASCOT dat files as an input feature vector to Percolator. The semi-supervised characteristic of Percolator also makes it adaptive to datasets from different experimental conditions, such as different samples and mass spectrometers. The features applied in MP can be found in the 'config.properties' file in the software package, and the detailed definitions of the features are described in ref. [6]. PepDistiller: PepDistiller was designed as a quality control method to distill high-confident peptide identifications. It includes all of the features used in MP and executes Percolator (version 1.12) to discriminate between the correct and incorrect matches. However, there are two major improvements: (1) in addition to the feature set used in MASCOT Percolator, NTT was added to PepDistiller to improve the performance of peptide identifications obtained from semi-tryptic search results (Table S1); (2) the refined FDR estimation method proposed by Navarro et al.[8], was integrated into PepDistiller to accurately determine the confidence of peptide identifications. PepDistiller is written in Perl, and can be downloaded from http://bioinfo.hupo.org.cn/tools/PepDistiller. Method of FDR calculation Based on the target-decoy strategy, several methods have been used to calculate the FDR[7-11]. Because Percolator is specially designed for separate searches, two FDR estimation methods designed for separate searches were considered in this study. One is the method applied in Percolator that we termed the PIT-fixed FDR estimation, which is described as follows. Denote the scores of target PSMs as t1 , t 2 , …, t mt and the scores of decoy PSMs as d1 , d 2 , …, d md . Here, mt is the number of target PSMs, and md is the number of decoy PSMs. For a given threshold s, in SS, FDR is calculated as follows[6-7] : 0 E{FDRPIT ( s)} mt |{di s, i 1, 2,..., md }| md (1), |{ti s, i 1, 2,..., mt }| where 0 (PIT) is the estimated proportion of target PSMs that are incorrect, which can also be calculated by Qvality[12]. The other method, named the refined FDR estimation, was proposed by Navarro et al.[8], and integrated into PepDistiller. The refined FDR is simply calculated as follows: E{FDRRefined ( s)} do 2db (2), db tb to where do (decoy only) is the number of PSMs with scores above the threshold s in the decoy database (DDB) but not in the target database (TDB). Analogously, to (target only) is the number of PSMs with scores above the threshold s in TDB but not in DDB. db (decoy better) is the number of PSMs with scores above s in both TDB and DDB but with better scores in DDB, and tb (target better) is the number of PSMs with scores above s in both TDB and DDB but with better scores in TDB. After FDRs were estimated and given a PSM with score s, the q-value associated with it was calculated using formula (3) as follows: q( s) min x s E{FDR( x)} (3), where FDR could be either the PIT-fixed FDR or the refined FDR. Given a threshold s, the actual FDR is calculated as FP/(FP+TP), where FP is the number of false positive hits with scores above s, and TP is the number of true positive hits with scores above s that belong to the standard proteins in the sample. After the actual FDR was calculated for each threshold, the actual q-value was calculated in the same manner as formula (3). For D2, after each target-decoy database search, the estimated q-values were plotted against the actual q-values to reveal the accuracy of estimated q-values. The three scatter plots were then smoothed with the local regression method LOESS[13]. Table S1. Features used in PepDistiller to represent PSMs. In total, 17 features are applied in PepDistiller as an input feature vector to Percolator, of which 16 are inconsistent with the features used in MP. These features can be divided into three aspects: 1–5 represent features related to the mass error of parent and product ions; 6–10 represent features related to the properties of identified peptides; and 11-17 represent features related to the quality of PSMs. ID Feature 1 DeltaM 2 AbsDeltaM Defintion Difference between the calculated and observed peptide mass (in Dalton and ppm) Absolute value of DeltaM (in Dalton and ppm) 3 IsoDeltaM Isotopic error corrected DeltaM (in Dalton and ppm) 4 FragDelaM_meidan 5 FragDelaM_iqr 6 MrCalc Median of fragment ions deltaM (in Dalton and ppm) Interquartile range(IQR) of fragment ions deltaM (in Dalton and ppm) Calculated monoisptopic mass of identified peptide 7 Charge Charge state of parent ion 8 MC 9a NTT 10 VarMods Number of missed tryptic cleavages Number of tryptic termini (for fully-tryptic search, it equals to 0) Number of modified sites / number of modifiable sites 11 IonScore 12 dIonsScore 13 FractionsIonsMarchedB1-Y2 MASCOT ion score Difference between ion scores of the best and second-best non-isobaric match Fractions of matched ions (per ion series) 14 TotInt The sum of all ions intensities (log) 15 IntMatchedTot The sum of all matched ions intensities (log) 16 RelIntMatchedTot IntMatchedTot / TotInt 17 RelIntMatchedB1-Y2 Relative intensity matched (per ion series) a NTT is not included in the feature set of MP but is incorporated into PepDistiller. Table S2. Homology FP matches with q-values lower than 0.01 generated from the standard dataset (D1) semi-tryptic search results. Spectrum Peptide Sequence Ion Score ppm FTHPS_2007Sept07-01.6149.6149.2.dta DVTVLQNTDGNNNDAWAK 109.47 1.51 FTHPS_2007Sept07-01.1562.1562.2.dta QNTDGNNNDAWAK 76.57 1.61 FTHPS_2007Sept07-01.6320.6320.2.dta DVTVLQNTDGNNNDAWAK 71.08 -1.15 FTHPS_2007Sept07-01.4711.4711.2.dta DTPSLEDEAAGHVTQAR 62.76 3.66 FTHPS_2007Sept07-01.6152.6152.3.dta TTAEEAGIGDTPSLEDEAAGHVTQAR 52.43 2.13 FTHPS_2007Sept07-01.5625.5625.3.dta GIGDTPSLEDEAAGHVTQAR 39.32 0.19 FTHPS_2007Sept07-01.5759.5759.3.dta GIGDTPSLEDEAAGHVTQAR 38.54 -0.18 FTHPS_2007Sept07-01.6186.6186.3.dta GTTAEEAGIGDTPSLEDEAAGHVTQAR 32.96 1.69 FTHPS_2007Sept07-01.5147.5147.3.dta GIGDTPSLEDEAAGHVTQAR 32.14 0.27 FTHPS_2007Sept07-01.11299.11299.2.dta GAELVDALQFVCGDR 88.86 2.93 FTHPS_2007Sept07-01.10877.10877.2.dta AELVDALQFVCGDR 84.53 1.44 FTHPS_2007Sept07-01.10699.10699.2.dta GAELVDALQFVCGDR 75.50 0.12 FTHPS_2007Sept07-01.10727.10727.2.dta GAELVDALQFVCGDR 51.87 1.97 Homology Sequence Standard Protein DVTVLQNTDGNNNEAWAK TRFL_HUMAN GTTAEEAGIGDTPSLEDEAAGHVTQEP TAU_HUMAN GGELVDTLQFVCGDR IGF2_HUMAN Table s3. Root mean square error (RMSE) between the estimated and actual q-values in the interval of [0, 0.06] generated by two different FDR calculation methods and six different decoy designs. FDR Type PIT-fixed Refined Decoy Design q-value RMSE (1e-2) Mean σ RND 0.78 0.02 SHF 0.76 0.05 REV 0.67 \ RNDTP 0.73 0.01 SHFTP 0.72 0.05 REVTP 0.72 \ RND 0.73 0.06 SHF 0.68 0.02 REV 0.56 \ RNDTP 0.60 0.01 SHFTP 0.65 0.02 REVTP 0.65 \ A 5000 4500 4000 Target PSMs 3500 3000 2500 2000 1500 MP PD, FDRPIT 1000 500 0 0 PD, FDRRefined 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 q-value 0.1 B 0.1 0.09 FDRRefine (RND, NoEnzy) 0.08 FDRPIT (RND, NoEnzy) Y=X Actual q-value 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 Estimated q-value Figure S1. (A) Comparison of the sensitivity of MASCOT Percolator (MP) and PepDistiller (PD) for MASCOT none-enzymatic search results. Two kinds of FDR estimation method were applied in PepDistiller: PIT-fixed FDR (FDRPIT) and the refined FDR (FDRRefined). The number of target PSMs was plotted against each q-value threshold. D1 dataset were used to test the sensitivity. (B) Evaluation of the accuracy of FDR estimations generated by the refined and PIT-fixed methods. The none-enzymatic search results of the standard dataset D2 were used. PepDistiller was used for filtering. The dataset was searched against three RND decoy databases to eliminate the statistical bias of q-value estimation. The curves were smoothed with the local regression method LOESS. A1 A2 0.1 5000 0.09 4800 0.08 4600 FDRRefined (RNDTP) FDRPIT (RNDTP) Y=X Actual q-value Target PSMs 0.07 4400 4200 4000 0.06 0.05 0.04 0.03 3800 3600 3400 0 MP PD, FDRPIT (RNDTP) 0.02 0.01 PD, FDRRefined (RNDTP) 0 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 q-value B1 B2 5000 0.1 4800 0.09 4600 0.08 4200 4000 3800 3600 3400 3200 FDRPIT (SHF) Y=X 0.06 0.05 0.04 0.03 0.02 3000 MP PD, FDRPIT (SHF) 2800 PD, FDRRefined (SHF) 0.01 2600 0 FDRRefined (SHF) 0.07 Actual q-value Target PSMs 4400 0 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 q-value C1 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 Estimated q-value C2 5200 0.1 4800 0.09 0.08 3600 3200 2800 2400 2000 1600 0 FDRRefined (SHFTP) FDRPIT (SHFTP) Y=X 0.07 4000 Actual q-value Target PSMs 4400 0.06 0.05 0.04 0.03 MP PD, FDRPIT (SHFTP) 0.02 PD, FDRRefined (SHFTP) 0.01 0 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 q-value D1 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 Estimated q-value D2 0.1 5000 0.09 Y=X FDRPIT (REV) 0.08 FDRRefine (REV) 4800 0.07 Actual q-value Target PSMs 4600 4400 4200 4000 3800 3600 3400 0 E1 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 Estimated q-value 0.06 0.05 0.04 0.03 MP PD, FDRPIT (REV) 0.02 PD, FDRRefined (REV) 0.01 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 q-value 0 0 E2 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 Estimated q-value 0.1 4800 0.09 4600 0.08 4400 0.07 Actual q-value Target PSMs 5000 4200 4000 3800 3600 3400 FDRRefined (REVTP) 0.06 0.05 0.04 0.03 3200 MP PD, FDRPIT (REVTP) 0.02 3000 PD, FDRRefined (REVTP) 0.01 2800 0 Y=X FDRPIT (REVTP) 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 q-value 0 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 Estimated q-value Figure S2. (A1-E1) Comparison of the sensitivity of MASCOT Percolator (MP) and PepDistiller (PD) for MASCOT semi-tryptic search results using five kinds of decoy databases. Two kinds of FDR estimation method were applied in PepDistiller: PIT-fixed FDR (FDRPIT) and the refined FDR (FDRRefined). The number of target PSMs was plotted against each q-value threshold. D1 dataset were used to test the sensitivity. (A2-E2) Evaluation of the accuracy of FDR estimations generated by the refined and PIT-fixed methods. The semi-tryptic search results of the standard dataset D2 were used. Five kinds of decoy databases were applied. PepDistiller was used for filtering. The five decoy design methods are: RNDTP, SHF, SHFTP, REV, and REVTP. (1) RNDTP represents randomizing the amino acids of in silico tryptic peptides except the tryptic cleavage sites (K and R) using a uniform distribution random number generator, while preserving the average amino acid composition, protein length, tryptic cleavage sites and positions of target proteins; (2) SHF represents utilizing Fisher-Yates shuffle algorithm[14] to uniformly shuffle the amino acids without introducing skewness[15]; (3) SHFTP represents shuffling in silico tryptic peptides as SHF method does, while preserving average amino acid composition, protein length, and tryptic cleavage sites (K and R) and positions[16]; (4) REV represents reversing target protein sequences[17]; (5) REVTP represents reversing in silico tryptic peptides while preserving average amino acid composition, protein length, tryptic cleavage sites (K and R) and positions. REV was generated by a perl script decoy.pl provided by Matrix Science. The others were generated by in-house developed perl scripts, which can be downloaded from http://bioinfo.hupo.org.cn/tools/PepDistiller. When using RNDTP, SHF and SHFTP, the dataset was searched against three decoy databases generated by each method to eliminate the statistical bias of q-value estimation. The curves were smoothed with the local regression method LOESS. Figure S3. Comparison of process time consumed by PepDistiller (PD) and Mascot Percolator (MP) on dataset D2. Reference [1] Zhang, J., Li, J., Liu, X., Xie, H., et al., A nonparametric model for quality control of database search results in shotgun proteomics. BMC Bioinformatics 2008, 9, 29. [2] Jung, H.-J., Purvine, S. O., Kim, H., Petyuk, V. A., et al., Integrated Post-Experiment Monoisotopic Mass Refinement: An Integrated Approach to Accurately Assign Monoisotopic Precursor Masses to Tandem Mass Spectrometric Data. Analytical Chemistry 2010, 82, 8510-8518. [3] Mayampurath, A. M., Jaitly, N., Purvine, S. O., Monroe, M. E., et al., DeconMSn: a software tool for accurate parent ion monoisotopic mass determination for tandem mass spectra. Bioinformatics 2008, 24, 1021-1023. [4] Shin, B., Jung, H.-J., Hyung, S.-W., Kim, H., et al., Postexperiment Monoisotopic Mass Filtering and Refinement (PE-MMR) of Tandem Mass Spectrometric Data Increases Accuracy of Peptide Identification in LC/MS/MS. Mol Cell Proteomics 2008, 7, 1124-1134. [5] Petyuk, V. A., Mayampurath, A. M., Monroe, M. E., Polpitiya, A. D., et al., DtaRefinery, a software tool for elimination of systematic errors from parent ion mass measurements in tandem mass spectra data sets. Mol Cell Proteomics 2010, 9, 486-496. [6] Brosch, M., Yu, L., Hubbard, T., Choudhary, J., Accurate and sensitive peptide identification with Mascot Percolator. J Proteome Res 2009, 8, 3176-3181. [7] Kall, L., Canterbury, J. D., Weston, J., Noble, W. S., MacCoss, M. J., Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat Methods 2007, 4, 923-925. [8] Navarro, P., Vazquez, J., A refined method to calculate false discovery rates for peptide identification using decoy databases. J Proteome Res 2009, 8, 1792-1796. [9] Elias, J. E., Gygi, S. P., Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat Methods 2007, 4, 207-214. [10] Kall, L., Storey, J. D., MacCoss, M. J., Noble, W. S., Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. J Proteome Res 2008, 7, 29-34. [11] Hather, G., Higdon, R., Bauman, A., von Haller, P. D., Kolker, E., Estimating false discovery rates for peptide and protein identification using randomized databases. Proteomics 2010, 10, 2369-2376. [12] Kall, L., Storey, J. D., Noble, W. S., Non-parametric estimation of posterior error probabilities associated with peptides identified by tandem mass spectrometry. Bioinformatics 2008, 24, i42-48. [13] Cleveland, W. S., Devlin, S. J., Locally weighted regression: an approach to regression analysis by local fitting. Journal of the American Statistical Association 1988, 83, 596-610. [14] Fisher, R., Yates, F., Statistical tables for biological, agricultural and medical research, Oliver & Boyd, London 1948, 26-27. [15] Klammer, A. A., MacCoss, M. J., Effects of modified digestion schemes on the identification of proteins from complex mixtures. J Proteome Res 2006, 5, 695-700. [16] Zhang, J., Li, J., Xie, H., Zhu, Y., He, F., A new strategy to filter out false positive identifications of peptides in SEQUEST database search results. Proteomics 2007, 7, 4036-4044. [17] Moore, R. E., Young, M. K., Lee, T. D., Qscore: an algorithm for evaluating SEQUEST database search results. J Am Soc Mass Spectrom 2002, 13, 378-386.