Support Information 1 2 Materials and Methods 3 1. Datasets 4 Dataset 1 (UPS1): This dataset is a protein standard derived from a set of 48 human 5 proteins (Sigma, Universal Proteomics Standard UPS1). It was previously used to 6 validate the accuracy of MascotPercolator [1]. The MS/MS spectra (8191 spectra) 7 were 8 ftp://ftp.sanger.ac.uk/pub4/resources/software/mascotpercolator/ . downloaded from 9 Dataset 2 (iPS-201B7): This dataset is from a large scale proteome analyses of 10 human induced pluripotent cell (201B7-P32) which was analyzed on an AB Sciex 11 TripleTOF 5600 System. The detailed generation process is described in reference [2]. 12 The 13 (http://proteomecentral.proteomexchange.org) via the dataset identifier PXD000071. raw data was downloaded from ProteomeXchange Consortium 14 Dataset 3 and 4 (Yeast-ETD and Yeast-ETcaD): The yeast dataset was collected 15 from 12 SCX fractions analyzed over a 40 min gradient on a modified hybrid linear 16 ion trap-Orbitrap (Thermo Scientific). The raw data was downloaded from 17 PeptideAtlas [3] (PAe001453) and is described in reference [4]. We used two of the 18 available datasets from PeptideAtlas, ETD (Trial 2, BioRep 2) (Yeast-ETD) and 19 ETcaD (Single Trial) (Yeast-ETcaD). 20 Dataset 5 and 6 (Ecoli-CID and Ecoli-ETcaD): The E.coli dataset is from Wright et 21 al.’s previous MascotPercolator study and was analyzed by LC-MS/MS using a dual 22 pressure linear ion trap Orbitrap instrument capable of both CID and ETcaD 23 fragmentation [5]. The data is available for download from PRIDE (http:// 24 http://www.ebi.ac.uk/pride) with accessions: 18990 and 19002. 25 26 2. MS/MS database searching 27 Dataset 1 (UPS1): Peak lists (8190 spectra) from the UPS1 experiments were 28 searched against a bipartite database [6] and a decoy database using Open Mass 1 Spectrometry Search Algorithm (OMSSA 2.1.9) and Mascot 2.3.02 (Matrix Science, 2 London, UK). The bipartite database contained the UniProt sequences for the 48 3 standard proteins, plus common contaminates and entrapment protein sequences 4 which were from H. influenza (UniProtKB/Swiss-Prot, downloaded on 7 July 2013, 5 1709 target sequences) proteome database. The decoy database was generated by 6 reversing the full bipartite target database. Mascot used the following parameters: 7 enzyme = Trypsin; maximum missed cleavages = 3; fixed modifications = 8 Carbamidomethyl (C); variable modifications = Oxidation (M), Deamidated (NQ), 9 Acetyl (Protein N-term); precursor mass tolerance = 20 ppm; fragment mass tolerance 10 = 0.5 Da; instrument = Default; decoy = 0. The majority of OMSSA settings were 11 kept the same as for Mascot: “-to 0.5 –te 20 –te ppm –i 1,4 –mf 3 –mv 1,4,10 –e 0 –nt 12 8 –v 3 –zh 4 –zl 2 –zcc 1 –w –he 1000” (Descriptions about search terms used in 13 OMSSA were explained in supplement Table S2). The resulting PSMs from the 14 bipartite database were filtered and hits to the entrapment proteins used to estimate 15 false positives over a range of OMSSAPercolator q-values. 16 Data set 2 (iPS-201B7): The raw MS data files were processed and converted into 17 MGF file format using Proteowizard 3.0.4472 (http://proteowizard.sourceforge.net/) 18 [7]. The MS/MS spectra were then searched by OMSSA 2.1.9 and Mascot 2.3.02 19 against the human Uniprot database (UniProtKB/Swiss-Prot, downloaded on 7 July 20 2013, 88354 target sequences) concatenated with a decoy database which was 21 generated by reversing the full target database. Mascot used the following parameters: 22 enzyme = Trypsin; maximum missed cleavages = 2; fixed modifications = 23 Carbamidomethyl (C); variable modifications = Oxidation (M), Deamidated (NQ), 24 Acetyl (Protein N-term); precursor mass tolerance = 50 ppm; fragment mass tolerance 25 = 0.1 Da; instrument = Default; decoy = 0; peptide isotope error = 1. The majority of 26 OMSSA settings were kept the same as for Mascot: “-to 0.1 -te 50 –te ppm -i 1,4 -mf 27 3 –mv 1,4,10 -e 0 -nt 8 -v 2 –zcc 1 -zh 4 –zl 2 –cp 1 –tem 4 –ti 1 -w -he 1000” 28 (Descriptions about search terms used in OMSSA were explained in supplement 29 Table S2). 30 Dataset 3 and 4 (Yeast-ETD and Yeast-ETcaD): The raw MS data was processed as 1 described in the MascotPercolator study [5]. The MS/MS spectra were searched by 2 OMSSA 2.1.9 and Mascot 2.3.02, against the translated Saccharomyces Genome 3 Database (SGD [8], http://www.yeastgenome.org/) concatenated with a decoy 4 database, which was generated by reversing the full target database. Mascot used the 5 following parameters: enzyme = Lys-C; maximum missed cleavages = 3; fixed 6 modifications = Carbamidomethyl (C); variable modifications = Oxidation (M), 7 Deamidated (NQ), Acetyl (Protein N-term); precursor mass tolerance = 50 ppm; 8 fragment mass tolerance = 0.5 Da; instrument = ETD-TRAP; peptide isotope error = 1; 9 decoy = 0. The majority of OMSSA settings were kept the same as for Mascot: “-w 10 -he 1000 -to 0.5 -te 50 –te ppm -i 2,4,5 -mv 1,4,10 -mf 3 -e 5 -v 3 -zh 7 -nt 8 -zcc 1 11 -hl 3 -h1 3 -h2 3 -cp 1 -tem 4 -ti 1” (Descriptions about search terms used in OMSSA 12 were explained in supplement Table S2). 13 Dataset 5 and 6 (Ecoli-CID and Ecoli-ETcaD): Peak lists were searched by OMSSA 14 2.1.9 and Mascot 2.3.02 against the same database. Both the peak lists and database 15 were generated as described by Wright et al. [5]. For the CID dataset, Mascot used the 16 following parameters: enzyme = Trypsin; maximum missed cleavages = 3; fixed 17 modifications = Carbamidomethyl (C); variable modifications = Oxidation (M), 18 Deamidated (NQ); precursor mass tolerance = 50 ppm; fragment mass tolerance = 1.5 19 Da; instrument = Default; decoy = 0. The majority of OMSSA settings were kept the 20 same as for Mascot: “-w -he 1000 -to 1.5 -te 50 –te ppm -i 1,4 -mv 1,4 –mf 3 -e 0 -v 3 21 -zh 4 -nt 8 -zcc 1 -cp 1 -hl 3 -h1 3 -h2 3” (Descriptions about search terms used in 22 OMSSA were explained in supplement Table S2). For ETD dataset, Mascot used the 23 following parameters: enzyme = Trypsin; maximum missed cleavages = 3; fixed 24 modifications = Carbamidomethyl (C); variable modifications = Oxidation (M), 25 Deamidated (NQ); precursor mass tolerance = 50 ppm; fragment mass tolerance = 1.5 26 Da; instrument = ETD-TRAP; decoy = 0. The majority of OMSSA settings were kept 27 the same as for Mascot: “-w -he 1000 -to 1.5 -te 50 –te ppm -i 2,4,5 -mv 1,4 –mf 3 -e 28 0 -v 3 -nt 8 -zh 7 -zcc 1 -cp 1 -hl 3 -h1 3 -h2 3” (Descriptions about search terms used 29 in OMSSA were explained in supplement Table S2). 1 The decoy databases were generated by the Perl script decoy.pl, which is provided 2 by Matrix Science (http://www.matrixscience.com/help/decoy_help.html). Mascot 3 Percolator 4 http://www.sanger.ac.uk/resources/software/mascotpercolator/ and the Percolator 5 v2.04 was downloaded from http://per-colator.com/. v2.02 [1] was downloaded from 6 7 Table S1. Features used in OMSSAPercolator. In total, 28 features are applied in 8 OMSSAPercolator as an input feature vector to Percolator. Index Features Description 1 Log10Evalue negative log10-value of E-value 2 Mass calculated peptide mass in Da 3 Charge peptide charge 4-5 DeltaMass, DeltaMassPPM Calculated minus observed peptide mass (in Dalton and ppm). 6-7 absDM, absDMppm Absolute value of calculated minus observed peptide mass (in Dalton and ppm) 8-9 isoDM, isoDMppm Calculated minus observed peptide mass, isotope error corrected (in Dalton and ppm) 10 VarModRatio The number of sites with variable modifications divided by the number of sites with potential variable modifications 11 TotalIntensity total intensity, natural logarithm transformed 12 MatchedIonInt total intensity of matched ions, natural logarithm transformed 13 relTotMatchedIonInt the total intensity of all matched ions divided by the total intensity of the spectrum 14 MaxMatchedIonInt max intensity of matched fragment ions 15 FragError mean mass error of matched fragment ions (in Dalton) 16-17 FragDeltaM_Med, FragDeltaM_MedPPM median mass error of matched fragment ions (in Dalton and ppm) 18-19 FragDeltaM_Iqr, FragDeltaM_IqrPPM Inter-quartile range of mass errors of matched fragment ions(in Dalton and ppm) 20 Qmatch The number of peptide matches for which an ms-ms match was attempted. (peptide to query, 1:n) 21 Longest longest matched fragment ion series 22 EnzTryC C-terminal enzymatic (tryptic) site, boolean 23 EnzTryN N-terminal enzymatic (tryptic) site, boolean 24 PepLen length of peptide sequence 25 Log10Pvalue positive log10-value of P-value 26 fracIonSeries fraction of calculated ions matched, reported separately for each ion series 27 relMatchedIonInt relative ion intensity of each ion series 28 EnzN the number of enzymatic sites excluding terminal sites 1 2 3 Table S2. Descriptions about search terms used in OMSSA. Parameter terms Description -w include spectra and search params in search results -he the maximum evalue allowed in the hit list -to product ion m/z tolerance in Da -te precursor ion m/z tolerance in Da (or ppm if -teppm flag set) -teppm search precursor masses in units of ppm -i id numbers of ions to search (comma delimited, no spaces) -mf comma delimited (no spaces) list of id numbers for fixed modifications -mv comma delimited (no spaces) list of id numbers for variable modifications -e id number of enzyme to use -v number of missed cleavages allowed -nt number of search threads to use -zh maximum precursor charge to search when not 1+ -zcc how should precursor charges be determined? (1=believe the input file, 2=use a range) -zl minimum precursor charge to search when not 1+ -cp eliminate charge reduced precursors in spectra (0=no, 1=yes) -hl maximum number of hits retained per precursor charge state per spectrum -h1 number of peaks allowed in single charge window (0 = number of ion species) -h2 number of peaks allowed in double charge window (0 = number of ion species) -tem precursor ion search type (0 = mono, 1 = avg, 2 = N15, 3 = exact, 4 = multiisotope) -ti when doing multiisotope search, number of isotopic peaks to search. 0 = monoisotopic peak only 1 2 A B C D 1 Figure S1. Performance comparison between OMSSAPercolator (OP), OMSSA, 2 Mascot Percolator (MP) and Mascot at different empirical PSM level q-values on (A) 3 UPS1, (B) iPS-201B7, (C) Yeast-ETcaD and (D) Ecoli-ETcaD. The number of target 4 PSMs was plotted against each q-value threshold. 5 A B C D E F 1 Figure S2. Performance comparison between OMSSAPercolator (OP), OMSSA, 2 Mascot Percolator (MP) and Mascot at different empirical peptide level q-values on 3 (A) UPS1, (B) iPS-201B7, (C) Yeast-ETcaD, (D) Ecoli-ETcaD, (E) Yeast-ETD and 4 (F) Ecoli-CID. The number of peptides was plotted against each q-value threshold. 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Reference [1] Brosch, M., Yu, L., Hubbard, T., Choudhary, J., Accurate and sensitive peptide identification with Mascot Percolator. Journal of proteome research 2009, 8, 3176-3181. [2] Yamana, R., Iwasaki, M., Wakabayashi, M., Nakagawa, M., et al., Rapid and deep profiling of human induced pluripotent stem cell proteome by one-shot NanoLC-MS/MS analysis with meter-scale monolithic silica columns. Journal of proteome research 2013, 12, 214-221. [3] Desiere, F., Deutsch, E. W., King, N. L., Nesvizhskii, A. I., et al., The PeptideAtlas project. Nucleic acids research 2006, 34, D655-658. [4] Swaney, D. L., McAlister, G. C., Coon, J. J., Decision tree-driven tandem mass spectrometry for shotgun proteomics. Nature methods 2008, 5, 959-964. [5] Wright, J. C., Collins, M. O., Yu, L., Kall, L., et al., Enhanced peptide identification by electron transfer dissociation using an improved Mascot Percolator. Molecular & cellular proteomics : 1 2 3 4 5 6 7 8 9 10 MCP 2012, 11, 478-491. [6] Klimek, J., Eddes, J. S., Hohmann, L., Jackson, J., et al., The standard protein mix database: a diverse data set to assist in the production of improved Peptide and protein identification software tools. Journal of proteome research 2008, 7, 96-103. [7] Kessner, D., Chambers, M., Burke, R., Agus, D., Mallick, P., ProteoWizard: open source software for rapid proteomics tools development. Bioinformatics 2008, 24, 2534-2536. [8] Cherry, J. M., Adler, C., Ball, C., Chervitz, S. A., et al., SGD: Saccharomyces Genome Database. Nucleic acids research 1998, 26, 73-79.