Supplementary Methods Sample processing Two experimental data sets were used for this study for illustrative purposes. The larger human sample representing a typical proteome-scale analytical sample was derived from a fraction SDS-PAGE-separated Jurkat lysate. The second, yeast-derived data set in which correct identifications were present in a much higher proportion, was acquired as part of a previous study1. For each protein source, whole-cell lysate was alkylated with iodoacetamide and fractionated by SDS-PAGE (4-12% gradient trisglycine gel, Invitrogen). Proteins were then in-gel digested with trypsin, extracted, and cleaned by off-line desalting with Sep-Pak C18 solid phase resin (Waters, Milford, MA). Lyophilized samples were redissolved in 7.5% acetonitrile/5% formic acid to a final concentration of approximately 1 g/L. Liquid chromatography and tandem mass spectrometry (LC-MS/MS) LC-MS/MS experiments were performed on an LTQ FT mass spectrometer (Thermo Electron, San Jose, CA) equipped with an Agilent 1100 high-performance liquid chromatography (HPLC) pump (Agilent Technologies, Palo Alto, CA) and a Famos autosampler (LC Packings, San Francisco, CA). Peptide mixtures were introduced into the mass spectrometer via a fused silica microcapillary column (internal diameter = 125 m), ending in an in-house pulled needle tip (internal diameter ~ 5 m). Columns were packed to a length of 18 cm with a C18 reversed-phase resin (Magic C18AQ, Michrom Bioresources, Auburn, CA). Approximately 2 g of sample solution were loaded onto the column. Peptides were eluted into the electrospray ionization source of the mass spectrometer via a linear gradient of 12 to 33% buffer B (2.5% water and 0.1% formic acid in acetonitrile (v/v)) in buffer A (2.5% acetonitrile and 0.1% formic acid in water (v/v)) over 96 minutes (Jurkat sample) or 3 to 37% Buffer B over 90 minutes (yeast sample) followed by a high organic wash (100% buffer B, 6 minutes) and a column reconditioning wash (100 % buffer A, 50 minutes). Eluting peptides were measured by the LTQ FT mass spectrometer (ThermoElectron, San Jose, CA) operating in a datadependent mode. For the Jurkat sample, eight ion-trap MS/MS spectra were acquired 1 per data-dependent cycle from a high-resolution (R set at 100,000) FTICR master spectrum (mass range = 350 – 1700 m/z). The yeast sample was analyzed with a SIM3 method as described1. Data processing Resulting MS/MS spectra were searched with SEQUEST2 or Mascot3 algorithms (where noted) against a composite sequence database consisting of sample-appropriate protein sequences downloaded from the Saccharomyces Genome Database (SGD, Stanford University, CA) (yeast sample) or the minimally-redundant human sequences stored at the International Protein Index4 (downloaded February, 2006), common contaminant protein sequences, and reversed versions of these sequences. Alternatively, these data were searched against either the downloaded sequences (target) or reversed sequences (decoy) separately, where indicated. Pseudo-reversed sequences were generated on-the-fly by the implementation of SEQUEST on the SEQUEST Sorcerer platform (Sage-N Research, San Jose, CA). Random and Markovchain modeled decoy sequence databases were constructed based on amino acid frequencies in the target database using an in-house algorithm written in the Perl programming language. For the Markov database, new residues were selected based on the preceding four residues. All SEQUEST searches were performed on the SEQUEST Sorcerer platform. All Mascot searches were performed on an in-house dual-processor linux server. Searches against the human databases were performed using the following parameters: at least one tryptic terminus for all considered peptides, a mass tolerance of ± 50 ppm, variable oxidation for methionine residues ( + 15.99491 Da), and static modification with iodoacetamide on cysteine residues ( + 57.02146 Da). Fragment ion mass tolerance for SEQUEST and Mascot searches were left at their default parameters. Modeling the error associated with measured FP rates was performed as follows: Software was written to simulate the effect filtering criteria have on FP estimations 2 derived from set numbers of correct and incorrect PSMs. This program exploited the target-decoy principle and therefore relied on the same assumptions explained previously, namely that all decoy hits are incorrect, and that there are equal numbers of incorrect target and decoy hits. The program took as input the number of total hits to consider, and what portion of them are actually correct (i.e., precision, Table 1). These correct hits were assigned a “target” state. For each of the remaining incorrect hits, the program randomly assigned a “target” or “decoy” state. Once all hits were assigned a state, the precision rate was calculated by doubling the number of decoy hits and dividing this by the total number of hits. This number was then subtracted from the predetermined precision rate to give the deviation between the actual and estimated precision rates (estimation error). This process was repeated 100,000 times to create a distribution of estimation error from which a standard deviation was derived. Such standard deviation measurements were made for many combinations of input total hits and precision (Fig. 5a). For estimating the frequencies of incorrect identifications and establishing PSM selection criteria, redundant PSMs were first removed keeping the top-scoring PSM as a single representative. Confidently-assigned peptide hits were selected by an in-house program similar in principle to one previously described5 that took into account charge, tryptic and missed cleavage states, and SEQUEST’s XCorr and Cn scores, or the Mascot’s Ion Score and homology factor6. With this program, low-confidence peptides such as those with one tryptic terminus, multiple missed cleavages and low scores were automatically excluded from subsequent analyses. Additional computational experiments were conducted using a combination of software written in the programming languages Perl and PHP, with additional support from Microsoft Excel and a MySql database. 3 References 1. 2. 3. 4. 5. 6. Haas, W., Faherty, B.K., Gerber, S.A., Elias, J.E., Beausoleil, S.A., Bakalarski, C.E., Li, X., Villen, J. & Gygi, S.P. Optimization and use of peptide mass measurement accuracy in shotgun proteomics. Mol Cell Proteomics (2006). Eng, J.K., McCormack, A.L. & Yates, J.R., 3rd An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J Am Soc Mass Spectrom 5, 976-989 (1994). Perkins, D.N., Pappin, D.J., Creasy, D.M. & Cottrell, J.S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20, 3551-3567 (1999). Kersey, P.J., Duarte, J., Williams, A., Karavidopoulou, Y., Birney, E. & Apweiler, R. The International Protein Index: an integrated database for proteomics experiments. Proteomics 4, 1985-1988 (2004). Kislinger, T., Rahman, K., Radulovic, D., Cox, B., Rossant, J. & Emili, A. PRISM, a generic large scale proteomic investigation strategy for mammals. Mol Cell Proteomics 2, 96-106 (2003). Elias, J.E., Haas, W., Faherty, B.K. & Gygi, S.P. Comparative evaluation of mass spectrometry platforms used in large-scale proteomics investigations. Nat Methods 2, 667-675 (2005). 4