Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. Index for Supplementary Files : Supplement_S1toS12.doc S1 : Additional descriptive statistics from salivary dataset. S2 : Representative spectra from MaxQuant processed and ReAdW4Mascot2 processed ProteinPilot searches from small salivary dataset. S3 : Summary of fractions in salivary dataset. S4 : Workflow for analyzing the human salivary dataset and results from ProteinPilot analysis. S5: Effect of Proteominer treatment and mass accuracy on predicted modifications on identified peptides. S6 : Scaffold results for normalized spectral counts. S7 : Gene Ontology (GO) analysis of the whole salivary proteome. S8 : Descriptive statistics for IPRG Phosphoproteome dataset. Identification at protein level, peptide level and spectral level for phosphoproteome dataset. Mass Accuracy plots, Cumulative Mass Accuracy plots and Distribution of peptide scores for Phosphoproteome dataset. S9 : Descriptive statistics for Rat SILAC dataset. Identification at protein level, peptide level and spectral level for Rat SILAC dataset. Mass Accuracy plots, Cumulative Mass Accuracy plots and Distribution of peptide scores for Rat SILAC dataset. S10 : Tranche hyperlinks for the data. S11. Materials and Methods S12: Protocol for converting .RAW files from LTQ/Orbitrap to High mass accuracy .MGF files for ProteinPilot search. 1 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. Supplement Text S1 to S11 : S1. Additional descriptive statistics from salivary dataset. A) Summary of dataset: Mascot generic format (.MGF) peaklists were generated from .RAW files by using ReAdW4Mascot2 [Ref S1] (See 1A ‘Dataset Key’ column). In an alternative workflow, the "Quant" module from MaxQuant was used to generate .MSM files - that were further converted to .MGF format (See 1B in ‘Dataset Key’ column). The .MGF files thus generated from data conversion tools were searched using ProteinPilot v 4.0 against Human (Datasets 1). Dataset was generated using an LTQ/Orbitrap mass spectrometer [Ref S2]. a Sample # of raw MS Datase Preparatio files acquisition Number of t Key Description n mode spectra Dataset 1 2D 20 1A b fractionate Human d and 1B whole ProteoMin saliva er treated. Centroid 88,308 Dataset # a All searches were conducted using ProteinPilot. b Subset of data from Bandhakavi et al 2009 [Ref S2]. e MaxQuant “Quant” processed peaklists reflect high mass accuracy. 2 Peaklist generation e ReadW4Mascot2 MaxQuant "Quant" module Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. S1 B) Identification at peptide level and spectral level for salivary dataset. The small salivary dataset (Dataset 1) was processed with ReadW or MaxQuant and then searched with sub ppm instrument settings using ProteinPilot. Identifications were at 5 % local FDR threshold at distinct peptide level (a) and spectral level (b). a b 3 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. S1 C) Cumulative Mass Accuracy plots from ProteinPilot searches of MaxQuant processed and ReAdW processed peaklist. Cumulative distribution of percent of precursors identified by ProteinPilot has been plotted against precursor Delta ppm. Spectra identified from ProteinPilot searches using MaxQuant processed peaklist are represented with a dark line. Spectra identified from ProteinPilot searches using ReAdW processed peaklist are represented with a grey line. a 4 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. Supplement S2. Representative spectra from MaxQuant processed and ReAdW4Mascot2 processed ProteinPilot searches from the small salivary dataset (Dataset 1). Spectral tab from ProteinPilot was used to represent fragmentation evidence for MaxQuant processed and unprocessed spectra. In the following pages, the top half of each page shows spectrum generated from ReADW and bottom half shows spectrum generated from MaxQuant processing. Text panel also shows spectrum number (from the .group of the dataset), Theoretical m/z value (in Da), Precursor m/z value (in Da), Charge state, Delta mass (in Da), Best peptide sequence and its annotation, modification (if any), Peptide Conf and Sc values and the Protein Rank in ProteinPilot .group file. 5 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. Fig S2a 6 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. Fig S2b 7 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. Fig S2c 8 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. Fig S2d 9 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. Fig S2e 10 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. Fig S2f 11 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. Fig S2g 12 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. Fig S2h --------------------------------------------------------------------------------------------------------------------- 13 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. Supplement S3 : Summary of whole salivary dataset. A) Datasets were generated using an LTQ/Orbitrap mass spectrometer. MSM peaklists were generated from .RAW files by using the "Quant" module from MaxQuant - that were further converted to .MGF format. The .MGF files were searched using ProteinPilot v 4.0 against Human database. Note that Dataset 1 from Fig 1B is a subset of this dataset. Dataset a # # of MS Number Sample raw acquisition of Description Preparation files mode spectra Dataset b Native 200 sample, 2D fractionated, 3D fractionated Human or whole ProteoMiner saliva treated Centroid 988,974 Peaklist generationc MaxQuant "Quant" module. a All searches were conducted using ProteinPilot. b Data from Bandhakavi et al 2009 [Ref S2] along with additional fractions. c MaxQuant “Quant” processed peaklists reflect high mass accuracy. B) Summary of whole salivary dataset fractions. Sample fractionation ProteoMiner treatment Number of fractions 2Da No 20 75271 2Da Yes (Library-1)d 20 87469 3Db No 41 250553 3Db Yes (Library-1)d 57 224079 2Dc Yes (Library-2)e 42 235072 3Db Yes (Library-2)e 20 116530 a Salt fractionated 14 MS/MS spectra Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. b IEF separated and salt-fractionated c IEF separated d ProteoMiner ™ Library-1 e ProteoMiner ™ Library -2 15 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. --------------------------------------------------------------------------------------------------------------------- Supplement S4. A)Workflow for analyzing the human salivary dataset Peaklists in MSM format from LTQ/Orbitrap human salivary dataset were generated from 200 .RAW files by using “Quant" module from MaxQuant. In workflow that used ProteinPilot, MSM format peaklists were further converted to .mgf format and were searched using ProteinPilot v 4.0 against Human IPI database. PDST tool was used to analyze outputs from representative fraction searches to compare effect of ProteoMiner treatment on predicted modifications. In a subsequent workflow that used MaxQuant, MSM files were searched against Human IPI database using Mascot v2.2. Further, Mascot search .dat files were used to generate proteingroups.txt file using MaxQuant “Identify” module. This proteingroups.txt output was used to parse out information about Gene Ontology categories. Representative fraction searches were used to compare effect of ProteoMiner treatment on protein abundance. 16 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. S4 B) Human whole salivary dataset. Summary of ProteinPilot and MaxQuant results for large (988,974 MS/MS spectra) salivary dataset (Dataset 2 in Table S3 A) Search MS/MS Workflow Identified ProteinPilot 82980 MaxQuant 55337 Peptide Sequences Identified Total proteins identified a 13162 2224 10757 2131 a. Proteins identified at 1% FDR with at least 1 peptide at 1% FDR ProteinPilot Results for whole salivary dataset : 988,974 spectra from a MaxQuant .MSM peaklist input file were searched using ProteinPilot. Identification statistics from the FDR Single Table Summary showed 82,980 spectral matches at 1% global FDR and 2224 protein identifications at 1% global FDR (S4B). Supporting FDR reports contain plots of error rates at all thresholds for both global and local error rate calculations and ROC plots, which show absolute numbers of correct versus incorrect answers (S4C). A robust and comprehensive list of proteins (Supplement S18) was generated after analysis of whole saliva by combining MaxQuant’s ability to accurately process acquired peaks, and ProteinPilot’s ability to search multiple modifications and perform robust protein reporting. MaxQuant results for whole salivary dataset : The whole salivary dataset was processed with the MaxQuant workflow (with Mascot) (Supplement S4A) and results were compared to ProteinPilot results. The results were processed by MaxQuant’s “Identify” module, which generated a list of proteins (Proteingroups.txt; Supplement S19) at 1% protein and peptide FDR thresholds. From a total of 988,974 spectra, 55,337 spectra were matched at 1% global FDR and 2131 proteins were identified at 1% global FDR with a minimum of one peptide at 1% global FDR (Table S4B). The average ppm error for the dataset was 0.56 with an SD of 0.86 ppm. From the 2131 proteins inferred from the MaxQuant workflow, 1956 (91.8%) were also inferred from the ProteinPilot workflow. Proteins were grouped into cellular component, biological processes and molecular function categories upon Gene Ontology (GO) analysis. MaxQuant grouped proteins by molecular weight (Supplement S7). --------------------------------------------------------------------------------------------------------------------17 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. S4 C) . Single Table Summary of the FDR analysis output from ProteinPilot search. --------------------------------------------------------------------------------------------------------------------18 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. Supplement S4 D) . Identification statistics of the FDR analysis output from ProteinPilot search. Protein (a), distinct peptide (b) and spectral level (c). ProteinPilot generates a post-search FDR analysis output using ‘Proteomics System Performance Evaluation Pipeline’ (PSPEP). FDR results are tabulated and plotted at spectral, distinct peptide and protein level. The PSPEP method uses a non-linear fitting method to calculate a local or instantaneous level FDR that measures the error rate of the last protein in the list of proteins as opposed to global FDR, which estimates the error rate for an entire protein list. 19 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. a 20 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. b 21 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. C 22 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. Supplement S5. Effect of Proteominer treatment on spectral utilization and predicted modifications on identified peptides. The whole saliva was treated with hexapeptide libraries (ProteoMiner ™), which resulted in reduction in the relative amounts of highly abundant proteins, while increasing the relative amounts of low-abundance proteins. Although the mechanism of action is still under debate, sample treatment with hexapeptide beads has resulted in detection of a much larger number of proteins from various biological samples, as compared to non-treated samples. In order to analyze the effect of DRC on relative ranking of the most frequently predicted modifications, ProteinPilot’s ability to predict multiple peptide modifications was used in conjunction with the post-search PDST tool. The effect of ProteoMiner treatment on enrichment of modified peptides was evaluated using ProteinPilot. Results from 20 SCX HPLC fractions (untreated) were compared against 20 SCX HPLC fractions of ProteoMiner treated and salt-fractionated fractions. Table S5 A) ProteinPilot results from twenty fractions of either ProteoMiner treated or untreated whole salivary were compared. MaxQuant processed and ReAdW processed peaklists were used to compare the effect of high mass accuracy on PTM identification. The ProteinPilot results outputs were used for subsequent ProteinPilot Descriptive Statistics Template (PDST) analysis. Number of MS/MS spectra Spectral level 5% local FDR Protein level 5% local FDR Percent of spectra with specified protein and peptide confidence % Modified peptides MaxQuant processed Untreate ProteoMin d er treated 75271 87469 5869 10557 267 716 6.80% 37.30% 10.80% 37.50% ReAdW processed Untreate ProteoMin d er treated 75271 87469 5389 10249 218 562 6.00% 38.60% 9.90% 37.10% An increase in spectral utilization in ProteoMiner treated sample was observed (Bandhakavi et al 2009) when compared to the untreated sample (Table S5A). The percent of modified peptides 23 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. (37.5%) from treated sample was similar to untreated sample (37.3%). The type and ranking of the predicted modifications in the untreated sample was assessed with PDST output and compared to results from the treated sample. PDST generated a list of twenty most frequent modifications (predicted amino acid features) in the dataset (Table S5B). Table S5B) Peptides identified at 5% local FDR were compared for their most frequent single features using PDST. MaxQuant processed ReAdW processed Exact Delta Untreated Rank ProteoMiner treated Rank Untreated Rank ProteoMiner treated Rank Methyl(E) 14.0157 386 1 896 2 316 2 883 2 Oxidation(M) 15.9949 384 2 1136 1 362 1 1054 1 Feature Deamidated(N) Gln->pyroGlu@N-term 0.984 239 3 175 6 171 3 134 7 -17.027 136 4 153 7 120 4 145 6 Methyl(K) 14.0157 100 5 315 3 98 5 291 3 Methyl(D) 14.0157 94 6 218 5 85 6 226 4 Methyl(R) 14.0157 89 7 226 4 74 7 181 5 Oxidation(W) 15.9949 78 8 67 8 Dioxidation(C) 31.9898 67 9 20 19 66 9 24 14 Cys->Dha(C) Protein Terminal Acetyl@N-term -33.988 58 10 20 18 59 10 23 15 42.0106 29 14 91 8 18 16 60 9 Dehydrated(D) -18.011 26 15 87 9 28 13 72 8 Delta:H(4)C(2)(K) 28.0313 17 19 52 10 15 18 51 10 Cation:Na(E) 21.9819 10 25 Pro->pyro-Glu(P) 13.9793 11 24 Cation:Na(D) 21.9819 12 22 Methyl(H) 14.0157 Dehydrated(T) Glu->pyroGlu@N-term -18.011 17 -18.011 Formyl(K) 9 25 20 20 12 23 26 13 18 13 23 13 21 11 24 13 22 28 14 14 20 27.9949 14 21 15 19 15 22 Oxidation(Y) 15.9949 26 16 35 13 18 17 43 11 Ammonia-loss(N) -17.027 24 17 14 22 19 15 Dioxidation(W) 31.9898 42 12 24 14 14 23 Deamidated(Q) 0.984 51 11 40 11 36 11 23 16 26.0157 36 13 13 24 36 12 15 21 Delta:H(2)C(2)(H) 24 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. The modification list was ranked by the number of features in the untreated sample. Glutamate methylation and methionine oxidation were top-ranked in both untreated and treated samples. Comparison between the datasets showed that the intermediate and lower-ranked modifications changed starting at the 8th rank modification. The relative ranking of most predicted modifications changed after ProteoMiner treatment. This observation is noteworthy, even for the study of native salivary sample. This makes a case for using both treated and untreated samples to identify more PTMs. Effect of mass accuracy on PTM identification : MaxQuant processed and ReAdW processed peaklists were used to compare the effect of high mass accuracy on PTM identification. MaxQuant processed peaklist yields more peptide identifications and so it is not surprising to see that the number of modifications increases proportionately to spectral identifications. In other words, the average number of PTM identifications increases as much as Improvement in proteins and spectra improvement due to high mass accuracy. (11-12%). There are a few exceptions to this observation such as Deamidation, Gln -> Pyro-Glu@N-term, Oxidation(W), Dioxidation(C), Cys->Dha(C) . The possible reasons for this observation will have to be investigated further. 25 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. Supplement S6 Scaffold results for normalized spectral counts. In order to compare the relative abundance of proteins in untreated samples to ProteoMiner treated sample, we used the Mascot generated .dat files from the MaxQuant workflow in Scaffold analysis. Scaffold was used to normalize the spectral counts in these samples. For this analysis, we used results from all fractions from untreated sample (61 fractions) and compared them against all fractions of ProteoMiner treated sample (75 fractions). The normalized spectral counts were used to rank proteins relatively in the samples. The corresponding normalized counts in ProteoMiner treated sample showed proteins that are enriched after treatment (Figure a). Alpha-actinin-1, Nucleobindin-2, Carbonic anhydrase VI and stratifin are some of the proteins that are enriched after ProteoMiner treatment (Figure a). The complete list of proteins and their normalized spectral quantitative values can be found in Supplement S23. As a measure of depletion of abundant proteins due to ProteoMiner treatment, out of the 25 most abundant proteins in untreated samples, only 4 proteins are observed to be in the 25 most abundant proteins list in ProteoMiner treated sample (Supplement S23). In other words, 21 out of 25 most abundant proteins from untreated sample are depleted due to ProteoMiner treatment. When normalized spectral counts were used to relatively rank the proteins in ProteoMiner treated sample, then the corresponding normalized counts in untreated sample show proteins that are depleted after treatment (Figure b). Proteins such as amylase alpha 1A, Immunoglobulin kappa constant, Mucin-5B, Lipocalin-1, Zinc-alpha-2-glycoprotein and cystatin A are depleted after ProteoMiner treatment (Figure b). As a measure of enrichment of low-abundance proteins due to ProteoMiner treatment, out of 353 least abundant proteins (Normalized spectral count =1) in untreated samples, only 17 proteins are observed to have a lower ranking in ProteoMiner 26 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. treated sample (Supplement Sheet S23). In other words, 336 out of 353 least abundant proteins from untreated sample are enriched due to ProteoMiner treatment. a b Scaffold results for normalized spectral counts for ProteoMiner Treated and untreated fractions. a. Protein identifications from Untreated dataset were ranked according to their abundance (Spectral counts). The corresponding normalized counts in ProteoMiner treated sample show proteins (representative peaks denoted by gene symbols) that are enriched after treatment. The list of proteins with their corresponding Quantitative values is available in Supplementary section SY. b. Protein identifications from Proteominer-treated dataset were ranked according to their abundance (Spectral counts). The corresponding normalized counts in untreated sample show proteins (representative peaks denoted by gene symbols) that are greatly reduced after treatment. The list of proteins with their corresponding Quantitative values is available in Supplement S23. --------------------------------------------------------------------------------------------------------------------27 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. Supplement S7. Gene Ontology (GO) analysis of the salivary proteome. The salivary proteome was analyzed for their Gene Ontology (GO) categories - cellular component (a), biological processes (b) and molecular function (c) and molecular weight (d). Proteingroups.txt output from MaxQuant search was used for this analysis. The whole saliva dataset was also analyzed using MaxQuant. MaxQuant’s output, in form of proteingroups.txt file, when searched with human IPI database, can be used for parsing information about the biological content of identified proteins. The columns in the text file contain gene ontology terms (biological processes, molecular functions and cellular component); biological pathways (KEGG) and PTMs (associated localization scores). Biological Processes a Cellular Components b 28 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. Molecular functions c Molecular Weights d 29 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. Supplement S8. Descriptive statistics for IPRG Phosphoproteome dataset. A) Summary of dataset: Mascot generic format (.MGF) peaklists were generated from .RAW files by using ProteoWizard [Ref S3] (See 2A in ‘Dataset Key’ column). In an alternative workflow, the "Quant" module from MaxQuant was used to generate .MSM files - that were further converted to .MGF format (See 2B in ‘Dataset Key’ column). The .MGF files thus generated from data conversion tools were searched using ProteinPilot v 4.0 against Human IPI databases. Dataset were generated using an LTQ/Orbitrap mass spectrometer [Ref S4]. Dataset a # Description Dataset 2 Sample Preparation c # of raw MS Dataset files acquisition Number Key mode of spectra 3 2A Peaklist generation e ProteoWizard Human phosphopr oteome IMAC enriched 2B Profile 28,448 MaxQuant "Quant" module a All searches were conducted using ProteinPilot. c ABRF 2010 iPRG study e MaxQuant “Quant” processed peaklists reflect high mass accuracy. S8 B) Identification at protein level, peptide level and spectral level for phosphoproteome dataset. Phosphoproteome dataset (Dataset 2) was processed with ProteoWizard or MaxQuant; and then searched with subppm instrument settings in ProteinPilot. Identifications were at 5 % local FDR threshold at protein level (a), distinct peptide level (b) and spectral level (c). a 30 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. b c Supplement S8 C) Mass Accuracy plots from ProteinPilot searches of MaxQuant processed and ProteoWizard processed peaklists. The distribution of the frequency of spectra identified by ProteinPilot has been plotted against precursor Delta ppm. Spectra identified from ProteinPilot searches using MaxQuant processed peaklist are represented with a dark line. Spectra identified from ProteinPilot searches using ProteoWizard processed peaklist for Phosphoproteome dataset (dataset 2)) are represented with a grey line. 31 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. a Supplement S8 D) Cumulative Mass Accuracy plots from ProteinPilot searches of MaxQuant processed and ProteoWizard processed peaklists. Cumulative distribution of percent of precursors identified by ProteinPilot has been plotted against precursor Delta ppm. Spectra identified from ProteinPilot searches using MaxQuant processed peaklist are represented with a dark line. Spectra identified from ProteinPilot searches using ProteoWizard processed peaklist (for b. Phosphoproteome dataset) are represented with a grey line. a Supplement S8 E) Distribution of peptide scores of confident identifications from ProteinPilot search. The distribution of the frequency of spectra identified by ProteinPilot at 5% local FDR has been plotted against Peptide Score (Sc). Spectra identified from ProteinPilot searches using MaxQuant processed peaklist are represented with a dark line and spectra identified from ProteinPilot searches using ProteoWizard processed peaklist from dataset 2 (for a. Phosphoproteome dataset) are represented with a grey line. 32 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. 33 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. Supplement S9. Descriptive statistics for Rat SILAC dataset. A) Summary of dataset: Mascot generic format (.MGF) peaklists were generated from .RAW files by using ReAdW4Mascot2 [Ref S1] (See 3A in ‘Dataset Key’ column). In an alternative workflow, the "Quant" module from MaxQuant was used to generate .MSM files - that were further converted to .MGF format (See 3B in ‘Dataset Key’ column). The .MGF files thus generated from data conversion tools were searched using ProteinPilot v 4.0 against Rat IPI database (Dataset 3). Dataset was generated using an LTQ/Orbitrap mass spectrometer). Dataset a # Description Sample Preparation Dataset 3 # of raw MS Dataset files acquisition Number Key mode of spectra 9 3A Peaklist generation e ReAdWMascot2 3B Rat L6 cell SILAC (K+8, R line +10) Profile 52,164 MaxQuant "Quant" module. a All searches were conducted using ProteinPilot. e MaxQuant “Quant” processed peaklists reflect high mass accuracy. Supplement S9 B) : Identification at protein level, peptide level and spectral level for Rat SILAC dataset. The Rat SILAC dataset (Dataset 3) was processed with ReadW or MaxQuant and then searched with sub ppm instrument settings using ProteinPilot. Identifications were at 5 % local FDR threshold at protein level (a), distinct peptide level (b) and spectral level (c). a 34 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. b Supplement S9 C). Mass Accuracy plots from ProteinPilot searches of MaxQuant processed and ReAdW processed peaklists. The distribution of the frequency of spectra identified by ProteinPilot has been plotted against precursor Delta ppm. Spectra identified from ProteinPilot searches using MaxQuant processed peaklist are represented with a dark line. Spectra identified from ProteinPilot searches using ReAdW processed peaklist from Dataset 3 ( b. SILAC dataset) are represented with a grey line. 35 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. Supplement S9 D). Cumulative Mass Accuracy plots from ProteinPilot searches of MaxQuant processed and ProteoWizard / ReAdW processed peaklists. Cumulative distribution of percent of precursors identified by ProteinPilot has been plotted against precursor Delta ppm. Spectra identified from ProteinPilot searches using MaxQuant processed peaklist are represented with a dark line. Spectra identified from ProteinPilot searches using ReAdW processed peaklist (for SILAC dataset) or are represented with a grey line. c Supplement S9 E). Distribution of peptide scores of confident identifications from ProteinPilot search. The distribution of the frequency of spectra identified by ProteinPilot at 5% local FDR has been plotted against Peptide Score (Sc). Spectra identified from ProteinPilot searches using MaxQuant processed peaklist are represented with a dark line and spectra identified from ProteinPilot searches using ReAdW processed peaklist from dataset 3 (for SILAC dataset) are represented with a grey line. 36 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. 37 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. Supplement S10. Tranche hyperlinks for the data presented in Jagtap et al.,: The individual components of the data associated with this manuscript can be downloaded from ProteomeCommons.org Tranche by using the hyperlinks or using associated hash code and the passphrase mentioned below. Passphrase to access data: maxquantproteinpilot The hash may be used to prove exactly what files were published as part of this manuscript's data set, and the hash may also be used to check that the data has not changed since publication. PEAKLISTS : FmlX6MPmEeKG152QgcCVU/spcWHiTiR+sJahjzKYUj0XUXXJkhInFqin0pdvgLHlhwQPtyh 7+kjQwUHSlRV3fhq4VyEAAAAAAABcIg== a) Salivary dataset – peaklist (Peak list for salivary proteome fractions in MGF format) b) Phosphoproteome dataset – peaklist (Peak list for phosphoproteome fractions in MGF format) c) Rat SILAC dataset – peaklist (Peak list for RAT SILAC fractions in MGF format) PROTEINPILOT SEARCH RESULTS : pTyZ5Eo2DSH0U3h3Teq3U95txifLGStPmPsTUe8esa0I3AhN7mXRxV9DQuAFEbSxnudrxOJ 14R83Nwkc6EoY8vgG99gAAAAAAAAljg== ProteinPilot Search results and False Positive Rate Analysis (ProteinPilot .group files, PDST outputs, FDR Analysis Results including results for large salivary dataset) MAXQUANT SEARCH RESULTS : fJhRymxnWQwYdjKtEAtiDDSWdWCL0wtdyAlKRgBz/gthCWvh8PE67OgEcX1Dv29K+CSr YOToj2T7eJbwEUwHIqKOMHoAAAAAAAALcw== MaxQuant Search results for large salivary dataset. (MaxQuant files) MASCOT-SCAFFOLD-WORKFLOW RESULTS : q2SqBKZd9usRNx3Z4KhMOVMYcD1nq1cAAMmYCGZt14pJ8eVoKCzZpNLEqluKeJPKim wgkNYhwEtXqHdbl9RpnwgBABsAAAAAAAAG3g== Mascot-Scaffold-Workflow for measuring relative protein abundance. (Mascot search results (.dat files) and Scaffold results (.sf3 file)) 38 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. Supplement S11. MATERIALS AND METHODS Sample datasets Three LTQ Orbitrap datasets were processed and analyzed. The effect of MaxQuant processing on protein and peptide identifications and dataset specific metrics were analyzed with Datasets 1, 2, and 3. Dataset 1, a subset of large whole salivary dataset, was a set of twenty .RAW files generated from 3D-separated human whole saliva [Ref S2]. Dataset 2 was the iPRG 2010 data, generated from a phosphopeptide-enriched sample [Ref S4]. Dataset 3 was generated from a SILAC-labeled rat nuclear cell culture preparation. The datasets varied in MS acquisition mode (centroid for Dataset 1, profile for Dataset 2 and 3) and peaklist generation software (ReAdW4Mascot2 for Datasets 1A and 3A and ProteoWizard for Dataset 2A; MaxQuant Quant module for Datasets 1B, 2B and 3B). The whole saliva dataset was generated as described in Bandhakavi et al 2009. Briefly, whole saliva sample was processed as ‘untreated saliva’ or treated with hexapeptide libraries (ProteoMiner™; Bio-Rad Laboratories) for protein dynamic range compression (DRC). Protein samples were trypsinized and fractionated by preparative IEF (OFFGEL) based on their isoelectric points (pH 3−10). Peptide fractions were analyzed directly by C18 RP-LC-MS (2D LC-MS fractionation) or fractionated by SCX prior to C18 LC-MS (3D LC-MS fractionation). For sample treatment and fractionation scheme see Supplement S1. Preprocessing of datasets and Protein and Peptide Detection Orbitrap datasets were searched with ProteinPilot v 4.0 (ProteinPilot Software 4.0.8085; Revision: 148085; Paragon Algorithm: 4.0.0.0. 1458083. The .RAW files were converted to .MSM files with MaxQuant’s (v 1.0.13.13) "Quant" module. The .MSM files are .MGF files with high precursor mass accuracy and limited product ion ‘noise’ peaks. After converting the file extension from .MSM to .MGF format, the files were searched using Paragon. ProteinPilot 39 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. search parameters common to all datasets (1 – 3) were: Instrument: LTQ/Orbitrap subppm; Digestion: Trypsin; ID Focus: Biological Modifications; Search effort: Thorough; Protein identification threshold: 10% Conf. Datasets 1 , 2 and 4 were searched against the Human IPI (hIPI) database (v3.52, Nov 2008) plus contaminant proteins (from MaxQuant installation) with 148372 forward plus reversed sequences. Cys alkylation was defined as none (Dataset 1 and 4) and Iodoacetamide (Dataset 2 and 3); Sample Type was set to Identification for datasets 1, 2 and 4. For Dataset 2, Special factors setting was Phosphorylation emphasis. Dataset 3 was searched against Rat IPI v3.52 database (November 2008; 80336 forward plus reversed sequences); Sample Type was defined as SILAC (K+8, R+10) and Cys alkylation was set to Iodoacetamide. For comparisons of results as a function of input file (i.e., MaxQuant “Quant” module vs. alternate peaklist generation methods), .MGF files were created from .RAW files with ReAdW4Mascot2 [Ref S1] (for Dataset 1 and 3) or with ProteoWizard [Ref S3] (for Dataset 2) and analyzed with ProteinPilot. The searches parameters were identical to the previous parameters (described above for Datasets 1, 2 and 3). Dataset 4 was searched against the "target-decoy" version of hIPI v3.52 with ProteinPilot v 4.0 and MaxQuant v1.0.13.13. For the ProteinPilot search, .RAW files were converted to .MSM files with MaxQuant’s “Quant” module as described above. Additional ProteinPilot parameters were: Sample Type: Identification; Cys alkylation: None. The Mascot search parameters for the MaxQuant search were: Fragment tolerance: 0.50 Da (Monoisotopic); Precursor tolerance: Adjusted individually using Quant; Variable Modifications: Met Oxidation; Digestion Enzyme: Trypsin; Maximum Missed Cleavages: 2. In the “Identify” module, the protein identification threshold setting was 1% FDR and the peptide threshold was one peptide minimum with 1 % FDR. 40 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. Post-search data analysis For each ProteinPilot search, Protein and Peptide Summaries were exported and FDR reports were generated. The FDR reports included the numbers of proteins identified, distinct peptides matched and spectra matched at local and global FDR thresholds. The FDR reports also generated Numeric ROC plots, Estimated FDRs and non-linear fitting curves at spectral, peptide and protein levels. The Protein Descriptive Statistics Template (PDST, v3.61), an Excel-based tool developed by AB SCIEX, was used for post-processing. The PDST tool generates extensive, dataset-specific metrics from ProteinPilot output (peptide and protein exports and FDR). From PDST, the effect of dynamic range compression on predicted modifications and spectral utilization was determined for whle salivary dataset. The effects of dynamic range compression on protein abundance for Dataset 4 was estimated after analysis of Mascot results for untreated and treated samples in Scaffold Q+ v3.0. 41 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. Supplemental S12: Protocol for converting .RAW files from LTQ/Orbitrap to High mass accuracy .MGF files for ProteinPilot search. RAW data conversion using MaxQuant : Requirements : XCalibur v2.1 MaxQuant v1.0.13.13 32-bit Windows PC. 1) Install MaxQuant v1.013.13 onto your 32-bit Windows machine. The latest MaxQuant version uses Andromeda as a search engine and does not produce the desired .MSM files. Please contact MaxQuant google group (http://groups.google.com/group/maxquant-list) for the earlier version (v1.013.13) of MaxQuant. 2) Once installed please visit http://mediamill.cla.umn.edu/mediamill/display/61837 for a webinar on the use of MaxQuant and requirement for setting this up. Ensure that the conf folder has been replaced with appropriate file from Mascot ‘config’ folder. 3) Transfer your .RAW files into an accessible folder on the Windows machine (C:\ drive or an external drive connected to the Windows machine) 42 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. 4) Once transferred click on Quant.exe to start MaxQuant’s “Quant” module. 5) In the “Quant” module, click on “Select Files” to select RAW files from your folder. 43 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. 6) Alternatively, you can also click on “Select Folder” for a folder with RAW files. 7) Once the RAW files are loaded, click on Parameters Tab. Also adjust the number of threads to maximum available on your PC. 44 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. 8) In the Parameters Tab, set the SILAC type to singlets for non-quantitative, identification only study. Add or remove variable and fixed modifications from the list, according to sample preparation used for the dataset. Choose correct database and enzyme for search. 9) Click on Raw files and press “Start” for data processing. 10) The bottom tab shows the status of data processing. Wait until the tab shows “Done”. This typically takes 30 minutes per RAW file. 45 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. 46 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. 11) Once the Quant module displays the message “Done”, go to the RAW files folder and look for .MSM files in individual folders for each RAW file. 12) For example, for processed folder for A1, one would sort the file type and select .MSM files with extensions (sil0 and peaks) and copy them in a separate folder. The same procedure is repeated for all files. 47 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. 13) The .msm extension in the MSM files folder is changed to .mgf 14) These files now can be used for ProteinPilot search. Note that you can search the files with “Orbi/FT MS (sub-ppm), LTQ MS/MS” Instrument settings – because of the peaklist’s high precursor mass accuracy. For webinar modules on use of ProteinPilot please visit : http://tinyurl.com/proteinpilotin 48 Jagtap et al. Optimal and robust analysis of high-mass-accuracy Orbitrap datasets. Supplementary References : S1: http://chemdata.nist.gov/mass-spc/ftp/download/peptide_library/software/current_releases/ReAdw4Mascot2 S2: Bandhakavi S, Stone MD, Onsongo G, Van Riper SK et al. (2009) A dynamic range compression and three-dimensional peptide fractionation analysis platform expands proteome coverage and the diagnostic potential of whole saliva. J Proteome Res. 8(12): 5590-5600. S3: Kessner D, Chambers M, Burke R, Agus D et al. (2008) ProteoWizard: open source software for rapid proteomics tools development. Bioinformatics. 24(21): 2534-2536. S4: Rudnick P, Askenazi M, Clauser K, Lane W et al. (2010) ABRF iPRG2010 Study: Informatic Evaluation of Phosphopeptide Identification and Phosphosite Localization Results from Multiple Proteomics Laboratories. Proceedings of the 58th ASMS Conference on Mass Spectrometry and Allied Topics, Salt Lake City, UT. 49