How to submit MS proteomics data to ProteomeXchange via the PRIDE database Supplementary Material 1.Definitions of data types and files There are a variety of data types in proteomics that can be submitted to ProteomeXchange (PX)/ PRIDE. These are the definitions and tags that are used throughout the main text of the manuscript. a) Mass spectrometer output files: the original data and metadata generated by mass spectrometers. The data may be the original profile mode scans or may already have had some basic processing like centroiding applied. They may be: o i) raw data (see below). o ii) peak list spectra in a standardized format (see below), but they cannot be ‘processed peak lists’ (see below). Mass spectrometer output files are labelled as ‘RAW’ throughout the manuscript. b) Raw data: the binary, vendor-specific output files directly created by the instrument software. These files are typically large (potentially several gigabytes) and require specialized software in order to be read. c) Standardized MS data formats: There are currently three widely known open standard mass spectrometry data formats in proteomics: mzXML [1], mzData and the successor to both of the above: mzML [2] (currently v1.1, http://www.psidev.info/mzml), developed by the PSI (Proteomics Standards Initiative). In addition to the mass spectra, they contain detailed metadata that provide context to the measurements. d) Processed peak lists: Heavily processed form of mass spectrometry data, usually derived from the raw data files through various (semi-)automatic steps, e.g. centroiding, deisotoping, and charge deconvolution. These files are formatted in plain text, with typical formats like dta, pkl, ms2 or mgf. They usually contain only a subset of only the MS2 scans (MS1 scans are excluded), and are missing significant amounts of metadata that were present in the source format. Processed peak list files are labelled as ‘PEAK’ throughout the manuscript. e) Protein/peptide identifications: Proteomics mass spectra can be matched to peptides or proteins, resulting in identifications for those spectra. In the case of fragmentation spectra, the initial identification will consist of a peptide sequence; subsequent steps will derive a list of proteins from the identified peptides. This information can be represented by a variety of data formats called ‘search engine output files’ (see below). 1 How to submit MS proteomics data to ProteomeXchange via the PRIDE database f) Protein/peptide quantification: Protein/peptide expression values can also be obtained from mass spectra. There is a high diversity of approaches that result in the existence of very heterogeneous software and data analysis pipelines. Some search engines are able to perform both identification and quantification, and produce ‘search engine output files’ containing both types of data. However, if there is software that only performs the quantification part of the analysis, the generated data is represented as ‘quantification software output files’ (see below). g) Search engine output files: They contain the data and metadata generated by the search engines used for performing the identification and often the quantification of peptides and proteins. Each search engine has its own specific output file. The formats are typically formatted in either plain text or XML, with typical formats like Mascot .dat, X!Tandem .xml, etc. In addition to each specific format, a data standard format called mzIdentML (currently v1.1, http://www.psidev.info/mzIdentML) [3] has been developed by the PSI. Some search engine output files can represent as well quantification results, but this is not the case of mzIdentML. At present mzIdentML can now be exported from a variety of tools (see Table 1 in the main manuscript and http://www.psidev.info/tools-implementing-mzIdentML). Additionally, PRIDE XML is the original PRIDE internal data format. It can represent both mass spectra data (it contains the mzData format) and protein/peptide identifications, and for some basic use cases, quantification information as well. ‘Search engine output files’ can be converted to PRIDE XML using PRIDE Converter 2 [4] and a few other available tools (see Table 2 in the main manuscript). Search engine output files are labelled either as ‘RESULT’ (for mzIdentML or PRIDE XML) or ‘SEARCH’ (any other output format) throughout the manuscript. h) Quantification software output files: The data and metadata generated by the software used for performing exclusively the quantification analysis of peptides and proteins. In addition to each specific format from each software tool, a data standard format called mzQuantML (currently v1.0, http://www.psidev.info/mzquantml) has been released by the PSI [5]. Quantification software output files are labelled as ‘QUANT’ throughout the manuscript. i) Sequence database: Sequence database file (usually in FASTA format) that was used to perform the mass spectral search. Sequence database files (both protein and DNA) are labelled as ‘FASTA’ throughout the manuscript. j) Spectral library: Spectral library file that was used for performing the mass spectral search. The files are labelled as ‘SP_LIBRARY’ throughout the manuscript. k) Metadata: Related biological or technological metadata provide context to the spectra and/or the identification and quantification data. Mass spectrometer, search engine, and quantification 2 How to submit MS proteomics data to ProteomeXchange via the PRIDE database software output files typically accommodate this information. The ‘PX summary file’ (see below) includes the main metadata information related to any submission to PX via PRIDE. l) PX summary file: Tab-delimited file format that is generated by the PX submission tool (see next section) and used by the PRIDE internal submission pipeline. It is a wrapper file containing all the files and data types included in each submission plus the relevant metadata. The format is described at http://www.proteomexchange.org/sites/proteomexchange.org/files/documents/proteomexcha nge_submission_summary_file_format.pdf. 2. Information about the biological dataset used in the tutorial The title of the example data set used to demonstrate the process is “Discovery of new cerebrospinal fluid biomarkers for meningitis in children”. The objective of this study was to define specific protein signatures in cerebrospinal fluid associated with Streptococcus pneumoniae infection. The data set consists of 12 runs: four of them are controls (labelled as C133, C134, C135 and C145) and the other eight are the actual samples (labelled as P5, P7, P10, P55, P60, P79, P319 and P340). It was deposited in PRIDE/PX as a ‘complete’ submission (accession number PXD000764, DOI 10.6019/PXD000764). Private access is enabled with the following reviewer account: reviewer41356@ebi.ac.uk and password: f92UEXHE). Overall, the data set contains 12 raw files (‘RAW’), 12 mzIdentML (‘RESULT’) files containing the peptide/protein identification results, and the corresponding 12 mgf (Mascot Generic File) files containing the mass spectra (‘PEAK’ files). Finally, it also contains one mzQuantML file (‘QUANT’), containing the corresponding peptide/protein quantification results. Below, it is described how the different files were generated. The raw data acquired were converted into a single mgf file containing the peak list by Proteome Discoverer 1.1 (Thermo Fisher Scientific) using default parameters. Independent mgf files for each sample were searched against a merged database composed of reviewed entries of the human UniProt database (version 20120711; 20,225 entries) and S. pneumoniae reference strain ATCC BAA255/R6 (version 20120711; 2,029 entries), with the Mascot search engine (version 2.4.0, Matrix Science), using trypsin as the enzyme, carbamidomethylation of cysteine as fixed modification, allowing methionine oxidation as variable modification and one trypsin missed cleavage, a mass tolerance of 10 ppm for precursors and 0.6 Da for fragment ions. The false discovery rate (FDR) was calculated using the decoy database tool in Mascot. The outputs from Mascot were exported in the mzIdentML version 1.1 format. 3 How to submit MS proteomics data to ProteomeXchange via the PRIDE database The peptides and proteins were quantified using ‘Progenesis LCMS’ software (version 4.0; Nonlinear Dynamics). Software default thresholds were used. Samples were grouped as control and infected, accordingly. Features with positive charge states between +2 and +5, and three or more isotopic peaks were taken to further identification. A merged peak list generated by Progenesis LCMS was searched against the composite database, using Mascot with the same parameters as above. The Mascot results were imported into Progenesis LCMS and a cut off score of 20 was applied after manually evaluating the quality of the lowest scored peptides. Similar proteins were grouped and only non-conflicting features were used for quantification. The output Progenesis peptide and protein .csv files were imported to the software Progenesis Post-Processor version 1.0.4-beta (http://progenesis-post-processor.googlecode.com/svn/maven/release/progenesis-postprocessor/1.0.4-beta/progenesis-post-processor-1.0.4-beta.zip), for conversion into a single mzQuantML file. 3. Submission via the command line using the Aspera file transfer functionality This option is available for submitters with bioinformatics support who prefer not to use the PX submission tool, due to the manual work involved (e.g. if the submission contains a large number of files). A command-line based submission can also be performed using the Aspera file transfer functionality with the freely available ‘Aspera Connect’ web browser plug-in. Aspera (http://asperasoft.com/) uses a patented transfer protocol which can offer good transfer speeds of up to 500 MB/s, even across large geographical distances. The plug-in has to be installed first and the user needs to contact the PRIDE team in advance since they will need to have an account and a target directory to access the private PRIDE Aspera server. All the relevant data files and the corresponding ‘PX summary file’ must be uploaded. Both complete and partial submissions are supported. More details about the procedure can be found at http://www.ebi.ac.uk/pride/help/archive/aspera. Step 1 – Data preparation and generation of the ‘PX summary file’ As mentioned in the previous sections, files to be submitted need to be at hand in the first place. In addition, a ‘PX summary file’ needs to be created and included in the submission. This file can be generated in two ways: 4 How to submit MS proteomics data to ProteomeXchange via the PRIDE database A) Using the PX Submission tool. In this case, the submitter will go through the different steps as explained in the main manuscript. The metadata and the file mappings are thus provided using the tool. In the ‘summary screen’ (Step 8 in the main manuscript), the PX submission tool provides an ‘Export Summary’ functionality. The ‘PX summary file’ can then be saved locally (usually with the extension .px, Supplemental Figure S3). B) Generating the file by scripting. Details about the tab delimited PX summary format can be found at http://www.proteomexchange.org/sites/proteomexchange.org/files/documents/proteomex change_submission_summary_file_format.pdf. Step 2 – Installation of the ‘Aspera Connect’web browser plug-in The ‘Aspera Connect’ web browser plug-in can be downloaded from http://downloads.asperasoft.com/. It is listed there as ‘Client’ software. The local installation directory will depend on the operating system used, and it should be noted together with the local directory (containing the data set to be uploaded), the target destination directory and the Aspera server password (both must be obtained in advance from PRIDE staff). Step 3 – Aspera command-line file upload The exact Aspera command used differs slightly between operating systems. From within the ‘Aspera Connect’ tool’s directory, the execution is as follows: Macintosh: ./ascp -QT -l500m --file-manifest=text -k 2 <path-to-folder-to-be-uploaded> pridedrop-006@ah01.ebi.ac.uk:<name-of-target-dir-specified-by-PRIDE> Windows: ascp.exe -QT -l500m --file-manifest=text -k 2 <path-to-folder-to-be-uploaded> pride-drop-006@ah01.ebi.ac.uk:<name-of-target-dir-specified-by-PRIDE> The user will then be prompted for the server password. A transfer report will be also generated to give a summary of the files transferred successfully. The command can be modified to support a lower transfer speed limits, in order to reduce connection time-outs. For instance, instead of using “l500m”, a lower figure of “-l250m” would specify a limit of 250 Mbits per second. At the end the submitter needs to notify the PRIDE team that an Aspera-based submission upload transfer has taken place. PRIDE staff will then continue to process the submission. 5 How to submit MS proteomics data to ProteomeXchange via the PRIDE database 4. How to perform bulk submissions Step 1 – Generation of the ‘PX summary file’ As explained in section 3, step 1 above. Step 2 – File transfer This can be done through the ‘Bulk submission’ option in the PX submission tool (available in Step 1) (Figure 2, panel 1, main manuscript), or following the command line option, as explained in the previous section. 5. How to access data privately in PRIDE Private data set files can be accessed in two ways: via the PRIDE Archive web or via PRIDE Inspector. (i) The PRIDE Archive web site. Similarly to what is explained in the main manuscript (section ‘How to make a data set public or add the corresponding reference’), personal or reviewer accounts can be used to access and download the individual data sets. (ii) In the PRIDE Inspector tool, select ‘Review Project’ and enter the account details (username and password). PRIDE Inspector can be used to visualize PRIDE XML or mzIdentML files (‘RESULT’ files in ‘complete’ submissions), or just to download all the files to a given directory (in the case of ‘partial’ submissions). 6 How to submit MS proteomics data to ProteomeXchange via the PRIDE database Abbreviations CSV: Comma Separated Values CV: Controlled Vocabulary DOI: Digital Object Identifier EBI: European Bioinformatics Institute FDR: False Discovery Rate GUI: Graphical User Interface MGF: Mascot Generic File MS: Mass Spectrometry PRIDE: PRoteomics IDEntifications (database) PSI: Proteomics Standards Initiative PX: ProteomeXchange 7 How to submit MS proteomics data to ProteomeXchange via the PRIDE database References [1] Pedrioli, P. G., Eng, J. K., Hubley, R., Vogelzang, M., et al., A common open representation of mass spectrometry data and its application to proteomics research. Nature Biotechnol. 2004, 22, 14591466. [2] Martens, L., Chambers, M., Sturm, M., Kessner, D., et al., mzML--a community standard for mass spectrometry data. Mol. Cell. Proteomics, 10, R110 000133. [3] Jones, A. R., Eisenacher, M., Mayer, G., Kohlbacher, O., et al., The mzIdentML data standard for mass spectrometry-based proteomics results. Mol. Cell. Proteomics, 11, M111 014381. [4] Cote, R. G., Griss, J., Dianes, J. A., Wang, R., et al., The PRoteomics IDEntification (PRIDE) Converter 2 framework: an improved suite of tools to facilitate data submission to the PRIDE database and the ProteomeXchange consortium. Mol. Cell. Proteomics, 11, 1682-1689. [5] Walzer, M., Qi, D., Mayer, G., Uszkoreit, J., et al., The mzQuantML data standard for mass spectrometry-based quantitative studies in proteomics. Mol. Cell. Proteomics, 12, 2332-2340. 8