1.Definitions of data types and files

advertisement
How to submit MS proteomics data to ProteomeXchange via the PRIDE database
Supplementary Material
1.Definitions of data types and files
There are a variety of data types in proteomics that can be submitted to ProteomeXchange (PX)/
PRIDE. These are the definitions and tags that are used throughout the main text of the manuscript.
a) Mass spectrometer output files: the original data and metadata generated by mass
spectrometers. The data may be the original profile mode scans or may already have had some
basic processing like centroiding applied. They may be:
o
i) raw data (see below).
o
ii) peak list spectra in a standardized format (see below), but they cannot be ‘processed
peak lists’ (see below).
Mass spectrometer output files are labelled as ‘RAW’ throughout the manuscript.
b) Raw data: the binary, vendor-specific output files directly created by the instrument software.
These files are typically large (potentially several gigabytes) and require specialized software in
order to be read.
c) Standardized MS data formats: There are currently three widely known open standard mass
spectrometry data formats in proteomics: mzXML [1], mzData and the successor to both of the
above: mzML [2] (currently v1.1, http://www.psidev.info/mzml), developed by the PSI
(Proteomics Standards Initiative). In addition to the mass spectra, they contain detailed metadata
that provide context to the measurements.
d) Processed peak lists: Heavily processed form of mass spectrometry data, usually derived from
the raw data files through various (semi-)automatic steps, e.g. centroiding, deisotoping, and
charge deconvolution. These files are formatted in plain text, with typical formats like dta, pkl,
ms2 or mgf. They usually contain only a subset of only the MS2 scans (MS1 scans are excluded),
and are missing significant amounts of metadata that were present in the source format.
Processed peak list files are labelled as ‘PEAK’ throughout the manuscript.
e) Protein/peptide identifications: Proteomics mass spectra can be matched to peptides or
proteins, resulting in identifications for those spectra. In the case of fragmentation spectra, the
initial identification will consist of a peptide sequence; subsequent steps will derive a list of
proteins from the identified peptides. This information can be represented by a variety of data
formats called ‘search engine output files’ (see below).
1
How to submit MS proteomics data to ProteomeXchange via the PRIDE database
f)
Protein/peptide quantification: Protein/peptide expression values can also be obtained from
mass spectra. There is a high diversity of approaches that result in the existence of very
heterogeneous software and data analysis pipelines. Some search engines are able to perform
both identification and quantification, and produce ‘search engine output files’ containing both
types of data. However, if there is software that only performs the quantification part of the
analysis, the generated data is represented as ‘quantification software output files’ (see below).
g) Search engine output files: They contain the data and metadata generated by the search engines
used for performing the identification and often the quantification of peptides and proteins. Each
search engine has its own specific output file. The formats are typically formatted in either plain
text or XML, with typical formats like Mascot .dat, X!Tandem .xml, etc. In addition to each
specific
format,
a
data
standard
format
called
mzIdentML
(currently
v1.1,
http://www.psidev.info/mzIdentML) [3] has been developed by the PSI. Some search engine
output files can represent as well quantification results, but this is not the case of mzIdentML. At
present mzIdentML can now be exported from a variety of tools (see Table 1 in the main
manuscript and http://www.psidev.info/tools-implementing-mzIdentML).
Additionally, PRIDE XML is the original PRIDE internal data format. It can represent both mass
spectra data (it contains the mzData format) and protein/peptide identifications, and for some
basic use cases, quantification information as well. ‘Search engine output files’ can be converted
to PRIDE XML using PRIDE Converter 2 [4] and a few other available tools (see Table 2 in the main
manuscript). Search engine output files are labelled either as ‘RESULT’ (for mzIdentML or PRIDE
XML) or ‘SEARCH’ (any other output format) throughout the manuscript.
h) Quantification software output files: The data and metadata generated by the software used for
performing exclusively the quantification analysis of peptides and proteins. In addition to each
specific format from each software tool, a data standard format called mzQuantML (currently
v1.0, http://www.psidev.info/mzquantml) has been released by the PSI [5]. Quantification
software output files are labelled as ‘QUANT’ throughout the manuscript.
i)
Sequence database: Sequence database file (usually in FASTA format) that was used to perform
the mass spectral search. Sequence database files (both protein and DNA) are labelled as ‘FASTA’
throughout the manuscript.
j)
Spectral library: Spectral library file that was used for performing the mass spectral search. The
files are labelled as ‘SP_LIBRARY’ throughout the manuscript.
k) Metadata: Related biological or technological metadata provide context to the spectra and/or
the identification and quantification data. Mass spectrometer, search engine, and quantification
2
How to submit MS proteomics data to ProteomeXchange via the PRIDE database
software output files typically accommodate this information. The ‘PX summary file’ (see below)
includes the main metadata information related to any submission to PX via PRIDE.
l)
PX summary file: Tab-delimited file format that is generated by the PX submission tool (see next
section) and used by the PRIDE internal submission pipeline. It is a wrapper file containing all the
files and data types included in each submission plus the relevant metadata. The format is
described
at
http://www.proteomexchange.org/sites/proteomexchange.org/files/documents/proteomexcha
nge_submission_summary_file_format.pdf.
2. Information about the biological dataset used in the tutorial
The title of the example data set used to demonstrate the process is “Discovery of new cerebrospinal
fluid biomarkers for meningitis in children”. The objective of this study was to define specific protein
signatures in cerebrospinal fluid associated with Streptococcus pneumoniae infection. The data set
consists of 12 runs: four of them are controls (labelled as C133, C134, C135 and C145) and the other
eight are the actual samples (labelled as P5, P7, P10, P55, P60, P79, P319 and P340). It was deposited
in PRIDE/PX as a ‘complete’ submission (accession number PXD000764, DOI 10.6019/PXD000764).
Private access is enabled with the following reviewer account: reviewer41356@ebi.ac.uk and
password: f92UEXHE). Overall, the data set contains 12 raw files (‘RAW’), 12 mzIdentML (‘RESULT’)
files containing the peptide/protein identification results, and the corresponding 12 mgf (Mascot
Generic File) files containing the mass spectra (‘PEAK’ files). Finally, it also contains one mzQuantML
file (‘QUANT’), containing the corresponding peptide/protein quantification results. Below, it is
described how the different files were generated.
The raw data acquired were converted into a single mgf file containing the peak list by Proteome
Discoverer 1.1 (Thermo Fisher Scientific) using default parameters. Independent mgf files for each
sample were searched against a merged database composed of reviewed entries of the human
UniProt database (version 20120711; 20,225 entries) and S. pneumoniae reference strain ATCC BAA255/R6 (version 20120711; 2,029 entries), with the Mascot search engine (version 2.4.0, Matrix
Science), using trypsin as the enzyme, carbamidomethylation of cysteine as fixed modification,
allowing methionine oxidation as variable modification and one trypsin missed cleavage, a mass
tolerance of 10 ppm for precursors and 0.6 Da for fragment ions. The false discovery rate (FDR) was
calculated using the decoy database tool in Mascot. The outputs from Mascot were exported in the
mzIdentML version 1.1 format.
3
How to submit MS proteomics data to ProteomeXchange via the PRIDE database
The peptides and proteins were quantified using ‘Progenesis LCMS’ software (version 4.0; Nonlinear
Dynamics). Software default thresholds were used. Samples were grouped as control and infected,
accordingly. Features with positive charge states between +2 and +5, and three or more isotopic
peaks were taken to further identification. A merged peak list generated by Progenesis LCMS was
searched against the composite database, using Mascot with the same parameters as above. The
Mascot results were imported into Progenesis LCMS and a cut off score of 20 was applied after
manually evaluating the quality of the lowest scored peptides. Similar proteins were grouped and
only non-conflicting features were used for quantification. The output Progenesis peptide and
protein .csv files were imported to the software Progenesis Post-Processor version 1.0.4-beta
(http://progenesis-post-processor.googlecode.com/svn/maven/release/progenesis-postprocessor/1.0.4-beta/progenesis-post-processor-1.0.4-beta.zip), for conversion
into
a single
mzQuantML file.
3. Submission via the command line using the Aspera file transfer
functionality
This option is available for submitters with bioinformatics support who prefer not to use the PX
submission tool, due to the manual work involved (e.g. if the submission contains a large number of
files). A command-line based submission can also be performed using the Aspera file transfer
functionality with the freely available ‘Aspera Connect’ web browser plug-in. Aspera
(http://asperasoft.com/) uses a patented transfer protocol which can offer good transfer speeds of
up to 500 MB/s, even across large geographical distances. The plug-in has to be installed first and the
user needs to contact the PRIDE team in advance since they will need to have an account and a
target directory to access the private PRIDE Aspera server. All the relevant data files and the
corresponding ‘PX summary file’ must be uploaded. Both complete and partial submissions are
supported.
More
details
about
the
procedure
can
be
found
at
http://www.ebi.ac.uk/pride/help/archive/aspera.
Step 1 – Data preparation and generation of the ‘PX summary file’
As mentioned in the previous sections, files to be submitted need to be at hand in the first place. In
addition, a ‘PX summary file’ needs to be created and included in the submission. This file can be
generated in two ways:
4
How to submit MS proteomics data to ProteomeXchange via the PRIDE database
A) Using the PX Submission tool. In this case, the submitter will go through the different steps
as explained in the main manuscript. The metadata and the file mappings are thus provided
using the tool. In the ‘summary screen’ (Step 8 in the main manuscript), the PX submission
tool provides an ‘Export Summary’ functionality. The ‘PX summary file’ can then be saved
locally (usually with the extension .px, Supplemental Figure S3).
B) Generating the file by scripting. Details about the tab delimited PX summary format can be
found
at
http://www.proteomexchange.org/sites/proteomexchange.org/files/documents/proteomex
change_submission_summary_file_format.pdf.
Step 2 – Installation of the ‘Aspera Connect’web browser plug-in
The
‘Aspera
Connect’
web
browser
plug-in
can
be
downloaded
from
http://downloads.asperasoft.com/. It is listed there as ‘Client’ software. The local installation
directory will depend on the operating system used, and it should be noted together with the local
directory (containing the data set to be uploaded), the target destination directory and the Aspera
server password (both must be obtained in advance from PRIDE staff).
Step 3 – Aspera command-line file upload
The exact Aspera command used differs slightly between operating systems. From within the ‘Aspera
Connect’ tool’s directory, the execution is as follows:

Macintosh: ./ascp -QT -l500m --file-manifest=text -k 2 <path-to-folder-to-be-uploaded> pridedrop-006@ah01.ebi.ac.uk:<name-of-target-dir-specified-by-PRIDE>

Windows: ascp.exe -QT -l500m --file-manifest=text -k 2 <path-to-folder-to-be-uploaded>
pride-drop-006@ah01.ebi.ac.uk:<name-of-target-dir-specified-by-PRIDE>
The user will then be prompted for the server password. A transfer report will be also generated to
give a summary of the files transferred successfully. The command can be modified to support a
lower transfer speed limits, in order to reduce connection time-outs. For instance, instead of using “l500m”, a lower figure of “-l250m” would specify a limit of 250 Mbits per second.
At the end the submitter needs to notify the PRIDE team that an Aspera-based submission upload
transfer has taken place. PRIDE staff will then continue to process the submission.
5
How to submit MS proteomics data to ProteomeXchange via the PRIDE database
4. How to perform bulk submissions
Step 1 – Generation of the ‘PX summary file’
As explained in section 3, step 1 above.
Step 2 – File transfer
This can be done through the ‘Bulk submission’ option in the PX submission tool (available in Step 1)
(Figure 2, panel 1, main manuscript), or following the command line option, as explained in the
previous section.
5. How to access data privately in PRIDE
Private data set files can be accessed in two ways: via the PRIDE Archive web or via PRIDE Inspector.
(i) The PRIDE Archive web site. Similarly to what is explained in the main manuscript (section ‘How to
make a data set public or add the corresponding reference’), personal or reviewer accounts can be
used to access and download the individual data sets.
(ii) In the PRIDE Inspector tool, select ‘Review Project’ and enter the account details (username and
password). PRIDE Inspector can be used to visualize PRIDE XML or mzIdentML files (‘RESULT’ files in
‘complete’ submissions), or just to download all the files to a given directory (in the case of ‘partial’
submissions).
6
How to submit MS proteomics data to ProteomeXchange via the PRIDE database
Abbreviations
CSV: Comma Separated Values
CV: Controlled Vocabulary
DOI: Digital Object Identifier
EBI: European Bioinformatics Institute
FDR: False Discovery Rate
GUI: Graphical User Interface
MGF: Mascot Generic File
MS: Mass Spectrometry
PRIDE: PRoteomics IDEntifications (database)
PSI: Proteomics Standards Initiative
PX: ProteomeXchange
7
How to submit MS proteomics data to ProteomeXchange via the PRIDE database
References
[1] Pedrioli, P. G., Eng, J. K., Hubley, R., Vogelzang, M., et al., A common open representation of mass
spectrometry data and its application to proteomics research. Nature Biotechnol. 2004, 22, 14591466.
[2] Martens, L., Chambers, M., Sturm, M., Kessner, D., et al., mzML--a community standard for mass
spectrometry data. Mol. Cell. Proteomics, 10, R110 000133.
[3] Jones, A. R., Eisenacher, M., Mayer, G., Kohlbacher, O., et al., The mzIdentML data standard for
mass spectrometry-based proteomics results. Mol. Cell. Proteomics, 11, M111 014381.
[4] Cote, R. G., Griss, J., Dianes, J. A., Wang, R., et al., The PRoteomics IDEntification (PRIDE)
Converter 2 framework: an improved suite of tools to facilitate data submission to the PRIDE
database and the ProteomeXchange consortium. Mol. Cell. Proteomics, 11, 1682-1689.
[5] Walzer, M., Qi, D., Mayer, G., Uszkoreit, J., et al., The mzQuantML data standard for mass
spectrometry-based quantitative studies in proteomics. Mol. Cell. Proteomics, 12, 2332-2340.
8
Download