PPTX

advertisement
PROCESS
Proteomics data Collection, Software and Standards to support open access
and long term management of data
Andy Jones,
Institute of Integrative Biology
University of Liverpool
Overview
• PROCESS outline
• 1 minute introduction to standards (mzML,
mzIdentML, mzQuantML, mzTab and MIAPE)
• PROCESS objectives
• Addressing future challenges in proteomics
data representation
PROCESS
PROCESS
• ProteomeXchange comes to an end in 2014
– No obvious EU-FP7 routes for follow up grant
• Grant application to UK’s BBSRC
Bioinformatics and Biological resource fund
– PIs: Jones, Hermjakob and Vizcaino
– 2 staff members to work on PSI standards &
software
– Financial support for PSI meetings in 2014-2016
– Advisory board made up industry and academics
PROCESS
Background - PSI standards
•
mzML
–
–
–
•
mzIdentML
–
–
•
–
XML-based standard for quantification data (feature-level, peptide-level, protein-level) e.g. output of
quantification software
Version 1.0 released Feb 2013
mzTab
–
–
•
XML-based standard for peptide and protein identification data e.g. output a search engine
Stable at version 1.1
mzQuantML
–
•
XML-based standard for MS data (raw data and peak lists used e.g. in search engines)
Data stored in base64 binary within XML tags
Stable at version 1.1
Flat file (tab-separated) format, capturing summary of quantification data (and ident data) for viewing in
spreadsheet software or statistical analysis
Close to version 1.0 release
MIAPE documents
–
Individual modules stating the metadata e.g. essential protocol info /software parameters, to be reported
about a given technique (mass spec, identification software, quantification)
Standards all need some level of routine maintenance, response to new issues etc.
PROCESS
PROCESS
Tools from our groups and collaborators
•
Java programming interfaces on the standards
– jmzML, jmzIdentML, jmzQuantML, jmzTab
– Build index over file (xxIndex) for random access even for very large files
•
PRIDE Converter / PRIDE Inspector
•
mzidLibrary
– Processing routines for mzIdentML (FDR calculation, protein inference, converters for OMSSA
and X!Tandem, emPAI etc.)
•
ProteoIDViewer (viewer for mzIdentML)
•
Java-based validators for the four formats
– Check structure of file and correct use of terminology
– Can check MIAPE compliance
Integration, maintenance and new features needed
PROCESS
PROCESS objectives
1. On-going maintenance and development of standards and CVs
2. Work directly with software vendors to provide export capabilities to PSI standards
3. Integrated converter for all open source formats into the standards
4. Maintain and update Java programming interfaces to the standards
5. Evolution of standards e.g. PTM localization, protein grouping, DIA, top-down
6. Promote re-use of PSI standards for metabolomics
7. New functionality in PRIDE Inspector
8. Develop standards and software for effective compression of data
9. Updates to PSI validation software
PROCESS
Commercial support for PSI standards – current status
Tool
Vendor
mzML
mzIdentML
mzQuantML
mzTab
Mascot & Distiller
Matrix Science
IMPORT
EXPORT
-
-
Progenesis LC-MS
IMPORT
-
-
-
PLGS
Nonlinear
Dynamics
Waters
-
-
Proteome Discoverer
Thermo
limited support for IMPORT and EXPORT
IMPORT
-
-
-
PEAKS, PEAKSQ etc.
Bioinf. Solutions IMPORT
Inc.
Scaffold,
Scaffold Proteome
PTM, Scaffold Q
Software Inc
EXPORT
-
-
IMPORT
EXPORT
and -
-
Phenyx
GeneBio
IMPORT
-
Byonic
Protein Metrics
-
ProteinPilot
AB Sciex
??
EXPORT via add- on
EXPORT
under dev.
-
SpectrumMill
Agilent
-
• Standards support in commercial tools is patchy at best...
• New quant formats (unsurprisingly) not implemented yet
• Having external tools and converters is okay, but better to have direct implementation
PROCESS
Commercial support for PSI standards – aim in PROCESS
Tool
Vendor
Mascot & Distiller
Matrix Science
Progenesis LC-MS
PLGS
Nonlinear
Dynamics
Waters
Proteome Discoverer
Thermo
mzML
mzIdentML
mzQuantML
mzTab
PEAKS, PEAKSQ etc.
Bioinf. Solutions
Inc.
Scaffold,
Scaffold Proteome
PTM, Scaffold Q
Software Inc
Phenyx
GeneBio
Byonic
Protein Metrics
ProteinPilot
AB Sciex
SpectrumMill
Agilent
Of course not all packages need to export all formats
e.g. Two quant formats serve different needs and we
could map automatically from mzq  mzTab
PROCESS
Improving commercial buy-in
•
Q. Why is it important to get good uptake of standards?
– Still a lot of fragmentation across proteomics tools
•
No package can import data from every instrument type and every search engine
– Very difficult to do benchmarking or QC without standards, cannot compare one step in
isolation of all other factors, e.g.
•
Instrument performance; peak picking; search engine performance; protein inference; quantification
accuracy etc.
– General principle of “open-access” science
•
Software/instrument vendors will only invest time if their users request it or to link
up with beneficial tools
•
Public data deposition still the exception
•
Example of a good “carrot”:
– If you wish to publish in MCP – long list of guidelines to complete.
– Our goal should be to make it “one-click” data deposition and fulfilment of metadata
requirements – through the standards and software
PROCESS
Improving commercial buy-in
• For vendor raw formats, Proteowizard has been successful
–
–
–
–
Embeds vendors own libraries for file reading
Writes out to mzML (MSConvert and MSConvert GUI tool)
Pretty much universal mass spec format converter
http://proteowizard.sourceforge.net/
• Should we adopt similar model for ident / quant formats
– Would be preferable to work with vendors to get them to
export formats directly
– Vendors have seen lots of formats come and go...
• Still exploring best model to make this work!
PROCESS
Developments to facilitate data production , sharing and deposition
•
Various format converters in existence
–
–
–
–
•
•
Proteowizard
PRIDE Converter
Lennart Martens’ group toolkits/APIs
mzidLibrary and mzqLibrary
Needs unification, “one-stop-shop” for all formats in proteomics – achievable?
Single validator for all MS-related PSI standards?
PROCESS
Developments to facilitate data analysis and re-use
• Raw file repository at EBI
• Aim – store data in PSI standards, unified
toolkit for accessing all data
PROCESS
New features to be supported in
standards
• Protein grouping
–
–
–
–
Still an unsolved problem...
Q. “How many proteins did you identify?”
A. “Well it depends how you count them”
Working group’s aim is to capture results of protein inference in a standard
way (grouping, sameset, subset relationships etc)
– Standardise method of counting so different approaches can be compared
• Modification ambiguity
– Scores associated with localisation of modification site
– Correct communication of ambiguity e.g.
• PEPTYYDER + Phospho@[5|6] => P=0.8
• PEPTYYDER + Phospho@[4]
=> P=0.2
• Alternative / new MS techniques:
– Ion mobility
– Data independent acquisition
PROCESS
Handling the data explosion
•
Size of data files is/will be a major issue for labs, databases and core facilities
–
–
–
•
mzML
–
–
•
Hybrid format (XML storing base 64 binary data)
Generally much larger than vendor raw files ; gzip saves ~30%, but can be slow to decompress
mzIdentML
–
–
–
•
Storage cost, data transfer, slow file access
Encouraging labs to use standard formats will make the problem worse
XML is very verbose – several times larger than a flat-file, could be 10 x bigger than binary
All data stored as text within XML tags
Files can get very large if search engine exports ~10 PSMs per spectrum (+ fragmentation), e.g. 100s MB for
a large search
gzip mzid achieves ~90% compression – gunzipping 500MB mzid file takes 1-2s
What are the alternatives to pure XML?
–
–
–
Binary XML e.g. FastInfoset
HDF e.g mz5 from Steen lab
Standard adoption of XML compression practices
Note: All current Microsoft formats docx, xlsx etc are simply zipped XML; IBM encourages use of zipped XML for
many Big Data applications
PROCESS
Some options for file compression
File
type
Pros
Cons
mzML
mzIdentML
File size | Read File size | Read
time*
time†
1.0 | 1.0
1.0 | 1.0
Pure
XML
Text based, industry standard, Verbose
strong tool support
Gzip
XML
industry standard, free tools, easy no direct access from API, thus 0.7 | 1.3
to implement
slower than XML
0.1 | 1.05
HDF5
fast, small
No current code
Gzip
Fast
Infoset
faster than XML; trivial to convert No free C++ libs; inaccessible to 0.7 | 0.65
Java XML tools to FI
Perl/python developers
currently dependent on single C++ 0.5 | 0.25
lib, limited Java support
0.1 | 0.4
Pros, cons and average performance of methods for compressing mzML and mzIdentML. We estimate mzQuantML
performance to be somewhere between mzML and mzIdentML.
PROCESS
mz5
• Mapping of mzML into HDF5, using PSI-MS terminology
• HDF5 is a binary file format specification, designed for
efficient file sizes and fast access of multi-dimensional data
• Implemented in ProteoWizard
• No current support for mz5 outside ProteoWizard (C++)
• Limited support for HDF outside of C or C++
Wilhelm, M., Kirchner, M., Steen, J. A. J., and Steen, H. (2012) mz5:
Space- and Time-efficient Storage of Mass Spectrometry Data Sets.
Molecular & Cellular Proteomics 11.
PROCESS
Linear read/write times and storage space requirements
Basic finding:
~ 50% space of mzML
~4 times faster reading
Some caveats:
Is mzML reading optimal?
Not obvious why mzML
would be so much worse
than mzXML?
“In addition, mz5 removes zero intensity scans and encodes m/z measurements in a
delta mass representation, storing distances between consecutive m/z observations.
The latter two measures yield a storage requirement reduction of 55%.”
©2012 by American Society for Biochemistry and Molecular Biology
Wilhelm M et al. Mol Cell Proteomics 2012;11:O111.011379
Do we want (new) compressed
standards
Pros:
• Cost saving of data storage / transfer
• Better user experience with software with faster read times
• Some instruments (e.g. Ion mobility) now produce raw files > 10GB and
upwards
• Move towards cloud-based processing/storage – data transfer time could
make this prohibitive
Cons:
• mzML and mzIdentML now considered stable standards
• mzQuantML release 1.0 - hope will be stable long term
• Major downside to making yet another standard
• Difficult enough getting commercial implementations of what we have
• General principle – more complex the compression technique, fewer
people understand it or can implement it
PROCESS
Horses for courses
• Long term archival
– Want smallest format, so long as it can be used or converted to something
useful when needed
• Data visualisation
– Fast, random access – a processed, lossy format may be okay for some tasks?
• Analysis software development
– Typically raw data is needed (esp. for quant)
– Some tasks take many CPU hours, so spending 1 minute or 5 minutes reading
the file is not an issue
– Standard must be accessible to developers in most modern languages
• Can one format be optimal for all use cases – should we aim for this?
PROCESS
Summary
• PROCESS will be focussed on some key
outcomes:
– Improving vendor support for PSI standards
– Releasing tools to help informatics groups work
with the standards
• universal converters, validators
– Data compression techniques for proteomics
– Ensuring standards evolve with new techniques in
proteomics
PROCESS
Acknowledgements
• Henning Hermjakob and Juan Antonio Vizcaino
• Support from:
– Ian Morns (Nonlinear Dynamics), David Creasy (Matrix Science), Hans
Vissers (Waters), Lian Yang (Bioinformatics Solutions Inc.), Brian Searle
(Proteome Software), Sean Seymour (AB Sciex), Juergen Cox (MaxPlanck), Conrad Bessant (Queen Mary), Hanno Steen (Harvard), David
Matthews (Bristol), Lennart Martens (Ghent), Parag Mallick (Stanford),
Kathryn Lilley (Cambridge), Eric Deutsch (ISB, Seattle), Nuno Bandeira
(UCSD), Alexey Nesvizhskii (Michigan), Robert Chalkley (UCSF), Andrew
Dowsey (Manchester), Oliver Kohlbacher (Tuebingen), Steffen
Neumann (Leibniz), Chris Steinbeck (EBI), Jyoti Choudhary (Sanger)
• Data compression prelim. work: Andy Dowsey and Faviel Gonzalez
• Lennart Martens – group developed jmzML and various APIs used in
tools
PROCESS
Download