PROCESS Proteomics data Collection, Software and Standards to support open access and long term management of data Andy Jones, Institute of Integrative Biology University of Liverpool Overview • PROCESS outline • 1 minute introduction to standards (mzML, mzIdentML, mzQuantML, mzTab and MIAPE) • PROCESS objectives • Addressing future challenges in proteomics data representation PROCESS PROCESS • ProteomeXchange comes to an end in 2014 – No obvious EU-FP7 routes for follow up grant • Grant application to UK’s BBSRC Bioinformatics and Biological resource fund – PIs: Jones, Hermjakob and Vizcaino – 2 staff members to work on PSI standards & software – Financial support for PSI meetings in 2014-2016 – Advisory board made up industry and academics PROCESS Background - PSI standards • mzML – – – • mzIdentML – – • – XML-based standard for quantification data (feature-level, peptide-level, protein-level) e.g. output of quantification software Version 1.0 released Feb 2013 mzTab – – • XML-based standard for peptide and protein identification data e.g. output a search engine Stable at version 1.1 mzQuantML – • XML-based standard for MS data (raw data and peak lists used e.g. in search engines) Data stored in base64 binary within XML tags Stable at version 1.1 Flat file (tab-separated) format, capturing summary of quantification data (and ident data) for viewing in spreadsheet software or statistical analysis Close to version 1.0 release MIAPE documents – Individual modules stating the metadata e.g. essential protocol info /software parameters, to be reported about a given technique (mass spec, identification software, quantification) Standards all need some level of routine maintenance, response to new issues etc. PROCESS PROCESS Tools from our groups and collaborators • Java programming interfaces on the standards – jmzML, jmzIdentML, jmzQuantML, jmzTab – Build index over file (xxIndex) for random access even for very large files • PRIDE Converter / PRIDE Inspector • mzidLibrary – Processing routines for mzIdentML (FDR calculation, protein inference, converters for OMSSA and X!Tandem, emPAI etc.) • ProteoIDViewer (viewer for mzIdentML) • Java-based validators for the four formats – Check structure of file and correct use of terminology – Can check MIAPE compliance Integration, maintenance and new features needed PROCESS PROCESS objectives 1. On-going maintenance and development of standards and CVs 2. Work directly with software vendors to provide export capabilities to PSI standards 3. Integrated converter for all open source formats into the standards 4. Maintain and update Java programming interfaces to the standards 5. Evolution of standards e.g. PTM localization, protein grouping, DIA, top-down 6. Promote re-use of PSI standards for metabolomics 7. New functionality in PRIDE Inspector 8. Develop standards and software for effective compression of data 9. Updates to PSI validation software PROCESS Commercial support for PSI standards – current status Tool Vendor mzML mzIdentML mzQuantML mzTab Mascot & Distiller Matrix Science IMPORT EXPORT - - Progenesis LC-MS IMPORT - - - PLGS Nonlinear Dynamics Waters - - Proteome Discoverer Thermo limited support for IMPORT and EXPORT IMPORT - - - PEAKS, PEAKSQ etc. Bioinf. Solutions IMPORT Inc. Scaffold, Scaffold Proteome PTM, Scaffold Q Software Inc EXPORT - - IMPORT EXPORT and - - Phenyx GeneBio IMPORT - Byonic Protein Metrics - ProteinPilot AB Sciex ?? EXPORT via add- on EXPORT under dev. - SpectrumMill Agilent - • Standards support in commercial tools is patchy at best... • New quant formats (unsurprisingly) not implemented yet • Having external tools and converters is okay, but better to have direct implementation PROCESS Commercial support for PSI standards – aim in PROCESS Tool Vendor Mascot & Distiller Matrix Science Progenesis LC-MS PLGS Nonlinear Dynamics Waters Proteome Discoverer Thermo mzML mzIdentML mzQuantML mzTab PEAKS, PEAKSQ etc. Bioinf. Solutions Inc. Scaffold, Scaffold Proteome PTM, Scaffold Q Software Inc Phenyx GeneBio Byonic Protein Metrics ProteinPilot AB Sciex SpectrumMill Agilent Of course not all packages need to export all formats e.g. Two quant formats serve different needs and we could map automatically from mzq mzTab PROCESS Improving commercial buy-in • Q. Why is it important to get good uptake of standards? – Still a lot of fragmentation across proteomics tools • No package can import data from every instrument type and every search engine – Very difficult to do benchmarking or QC without standards, cannot compare one step in isolation of all other factors, e.g. • Instrument performance; peak picking; search engine performance; protein inference; quantification accuracy etc. – General principle of “open-access” science • Software/instrument vendors will only invest time if their users request it or to link up with beneficial tools • Public data deposition still the exception • Example of a good “carrot”: – If you wish to publish in MCP – long list of guidelines to complete. – Our goal should be to make it “one-click” data deposition and fulfilment of metadata requirements – through the standards and software PROCESS Improving commercial buy-in • For vendor raw formats, Proteowizard has been successful – – – – Embeds vendors own libraries for file reading Writes out to mzML (MSConvert and MSConvert GUI tool) Pretty much universal mass spec format converter http://proteowizard.sourceforge.net/ • Should we adopt similar model for ident / quant formats – Would be preferable to work with vendors to get them to export formats directly – Vendors have seen lots of formats come and go... • Still exploring best model to make this work! PROCESS Developments to facilitate data production , sharing and deposition • Various format converters in existence – – – – • • Proteowizard PRIDE Converter Lennart Martens’ group toolkits/APIs mzidLibrary and mzqLibrary Needs unification, “one-stop-shop” for all formats in proteomics – achievable? Single validator for all MS-related PSI standards? PROCESS Developments to facilitate data analysis and re-use • Raw file repository at EBI • Aim – store data in PSI standards, unified toolkit for accessing all data PROCESS New features to be supported in standards • Protein grouping – – – – Still an unsolved problem... Q. “How many proteins did you identify?” A. “Well it depends how you count them” Working group’s aim is to capture results of protein inference in a standard way (grouping, sameset, subset relationships etc) – Standardise method of counting so different approaches can be compared • Modification ambiguity – Scores associated with localisation of modification site – Correct communication of ambiguity e.g. • PEPTYYDER + Phospho@[5|6] => P=0.8 • PEPTYYDER + Phospho@[4] => P=0.2 • Alternative / new MS techniques: – Ion mobility – Data independent acquisition PROCESS Handling the data explosion • Size of data files is/will be a major issue for labs, databases and core facilities – – – • mzML – – • Hybrid format (XML storing base 64 binary data) Generally much larger than vendor raw files ; gzip saves ~30%, but can be slow to decompress mzIdentML – – – • Storage cost, data transfer, slow file access Encouraging labs to use standard formats will make the problem worse XML is very verbose – several times larger than a flat-file, could be 10 x bigger than binary All data stored as text within XML tags Files can get very large if search engine exports ~10 PSMs per spectrum (+ fragmentation), e.g. 100s MB for a large search gzip mzid achieves ~90% compression – gunzipping 500MB mzid file takes 1-2s What are the alternatives to pure XML? – – – Binary XML e.g. FastInfoset HDF e.g mz5 from Steen lab Standard adoption of XML compression practices Note: All current Microsoft formats docx, xlsx etc are simply zipped XML; IBM encourages use of zipped XML for many Big Data applications PROCESS Some options for file compression File type Pros Cons mzML mzIdentML File size | Read File size | Read time* time† 1.0 | 1.0 1.0 | 1.0 Pure XML Text based, industry standard, Verbose strong tool support Gzip XML industry standard, free tools, easy no direct access from API, thus 0.7 | 1.3 to implement slower than XML 0.1 | 1.05 HDF5 fast, small No current code Gzip Fast Infoset faster than XML; trivial to convert No free C++ libs; inaccessible to 0.7 | 0.65 Java XML tools to FI Perl/python developers currently dependent on single C++ 0.5 | 0.25 lib, limited Java support 0.1 | 0.4 Pros, cons and average performance of methods for compressing mzML and mzIdentML. We estimate mzQuantML performance to be somewhere between mzML and mzIdentML. PROCESS mz5 • Mapping of mzML into HDF5, using PSI-MS terminology • HDF5 is a binary file format specification, designed for efficient file sizes and fast access of multi-dimensional data • Implemented in ProteoWizard • No current support for mz5 outside ProteoWizard (C++) • Limited support for HDF outside of C or C++ Wilhelm, M., Kirchner, M., Steen, J. A. J., and Steen, H. (2012) mz5: Space- and Time-efficient Storage of Mass Spectrometry Data Sets. Molecular & Cellular Proteomics 11. PROCESS Linear read/write times and storage space requirements Basic finding: ~ 50% space of mzML ~4 times faster reading Some caveats: Is mzML reading optimal? Not obvious why mzML would be so much worse than mzXML? “In addition, mz5 removes zero intensity scans and encodes m/z measurements in a delta mass representation, storing distances between consecutive m/z observations. The latter two measures yield a storage requirement reduction of 55%.” ©2012 by American Society for Biochemistry and Molecular Biology Wilhelm M et al. Mol Cell Proteomics 2012;11:O111.011379 Do we want (new) compressed standards Pros: • Cost saving of data storage / transfer • Better user experience with software with faster read times • Some instruments (e.g. Ion mobility) now produce raw files > 10GB and upwards • Move towards cloud-based processing/storage – data transfer time could make this prohibitive Cons: • mzML and mzIdentML now considered stable standards • mzQuantML release 1.0 - hope will be stable long term • Major downside to making yet another standard • Difficult enough getting commercial implementations of what we have • General principle – more complex the compression technique, fewer people understand it or can implement it PROCESS Horses for courses • Long term archival – Want smallest format, so long as it can be used or converted to something useful when needed • Data visualisation – Fast, random access – a processed, lossy format may be okay for some tasks? • Analysis software development – Typically raw data is needed (esp. for quant) – Some tasks take many CPU hours, so spending 1 minute or 5 minutes reading the file is not an issue – Standard must be accessible to developers in most modern languages • Can one format be optimal for all use cases – should we aim for this? PROCESS Summary • PROCESS will be focussed on some key outcomes: – Improving vendor support for PSI standards – Releasing tools to help informatics groups work with the standards • universal converters, validators – Data compression techniques for proteomics – Ensuring standards evolve with new techniques in proteomics PROCESS Acknowledgements • Henning Hermjakob and Juan Antonio Vizcaino • Support from: – Ian Morns (Nonlinear Dynamics), David Creasy (Matrix Science), Hans Vissers (Waters), Lian Yang (Bioinformatics Solutions Inc.), Brian Searle (Proteome Software), Sean Seymour (AB Sciex), Juergen Cox (MaxPlanck), Conrad Bessant (Queen Mary), Hanno Steen (Harvard), David Matthews (Bristol), Lennart Martens (Ghent), Parag Mallick (Stanford), Kathryn Lilley (Cambridge), Eric Deutsch (ISB, Seattle), Nuno Bandeira (UCSD), Alexey Nesvizhskii (Michigan), Robert Chalkley (UCSF), Andrew Dowsey (Manchester), Oliver Kohlbacher (Tuebingen), Steffen Neumann (Leibniz), Chris Steinbeck (EBI), Jyoti Choudhary (Sanger) • Data compression prelim. work: Andy Dowsey and Faviel Gonzalez • Lennart Martens – group developed jmzML and various APIs used in tools PROCESS