10. Standards in Proteomics MS bioinformatics analysis for proteomics Salvador Martínez de Bartolomé smartinez@proteored.org Bioinformatics support – ProteoRed Proteomics Facility, National Center for Biotechnology, Madrid Index • Need of standards in Proteomics • HUPO-PSI – Organization – Standard data formats – MIAPEs • PEFF: A Common Sequence Database Format in Proteomics • PRIDE • Standard data format converters Index • Need of standards in Proteomics • HUPO-PSI – Organization – Standard data formats – MIAPEs • PEFF: A Common Sequence Database Format in Proteomics • PRIDE • Standard data format converters Need of standards in Proteomics Proteomics data is often only made available as arbitrarily formatted PDF tables, carrying important limitations: • Source data (mass spectra) are not made available • No peer review validation possible • Very little raw materials for testing innovative in silico techniques are available • Automated (re-)processing of the identifications is impossible (eliminating objective technique comparison) Thoughts in Standards •Bradshaw RA, Burlingame AL, Carr S, Aebersold R. Reporting protein identification data: the next generation of guidelines. Mol Cell Proteomics. 2006 May;5(5):787-8. •Wilkins et al. Guidelines for the next 10 years of proteomics. Proteomics. 2006 Jan;6(1):4-8. •Nature Biotechnology 2006, Nov: • Editorial: Standards Operating Procedures • Burgoon LD. The need for standards, not guidelines, in biological data reporting and sharing. • Ball C. Are we stuck in standards? •Nature Biotechnology: Planned focus issue and Community Consultation on Standards: http://www.nature.com/nbt/consult/index.html Need of standards in Proteomics • Proteomics: No standardized reporting, not standard database submission • Proteomics data is generated at a high rate, and lost at a high rate • Experiments are repeated unnecessarily, the field advances slower than necessary Need of standards in Proteomics • Standards for: • Store data • Review data • Reproduce results • Compare data • Exchange data Index • Need of standards in Proteomics • HUPO-PSI – – – – Organization Standard data formats CVs MIAPEs • PEFF: A Common Sequence Database Format in Proteomics • PRIDE • Standard data format converters Index • Need of standards in Proteomics • HUPO-PSI – – – – Organization Standard data formats CVs MIAPEs • PEFF: A Common Sequence Database Format in Proteomics • PRIDE • Standard data format converters HUPO PSI Protein Standard Initiative http://www.psidev.info HUPO PSI Protein Standard Initiative Meetings http://www.psidev.info HUPO PSI Protein Standard Initiative http://psidev.info The Proteomics Standards Initiative (PSI) aims to define • Open community initiative community standards for data representation in proteomics and to facilitate data comparison, exchange and verification • Develop data format standards Proteomics 2003, 3 (7):and The annotation proteomics standards standards initiative. • Data representation Orchard,S. , Hermjakob,H. , Apweiler,R. • Involve data producers, database providers, software producers, publishers HUPO PSI structure Main unit is workgroup • Gel Electrophoresis • Molecular Interactions • Sample Processing • Mass spectrometry • Proteomic Informatics (MS oriented) • Protein Modifications Transversal activities • One Steering Group • Controlled vocabulary • MIAPE guidelines HUPO PSI structure • No permanent funding, active members work on their “spare time” • Annual workshop, reporting activity at annual HUPO, conference calls, dedicated workshops • Website (http://psidev.info) and mailing-lists • PSI Document process • Vizcaino, J.A., Martens, L., Hermjakob, H., Julian, R.K. and Paton, N.W. (2007) The PSI formal document process and its implementation on the PSI website. Proteomics 7: 2355-2357. HUPO PSI document process PSI Editor returns Draft Candidate Recommendation submitted to PSI Editor PSI-WG submits PFD-R.P with supporting documents (tutorials,etc) To PSI-SG requesting PFD-R status PSI Editor reviews draft PSI-SG reviews request Pass Pass Revise Revise PSI-SG Provide Feedback to WG Chairs Community consultation at: http://www.nature.com/nbt/consult/ PSI-SG and PSI Editor conduct Formal External Review PSI Editor submits draft to PSI-SG Revise 60 day Formal Review and Public Comment 15 Day PSI-SG Comment 30-day Public Comment PSI Editor reviews comments PSI Editor reviews comments Pass Pass Pass PSI Editor posts & announces PSI Working Draft Proposal (PWD-R.P) PSI Editor posts & announces PSI Final Document Proposal (PFD-R.P) PSI Editor posts & announces PSI Final Document (PFD-R) PSI Editor returns Draft, remove Revise PWD from index PSI-SG Examines Reviews Revise HUPO PSI structure HUPO-PSI • Project status HUPO-PSI PSI deliverables • Data Formats formats (XML schema, instance docs, specification docs) • Controlled • MIML Vocabularies • MIAPE docs (representation and annotation standards) • mzML • AnalysisXML • gelML • giML • spML •MIAPE minimal reporting requirements • One parent document - The minimum information about a proteomics experiment (MIAPE), Nature Biotechnology 25, 887-893 (2007) • MIAPE MI, MS, MSI, GE, GI, CC, CE, SP Index • Need of standards in Proteomics • HUPO-PSI – – – – Organization Standard data formats CVs MIAPEs • PEFF: A Common Sequence Database Format in Proteomics • PRIDE • Standard data format converters Standard data formats for Experimental data: spectra, acquisition parameters, acquisition equipment, ... Analyzed data: identifications, quantitations, data analysis software ... Standard data formats Experimental data: spectra, acquisition parameters, acquisition equipment, ... • data format capturing peak list information. Seattle Proteome Center • Its aim is to unite the large number of current formats at the (pkl's, Institute for mzXML 2.0 mzXML 3.0 mzXML 4.0 Systems Biology dta's, mgf's, .....) into one mzML 1.0 • It is NOT a substitute for the rawfile formats of the instrument HUPO-PSI vendors. Some mzXML vendors, if1.05 not all, will provide mzXML 2.0software transforming their raw files to that standards mzML: Released on June 1st, 2008 Sample instance document mzML 1.0 Standard data formats for Experimental data: spectra, acquisition parameters, acquisition equipment, ... Analyzed data: identifications, quantitations, data analysis software ... Standard data formats Analyzed data: identifications, quantitations, data analysis software ... • describes the results of identification and quantitation processes for proteins, peptides and protein modifications from pepXML Seattle Proteome Center mass spectrometry at the Institute for AnalysisXML Systems Biology protXML HUPO-PSI AnalysisXML: v1.0 – candidate (Dic 08) Sample instance document AnalysisXML (beta) Standard data formats Other data: XML data format MIAPE GelML MIAPE GE GelInfoML miXML MIAPE GI MIAPE MIMIX spML MIAPE SP Standard data formats mass spectrometer A mass spectrometer B proprietary format converter search engine A mzML search engine B Public repository analysisXML Index • Need of standards in Proteomics • HUPO-PSI – – – – Organization Standard data formats CVs MIAPEs • PEFF: A Common Sequence Database Format in Proteomics • PRIDE • Standard data format converters Controlled Vocabularies The Controlled Vocabularies (CVs) of the Proteomic Standard Initiative (PSI) provide a consensus annotation system to standardize the meaning, syntax and formalism of terms used across proteomics, as required by the PSI Working Groups. Each PSI working group develop the CVs required by the technology or data type it aims to standardize, following common recommendations for development and maintenance. At the PSI meeting in Washington (Sept 06), it was decided that all PSI working groups should adopt the same CVs standardizing some overlapping concepts (units and resources). Controlled Vocabularies Term Synonyms What is a CV? TOF time-of-flight T.O.F. 100173 time of flight Controlled Vocabularies • PSI CVs are composed of two documents: • a design principle description • the implementation of the CVs in OBO (Open Biomedical Ontologies) •Developing CVs is a process of collecting, and if necessary defining terms. • Every effort must be made to adopt and re-use existing ontologies or CVs where they exist, to avoid “re-inventing the wheel”. Ontology Lookup Service http://www.ebi.ac.uk/ontology-lookup/ • The OLS provides a web service interface to query multiple ontologies from a single location with a unified output format. Ontology Lookup Service http://www.ebi.ac.uk/ontology-lookup/ Index • Need of standards in Proteomics • HUPO-PSI – – – – Organization Standard data formats CVs MIAPEs • PEFF: A Common Sequence Database Format in Proteomics • PRIDE • Standard data format converters MIAPE: Minimum Information About a Proteomics Experiment Taylor, C.F., Paton, N.W., Lilley, K.S., Binz, P.A., Julian, R.K., Jr., Jones, A.R., Zhu, W., Apweiler, R., Aebersold, R., Deutsch, E.W., Dunn, M.J., Heck, A.J., Leitner, A., Macht, M., Mann, M., Martens, L., Neubert, T.A., Patterson, S.D., Ping, P., Seymour, S.L., Souda, P., Tsugita, A., Vandekerckhove, J., Vondriska, T.M., Whitelegge, J.P., Wilkins, M.R., Xenarios, I., Yates, J.R., 3rd and Hermjakob, H. (2007) The minimum information about a proteomics experiment (MIAPE). Nat Biotechnol 25: 887-893. Sufficiency and practicability • Unambiguous description of the experimental context • Allow understanding of the results and their interpretation • Sufficient to permit a critical evaluation • In principle allow recreation of the work MIAPE • It is: guidelines – Describing a list of information and data to provide when an experiment is reported (it is a content descriptor) • Peptide sequence, scores, modifications, mass errors, etc. – Helping to assess quality control • Number of replicates, expected error rate MIAPE • It is not: guidelines – Describing the way to run an experiment • does not specify the use of a search engine in particular • does not force the use of one protocol – Describing the data representation • Use excel to create a table with these five following columns:… – Including any quality judgment • need 30% sequence coverage to identify a protein • “The absence of thorough validation of both analytical and biological results, including error analysis should result in rejection” • “Authors should justify the use of a very small database or database that excludes common contaminants, since this may generate misleading assignments” MIAPE guidelines • MIAPE Gel Electrophoresis (GE) v1.4 • MIAPE Gel Informatics (GI) v0.5 • MIAPE Mass Spectrometry (MS) v2.22 • MIAPE Mass Spectrometry Informatics (MSI) v0.8 • MIAPE Column Chromatography (CC) v1.0 • MIAPE Capillary Electrophoresis (CE) v0.7 • MIAPE Sample Preparation and handling (SP) v0.2 • MIAPE Molecular Interactions (MI) v1.1.2 Online tool to generate and store MIAPE documents http://www.proteored.org A MIAPE generator tool Fill all minimal information by hand ProteoRed server Fill only some changes or new items by hand, and add automatically static information from previous MIAPE documents A MIAPE generator tool http://www.proteored.org A MIAPE generator tool A MIAPE generator tool A MIAPE generator tool A MIAPE generator tool HUPO-PSI: MIAPE Gel Electrophoresis v1.2 Edit document Delete document Generate report Generate XML MIAPE Reports Generate report MIAPE Reports MIAPE Reports MIAPE Reports MIAPE Reports MIAPE Reports MIAPE Reports MIAPE Reports Index • Need of standards in Proteomics • HUPO-PSI – Organization – Standard data formats – MIAPEs • PEFF: A Common Sequence Database Format in Proteomics • PRIDE • Standard data format converters PEFF: PSI Extended FASTA Format A Common Sequence Database Format in Proteomics •P-A Binz, S Seymour, J Shofsthal, D Creasy, E Kapp •Problem: interpretation of current fasta format by search engines: • Protein identifiers • Description • Taxonomy • Other annotation (PTMs, sequence variants, etc) •Propose an alternative to the heterogeneous fasta format, ideally generated by the database providers, or alternatively via an accepted converter, to submit one single source sequence database to various search engines • SwissProt and EBI already agreed on the principle •Format proposal reached (not only for MS, flexible, extensible) PEFF: PSI Extended FASTA Format • A unified format for protein and nucleotide sequence databases to be used by sequence search engines and other associated tools (spectra library search tools, sequence alignment software, data repositories, etc). • Enables consistent extraction, display and processing of information such as protein/nucleotide sequence database entry identifier, description, taxonomy, etc. across software platforms. • Allows the representation of structural annotations such as post-translational modifications, mutations and other processing events. • Flat file that includes a header of meta data to describe relevant information about the database(s) from which the sequence has been obtained (i.e., name, version, etc). • Sequence database providers are encouraged to generate this format as part of their release policy or to provide appropriate converters that can be incorporated into processing tools. Index • Need of standards in Proteomics • HUPO-PSI – Organization – Standard data formats – MIAPEs • PEFF: A Common Sequence Database Format in Proteomics • PRIDE • Standard data format converters PRIDE – Protein Identification Database • Turns publicly available data into publicly accessible data • Protein identifications • Experimental detail • Peak lists • Linkout to raw data • Fully open source • Fully open data • Implementation of PSI standards as they are released PRIDE mass spectrometer A PRIDE mass spectrometer B proprietary format converter search engine A mzML search engine B Public repository analysisXML PRIDE – Protein Identification Database ...Tomorrow with Alberto Medina Index • Need of standards in Proteomics • HUPO-PSI – Organization – Standard data formats – MIAPEs • PEFF: A Common Sequence Database Format in Proteomics • PRIDE • Standard data format converters Standard data format converters • msconvert (ProteoWizard): – From: mzML, mzXML, Thermo RAW, MGF – To: mzML, mzXML – Vendor format reading restrictions: Thermo RAW: Windows with XCalibur XDK installed Standard data format converters • ReAdW version 4.0.2: – From: • Thermo RAW – Exports to: • mzXML • mzML (not yet updated to final mzML 1.0 standard; try msconvert) – Requires a valid installation of the Thermo XCalibur software system, as it relies on the XCalibur libraries. Standard data format converters • CompassXport 1.3.6 : – From: • analysis.baf (instrument families: APEX, micrOTOF, micOTOF-Q, ...) • analysis.yep (esquire/HCT instrument family) • AutoXecute run for LCMaldi (instrument family: autoFlex, ultraFlex, ...) • fid files (flex instrument family) – Exports to: • mzXML version 2.1 • mzData, version 1.05 • mzML in progress – Do not requires to install Bruker propietary software – Replace to mzBruker Standard data format converters • massWolf 4.0.2 (1st july 08): – From: • MassLynx native acquisition files – Exports to: • mzXML – Requires installation of MassLynx software on the same computer – You must select the appropriate massWolf download to match the version of your MassLynx software (4.0 or 4.1). Standard data format converters • mzWiff 4.0.2 (1st July 08): – From: • Analyst native acquisition (.wiff) files – Exports to: • mzXML – Requires installation of Analyst software Standard data format converters • T2DExtractor (Dec 07): – From: • data from a SCIEX/ABI 4000 series MALDI TOFTOF instruments – Exports to: • mzXML Standard data format converters • Trapper 4.1.0 (17 th july 08): – From: • Agilent MassHunter format (.d directories) – Exports to: • mzXML – Requires Agilent's MHDAC software installed – This software will be included in the upcoming 4.1.0 TPP distribution