Need of standards in Proteomics

advertisement
10. Standards in Proteomics
MS bioinformatics analysis for proteomics
Salvador Martínez de Bartolomé
smartinez@proteored.org
Bioinformatics support – ProteoRed
Proteomics Facility, National Center for
Biotechnology, Madrid
Index
• Need of standards in Proteomics
• HUPO-PSI
– Organization
– Standard data formats
– MIAPEs
• PEFF: A Common Sequence Database Format in
Proteomics
• PRIDE
• Standard data format converters
Index
• Need of standards in Proteomics
• HUPO-PSI
– Organization
– Standard data formats
– MIAPEs
• PEFF: A Common Sequence Database Format in
Proteomics
• PRIDE
• Standard data format converters
Need of standards in Proteomics
Proteomics data is often only made available as arbitrarily
formatted PDF tables, carrying important limitations:
• Source data (mass spectra) are not made available
• No peer review validation possible
• Very little raw materials for testing innovative in silico
techniques are available
• Automated (re-)processing of the identifications is
impossible (eliminating objective technique comparison)
Thoughts in Standards
•Bradshaw RA, Burlingame AL, Carr S, Aebersold R.
Reporting protein identification data: the next generation of guidelines.
Mol Cell Proteomics. 2006 May;5(5):787-8.
•Wilkins et al.
Guidelines for the next 10 years of proteomics.
Proteomics. 2006 Jan;6(1):4-8.
•Nature Biotechnology 2006, Nov:
• Editorial: Standards Operating Procedures
• Burgoon LD. The need for standards, not guidelines, in biological
data reporting and sharing.
• Ball C. Are we stuck in standards?
•Nature Biotechnology: Planned focus issue and Community Consultation
on Standards: http://www.nature.com/nbt/consult/index.html
Need of standards in Proteomics
• Proteomics: No standardized reporting, not standard database
submission
• Proteomics data is generated at a high rate, and lost at a high
rate
• Experiments are repeated unnecessarily, the field advances
slower than necessary
Need of standards in Proteomics
• Standards for:
• Store data
• Review data
• Reproduce results
• Compare data
• Exchange data
Index
• Need of standards in Proteomics
• HUPO-PSI
–
–
–
–
Organization
Standard data formats
CVs
MIAPEs
• PEFF: A Common Sequence Database Format in
Proteomics
• PRIDE
• Standard data format converters
Index
• Need of standards in Proteomics
• HUPO-PSI
–
–
–
–
Organization
Standard data formats
CVs
MIAPEs
• PEFF: A Common Sequence Database Format in
Proteomics
• PRIDE
• Standard data format converters
HUPO PSI
Protein Standard Initiative
http://www.psidev.info
HUPO PSI
Protein Standard Initiative
Meetings
http://www.psidev.info
HUPO PSI
Protein Standard Initiative
http://psidev.info
The
Proteomics
Standards
Initiative (PSI) aims to define
• Open
community
initiative
community standards for data representation in proteomics and to
facilitate
data
comparison,
exchange and verification
• Develop
data
format standards
Proteomics
2003, 3 (7):and
The annotation
proteomics standards
standards initiative.
• Data representation
Orchard,S. , Hermjakob,H. , Apweiler,R.
• Involve data producers, database providers, software producers,
publishers
HUPO PSI structure
Main unit is workgroup
• Gel Electrophoresis
• Molecular Interactions
• Sample Processing
• Mass spectrometry
• Proteomic Informatics (MS oriented)
• Protein Modifications
Transversal activities
• One Steering Group
• Controlled vocabulary
• MIAPE guidelines
HUPO PSI structure
• No permanent funding, active members work on their “spare
time”
• Annual workshop, reporting activity at annual HUPO,
conference calls, dedicated workshops
• Website (http://psidev.info) and mailing-lists
• PSI Document process
• Vizcaino, J.A., Martens, L., Hermjakob, H., Julian, R.K. and Paton, N.W. (2007)
The PSI formal document process and its implementation on the PSI website.
Proteomics 7: 2355-2357.
HUPO PSI document process
PSI Editor returns
Draft
Candidate
Recommendation
submitted to PSI Editor
PSI-WG submits
PFD-R.P with supporting
documents (tutorials,etc)
To PSI-SG requesting
PFD-R status
PSI Editor
reviews draft
PSI-SG
reviews
request
Pass
Pass
Revise
Revise
PSI-SG Provide
Feedback to WG
Chairs
Community consultation at:
http://www.nature.com/nbt/consult/
PSI-SG and PSI
Editor conduct
Formal External
Review
PSI Editor submits
draft to PSI-SG
Revise
60 day Formal
Review and Public
Comment
15 Day PSI-SG
Comment
30-day Public
Comment
PSI Editor
reviews
comments
PSI Editor
reviews
comments
Pass
Pass
Pass
PSI Editor posts &
announces
PSI Working
Draft Proposal
(PWD-R.P)
PSI Editor posts &
announces
PSI Final
Document
Proposal
(PFD-R.P)
PSI Editor posts &
announces
PSI Final
Document
(PFD-R)
PSI Editor returns
Draft, remove
Revise PWD from index
PSI-SG
Examines
Reviews
Revise
HUPO PSI structure
HUPO-PSI
• Project status
HUPO-PSI
PSI deliverables
• Data
Formats
formats
(XML schema, instance docs, specification docs)
• Controlled
• MIML Vocabularies
• MIAPE docs (representation and annotation standards)
• mzML
• AnalysisXML
• gelML
• giML
• spML
•MIAPE minimal reporting requirements
• One parent document - The minimum information about a proteomics
experiment (MIAPE), Nature Biotechnology 25, 887-893 (2007)
• MIAPE MI, MS, MSI, GE, GI, CC, CE, SP
Index
• Need of standards in Proteomics
• HUPO-PSI
–
–
–
–
Organization
Standard data formats
CVs
MIAPEs
• PEFF: A Common Sequence Database Format in
Proteomics
• PRIDE
• Standard data format converters
Standard data formats for
Experimental data: spectra, acquisition parameters,
acquisition equipment, ...
Analyzed data: identifications, quantitations, data analysis
software ...
Standard data formats
Experimental data: spectra, acquisition parameters,
acquisition equipment, ...
• data format capturing peak list information.
Seattle Proteome Center
•
Its
aim
is
to
unite
the
large
number
of
current
formats
at the (pkl's,
Institute for
mzXML 2.0
mzXML 3.0
mzXML 4.0
Systems Biology
dta's, mgf's, .....) into one
mzML 1.0
• It is NOT a substitute for the rawfile formats of the instrument
HUPO-PSI
vendors. Some mzXML
vendors, if1.05
not all, will
provide
mzXML
2.0software
transforming their raw files to that standards
mzML: Released on June 1st, 2008
Sample instance document mzML 1.0
Standard data formats for
Experimental data: spectra, acquisition parameters,
acquisition equipment, ...
Analyzed data: identifications, quantitations, data analysis
software ...
Standard data formats
Analyzed data: identifications, quantitations, data analysis
software ...
• describes the results of identification and quantitation
processes for proteins,
peptides and protein modifications from
pepXML
Seattle Proteome Center
mass spectrometry
at the Institute for
AnalysisXML
Systems Biology
protXML
HUPO-PSI
AnalysisXML: v1.0 – candidate (Dic 08)
Sample instance document AnalysisXML
(beta)
Standard data formats
Other data:
XML data format
MIAPE
GelML
MIAPE GE
GelInfoML
miXML
MIAPE GI
MIAPE MIMIX
spML
MIAPE SP
Standard data formats
mass
spectrometer
A
mass
spectrometer
B
proprietary
format
converter
search
engine
A
mzML
search
engine
B
Public repository
analysisXML
Index
• Need of standards in Proteomics
• HUPO-PSI
–
–
–
–
Organization
Standard data formats
CVs
MIAPEs
• PEFF: A Common Sequence Database Format in
Proteomics
• PRIDE
• Standard data format converters
Controlled Vocabularies
The Controlled Vocabularies (CVs) of the Proteomic Standard
Initiative (PSI) provide a consensus annotation system to
standardize the meaning, syntax and formalism of terms used
across proteomics, as required by the PSI Working Groups.
Each PSI working group develop the CVs required by the
technology or data type it aims to standardize, following common
recommendations for development and maintenance.
At the PSI meeting in Washington (Sept 06), it was decided that all
PSI working groups should adopt the same CVs standardizing
some overlapping concepts (units and resources).
Controlled Vocabularies
Term
Synonyms
What is a CV?
TOF
time-of-flight
T.O.F.
100173
time of flight
Controlled Vocabularies
• PSI CVs are composed of two documents:
• a design principle description
• the implementation of the CVs in OBO (Open Biomedical
Ontologies)
•Developing CVs is a process of collecting, and if necessary
defining terms.
• Every effort must be made to adopt and re-use existing
ontologies or CVs where they exist, to avoid “re-inventing the
wheel”.
Ontology Lookup Service
http://www.ebi.ac.uk/ontology-lookup/
• The OLS provides a web service interface to query multiple
ontologies from a single location with a unified output format.
Ontology Lookup Service
http://www.ebi.ac.uk/ontology-lookup/
Index
• Need of standards in Proteomics
• HUPO-PSI
–
–
–
–
Organization
Standard data formats
CVs
MIAPEs
• PEFF: A Common Sequence Database Format in
Proteomics
• PRIDE
• Standard data format converters
MIAPE: Minimum Information About a
Proteomics Experiment
Taylor, C.F., Paton, N.W., Lilley, K.S., Binz, P.A., Julian, R.K., Jr., Jones, A.R., Zhu, W., Apweiler, R.,
Aebersold, R., Deutsch, E.W., Dunn, M.J., Heck, A.J., Leitner, A., Macht, M., Mann, M., Martens, L., Neubert,
T.A., Patterson, S.D., Ping, P., Seymour, S.L., Souda, P., Tsugita, A., Vandekerckhove, J., Vondriska, T.M.,
Whitelegge, J.P., Wilkins, M.R., Xenarios, I., Yates, J.R., 3rd and Hermjakob, H. (2007)
The minimum information about a proteomics experiment (MIAPE).
Nat Biotechnol 25: 887-893.
Sufficiency and practicability
• Unambiguous description of the experimental context
• Allow understanding of the results and their
interpretation
• Sufficient to permit a critical evaluation
• In principle allow recreation of the work
MIAPE
• It is:
guidelines
– Describing a list of information and data to provide
when an experiment is reported (it is a content
descriptor)
• Peptide sequence, scores, modifications, mass errors, etc.
– Helping to assess quality control
• Number of replicates, expected error rate
MIAPE
• It is not:
guidelines
– Describing the way to run an experiment
• does not specify the use of a search engine in particular
• does not force the use of one protocol
– Describing the data representation
• Use excel to create a table with these five following
columns:…
– Including any quality judgment
• need 30% sequence coverage to identify a protein
• “The absence of thorough validation of both analytical and
biological results, including error analysis should result in
rejection”
• “Authors should justify the use of a very small database or
database that excludes common contaminants, since this may
generate misleading assignments”
MIAPE
guidelines
• MIAPE Gel Electrophoresis (GE) v1.4
• MIAPE Gel Informatics (GI) v0.5
• MIAPE Mass Spectrometry (MS) v2.22
• MIAPE Mass Spectrometry Informatics (MSI) v0.8
• MIAPE Column Chromatography (CC) v1.0
• MIAPE Capillary Electrophoresis (CE) v0.7
• MIAPE Sample Preparation and handling (SP) v0.2
• MIAPE Molecular Interactions (MI) v1.1.2
Online tool to generate and store MIAPE
documents
http://www.proteored.org
A MIAPE generator tool
Fill all minimal information
by hand
ProteoRed
server
Fill only some changes or new items by hand,
and add automatically static information
from previous MIAPE documents
A MIAPE generator tool
http://www.proteored.org
A MIAPE generator tool
A MIAPE generator tool
A MIAPE generator tool
A MIAPE generator tool
HUPO-PSI: MIAPE Gel Electrophoresis v1.2
Edit document
Delete document
Generate report
Generate XML
MIAPE Reports
Generate report
MIAPE Reports
MIAPE Reports
MIAPE Reports
MIAPE Reports
MIAPE Reports
MIAPE Reports
MIAPE Reports
Index
• Need of standards in Proteomics
• HUPO-PSI
– Organization
– Standard data formats
– MIAPEs
• PEFF: A Common Sequence Database Format in
Proteomics
• PRIDE
• Standard data format converters
PEFF: PSI Extended FASTA Format
A Common Sequence Database Format in
Proteomics
•P-A Binz, S Seymour, J Shofsthal, D Creasy, E Kapp
•Problem: interpretation of current fasta format by search engines:
• Protein identifiers
• Description
• Taxonomy
• Other annotation (PTMs, sequence variants, etc)
•Propose an alternative to the heterogeneous fasta format, ideally
generated by the database providers, or alternatively via an accepted
converter, to submit one single source sequence database to various search
engines
• SwissProt and EBI already agreed on the principle
•Format proposal reached (not only for MS, flexible, extensible)
PEFF: PSI Extended FASTA Format
• A unified format for protein and nucleotide sequence databases to be used by
sequence search engines and other associated tools (spectra library search tools,
sequence alignment software, data repositories, etc).
• Enables consistent extraction, display and processing of information such as
protein/nucleotide sequence database entry identifier, description, taxonomy, etc.
across software platforms.
• Allows the representation of structural annotations such as post-translational
modifications, mutations and other processing events.
• Flat file that includes a header of meta data to describe relevant information
about the database(s) from which the sequence has been obtained (i.e., name,
version, etc).
• Sequence database providers are encouraged to generate this format as
part of their release policy or to provide appropriate converters that can be
incorporated into processing tools.
Index
• Need of standards in Proteomics
• HUPO-PSI
– Organization
– Standard data formats
– MIAPEs
• PEFF: A Common Sequence Database Format in
Proteomics
• PRIDE
• Standard data format converters
PRIDE – Protein Identification Database
• Turns publicly available data into publicly accessible
data
• Protein identifications
• Experimental detail
• Peak lists
• Linkout to raw data
• Fully open source
• Fully open data
• Implementation of PSI standards as they are released
PRIDE
mass
spectrometer
A
PRIDE
mass
spectrometer
B
proprietary
format
converter
search
engine
A
mzML
search
engine
B
Public repository
analysisXML
PRIDE – Protein Identification Database
...Tomorrow with Alberto Medina
Index
• Need of standards in Proteomics
• HUPO-PSI
– Organization
– Standard data formats
– MIAPEs
• PEFF: A Common Sequence Database Format in
Proteomics
• PRIDE
• Standard data format converters
Standard data format converters
• msconvert (ProteoWizard):
– From: mzML, mzXML, Thermo RAW, MGF
– To: mzML, mzXML
– Vendor format reading restrictions: Thermo RAW: Windows
with XCalibur XDK installed
Standard data format converters
• ReAdW version 4.0.2:
– From:
• Thermo RAW
– Exports to:
• mzXML
• mzML (not yet updated to final mzML 1.0 standard; try msconvert)
– Requires a valid installation of the Thermo XCalibur
software system, as it relies on the XCalibur libraries.
Standard data format converters
• CompassXport 1.3.6 :
– From:
• analysis.baf (instrument families: APEX, micrOTOF, micOTOF-Q, ...)
• analysis.yep (esquire/HCT instrument family)
• AutoXecute run for LCMaldi (instrument family: autoFlex, ultraFlex,
...)
• fid files (flex instrument family)
– Exports to:
• mzXML version 2.1
• mzData, version 1.05
• mzML in progress
– Do not requires to install Bruker propietary software
– Replace to mzBruker
Standard data format converters
• massWolf 4.0.2 (1st july 08):
– From:
• MassLynx native acquisition files
– Exports to:
• mzXML
– Requires installation of MassLynx software on the same
computer
– You must select the appropriate massWolf download to match
the version of your MassLynx software (4.0 or 4.1).
Standard data format converters
• mzWiff 4.0.2 (1st July 08):
– From:
• Analyst native acquisition (.wiff) files
– Exports to:
• mzXML
– Requires installation of Analyst software
Standard data format converters
• T2DExtractor (Dec 07):
– From:
• data from a SCIEX/ABI 4000 series MALDI TOFTOF instruments
– Exports to:
• mzXML
Standard data format converters
• Trapper 4.1.0 (17 th july 08):
– From:
• Agilent MassHunter format (.d directories)
– Exports to:
• mzXML
– Requires Agilent's MHDAC software installed
– This software will be included in the upcoming 4.1.0 TPP
distribution
Download