PPTX

advertisement
Protein grouping in mzIdentML
ProteinDetectionList
ProteinAmbiguityGroup id=“PAG1”
ProteinDetectionHypothesis id=“PDH1” dbseq_ref=“dbseq_Q05421|CP2E1_MOUSE”
anchor protein
ProteinDetectionHypothesis id=“PDH2” dbseq_ref=“dbseq_Q05423|CP2E2_MOUSE”
sequence same-set
ProteinDetectionHypothesis id=“PDH3” dbseq_ref=“dbseq_Q05312|CP2F1_MOUSE”
sequence subset
ProteinAmbiguityGroup id=“PAG2”
....
ProteinAmbiguityGroup and ProteinDetectionHypothesis
Existing CV terms for
ProteinDetectionHypothesis
id: MS:1001591
name: anchor protein
def: "A representative protein selected from a set of sequence same-set or spectrum same-set proteins." [PSI:MS]
xref: value-type:xsd\:string "The allowed value-type for this CV term."
is_a: MS:1001101 ! protein group or subset relationship
id: MS:1001592
name: family member protein
def: "A protein with significant homology to another protein, but some distinguishing peptide matches." [PSI:MS]
xref: value-type:xsd\:string "The allowed value-type for this CV term."
is_a: MS:1001101 ! protein group or subset relationship
id: MS:1001593
name: group member with undefined relationship OR ortholog protein
def: "TO ENDETAIL: a really generic relationship OR ortholog protein." [PSI:MS]
is_a: MS:1001101 ! protein group or subset relationship
id: MS:1001594
name: sequence same-set protein
def: "A protein which is indistinguishable or equivalent to another protein, having matches to an identical set of
peptide sequences." [PSI:MS]
xref: value-type:xsd\:string "The allowed value-type for this CV term."
is_a: MS:1001101 ! protein group or subset relationship
id: MS:1001595
name: spectrum same-set protein
def: "A protein which is indistinguishable or equivalent to another protein, having matches to a set of peptide
sequences that cannot be distinguished using the evidence in the mass spectra." [PSI:MS]
xref: value-type:xsd\:string "The allowed value-type for this CV term."
is_a: MS:1001101 ! protein group or subset relationship
Existing CV terms for
ProteinDetectionHypothesis
id: MS:1001596
name: sequence sub-set protein
def: "A protein with a sub-set of the peptide sequence matches for another protein, and no distinguishing peptide
matches." [PSI:MS]
xref: value-type:xsd\:string "The allowed value-type for this CV term."
is_a: MS:1001101 ! protein group or subset relationship
id: MS:1001597
name: spectrum sub-set protein
def: "A protein with a sub-set of the matched spectra for another protein, where the matches cannot be distinguished
using the evidence in the mass spectra, and no distinguishing peptide matches." [PSI:MS]
xref: value-type:xsd\:string "The allowed value-type for this CV term."
is_a: MS:1001101 ! protein group or subset relationship
id: MS:1001598
name: sequence subsumable protein
def: "A sequence same-set or sequence sub-set protein where the matches are distributed across two or more
proteins." [PSI:MS]
xref: value-type:xsd\:string "The allowed value-type for this CV term."
is_a: MS:1001101 ! protein group or subset relationship
id: MS:1001599
name: spectrum subsumable protein
def: "A spectrum same-set or spectrum sub-set protein where the matches are distributed across two or more
proteins." [PSI:MS]
xref: value-type:xsd\:string "The allowed value-type for this CV term."
is_a: MS:1001101 ! protein group or subset relationship
Problems
• No requirement for any exporter to use the terms “MAY”
• “anchor protein” doesn’t capture intended role and isn’t used consis
id: MS:1001596
name: sequence sub-set protein
def: "A protein with a sub-set ...." [PSI:MS]
xref: value-type:xsd\:string "The allowed value-type for this CV term."
is_a: MS:1001101 ! protein group or subset relationship
•
No definition of what should be put in the value slot of cv terms:
• Could be the PDH identifier, accession or DBSequence identifier of group representative or any other protein
that is super-set to this protein
• Or anything else for that matter
• What does passThreshold = “true” on PDH mean?
• Unclear how to count the number of identified proteins in an mzIdentML file
•Count PAGs or count PDHs?
• No terms for protocol describing how inference has been done or how to interpret results
Proposed work group outcomes
•
Attach cv terms to <ProteinDetectionProtocol> describing how protein inference
has been done
– Still under discussion, since these effectively describe parts of the algorithm used
•
Exactly one mandatory “representative protein” MUST be present per group (new
name for “anchor protein”) on PDH
– To be checked by semantic validator
•
ProteinDetectionList MUST have a cv term “number of identified proteins” (count
PAGs that have “representative protein” PDH with passThreshold=“true”
•
Each PDH SHOULD be flagged with one term from a group stating whether it is
“representative protein”, “sequence|spectrum same-set”, “sequence|spectrum
subset”, “sequence|spectrum subsumed” or “marginally distinguished” (i.e. Not
strictly any of these, but not enough evidence to be a group representative)
– Value slot of these terms SHOULD contain a comma-separated list of super-set or same-set (as
appropriate) PDH IDs
mzIdentML
context
CV terms
Values
Requirement level
Description
ProteinDetectionProtocol
“No parsimony”, “Strict
parsimony”, “Parsimony
with additional
considerations”
Parent term: “Parsimony
usage”
xsd:String (to
allow free text
description)
SHOULD
ProteinDetectionProtocol
“No intact protein
separation for protein
inference”, “Partial
isolation for protein
inference”, “Nearly
complete isolation for
protein inference”
Parent term: “Role of
intact protein separation
in protein inference”
xsd:String (to
allow free text
description)
SHOULD
ProteinDetectionProtocol
“Attempted isoform
differentiation”,
“Prevented isoform
differentiation”
Parent term: “Isoform
Differentiation”
-
SHOULD
No parsimony used means no parsimony approach has been applied
generating the protein list. Strict parsimony used should be indicated if
parsimony is the only consideration used to report proteins. Parsimony with
additional considerations used should be indicated if additional information
such as quantitation information is used to influence which proteins are
reported, or if some additional proteins are reported for other reasons,
such as a desire to report one protein from each gene to which any
matched peptide maps.
In workflows where proteins are not separated to any degree, or in which
protein separation information is not used in the protein inference, this will
have a value of No intact protein separation for protein inference”, as will be
the case in strictly bottom up proteomics. At the other limit, Nearly
complete isolation should be indicated when separation of intact proteins is
conducted and relied upon for protein inference, as is common in multidimensional gel-based work. The Partial isolation for protein inference value
should be specified for cases where some level of protein isolation is used –
for example, if a sizing column is used to separate intact proteins into
fractions or in the common GeLC-MS workflow where 1D gel separation is
followed by bottom up analysis of the gel slices.
In the context of a parsimony approach, an inference tool can either
attempt to report multiple protein forms by determining if there is
adequate evidence to support the detection of more than one isoform in a
cluster (most common), or alternately the tool could prevent this
differentiation process and maximally group instead.
ProteinDetectionProtocol
Accession Ambiguity is
Reported
“true”, “false”
SHOULD
Used for reporting whether ambiguity is reported i.e. if true PAGs may
contain one or more PDHs, if false, each PAG must contain only one PDH
(no attempt to report ambiguity).
Table 1 –New CV terms for reporting how protein inference has been performed. The semantic validation software for mzIdentML reports an error
(MUST), a warning (SHOULD) or an informational message (MAY) if these terms are not reported within the file.
ProteinDetectionProtocol
Threshold applied to
Peptides
“true”, “false”
SHOULD
Set to true if thresholds are applied to PSMs or peptide level prior to protein
inference. If thresholds have been applied, these should be reported under
ProteinDetectionProtocol->Threshold using appropriate CV terms.
This should be set to false for protein inference approaches that limit to a single
top ranking peptide per spectrum for consideration during protein inference; true
should be set for approaches that preserve multiple answers per spectrum and
provide all of these to the protein inference algorithm.
Sequence-centric parsimony minimization means that the inference method has
sought to find the minimal set of proteins that explain all the peptide sequences
observed, while Spectrum-centric parsimony minimization means the inference
approach has sought to find the minimal set of proteins that explains the
collection of observed spectra. Sequence-centric parsimony with additional rules
would apply if a sequence-centric approach is used but additional rules are used –
for example, if allowances are made to compensate for limitations of this
approach such as I/L and deamidation ambiguities. No parsimony minimization
should be indicated only if the Parsimony usage field is set to No parsimony.
ProteinDetectionProtocol
Multiple matches per
spectrum are considered
“true”, “false”
SHOULD
ProteinDetectionProtocol
“Spectrum-centric
parsimony Minimization”,
“Sequence-centric
parsimony minimization”,
“Sequence-centric
parsimony minimization
with additional rules”, “No
parsimony minimization”
Parent term: Parsimony
Minimization Method
-
SHOULD
ProteinDetectionProtocol
“Exhaustive list ambiguity
modeling”, “Limited list
ambiguity modeling”,
Parent term: Ambiguity
Modeling Approach
-
SHOULD
In modelling a PAG, in one approach an algorithm can list all known intersection
relationships, including accessions that have very limited overlap with the
representative protein in the group. Alternately approaches to limit the scope of
accessions that are included using various approaches. For example, one could
list only accessions that have at least some minimal level of intersection with the
representative protein in the group. This CV term simply captures whether the
group modelling is limited in some way or is exhaustive in listing accessions.
ProteinDetectionProtocol
->Threshold
Protein Quality Threshold:
MinimumNumSequencesRe
quired
Integer
SHOULD
An integer value representing the number of identified peptide sequences
required for creating a PDH.
ProteinDetectionProtocol
TaxonomyBasedPreference
“true”, “false”
SHOULD
In some workflows, one might map identified peptides to a multi-species protein
sequence database, but prefer matches to sequences from a particular species.
ProteinDetectionProtocol
->Threshold
Other thresholding terms?
Table 1 cont. –New CV terms for reporting how protein inference has been performed. The semantic validation software for mzIdentML reports an error
(MUST), a warning (SHOULD) or an informational message (MAY) if these terms are not reported within the file.
mzIdentML context
CV term
Values
ProteinDetectionList
number of identified
proteins
Integer
ProteinAmbiguityGroup
Protein cluster identifier
Require-ment
level
MUST
Description
The value reported should equal the number of PAGs
containing a PDH flagged as Representative Protein and
passThreshold=“true”
A common identifier reported allows multiple PAGs to be
linked, for example indicating some peptides are shared
between different PAGs.
String. A
within-file
unique
identifier
NumberDistinctProteinSeq Integer
uences
MAY
SHOULD
The number of distinct protein sequences among the PDHs in
the group. For example, if there are two PDH with different
identifiers that have identical full length sequences, the
NumberDistinctProteinSequences would be one.
ProteinDetectionHypothesis
Representative protein
-
MUST
(be present on
one PDH per
PAG that is
counted)
The Representative protein will generally have likelihood
greater than or equal to other proteins in the
ProteinAmbiguityGroup, but this is not required Exactly one
PDH within a PAG must be assigned with this label to serve as
the representative for the putatively detected protein. A PDH
labelled as the Representative protein can have
passThreshold=“true|false” i.e. it need not have passed the
threshold reported in the ProteinDetectionProtocol.
ProteinDetectionHypothesis
Sequence Same-Set
Protein
xsd:String –
comma
separated list
of PDH Ids that
are same-set
SHOULD
A protein that is indistinguishable or equivalent to another
protein in the group, having matches to an identical set of
peptide sequences.
ProteinDetectionHypothesis
Spectrum Same-Set
Protein
xsd:String –
comma
separated list
of PDH Ids that
are same-set
SHOULD
A protein that is indistinguishable or equivalent to the
Representative protein, having matches to a set of peptide
sequences that cannot be distinguished using the evidence in
the mass spectra.
ProteinAmbiguityGroup
Table 2 New CV terms for reporting protein set (group) relationships and global statistics about the protein identification results. The semantic validation software for mzIdentML
reports an error (MUST), a warning (SHOULD) or an informational message (MAY) if these terms are not reported within the file.
ProteinDetectionHypothesis
Sequence Subset
Protein
xsd:String – comma
separated list of PDH Ids
that are super-set
SHOULD
A protein with a sub-set of the peptide sequence matches for the
Representative protein, and no distinguishing peptide matches.
ProteinDetectionHypothesis
Spectrum Subset
Protein
xsd:String – comma
separated list of PDH Ids
that are super-set
SHOULD
A protein with a sub-set of the matched spectra for the
Representative protein, where the matches cannot be
distinguished using the evidence in the mass spectra.
ProteinDetectionHypothesis
Sequence Multiply
Subsumable Protein
xsd:String – comma
separated list of PDH Ids
that subsume this PDH
SHOULD
A sequence same-set or sequence sub-set protein where the
matches are distributed across two or more proteins.
ProteinDetectionHypothesis
Spectrum Multiply
Subsumable Protein
xsd:String – comma
separated list of PDH Ids
that subsume this PDH
SHOULD
A spectrum same-set or spectrum sub-set protein where the
matches are distributed across two or more proteins.
ProteinDetectionHypothesis
Marginally
distinguished protein
-
MAY
ProteinDetectionHypothesis
Covering Set Protein
DBSequence
Protein Sequence
Identical
xsd:String – comma
separated list of native
accession(s) of protein
with identical protein
sequence
MAY
Assigned to a PDH that has some evidence to support its
presence in addition to the representative protein i.e. they have
a unique peptide but not sufficient to be promoted as a
Representative Protein in a PAG.
A member of a minimal set of proteins sufficient to explain all
matched peptides/spectra via a parsimony approach. This
provides an alternative means of reporting a parsimonious
protein list when ParsimonyUsage=“Parsimony with additional
considerations”. A PAG can contain zero, one, or multiple PDHs
bearing this term.
Full length protein sequence is identical with respect to the
protein specified in the value attribute of this term.
DBSequence
Protein Sequence
Subsequence
xsd:String – native
accession of protein with
“super”-sequence
MAY
MAY
Full length protein sequence is a subsequence of the protein
specified in the value attribute of this term.
Table 2 cont. New CV terms for reporting protein set (group) relationships and global statistics about the protein identification results. The semantic validation software for
mzIdentML reports an error (MUST), a warning (SHOULD) or an informational message (MAY) if these terms are not reported within the file.
Unresolved issues
• Are the protocol terms necessary / sensible / overkill?
• Is there general consensus on the idea that the number of
identified proteins MUST be reported
– and must equal count of PAGs with PDH passThreshold=“true”
• Is it sensible to have SHOULD rules on all subset/same-sets?
• Extra terms for relationships between protein sequences
– Probably these will be removed
• Mechanism for updating the mzIdentML specifications and
validation software
– Minor update + submission to shortened PSI process?
Download