R3_Answers_Reviewers_PSI_mzTab_final_R2_ARJ

advertisement
ANSWERS TO THE COMMENTS FROM THE REVIEWERS’
Since the last round of review (March-April 2013) quite extensive changes have been made to the mzTab
format specification, based on the feedback received by the reviewers of the PSI document process but also
from the reviewers of the manuscript submitted to Molecular and Cellular Proteomics (MCP).
As a result, we think that overall the format is now more complex, but we also expect that it is more
reliable and less ambiguous. There are now two types of mzTab files: ‘Identification’ and ‘Quantification’.
This is specified in the mandatory metadata field ‘mzTab-type’. ‘Identification’ files can be used to report
peptide, protein, and small molecule identifications. ‘Quantification’ files can be used for quantification
results, which optionally may contain identification results about the quantified proteins, peptides or small
molecules.
In addition, there are two levels of detail (called ‘mode’) of reporting data in mzTab files: ‘Summary’ and
‘Complete’. The ‘Summary’ mode can be used to report the final results, putting together data coming from
e.g. different replicates. On the other hand, the ‘Complete’ mode is used if detailed information for each
individual assay/replicate is provided. The ‘mode’ is specified in the mandatory metadata field ‘mzTabmode’.
A new section called 'PSM' (peptide spectrum match) has been included as a replacement to 'Peptides',
which is now recommended only for ‘Quantification’ files. In addition, for improving the data reporting,
‘Units’ are not present anymore but instead now it is possible to model the experimental design up to a
high level of detail (Section 5.3 in the specification document, https://code.google.com/p/mztab/). In
practical terms there are four possible variants of mzTab files, by combining the ‘Identification’ and
‘Quantification’ ‘type’ with the ‘Complete’ and ‘Summary’ ‘mode’. The requirements (e.g. number of
mandatory fields in the different sections) are different depending on the mzTab ‘type’ and ‘mode’. Tables
2-6 in the specification document outline this.
In addition, many small refinements have been done, since the specification document has been
thoroughly reviewed. For instance, the order of the columns in the different sections is now recommended
but not mandatory (a few reviewers asked for this change). New example files have now been generated
both
from
proteomics
and
metabolomics
approaches
(see
https://code.google.com/p/mztab/wiki/ExampleFiles).
Finally, we want to thank all the reviewers for their constructive feedback. Below, we reply to the
comments raised.
Invited Reviewer 1
mzTab: exchange format for proteomics and metabolomics results
Review of specification documentation in DocProc, RC2
General thoughts:
Most of the comments have been clarified and I agree to the authors.
1
But there are some cases, which I think would not lead to a wider implementation of the proposed standard.
My main concern is still, that many attributes in the each layer are mandatory, but not easily to get (URI,
GO-annotation...) and for most cases not even necessary, even less for a co-worker without deeper
knowledge in proteomic, who should receive an mzTab file. For this case, which is obviously one of the
main goals of mzTab, the files will be bloated with "null"-columns. In 5.8 it is even stated, the results
SHOULD have a reliabilty score, which will inevitably lead to implementations with null-columns. Also
having search engine and search_engine_scores is very redundant.
I think, it is good to have a file format which gives the ability to give a small report of proteomic
experiments with a set of standardized attributes. But making too many of them mandatory ruins the beauty
and compactness of this format, in my opinion.
Answer: In the new specification some of the fields have been made optional. For instance, one of
them is ‘reliability_score’. See Tables 2-6 and section 6 in the Specification document for all the
details. However, we think that it is necessary to have both ‘search_engine’ and ‘search_engine
_score’, since it is not always obvious for non-experts which scores corresponding to each software.
Furthermore, there is still the problem with the ambiguous assignment of the peptides to proteins. I think it
may be much easier to allow for a peptide to have several accessions, than bloating the file with redundant
lines with exactly the same information but the accession, if someone wants to report the peptides on a PSM
level (e.g. for later spectral count processing).
Answer: These multiple peptide-protein assignments can still be reported using the attribute
“ambiguity_group” in the same row. One of the principles of the format was to report one “main”
protein identification (preferred peptide-protein mapping, if needed).
Specific comments:
Detailed definitions for "num_peptides" and "num_peptides_unambiguous" are not given and in 5.7 it is
stated, they are rather loosely and implementation dependant. In contrast, "num_peptides_distinct" is defined
by sequence+modifications, but does not allow for defining a peptide as distinct by sequence only, which is
often used, or sequence+modifications+charge.
Answer: Definitions are now included in sections 5.10 and 6 of the specification document.
Unfortunately the link to the OBO-file of the PSI Protein modifications workgroup seems to be broken.
Answer: The URL has been updated to http://psidev.cvs.sourceforge.net/psidev/psi/mod/data/PSIMOD.obo.
The small example in 5.5 states to encode the not given values with "-", shouln't these be "null" or "NaN"?
Answer: It has been corrected.
In 5.10.6 are spaces in the optional column name.
Answer: It has been corrected.
6.4 The Tab is written with uppercase T only here.
2
Answer: Now corrected.
6.4.2 I see no reason for giving an unassigned peptide a "null" as accession, only because it does not occur in
the protein section.
Answer: The “null” is provided if no protein identification is associated to a peptide identification
(for instance in peptidomics experiments). In this case, the “Protein” section would not be present in
the file. We think this is consistent with the use of “null” in the other attributes present in an mzTab
file. To prevent certain parsing errors, we decided that mzTab files must not contain empty cells but
must use the string “null” to represent missing values.
6.4.11 and 6.4.13: if multiple RTs may be defined, the m/z values will probably differ as well, and why then
not also include different charges? In my opinion, either all or none should be allowed, this seems very
inconsistent.
Answer: The definitions and semantics of this attribute have been substantially improved in the new
version. In the peptide table a ‘retention time window’ is defined to allow for capture of elution
ranges over which quantitation values are calculated. If multiple points are given it is expected that
these relate to all scans from which quantitative values were generated.
3
SC Reviewer 1:
Global comments:
Much has been corrected in the R2 submission. Two major concerns that are coming from many reviewers
and not addressed / rejected are:
-
Why forcing a high number of mandatory columns that will be empty in many cases? This list is
somewhat arbitrary and not fully justified
Why forcing an order of the mandatory columns, and at the same time have possible optional
columns appearing as intercalated? As the columns have CVed title names this is not necessary, and
in addition this will allow to intercalate optional columns, which will make parser break.
PSI Editor ME: From my point of view the balance between optional and mandatory is indeed arbitrary;
although this will probably influence the compliance of the standard within the community, there is no
editorial necessity to change something.
Answer: As mentioned at the beginning of this document, in the new version of the specification
document, some of the columns have been made optional depending on the mzTab ‘mode’ and ‘type’.
In addition, now the order of the columns is not mandatory, but recommended. See Tables 2-6 and
Section 6 of the specification document for more details.
Detailed comments on answers of invited reviewer 1:
Page 24 "uri", "go_terms" and "protein_coverage":
They should be optional, as they are not easily found after a peptide search (and not even given by many
search engines).
Answer: While it is true that these fields may be unavailable at several times, we tried to keep the
number of optional columns as low as possible. This ensures that all mzTab files have a similar
structure and are therefore easier to use for inexperienced users.
SC Reviewer 1: If mzTAB is meant to be light and not heavy, this statement anc choice is
counterproductive. On one side mzTAB requires a minimal number of columns, which is for some
experiment not useful and makes the file bigger than necessary; on another side, there are some
optional columns; therefore the choice of keeping some obligatory is arbitrary. This must be
corrected for consistency reasons.
PSI Editor ME: same comment as above (From my point of view the balance between optional and
mandatory is indeed arbitrary; although this will probably influence the compliance of the standard within
the community, there is no editorial necessity to change something.)
Answer: See our previous response. Take also into account that now the format has been made much
more flexible, by introducing the mzTab ‘type’ (‘Identification’ and’ Quantification’) and ‘mode’
(‘Complete’ and ‘Summary’)
Page 25:
Why must the peptide section follow the (optional) protein section? I don't see any reason for this, except a
possible accession-parsing. Though, the accessions in the peptide don't have to match any accessions in the
protein section, neither must there be any protein section at all.
4
Answer: As mentioned before, we believe that a fixed order of sections and thereby a more “stable”
format makes the format easier to use. At the same time, we cannot see any disadvantage in
enforcing a certain order. People developing software that is able to generate mzTab files will on
average be more experienced than people “consuming” the files. For them, it should not make a
difference whether a certain order of sections is enforced while the latter group might find it helpful
to know “where” to find what type of information.
SC Reviewer 1: I do not agree: having a non absolutely fixed format is prone to parsing error. If a
less experienced person as you mention is not designing the count of the columns in a smart way, two
perfectly valid mzTAB files with differing number of columns will generate odd results…
PSI Editor ME: same comment as above (From my point of view the balance between optional and
mandatory is indeed arbitrary; although this will probably influence the compliance of the standard within
the community, there is no editorial necessity to change something.)
Answer: See our previous responses. Also, now the order of the columns is now not mandatory, only
recommended.
Page 26:
The "unique" column may be very useful, but as mzTab should be an easy generated format, it should be
optional and not every search engine reports this value.
Answer: See previous responses. If this information is not available, it is possible to just use ‘null’.
SC Reviewer 1: Again, it’s an overkill. Particularly here, in addition, the definition of unique might
vary from one tool to another, which makes the interpretation of this field non homogeneous
PSI Editor ME: same comment as above; additionally: definition of unique is given in 5.7 and seems
conclusive to me (i.e. “database-unique”).
Answer: We think this is an essential piece of information for clarifying the protein inference, one of
the reasons why interpretation of proteomics results is difficult for non-proteomics researchers. This
is the main reason why it was added to the mzTab specification.
Page 27:
Also "retention_time" is neither given by every search engine, nor by every method in MS relevant (e.g.
direct injection), and thus should be optional.
Answer: We believe that the retention time of a peptide or a small molecule is a vital piece of
information that is used and usable in many workflows. Therefore, it should be possible to report it
in mzTab. To minimize the number of optional columns this column was defined as non-optional.
SC Reviewer 1: Retention times are useful only if comparing highly similar chromatographic
conditions. This will only be valid if one compare and exchange results that fulfil these criteria. Even
if I agree that this information can be very useful in some area, it is not the case for many
straightforward identification and quantitation jobs.
PSI Editor ME: same comment as above (From my point of view the balance between optional and
mandatory is indeed arbitrary; although this will probably influence the compliance of the standard within
the community, there is no editorial necessity to change something.)
5
Answer: We thought this attribute needed to be mandatory. If it is not available, in the mzTab file it
can be reported as “null”.
Page 28:
In the description of the metadata it is said, that the {UNIT_ID}_ms-file is an optional value (as all metadata
is). But the mandatory field "spectra_ref" uses these fields information. If it is not given in the metadata,
there is no use of this mandatory reference format for the "spectra_ref". So either this field should be
optional or rather the ms-file in the metadata mandatory.
Answer: While the protein / peptide / small molecule sections are table based, the metadata section
is key-value based. To enable the easy concatenation of the table based sections we have decided to
minimize the number of optional columns. At the same time, the metadata section was defined as an
“all optional” section. We believe, that while this specific constellation is not ideal these more
general rules are easier to understand.
SC Reviewer 1: This comment should address the problem addressed by this reviewer. Either add a
constrain to the meta data or allow that mzTAB files are inconsistent.
PSI Editor ME: I think this is a difference between syntax and semantics; the syntax allows ms_file to be
missing, but then – by semantics – the file is automatically not valid, because spectra_ref is mandatory.
There is no disadvantage to include this implicit “mandatoriness” into the specification document; so please
add it.
Answer: We have changed the specification document, as requested by the reviewer and the editor
and we think it is a good idea to clarify this. Now the element “ms_run[1-n]-location” is mandatory
in the Metadata section.
Detailed comments on answers of SC reviewer 1:
in 6 "Every line in an mzTab file must start"
=> capitalize MUST
Answer: This was updated in the specification document.
in text (under Params) "Any field that is not available should be left empty"
=> should'nt that be MUST be left empty ??
Answer: This was changed to MUST.
=> how are space and comma characters constrained?
SC Reviewer 1: No answer?
Answer: Sorry, we missed this point in the previous round of review. We think the reviewer is
referring to the situation where the name of the CV term contains a comma. We did not consider this
case before so thanks for pointing this out. The solution in this case is to use the same convention
used in CSV files, adding quotes (“) to the CV param name:
[label, accession, “first part of the param name , second part of the name”, value].
6
[MOD, MOD:00648, “N,O-diacetylated L-serine”,]
This information has been added to the specification document (section 6, “Format specification”).
6.4.1 sequence
=> how to encode sequence ambiguity (I/L), others, and results from sequence tags experiments?
Answer: See section new version 5.10.5. I/L can be represented as ‘J’.
SC Reviewer 1: And what about K/E, and sequence tags with {A1A2} where A1 and A2 are two residues, for
which we do not know the order in the sequence?
PSI Editor ME: Q/E = Z and N/D = B are already mentioned in 5.10.5. Regarding {A1A2} at the moment
one has to write XX, if I understand correctly. This is a loss of information, but from the editorial point of
view no necessity to change. @Authors: please decide whether you want to improve that now... Add that
special case / work-around into the specification document!
Answer: Sequence tag approaches are not properly supported by mzTab. It was not one of the initial
aims to have support for this type of approaches. This is one of the possible future developments.
This has been clarified in the specification document (Section 7).
Other question: why not using PSI-MS CV for terms? no relationships to mzIdentML terms? just
independant terms?
Answer: We do not quite understand the reviewers comment. The mzTab format specification does
not exclude and CVs to be used but actually recommends to use the PSI-MS CV.
SC Reviewer 1: It appears that the column names do not follow the PSI-MS CV terms (some
examples below):
“Charge” is “charge state” in PSI-MS : MS:1000041
“Search_engine_score” is “search engine specific score for peptides” in PSI-MS: MS:1001143
“sequence” is “unmodified peptide sequence” in PSI-MS: MS:10000888
Would be appropriate to provide in the documentation a list of PSI-MS CVs and/or element names in
mzIdentML/mzQuantML to make sure that one can map the terms correctly.
PSI Editor ME: As I understand it, CV terms can be used but the above mentioned column names are “own
special labels”. Unfortunately there are also attributes in mzIdentML doubling CV terms (like
“SpectrumIdentificationItem/chargeState”). No editorial necessity to add such a list.
Answer: It was never our aim to use exactly the same naming for the name of the attributes in mzTab
and the PSI-MS CV terms. The definition of the attributes is present in the specification document. In
addition, for those terms for which definition is not included in the document (optional columns) it is
necessary to use names of PSI-MS terms. So, we think that following this approach there should not
be any naming ambiguity related issues.
7
Public commenter 2
I found the following, which are generally instances of commas or use of the definitive article.
Page 2
Plural experiments
This document addresses the systematic description of peptide, protein and small molecule identification and
quantification data retrieved from a mass spectrometry-based experiments
Page 3
Missing “a”
T1. Export of results to external software, that is not able to parse proteomics/metabolomics specific data
formats but can handle simple tab-delimited file formats. As a guideline the file format is designed to
be viewable by programs such as Microsoft Excel® and Open Office Spreadsheet.
be vs. being
T2. Allow the concatenation of results, and thus be able to combine results from multiple experiments
but also multiple entries from local LIMS databases or MS proteomics repositories.
Page 6
the core
There is a difficult issue with respect to how software should encode CV terms, such that changes to the core
can be accommodated.
Page 7
Comma to separate main clause
mzTab is designed to only hold experimental results, which in proteomics experiments can be very complex.
Page 9
Commas with furthermore
Several of the available techniques, furthermore, allow/require multiple similar samples to be multiplexed
and analyzed in a single MS run. When several biological samples are multiplexed these samples are referred
8
to as “subsamples” in mzTab. Subsamples MUST, furthermore, be linked to the used labels in the metadata
section of the mzTab file (see example below). In case a quantification method is used that does not lead to
multiplexed biological samples, the generated quantification values MUST be reported as subsample 1.
Replicates in experimental…
Answer: All the minor changes suggested by the reviewer have been done in the new version of the
specification document.
9
Public commenter 3:
Minor comments from me. Overall the specifications look fine and I don’t want to hold up the process of
finalising version 1.0 but a few additional clarifications will help with implementation, as noted below in
italic:
retention_time section should specify what to do for peptides analysed in multiple runs (subsamples) –
report a single value of RT in the “master run” or use the | to separate out values from different replicates.
I’m sure we discussed this issue but I don’t see that this made it into the spec doc.
Answer: As mentioned before, the definitions and semantics of this attribute have been substantially
improved in the new version. In the peptide table a ‘retention_time_window’ is defined to allow for
capture of elution ranges over which quantitation values are calculated. If multiple points are given
it is expected that these relate to all scans from which quantitative values were generated.
Mass_to_charge as above – is this a “master peptide” m/z value?
Answer: This has now been clarified in the specification document. It is assumed that the reported
mass to charge (m/z) value is for a given “master” peptide from one assay only (and the unlabeled
peptide in label-based approaches). If the exporter wishes to export values for all assays, this can be
done using optional columns..
Section 5.3 undistinguishable should be indistinguishable
Answer: Now fixed.
Section 5.4 “...mzTab is designed to be a simple data format. Therefore, the reporting of results from such
experimental designs is poorly supported in mzTab.”
Consider replacing “poorly” with “supported to only a limited extent”
Answer: Now changed. In fact, as highlighted above, now the reporting of the experimental design is
fully supported.
Section 5.9
{position}{Parameter}-{Modification or Substitution identifier}|{neutral loss}
...
{Parameter} is optional. It MAY be used to report a quantity e.g. a probability score
associated with the modification or location.
Reporting the first two possible sites for the phosphorylation with given probability score
Here only the modification field is given:
10
3 [MS,MS:1001876, modification
0.2]MOD:00412, 8-MOD:00412
probability,
0.8]|4[MS,MS:1001876,
modification
probability,
I think an extra example is needed to show whether the following is also allowed:
(3|4)[MS,MS:1001876, modification probability, 0.8]-MOD:00412
Answer: We prefer not to allow this second option. The preferred option is the one listed above. This
clarification has been added to the specification document.
i.e. A probability for grouped positions – I recall this was desired by some groups. Or an even more complex
example:
(3|4)[MS,MS:1001876, modification probability, 0.8]|7[MS,MS:1001876, modification probability, 0.2]MOD:00412
As such the following needs to be specified:


Are grouped positions allowed at all?
If so, are round brackets allowed around grouped positions – otherwise impossible to tell if a
cvParam is associated with the group or the last element of the group only
As a general note, writing code to read these possibilities looks pretty nightmarish. As a check, these would
be the parsing rules:
-
Split different mods by comma
Recognise the start of the modification CV param by “-“
Recognise the end of the modification by comma or | for neutral loss
Positions are separated by bars. Need decision on grouping and use of brackets?
cvParam associated with a given position is surrounded by square brackets
All correct?
PSI Editor ME: It seems, that grouped positions are suggested as additional possibility by the reviewer. If
the authors want to implement it now, they are free to do; if not, that could be left out until next release from
the editorial point of view, but please add a hint to the specification document, that grouped positions are not
possible at the moment.
Answer: We agree in that all not possible scenarios for ambiguity in modification position are
supported. However, we believe that the main use cases are now covered and more work can be put
in the future in order to support new use cases (like complex scenarios dealing with grouped
modifications). This clarification has been added to the specification document (Section 7).
11
comments from invited reviewers:
-------------------------------
invited reviewer 1:
see R2_mzTab_comments_invited_reviewer_1_anon.docx
Answer: Comments addressed before.
invited reviewer 2:
"all my questions were answered and i have no further comments"
public comments from original commenters:
----------------------------------------SC reviewer 1:
see R2_mzTab_comments_SC_Reviewer_1_anon.docx
Answer: Comments addressed before.
public commenter 1:
"The specification is looking good.
I found one misspelling, in file R2_The_ten_minute_guide_to_mzTab.pdf:
'separate table or section but only exit through their Unit_IDs' should have 'exist' instead of 'exit'"
Answer: Now fixed.
12
public commenter 2:
"I have read through the mzTab document and the changes and it is fine. An impressive amount of detail
and flexibility. I have attached a list of a few instances where, in my opinion, there are slight errors
(punctuation, plurals etc)." => see R2_mzTab_comments_public_commenter_2.docx
Answer: Comments addressed before.
comments from additional commenters:
-----------------------------------
public commenter 3:
see R2_mzTab_comments_public_commenter_3.docx
Answer: Comments addressed before.
public commenter 4:
"I have several concerns about mzTab format while coding a converter from mzQuantML to mzTab and do
not know how to add them on http://www.psidev.info/mzTab-in-docproc. So email you instead.
1.
In the official document, "mzTab is intended as a lightweight supplement to the already existing
standard file formats mzIdentML and mzQuantML, providing a summary of the final results of a MS-based
proteomics experiment." However in my opinion, it would be more accurate to phrase as a subset of the final
result due to the data can be contained within one mzTab file. In mzQuantML format, it is common to have
more than one quant layer, e.g. normalized abundance and original abundance, or ratios (more common in
labelled method), or more than one column from globalQuantLayer which may have different units and no
need to mention the combination of these quant layers. All of these cases cannot be handled within one
single mzTab file due to there is only one quantitation unit can be specified."
PSI Editor ME: From my point of view "summary of the final results" is sufficient (it contains "subset of
final results").
Answer: mzTab can be used to report the ‘final’ results that the data producer needs/wants to
communicate. If there is the need to communicate the results from the same experiment in a different
way (for instance, with a different level of detail), different mzTab files could be produced by the
data providers.
13
"2.
For a peptide entry in the Peptide section, m/z value sometimes is hard to be determined as the
precursors from same peptide can be detected at several very close m/z values. It is not clear which value
will be used, the theoretical, or the average, or the median value"
PSI Editor ME: @Authors: Please clarify, which m/z. I assume that is determined by the search engine or
part of the spectrum annotation.
Answer: As mentioned before, the charge is the one coming from the search engine (the peptide or
small molecule identification reported). This has been clarified in the specification document.
"One additional question: is the unit concept in mzTab similar to RawFilesGroup or MzQuantML element in
the mzQuantML format?"
PSI Editor ME: @Authors: As I understand, a unit is at first a protein, but could be finer-grained structures.
Please try to improve the description with regard to this comment.
Answer: ‘Units’ are not included in the new version of the specification document. Instead, a proper
reporting of the experimental design is now possible (see Section 5.3 of the specification document).
Also as a result, now it is possible to make a much more direct comparison/correlation between
mzQuantML and mzTab.
14
Download