ANSWERS TO THE COMMENTS FROM THE REVIEWERS’ Since the last round of review (March-April 2013) quite extensive changes have been made to the mzTab format specification, based on the feedback received by the reviewers of the PSI document process but also from the reviewers of the manuscript submitted to Molecular and Cellular Proteomics (MCP). As a result, we think that overall the format is now more complex, but we also expect that it is more reliable and less ambiguous. There are now two types of mzTab files: ‘Identification’ and ‘Quantification’. This is specified in the mandatory metadata field ‘mzTab-type’. ‘Identification’ files can be used to report peptide, protein, and small molecule identifications. ‘Quantification’ files can be used for quantification results, which optionally may contain identification results about the quantified proteins, peptides or small molecules. In addition, there are two levels of detail (called ‘mode’) of reporting data in mzTab files: ‘Summary’ and ‘Complete’. The ‘Summary’ mode can be used to report the final results, putting together data coming from e.g. different replicates. On the other hand, the ‘Complete’ mode is used if detailed information for each individual assay/replicate is provided. The ‘mode’ is specified in the mandatory metadata field ‘mzTabmode’. A new section called 'PSM' (peptide spectrum match) has been included as a replacement to 'Peptides', which is now recommended only for ‘Quantification’ files. In addition, for improving the data reporting, ‘Units’ are not present anymore but instead now it is possible to model the experimental design up to a high level of detail (Section 5.3 in the specification document, https://code.google.com/p/mztab/). In practical terms there are four possible variants of mzTab files, by combining the ‘Identification’ and ‘Quantification’ ‘type’ with the ‘Complete’ and ‘Summary’ ‘mode’. The requirements (e.g. number of mandatory fields in the different sections) are different depending on the mzTab ‘type’ and ‘mode’. Tables 2-6 in the specification document outline this. In addition, many small refinements have been done, since the specification document has been thoroughly reviewed. For instance, the order of the columns in the different sections is now recommended but not mandatory (a few reviewers asked for this change). New example files have now been generated both from proteomics and metabolomics approaches (see https://code.google.com/p/mztab/wiki/ExampleFiles). Finally, we want to thank all the reviewers for their constructive feedback. Below, we reply to the comments raised. Invited Reviewer 1 mzTab: exchange format for proteomics and metabolomics results Review of specification documentation in DocProc, RC2 General thoughts: Most of the comments have been clarified and I agree to the authors. 1 But there are some cases, which I think would not lead to a wider implementation of the proposed standard. My main concern is still, that many attributes in the each layer are mandatory, but not easily to get (URI, GO-annotation...) and for most cases not even necessary, even less for a co-worker without deeper knowledge in proteomic, who should receive an mzTab file. For this case, which is obviously one of the main goals of mzTab, the files will be bloated with "null"-columns. In 5.8 it is even stated, the results SHOULD have a reliabilty score, which will inevitably lead to implementations with null-columns. Also having search engine and search_engine_scores is very redundant. I think, it is good to have a file format which gives the ability to give a small report of proteomic experiments with a set of standardized attributes. But making too many of them mandatory ruins the beauty and compactness of this format, in my opinion. Answer: In the new specification some of the fields have been made optional. For instance, one of them is ‘reliability_score’. See Tables 2-6 and section 6 in the Specification document for all the details. However, we think that it is necessary to have both ‘search_engine’ and ‘search_engine _score’, since it is not always obvious for non-experts which scores corresponding to each software. Furthermore, there is still the problem with the ambiguous assignment of the peptides to proteins. I think it may be much easier to allow for a peptide to have several accessions, than bloating the file with redundant lines with exactly the same information but the accession, if someone wants to report the peptides on a PSM level (e.g. for later spectral count processing). Answer: These multiple peptide-protein assignments can still be reported using the attribute “ambiguity_group” in the same row. One of the principles of the format was to report one “main” protein identification (preferred peptide-protein mapping, if needed). Specific comments: Detailed definitions for "num_peptides" and "num_peptides_unambiguous" are not given and in 5.7 it is stated, they are rather loosely and implementation dependant. In contrast, "num_peptides_distinct" is defined by sequence+modifications, but does not allow for defining a peptide as distinct by sequence only, which is often used, or sequence+modifications+charge. Answer: Definitions are now included in sections 5.10 and 6 of the specification document. Unfortunately the link to the OBO-file of the PSI Protein modifications workgroup seems to be broken. Answer: The URL has been updated to http://psidev.cvs.sourceforge.net/psidev/psi/mod/data/PSIMOD.obo. The small example in 5.5 states to encode the not given values with "-", shouln't these be "null" or "NaN"? Answer: It has been corrected. In 5.10.6 are spaces in the optional column name. Answer: It has been corrected. 6.4 The Tab is written with uppercase T only here. 2 Answer: Now corrected. 6.4.2 I see no reason for giving an unassigned peptide a "null" as accession, only because it does not occur in the protein section. Answer: The “null” is provided if no protein identification is associated to a peptide identification (for instance in peptidomics experiments). In this case, the “Protein” section would not be present in the file. We think this is consistent with the use of “null” in the other attributes present in an mzTab file. To prevent certain parsing errors, we decided that mzTab files must not contain empty cells but must use the string “null” to represent missing values. 6.4.11 and 6.4.13: if multiple RTs may be defined, the m/z values will probably differ as well, and why then not also include different charges? In my opinion, either all or none should be allowed, this seems very inconsistent. Answer: The definitions and semantics of this attribute have been substantially improved in the new version. In the peptide table a ‘retention time window’ is defined to allow for capture of elution ranges over which quantitation values are calculated. If multiple points are given it is expected that these relate to all scans from which quantitative values were generated. 3 SC Reviewer 1: Global comments: Much has been corrected in the R2 submission. Two major concerns that are coming from many reviewers and not addressed / rejected are: - Why forcing a high number of mandatory columns that will be empty in many cases? This list is somewhat arbitrary and not fully justified Why forcing an order of the mandatory columns, and at the same time have possible optional columns appearing as intercalated? As the columns have CVed title names this is not necessary, and in addition this will allow to intercalate optional columns, which will make parser break. PSI Editor ME: From my point of view the balance between optional and mandatory is indeed arbitrary; although this will probably influence the compliance of the standard within the community, there is no editorial necessity to change something. Answer: As mentioned at the beginning of this document, in the new version of the specification document, some of the columns have been made optional depending on the mzTab ‘mode’ and ‘type’. In addition, now the order of the columns is not mandatory, but recommended. See Tables 2-6 and Section 6 of the specification document for more details. Detailed comments on answers of invited reviewer 1: Page 24 "uri", "go_terms" and "protein_coverage": They should be optional, as they are not easily found after a peptide search (and not even given by many search engines). Answer: While it is true that these fields may be unavailable at several times, we tried to keep the number of optional columns as low as possible. This ensures that all mzTab files have a similar structure and are therefore easier to use for inexperienced users. SC Reviewer 1: If mzTAB is meant to be light and not heavy, this statement anc choice is counterproductive. On one side mzTAB requires a minimal number of columns, which is for some experiment not useful and makes the file bigger than necessary; on another side, there are some optional columns; therefore the choice of keeping some obligatory is arbitrary. This must be corrected for consistency reasons. PSI Editor ME: same comment as above (From my point of view the balance between optional and mandatory is indeed arbitrary; although this will probably influence the compliance of the standard within the community, there is no editorial necessity to change something.) Answer: See our previous response. Take also into account that now the format has been made much more flexible, by introducing the mzTab ‘type’ (‘Identification’ and’ Quantification’) and ‘mode’ (‘Complete’ and ‘Summary’) Page 25: Why must the peptide section follow the (optional) protein section? I don't see any reason for this, except a possible accession-parsing. Though, the accessions in the peptide don't have to match any accessions in the protein section, neither must there be any protein section at all. 4 Answer: As mentioned before, we believe that a fixed order of sections and thereby a more “stable” format makes the format easier to use. At the same time, we cannot see any disadvantage in enforcing a certain order. People developing software that is able to generate mzTab files will on average be more experienced than people “consuming” the files. For them, it should not make a difference whether a certain order of sections is enforced while the latter group might find it helpful to know “where” to find what type of information. SC Reviewer 1: I do not agree: having a non absolutely fixed format is prone to parsing error. If a less experienced person as you mention is not designing the count of the columns in a smart way, two perfectly valid mzTAB files with differing number of columns will generate odd results… PSI Editor ME: same comment as above (From my point of view the balance between optional and mandatory is indeed arbitrary; although this will probably influence the compliance of the standard within the community, there is no editorial necessity to change something.) Answer: See our previous responses. Also, now the order of the columns is now not mandatory, only recommended. Page 26: The "unique" column may be very useful, but as mzTab should be an easy generated format, it should be optional and not every search engine reports this value. Answer: See previous responses. If this information is not available, it is possible to just use ‘null’. SC Reviewer 1: Again, it’s an overkill. Particularly here, in addition, the definition of unique might vary from one tool to another, which makes the interpretation of this field non homogeneous PSI Editor ME: same comment as above; additionally: definition of unique is given in 5.7 and seems conclusive to me (i.e. “database-unique”). Answer: We think this is an essential piece of information for clarifying the protein inference, one of the reasons why interpretation of proteomics results is difficult for non-proteomics researchers. This is the main reason why it was added to the mzTab specification. Page 27: Also "retention_time" is neither given by every search engine, nor by every method in MS relevant (e.g. direct injection), and thus should be optional. Answer: We believe that the retention time of a peptide or a small molecule is a vital piece of information that is used and usable in many workflows. Therefore, it should be possible to report it in mzTab. To minimize the number of optional columns this column was defined as non-optional. SC Reviewer 1: Retention times are useful only if comparing highly similar chromatographic conditions. This will only be valid if one compare and exchange results that fulfil these criteria. Even if I agree that this information can be very useful in some area, it is not the case for many straightforward identification and quantitation jobs. PSI Editor ME: same comment as above (From my point of view the balance between optional and mandatory is indeed arbitrary; although this will probably influence the compliance of the standard within the community, there is no editorial necessity to change something.) 5 Answer: We thought this attribute needed to be mandatory. If it is not available, in the mzTab file it can be reported as “null”. Page 28: In the description of the metadata it is said, that the {UNIT_ID}_ms-file is an optional value (as all metadata is). But the mandatory field "spectra_ref" uses these fields information. If it is not given in the metadata, there is no use of this mandatory reference format for the "spectra_ref". So either this field should be optional or rather the ms-file in the metadata mandatory. Answer: While the protein / peptide / small molecule sections are table based, the metadata section is key-value based. To enable the easy concatenation of the table based sections we have decided to minimize the number of optional columns. At the same time, the metadata section was defined as an “all optional” section. We believe, that while this specific constellation is not ideal these more general rules are easier to understand. SC Reviewer 1: This comment should address the problem addressed by this reviewer. Either add a constrain to the meta data or allow that mzTAB files are inconsistent. PSI Editor ME: I think this is a difference between syntax and semantics; the syntax allows ms_file to be missing, but then – by semantics – the file is automatically not valid, because spectra_ref is mandatory. There is no disadvantage to include this implicit “mandatoriness” into the specification document; so please add it. Answer: We have changed the specification document, as requested by the reviewer and the editor and we think it is a good idea to clarify this. Now the element “ms_run[1-n]-location” is mandatory in the Metadata section. Detailed comments on answers of SC reviewer 1: in 6 "Every line in an mzTab file must start" => capitalize MUST Answer: This was updated in the specification document. in text (under Params) "Any field that is not available should be left empty" => should'nt that be MUST be left empty ?? Answer: This was changed to MUST. => how are space and comma characters constrained? SC Reviewer 1: No answer? Answer: Sorry, we missed this point in the previous round of review. We think the reviewer is referring to the situation where the name of the CV term contains a comma. We did not consider this case before so thanks for pointing this out. The solution in this case is to use the same convention used in CSV files, adding quotes (“) to the CV param name: [label, accession, “first part of the param name , second part of the name”, value]. 6 [MOD, MOD:00648, “N,O-diacetylated L-serine”,] This information has been added to the specification document (section 6, “Format specification”). 6.4.1 sequence => how to encode sequence ambiguity (I/L), others, and results from sequence tags experiments? Answer: See section new version 5.10.5. I/L can be represented as ‘J’. SC Reviewer 1: And what about K/E, and sequence tags with {A1A2} where A1 and A2 are two residues, for which we do not know the order in the sequence? PSI Editor ME: Q/E = Z and N/D = B are already mentioned in 5.10.5. Regarding {A1A2} at the moment one has to write XX, if I understand correctly. This is a loss of information, but from the editorial point of view no necessity to change. @Authors: please decide whether you want to improve that now... Add that special case / work-around into the specification document! Answer: Sequence tag approaches are not properly supported by mzTab. It was not one of the initial aims to have support for this type of approaches. This is one of the possible future developments. This has been clarified in the specification document (Section 7). Other question: why not using PSI-MS CV for terms? no relationships to mzIdentML terms? just independant terms? Answer: We do not quite understand the reviewers comment. The mzTab format specification does not exclude and CVs to be used but actually recommends to use the PSI-MS CV. SC Reviewer 1: It appears that the column names do not follow the PSI-MS CV terms (some examples below): “Charge” is “charge state” in PSI-MS : MS:1000041 “Search_engine_score” is “search engine specific score for peptides” in PSI-MS: MS:1001143 “sequence” is “unmodified peptide sequence” in PSI-MS: MS:10000888 Would be appropriate to provide in the documentation a list of PSI-MS CVs and/or element names in mzIdentML/mzQuantML to make sure that one can map the terms correctly. PSI Editor ME: As I understand it, CV terms can be used but the above mentioned column names are “own special labels”. Unfortunately there are also attributes in mzIdentML doubling CV terms (like “SpectrumIdentificationItem/chargeState”). No editorial necessity to add such a list. Answer: It was never our aim to use exactly the same naming for the name of the attributes in mzTab and the PSI-MS CV terms. The definition of the attributes is present in the specification document. In addition, for those terms for which definition is not included in the document (optional columns) it is necessary to use names of PSI-MS terms. So, we think that following this approach there should not be any naming ambiguity related issues. 7 Public commenter 2 I found the following, which are generally instances of commas or use of the definitive article. Page 2 Plural experiments This document addresses the systematic description of peptide, protein and small molecule identification and quantification data retrieved from a mass spectrometry-based experiments Page 3 Missing “a” T1. Export of results to external software, that is not able to parse proteomics/metabolomics specific data formats but can handle simple tab-delimited file formats. As a guideline the file format is designed to be viewable by programs such as Microsoft Excel® and Open Office Spreadsheet. be vs. being T2. Allow the concatenation of results, and thus be able to combine results from multiple experiments but also multiple entries from local LIMS databases or MS proteomics repositories. Page 6 the core There is a difficult issue with respect to how software should encode CV terms, such that changes to the core can be accommodated. Page 7 Comma to separate main clause mzTab is designed to only hold experimental results, which in proteomics experiments can be very complex. Page 9 Commas with furthermore Several of the available techniques, furthermore, allow/require multiple similar samples to be multiplexed and analyzed in a single MS run. When several biological samples are multiplexed these samples are referred 8 to as “subsamples” in mzTab. Subsamples MUST, furthermore, be linked to the used labels in the metadata section of the mzTab file (see example below). In case a quantification method is used that does not lead to multiplexed biological samples, the generated quantification values MUST be reported as subsample 1. Replicates in experimental… Answer: All the minor changes suggested by the reviewer have been done in the new version of the specification document. 9 Public commenter 3: Minor comments from me. Overall the specifications look fine and I don’t want to hold up the process of finalising version 1.0 but a few additional clarifications will help with implementation, as noted below in italic: retention_time section should specify what to do for peptides analysed in multiple runs (subsamples) – report a single value of RT in the “master run” or use the | to separate out values from different replicates. I’m sure we discussed this issue but I don’t see that this made it into the spec doc. Answer: As mentioned before, the definitions and semantics of this attribute have been substantially improved in the new version. In the peptide table a ‘retention_time_window’ is defined to allow for capture of elution ranges over which quantitation values are calculated. If multiple points are given it is expected that these relate to all scans from which quantitative values were generated. Mass_to_charge as above – is this a “master peptide” m/z value? Answer: This has now been clarified in the specification document. It is assumed that the reported mass to charge (m/z) value is for a given “master” peptide from one assay only (and the unlabeled peptide in label-based approaches). If the exporter wishes to export values for all assays, this can be done using optional columns.. Section 5.3 undistinguishable should be indistinguishable Answer: Now fixed. Section 5.4 “...mzTab is designed to be a simple data format. Therefore, the reporting of results from such experimental designs is poorly supported in mzTab.” Consider replacing “poorly” with “supported to only a limited extent” Answer: Now changed. In fact, as highlighted above, now the reporting of the experimental design is fully supported. Section 5.9 {position}{Parameter}-{Modification or Substitution identifier}|{neutral loss} ... {Parameter} is optional. It MAY be used to report a quantity e.g. a probability score associated with the modification or location. Reporting the first two possible sites for the phosphorylation with given probability score Here only the modification field is given: 10 3 [MS,MS:1001876, modification 0.2]MOD:00412, 8-MOD:00412 probability, 0.8]|4[MS,MS:1001876, modification probability, I think an extra example is needed to show whether the following is also allowed: (3|4)[MS,MS:1001876, modification probability, 0.8]-MOD:00412 Answer: We prefer not to allow this second option. The preferred option is the one listed above. This clarification has been added to the specification document. i.e. A probability for grouped positions – I recall this was desired by some groups. Or an even more complex example: (3|4)[MS,MS:1001876, modification probability, 0.8]|7[MS,MS:1001876, modification probability, 0.2]MOD:00412 As such the following needs to be specified: Are grouped positions allowed at all? If so, are round brackets allowed around grouped positions – otherwise impossible to tell if a cvParam is associated with the group or the last element of the group only As a general note, writing code to read these possibilities looks pretty nightmarish. As a check, these would be the parsing rules: - Split different mods by comma Recognise the start of the modification CV param by “-“ Recognise the end of the modification by comma or | for neutral loss Positions are separated by bars. Need decision on grouping and use of brackets? cvParam associated with a given position is surrounded by square brackets All correct? PSI Editor ME: It seems, that grouped positions are suggested as additional possibility by the reviewer. If the authors want to implement it now, they are free to do; if not, that could be left out until next release from the editorial point of view, but please add a hint to the specification document, that grouped positions are not possible at the moment. Answer: We agree in that all not possible scenarios for ambiguity in modification position are supported. However, we believe that the main use cases are now covered and more work can be put in the future in order to support new use cases (like complex scenarios dealing with grouped modifications). This clarification has been added to the specification document (Section 7). 11 comments from invited reviewers: ------------------------------- invited reviewer 1: see R2_mzTab_comments_invited_reviewer_1_anon.docx Answer: Comments addressed before. invited reviewer 2: "all my questions were answered and i have no further comments" public comments from original commenters: ----------------------------------------SC reviewer 1: see R2_mzTab_comments_SC_Reviewer_1_anon.docx Answer: Comments addressed before. public commenter 1: "The specification is looking good. I found one misspelling, in file R2_The_ten_minute_guide_to_mzTab.pdf: 'separate table or section but only exit through their Unit_IDs' should have 'exist' instead of 'exit'" Answer: Now fixed. 12 public commenter 2: "I have read through the mzTab document and the changes and it is fine. An impressive amount of detail and flexibility. I have attached a list of a few instances where, in my opinion, there are slight errors (punctuation, plurals etc)." => see R2_mzTab_comments_public_commenter_2.docx Answer: Comments addressed before. comments from additional commenters: ----------------------------------- public commenter 3: see R2_mzTab_comments_public_commenter_3.docx Answer: Comments addressed before. public commenter 4: "I have several concerns about mzTab format while coding a converter from mzQuantML to mzTab and do not know how to add them on http://www.psidev.info/mzTab-in-docproc. So email you instead. 1. In the official document, "mzTab is intended as a lightweight supplement to the already existing standard file formats mzIdentML and mzQuantML, providing a summary of the final results of a MS-based proteomics experiment." However in my opinion, it would be more accurate to phrase as a subset of the final result due to the data can be contained within one mzTab file. In mzQuantML format, it is common to have more than one quant layer, e.g. normalized abundance and original abundance, or ratios (more common in labelled method), or more than one column from globalQuantLayer which may have different units and no need to mention the combination of these quant layers. All of these cases cannot be handled within one single mzTab file due to there is only one quantitation unit can be specified." PSI Editor ME: From my point of view "summary of the final results" is sufficient (it contains "subset of final results"). Answer: mzTab can be used to report the ‘final’ results that the data producer needs/wants to communicate. If there is the need to communicate the results from the same experiment in a different way (for instance, with a different level of detail), different mzTab files could be produced by the data providers. 13 "2. For a peptide entry in the Peptide section, m/z value sometimes is hard to be determined as the precursors from same peptide can be detected at several very close m/z values. It is not clear which value will be used, the theoretical, or the average, or the median value" PSI Editor ME: @Authors: Please clarify, which m/z. I assume that is determined by the search engine or part of the spectrum annotation. Answer: As mentioned before, the charge is the one coming from the search engine (the peptide or small molecule identification reported). This has been clarified in the specification document. "One additional question: is the unit concept in mzTab similar to RawFilesGroup or MzQuantML element in the mzQuantML format?" PSI Editor ME: @Authors: As I understand, a unit is at first a protein, but could be finer-grained structures. Please try to improve the description with regard to this comment. Answer: ‘Units’ are not included in the new version of the specification document. Instead, a proper reporting of the experimental design is now possible (see Section 5.3 of the specification document). Also as a result, now it is possible to make a much more direct comparison/correlation between mzQuantML and mzTab. 14