Reviewer 1 Comments for the Author

advertisement
Replies to Reviewers’ comments
We thank the reviewers for valuable comments. We have tried our best to respond to the
comments. Our point-to-point response is given below.
Reviewer 1 Comments for the Author...
This is an interesting and generally well-written paper. There is emerging evidence that
many small genes have escaped annotation, not only in the Arabidopsis genome, but
also in the genomes of other plant species. The methods described here should be
generally applicable. Overall an interesting and useful paper.
Reply: First of all, we would like to thank the reviewer 1 for giving positive evaluation
to this manuscript. Please see our replies to each of his comments as follows.
(Comments 1)
Specific Comments
The manuscript copy that I downloaded lacks page numbers, making it a bit more
difficult to point to specific places in the document. I will number counting the cover
page as #1.
Reply: In a response to the comment, I put page and line numbers in this manuscript.
(Comments 2)
Page 5, second para: Is the CI calculated over all reading frames or just the optimal one?
Reply: The CI was calculated for an optimal frame because we constructed all the
possible open reading frames in intergenic regions before calculating the CI. The
method calculating the CI was shown in method. To easily understand the point, we add
the following sentence in this part; “The CI was calculated for an optimal frame of an
sORFs because we first constructed the possible open reading frames in intergenic
regions.”
(Please see lines 8-10 on page 5)
(Comments 3)
Page 6, first para. I would not consider a false negative rate of 14% to be particularly
"low".
Reply: Following the comments, we removed “low” in the sentence. We just mentioned
that the CI method showed at least more than 86% specificity because the
specificity shows 88.8%, 95.0% and 86.0% in annotated exons, known sORF
in A.thaliana and known sORF in yeast, respectively.
(Please see lines 10-11 on page 6)
(Comments 4)
Page 7: Many ORFs that do not experience purifying selection appear to be expressed.
Can the authors comment on and suggest possible explanations for this.
Reply: Following the comment, we discussed the point as follows. Nekrutenko, et al
reported that the test of purifying selection sometimes did hardly identify purifying
selection in shorter sequences. If the nucleotide length is shorter, the number of
synonymous and nonsynonymous substitutions becomes also smaller. This trend is a
simple statistical problem. In small numbers, it is hard to find the significance in
comparison with longer sequences.
(Please see lines 17-18 on page 10)
(Comments 5)
Page8, first sentence: It is not clear in what sample this "marginally significant
enrichment" is found. Please clarify.
Reply: Following the comments, we revised the sentence as the following sentence
“The number of sORFs undergoing purifying selection of expressed sORFs is
significantly larger than that of non-expressed sORFs”.
(Please see lines 1-3 on page 8)
(Comments 6)
Page 8, second para: The 80% identity criterion applied here is unusually low. Most
genome annotators use criteria of 95% identity over 90% of the length for matches of
ESTs to cognate regions of the genome. Please comment, justify or modify.
Reply: As mentioned in the comments, the sentences may lead misunderstanding to
readers. For the assignments of an sORF to EST, we did use a strict criteria (>97%
identity and over 90% of the length match). The detailed description was written in
“Comparing sORFs and ESTs“ in materials and methods section. The 80% identity
criteria was used in the assignment between ESTs matched to an sORF and neighboring
annotated genes of the sORF. In the assignment, we inferred a possibility that identified
sORFs became parts of annotated genes. To estimate the error rate of our annotation, we
used relaxed criteria (80% identity) in this part. The detailed description was revised in
the text.
(Please see lines 8-10 on page 8)
(Comments 7)
Page 8, second para: The meaning of the sentence "Since the 16 sORFs ….. any
annotated genes." Is not clear and needs to be re-phrased.
Reply: The comment is essentially the same as the comments 6 of Reviewer 1. Please
see our reply to the comments 6 of Reviewer 1.
(Comments 8)
Pages 8-9: The authors refer to the work by Ayele et al on sequence conservation with
Brassica. It would be very interesting and worthwhile to determine the extent of
intersection between the sORFs predicted in this paper and Ayele's "conserved
intergenic sequences" just as it is of interest to know the intersection between the sORFs
and the transcribed intergenic regions detected by the various tiling arrays.
Reply: Following the comments, we showed the proportion of each sORF overlapped
with high conserved regions.
(Please see Supplement D)
Reviewer 2 Comments for the Author...
The mss includes useful analyses that address an important problem in genomic
analysis, the identification of small genes that are missed by current annotation
approaches. It should be of interest to Genome Research readers. However there are a
number of issues that need to be addressed.
Reply: First of all, we appreciate lots of useful comments to this manuscript. Please
see our replies to each of Prof. Phil Green’s comments as follows.
(Comments 1-1)
1-1. There needs to be a clearer analysis and discussion of the issue of false positives.
The authors choose a CI threshold corresponding to a false positive rate of 1% on
simulated sequence, but they find it to be ~3.5% on real intronic sequence. Why the
difference -- is it because introns include some 'real' short genes, or because the model
used in the simulations doesn't capture important features of real sequence, or is it due
to some other reason? Since there are 133,091 sORFs meeting the filtering criteria,
using the intronic false positive rate of 3.5% implies there could be ~4700 false
positives among the 133,091, i.e. more than the 3274 ORFs actually found! So it
would seem a large fraction of the 3274 are expected to be false positives. This needs to
be discussed.
Reply: In responses to the comments, we discussed that the difference of false positive
rate between simulated data (1%) and observed data (3.5%). Moreover, we conducted
1000 times simulation to examine the effect of the false positive rate. As results,
approximately 114 sORFs was possible to be falsely identified as coding sequences.
(Please see lines 9-14 on page 9)
(Please see lines 21-25 on page 6 and Supplement G)
(Comments 1-2)
1-2. Likewise, as the authors point out there are false positive rates that apply to both
the microarray and the purifying selection analyses, however when summarizing their
results (e.g. in the abstract) the contribution of false positives seems to be ignored. It
would be helpful to have a table giving 'corrected' estimates (that remove the expected
numbers of false positives) for numbers of coding-like sORFs, sORFs with transcription
evidence, and sORFs with purifying selection evidence.
Reply: (Should we estimate false positive rate for microarray and purifying selection?
The expression and purifying selection are one of futures for coding sequences. Some of
real coding sequences are also not expressed and are not subject to purifying selection.
Therefore, it may be hard to calculate false positive rate of microarray and purifying
selection. The reviewer2 really want to estimate the false positive rates. He mentioned
the point here, in comments 11 and 17)
(Comments 2)
2. There also needs to be more discussion of false negative rates. The authors exclude
ORFs with substantial low-complexity regions for finding candidate sORFs, but not (as
emerges only in the Discussion) for assessing false negative rates using known short
proteins, because many known short proteins would otherwise be excluded. Thus it is
somewhat misleading to both suggest that their approach is effective at finding short
protein-coding genes (e.g. 'our method has low false negative rates for predicting small
protein genes', p. 4), and that it also has a low false positive rate, because these claims
are based on different analysis criteria. At a minimum the authors should clarify this
issue and present false-positive and false negative rates both with, and without, the
low-complexity filter, so that the reader can assess this.
Reply: (I do not know how to estimate false positive and false negative rates both with
and without low-complexity filters. Can I estimate the false rate as follows?)
With low complexity
Without low complexity
Exon
1 minus Positive rate by CI
(False positive rate)
1 minus Positive rate by CI
(False positive rate)
Intron
Positive rate by CI
(False negative rate)
Positive rate by CI
(False negative rate)
(To estimate above, should I prepare exon sequences with and without low complexity,
and intron sequences with and without low complexity?)
(Comments 3)
3. I could not find any discussion about how (or whether) overlapping ORFs were
handled. Do the numbers 3274 and 1589 (of total and transcribed sORFs respectively
meeting the CI threshold) include cases of multiple overlapping ORFs (on the same, or
opposite) strands? If so, the overlaps need to be eliminated for these counts.
Reply: (Hi Shinhan, How to construct open reading frames in intergenic regions? If
ORFs with different frames were created in a locus, how do you deal with the
overlapping ORFs?)
(Comments 4)
4. There should be some further analyses (in addition to the EST analysis described by
the authors) of potential clustering of the novel sORFs into (known or novel) genes. For
example, how many of the sORFs in the different categories (with/without transcription
evidence, with/without purifying selection evidence) lie within a few hundred bp of a
known gene or another sORF, suggesting the possibility that they are part of the same
gene? How many are far enough away (> 5kb) that they are highly likely to represent
distinct genes?
Reply: In a response to the comments, we did not examine the possibility of sORFs with
similarity to known genes from sORFs because many parts of known genes are
remaining as pseudogenes in the genome. Therefore, identified sORFs are not related to
known genes.
(Comments 5)
5. It wasn't clear whether novel sORFs were compared to each other to see whether any
of them fall into homology families. This would be useful.
Reply: Following the comments, we made cluster by MCL method. The results were
shown in Supplement H.
(Please see lines 26-29 on page 6 and Supplement H)
(Comments 6)
6. There is another important caveat to interpreting sORFs meeting the CI threshold as
being likely protein coding sequences. Namely, the authors' Bayesian model only
includes the categories 'coding' (in 6 frames), and 'noncoding', with the 'noncoding'
model being estimated entirely from intronic sequences. However, there are many types
of noncoding sequence that are likely to be enriched in intergenic regions relative to
intronic regions -- for example RNA genes, promoter regions, 5' and 3' UTRs. If their
composition is significantly different from intronic sequence -- e.g. more GC rich -then they may well get identified by the model as having a high pp of coding, even
though they aren't coding. So (even apart from the false positive issue) it is not correct
to assume that anything having a strong pp for being coding is likely to be coding; it
may just be another type of sequence not explicitly allowed for by the model. In
particular, I don't think you can necessarily conclude that a sORF with evidence of
transcription is likely to be a protein coding gene -- it could be an RNA gene with a
composition that is not intron-like. Some of the claims in the abstract and mss need to
be changed to reflect this caveat.
Reply: In a response to the comments, we have examined coding potentials of
sORFs using several kinds of noncoding sequences such as UTRs, intergenic
regions and introns based on annotated data. The results of identified sORFs
are essentially the same. Therefore, we think it not necessary to emphasize
that identified sORFs were not likely to be intron sequences.
(Comments 7)
7. Although the analyses to derive the distribution of CIs and pps in simulated sequence
(e.g. Figure 1) are useful, simulated sequences do not fully capture the features of actual
sequences (as the authors acknowledge in the discussion). It would therefore be useful
to see similar distribution histograms for intronic, intergenic, and known coding
sequences. This might also illuminate whether the distribution appears bimodal in the
intergenic case (as one expects).
Reply: As mentioned in the comments, we had tried to use actual sequences
instead of simulated sequences to see the difference of CI between coding and
non-coding sequences. However, actual sequences have a technical problem
to capture the differences because many independent coding and non-coding
sequences of actual sequences cannot be prepared with different length. On
the other hand, simulated sequences can be many independent coding and
non-coding sequences with different length. That’s why we used simulated
sequences in this manuscript.
(Comments 8)
8. I found the mss difficult to understand in places, partly due to numerous grammatical
and typographical errors (the mss could use careful editing by a native English speaker),
and partly because important details are sometimes omitted (the authors should include
enough information to allow the analyses to be reproduced by another investigator).
Moreover, the terminology needs to be more clearly defined and more accurately and
consistently used. "sORF" should mean "short ORF", which implies nothing about
functionality or characteristics of the sequence other than the length of the potential
reading frame. However the authors sometimes seem to use it to instead mean a short
ORF meeting their CI threshold, and sometimes to mean a short ORF that encodes an
actual protein. The definitions need to be clearly stated. At several points the authors use
language that confuses the ORF with the encoded protein (e.g. they refer to the ORF
starting with a methioninine, rather than with a methionine codon).
Reply: Following the comments, this manuscript was edited by a native English speaker.
We hope that the revised manuscript is improved. Moreover, we take care of
terminology as follows. (1) sORF: the region starts from a methionine codon and ends
to a terminal codon. (2) Potential protein: the region has coding potential. (3)
Transcribed protein: the region is transcribed. (4) Selected protein is subject to purifying
selection. (5) Functional protein is transcribed or is subject to purifying selection. (Is
this OK? I did not still change the terms)
(Comments 9)
Additional points (-- since there are no page numbers in my copy (!), the page numbers
below assume the first page after the abstract is number 1):
Reply: In a response to the comment, I put page and line numbers in this manuscript.
(Comments 10)
Abstract: "1661 sORFS ... likely belong to novel protein coding genes": this sentence
needs to be rephrased, as it is contradicted two sentences later by "1275 sORFs ... likely
belong to novel genes". Also, all these numbers should be corrected for false positive
rates, as discussed above.
Reply: In a response to the comments, we revised the sentences following the criteria as
I mentioned in the reply to the comments 9. The latter comment is essentially the same
as the comments 2 of Reviewer 2. Please see our reply to the comments 2 of Reviewer
2.
(Comments 11)
p. 1: First paragraph: "These studies demonstates the presence of genic sequences" -the claim that transcribed regions necessarily represent protein-coding or RNA genes is
controversial. The fact that a region is transcribed does not in itself imply that it is
functional. "Some studies assumed ..." -- give references.
Reply: Following the comments, we change “demonstrate” to “may indicate” in the
sentences. Moreover, we gave the following references to the parts.
(Please see lines 7 and 11 on page 3)
(Comments 12)
"small proteins (referred to as sORF)" -- no, a protein is not an ORF!! Please use
accurate terminology. Proteins, ORFs, predicted genes, and genes are all distinct entities
that should not be confused with each other. "sORFs include mating pheromones ..." -same issue.
Reply: Following the comments, we changed “small proteins (referred to as sORF)" to
“small proteins translated from sORFs”. Moreover, we change “sORFs include mating
pheromones…” to “proteins include mating pheromones…..”.
(Please see lines 25-26 on page 3)
(Comments 13)
p. 2: "hexamer composition bias, which has been established as the best measure for
distinguishing CDS from NCDS" -- this is an overstatement. "known sORF" -- sORFs
are always 'known' simply by computational analysis, because that is how they are
defined"! You should say "known genes" or "known sORF genes". "predicted sORF' -should be "predicted genes".
Following the comments, we changed “the best measure” to “a general measure”.
Moreover, we changed “known sORFs” to “known sORF genes” and “predicted sORF”
to “predicted genes”.
(Please see lines 11 and 19 on page 4)
(Comments 14)
p. 3: "only 3.54% of the introns" (and also Table 1) -- does this mean you are analyzing
the entire intron, as a unit? If so this is not a particularly illuminating regarding false
positive rates – better would be to analyze short ORFs within introns.
Reply: (I do not know what kinds of data you used for introns and exons. I guess that
you use short ORFs in introns)
(Comments 15)
p.4: "starting with methionine" => "starting with ATG"
Reply: Following the comment, I changed “starting with ATG”.
(Please see lines 19 on page 6)
(Comments 16)
p. 5: As indicated above, you should discuss more thoroughly the issue of false positive
rates, for both sORFs and purifying selection, and how this affects predicted numbers.
Reply: The comment is essentially the same as the comments 2 of Reviewer 2. Please
see our reply to the comments 2 of Reviewer 2.
(Comments 17)
p. 6: "sORFs that are transcribed tend to be subject to purifying selection" -- this is an
overstatement, since the enrichment appears to be rather modest. "more than 80%
identity" -- what is the reason for this threshold? It should be something more like 95%
identity over a region of some minimal size. Also, there seems to be an unstated
assumption that if the sORF is part of a known gene, and has an EST match, then there
will necessarily be enough EST evidence to overlap other ESTs associated to the known
gene. This assumption should be stated explicitly, and some justification for it given (e.g.
based on patterns of EST overlap for known genes).
Reply: Following the comments, we change “sORFs that are transcribed tend to be
subject to purifying selection” to “sORFs that are transcribed tend to have purifying
selection”. Moreover, the analysis using EST has had been done by 95% threshold.
About the assumption mentioned by the reviewer 2, we stated the contents in the text.
(Hi Shinhan, I don’t know how to show the justification).
(Please see lines 15-18 on page 8)
(Comments 18)
p. 7: "regions with low amino acid complexity ... different codon usage" -- give
reference. "the number of sORFs we identified should be treated as a minimum
estimate" -- you can only take it as a minimum estimate if you know it doesn't include
any false positives, but it almost certainly does.
Reply: Following the comments, we gave the reference for “"regions with low amino
acid complexity ... different codon usage". Moreover, we changed the sentence
“Therefore, the number of sORFs we identified should be treated as a minimum
estimate in this regard.” to “It indicates that many sORFs are still hiding in the
annotated genome.”.
(Hi Shinhan, I cannot find references related to codon usage bias of
transmembnrane and signal peptides. I think that amino acid composition of
these regions is different from the other regions. However, the codon usage
seems to be similar. But I don’t know)
(Please see lines 12-14 on page 10)
(Comments 19)
pp. 7-8: "random matches can only account for ... is only 3.5%" I don't think this
argument is correct, since (if I'm reading the Methods correctly) you only applied the CI
filter to one of the two genomes (namely AT) -- so you don't gain any confidence
regarding the matches from the false positive rate.
(Reply: I do not understand the comments. Can you deal with the comments?)
(Comments 20)
p. 10: did CDSs include stop codon, or not? justify prior probability of coding of .4 for
Arabidopsis – this seems substantially higher than proportion of genome thought to
encode protein. Also, need to give prior prob of coding in each frame – was that just .4/
6
Reply: CDS does not include stop codon in the training data set. In the annotation data,
the proportion of coding sequences is approximately 0.4. Also, the prior prob in each
frame is 0.1333 (=0.3/6) because there are 6 frames in each segment.
(Comments 21)
p. 11: methionine is not same as a methionine codon "longest "sub-ORF"" -- what does
this mean? "matched a known gene with > 40% identity ..." does % identity refer to the
DNA or the protein sequence? "with BLAST" -- using what BLAST parameters and
thresholds?
Reply: “methionine” is changed to “methionine codon”. The sentences including
“longest” and “sub-ORF” will be revised to make it clear. Moreover, we revised the
parts of the similarity search.
(Please see lines 31-3 on pages 13-14)
(Comments 22)
p. 13: "maximal gap of one", "minimal identity of 80% between adjacent HSPs" -- what
does this mean? "the alignment covers the start position of the ORF" -- why is this
condition imposed? sORFs that correspond to short internal exons would be excluded
by this criterion, since the initial ATG of the ORF could be in the upstream intron rather
than the exon itself.
(Hi Shinhan, can you reply the comments? I do not know how to do the
analyses exactly).
Phil Green
Reviewer 3 Comments for the Author...
Kousuke HANADA et al., A large number of novel small open reading frames (sORFs)
in the intergenic regions of the Arabidopsis thaliana genome are transcribed or under
purifying selection
With the work presented Hanada and colleagues address the question of the presence of
thus far undetected small ORFs in the genome of Arabidopsis thaliana. This topic is of
special importance as there is increasing evidence that a large fraction of thus far
undetected sORFs escaped the initial, “classical” genome analysis albeit they encode
small proteins. Well-established example cases already demonstrated the biological
importance of this class of proteins. Thus a genome scale survey for them is required.
The authors scaned the intergenic space using a simplified method using the Coding
Index and applying evaluated and stringent thresholds for candidate sORFs. Evaluation
and further constriction of candidate sORFs is undertaken using whole genome tiling
arrays as well as evolutionary conservation and purifying selection criteria. Overall the
manuscript is well written and the methodological description as well as the
supplementary material provided is exemplary. However, there are a couple of points
that are not sufficiently discussed/addressed in the current version of the manuscript:
Reply: We would like to thank the reviewer 3 for giving positive evaluation to this
manuscript. Please see our replies to each of his comments as follows.
(Comments 1)
- The data flow and the analytical flow, esp the numbers of intermediate candidate
sORFs observed and retained is often difficult to follow and in some instances it is
unclear to which total number(s) the percentages given refer to. E.g.
paragraph ”Conservation across species and signatures of purifying selection”
Reply: Following the comment, we made a flow chart to represent the process
identifying sORFs in the Arabidopsis genome. Moreover, we revised the section of
“Conservation across species and signatures of purifying selection” to make it clear.
(Please see Supplement I)
(Please see the section of “Conservation across species and signatures of purifyin
g selection”)
(Comments 2)
- The conservation measure (=>30% identity; AA ID I guess?) against the other plant
genomes (namely Oryza, Poplar, Medicago, Lotus): The percentage of sORFs with
homologous counterpart in the other species is pretty low. The authors discuss four
potential reasons for this observation:
o False calls due to the large number of ORFs examined; although I can grasp what is
probably meant the phrasing is misleading as the percentage of false calls is certainly
not depending on the sheer number of calls
o Non-functional background translation of proteins. Unfortunately totally unclear to
me
o Incomplete plant genome sequences used (for comparison);
I agree that the available sequences for Medicago and Lotus are currently incomplete.
Poplar however has recently been published as close to complete. Unfortunately only a
fairly incomplete dataset for poplar was available for quit some time, so the authors did
not really have the opportunity to compare against the complete genome. Should be
pretty straightforward to update the comparison against poplar against the complete
dataset that became available along with the publication. Most importantly a
comparison against the Brassica shotgun reads might be highly informative due to the
close evolutionary relationship.
Reply: (Hi Shinhan, I did not reply the first comment. Could you deal with it?). We
revised “non-functional background translation of proteins” to “accumulation of many
nonsynonymous substitutions because of non-functional constraint of the related
sequences”.
(Comments 3)
- The filtering of initially detected sORF (candidates) to get rid of genes, pseudogenes,
transposons: I assume this has been performed using the TIGR annotation as a data
basis (?). Especially for the transposable and repetitive elements I’m not too sure on
whether this is a solid data basis. A filtering of sORF elements against an up to date
repeat/TE data resource might be a good idea.
Reply: In a response to the comments, we did use a similarity search of sORFs to the
annotated genes by Blast but not follow the TIGR annotation.
(Comments 4)
- The 13.2% of sORFs that are possibly missing exons mentioned within the abstract?
This is really hard to trace in the results section. Again the numbers and percentages
given in the last paragraph within the results section are confusing and hard to
follow (percentages of what?). Again a flowchart would really help.
Reply: In “Conservation across species and signature of purifying selection” section, we
showed the possibility of missing exons. A figure was created to understand this point.
(Comments 5)
- Numerous sections in Material and Methods are really complicated to read and
understand. Eg. the identification of introns. Long winded and complicated. Some
restructuring might help.
Reply: Following the comments, we made figures to illustrate CI and EI methods.
(Comments 6)
- The explanation of the Expression Index (EI) as well as the definition of the (EI)
thresholds used is really hard to follow and understand. Going through the supplied
supplementary info I found it remarkable that most of the sORFs that meet the
criteria are nevertheless close to the threshold.
Reply: Following the comments, we made a figure to show the methods in a
supplemental material.
(Comments 7)
- The generation of the coding index (CI) and delineation of exon vs. intron CI does
not take into account UTRS and promoters that potentially contain different CIs and
thus might lead to the definition of erroneous thresholds. At least for UTRs a lot of
info can be derived from cDNA information and should also be considered to be
used for the threshold definition. Not considering UTRS and promoters. Thus the
selection is exon vs intron but is not taking into account UTRs and promoters. A
coding vs. non-coding selection schema including at least UTRs might be more
adequate.
Reply: The comment is essentially the same as the comments 6 of Reviewer 2. Please
see our reply to the comments 6 of Reviewer 2.
(Comments 8)
- Are there data regarding the CI indexes for small cDNA supported exons? Are the
values obtained similar or different to those for long exons?
Reply: Yes, we examined the CI indexes for cDNA supported exons. The CI indexes of
small cDNA is lower than those of long exons because CI is inclined to be higher with
the length of exon. That’s why we provided a threshold to define coding potential with
different size of sORF.
(Comments 9)
- The size classes used for the CI definition is unclear; => some more details in the
methods section would be helpful
Reply: Following the comments, we made a figure to show the methods in a
supplemental material.
(Comments 10)
- The false positive rate obtained for introns is 3,54%. From appr. 133.000
hypothetical sORFs 3274 were found to exceed the CI threshold. Well this is close
to false positive rate discovered in introns and asks for some discussion
Reply: The comment is essentially the same as the comments 2 of Reviewer 2. Please
see our reply to the comments 1 of Reviewer 2.
(Comments 11)
- Table2: “IQR” should be defined
Reply: IQR is the abbreviation of interquartile range. IQR is a robust estimate of the
spread of the data, and shows the range from upper 25 % to lower 75%.
As stated above the manuscript is well written and focuses towards a topic which
certainly is of high interest for the readership of Genome Research. However a revision
of the manuscript considering the points mentioned above might help to improve the
manuscript.
Download