Replies_to_review_comments

advertisement
Summary Statement
First we’d like to thank the reviewers for the extremely useful feedbacks and suggestions. Based on the
suggestions, we have repeated the analyses and this manuscript revision provides different number of
predictions compared to our previous manuscript version. We have indicated in the manuscript the
correspondence between reviewers’ comments and our revisions. The major differences are summarized as
follows:
1. Following the comments of reviewer 2, we have completely reanalyzed our data by removing low
complexity sequences from both CDS and NCDS training data. This change impacts the number of
sequences predicted as coding. In addition, the low complexity filter is also applied to the sequences we
used to estimate false negative and false positive rates (Arabidopsis exon and intron sequences,
Arabidopsis known sORF genes and yeast known sORF genes) Therefore, the numbers in Table1,
Supplement A, B and C are different from the original ones.
2. During the revision process, we found an error that resulted in the underestimation of the number of
sORFs with coding potential in our earlier analysis. The mistake is that CI values of intergenic sORFs
were determined with stop codon included. However, terminal codons are not used in the training data of
both CDS and NCDS. In the revised manuscript, we removed the terminal codons in the intergenic
sORFs and estimated the CI values. The existence of terminal codons significantly reduced CI values.
Therefore, the number of sORFs with coding potential is larger than the original estimates. Consequently,
the number of resulting coding sORFs with evidence of transcription or purifying selection also
increased.
3. The number of related sequences of coding sORFs is much larger than that of original estimates. There
are two explanations for the differences. The first reason is that we modified our search procedure by
searching the genome sequences with translated BLAST instead of searching the translated ORFs from
the plant genomes analyzed. Secondly and more importantly, based on the suggestion of reviewer 1 and
3, we added Brassica oleracea that is closely related to Arabidopsis thaliana in our analyses.
Reviewer 1 Comments for the Author...
1. This is an interesting and generally well-written paper. There is emerging evidence that many
small genes have escaped annotation, not only in the Arabidopsis genome, but also in the genomes of
other plant species. The methods described here should be generally applicable. Overall an
interesting and useful paper. The manuscript copy that I downloaded lacks page numbers, making it
a bit more difficult to point to specific places in the document. I will number counting the cover page
as #1.
Reply: We appreciate the valuable comments from the reviewer and have put page and line numbers in the
revised manuscript.
2. Page 5, second para: Is the CI calculated over all reading frames or just the optimal one?
Reply: The CI was calculated for an ORF we defined from exon, intron, or intergenic regions. So the
calculation is always done for the frame that defines the ORF in question.
3. Page 6, first para. I would not consider a false negative rate of 14% to be particularly "low".
Reply: Following the comments, we have modified the sentence on p.6 to tone it down. In our previous
analysis, the exonic and intronic sequences for the training models were not subject to low complexity
filtering. The benchmark set (Arabidopsis and yeast small proteins) were not filtered either. Therefore the
false positive and false negative rate estimates were all based on unfiltered sequences.
4. Page 7: Many ORFs that do not experience purifying selection appear to be expressed. Can the
authors comment on and suggest possible explanations for this.
Reply: The number of coding sORFs experiencing purifying selection has increased substantially due to
the inclusion of Brassica oleracea sequences. In addition, Nekrutenko et al. (2002 Genome Research)
reported that the test of purifying selection is not as sensitive for shorter sequences. For shorter sequences,
the numbers of synonymous and non-synonymous substitutions are smaller and the rate estimates have
correspondingly larger standard errors. Therefore, the likelihood ratio tests we have used have a lower
sensitivity for shorter sequences. Finally, some coding sORFs we predicted may be false positives and were
RNA genes. These possibilities are discussed on p.11 and p.12.
5. Page8, first sentence: It is not clear in what sample this "marginally significant enrichment" is
found. Please clarify.
Reply: We found that the number of coding sORFs undergoing purifying selection of expressed sORFs is
significantly larger than that of non-expressed sORFs. We have modified the manuscript on p.8-9 to
explain this.
6. Page 8, second para: The 80% identity criterion applied here is unusually low. Most genome
annotators use criteria of 95% identity over 90% of the length for matches of ESTs to cognate
regions of the genome. Please comment, justify or modify.
Reply: We use the EST information in two different ways. The first is for identifying the cognate EST(s) of
a coding sORF and we did use a strict criteria (>97% identity and over 90% of the length match) for this.
The detailed description can be found in the Method section “Comparing sORFs and ESTs“. The second
use of ESTs is for identifying coding sORFs that are parts of genes already annotated. The 80% identity
criterion was used for this purpose. We err on the side of assigning some truly novel coding sORFs as
exons of annotated genes. We have modified the manuscript to clarify this (p.9)
7. Page 8, second para: The meaning of the sentence "Since the 16 sORFs ….. any annotated genes."
Is not clear and needs to be re-phrased.
Reply: We have revised this sentence on p.9.
8. Pages 8-9: The authors refer to the work by Ayele et al on sequence conservation with Brassica. It
would be very interesting and worthwhile to determine the extent of intersection between the sORFs
predicted in this paper and Ayele's "conserved intergenic sequences" just as it is of interest to know
the intersection between the sORFs and the transcribed intergenic regions detected by the various
tiling arrays.
Reply: We show in the revised Supplement C-1 the proportion of each sORF overlapped with highly
conserved regions and have discussed our findings briefly on p.10.
Reviewer 2 Comments for the Author...
The mss includes useful analyses that address an important problem in genomic analysis, the
identification of small genes that are missed by current annotation approaches. It should be of
interest to Genome Research readers. However there are a number of issues that need to be
addressed.
Reply: First of all, we appreciate many useful comments from Dr. Green. Our detailed replies to the
comments are listed below.
1a. There needs to be a clearer analysis and discussion of the issue of false positives. The authors
choose a CI threshold corresponding to a false positive rate of 1% on simulated sequence, but they
find it to be ~3.5% on real intronic sequence. Why the difference -- is it because introns include some
'real' short genes, or because the model used in the simulations doesn't capture important features of
real sequence, or is it due to some other reason? Since there are 133,091 sORFs meeting the filtering
criteria, using the intronic false positive rate of 3.5% implies there could be ~4700 false positives
among the 133,091, i.e. more than the 3274 ORFs actually found! So it would seem a large fraction of
the 3274 are expected to be false positives. This needs to be discussed.
Reply: Dr. Green has raised an important question regarding the potential presence of ~4000 false positives
among the 133k ORFs and the possibility that many of the coding sORF predicted are therefore false. The
3.5% is in fact derived from annotated introns not real introns as mentioned above. Therefore, some false
positive introns are most likely CDS mis-annotated as introns. In addition to mis-annotation, some real
introns contain CDSs due to alternative splicing. In Arabidopsis, 3161 (~22%) annotated genes have more
than one splice variants based on available cDNA/EST data for 13,019 genes with a total of ~56,000
introns. Therefore, nearly 6% of introns in Arabidopsis will contain coding sequences that contribute to the
elevated false positive identification.
To illustrate the influence of alternative splicing, we have examined the CI of putative
alternatively spliced introns. Introns defined based on cDNA evidence were used to search the Arabidopsis
protein annotation. Any intron after translation with ≥ 80% amino acid identity that span ≥ 80% intron
length is regarded as alternatively spliced. We found 480 alternatively spliced introns that meet these
stringent criteria and 318 (66%) of these introns have above threshold CI. This finding reinforces the notion
that many introns that can be alternatively spliced will be identified as coding sequences. We have included
the above discussion on p.11-12.
1b. Likewise, as the authors point out there are false positive rates that apply to both the microarray
and the purifying selection analyses, however when summarizing their results (e.g. in the abstract)
the contribution of false positives seems to be ignored. It would be helpful to have a table giving
'corrected' estimates (that remove the expected numbers of false positives) for numbers of codinglike sORFs, sORFs with transcription evidence, and sORFs with purifying selection evidence.
Reply: We have modified the abstract to include the false positive rate estimate for microarray purifying
selection analysis since providing corrected numbers may lead to some confusion. We have also generated
a new Figure 1 to illustrate our analysis pipeline, number of sequences qualified at each stage, and false
positive rate estimates when available to help clarify the issue the reviewer has raised.
2. There also needs to be more discussion of false negative rates. The authors exclude ORFs with
substantial low-complexity regions for finding candidate sORFs, but not (as emerges only in the
Discussion) for assessing false negative rates using known short proteins, because many known short
proteins would otherwise be excluded. Thus it is somewhat misleading to both suggest that their
approach is effective at finding short protein-coding genes (e.g. 'our method has low false negative
rates for predicting small protein genes', p. 4), and that it also has a low false positive rate, because
these claims are based on different analysis criteria. At a minimum the authors should clarify this
issue and present false-positive and false negative rates both with, and without, the low-complexity
filter, so that the reader can assess this.
Reply: We agree with the reviewer that the sentence “our method has low false negative rates…” is
somewhat misleading, as pointed out by reviewer1 as well. We have modified the sentences to tone it down
(p.6). In our previous analysis, the exonic and intronic sequences for the training models were not subject
to low complexity filtering. The benchmark set (Arabidopsis and yeast small proteins) were not filtered
either. Therefore the false positive and false negative rate estimates were all based on unfiltered sequences.
In response to the reviewer comment, we have completely re-done our analyses based on filtered
sequences. We found that the CI values based on filtered sequences are slightly higher than those based on
non-filtered sequences. We have modified the manuscript so all numbers are based on filtered sequences.
3. I could not find any discussion about how (or whether) overlapping ORFs were handled. Do the
numbers 3274 and 1589 (of total and transcribed sORFs respectively meeting the CI threshold)
include cases of multiple overlapping ORFs (on the same, or opposite) strands? If so, the overlaps
need to be eliminated for these counts.
Reply: We did not discuss in the manuscript how overlapping entries were dealt with. We have counted
overlapping sORFs as one and use the sORF with the highest CI value in each cluster as the representative.
We have modified our manuscript on p.6 to explain this in more detail and included this info in Figure 1.
4. There should be some further analyses (in addition to the EST analysis described by the authors)
of potential clustering of the novel sORFs into (known or novel) genes. For example, how many of the
sORFs in the different categories (with/without transcription evidence, with/without purifying
selection evidence) lie within a few hundred bp of a known gene or another sORF, suggesting the
possibility that they are part of the same gene? How many are far enough away (> 5kb) that they are
highly likely to represent distinct genes?
Reply: Arabidopsis has relative short intergenic sequences with the median size 1041bp. Therefore, it is
difficult to gauge if sequences belong to the same transcriptional units by the suggested distance at 5kb.
Instead, we have determined the intron size distribution and found that 95% of introns are smaller than
850bp. Using 850bp as the threshold for calling coding sORFs that are likely belong to novel genes, 2,341
sORFs are qualified. We have discussed this estimate on p.10.
5. It wasn't clear whether novel sORFs were compared to each other to see whether any of them fall
into homology families. This would be useful.
Reply: This is an interesting point that we have overlooked. We have generated similarity cluster using
markov clustering to determine the presence of gene famileis. The results are shown on p.9 and Supplement
C-2.
6. There is another important caveat to interpreting sORFs meeting the CI threshold as being likely
protein coding sequences. Namely, the authors' Bayesian model only includes the categories 'coding'
(in 6 frames), and 'noncoding', with the 'noncoding' model being estimated entirely from intronic
sequences. However, there are many types of noncoding sequence that are likely to be enriched in
intergenic regions relative to intronic regions -- for example RNA genes, promoter regions, 5' and 3'
UTRs. If their composition is significantly different from intronic sequence -- e.g. more GC rich -then they may well get identified by the model as having a high pp of coding, even though they aren't
coding. So (even apart from the false positive issue) it is not correct to assume that anything having a
strong pp for being coding is likely to be coding; it may just be another type of sequence not explicitly
allowed for by the model. In particular, I don't think you can necessarily conclude that a sORF with
evidence of transcription is likely to be a protein coding gene -- it could be an RNA gene with a
composition that is not intron-like. Some of the claims in the abstract and mss need to be changed to
reflect this caveat.
Reply: We have in fact examined coding potentials of sORFs using several kinds of noncoding sequences
such as UTRs, intergenic regions and introns based on annotated data. Compared to introns, much fewer
and shorter UTR sequences were available. Therefore, we ran into problems when we tried to extract UTR
ORFs for training purpose since most UTR ORFs are very short. Since we want to identify coding
sequences in intergenic regions, we were weary of using intergenic regions as training data. We did play
with RNA genes as well but we could only find very limited number of RNA genes as training set. Intron
data on the other hand is abundant. In addition, we are able to verify the authenticity of introns based on
full length cDNAs, although we still run into problems of alternative splicing.
The reviewer also pointed out that coding sORF with transcription evidence is not necessarily a
protein coding gene. We argue that an expressed region with higher CI will be more likely to be coding
sequences than an expressed region with lower CI. In addition, we explicitly incorporated exon coding
sequence composition info into the training model. So the coding sORFs have to be dissimilar from introns
AND similar to exons. The transcribed coding sORFs are likely not RNA genes that are not intron like, as
the reviewer has suggested. Therefore, we feel it is justifiable to say sORFs with above threshold CIs and
with evidence of transcription “likely” belong to novel protein coding genes. However, we are aware there
is some level of uncertainty involved and have put in additional cautionary notes regarding the possibility
that they could be RNA genes in our manuscript.
7. Although the analyses to derive the distribution of CIs and pps in simulated sequence (e.g. Figure
1) are useful, simulated sequences do not fully capture the features of actual sequences (as the
authors acknowledge in the discussion). It would therefore be useful to see similar distribution
histograms for intronic, intergenic, and known coding sequences. This might also illuminate whether
the distribution appears bimodal in the intergenic case (as one expects).
Reply: We agree with Dr. Green and have included the distributions as requested in revised Figure 4.
8. I found the mss difficult to understand in places, partly due to numerous grammatical and
typographical errors (the mss could use careful editing by a native English speaker), and partly
because important details are sometimes omitted (the authors should include enough information to
allow the analyses to be reproduced by another investigator). Moreover, the terminology needs to be
more clearly defined and more accurately and consistently used. "sORF" should mean "short ORF",
which implies nothing about functionality or characteristics of the sequence other than the length of
the potential reading frame. However the authors sometimes seem to use it to instead mean a short
ORF meeting their CI threshold, and sometimes to mean a short ORF that encodes an actual protein.
The definitions need to be clearly stated. At several points the authors use language that confuses the
ORF with the encoded protein (e.g. they refer to the ORF starting with a methioninine, rather than
with a methionine codon).
Reply: We apologize for the grammatical errors and typos and have tried to correct them as best as we can.
We concur that in the current version sORF and sORF with different evidence are not stringently defined.
We have modified our manuscript with the following definitions and included the same terminology in
Figure 1 to further improve the readability:
(1) sORF: short Open Reading Frame starting from ATG that are 90-300bp long,
(2) coding sORF: sORF with above threshold CI,
(3) transcribed coding sORF: coding sORF with above threshold tiling array feature intensity, and
(4) constrained coding sORF: coding sORF subject to purifying selection/functional constraint.
9. Additional points (-- since there are no page numbers in my copy (!), the page numbers below
assume the first page after the abstract is number 1):
Reply: We apologize for this and have included page and line numbers in the revised manuscript.
10. Abstract: "1661 sORFS ... likely belong to novel protein coding genes": this sentence needs to be
rephrased, as it is contradicted two sentences later by "1275 sORFs ... likely belong to novel genes".
Also, all these numbers should be corrected for false positive rates, as discussed above.
Reply: We have revised the sentences in the abstract by including the false positive rate information.
11. p. 1: First paragraph: "These studies demonstates the presence of genic sequences" -- the claim
that transcribed regions necessarily represent protein-coding or RNA genes is controversial. The fact
that a region is transcribed does not in itself imply that it is functional. "Some studies assumed ..." -give references.
Reply: We have changed “demonstrate” to “suggest” and provide references as requested on p.3.
12. "small proteins (referred to as sORF)" -- no, a protein is not an ORF!! Please use accurate
terminology. Proteins, ORFs, predicted genes, and genes are all distinct entities that should not be
confused with each other. "sORFs include mating pheromones ..." -- same issue.
Reply: We have modified our manuscript to have consistent definitions of the sORFs with different
evidence as described in the reply to Reviewer 2 comment 8. Also, we have revised parts where ORFs,
proteins, and genes were misused as indicated Dr. Green.
13. p. 2: "hexamer composition bias, which has been established as the best measure for
distinguishing CDS from NCDS" -- this is an overstatement. "known sORF" -- sORFs are always
'known' simply by computational analysis, because that is how they are defined"! You should say
"known genes" or "known sORF genes". "predicted sORF' -- should be "predicted genes".
Reply: We changed “the best measure” to “a general measure” and incorporated suggested changes
throughout the manuscript.
14. p. 3: "only 3.54% of the introns" (and also Table 1) -- does this mean you are analyzing the entire
intron, as a unit? If so this is not a particularly illuminating regarding false positive rates – better
would be to analyze short ORFs within introns.
Reply: In fact, we used sORFs in introns for the analysis instead of the whole introns. We have modified
the manuscript on p.5 and Table 1 to emphasize this.
15. p.4: "starting with methionine" => "starting with ATG"
Reply: We have incorporated the suggested change.
16. p. 5: As indicated above, you should discuss more thoroughly the issue of false positive rates, for
both sORFs and purifying selection, and how this affects predicted numbers.
Reply: We have revised our manuscript extensively based on the suggestion.
17. p. 6: "sORFs that are transcribed tend to be subject to purifying selection" -- this is an
overstatement, since the enrichment appears to be rather modest. "more than 80% identity" -- what
is the reason for this threshold? It should be something more like 95% identity over a region of some
minimal size. Also, there seems to be an unstated assumption that if the sORF is part of a known
gene, and has an EST match, then there will necessarily be enough EST evidence to overlap other
ESTs associated to the known gene. This assumption should be stated explicitly, and some
justification for it given (e.g. based on patterns of EST overlap for known genes).
Reply: We simply want to point out that the statistical test for enrichment is significant at 1% level
(originally at 5%). We have included sentence indicating the significance is not as robust as we had
expected. The 80% threshold is chosen not for identifying the cognate ESTs of predicted genes but as a
relaxed criterion to identify if ESTs matching sORFs are related to annotated genes. The reduced identity
requirement is for eliminating sORFs that are likely missing exons of annotated genes. Please note that
EST-sORF matches were established with a 97% identity threshold.
The assumption we are making is that: if an sORF S has an EST match X and X is also matched to
a known gene G, then S and G are likely belong to the same gene. However, we did not attempt to “overlap
other ESTs associated to the known gene”. We have stated the above assumption explicitly on p.9-10 to
clarify this.
18. p. 7: "regions with low amino acid complexity ... different codon usage" -- give reference. "the
number of sORFs we identified should be treated as a minimum estimate" -- you can only take it as a
minimum estimate if you know it doesn't include any false positives, but it almost certainly does.
Reply: The statement the reviewer highlighted is based on our own observation (e.g. Figure 2, the first 20+
amino acids tend to have lower posterior probabilities and turn out to be signal peptide) and we cannot find
a reference regarding this finding. In addition, it is in fact not codon “usage” but codon “composition” that
is different, a mistake we have corrected in the revised manuscript. Finally, we have modified the sentence
regarding “minimum estimate” on p.10.
19. pp. 7-8: "random matches can only account for ... is only 3.5%" I don't think this argument is
correct, since (if I'm reading the Methods correctly) you only applied the CI filter to one of the two
genomes (namely AT) -- so you don't gain any confidence regarding the matches from the false
positive rate.
Reply: We have eliminated this sentence.
20. p. 10: did CDSs include stop codon, or not? justify prior probability of coding of .4 for
Arabidopsis – this seems substantially higher than proportion of genome thought to encode protein.
Also, need to give prior prob of coding in each frame – was that just .4/ 6
Reply: CDS should not include stop codon. However, because of the reviewer’s comment we went back to
check our codes and found that they were erroneously included in the intergenic sORFs. The training data
sets, on the other hand, did not include stops. We have re-analyzed our dataset excluding stops. The
presence of the stop codon in the last window significantly reduced the CI in many cases, therefore, the
coding sORF prediction is substantially higher than what we reported in the previous manuscript version.
In the annotation data, the proportion of coding sequences is approximately 0.3. Also, the prior probability
in each frame is as the reviewer indicated, 0.3/6. We have modified the manuscript and explained these on
p.14.
21. p. 11: methionine is not same as a methionine codon "longest "sub-ORF"" -- what does this
mean? "matched a known gene with > 40% identity ..." does % identity refer to the DNA or the
protein sequence? "with BLAST" -- using what BLAST parameters and thresholds?
Reply: “Methionine” is changed to “methionine codon”. The sentences including “longest” and “sub-ORF”
have been revised to make it clearer. The >40% identity is amino acid sequence identity. We have included
info on the BLAST search parameters as well.
22. p. 13: "maximal gap of one", "minimal identity of 80% between adjacent HSPs" -- what does this
mean? "the alignment covers the start position of the ORF" -- why is this condition imposed? sORFs
that correspond to short internal exons would be excluded by this criterion, since the initial ATG of
the ORF could be in the upstream intron rather than the exon itself.
Reply: We realize that the writing in the current version a bit confusing and have replaced it on p.18 of the
current version. The intention for our studies is for uncovering novel coding genes in the Arabidopsis
genome instead of missing exons, therefore we deliberately focus on ORFs that started with ATG.
Reviewer 3 Comments for the Author...
With the work presented Hanada and colleagues address the question of the presence of thus far
undetected small ORFs in the genome of Arabidopsis thaliana. This topic is of special importance as
there is increasing evidence that a large fraction of thus far undetected sORFs escaped the initial,
“classical” genome analysis albeit they encode small proteins. Well-established example cases already
demonstrated the biological importance of this class of proteins. Thus a genome scale survey for
them is required. The authors scaned the intergenic space using a simplified method using the
Coding Index and applying evaluated and stringent thresholds for candidate sORFs. Evaluation and
further constriction of candidate sORFs is undertaken using whole genome tiling arrays as well as
evolutionary conservation and purifying selection criteria. Overall the manuscript is well written and
the methodological description as well as the supplementary material provided is exemplary.
However, there are a couple of points that are not sufficiently discussed/addressed in the current
version of the manuscript:
Reply: We would like to thank the reviewer 3 for providing valuable comments to this manuscript.
1. The data flow and the analytical flow, esp the numbers of intermediate candidate sORFs observed
and retained is often difficult to follow and in some instances it is unclear to which total number(s)
the percentages given refer to. E.g. paragraph ”Conservation across species and signatures of
purifying selection”
Reply: We have generated a procedure flow chart (Figure 1) including information on the number of
sORFs identified in the Arabidopsis genome during each stage of the analysis. We have also included info
on how the percentages were derived in various parts of the manuscript and some are deleted for clarity.
2. The conservation measure (=>30% identity; AA ID I guess?) against the other plant genomes
(namely Oryza, Poplar, Medicago, Lotus): The percentage of sORFs with homologous counterpart in
the other species is pretty low. The authors discuss four potential reasons for this observation:
 False calls due to the large number of ORFs examined; although I can grasp what is probably
meant the phrasing is misleading as the percentage of false calls is certainly not depending on
the sheer number of call
 Non-functional background translation of proteins. Unfortunately totally unclear to me
 Incomplete plant genome sequences used (for comparison);
I agree that the available sequences for Medicago and Lotus are currently incomplete. Poplar
however has recently been published as close to complete. Unfortunately only a fairly incomplete
dataset for poplar was available for quit some time, so the authors did not really have the
opportunity to compare against the complete genome. Should be pretty straightforward to update
the comparison against poplar against the complete dataset that became available along with the
publication. Most importantly a comparison against the Brassica shotgun reads might be highly
informative due to the close evolutionary relationship.
Reply: We agree with the reviewer that the sentence regarding “false call” should be clearer and have
modified this accordingly. For background translation we meant to say the fortuitous translation of regions
with no functional significance. Finally for the incomplete genome aspect, the poplar genome we analyzed
is the publication release. We appreciate the suggestion and have analyzed Brassica oleracea genome. The
inclusion of Brassica sequences increase the number of cross genome matches. The results are presented in
the revised version.
3. The filtering of initially detected sORF (candidates) to get rid of genes, pseudogenes, transposons:
I assume this has been performed using the TIGR annotation as a data basis (?). Especially for the
transposable and repetitive elements I’m not too sure on whether this is a solid data basis. A filtering
of sORF elements against an up to date repeat/TE data resource might be a good idea.
Reply: We used the TIGR annotated protein sequences to conduct similarity search of Arabidopsis genome
to rid of any regions that are similar to annotated genes. But we did not apply repeat masker using the upto-date repeat/TE data. We have compared our coding sORF against the up-to-date repeat/TE data from
Repbase. We have incorporated the finding in the manuscript p.8.
4. The 13.2% of sORFs that are possibly missing exons mentioned within the abstract? This is really
hard to trace in the results section. Again the numbers and percentages given in the last paragraph
within the results section are confusing and hard to follow (percentages of what?). Again a flowchart
would really help.
Reply: As suggested, a figure has been created (Figure 1) to clarify the various numbers and percentages in
the manuscript. We have also included info on how the percentages were generated in relevant part of the
manuscript.
5. Numerous sections in Material and Methods are really complicated to read and understand. Eg.
the identification of introns. Long winded and complicated. Some restructuring might help.
Reply: We have written the section regarding intron identification in a clearer way (p.13). In addition, we
have modified various parts of the Method section to improve the readability. We concur that the CI and EI
parts are rather complicated and we have provided Supplement A-1 and A-3 to illustrate how they were
determined.
6. The explanation of the Expression Index (EI) as well as the definition of the (EI) thresholds used is
really hard to follow and understand. Going through the supplied supplementary info I found it
remarkable that most of the sORFs that meet the criteria are nevertheless close to the threshold.
Reply: We have generated the Supplement A-3 to explain the EI methods. For the observation that the EI
value of the coding sORF that we regarded as transcribed are fairly close to the threshold, this is not
necessarily true since the distribution of EI value is fairly narrow. So a relatively small absolute difference
does not reflect on the level of significance.
7. The generation of the coding index (CI) and delineation of exon vs. intron CI does not take into
account UTRS and promoters that potentially contain different CIs and thus might lead to the
definition of erroneous thresholds. At least for UTRs a lot of info can be derived from cDNA
information and should also be considered to be used for the threshold definition. Not considering
UTRS and promoters. Thus the selection is exon vs intron but is not taking into account UTRs and
promoters. A coding vs. non-coding selection schema including at least UTRs might be more
adequate.
Reply: Reviewer 2 has a very similar concern. We have in fact use UTR and promoter as training
sequences in earlier analyses that were not shown. Although there are quite a few full-length cDNAs from
Arabidopsis, the caveats in using UTRs are that many of them are in fact not full length and that is that the
numbers and lengths of UTRs are much smaller than introns. We have troubles getting sufficient numbers
of UTR ORFs of length 90-300bp for our simulation studies. We have also play with promoter sequences.
Here the problem is that we found many sequences that we have defined as promoters are in fact expressed
(based on our tiling array data) even for putative promoter sequences defined according to full length
cDNAs. In addition, promoters are not easily defined and most studies have defined promoters rather
arbitrarily. With these caveats, we decide to focus on introns. We agree that a model combing intron and
UTR information may improve the model. However, given that our major point is to illustrate the presence
of many potential coding genes not identified in the Arabidopsis genome and the difficulty in getting
sufficiently long UTR ORFs, we feel the improvement will be very limited.
8. Are there data regarding the CI indexes for small cDNA supported exons? Are the values obtained
similar or different to those for long exons?
Reply: Yes, we have examined the CI values for cDNA supported exons. The CI values of small exons are
lower than those of long exons of the same composition. That’s why we have conducted simulation studies
to uncover the CI distributions at different sequence lengths for defining CI thresholds (see Supplement A1 and A-2).
9. The size classes used for the CI definition is unclear; => some more details in the methods section
would be helpful
Reply: We have modified the manuscript to explain the size class better (p.14-15). In addition, we have
made a figure to clarify the CI method and procedures in Supplement A-3 and A-3.
10. The false positive rate obtained for introns is 3,54%. From appr. 133.000 hypothetical sORFs
3274 were found to exceed the CI threshold. Well this is close to false positive rate discovered in
introns and asks for some discussion
Reply: Reviewer 2 has a similar concern (comment 2). In Arabidopsis, 3161 (~12%) annotated genes have
more than one splice variants based on available cDNA/EST data. The true number is likely higher. As a
result, some introns will contain coding sequences that contribute to larger number of introns predicted as
coding sequences than expected. We have included discussions concerning this point on p.6-7 and p.11. In
addition, we found a mistake in our original analysis that stops were including in calculating the CI values
for intergenic sORFs. After correction, the number of coding sORFs is significantly higher than the number
of potential false positives.
11. Table2: “IQR” should be defined
Reply: IQR is the abbreviation of Inter-Quartile Range and we have included this information in Table 2.
Download