Replies to Reviewers’ comments We thank the reviewers for valuable comments. We have tried our best to respond to the comments. Our point-to-point response is given below. Reviewer 1 Comments for the Author... This is an interesting and generally well-written paper. There is emerging evidence that many small genes have escaped annotation, not only in the Arabidopsis genome, but also in the genomes of other plant species. The methods described here should be generally applicable. Overall an interesting and useful paper. Reply: First of all, we would like to thank the reviewer 1 for giving positive evaluation to this manuscript. Please see our replies to each of his comments as follows. (Comments 1) Specific Comments The manuscript copy that I downloaded lacks page numbers, making it a bit more difficult to point to specific places in the document. I will number counting the cover page as #1. Reply: In a response to the comment, I put page and line numbers in this manuscript. (Comments 2) Page 5, second para: Is the CI calculated over all reading frames or just the optimal one? Reply: The CI was calculated for an optimal frame because we constructed all the possible open reading frames in intergenic regions before calculating the CI. The method calculating the CI was shown in method. To easily understand the point, we add the following sentence in this part; “The CI was calculated for an optimal frame of an sORFs because we first constructed the possible open reading frames in intergenic regions.” (Please see lines 8-10 on page 5) (Comments 3) Page 6, first para. I would not consider a false negative rate of 14% to be particularly "low". Reply: Following the comments, we removed “low” in the sentence. We just mentioned that the CI method showed at least more than 86% specificity because the specificity shows 88.8%, 95.0% and 86.0% in annotated exons, known sORF in A.thaliana and known sORF in yeast, respectively. (Please see lines 10-11 on page 6) (Comments 4) Page 7: Many ORFs that do not experience purifying selection appear to be expressed. Can the authors comment on and suggest possible explanations for this. Reply: Following the comment, we discussed the point as follows. Nekrutenko, et al reported that the test of purifying selection sometimes did hardly identify purifying selection in shorter sequences. If the nucleotide length is shorter, the number of synonymous and nonsynonymous substitutions becomes also smaller. This trend is a simple statistical problem. In small numbers, it is hard to find the significance in comparison with longer sequences. (Please see lines 17-18 on page 10) (Comments 5) Page8, first sentence: It is not clear in what sample this "marginally significant enrichment" is found. Please clarify. Reply: Following the comments, we revised the sentence as the following sentence “The number of sORFs undergoing purifying selection of expressed sORFs is significantly larger than that of non-expressed sORFs”. (Please see lines 1-3 on page 8) (Comments 6) Page 8, second para: The 80% identity criterion applied here is unusually low. Most genome annotators use criteria of 95% identity over 90% of the length for matches of ESTs to cognate regions of the genome. Please comment, justify or modify. Reply: As mentioned in the comments, the sentences may lead misunderstanding to readers. For the assignments of an sORF to EST, we did use a strict criteria (>97% identity and over 90% of the length match). The detailed description was written in “Comparing sORFs and ESTs“ in materials and methods section. The 80% identity criteria was used in the assignment between ESTs matched to an sORF and neighboring annotated genes of the sORF. In the assignment, we inferred a possibility that identified sORFs became parts of annotated genes. To estimate the error rate of our annotation, we used relaxed criteria (80% identity) in this part. The detailed description was revised in the text. (Please see lines 8-10 on page 8) (Comments 7) Page 8, second para: The meaning of the sentence "Since the 16 sORFs ….. any annotated genes." Is not clear and needs to be re-phrased. Reply: The comment is essentially the same as the comments 6 of Reviewer 1. Please see our reply to the comments 6 of Reviewer 1. (Comments 8) Pages 8-9: The authors refer to the work by Ayele et al on sequence conservation with Brassica. It would be very interesting and worthwhile to determine the extent of intersection between the sORFs predicted in this paper and Ayele's "conserved intergenic sequences" just as it is of interest to know the intersection between the sORFs and the transcribed intergenic regions detected by the various tiling arrays. Reply: Following the comments, we showed the proportion of each sORF overlapped with high conserved regions. (Please see Supplement D) Reviewer 2 Comments for the Author... The mss includes useful analyses that address an important problem in genomic analysis, the identification of small genes that are missed by current annotation approaches. It should be of interest to Genome Research readers. However there are a number of issues that need to be addressed. Reply: First of all, we appreciate lots of useful comments to this manuscript. Please see our replies to each of Prof. Phil Green’s comments as follows. (Comments 1-1) 1-1. There needs to be a clearer analysis and discussion of the issue of false positives. The authors choose a CI threshold corresponding to a false positive rate of 1% on simulated sequence, but they find it to be ~3.5% on real intronic sequence. Why the difference -- is it because introns include some 'real' short genes, or because the model used in the simulations doesn't capture important features of real sequence, or is it due to some other reason? Since there are 133,091 sORFs meeting the filtering criteria, using the intronic false positive rate of 3.5% implies there could be ~4700 false positives among the 133,091, i.e. more than the 3274 ORFs actually found! So it would seem a large fraction of the 3274 are expected to be false positives. This needs to be discussed. Reply: In responses to the comments, we discussed that the difference of false positive rate between simulated data (1%) and observed data (3.5%). Moreover, we conducted 1000 times simulation to examine the effect of the false positive rate. As results, approximately 114 sORFs was possible to be falsely identified as coding sequences. (Please see lines 9-14 on page 9) (Please see lines 21-25 on page 6 and Supplement G) (Comments 1-2) 1-2. Likewise, as the authors point out there are false positive rates that apply to both the microarray and the purifying selection analyses, however when summarizing their results (e.g. in the abstract) the contribution of false positives seems to be ignored. It would be helpful to have a table giving 'corrected' estimates (that remove the expected numbers of false positives) for numbers of coding-like sORFs, sORFs with transcription evidence, and sORFs with purifying selection evidence. Reply: (Should we estimate false positive rate for microarray and purifying selection? The expression and purifying selection are one of futures for coding sequences. Some of real coding sequences are also not expressed and are not subject to purifying selection. Therefore, it may be hard to calculate false positive rate of microarray and purifying selection. The reviewer2 really want to estimate the false positive rates. He mentioned the point here, in comments 11 and 17) (Comments 2) 2. There also needs to be more discussion of false negative rates. The authors exclude ORFs with substantial low-complexity regions for finding candidate sORFs, but not (as emerges only in the Discussion) for assessing false negative rates using known short proteins, because many known short proteins would otherwise be excluded. Thus it is somewhat misleading to both suggest that their approach is effective at finding short protein-coding genes (e.g. 'our method has low false negative rates for predicting small protein genes', p. 4), and that it also has a low false positive rate, because these claims are based on different analysis criteria. At a minimum the authors should clarify this issue and present false-positive and false negative rates both with, and without, the low-complexity filter, so that the reader can assess this. Reply: (I do not know how to estimate false positive and false negative rates both with and without low-complexity filters. Can I estimate the false rate as follows?) With low complexity Without low complexity Exon 1 minus Positive rate by CI (False positive rate) 1 minus Positive rate by CI (False positive rate) Intron Positive rate by CI (False negative rate) Positive rate by CI (False negative rate) (To estimate above, should I prepare exon sequences with and without low complexity, and intron sequences with and without low complexity?) (Comments 3) 3. I could not find any discussion about how (or whether) overlapping ORFs were handled. Do the numbers 3274 and 1589 (of total and transcribed sORFs respectively meeting the CI threshold) include cases of multiple overlapping ORFs (on the same, or opposite) strands? If so, the overlaps need to be eliminated for these counts. Reply: (Hi Shinhan, How to construct open reading frames in intergenic regions? If ORFs with different frames were created in a locus, how do you deal with the overlapping ORFs?) (Comments 4) 4. There should be some further analyses (in addition to the EST analysis described by the authors) of potential clustering of the novel sORFs into (known or novel) genes. For example, how many of the sORFs in the different categories (with/without transcription evidence, with/without purifying selection evidence) lie within a few hundred bp of a known gene or another sORF, suggesting the possibility that they are part of the same gene? How many are far enough away (> 5kb) that they are highly likely to represent distinct genes? Reply: In a response to the comments, we did not examine the possibility of sORFs with similarity to known genes from sORFs because many parts of known genes are remaining as pseudogenes in the genome. Therefore, identified sORFs are not related to known genes. (Comments 5) 5. It wasn't clear whether novel sORFs were compared to each other to see whether any of them fall into homology families. This would be useful. Reply: Following the comments, we made cluster by MCL method. The results were shown in Supplement H. (Please see lines 26-29 on page 6 and Supplement H) (Comments 6) 6. There is another important caveat to interpreting sORFs meeting the CI threshold as being likely protein coding sequences. Namely, the authors' Bayesian model only includes the categories 'coding' (in 6 frames), and 'noncoding', with the 'noncoding' model being estimated entirely from intronic sequences. However, there are many types of noncoding sequence that are likely to be enriched in intergenic regions relative to intronic regions -- for example RNA genes, promoter regions, 5' and 3' UTRs. If their composition is significantly different from intronic sequence -- e.g. more GC rich -then they may well get identified by the model as having a high pp of coding, even though they aren't coding. So (even apart from the false positive issue) it is not correct to assume that anything having a strong pp for being coding is likely to be coding; it may just be another type of sequence not explicitly allowed for by the model. In particular, I don't think you can necessarily conclude that a sORF with evidence of transcription is likely to be a protein coding gene -- it could be an RNA gene with a composition that is not intron-like. Some of the claims in the abstract and mss need to be changed to reflect this caveat. Reply: In a response to the comments, we have examined coding potentials of sORFs using several kinds of noncoding sequences such as UTRs, intergenic regions and introns based on annotated data. The results of identified sORFs are essentially the same. Therefore, we think it not necessary to emphasize that identified sORFs were not likely to be intron sequences. (Comments 7) 7. Although the analyses to derive the distribution of CIs and pps in simulated sequence (e.g. Figure 1) are useful, simulated sequences do not fully capture the features of actual sequences (as the authors acknowledge in the discussion). It would therefore be useful to see similar distribution histograms for intronic, intergenic, and known coding sequences. This might also illuminate whether the distribution appears bimodal in the intergenic case (as one expects). Reply: As mentioned in the comments, we had tried to use actual sequences instead of simulated sequences to see the difference of CI between coding and non-coding sequences. However, actual sequences have a technical problem to capture the differences because many independent coding and non-coding sequences of actual sequences cannot be prepared with different length. On the other hand, simulated sequences can be many independent coding and non-coding sequences with different length. That’s why we used simulated sequences in this manuscript. (Comments 8) 8. I found the mss difficult to understand in places, partly due to numerous grammatical and typographical errors (the mss could use careful editing by a native English speaker), and partly because important details are sometimes omitted (the authors should include enough information to allow the analyses to be reproduced by another investigator). Moreover, the terminology needs to be more clearly defined and more accurately and consistently used. "sORF" should mean "short ORF", which implies nothing about functionality or characteristics of the sequence other than the length of the potential reading frame. However the authors sometimes seem to use it to instead mean a short ORF meeting their CI threshold, and sometimes to mean a short ORF that encodes an actual protein. The definitions need to be clearly stated. At several points the authors use language that confuses the ORF with the encoded protein (e.g. they refer to the ORF starting with a methioninine, rather than with a methionine codon). Reply: Following the comments, this manuscript was edited by a native English speaker. We hope that the revised manuscript is improved. Moreover, we take care of terminology as follows. (1) sORF: the region starts from a methionine codon and ends to a terminal codon. (2) Potential protein: the region has coding potential. (3) Transcribed protein: the region is transcribed. (4) Selected protein is subject to purifying selection. (5) Functional protein is transcribed or is subject to purifying selection. (Is this OK? I did not still change the terms) (Comments 9) Additional points (-- since there are no page numbers in my copy (!), the page numbers below assume the first page after the abstract is number 1): Reply: In a response to the comment, I put page and line numbers in this manuscript. (Comments 10) Abstract: "1661 sORFS ... likely belong to novel protein coding genes": this sentence needs to be rephrased, as it is contradicted two sentences later by "1275 sORFs ... likely belong to novel genes". Also, all these numbers should be corrected for false positive rates, as discussed above. Reply: In a response to the comments, we revised the sentences following the criteria as I mentioned in the reply to the comments 9. The latter comment is essentially the same as the comments 2 of Reviewer 2. Please see our reply to the comments 2 of Reviewer 2. (Comments 11) p. 1: First paragraph: "These studies demonstates the presence of genic sequences" -the claim that transcribed regions necessarily represent protein-coding or RNA genes is controversial. The fact that a region is transcribed does not in itself imply that it is functional. "Some studies assumed ..." -- give references. Reply: Following the comments, we change “demonstrate” to “may indicate” in the sentences. Moreover, we gave the following references to the parts. (Please see lines 7 and 11 on page 3) (Comments 12) "small proteins (referred to as sORF)" -- no, a protein is not an ORF!! Please use accurate terminology. Proteins, ORFs, predicted genes, and genes are all distinct entities that should not be confused with each other. "sORFs include mating pheromones ..." -same issue. Reply: Following the comments, we changed “small proteins (referred to as sORF)" to “small proteins translated from sORFs”. Moreover, we change “sORFs include mating pheromones…” to “proteins include mating pheromones…..”. (Please see lines 25-26 on page 3) (Comments 13) p. 2: "hexamer composition bias, which has been established as the best measure for distinguishing CDS from NCDS" -- this is an overstatement. "known sORF" -- sORFs are always 'known' simply by computational analysis, because that is how they are defined"! You should say "known genes" or "known sORF genes". "predicted sORF' -should be "predicted genes". Following the comments, we changed “the best measure” to “a general measure”. Moreover, we changed “known sORFs” to “known sORF genes” and “predicted sORF” to “predicted genes”. (Please see lines 11 and 19 on page 4) (Comments 14) p. 3: "only 3.54% of the introns" (and also Table 1) -- does this mean you are analyzing the entire intron, as a unit? If so this is not a particularly illuminating regarding false positive rates – better would be to analyze short ORFs within introns. Reply: (I do not know what kinds of data you used for introns and exons. I guess that you use short ORFs in introns) (Comments 15) p.4: "starting with methionine" => "starting with ATG" Reply: Following the comment, I changed “starting with ATG”. (Please see lines 19 on page 6) (Comments 16) p. 5: As indicated above, you should discuss more thoroughly the issue of false positive rates, for both sORFs and purifying selection, and how this affects predicted numbers. Reply: The comment is essentially the same as the comments 2 of Reviewer 2. Please see our reply to the comments 2 of Reviewer 2. (Comments 17) p. 6: "sORFs that are transcribed tend to be subject to purifying selection" -- this is an overstatement, since the enrichment appears to be rather modest. "more than 80% identity" -- what is the reason for this threshold? It should be something more like 95% identity over a region of some minimal size. Also, there seems to be an unstated assumption that if the sORF is part of a known gene, and has an EST match, then there will necessarily be enough EST evidence to overlap other ESTs associated to the known gene. This assumption should be stated explicitly, and some justification for it given (e.g. based on patterns of EST overlap for known genes). Reply: Following the comments, we change “sORFs that are transcribed tend to be subject to purifying selection” to “sORFs that are transcribed tend to have purifying selection”. Moreover, the analysis using EST has had been done by 95% threshold. About the assumption mentioned by the reviewer 2, we stated the contents in the text. (Hi Shinhan, I don’t know how to show the justification). (Please see lines 15-18 on page 8) (Comments 18) p. 7: "regions with low amino acid complexity ... different codon usage" -- give reference. "the number of sORFs we identified should be treated as a minimum estimate" -- you can only take it as a minimum estimate if you know it doesn't include any false positives, but it almost certainly does. Reply: Following the comments, we gave the reference for “"regions with low amino acid complexity ... different codon usage". Moreover, we changed the sentence “Therefore, the number of sORFs we identified should be treated as a minimum estimate in this regard.” to “It indicates that many sORFs are still hiding in the annotated genome.”. (Hi Shinhan, I cannot find references related to codon usage bias of transmembnrane and signal peptides. I think that amino acid composition of these regions is different from the other regions. However, the codon usage seems to be similar. But I don’t know) (Please see lines 12-14 on page 10) (Comments 19) pp. 7-8: "random matches can only account for ... is only 3.5%" I don't think this argument is correct, since (if I'm reading the Methods correctly) you only applied the CI filter to one of the two genomes (namely AT) -- so you don't gain any confidence regarding the matches from the false positive rate. (Reply: I do not understand the comments. Can you deal with the comments?) (Comments 20) p. 10: did CDSs include stop codon, or not? justify prior probability of coding of .4 for Arabidopsis – this seems substantially higher than proportion of genome thought to encode protein. Also, need to give prior prob of coding in each frame – was that just .4/ 6 Reply: CDS does not include stop codon in the training data set. In the annotation data, the proportion of coding sequences is approximately 0.4. Also, the prior prob in each frame is 0.1333 (=0.3/6) because there are 6 frames in each segment. (Comments 21) p. 11: methionine is not same as a methionine codon "longest "sub-ORF"" -- what does this mean? "matched a known gene with > 40% identity ..." does % identity refer to the DNA or the protein sequence? "with BLAST" -- using what BLAST parameters and thresholds? Reply: “methionine” is changed to “methionine codon”. The sentences including “longest” and “sub-ORF” will be revised to make it clear. Moreover, we revised the parts of the similarity search. (Please see lines 31-3 on pages 13-14) (Comments 22) p. 13: "maximal gap of one", "minimal identity of 80% between adjacent HSPs" -- what does this mean? "the alignment covers the start position of the ORF" -- why is this condition imposed? sORFs that correspond to short internal exons would be excluded by this criterion, since the initial ATG of the ORF could be in the upstream intron rather than the exon itself. (Hi Shinhan, can you reply the comments? I do not know how to do the analyses exactly). Phil Green Reviewer 3 Comments for the Author... Kousuke HANADA et al., A large number of novel small open reading frames (sORFs) in the intergenic regions of the Arabidopsis thaliana genome are transcribed or under purifying selection With the work presented Hanada and colleagues address the question of the presence of thus far undetected small ORFs in the genome of Arabidopsis thaliana. This topic is of special importance as there is increasing evidence that a large fraction of thus far undetected sORFs escaped the initial, “classical” genome analysis albeit they encode small proteins. Well-established example cases already demonstrated the biological importance of this class of proteins. Thus a genome scale survey for them is required. The authors scaned the intergenic space using a simplified method using the Coding Index and applying evaluated and stringent thresholds for candidate sORFs. Evaluation and further constriction of candidate sORFs is undertaken using whole genome tiling arrays as well as evolutionary conservation and purifying selection criteria. Overall the manuscript is well written and the methodological description as well as the supplementary material provided is exemplary. However, there are a couple of points that are not sufficiently discussed/addressed in the current version of the manuscript: Reply: We would like to thank the reviewer 3 for giving positive evaluation to this manuscript. Please see our replies to each of his comments as follows. (Comments 1) - The data flow and the analytical flow, esp the numbers of intermediate candidate sORFs observed and retained is often difficult to follow and in some instances it is unclear to which total number(s) the percentages given refer to. E.g. paragraph ”Conservation across species and signatures of purifying selection” Reply: Following the comment, we made a flow chart to represent the process identifying sORFs in the Arabidopsis genome. Moreover, we revised the section of “Conservation across species and signatures of purifying selection” to make it clear. (Please see Supplement I) (Please see the section of “Conservation across species and signatures of purifyin g selection”) (Comments 2) - The conservation measure (=>30% identity; AA ID I guess?) against the other plant genomes (namely Oryza, Poplar, Medicago, Lotus): The percentage of sORFs with homologous counterpart in the other species is pretty low. The authors discuss four potential reasons for this observation: o False calls due to the large number of ORFs examined; although I can grasp what is probably meant the phrasing is misleading as the percentage of false calls is certainly not depending on the sheer number of calls o Non-functional background translation of proteins. Unfortunately totally unclear to me o Incomplete plant genome sequences used (for comparison); I agree that the available sequences for Medicago and Lotus are currently incomplete. Poplar however has recently been published as close to complete. Unfortunately only a fairly incomplete dataset for poplar was available for quit some time, so the authors did not really have the opportunity to compare against the complete genome. Should be pretty straightforward to update the comparison against poplar against the complete dataset that became available along with the publication. Most importantly a comparison against the Brassica shotgun reads might be highly informative due to the close evolutionary relationship. Reply: (Hi Shinhan, I did not reply the first comment. Could you deal with it?). We revised “non-functional background translation of proteins” to “accumulation of many nonsynonymous substitutions because of non-functional constraint of the related sequences”. (Comments 3) - The filtering of initially detected sORF (candidates) to get rid of genes, pseudogenes, transposons: I assume this has been performed using the TIGR annotation as a data basis (?). Especially for the transposable and repetitive elements I’m not too sure on whether this is a solid data basis. A filtering of sORF elements against an up to date repeat/TE data resource might be a good idea. Reply: In a response to the comments, we did use a similarity search of sORFs to the annotated genes by Blast but not follow the TIGR annotation. (Comments 4) - The 13.2% of sORFs that are possibly missing exons mentioned within the abstract? This is really hard to trace in the results section. Again the numbers and percentages given in the last paragraph within the results section are confusing and hard to follow (percentages of what?). Again a flowchart would really help. Reply: In “Conservation across species and signature of purifying selection” section, we showed the possibility of missing exons. A figure was created to understand this point. (Comments 5) - Numerous sections in Material and Methods are really complicated to read and understand. Eg. the identification of introns. Long winded and complicated. Some restructuring might help. Reply: Following the comments, we made figures to illustrate CI and EI methods. (Comments 6) - The explanation of the Expression Index (EI) as well as the definition of the (EI) thresholds used is really hard to follow and understand. Going through the supplied supplementary info I found it remarkable that most of the sORFs that meet the criteria are nevertheless close to the threshold. Reply: Following the comments, we made a figure to show the methods in a supplemental material. (Comments 7) - The generation of the coding index (CI) and delineation of exon vs. intron CI does not take into account UTRS and promoters that potentially contain different CIs and thus might lead to the definition of erroneous thresholds. At least for UTRs a lot of info can be derived from cDNA information and should also be considered to be used for the threshold definition. Not considering UTRS and promoters. Thus the selection is exon vs intron but is not taking into account UTRs and promoters. A coding vs. non-coding selection schema including at least UTRs might be more adequate. Reply: The comment is essentially the same as the comments 6 of Reviewer 2. Please see our reply to the comments 6 of Reviewer 2. (Comments 8) - Are there data regarding the CI indexes for small cDNA supported exons? Are the values obtained similar or different to those for long exons? Reply: Yes, we examined the CI indexes for cDNA supported exons. The CI indexes of small cDNA is lower than those of long exons because CI is inclined to be higher with the length of exon. That’s why we provided a threshold to define coding potential with different size of sORF. (Comments 9) - The size classes used for the CI definition is unclear; => some more details in the methods section would be helpful Reply: Following the comments, we made a figure to show the methods in a supplemental material. (Comments 10) - The false positive rate obtained for introns is 3,54%. From appr. 133.000 hypothetical sORFs 3274 were found to exceed the CI threshold. Well this is close to false positive rate discovered in introns and asks for some discussion Reply: The comment is essentially the same as the comments 2 of Reviewer 2. Please see our reply to the comments 1 of Reviewer 2. (Comments 11) - Table2: “IQR” should be defined Reply: IQR is the abbreviation of interquartile range. IQR is a robust estimate of the spread of the data, and shows the range from upper 25 % to lower 75%. As stated above the manuscript is well written and focuses towards a topic which certainly is of high interest for the readership of Genome Research. However a revision of the manuscript considering the points mentioned above might help to improve the manuscript.