Summary Statement First we’d like to thank the reviewers for the extremely useful feedbacks and suggestions. Based on the suggestions, we have repeated the analyses and this manuscript revision provides different number of predictions compared to our previous manuscript version. We have indicated in the manuscript the correspondence between reviewers’ comments and our revisions. The major differences are summarized as follows: 1. Following the comments of reviewer 2, we have completely reanalyzed our data by removing low complexity sequences from both CDS and NCDS training data. This change impacts the number of sequences predicted as coding. In addition, the low complexity filter is also applied to the sequences we used to estimate false negative and false positive rates (Arabidopsis exon and intron sequences, Arabidopsis known sORF genes and yeast known sORF genes) Therefore, the numbers in Table1, Supplement A, B and C are different from the original ones. 2. During the revision process, we found an error that resulted in the underestimation of the number of sORFs with coding potential in our earlier analysis. The mistake is that CI values of intergenic sORFs were determined with stop codon included. However, terminal codons are not used in the training data of both CDS and NCDS. In the revised manuscript, we removed the terminal codons in the intergenic sORFs and estimated the CI values. The existence of terminal codons significantly reduced CI values. Therefore, the number of sORFs with coding potential is larger than the original estimates. Consequently, the number of resulting coding sORFs with evidence of transcription or purifying selection also increased. 3. The number of related sequences of coding sORFs is much larger than that of original estimates. There are two explanations for the differences. The first reason is that we modified our search procedure by searching the genome sequences with translated BLAST instead of searching the translated ORFs from the plant genomes analyzed. Secondly and more importantly, based on the suggestion of reviewer 1 and 3, we added Brassica oleracea that is closely related to Arabidopsis thaliana in our analyses. Reviewer 1 Comments for the Author... 1. This is an interesting and generally well-written paper. There is emerging evidence that many small genes have escaped annotation, not only in the Arabidopsis genome, but also in the genomes of other plant species. The methods described here should be generally applicable. Overall an interesting and useful paper. The manuscript copy that I downloaded lacks page numbers, making it a bit more difficult to point to specific places in the document. I will number counting the cover page as #1. Reply: We appreciate the valuable comments from the reviewer and have put page and line numbers in the revised manuscript. 2. Page 5, second para: Is the CI calculated over all reading frames or just the optimal one? Reply: The CI was calculated for an ORF we defined from exon, intron, or intergenic regions. So the calculation is always done for the frame that defines the ORF in question. 3. Page 6, first para. I would not consider a false negative rate of 14% to be particularly "low". Reply: Following the comments, we have modified the sentence on p.6 to tone it down. In our previous analysis, the exonic and intronic sequences for the training models were not subject to low complexity filtering. The benchmark set (Arabidopsis and yeast small proteins) were not filtered either. Therefore the false positive and false negative rate estimates were all based on unfiltered sequences. 4. Page 7: Many ORFs that do not experience purifying selection appear to be expressed. Can the authors comment on and suggest possible explanations for this. Reply: The number of coding sORFs experiencing purifying selection has increased substantially due to the inclusion of Brassica oleracea sequences. In addition, Nekrutenko et al. (2002 Genome Research) reported that the test of purifying selection is not as sensitive for shorter sequences. For shorter sequences, the numbers of synonymous and non-synonymous substitutions are smaller and the rate estimates have correspondingly larger standard errors. Therefore, the likelihood ratio tests we have used have a lower sensitivity for shorter sequences. Finally, some coding sORFs we predicted may be false positives and were RNA genes. These possibilities are discussed on p.11 and p.12. 5. Page8, first sentence: It is not clear in what sample this "marginally significant enrichment" is found. Please clarify. Reply: We found that the number of coding sORFs undergoing purifying selection of expressed sORFs is significantly larger than that of non-expressed sORFs. We have modified the manuscript on p.8-9 to explain this. 6. Page 8, second para: The 80% identity criterion applied here is unusually low. Most genome annotators use criteria of 95% identity over 90% of the length for matches of ESTs to cognate regions of the genome. Please comment, justify or modify. Reply: We use the EST information in two different ways. The first is for identifying the cognate EST(s) of a coding sORF and we did use a strict criteria (>97% identity and over 90% of the length match) for this. The detailed description can be found in the Method section “Comparing sORFs and ESTs“. The second use of ESTs is for identifying coding sORFs that are parts of genes already annotated. The 80% identity criterion was used for this purpose. We err on the side of assigning some truly novel coding sORFs as exons of annotated genes. We have modified the manuscript to clarify this (p.9) 7. Page 8, second para: The meaning of the sentence "Since the 16 sORFs ….. any annotated genes." Is not clear and needs to be re-phrased. Reply: We have revised this sentence on p.9. 8. Pages 8-9: The authors refer to the work by Ayele et al on sequence conservation with Brassica. It would be very interesting and worthwhile to determine the extent of intersection between the sORFs predicted in this paper and Ayele's "conserved intergenic sequences" just as it is of interest to know the intersection between the sORFs and the transcribed intergenic regions detected by the various tiling arrays. Reply: We show in the revised Supplement C-1 the proportion of each sORF overlapped with highly conserved regions and have discussed our findings briefly on p.10. Reviewer 2 Comments for the Author... The mss includes useful analyses that address an important problem in genomic analysis, the identification of small genes that are missed by current annotation approaches. It should be of interest to Genome Research readers. However there are a number of issues that need to be addressed. Reply: First of all, we appreciate many useful comments from Dr. Green. Our detailed replies to the comments are listed below. 1a. There needs to be a clearer analysis and discussion of the issue of false positives. The authors choose a CI threshold corresponding to a false positive rate of 1% on simulated sequence, but they find it to be ~3.5% on real intronic sequence. Why the difference -- is it because introns include some 'real' short genes, or because the model used in the simulations doesn't capture important features of real sequence, or is it due to some other reason? Since there are 133,091 sORFs meeting the filtering criteria, using the intronic false positive rate of 3.5% implies there could be ~4700 false positives among the 133,091, i.e. more than the 3274 ORFs actually found! So it would seem a large fraction of the 3274 are expected to be false positives. This needs to be discussed. Reply: Dr. Green has raised an important question regarding the potential presence of ~4000 false positives among the 133k ORFs and the possibility that many of the coding sORF predicted are therefore false. The 3.5% is in fact derived from annotated introns not real introns as mentioned above. Therefore, some false positive introns are most likely CDS mis-annotated as introns. In addition to mis-annotation, some real introns contain CDSs due to alternative splicing. In Arabidopsis, 3161 (~22%) annotated genes have more than one splice variants based on available cDNA/EST data for 13,019 genes with a total of ~56,000 introns. Therefore, nearly 6% of introns in Arabidopsis will contain coding sequences that contribute to the elevated false positive identification. To illustrate the influence of alternative splicing, we have examined the CI of putative alternatively spliced introns. Introns defined based on cDNA evidence were used to search the Arabidopsis protein annotation. Any intron after translation with ≥ 80% amino acid identity that span ≥ 80% intron length is regarded as alternatively spliced. We found 480 alternatively spliced introns that meet these stringent criteria and 318 (66%) of these introns have above threshold CI. This finding reinforces the notion that many introns that can be alternatively spliced will be identified as coding sequences. We have included the above discussion on p.11-12. 1b. Likewise, as the authors point out there are false positive rates that apply to both the microarray and the purifying selection analyses, however when summarizing their results (e.g. in the abstract) the contribution of false positives seems to be ignored. It would be helpful to have a table giving 'corrected' estimates (that remove the expected numbers of false positives) for numbers of codinglike sORFs, sORFs with transcription evidence, and sORFs with purifying selection evidence. Reply: We have modified the abstract to include the false positive rate estimate for microarray purifying selection analysis since providing corrected numbers may lead to some confusion. We have also generated a new Figure 1 to illustrate our analysis pipeline, number of sequences qualified at each stage, and false positive rate estimates when available to help clarify the issue the reviewer has raised. 2. There also needs to be more discussion of false negative rates. The authors exclude ORFs with substantial low-complexity regions for finding candidate sORFs, but not (as emerges only in the Discussion) for assessing false negative rates using known short proteins, because many known short proteins would otherwise be excluded. Thus it is somewhat misleading to both suggest that their approach is effective at finding short protein-coding genes (e.g. 'our method has low false negative rates for predicting small protein genes', p. 4), and that it also has a low false positive rate, because these claims are based on different analysis criteria. At a minimum the authors should clarify this issue and present false-positive and false negative rates both with, and without, the low-complexity filter, so that the reader can assess this. Reply: We agree with the reviewer that the sentence “our method has low false negative rates…” is somewhat misleading, as pointed out by reviewer1 as well. We have modified the sentences to tone it down (p.6). In our previous analysis, the exonic and intronic sequences for the training models were not subject to low complexity filtering. The benchmark set (Arabidopsis and yeast small proteins) were not filtered either. Therefore the false positive and false negative rate estimates were all based on unfiltered sequences. In response to the reviewer comment, we have completely re-done our analyses based on filtered sequences. We found that the CI values based on filtered sequences are slightly higher than those based on non-filtered sequences. We have modified the manuscript so all numbers are based on filtered sequences. 3. I could not find any discussion about how (or whether) overlapping ORFs were handled. Do the numbers 3274 and 1589 (of total and transcribed sORFs respectively meeting the CI threshold) include cases of multiple overlapping ORFs (on the same, or opposite) strands? If so, the overlaps need to be eliminated for these counts. Reply: We did not discuss in the manuscript how overlapping entries were dealt with. We have counted overlapping sORFs as one and use the sORF with the highest CI value in each cluster as the representative. We have modified our manuscript on p.6 to explain this in more detail and included this info in Figure 1. 4. There should be some further analyses (in addition to the EST analysis described by the authors) of potential clustering of the novel sORFs into (known or novel) genes. For example, how many of the sORFs in the different categories (with/without transcription evidence, with/without purifying selection evidence) lie within a few hundred bp of a known gene or another sORF, suggesting the possibility that they are part of the same gene? How many are far enough away (> 5kb) that they are highly likely to represent distinct genes? Reply: Arabidopsis has relative short intergenic sequences with the median size 1041bp. Therefore, it is difficult to gauge if sequences belong to the same transcriptional units by the suggested distance at 5kb. Instead, we have determined the intron size distribution and found that 95% of introns are smaller than 850bp. Using 850bp as the threshold for calling coding sORFs that are likely belong to novel genes, 2,341 sORFs are qualified. We have discussed this estimate on p.10. 5. It wasn't clear whether novel sORFs were compared to each other to see whether any of them fall into homology families. This would be useful. Reply: This is an interesting point that we have overlooked. We have generated similarity cluster using markov clustering to determine the presence of gene famileis. The results are shown on p.9 and Supplement C-2. 6. There is another important caveat to interpreting sORFs meeting the CI threshold as being likely protein coding sequences. Namely, the authors' Bayesian model only includes the categories 'coding' (in 6 frames), and 'noncoding', with the 'noncoding' model being estimated entirely from intronic sequences. However, there are many types of noncoding sequence that are likely to be enriched in intergenic regions relative to intronic regions -- for example RNA genes, promoter regions, 5' and 3' UTRs. If their composition is significantly different from intronic sequence -- e.g. more GC rich -then they may well get identified by the model as having a high pp of coding, even though they aren't coding. So (even apart from the false positive issue) it is not correct to assume that anything having a strong pp for being coding is likely to be coding; it may just be another type of sequence not explicitly allowed for by the model. In particular, I don't think you can necessarily conclude that a sORF with evidence of transcription is likely to be a protein coding gene -- it could be an RNA gene with a composition that is not intron-like. Some of the claims in the abstract and mss need to be changed to reflect this caveat. Reply: We have in fact examined coding potentials of sORFs using several kinds of noncoding sequences such as UTRs, intergenic regions and introns based on annotated data. Compared to introns, much fewer and shorter UTR sequences were available. Therefore, we ran into problems when we tried to extract UTR ORFs for training purpose since most UTR ORFs are very short. Since we want to identify coding sequences in intergenic regions, we were weary of using intergenic regions as training data. We did play with RNA genes as well but we could only find very limited number of RNA genes as training set. Intron data on the other hand is abundant. In addition, we are able to verify the authenticity of introns based on full length cDNAs, although we still run into problems of alternative splicing. The reviewer also pointed out that coding sORF with transcription evidence is not necessarily a protein coding gene. We argue that an expressed region with higher CI will be more likely to be coding sequences than an expressed region with lower CI. In addition, we explicitly incorporated exon coding sequence composition info into the training model. So the coding sORFs have to be dissimilar from introns AND similar to exons. The transcribed coding sORFs are likely not RNA genes that are not intron like, as the reviewer has suggested. Therefore, we feel it is justifiable to say sORFs with above threshold CIs and with evidence of transcription “likely” belong to novel protein coding genes. However, we are aware there is some level of uncertainty involved and have put in additional cautionary notes regarding the possibility that they could be RNA genes in our manuscript. 7. Although the analyses to derive the distribution of CIs and pps in simulated sequence (e.g. Figure 1) are useful, simulated sequences do not fully capture the features of actual sequences (as the authors acknowledge in the discussion). It would therefore be useful to see similar distribution histograms for intronic, intergenic, and known coding sequences. This might also illuminate whether the distribution appears bimodal in the intergenic case (as one expects). Reply: We agree with Dr. Green and have included the distributions as requested in revised Figure 4. 8. I found the mss difficult to understand in places, partly due to numerous grammatical and typographical errors (the mss could use careful editing by a native English speaker), and partly because important details are sometimes omitted (the authors should include enough information to allow the analyses to be reproduced by another investigator). Moreover, the terminology needs to be more clearly defined and more accurately and consistently used. "sORF" should mean "short ORF", which implies nothing about functionality or characteristics of the sequence other than the length of the potential reading frame. However the authors sometimes seem to use it to instead mean a short ORF meeting their CI threshold, and sometimes to mean a short ORF that encodes an actual protein. The definitions need to be clearly stated. At several points the authors use language that confuses the ORF with the encoded protein (e.g. they refer to the ORF starting with a methioninine, rather than with a methionine codon). Reply: We apologize for the grammatical errors and typos and have tried to correct them as best as we can. We concur that in the current version sORF and sORF with different evidence are not stringently defined. We have modified our manuscript with the following definitions and included the same terminology in Figure 1 to further improve the readability: (1) sORF: short Open Reading Frame starting from ATG that are 90-300bp long, (2) coding sORF: sORF with above threshold CI, (3) transcribed coding sORF: coding sORF with above threshold tiling array feature intensity, and (4) constrained coding sORF: coding sORF subject to purifying selection/functional constraint. 9. Additional points (-- since there are no page numbers in my copy (!), the page numbers below assume the first page after the abstract is number 1): Reply: We apologize for this and have included page and line numbers in the revised manuscript. 10. Abstract: "1661 sORFS ... likely belong to novel protein coding genes": this sentence needs to be rephrased, as it is contradicted two sentences later by "1275 sORFs ... likely belong to novel genes". Also, all these numbers should be corrected for false positive rates, as discussed above. Reply: We have revised the sentences in the abstract by including the false positive rate information. 11. p. 1: First paragraph: "These studies demonstates the presence of genic sequences" -- the claim that transcribed regions necessarily represent protein-coding or RNA genes is controversial. The fact that a region is transcribed does not in itself imply that it is functional. "Some studies assumed ..." -give references. Reply: We have changed “demonstrate” to “suggest” and provide references as requested on p.3. 12. "small proteins (referred to as sORF)" -- no, a protein is not an ORF!! Please use accurate terminology. Proteins, ORFs, predicted genes, and genes are all distinct entities that should not be confused with each other. "sORFs include mating pheromones ..." -- same issue. Reply: We have modified our manuscript to have consistent definitions of the sORFs with different evidence as described in the reply to Reviewer 2 comment 8. Also, we have revised parts where ORFs, proteins, and genes were misused as indicated Dr. Green. 13. p. 2: "hexamer composition bias, which has been established as the best measure for distinguishing CDS from NCDS" -- this is an overstatement. "known sORF" -- sORFs are always 'known' simply by computational analysis, because that is how they are defined"! You should say "known genes" or "known sORF genes". "predicted sORF' -- should be "predicted genes". Reply: We changed “the best measure” to “a general measure” and incorporated suggested changes throughout the manuscript. 14. p. 3: "only 3.54% of the introns" (and also Table 1) -- does this mean you are analyzing the entire intron, as a unit? If so this is not a particularly illuminating regarding false positive rates – better would be to analyze short ORFs within introns. Reply: In fact, we used sORFs in introns for the analysis instead of the whole introns. We have modified the manuscript on p.5 and Table 1 to emphasize this. 15. p.4: "starting with methionine" => "starting with ATG" Reply: We have incorporated the suggested change. 16. p. 5: As indicated above, you should discuss more thoroughly the issue of false positive rates, for both sORFs and purifying selection, and how this affects predicted numbers. Reply: We have revised our manuscript extensively based on the suggestion. 17. p. 6: "sORFs that are transcribed tend to be subject to purifying selection" -- this is an overstatement, since the enrichment appears to be rather modest. "more than 80% identity" -- what is the reason for this threshold? It should be something more like 95% identity over a region of some minimal size. Also, there seems to be an unstated assumption that if the sORF is part of a known gene, and has an EST match, then there will necessarily be enough EST evidence to overlap other ESTs associated to the known gene. This assumption should be stated explicitly, and some justification for it given (e.g. based on patterns of EST overlap for known genes). Reply: We simply want to point out that the statistical test for enrichment is significant at 1% level (originally at 5%). We have included sentence indicating the significance is not as robust as we had expected. The 80% threshold is chosen not for identifying the cognate ESTs of predicted genes but as a relaxed criterion to identify if ESTs matching sORFs are related to annotated genes. The reduced identity requirement is for eliminating sORFs that are likely missing exons of annotated genes. Please note that EST-sORF matches were established with a 97% identity threshold. The assumption we are making is that: if an sORF S has an EST match X and X is also matched to a known gene G, then S and G are likely belong to the same gene. However, we did not attempt to “overlap other ESTs associated to the known gene”. We have stated the above assumption explicitly on p.9-10 to clarify this. 18. p. 7: "regions with low amino acid complexity ... different codon usage" -- give reference. "the number of sORFs we identified should be treated as a minimum estimate" -- you can only take it as a minimum estimate if you know it doesn't include any false positives, but it almost certainly does. Reply: The statement the reviewer highlighted is based on our own observation (e.g. Figure 2, the first 20+ amino acids tend to have lower posterior probabilities and turn out to be signal peptide) and we cannot find a reference regarding this finding. In addition, it is in fact not codon “usage” but codon “composition” that is different, a mistake we have corrected in the revised manuscript. Finally, we have modified the sentence regarding “minimum estimate” on p.10. 19. pp. 7-8: "random matches can only account for ... is only 3.5%" I don't think this argument is correct, since (if I'm reading the Methods correctly) you only applied the CI filter to one of the two genomes (namely AT) -- so you don't gain any confidence regarding the matches from the false positive rate. Reply: We have eliminated this sentence. 20. p. 10: did CDSs include stop codon, or not? justify prior probability of coding of .4 for Arabidopsis – this seems substantially higher than proportion of genome thought to encode protein. Also, need to give prior prob of coding in each frame – was that just .4/ 6 Reply: CDS should not include stop codon. However, because of the reviewer’s comment we went back to check our codes and found that they were erroneously included in the intergenic sORFs. The training data sets, on the other hand, did not include stops. We have re-analyzed our dataset excluding stops. The presence of the stop codon in the last window significantly reduced the CI in many cases, therefore, the coding sORF prediction is substantially higher than what we reported in the previous manuscript version. In the annotation data, the proportion of coding sequences is approximately 0.3. Also, the prior probability in each frame is as the reviewer indicated, 0.3/6. We have modified the manuscript and explained these on p.14. 21. p. 11: methionine is not same as a methionine codon "longest "sub-ORF"" -- what does this mean? "matched a known gene with > 40% identity ..." does % identity refer to the DNA or the protein sequence? "with BLAST" -- using what BLAST parameters and thresholds? Reply: “Methionine” is changed to “methionine codon”. The sentences including “longest” and “sub-ORF” have been revised to make it clearer. The >40% identity is amino acid sequence identity. We have included info on the BLAST search parameters as well. 22. p. 13: "maximal gap of one", "minimal identity of 80% between adjacent HSPs" -- what does this mean? "the alignment covers the start position of the ORF" -- why is this condition imposed? sORFs that correspond to short internal exons would be excluded by this criterion, since the initial ATG of the ORF could be in the upstream intron rather than the exon itself. Reply: We realize that the writing in the current version a bit confusing and have replaced it on p.18 of the current version. The intention for our studies is for uncovering novel coding genes in the Arabidopsis genome instead of missing exons, therefore we deliberately focus on ORFs that started with ATG. Reviewer 3 Comments for the Author... With the work presented Hanada and colleagues address the question of the presence of thus far undetected small ORFs in the genome of Arabidopsis thaliana. This topic is of special importance as there is increasing evidence that a large fraction of thus far undetected sORFs escaped the initial, “classical” genome analysis albeit they encode small proteins. Well-established example cases already demonstrated the biological importance of this class of proteins. Thus a genome scale survey for them is required. The authors scaned the intergenic space using a simplified method using the Coding Index and applying evaluated and stringent thresholds for candidate sORFs. Evaluation and further constriction of candidate sORFs is undertaken using whole genome tiling arrays as well as evolutionary conservation and purifying selection criteria. Overall the manuscript is well written and the methodological description as well as the supplementary material provided is exemplary. However, there are a couple of points that are not sufficiently discussed/addressed in the current version of the manuscript: Reply: We would like to thank the reviewer 3 for providing valuable comments to this manuscript. 1. The data flow and the analytical flow, esp the numbers of intermediate candidate sORFs observed and retained is often difficult to follow and in some instances it is unclear to which total number(s) the percentages given refer to. E.g. paragraph ”Conservation across species and signatures of purifying selection” Reply: We have generated a procedure flow chart (Figure 1) including information on the number of sORFs identified in the Arabidopsis genome during each stage of the analysis. We have also included info on how the percentages were derived in various parts of the manuscript and some are deleted for clarity. 2. The conservation measure (=>30% identity; AA ID I guess?) against the other plant genomes (namely Oryza, Poplar, Medicago, Lotus): The percentage of sORFs with homologous counterpart in the other species is pretty low. The authors discuss four potential reasons for this observation: False calls due to the large number of ORFs examined; although I can grasp what is probably meant the phrasing is misleading as the percentage of false calls is certainly not depending on the sheer number of call Non-functional background translation of proteins. Unfortunately totally unclear to me Incomplete plant genome sequences used (for comparison); I agree that the available sequences for Medicago and Lotus are currently incomplete. Poplar however has recently been published as close to complete. Unfortunately only a fairly incomplete dataset for poplar was available for quit some time, so the authors did not really have the opportunity to compare against the complete genome. Should be pretty straightforward to update the comparison against poplar against the complete dataset that became available along with the publication. Most importantly a comparison against the Brassica shotgun reads might be highly informative due to the close evolutionary relationship. Reply: We agree with the reviewer that the sentence regarding “false call” should be clearer and have modified this accordingly. For background translation we meant to say the fortuitous translation of regions with no functional significance. Finally for the incomplete genome aspect, the poplar genome we analyzed is the publication release. We appreciate the suggestion and have analyzed Brassica oleracea genome. The inclusion of Brassica sequences increase the number of cross genome matches. The results are presented in the revised version. 3. The filtering of initially detected sORF (candidates) to get rid of genes, pseudogenes, transposons: I assume this has been performed using the TIGR annotation as a data basis (?). Especially for the transposable and repetitive elements I’m not too sure on whether this is a solid data basis. A filtering of sORF elements against an up to date repeat/TE data resource might be a good idea. Reply: We used the TIGR annotated protein sequences to conduct similarity search of Arabidopsis genome to rid of any regions that are similar to annotated genes. But we did not apply repeat masker using the upto-date repeat/TE data. We have compared our coding sORF against the up-to-date repeat/TE data from Repbase. We have incorporated the finding in the manuscript p.8. 4. The 13.2% of sORFs that are possibly missing exons mentioned within the abstract? This is really hard to trace in the results section. Again the numbers and percentages given in the last paragraph within the results section are confusing and hard to follow (percentages of what?). Again a flowchart would really help. Reply: As suggested, a figure has been created (Figure 1) to clarify the various numbers and percentages in the manuscript. We have also included info on how the percentages were generated in relevant part of the manuscript. 5. Numerous sections in Material and Methods are really complicated to read and understand. Eg. the identification of introns. Long winded and complicated. Some restructuring might help. Reply: We have written the section regarding intron identification in a clearer way (p.13). In addition, we have modified various parts of the Method section to improve the readability. We concur that the CI and EI parts are rather complicated and we have provided Supplement A-1 and A-3 to illustrate how they were determined. 6. The explanation of the Expression Index (EI) as well as the definition of the (EI) thresholds used is really hard to follow and understand. Going through the supplied supplementary info I found it remarkable that most of the sORFs that meet the criteria are nevertheless close to the threshold. Reply: We have generated the Supplement A-3 to explain the EI methods. For the observation that the EI value of the coding sORF that we regarded as transcribed are fairly close to the threshold, this is not necessarily true since the distribution of EI value is fairly narrow. So a relatively small absolute difference does not reflect on the level of significance. 7. The generation of the coding index (CI) and delineation of exon vs. intron CI does not take into account UTRS and promoters that potentially contain different CIs and thus might lead to the definition of erroneous thresholds. At least for UTRs a lot of info can be derived from cDNA information and should also be considered to be used for the threshold definition. Not considering UTRS and promoters. Thus the selection is exon vs intron but is not taking into account UTRs and promoters. A coding vs. non-coding selection schema including at least UTRs might be more adequate. Reply: Reviewer 2 has a very similar concern. We have in fact use UTR and promoter as training sequences in earlier analyses that were not shown. Although there are quite a few full-length cDNAs from Arabidopsis, the caveats in using UTRs are that many of them are in fact not full length and that is that the numbers and lengths of UTRs are much smaller than introns. We have troubles getting sufficient numbers of UTR ORFs of length 90-300bp for our simulation studies. We have also play with promoter sequences. Here the problem is that we found many sequences that we have defined as promoters are in fact expressed (based on our tiling array data) even for putative promoter sequences defined according to full length cDNAs. In addition, promoters are not easily defined and most studies have defined promoters rather arbitrarily. With these caveats, we decide to focus on introns. We agree that a model combing intron and UTR information may improve the model. However, given that our major point is to illustrate the presence of many potential coding genes not identified in the Arabidopsis genome and the difficulty in getting sufficiently long UTR ORFs, we feel the improvement will be very limited. 8. Are there data regarding the CI indexes for small cDNA supported exons? Are the values obtained similar or different to those for long exons? Reply: Yes, we have examined the CI values for cDNA supported exons. The CI values of small exons are lower than those of long exons of the same composition. That’s why we have conducted simulation studies to uncover the CI distributions at different sequence lengths for defining CI thresholds (see Supplement A1 and A-2). 9. The size classes used for the CI definition is unclear; => some more details in the methods section would be helpful Reply: We have modified the manuscript to explain the size class better (p.14-15). In addition, we have made a figure to clarify the CI method and procedures in Supplement A-3 and A-3. 10. The false positive rate obtained for introns is 3,54%. From appr. 133.000 hypothetical sORFs 3274 were found to exceed the CI threshold. Well this is close to false positive rate discovered in introns and asks for some discussion Reply: Reviewer 2 has a similar concern (comment 2). In Arabidopsis, 3161 (~12%) annotated genes have more than one splice variants based on available cDNA/EST data. The true number is likely higher. As a result, some introns will contain coding sequences that contribute to larger number of introns predicted as coding sequences than expected. We have included discussions concerning this point on p.6-7 and p.11. In addition, we found a mistake in our original analysis that stops were including in calculating the CI values for intergenic sORFs. After correction, the number of coding sORFs is significantly higher than the number of potential false positives. 11. Table2: “IQR” should be defined Reply: IQR is the abbreviation of Inter-Quartile Range and we have included this information in Table 2.