1/11 Supplementary Methods Sequence information The IRGSP genome sequence build 3 assembled as of July 31, 2004 was used (International Rice Genome Sequencing Project 2005). An artificial duplication of a segment of ~300 kb found in chromosome 11 (Matsumoto, pers. comm.) was discarded, and therefore the total number of nucleotides (Table 1) is smaller than that reported by the IRGSP. All the full-length rice cDNAs used were available on October 1, 2004, and mRNAs/ESTs of rice, wheat, barley, maize, sorghum, sugarcane, and A. thaliana as of September 1, 2004 (Supplementary Table 1) were downloaded from the DDBJ/EMBL/GenBank DNA databanks. Flanking sequences of 18,056 Tos17 insertion lines (Miyao et al. 2003), 7,583 T-DNA insertion lines (Chen et al. 2003; Sallaud et al. 2004) and 1,072 Ds insertion lines (Kim et al. 2004) were obtained from the DNA databanks. Amino acid sequences from the Rice Proteome Database (http://gene64.dna.affrc.go.jp/RPD/main_en.html) determined directly by Edman sequencing were used to identify ORFs (Komatsu et al. 2004; Komatsu and Tanaka 2005). The A. thaliana genome sequence as of August 13, 2004 was downloaded from NCBI's ftp site (ftp://ftp.ncbi.nih.gov/genomes/). The ORFs of A. thaliana were retrieved from MAtDB (http://mips.gsf.de/proj/thal/db/index.html) as of July 9, 2004 (Schoof et al. 2004). Repeat-masking All the repeat sequences in the genome, cDNAs and flanking sequences of insertional mutants were masked by RepeatMasker (http://www.repeatmasker.org/) and the TIGR (The Institute for Genomic Research) Plant Repeat Databases ver. 2 (http://www.tigr.org/tdb/e2k1/plant.repeats/index.shtml) (Ouyang and Buell 2004). We also 2/11 used RepBase ver. 8.12 (Jurka 2000) for sugarcane, because the data for its repeat sequences were not included in TIGR's databases. and -xsmall. Options for RepeatMasker were as follows: -nolow Vector sequences detected were trimmed. We discarded 172 contaminants including bacterial DNA, which were found in the rice full-length cDNA dataset (Kikuchi, pers. comm.). Any poly-adenine tails or 5'-poly-thymine tracts (10 or more consecutive adenines or thymines) of the cDNAs were removed by using a custom-made Perl script. Any cDNA sequences with <30 non-repeat nucleotides in total length were not employed for further analyses. In this way, 75 O. sativa and 25 A. thaliana mRNAs were excluded from the dataset, and as a result, 34,640 and 59,709 sequences were used for O. sativa and A. thaliana, respectively. cDNA mapping to the genome Positions of the mRNAs on the genome were initially determined by BLASTN 2.2.9 (options: -p blastn -F 'm D' -U T -e 0.01) (Altschul et al. 1997), when nucleotide identity was >60%, E-value <0.01 and coverage ≥40% for non-repetitive regions. The genomic sequences of 40-kb 5'-and 3'-flanking regions as well as the region aligned with an mRNA by BLASTN were selected and then re-aligned with the mRNA by using est2genome from the EMBOSS package ver. 2.7.1 (options: -gappenalty 8 -mismatch 6) (Rice et al. 2000). The alignments were employed when the nucleotide identity was ≥95% and the mRNA coverage against the genome was ≥90%. One of the aligned regions was selected if multiple hits were listed (for details, see Imanishi et al. 2004). However, because 314 mRNAs that could be mapped to multiple positions turned out to be possible chimeras by visual inspection, the ORFs were not predicted in these mRNAs. We did not include 34 mRNAs that mapped to the mitochondrial genome (accession numbers AB076665 and AB076666), and 24 mRNAs that mapped to the chloroplast genome (X15901) in our annotations. For further analyses 3/11 we used extracted genomic sequences instead of the mRNAs themselves, because sequencing of cDNAs is in general a more error-prone process than sequencing genomic DNA. Mapping of short DNA sequences (ESTs and flanking sequences of insertional mutants) For fast mapping of a vast number of DNAs that may contain large gaps or introns, their positions on the genome were determined solely by BLASTN. First, the aligned query region of the top hit was mapped. Second, the next hit was examined, and if a query region reported did not overlap with that of the top hit, it was also mapped. This procedure was repeated in the order of their BLAST scores until all the nucleotides of the query were mapped with no conflicts or no more BLAST hits remained. Gene prediction Protein-coding genes were predicted in the genome by using four ab initio prediction methods: Fgenesh trained by monocot data (Salamov and Solovyev 2000), GENSCAN with A. thaliana and maize matrices (Burge and Karlin 1997), and GLocate that was developed for rice gene finding (Numa, pers. comm.). One of the gene structures predicted was selected by a modified version of Combiner ver. 1 (Allen et al. 2004). If equivalent predictions were obtained by multiple programs, then results were selected in the following order of preference: Fgenesh, GENSCAN (A. thaliana), GENSCAN (maize) and GLocate. We took Fgenesh results first because it had been reported in previous work that this software gave slightly better results than others (Yao et al. 2005). Predicted genes were included only if the region was covered by any mRNAs or ESTs so that all the genes in our final dataset had physical clone support. The tRNA genes in the O. sativa genome were predicted by tRNAscan-SE ver. 1.23 (Lowe and Eddy 1997). The numbers of the tRNA genes of A. thaliana were obtained from 4/11 the Genomic tRNA Database (http://lowelab.ucsc.edu/GtRNAdb/). The rDNAs were detected by RepeatMasker with the TIGR Repeat Databases. cDNA clustering If exons predicted or identified for different transcripts shared the same genomic region on the same strand, then they were placed in the same cluster (locus). Unmapped mRNAs were compared using BLASTN, and they were clustered when the E-value = 0. For details about locus IDs, see the following URL: http://rapdb.lab.nig.ac.jp/note.html#nomenclature Evaluation of unmapped clusters Of the rice mRNAs, 93% could be mapped to the genome. This figure was unexpectedly low when compared with figures for other cDNA-based annotations (cf. Imanishi et al. 2004). We applied the same analysis pipeline to A. thaliana using the latest genome build and 59,734 mRNAs. We found that, using our criteria, 97% of the A. thaliana mRNAs could be mapped to the genome. The lower proportion of cDNAs that could be mapped to the rice genome does not seem to be due to an erroneous mapping pipeline. The average identity of all rice mRNAs mapped to the genome was 99.9%, suggesting that most of the mRNAs were in their correct positions on the genome. Since about 5% of the rice genome still remains to be sequenced (IRGSP 2005), it is possible that a proportion of sequences could be transcribed from within these as yet unsequenced genomic regions (Nagaki et al. 2004; Wu et al. 2004). To further check whether the unmapped mRNAs were derived from unsequenced portions of the IRGSP genome, those mRNAs were compared with the japonica and indica rice genome contigs determined and assembled by other groups independently of IRGSP 5/11 (Goff et al. 2002; Yu et al. 2005). The project accession numbers of the contigs are: AACV00000000 (versions AACV01000001.1-AACV01035047.1) and AAAA00000000 (versions AAAA02000001.1-AAAA02050231.1). sequences by the aforementioned method. Repetitive regions were masked in these Of 2,102 representatives of unmapped cDNA clusters, 285 could be mapped to multiple positions in the IRGSP genome and we used the remaining 1,817 sequences for further analyses. by BLASTN. These mRNAs were aligned to the contigs As a result, 160 could be mapped to the japonica contigs with ≥95% identity and ≥90% coverage, and 152 were mapped to the indica contigs. relatively short, the mRNAs could only partially be mapped. Since the contigs were In fact, we found that 729 were aligned to japonica and 715 were aligned to indica, with ≥95% identity. The majority of the unmapped mRNAs did not seem to be due to contaminations but to be derived from unsequenced regions in the IRGSP genome. ORF prediction Transcripts identified by mRNAs or predicted by ab initio methods were BLASTX searched against the UniProtKB/Swiss-Prot (release 44.6) and UniProtKB/TrEMBL (release 27.6) databases, reviewed rice RefSeq proteins as of September 16, 2004, and the Rice Proteome Database as of October 20, 2004. If the deduced amino acid sequence of an ORF was ≥50% identical to protein(s) in these databases, this predicted amino acid sequence of the frame was assigned as a known or homologous protein. If no known or homologous proteins for a given mRNA were detected by BLASTX, the ORF was predicted using GeneMark (Borodovsky and McIninch 1993). For this prediction, a training dataset of 3rd order Markov models was prepared using the 1,906 annotated rice mRNAs in the DNA databanks. If no ORF was suggested by GeneMark, we selected the longest ORF with greater than 80 amino acids (a.a.). Since the start codon (ATG) was subsequently inferred, 6/11 the resultant ORF could be smaller than 80 a.a.. The remaining 725 loci in which no appropriate coding frames were detected became non-protein-coding RNA candidates. The most upstream ATG was taken as the start codon unless the ATG codon was located inside any regions aligned to homologs reported by BLASTX. The FLcDNAs may contain introns due to incomplete splicing. We detected and eliminated the unspliced introns using the same method as Imanishi et al. (2004). All the ORFs predicted were subjected to InterProScan (ver. 3.3) searches (Zdobnov and Apweiler 2001; Quevillon et al. 2005) with the InterPro database 8.1 (Apweiler et al. 2001) to detect motifs/domains/families, but 'frequent hitters' (Supplementary Table 8) that tend to give false predictions were not used for curation and ORF categorization (see below). Gene Ontology (GO) IDs were assigned by using the InterProScan results, and the ORFs were classified according to the GO hierarchy (Ashburner et al. 2000). IRGSP and RAP annotation standards The ORFs had originally been classified into four classes according to the IRGSP Annotation Standard (http://demeter.bio.bnl.gov/Annotation.html): 'known', 'similar', 'unknown', or 'hypothetical' protein. Although we essentially followed this original standard, because a large number of cDNAs including rice full-length cDNAs were available at the time of curation, the standard was modified as follows. The 'known' and 'similar' classes correspond to Categories I and II, respectively, although criteria of significant similarity differed slightly. The 'unknown' proteins are further classified into Categories III, IV and V. The original 'hypothetical' proteins, those predicted by ab initio prediction methods only, are not included in the current dataset. Analysis and annotation of non-protein-coding (np) RNAs 7/11 In the RAP dataset we could identify 1,168 transcripts that lacked an ORF or encoded a short putative peptide (≤80 amino acids). These transcripts were first mapped to the rice genome and their exon structures were examined to verify proper locus mapping (>95% sequence identity and ≥90% coverage). Then, features such as the genomic context (presence of upstream and/or downstream genes within 5 kb), canonical polyadenylation signals (AATAAA or ATTAAA), polyadenosine tails, support by ESTs, and antisense transcripts were inspected (Imanishi et al. 2004). categories as described in the text. These transcripts were classified into four All npRNAs were subjected to sequence homology search against 23,996 known plant and animal RNA genes (Imanishi et al. 2004). However, no significant hits were identified, which suggested that all 131 putative npRNAs are either undescribed among other plant species or unique. Orthologous plant RNA genes were also investigated using BLASTN searches and no homologous genes were identified. This may be due to the fact that there have only been a limited number of npRNA genes reported for any plant species (MacIntosh et al. 2001; Schoof and Karlowski 2003). 8/11 References Allen, J.E., Pertea, M., and Salzberg, S.L. 2004. Computational gene prediction using multiple sources of evidence. Genome Res. 14: 142-148. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25: 3389-3402. Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Birney, E., Biswas, M., Bucher, P., Cerutti, L., Corpet, F., Croning, M.D. et al. 2001. The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res. 29: 37-40. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T. et al. 2000. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25: 25-29. Borodovsky, M. and McIninch, J. 1993. GeneMark: Parallel Gene Recognition for both DNA Strands. Computers & Chemistry 17: 123-133. Burge, C. and Karlin, S. 1997. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268: 78-94. Chen, S., Jin, W., Wang, M., Zhang, F., Zhou, J., Jia, Q., Wu, Y., Liu, F., and Wu, P. 2003. Distribution and characterization of over 1000 T-DNA tags in rice genome. Plant J. 36: 105-113. Goff, S.A., Ricke, D., Lan, T.H., Presting, G., Wang, R., Dunn, M., Glazebrook, J., Sessions, A., Oeller, P., Varma, H. et al. 2002. A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science 296: 92-100. Imanishi, T., Itoh, T., Suzuki, Y., O'Donovan, C., Fukuchi, S., Koyanagi, K.O., Barrero, R.A., 9/11 Tamura, T., Yamaguchi-Kabata, Y., Tanino, M. et al. 2004. Integrative annotation of 21,037 human genes validated by full-length cDNA clones. PLoS Biol. 2: 0859-0875. International Rice Genome Sequencing Project. 2005. The map-based sequence of the rice genome. Nature 436: 793-800. Jurka, J. 2000. Repbase update: a database and an electronic journal of repetitive elements. Trends Genet. 16: 418-420. Kim, C.M., Piao, H.L., Park, S.J., Chon, N.S., Je, B.I., Sun, B., Park, S.H., Park, J.Y., Lee, E.J., Kim, M.J. et al. 2004. Rapid, large-scale generation of Ds transposant lines and analysis of the Ds insertion sites in rice. Plant J. 39: 252-263. Komatsu, S., Kojima, K., Suzuki, K., Ozaki, K., and Higo, K. 2004. Rice Proteome Database based on two-dimensional polyacrylamide gel electrophoresis: its status in 2003. Nucleic Acids Res. 32: D388-392. Komatsu, S. and Tanaka, N. 2005. Rice proteome analysis: a step toward functional analysis of the rice genome. Proteomics 5: 938-949. Lowe, T.M. and Eddy, S.R. 1997. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25: 955-964. MacIntosh, G.C., Wilkerson, C., and Green, P.J. 2001. Identification and analysis of Arabidopsis expressed sequence tags characteristic of non-coding RNAs. Plant Physiol. 127: 765-776. Miyao, A., Tanaka, K., Murata, K., Sawaki, H., Takeda, S., Abe, K., Shinozuka, Y., Onosato, K., and Hirochika, H. 2003. Target site specificity of the Tos17 retrotransposon shows a preference for insertion within genes and against insertion in retrotransposon-rich regions of the genome. Plant Cell 15: 1771-1780. Nagaki, K., Cheng, Z., Ouyang, S., Talbert, P.B., Kim, M., Jones, K.M., Henikoff, S., Buell, C.R., and Jiang, J. 2004. Sequencing of a rice centromere uncovers active genes. Nat. 10/11 Genet. 36: 138-145. Ouyang, S. and Buell, C.R. 2004. The TIGR Plant Repeat Databases: a collective resource for the identification of repetitive sequences in plants. Nucleic Acids Res. 32: D360-363. Quevillon, E., Silventoinen, V., Pillai, S., Harte, N., Mulder, N., Apweiler, R., and Lopez, R. 2005. InterProScan: protein domains identifier. Nucleic Acids Res. 33: W116-120. Rice, P., Longden, I., and Bleasby, A. 2000. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 16: 276-277. Salamov, A.A. and Solovyev, V.V. 2000. Ab initio gene finding in Drosophila genomic DNA. Genome Res. 10: 516-522. Sallaud, C., Gay, C., Larmande, P., Bes, M., Piffanelli, P., Piegu, B., Droc, G., Regad, F., Bourgeois, E., Meynard, D. et al. 2004. High throughput T-DNA insertion mutagenesis in rice: a first step towards in silico reverse genetics. Plant J. 39: 450-464. Schoof, H., Ernst, R., Nazarov, V., Pfeifer, L., Mewes, H.W., and Mayer, K.F. 2004. MIPS Arabidopsis thaliana Database (MAtDB): an integrated biological knowledge resource for plant genomics. Nucleic Acids Res. 32: D373-376. Schoof, H. and Karlowski, W.M. 2003. Comparison of rice and Arabidopsis annotation. Curr. Opin. Plant Biol. 6: 106-112. Wu, J., Yamagata, H., Hayashi-Tsugane, M., Hijishita, S., Fujisawa, M., Shibata, M., Ito, Y., Nakamura, M., Sakaguchi, M., Yoshihara, R. et al. 2004. Composition and structure of the centromeric region of rice chromosome 8. Plant Cell 16: 967-976. Yao, H., Guo, L., Fu, Y., Borsuk, L.A., Wen, T.J., Skibbe, D.S., Cui, X., Scheffler, B.E., Cao, J., Emrich, S.J. et al. 2005. Evaluation of five ab initio gene prediction programs for the discovery of maize genes. Plant Mol. Biol. 57: 445-460. Yu, J., Wang, J., Lin, W., Li, S., Li, H., Zhou, J., Ni, P., Dong, W., Hu, S., Zeng, C. et al. 2005. The Genomes of Oryza sativa: a history of duplications. PLoS Biol. 3: 0266-0281. 11/11 Zdobnov, E.M. and Apweiler, R. 2001. InterProScan - an integration platform for the signature-recognition methods in InterPro. Bioinformatics 17: 847-848.