Supplementary Methods

advertisement
1/11
Supplementary Methods
Sequence information
The IRGSP genome sequence build 3 assembled as of July 31, 2004 was used
(International Rice Genome Sequencing Project 2005).
An artificial duplication of a
segment of ~300 kb found in chromosome 11 (Matsumoto, pers. comm.) was discarded, and
therefore the total number of nucleotides (Table 1) is smaller than that reported by the IRGSP.
All the full-length rice cDNAs used were available on October 1, 2004, and mRNAs/ESTs of
rice, wheat, barley, maize, sorghum, sugarcane, and A. thaliana as of September 1, 2004
(Supplementary Table 1) were downloaded from the DDBJ/EMBL/GenBank DNA databanks.
Flanking sequences of 18,056 Tos17 insertion lines (Miyao et al. 2003), 7,583 T-DNA
insertion lines (Chen et al. 2003; Sallaud et al. 2004) and 1,072 Ds insertion lines (Kim et al.
2004) were obtained from the DNA databanks.
Amino acid sequences from the Rice
Proteome Database (http://gene64.dna.affrc.go.jp/RPD/main_en.html) determined directly by
Edman sequencing were used to identify ORFs (Komatsu et al. 2004; Komatsu and Tanaka
2005).
The A. thaliana genome sequence as of August 13, 2004 was downloaded from NCBI's
ftp site (ftp://ftp.ncbi.nih.gov/genomes/). The ORFs of A. thaliana were retrieved from
MAtDB (http://mips.gsf.de/proj/thal/db/index.html) as of July 9, 2004 (Schoof et al. 2004).
Repeat-masking
All the repeat sequences in the genome, cDNAs and flanking sequences of insertional
mutants were masked by RepeatMasker (http://www.repeatmasker.org/) and the TIGR (The
Institute for Genomic Research) Plant Repeat Databases ver. 2
(http://www.tigr.org/tdb/e2k1/plant.repeats/index.shtml) (Ouyang and Buell 2004). We also
2/11
used RepBase ver. 8.12 (Jurka 2000) for sugarcane, because the data for its repeat sequences
were not included in TIGR's databases.
and -xsmall.
Options for RepeatMasker were as follows: -nolow
Vector sequences detected were trimmed.
We discarded 172 contaminants
including bacterial DNA, which were found in the rice full-length cDNA dataset (Kikuchi,
pers. comm.). Any poly-adenine tails or 5'-poly-thymine tracts (10 or more consecutive
adenines or thymines) of the cDNAs were removed by using a custom-made Perl script.
Any cDNA sequences with <30 non-repeat nucleotides in total length were not employed for
further analyses.
In this way, 75 O. sativa and 25 A. thaliana mRNAs were excluded from
the dataset, and as a result, 34,640 and 59,709 sequences were used for O. sativa and A.
thaliana, respectively.
cDNA mapping to the genome
Positions of the mRNAs on the genome were initially determined by BLASTN 2.2.9
(options: -p blastn -F 'm D' -U T -e 0.01) (Altschul et al. 1997), when nucleotide identity was
>60%, E-value <0.01 and coverage ≥40% for non-repetitive regions. The genomic
sequences of 40-kb 5'-and 3'-flanking regions as well as the region aligned with an mRNA by
BLASTN were selected and then re-aligned with the mRNA by using est2genome from the
EMBOSS package ver. 2.7.1 (options: -gappenalty 8 -mismatch 6) (Rice et al. 2000).
The
alignments were employed when the nucleotide identity was ≥95% and the mRNA coverage
against the genome was ≥90%.
One of the aligned regions was selected if multiple hits were
listed (for details, see Imanishi et al. 2004).
However, because 314 mRNAs that could be
mapped to multiple positions turned out to be possible chimeras by visual inspection, the
ORFs were not predicted in these mRNAs.
We did not include 34 mRNAs that mapped to
the mitochondrial genome (accession numbers AB076665 and AB076666), and 24 mRNAs
that mapped to the chloroplast genome (X15901) in our annotations. For further analyses
3/11
we used extracted genomic sequences instead of the mRNAs themselves, because sequencing
of cDNAs is in general a more error-prone process than sequencing genomic DNA.
Mapping of short DNA sequences (ESTs and flanking sequences of insertional mutants)
For fast mapping of a vast number of DNAs that may contain large gaps or introns,
their positions on the genome were determined solely by BLASTN.
First, the aligned query
region of the top hit was mapped. Second, the next hit was examined, and if a query region
reported did not overlap with that of the top hit, it was also mapped. This procedure was
repeated in the order of their BLAST scores until all the nucleotides of the query were
mapped with no conflicts or no more BLAST hits remained.
Gene prediction
Protein-coding genes were predicted in the genome by using four ab initio prediction
methods: Fgenesh trained by monocot data (Salamov and Solovyev 2000), GENSCAN with A.
thaliana and maize matrices (Burge and Karlin 1997), and GLocate that was developed for
rice gene finding (Numa, pers. comm.). One of the gene structures predicted was selected
by a modified version of Combiner ver. 1 (Allen et al. 2004).
If equivalent predictions were
obtained by multiple programs, then results were selected in the following order of
preference: Fgenesh, GENSCAN (A. thaliana), GENSCAN (maize) and GLocate.
We took
Fgenesh results first because it had been reported in previous work that this software gave
slightly better results than others (Yao et al. 2005). Predicted genes were included only if the
region was covered by any mRNAs or ESTs so that all the genes in our final dataset had
physical clone support.
The tRNA genes in the O. sativa genome were predicted by tRNAscan-SE ver. 1.23
(Lowe and Eddy 1997).
The numbers of the tRNA genes of A. thaliana were obtained from
4/11
the Genomic tRNA Database (http://lowelab.ucsc.edu/GtRNAdb/). The rDNAs were
detected by RepeatMasker with the TIGR Repeat Databases.
cDNA clustering
If exons predicted or identified for different transcripts shared the same genomic region
on the same strand, then they were placed in the same cluster (locus).
Unmapped mRNAs
were compared using BLASTN, and they were clustered when the E-value = 0.
For details
about locus IDs, see the following URL:
http://rapdb.lab.nig.ac.jp/note.html#nomenclature
Evaluation of unmapped clusters
Of the rice mRNAs, 93% could be mapped to the genome. This figure was
unexpectedly low when compared with figures for other cDNA-based annotations (cf.
Imanishi et al. 2004).
We applied the same analysis pipeline to A. thaliana using the latest
genome build and 59,734 mRNAs. We found that, using our criteria, 97% of the A. thaliana
mRNAs could be mapped to the genome.
The lower proportion of cDNAs that could be
mapped to the rice genome does not seem to be due to an erroneous mapping pipeline. The
average identity of all rice mRNAs mapped to the genome was 99.9%, suggesting that most
of the mRNAs were in their correct positions on the genome.
Since about 5% of the rice
genome still remains to be sequenced (IRGSP 2005), it is possible that a proportion of
sequences could be transcribed from within these as yet unsequenced genomic regions
(Nagaki et al. 2004; Wu et al. 2004).
To further check whether the unmapped mRNAs were derived from unsequenced
portions of the IRGSP genome, those mRNAs were compared with the japonica and indica
rice genome contigs determined and assembled by other groups independently of IRGSP
5/11
(Goff et al. 2002; Yu et al. 2005).
The project accession numbers of the contigs are:
AACV00000000 (versions AACV01000001.1-AACV01035047.1) and AAAA00000000
(versions AAAA02000001.1-AAAA02050231.1).
sequences by the aforementioned method.
Repetitive regions were masked in these
Of 2,102 representatives of unmapped cDNA
clusters, 285 could be mapped to multiple positions in the IRGSP genome and we used the
remaining 1,817 sequences for further analyses.
by BLASTN.
These mRNAs were aligned to the contigs
As a result, 160 could be mapped to the japonica contigs with ≥95% identity
and ≥90% coverage, and 152 were mapped to the indica contigs.
relatively short, the mRNAs could only partially be mapped.
Since the contigs were
In fact, we found that 729 were
aligned to japonica and 715 were aligned to indica, with ≥95% identity.
The majority of the
unmapped mRNAs did not seem to be due to contaminations but to be derived from
unsequenced regions in the IRGSP genome.
ORF prediction
Transcripts identified by mRNAs or predicted by ab initio methods were BLASTX
searched against the UniProtKB/Swiss-Prot (release 44.6) and UniProtKB/TrEMBL (release
27.6) databases, reviewed rice RefSeq proteins as of September 16, 2004, and the Rice
Proteome Database as of October 20, 2004.
If the deduced amino acid sequence of an ORF
was ≥50% identical to protein(s) in these databases, this predicted amino acid sequence of the
frame was assigned as a known or homologous protein.
If no known or homologous
proteins for a given mRNA were detected by BLASTX, the ORF was predicted using
GeneMark (Borodovsky and McIninch 1993).
For this prediction, a training dataset of 3rd
order Markov models was prepared using the 1,906 annotated rice mRNAs in the DNA
databanks.
If no ORF was suggested by GeneMark, we selected the longest ORF with
greater than 80 amino acids (a.a.).
Since the start codon (ATG) was subsequently inferred,
6/11
the resultant ORF could be smaller than 80 a.a..
The remaining 725 loci in which no
appropriate coding frames were detected became non-protein-coding RNA candidates.
The
most upstream ATG was taken as the start codon unless the ATG codon was located inside any
regions aligned to homologs reported by BLASTX.
The FLcDNAs may contain introns due to incomplete splicing. We detected and
eliminated the unspliced introns using the same method as Imanishi et al. (2004).
All the ORFs predicted were subjected to InterProScan (ver. 3.3) searches (Zdobnov
and Apweiler 2001; Quevillon et al. 2005) with the InterPro database 8.1 (Apweiler et al.
2001) to detect motifs/domains/families, but 'frequent hitters' (Supplementary Table 8) that
tend to give false predictions were not used for curation and ORF categorization (see below).
Gene Ontology (GO) IDs were assigned by using the InterProScan results, and the ORFs were
classified according to the GO hierarchy (Ashburner et al. 2000).
IRGSP and RAP annotation standards
The ORFs had originally been classified into four classes according to the IRGSP
Annotation Standard (http://demeter.bio.bnl.gov/Annotation.html): 'known', 'similar',
'unknown', or 'hypothetical' protein.
Although we essentially followed this original standard,
because a large number of cDNAs including rice full-length cDNAs were available at the time
of curation, the standard was modified as follows.
The 'known' and 'similar' classes
correspond to Categories I and II, respectively, although criteria of significant similarity
differed slightly. The 'unknown' proteins are further classified into Categories III, IV and V.
The original 'hypothetical' proteins, those predicted by ab initio prediction methods only, are
not included in the current dataset.
Analysis and annotation of non-protein-coding (np) RNAs
7/11
In the RAP dataset we could identify 1,168 transcripts that lacked an ORF or encoded a
short putative peptide (≤80 amino acids).
These transcripts were first mapped to the rice
genome and their exon structures were examined to verify proper locus mapping (>95%
sequence identity and ≥90% coverage).
Then, features such as the genomic context
(presence of upstream and/or downstream genes within 5 kb), canonical polyadenylation
signals (AATAAA or ATTAAA), polyadenosine tails, support by ESTs, and antisense
transcripts were inspected (Imanishi et al. 2004).
categories as described in the text.
These transcripts were classified into four
All npRNAs were subjected to sequence homology
search against 23,996 known plant and animal RNA genes (Imanishi et al. 2004).
However,
no significant hits were identified, which suggested that all 131 putative npRNAs are either
undescribed among other plant species or unique.
Orthologous plant RNA genes were also
investigated using BLASTN searches and no homologous genes were identified.
This may
be due to the fact that there have only been a limited number of npRNA genes reported for
any plant species (MacIntosh et al. 2001; Schoof and Karlowski 2003).
8/11
References
Allen, J.E., Pertea, M., and Salzberg, S.L. 2004. Computational gene prediction using
multiple sources of evidence. Genome Res. 14: 142-148.
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman,
D.J. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database
search programs. Nucleic Acids Res. 25: 3389-3402.
Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Birney, E., Biswas, M., Bucher, P.,
Cerutti, L., Corpet, F., Croning, M.D. et al. 2001. The InterPro database, an integrated
documentation resource for protein families, domains and functional sites. Nucleic
Acids Res. 29: 37-40.
Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P.,
Dolinski, K., Dwight, S.S., Eppig, J.T. et al. 2000. Gene ontology: tool for the
unification of biology. The Gene Ontology Consortium. Nat. Genet. 25: 25-29.
Borodovsky, M. and McIninch, J. 1993. GeneMark: Parallel Gene Recognition for both DNA
Strands. Computers & Chemistry 17: 123-133.
Burge, C. and Karlin, S. 1997. Prediction of complete gene structures in human genomic
DNA. J. Mol. Biol. 268: 78-94.
Chen, S., Jin, W., Wang, M., Zhang, F., Zhou, J., Jia, Q., Wu, Y., Liu, F., and Wu, P. 2003.
Distribution and characterization of over 1000 T-DNA tags in rice genome. Plant J.
36: 105-113.
Goff, S.A., Ricke, D., Lan, T.H., Presting, G., Wang, R., Dunn, M., Glazebrook, J., Sessions,
A., Oeller, P., Varma, H. et al. 2002. A draft sequence of the rice genome (Oryza sativa
L. ssp. japonica). Science 296: 92-100.
Imanishi, T., Itoh, T., Suzuki, Y., O'Donovan, C., Fukuchi, S., Koyanagi, K.O., Barrero, R.A.,
9/11
Tamura, T., Yamaguchi-Kabata, Y., Tanino, M. et al. 2004. Integrative annotation of
21,037 human genes validated by full-length cDNA clones. PLoS Biol. 2: 0859-0875.
International Rice Genome Sequencing Project. 2005. The map-based sequence of the rice
genome. Nature 436: 793-800.
Jurka, J. 2000. Repbase update: a database and an electronic journal of repetitive elements.
Trends Genet. 16: 418-420.
Kim, C.M., Piao, H.L., Park, S.J., Chon, N.S., Je, B.I., Sun, B., Park, S.H., Park, J.Y., Lee,
E.J., Kim, M.J. et al. 2004. Rapid, large-scale generation of Ds transposant lines and
analysis of the Ds insertion sites in rice. Plant J. 39: 252-263.
Komatsu, S., Kojima, K., Suzuki, K., Ozaki, K., and Higo, K. 2004. Rice Proteome Database
based on two-dimensional polyacrylamide gel electrophoresis: its status in 2003.
Nucleic Acids Res. 32: D388-392.
Komatsu, S. and Tanaka, N. 2005. Rice proteome analysis: a step toward functional analysis
of the rice genome. Proteomics 5: 938-949.
Lowe, T.M. and Eddy, S.R. 1997. tRNAscan-SE: a program for improved detection of transfer
RNA genes in genomic sequence. Nucleic Acids Res. 25: 955-964.
MacIntosh, G.C., Wilkerson, C., and Green, P.J. 2001. Identification and analysis of
Arabidopsis expressed sequence tags characteristic of non-coding RNAs. Plant
Physiol. 127: 765-776.
Miyao, A., Tanaka, K., Murata, K., Sawaki, H., Takeda, S., Abe, K., Shinozuka, Y., Onosato,
K., and Hirochika, H. 2003. Target site specificity of the Tos17 retrotransposon shows
a preference for insertion within genes and against insertion in retrotransposon-rich
regions of the genome. Plant Cell 15: 1771-1780.
Nagaki, K., Cheng, Z., Ouyang, S., Talbert, P.B., Kim, M., Jones, K.M., Henikoff, S., Buell,
C.R., and Jiang, J. 2004. Sequencing of a rice centromere uncovers active genes. Nat.
10/11
Genet. 36: 138-145.
Ouyang, S. and Buell, C.R. 2004. The TIGR Plant Repeat Databases: a collective resource for
the identification of repetitive sequences in plants. Nucleic Acids Res. 32: D360-363.
Quevillon, E., Silventoinen, V., Pillai, S., Harte, N., Mulder, N., Apweiler, R., and Lopez, R.
2005. InterProScan: protein domains identifier. Nucleic Acids Res. 33: W116-120.
Rice, P., Longden, I., and Bleasby, A. 2000. EMBOSS: the European Molecular Biology Open
Software Suite. Trends Genet. 16: 276-277.
Salamov, A.A. and Solovyev, V.V. 2000. Ab initio gene finding in Drosophila genomic DNA.
Genome Res. 10: 516-522.
Sallaud, C., Gay, C., Larmande, P., Bes, M., Piffanelli, P., Piegu, B., Droc, G., Regad, F.,
Bourgeois, E., Meynard, D. et al. 2004. High throughput T-DNA insertion mutagenesis
in rice: a first step towards in silico reverse genetics. Plant J. 39: 450-464.
Schoof, H., Ernst, R., Nazarov, V., Pfeifer, L., Mewes, H.W., and Mayer, K.F. 2004. MIPS
Arabidopsis thaliana Database (MAtDB): an integrated biological knowledge resource
for plant genomics. Nucleic Acids Res. 32: D373-376.
Schoof, H. and Karlowski, W.M. 2003. Comparison of rice and Arabidopsis annotation. Curr.
Opin. Plant Biol. 6: 106-112.
Wu, J., Yamagata, H., Hayashi-Tsugane, M., Hijishita, S., Fujisawa, M., Shibata, M., Ito, Y.,
Nakamura, M., Sakaguchi, M., Yoshihara, R. et al. 2004. Composition and structure of
the centromeric region of rice chromosome 8. Plant Cell 16: 967-976.
Yao, H., Guo, L., Fu, Y., Borsuk, L.A., Wen, T.J., Skibbe, D.S., Cui, X., Scheffler, B.E., Cao,
J., Emrich, S.J. et al. 2005. Evaluation of five ab initio gene prediction programs for
the discovery of maize genes. Plant Mol. Biol. 57: 445-460.
Yu, J., Wang, J., Lin, W., Li, S., Li, H., Zhou, J., Ni, P., Dong, W., Hu, S., Zeng, C. et al. 2005.
The Genomes of Oryza sativa: a history of duplications. PLoS Biol. 3: 0266-0281.
11/11
Zdobnov, E.M. and Apweiler, R. 2001. InterProScan - an integration platform for the
signature-recognition methods in InterPro. Bioinformatics 17: 847-848.
Download