Annotation Sequence 202

advertisement
Sequence 202 Annotation
Comparison of Sequence 202 to Maize Chromosome 10
Sequence 202 was known to have been a portion of maize chromosome 10 (base pairs 21163386 to
21556916). Because of the existence of a fragmented gene (202-10), it seemed possible that some
section of the DNA was missing from sequence 202. A blast search of sequence 202 versus maize
chromosome 10 showed that the entirety of sequence 202 was a match to chromosome 10. A dotplot
of sequence 202 against the corresponding section of chromosome 10 showed the somewhat surprising
result that a section of 202 is reversed relative to the maize reference sequence (maizesequence.org)
version of chromosome 10.
Results of comparing sequence 202 to maize chromosome 10.
Gene Prediction
Sequence 202 is a 400,000 bp segment from maize chromosome 10. Sequence 202 contained 85.06%
repetitive elements that were removed by Repeatmasker (replaced with “N”). Combined GC level was
47.51 % with Observed/Expected ratio of 0.6944017 for CG and and 0.8954022 for GC. All of these are
within the expected range for genomic maize sequence. (See supplemental information).
The repeat-masked sequence was submitted to the gene prediction programs Genemark and FGENESH.
Genemark predicted 14 genes and FGENESH predicted 5. As all but one of the FGENESH predictions
overlapped with a Genemark prediction, the results were combined into a total of 15 predicted genes,
referred to as 202-1 through 202-15. For 9 of the predicted proteins, no BLASTP matches were
obtained. For 3 more predicted proteins, matches with very poor E values were obtained. (See Excel
spreadsheet). For the final 3 predicted proteins, 202-6, 202-10, and 202-14 matches with strong E
values were obtained. For each of these, there were both Genemark and FGENESH predictions that
overlapped at least partially with each other.
Predicted Gene 202-6
For predicted gene 202-6 there were overlapping Genemark and FGENESH predictions. The FGENESH
prediction had a group of strong BLASTP hits but the Genemark prediction had no hits. The Genemark
prediction was much shorter. The FGENESH predicted gene was a very strong BLASTP match to a long
list of predicted, putative and hypothetical genes in many plant species, including Sorghum bicolor,
Oryza sativa, Arabidopsis thaliana, Populus trichocarpa, Vitis vinifera and Ricinus communis. Although
accession ABG21939.1 is described as an expressed protein, the reference for this is a rice sequence
paper (BMC Biol. 2005; 3: 20) and the genes described were based on FGENESH prediction. As all of
these genes are likely predictions based on sequence annotation, additional information about this gene
was difficult to obtain. Comparison to the NCBI conserved domains detected a match with the DUF936
superfamily which consists of hypothetical Arabidopsis and rice genes of unknown function. An
InterProScan search for conserved domains returned no hits. Because it is highly conserved in so many
plant species, this is likely to be a real gene, but no information about gene function is available.
A search of the EST database using the DNA sequence of 202-6 returned three maize mRNA hits
DN212513.1, CF038585.1, and CF037762.1. All of these mapped to chromosome 10 in the maize
sequence and were in the same region as our sequence (CF038585.1 and CF037762.1 are the same
sequence). This supports our hypothesis that this is an expressed gene, but with unknown function.
Interestingly, although DN212513.1 was a perfect match to our sequence, it had the opposite
orientation (see figure “mRNA 1 vs 202-6” in wiki). The presence of these ESTs supports the hypothesis
that 202-6 is a true gene of unknown function.
Predicted Gene 202-10
The prediction for gene 202-10 from Genemark and FGENESH differed by 3 amino acids. Both had
strong BLASTP hits to 24-methylenesterol C-methyltransferase genes for maize and s-adenosyl-lmethionine:delta24-sterol-C-methyltransferase genes in dicots. At least one of these corn genes,
NP_001149131.1 is from a paper based on cDNA (Insights into corn genes derived from large-scale cDNA
sequencing: JOURNAL Plant Mol. Biol. 69 (1-2), 179-194 (2009)) and as such is likely an expressed gene
in maize. However, our predicted gene was either 117 amino acids (Genemark) or 120 amino acids
(FGENESH) in length; less than half the length of the best blast hits which were 360 – 370 amino acid
proteins. Part of predicted gene 202-10 was closely aligned to several of the methyltransferase genes
(amino acids 24-62 of the FGENESH prediction), but the highly conserved portion of the
methyltransferase genes was much longer. Several attempts were made to find a longer section of this
gene to determine whether it was a real gene that only partially matched the methyltransferase genes.
Genscan gave a slightly longer predicted protein (242 amino acids), but the blast matches were similar.
When the FGENESH prediction was split into two exons, the first exon was the source of the
methyltransferase matches, but the second exon had no strong matches. Adding a longer section of
DNA to the second exon did not result in better matches. As a result, it seems unlikely that 202-10 is a
true protein that has a methyltransferase-like domain attached to another domain that is not similar to
methyltransferase.
When searching the EST database using the 202-10 DNA sequence, there were many hits that seemed to
reflect the gene segment described above. These genes did not map back to chromosome 10. Probably
these hits are to the mRNA of several different full length methyltransferase genes and do not reflect
transcription 202-10.
Predicted gene 202-10 may be a pseudogene , the partial gene may have been inserted in a clone, or the
contig may have been incorrectly assembled. As it seems unlikely that the same mis-assembly or partial
gene would have occurred in separate experiments, this lends credence to the hypothesis that 202-10
may be a pseudogene. A pseudogene may be the result of retrotransposon activity, gene duplication,
or may be a gene that has lost function.
Predicted Gene 202-14
The Genemark and FGENESH predictions were the same for 202-14. Similarly to 202-6, this predicted
protein had strong BLASTP hits to several hypothetical proteins in sorghum and rice (along with a hit to a
hypothetical maize protein). A hit to an “expressed protein” in rice, ABA97195.1, referred to the same
paper( BMC Biol. 2005; 3: 20) Again there was no result when the NCBI conserved domain database was
searched. Again this gene is highly conserved among plant species and seems likely to be a real gene.
When this gene was searched against the EST database, there were several hits. The best blast EST hit
maps to a gene on chromosome 10. This supports the hypothesis that 202-14 is a true expressed gene
of unknown function.
Download