Seq5part2(50244-10000) 1. NCBI Blast Est database - - - - EST1019 Zea mays embryo sac cDNA library Zea mays cDNA clone ES2376 5-, mRNA sequence (GenBank: CF972490.1) “Transcriptome of Zea mays embryo sac” zmrww005_0B20-004-d01.s0 zmrww005 Zea mays cDNA 5-, mRNA sequence (GenBank: CK371462.1) “Functional Genomics of Root Growth and Root Signaling Under Drought” zmrww00_0B20-004-d01.s2 zmrww00 Zea mays cDNA 3-, mRNA sequence (GenBank: CF637313.1) “Functional Genomics of Root Growth and Root Signaling Under Drought” 25273824 CERES-504 Zea mays cDNA clone 1585441 3-, mRNA sequence (GenBank: FL474732.1) “Insights into corn genes derived from large-scale cDNA sequencing” Tissue: root and shoot ZM_BFb0268O02.f ZM_BFb Zea mays cDNA 3-, mRNA sequence - - - - (GenBank:DY536200.1) “Maize Full-length cDNA Project” Zm03_02c05_A Zm03_AAFC_ECORC_cold_stressed_maize_seedlings Zea mays cDNA clone Zm03_02c05, mRNA sequence (GenBank: BG319847.1) “Expressed Sequence Tags from Cold-Stressed Maize Seedlings Grown Under High Light Intensity” Tissue: leaf E0081 Zea mays egg cell cDNA library Zea mays cDNA clone 117 5- similar to retroposon, mRNA sequence (GenBank: DR452005.1) “Transcriptome of Zea mays egg cell” EST5083 Zea mays sperm cell cDNA library Zea mays cDNA clone Zmsp12591 5-, mRNA sequence (GenBank: CK700981.1) “Sperm cells of Zea mays have a complex complement of mRNAs” EL01N0306E03.b Endosperm_3 Zea mays cDNA, mRNA sequence (GenBank: CD433258.1) “Characterization of the maize endosperm transcriptome and its comparison to the rice genome” Then I used these sequences in as queries and searched the Nucleotide database of Corn on NCBI. All of the sequences hit on Chromosome 8 (Genomic sequence for Zea mays clone ZMMBBb0614J24, from chromosome 8, complete sequence GenBank: AC157487.1), Zea mays cultivar inbred line B73 teosinte glume architecture 1 (tga1) gene, complete cds (GenBank: AY883559.2) and Zea mays cytochrome P450 monooxygenase CYP71C3v2 gene, complete cds (GenBank: AY072299.1). Nucleotide The results that I used the full length(50000bp) sequence as query to search the Nucleotide database of corn: The hits with the highest scores are: 1) Zea mays B transcriptional activator (b1) gene, b1-B' allele, exons 1 through 3 and partial cds GenBank: AY078063.2 2) Genomic sequence for Zea mays clone ZMMBBb0614J24, from chromosome 8, complete sequence GenBank: AC157487.1 3) Zea mays cultivar inbred line B73 teosinte glume architecture 1 (tga1) gene, complete cds GenBank: AY883559.2 Also, there are a lot of Mu transposons in the sequences. Here is the dotplot of the query sequence and the b1 gene: Then I searched the nucleotide database of all organisms. And I found a hit named “Zea mays gypsy retrotransposon huck, and copia retrotransposon ji, complete sequence; and helitron Mo17_14594, complete sequence GenBank: DQ002408.1”. This result is consistent with the result reported by Repeatmasker, in which 83.18% of the sequence are Copia or Gypsy elements. Here is the dotplot of the query sequence and the transposon sequences: 2. Gene Prediction The next step is to predict the genes in this sequence. The programs I chose are FGENESH and Augustus(http://bioinf.uni-greifswald.de/augustus/submission). FGENESH FGENESH predicted 7 genes. The predicted genes were then translated into peptides. These peptides were used as queries to run Blastp in the swissprot database. 3 of them had significant hits. Segment 1: 62665 - 64698 1 CDSf: 62665 - 63676 1011bp 2 CDSl: 63853 - 64698 846bp Retrotrans_gag[pfam03732], Retrotransposon gag protein; Gag or Capsid-like proteins from LTR retrotransposons. >ATGGCGACCGACAACTCGCCCGCCGGCGGCGGAATCGACGACGTCTTCCCCGCGCGGTGGAAGAACAAC ATTCGAGCTTGCCTCGTCCCCTCCCCCGCCGACGGAGGAGGAGGCGGGGCAACCCAAGGCCAAGCAGGA GGCGGCACCTCGTCGGCTGTCGAGCGAGTCGACGGTGCCAGCGCCCCAATGGGGGGCACGTCGGGCATC GACCTCGCGTCTGAGACGAAGACGAGCGCCGTCTCCCCGCAACACGTCAACCCCAAGCAAACGGACGACG CCAACACGCTCGCAAGGGACTTGCTGGGCGTCACCCTCGTACCTGAGACGGCGGTGCAGTCTACCCCTGAC GTGACTTCGTCACCGCCCGTCGACCAAGAGGTACCGACCGATTCCCATCTCGCGCCTTTTGGATTCAGCCTC AACCCCCCAAGCGACTTCGCTTTGGTGGACGCTCTCATAGAGGCGAGTCCAAACCCTCTGGGGTATCGTATG CGGTCACCATGGGACCGGCTGACGGCCGTCTCAACCTACGGGCCCTTAGGGTCCGAGGAAGATGACGAGC CCGACTTTAGTTGGGATTTCTCTGGACTTGGTAACCCCAGTGCCATGCGGGACTTTATGACCGCGTGCGACT ACTGCCTTTCCGACTGTTCCGACGGTAGCCGCAGCCTCGGCGACAAGGACTGCGGCCCAAGTCGTGAATGT TTTCACGTCGATCTAGGGGGTCCCGACGAAGGCAACCATCTTGGTATGCCAGAGAATGGTGACCTTCCTAG GCCTGTGCCTCACGTTGACATCCTTCGGGAGCTAGCTGTGGTCCCCGTTCCGGCAGGGGGTCATGACCCAC AACTCGAGCAAATCCGCGAGATGCAGGCCAGGCTCGACGAGGGAGCAGGAACACTTGAGCCGTTCCGCC GGGACAATAGGCAGGAATGGGCGGGCCAACCTCTGGCCGGAGAAGTGCGTCATCTACCCCAGGGCATCCA GCACCGCGTCGCCGACGATGTCAGGgtaaggccgccaccggtttccagtggggtcggccagaacctggctgcagcggcaatact tctccgcgcgatgccggagccatcaaccaccgaggggcggcgtatccagggagagctcaagaacctcctggaggacgccgcggtctgacg ggccgaaagctccgcctcccgaaggcagGGGTACCCCTCGGAACATCGCGCCGCGACTTCCCGATTCATGCGGGAAG CCTCGGTCCACACCGGCCGCATGCGTAACATAGCGCATGCGGCCCCGGGTCGCCTCGGCAACGAGCACCAT CACCATAACTGTTGGGCCCACCTCGACGAGAGGGTGCGCCGAGGCTACCACCCCAGGCGTGGGGGACGCT ACGACAGCGGGGAGGATCGGAGTCCCTCGCCCAAACCACCTGGTCCGCAGGCTTTCAACCGCGCCATACG ACGGGCGCCGTTCCCGACCCGGTTCCGAACCCCGACTACTATCACAAAGTACTCGGGGGAGACGAGACCG GAACTGTGGCTCGCAGACTACCGGCTGGCCTGCCAGCTGGGTGGAACGGACGATGACAACCTCATCATCTG CAACCTCCCCCTGTTCCTTTCCGACACCGCTCGCGCCTGGCTGGAGCACCTGCCTCCGGGGCAGATCTCCAA CTGGGACGACCTGGTCCAAGCCTTCGCCGGTAATTTCCAGGGCACGTACGTGCGCCCTGGAAACTCCTGGG ATCTCCGAAGCTGCCGCCAGCAGCCGGGGGGGTCTCTCCGGGACTACATCCGGCGATTCTCGAAGCAGCG CACCGAGCTGCCCAACATCGCCGATTCGGATGTCATCGGCGCGTTCCTCGCCGGCACCACCTGCCGTGACCT GGTGAGCAAGCTGGGTCGCAAGACCCCCACCAGGGCGAGCGAGCTGATGGACATCGCCACCAAGTTCGCC TCTGGCCAGGAGGCGGTTGAGGCCATCTTCCGGAAGGACAAGCAGCCCCAGGGCCGCCCACCGGAAGAT GTCCCCGAGGCGTCAACTTAG Protein sequences: MATDNSPAGGGIDDVFPARWKNNIRACLVPSPADGGGGGATQGQAGGGTSSAVERVDGASAPMGGTSGID LASETKTSAVSPQHVNPKQTDDANTLARDLLGVTLVPETAVQSTPDVTSSPPVDQEVPTDSHLAPFGFSLNPPS DFALVDALIEASPNPLGYRMRSPWDRLTAVSTYGPLGSEEDDEPDFSWDFSGLGNPSAMRDFMTACDYCLSDC SDGSRSLGDKDCGPSRECFHVDLGGPDEGNHLGMPENGDLPRPVPHVDILRELAVVPVPAGGHDPQLEQIRE MQARLDEGAGTLEPFRRDNRQEWAGQPLAGEVRHLPQGIQHRVADDVRGYPSEHRAATSRFMREASVHTG RMRNIAHAAPGRLGNEHHHHNCWAHLDERVRRGYHPRRGGRYDSGEDRSPSPKPPGPQAFNRAIRRAPFPT RFRTPTTITKYSGETRPELWLADYRLACQLGGTDDDNLIICNLPLFLSDTARAWLEHLPPGQISNWDDLVQAFAG NFQGTYVRPGNSWDLRSCRQQPGGSLRDYIRRFSKQRTELPNIADSDVIGAFLAGTTCRDLVSKLGRKTPTRAS ELMDIATKFASGQEAVEAIFRKDKQPQGRPPEDVPEAST Segment 2: 66287 - 69085 3 exons: 1 CDSf 66287 - 67405 1119bp 2 CDSi 67439 - 67615 177bp 3 CDSl 68270 - 69085 816bp RNase_HI_archaeal_like[cd09279], RNAse HI family that includes Archaeal RNase HI; RT_LTR[cd01647], RT_LTR: Reverse transcriptases (RTs) from retrotransposons and retroviruses which have long terminal repeats (LTRs) in their DNA copies but not in their RNA template. rve[pfam00665], Integrase core domain RVT_3[pfam13456], Reverse transcriptase-like; This domain is found in plants and appears to be part of a retrotransposon. RNase_HI_RT_Ty3[cd09274], Ty3/Gypsy family of RNase HI in long-term repeat retroelements; RNase_H[cd06222], RNase H is an endonuclease that cleaves the RNA strand of an RNA/DNA hybrid in a sequence non-specific manner RNase_H[pfam00075], RNase H; RNase H digests the RNA strand of an RNA/DNA hybrid. Important enzyme in retroviral replication cycle. RVT_1[pfam00078], Reverse transcriptase (RNA-dependent DNA polymerase) PRK07238[PRK07238], bifunctional RNase H/acid phosphatase PRK07708[PRK07708], hypothetical protein; Validated >ATGCCATTCAGTTTGAGGAATGCGGGTGCAACGTACCAACGGTGCATGAACCACATGTTCGGCGAACACA TTGGCCGAACGGTCGAGGCCTACGTCGATGACATCGTAGTCAAGACGAGGAAAGCCTCCGACCTCCTTTCC GACCTTGAAGCGACATTCCGATGTCTCAAGGCGAAAGGCGTGAAGCTCAATCCCGAGAAATGTGTCTTCGG GGTTCCACGAGGCATGCTCTTGGGGTTCATCGTCTCCGAGCGGGGCATCGAGGCCAACCCGGAGAAGATC GCGGCCAACACCAGCATGGGGCCCATCAAGGACTTGAAAGGCGTACAGAGAGTCACAGGATGCCTTGCGG CTCTGAGCCGTTTCATCTCGCGCCTCGGCGAAAGAGGCCTACCTCTGTACCGCCTCTTAAGGAAGGCCGAGT GCTTCACTTGGACCCCTGAGGCCGAGGAAGCCCTCGGGAACCTGAAGGCGCTCCTCACGAACGCGCCCAT CTTGGTGCCCCCCGCTGCCGGAGAAGCCCTCTTGATCTACGTCACCACGACCACTCAGGTGGTTAGCGCCG CGATTGTGGTTGAGAGACGAGAAGAGGGGCATGCATTGCCCGTACAGAGGCCAGTCTACTTCATCAGTGAG GTACTGTCCGAGACCAAGATCCGCTACCCACAAATTCAGAAGCTGCTGTACGCAGTGATCCTGACACGACGG AAGTTGCGACACTACTTCAAGTCTCATCCGGTGACTGTGGTGTCATCCTTCCCCCTGGGGGAGATCATCCAG TGCCGAGAGGCCTCGGCTAGAATTGCAAAGTGGGCGGTGGAAATCATGGGCGAGACGATCTCGTTCGCCC CTCGGAAGGCCATCAAGTCCCAGGTCTTGGCGGACTTTGTGGCTGAATGGGTCGACACCCAGCTCCCAACA GCTCCGATCCAACCGGAACTCTGGACCATGTTTTTCGACGGGTCACTGATGAAGACAGGAGCAGGCGCAG GCCTGCTCTTGATCTCGCCCCTCAAGAAGCACCTACGCTACGTGCTACGCCTCCACTTCCCGGCGTCCAACAA TGTGGCTAAGTACGAGGCTCTAGTCAACGGGTTGCGCATCGCCATCGAGCTGGGGgtctgacgcctcgacgctcgt ggtgactcgcagCTCGTCATCGACCAAGTCATGAAGAACTCCCACTGCCACGACCCGAAGATGGAGGCCTACTG CGATGAGGTTCGGCGCCTGGAAGACAAGTTCTACGGGCTCGAGCTCAACCACATCGCCCGACGCCACAAC GAGACTGCGGACGAGCTGGCTAAAATAGCCTCGGGGCGAACAACGgttcccccagacgtcttctcccgagacctgcat caaccctccgtcaagaccgacgacacgcccgagcccgagacaccctcggcttagtccgaggcaccctcggctcagtccgaggcgccatcgg ctcggcccgaggcaccctcggctcaacccgaggcaccctcggcccccgagggtgaggcactgcgcatcgaggaggagcggagaggggtc atgcctaatcgaaactggcagaccccgtacctgcaatatctccgccgaggagagctacccctcgaccaagccgaagcttggcggttggcgc ggcgcgccaagtcgttcgtcttgctgggagacgagaaggagctctaccaccgcagcccctcgggcatcctccagcgatgcatttccatcgcc gaaggccaggagctcctacaagagatacactcgggggcttgtggccatcacgcagcacctcgagcccttgttggaaacgccttccgacaag gtttctactggccgacggcggtggccgacaccactagaattgtccgcacctgcgaagggtgtcagttctacacaaggcagacccacctaccc gcttaggccctgcagaccatacccatcacctggtcatttgttgtgtggggtctggacctagttggccccttgcagAAGGCACCCGGGGG CTACACGCATCTGTTGGTCGCCATCGACAAATTCTCCAAGTGGATCGAGGTCCGACCCCTAAACAGCATCAG GTCCGAACAGGCGGTGGCGTTCTTCACCAACATCATCCATCGCTTTGGGGTCCCGAACTCCATCATCACCGA CAACGGCACCCAGTTCACCGGCAGAAAGTTCCTGGACTTCTGCGAGGATCACCACATCTGGGTGGACTGG GCCGCCGTGGCTCACCCCATGACGAATGGGCAAGTAGAGCGTGCCAACGGCATGATTCTACAAGGACTCAA GCCTCGAATCTACAACGACCTCAACAAGTTCGGCAAGCGGTGGATGAAGGAACTCCCCTCGGTGGTCTGGA GTCTGAGGACGACGCTGAGCCGGGCCACGGGCTTCACACCGTTCTTTCTAGTCTATGGGGCCGAGACCGTC TTGCCCATAGACTTAGAATACGGTTCCCCGAGGACGAGGGCCTACGACGACCAAAGCAATCGAGCTAATCG AGAAGACTCACCGGACCAGCTGGAAGAGGCTCGGGACATGGCCTTACTACACTCGGCGCGGTACCAGCAG TCCTTGCGACGCTACCACGCCCGAGGGGTTCGGTCCCGAGACCTCCAGGTGGGCGACCTGGTGCTTCGGCT GCGACAAGACGCCCGAGGGCGGCACAAGCTCATGCCTCCCTGGGAAGGGTCGTTCGTCATCGCCAAAGTT CTGAAGCCTGGGACGTACAAGCTGGCCAACAGTCAAGGCGAGGTCTACAGCAACGCTTGGAACATCCGAC AGCTACGTCGCTTCTACCCTTAA Protein sequence: MPFSLRNAGATYQRCMNHMFGEHIGRTVEAYVDDIVVKTRKASDLLSDLEATFRCLKAKGVKLNPEKCVFGVP RGMLLGFIVSERGIEANPEKIAANTSMGPIKDLKGVQRVTGCLAALSRFISRLGERGLPLYRLLRKAECFTWTPEA EEALGNLKALLTNAPILVPPAAGEALLIYVTTTTQVVSAAIVVERREEGHALPVQRPVYFISEVLSETKIRYPQIQKLL YAVILTRRKLRHYFKSHPVTVVSSFPLGEIIQCREASARIAKWAVEIMGETISFAPRKAIKSQVLADFVAEWVDTQL PTAPIQPELWTMFFDGSLMKTGAGAGLLLISPLKKHLRYVLRLHFPASNNVAKYEALVNGLRIAIELGLVIDQVM KNSHCHDPKMEAYCDEVRRLEDKFYGLELNHIARRHNETADELAKIASGRTTKAPGGYTHLLVAIDKFSKWIEVR PLNSIRSEQAVAFFTNIIHRFGVPNSIITDNGTQFTGRKFLDFCEDHHIWVDWAAVAHPMTNGQVERANGMIL QGLKPRIYNDLNKFGKRWMKELPSVVWSLRTTLSRATGFTPFFLVYGAETVLPIDLEYGSPRTRAYDDQSNRAN REDSPDQLEEARDMALLHSARYQQSLRRYHARGVRSRDLQVGDLVLRLRQDARGRHKLMPPWEGSFVIAKVL KPGTYKLANSQGEVYSNAWNIRQLRRFYP Segment 3: 82383 - 88664 7 exons 1 CDSf 2 CDSi 3 CDSi 4 CDSi 5 CDSi 6 CDSi 7 CDSl 82383 - 83722 84124 - 84298 84369 - 85018 85130 - 85433 85920 - 86500 86862 - 87035 87327 - 88664 1338bp 174bp 684bp 303bp 579bp 174bp 1338bp RT_LTR[cd01647], RT_LTR: Reverse transcriptases (RTs) from retrotransposons and retroviruses. RNase_HI_archaeal_like[cd09279], RNAse HI family that includes Archaeal RNase HI; rve[pfam00665], Integrase core domain; DUF4370[pfam14290], Domain of unknown function (DUF4370); RT_DIRS1[cd03714], RT_DIRS1: Reverse transcriptases (RTs) occurring in the DIRS1 group of retransposons. RVT_1[pfam00078], Reverse transcriptase (RNA-dependent DNA polymerase); A reverse transcriptase gene is usually indicative of a mobile element such as a retrotransposon or retrovirus. PRK12829[PRK12829], short chain dehydrogenase; Provisional PHA03307[PHA03307], transcriptional regulator ICP4; Provisional >ATGGCGGCCGACAACCCGCCCGCCGGCGGCGGAATCGATGACGTCTTCCCCACGTGGCGGAAGAACGAC ATTCGGGCTTGTCCCGTCCCCTCCCCCGTCGACGGAGGAGGAGGCGGGGCAACCAAGGCCAAGCAGGAG GCGGCACCTCGTCGGCTATCGAGCGAGTCGACGGCGCCGGTGCCCCCAACGAGGGGCGCGATGGGCATCG ACATCGCGTCTGAGACGAAGACGAGCGCCGTCTCCCCGCAACACGCCAACTCCAAGCAAACGGACGACGC CAGCACGCTCGCAAAAGACTTGTTGGGCGTCACCCTCGTACCTGAGACGACGGTGCAGTCTACCCCTGACG TGACTTCGTCACCGCCCGTCGACCAAGACGTACCGACCGATTCCCATCTCGCGCCTTTTGGATTCAGCCTCG ACCCACCAAGCGACTTCGCTTTGGTGGACGCTTTCATAGAGGCGAGTCCAAACCCTCCGGGGTATCGTGTG CGGTCACCCTGGGACCGGCTGACAGCCGTCTCGACCTACGGGCCCTCGGGTTCCGAGGAAGATGACGAGC CCGACTTTTGTTGGGATTTCTCTGGACTTGGTAACCCCAGTGCCATGCGGGACTTCATGACCACATGCGACT ACTGCCTTTCCGACTGTTCCGACGGTAGCCGCAGCCTCGGCGACGAGGACTATGGCCCAAGTCGTGAATGT TTCCACGTCGACCTAGGGGGTCCCGGCGAAGGAAACCATCCTGGTATACCGGAAAATGGTGATCCCCCTAG GCCTGCGCCTCGCGTTGACATCCTACGGGAGCTAGCTGTGGTCCCAGTCCCTGCGGGGGTCAGGACTCACA GCTCGAGCAAATCTGCGAGATGCAGGCCAGGCTCGACGAGGGAGCAGGAACACTTGAGCCGTTCCGCCG GGACATCGGGCAGGAATGGGCAGGCCAACCTCCGGCCGGAGAAGCGCGCCATCTACCCCAGGGCATCCAA CACCGCATCGCCGACGATGTCAGGGCAAGGCCGCCACCGGCCTCCAGTGGGGTCGGCCAGAACCTGGCTG CAGCGGCAATACTTCTCCGCGCGATGCCGGAGCCATCTACCACCGAGGGGCGGCGTATCCAGGGAGAGCTC AAGAATCTCCTGGAGGATGTCGCGGTCCGACGGGCCGAAAGCTCCGCCTCCCGAAGGCAGGGGTACCCCT CGGAACATCGCGCCGCGACTTCCCAATTCATGCGGAAAGCCTCGGTCCACACCGGGCGCACGCGCAACACA GCGCCTGCGGCCCTGGGTCGCCTCGGCAACGAACACCCTCACCGCAACCGTCGAACCCACCTCGACGAGA gggtgcgccgaggctaccaccccaggcgtgggggacgctacgacagcggggaggattggagtccctcgcccgaaccacccggtccgcag gctttcagccgggccatacgacgggcgccgttcccgacccggttccgaaccccgactactatcacaaagtactcgggggagacgagaccgg aactgtggctcgcggactaccggctagcctgccacctgggtggaacagacgatgacaatctcatcatccggaacctccccctgttcctctccg acaccgctcgagcctggctggagcacctgcctccggggcagatctccaactaggacgacctggtccaagccttcgccggcaacttccagggt acgtatgtgtgccctgggaactcctgggatctccaaaGCTGCCGCCAGCAGCCGGGGGAGTCTCTCTGGGACTACATCC GGCAATTCTCGAAGCAGCGCACCGAGTTGCCCAATGTCACCGACTCGGATGTCATCGGCGCGTTCCTCGCC GACACCACTTGCCGCGACCTGGTTAGCAAGCTGGGTCGCAAGACCCCCACCAGGGCGAGTGaggtgatggac atcgccaccaagttcgcctctggctaGGATGCGGTTGAGGCCATCTTCCGGAAGGACAAGCAGCCCCAGGGCCGCC CACCGGAAGATGTCCCCGAGGCGTCAACTCAGCGCGGCATCAAGAAGAAAGGCAAGAAGAAGTCGCAAG CAAAACGCGACGCCGCCGATGCGAACTTTGTCGCCGCCGCCGAGTACAAGAACCCTCGGAAACCTCCTGG AGGTGCCAATCTCTTCGACAAGATGCTCAAGGAGCCGTGCCCCTGTCATCAGGGGCCCGTCAAGCACACCC TTGAGGAGTGCGCCATGCTTCGGCGCCACTTTCACAAAGCCGGGCCACCTGCGGAGGGTGGCCGGGCCCG CGACGACGATAAGAAGGAGGATCACAAGGCAGGAGAGTTCCCCGAGGTCCACGACTGCTTCATGATCTAC GGTGGGCAAGTGGCGAACGCCTCGGCTCGGCACCACAAGCAAGAGCGTCGGGAGGTCTGCTCGGTAAAG GTGGCGGCGCCAGTCTACCTAGACTGGTCCGACAAGCCCATCACCTTCGACCAGGGCGACCACCCCGACCG CGTGCCGAGCCTGGGGAAGTACCCGCTCGTTGTCGACCCCGTCATCGGCAACGTCAGGCTCACCAAGGTCC TCATGGACGGAGGCAGCAGCCTCAACGTCATCTACGCCAAGACCCTCGGGCTCCTGCGGATCGATCTGTCCT Cggtacgggcaggagctgcgccttttcacgggatcatccctgggaagcgcgtccagcccctcggacaactcgatctacccgtctgctttggg acaccctccaacttctgaaagGAGACCCTCACGTTCGAGGTGGTCGGGTTTCGAGGAACCTACCACGCAGTGCTG AGGAGGCCATGCTACGCCAAGTTCATGGTCGTCCCCAACTACACCTACCACAAGCTAAAGATGCCAGGCCCC AACGGGGTCATCACCGTCGGCCCCACGTACCGACACGCGTACGAATGCGACGTGGAGTGCATGGAGTACGC CGAGGCCCTCGCCAAATCCGAGGCCCTCATCGCCGACCTGGAGAGCCTCTCCAAGGAGGCGCCAGACGTG AAGCGCCACACCAGCAACTTCGAGCCAACGGAGATGggtaagttcgtccctctcaacaccagcaacgatacctccaagctg atccggatcgggctccgagctcgaccccaaataggaagcagtctcgtcgactttctccgtgcaaacaccgatgtttttgcatggaatccctcgg acatgcccggcataccgagggatgtcgccgagcactcgctggatatccgagctagagcccgacccgtgaagcagcctctgcgccggttcga cgaagaaaagcgcagagccataggcgaggagatccacaagctaatggcggtagggttcatcaaagaggtattccatcccgagtggcttgc caaccctgtgcttgtgagaaagaaaggagggaaatggcgtatgtgtgtagactacactggtctaaacaaagcatgtccaaaagttccctacc ctctgcctcgcatcgatcaaatcgtggattccactgctgggtgcgaaaccctgtctttcctcgatgcctactcagGGTATCGCCAAATCA GGATGAAAGAGTCCGACCAGCTCGCGACTTCTTTCATCACACCTTTCGGCATGTACTGCTATGTTACCATGTC GTTTGGTTTGAGGAATGCGGGTGCGACATACCAAAGGTGCATGAACCACGTGTTCGGCGAACACATTGGTC GAACGGTCGAGGCTTACATCGATGACATCGTAGTCAAGACGAGGAAAGCCTCTGACCTCCTTTCCGACCTTG AAACGACATTCTGGTGTCTCAAGGCGAAAGGTGTAAAGCTCAATCCCGAGAAGTGCGTCTTCGGGGTCCCC CAAGGCTTGCTCTTGGGGTTTATCGTCTCCGAGCGGGGCATCGAGGCCAACCCAGAGAAAATCGTGGCCAT CACCAACATGGGGCCCATCAAGGACTTGAAAGGCGTACAGAGGGTCACGGGGTGCCTTGCGGCTCTGAGC CGTTTCATCTCACGCCTCGGCGAAAGAGGCCTGCCTCTGTACCGCCTCTTAAGGAAGGCCGAGTGCTTCACT TGGACCCCTGAGGCCGAGGAAGCCCTCGGGAACCTGAAGGCGCTCCTCACGAACGCGCCCATCTtggtgccc ccgcggccggagaagccctcttgatctacgtcgccgctaccactcaggtggtcagcgccgcgatcgtggttgagagacgagaagagggaca tgcattgcctgtccagaggccagtctacttcgtcagtgaggtactgtccgagaccaagatccgctacccacaaattccgagtctcatccggtga ctgtggtgtcatctttccccctgggggagatcatccagtgccgagaggcctcgggtaggattgcaaagtgggcggtggaaatcatgggcgag acaatctcgttcgccactcgtaaggccataaagtcccaagtcttggcggactttgtggctgaatgggtcgatacccaGCTCCCGACAGC TCCGATCCAACCGGAACTCTGGACCATGTTTTTTGACGGGTCGCTGATGAAGACAGGGGCAGGCGCGGGC CTGCTCTTCATCTCGCCCCTCGGGAAGCACCTACGCTACGTGCTACGCCTCCACTTCCCGGCGTCCAACAATG TGGCCGAGTACGAGGCTCTggtcaacgggttgcgcgtcgccatcgagctagggatccgacgtctcgacgctcgcggtgactcgtagc tcgtcattgactaagtcatgaagaactcccacttctgcgactcgaagatggaagcctactgcgatgaggttcggcgcctggaggacaagttct atgggctcgagttcaaccacatcgcccgacgctacaacgagactgcggacaagctggctaagatagcctcggggcaaacaacggttccccc ggacgtcttctcctgagacctgcatcaaccctccgtcaagACCGACGACACGCCCGAGCCCGAGAAGGCCTCGGCCCAGC CCGAGGCACCCTCGGCCCCCGAGGATGAGGCACTGCGTGTCGAGGAGGAGCGGAGCGGGGTCACGCCTA ATCGAAACTGGCAGACCCCGAACCTGCAATATCTCCACCGAGGAGAGCTACCCCTCGACCGAGCCGAAGCT CGGCGGTTGGCGCGGCGTGCCAAGTCGTTCGTCTTGCTGGGGGACGGGAAGGAGCTCTACCATCGCAGCC CCTCAGGCATCCTCCAGCAATGCATATCCATCACCGAAGGCCAGGAGCTCTTACAAGAAATACACTCGGGGG CTTGCGGGCATCACGCGGCGCCCCGAGCCCTTGTTGGGAACGCCTTCCGACAAGGTTTCTACTGGCCAACC GCGGTGGCCGACGCCACTAGAATTGTTCGCACCTGCCAGGGGTGTCAATTCTACGCAAGGCAGACTCACCT TCCCGCCCAGGCTCTACAGACCATACCCATCACCTGGTCGTTTGCTGTGTGGGGTCTGGACCTCGTCGGCAC CTTGCAGAAGGCACCCGGGGGCTACACGCACCTGCTGGTCGCCATCGACAAATTCTCCAAGTGGATCGAGG TCCGACCCCTAAACAGCATCAGGTCTGAACAGGCGGTGGCGTTCTTCACCAACATCATCCATCGCTTTGGGG TCCCGAACTCCATCATCACCGACAACGACACCCAGTTCACCGACAGAAAGTTCCTGGACTTCTGCGAGGATC ACCACATCCGGGTGGACTGGGCCGCCGTGGCTCACCCCATGACGAATGGGCAAGTAGAGCGTGCCAACGG CATGATCCTGCAAGGACTCAAGCCGTGGATCTACAACAACCTTAACAAGTTCGGCAAGCGATGGATGAAGG AGCTCCCCTCGGTGGTCTGGAGTCTGAGGACAACGCCGAGCCGAGCCACGGGCTTCACACCGTTCTTTCTA GTCTATGGGGCCGAGGCCATCTTGCCCATAGACTTAGAATACGGTTCCCCAAGGACGAGGGCCTACAACGA CCAAAGCAATCGAGCTAACCGAGAAGACTCACTGGACCAGCTGGAAGAGGCTCGGAACATGGCCTTCCTA CACTCGGCGCGGTATCAGCAGTCCCTGCGACGCTACCACGCCCGAAGGGTTCGGTCCCGAGACCTCCAGGT GGGCGACTTGGTGCTTCGGCTGCGACAAGACGCCCGAGGGCGGCACAAGCTCACGCCTCCCTGGGAAGG GTCGTTCGTCATCGCCAAGGTTCTGAAGCCCGGGACGTATAAGCTGGCCAACAGTCAAGGCGAGGTCTACA ACAACGCTTGGAACATCCGATAG Protein sequence: MAADNPPAGGGIDDVFPTWRKNDIRACPVPSPVDGGGGGATKAKQEAAPRRLSSESTAPVPPTRGAMGIDIA SETKTSAVSPQHANSKQTDDASTLAKDLLGVTLVPETTVQSTPDVTSSPPVDQDVPTDSHLAPFGFSLDPPSDFA LVDAFIEASPNPPGYRVRSPWDRLTAVSTYGPSGSEEDDEPDFCWDFSGLGNPSAMRDFMTTCDYCLSDCSDG SRSLGDEDYGPSRECFHVDLGGPGEGNHPGIPENGDPPRPAPRVDILRELAVVPVPAGVRTHSSSKSARCRPGS TREQEHLSRSAGTSGRNGQANLRPEKRAIYPRASNTASPTMSGQGRHRPPVGSARTWLQRQYFSARCRSHLP PRGGVSRESSRISWRMSRSDGPKAPPPEGRGTPRNIAPRLPNSCGKPRSTPGARATQRLRPWVASATNTLTATV EPTSTRGCRQQPGESLWDYIRQFSKQRTELPNVTDSDVIGAFLADTTCRDLVSKLGRKTPTRASEDAVEAIFRKD KQPQGRPPEDVPEASTQRGIKKKGKKKSQAKRDAADANFVAAAEYKNPRKPPGGANLFDKMLKEPCPCHQG PVKHTLEECAMLRRHFHKAGPPAEGGRARDDDKKEDHKAGEFPEVHDCFMIYGGQVANASARHHKQERREV CSVKVAAPVYLDWSDKPITFDQGDHPDRVPSLGKYPLVVDPVIGNVRLTKVLMDGGSSLNVIYAKTLGLLRIDLS SETLTFEVVGFRGTYHAVLRRPCYAKFMVVPNYTYHKLKMPGPNGVITVGPTYRHAYECDVECMEYAEALAKSE ALIADLESLSKEAPDVKRHTSNFEPTEMGYRQIRMKESDQLATSFITPFGMYCYVTMSFGLRNAGATYQRCMN HVFGEHIGRTVEAYIDDIVVKTRKASDLLSDLETTFWCLKAKGVKLNPEKCVFGVPQGLLLGFIVSERGIEANPEKI VAITNMGPIKDLKGVQRVTGCLAALSRFISRLGERGLPLYRLLRKAECFTWTPEAEEALGNLKALLTNAPILLPTAPI QPELWTMFFDGSLMKTGAGAGLLFISPLGKHLRYVLRLHFPASNNVAEYEALTDDTPEPEKASAQPEAPSAPED EALRVEEERSGVTPNRNWQTPNLQYLHRGELPLDRAEARRLARRAKSFVLLGDGKELYHRSPSGILQQCISITEG QELLQEIHSGACGHHAAPRALVGNAFRQGFYWPTAVADATRIVRTCQGCQFYARQTHLPAQALQTIPITWSFA VWGLDLVGTLQKAPGGYTHLLVAIDKFSKWIEVRPLNSIRSEQAVAFFTNIIHRFGVPNSIITDNDTQFTDRKFLD FCEDHHIRVDWAAVAHPMTNGQVERANGMILQGLKPWIYNNLNKFGKRWMKELPSVVWSLRTTPSRATGF TPFFLVYGAEAILPIDLEYGSPRTRAYNDQSNRANREDSLDQLEEARNMAFLHSARYQQSLRRYHARRVRSRDL QVGDLVLRLRQDARGRHKLTPPWEGSFVIAKVLKPGTYKLANSQGEVYNNAWNIR Augustus gene prediction Augustus predicted 13 genes. The predicted genes were then translated into peptides. These peptides were used as queries to run Blastp in the swissprot database. Only 2 of them had significant hits. One belongs to the Reverse transcriptases (RTs) superfamily, the other belongs to the RNase H superfamily. Segment 1: 65858 --- 67411 CDS 65858 --- 67411 1553bp RT_LTR[cd01647]: Reverse transcriptases (RTs) from retrotransposons and retroviruses which have long terminal repeats (LTRs) in their DNA copies but not in their RNA template. RT_Rtv[cd01645]: Reverse transcriptases (RTs) from retroviruses (Rtvs). RT_ZFREV_like[cd03715]: A subfamily of reverse transcriptases (RTs) found in sequences similar to the intact endogenous retrovirus ZFERV from zebrafish and to Moloney murine leukemia virus RT. >ATGCCCGGCATACCGAGGGATGTCGCCGAGCACTCGCTGGATATCCGAGCTGGAGCCCGACCCGTGAAGC AGCCTTTGCGCCGATTCGACGAAGAAAAGCGCAGAGCCATAGGCGAGGAGATCCACAAGCTAATGGCGGC AGGGTTCATCAAAGAGGTATTCCACCCCGAATGGCTTGCCAACCCTGTGCTTGTGAGAAAGAAAGGAGGG AAATGGCGGATGTGTGTAGACTACACTGGTCTAAACAAAGCATGTCCGAAAGTTCCCTACCCTCTACCTCGCA TCGATCAAATCGTGGATTCCACTGCTGGGTGCGAAACCCTATCTTTCCTTGATGCCTACTCGGGGTATCACCA GATCAGGATGAAAGAGTCCGACCAGCTCGCGACTTCTTTCATCACACCCTTCGGCATGTACTGTTATGTTACC ATGCCATTCAGTTTGAGGAATGCGGGTGCAACGTACCAACGGTGCATGAACCACATGTTCGGCGAACACATT GGCCGAACGGTCGAGGCCTACGTCGATGACATCGTAGTCAAGACGAGGAAAGCCTCCGACCTCCTTTCCGA CCTTGAAGCGACATTCCGATGTCTCAAGGCGAAAGGCGTGAAGCTCAATCCCGAGAAATGTGTCTTCGGGG TTCCACGAGGCATGCTCTTGGGGTTCATCGTCTCCGAGCGGGGCATCGAGGCCAACCCGGAGAAGATCGC GGCCAACACCAGCATGGGGCCCATCAAGGACTTGAAAGGCGTACAGAGAGTCACAGGATGCCTTGCGGCT CTGAGCCGTTTCATCTCGCGCCTCGGCGAAAGAGGCCTACCTCTGTACCGCCTCTTAAGGAAGGCCGAGTG CTTCACTTGGACCCCTGAGGCCGAGGAAGCCCTCGGGAACCTGAAGGCGCTCCTCACGAACGCGCCCATCT TGGTGCCCCCCGCTGCCGGAGAAGCCCTCTTGATCTACGTCACCACGACCACTCAGGTGGTTAGCGCCGCG ATTGTGGTTGAGAGACGAGAAGAGGGGCATGCATTGCCCGTACAGAGGCCAGTCTACTTCATCAGTGAGGT ACTGTCCGAGACCAAGATCCGCTACCCACAAATTCAGAAGCTGCTGTACGCAGTGATCCTGACACGACGGA AGTTGCGACACTACTTCAAGTCTCATCCGGTGACTGTGGTGTCATCCTTCCCCCTGGGGGAGATCATCCAGT GCCGAGAGGCCTCGGCTAGAATTGCAAAGTGGGCGGTGGAAATCATGGGCGAGACGATCTCGTTCGCCCC TCGGAAGGCCATCAAGTCCCAGGTCTTGGCGGACTTTGTGGCTGAATGGGTCGACACCCAGCTCCCAACAG CTCCGATCCAACCGGAACTCTGGACCATGTTTTTCGACGGGTCACTGATGAAGACAGGAGCAGGCGCAGG CCTGCTCTTGATCTCGCCCCTCAAGAAGCACCTACGCTACGTGCTACGCCTCCACTTCCCGGCGTCCAACAAT GTGGCTAAGTACGAGGCTCTAGTCAACGGGTTGCGCATCGCCATCGAGCTGGGGGTCTGA Protein sequence: MPGIPRDVAEHSLDIRAGARPVKQPLRRFDEEKRRAIGEEIHKLMAAGFIKEVFHPEWLANPVLVRKKGGKWR MCVDYTGLNKACPKVPYPLPRIDQIVDSTAGCETLSFLDAYSGYHQIRMKESDQLATSFITPFGMYCYVTMPFSL RNAGATYQRCMNHMFGEHIGRTVEAYVDDIVVKTRKASDLLSDLEATFRCLKAKGVKLNPEKCVFGVPRGMLL GFIVSERGIEANPEKIAANTSMGPIKDLKGVQRVTGCLAALSRFISRLGERGLPLYRLLRKAECFTWTPEAEEALG NLKALLTNAPILVPPAAGEALLIYVTTTTQVVSAAIVVERREEGHALPVQRPVYFISEVLSETKIRYPQIQKLLYAVILT RRKLRHYFKSHPVTVVSSFPLGEIIQCREASARIAKWAVEIMGETISFAPRKAIKSQVLADFVAEWVDTQLPTAPI QPELWTMFFDGSLMKTGAGAGLLLISPLKKHLRYVLRLHFPASNNVAKYEALVNGLRIAIELG Segment 2: 86898 --- 88664 2 exons 1 CDS 2 CDS 86898---87090 87304---88664 192bp 1360bp RNase_HI_archaeal_like[cd09279], RNAse HI family that includes Archaeal RNase HI RVT_3[pfam13456], Reverse transcriptase-like; This domain is found in plants and appears to be part of a retrotransposon. RNase_H[cd06222], RNase H is an endonuclease that cleaves the RNA strand of an RNA/DNA hybrid in a sequence non-specific manner RnhA[COG0328], Ribonuclease HI [DNA replication, recombination, and repair] PRK07238[PRK07238], bifunctional RNase H/acid phosphatase; Provisional PRK07708[PRK07708], hypothetical protein; Validated >ATGTTTTTTGACGGGTCGCTGATGAAGACAGGGGCAGGCGCGGGCCTGCTCTTCATCTCGCCCCTCGGGA AGCACCTACGCTACGTGCTACGCCTCCACTTCCCGGCGTCCAACAATGTGGCCGAGTACGAGGCTCTGGTCA ACGGGTTGCGCGTCGCCATCGAGCTAGGGATCCGACGTCTCGACGCTCGCggtgactcgtagctcgtcattgactaag tcatgaagaactcccacttctgcgactcgaagatggaagcctactgcgatgaggttcggcgcctggaggacaagttctatgggctcgagttca accacatcgcccgacgctacaacgagactgcggacaagctggctaagatagcctcggggcaaacaacggttcccccggacgtcttctcctg agaCCTGCATCAACCCTCCGTCAAGACCGACGACACGCCCGAGCCCGAGAAGGCCTCGGCCCAGCCCGAGG CACCCTCGGCCCCCGAGGATGAGGCACTGCGTGTCGAGGAGGAGCGGAGCGGGGTCACGCCTAATCGAA ACTGGCAGACCCCGAACCTGCAATATCTCCACCGAGGAGAGCTACCCCTCGACCGAGCCGAAGCTCGGCGG TTGGCGCGGCGTGCCAAGTCGTTCGTCTTGCTGGGGGACGGGAAGGAGCTCTACCATCGCAGCCCCTCAG GCATCCTCCAGCAATGCATATCCATCACCGAAGGCCAGGAGCTCTTACAAGAAATACACTCGGGGGCTTGCG GGCATCACGCGGCGCCCCGAGCCCTTGTTGGGAACGCCTTCCGACAAGGTTTCTACTGGCCAACCGCGGTG GCCGACGCCACTAGAATTGTTCGCACCTGCCAGGGGTGTCAATTCTACGCAAGGCAGACTCACCTTCCCGCC CAGGCTCTACAGACCATACCCATCACCTGGTCGTTTGCTGTGTGGGGTCTGGACCTCGTCGGCACCTTGCAG AAGGCACCCGGGGGCTACACGCACCTGCTGGTCGCCATCGACAAATTCTCCAAGTGGATCGAGGTCCGACC CCTAAACAGCATCAGGTCTGAACAGGCGGTGGCGTTCTTCACCAACATCATCCATCGCTTTGGGGTCCCGAA CTCCATCATCACCGACAACGACACCCAGTTCACCGACAGAAAGTTCCTGGACTTCTGCGAGGATCACCACAT CCGGGTGGACTGGGCCGCCGTGGCTCACCCCATGACGAATGGGCAAGTAGAGCGTGCCAACGGCATGATC CTGCAAGGACTCAAGCCGTGGATCTACAACAACCTTAACAAGTTCGGCAAGCGATGGATGAAGGAGCTCCC CTCGGTGGTCTGGAGTCTGAGGACAACGCCGAGCCGAGCCACGGGCTTCACACCGTTCTTTCTAGTCTATG GGGCCGAGGCCATCTTGCCCATAGACTTAGAATACGGTTCCCCAAGGACGAGGGCCTACAACGACCAAAGC AATCGAGCTAACCGAGAAGACTCACTGGACCAGCTGGAAGAGGCTCGGAACATGGCCTTCCTACACTCGGC GCGGTATCAGCAGTCCCTGCGACGCTACCACGCCCGAAGGGTTCGGTCCCGAGACCTCCAGGTGGGCGAC TTGGTGCTTCGGCTGCGACAAGACGCCCGAGGGCGGCACAAGCTCACGCCTCCCTGGGAAGGGTCGTTCG TCATCGCCAAGGTTCTGAAGCCCGGGACGTATAAGCTGGCCAACAGTCAAGGCGAGGTCTACAACAACGCT TGGAACATCCGATAG protein sequence: MFFDGSLMKTGAGAGLLFISPLGKHLRYVLRLHFPASNNVAEYEALVNGLRVAIELGIRRLDARDLHQPSVKTDD TPEPEKASAQPEAPSAPEDEALRVEEERSGVTPNRNWQTPNLQYLHRGELPLDRAEARRLARRAKSFVLLGDG KELYHRSPSGILQQCISITEGQELLQEIHSGACGHHAAPRALVGNAFRQGFYWPTAVADATRIVRTCQGCQFYAR QTHLPAQALQTIPITWSFAVWGLDLVGTLQKAPGGYTHLLVAIDKFSKWIEVRPLNSIRSEQAVAFFTNIIHRFGV PNSIITDNDTQFTDRKFLDFCEDHHIRVDWAAVAHPMTNGQVERANGMILQGLKPWIYNNLNKFGKRWMK ELPSVVWSLRTTPSRATGFTPFFLVYGAEAILPIDLEYGSPRTRAYNDQSNRANREDSLDQLEEARNMAFLHSAR YQQSLRRYHARRVRSRDLQVGDLVLRLRQDARGRHKLTPPWEGSFVIAKVLKPGTYKLANSQGEVYNNAWNI R