Guide to Genomic Sequence Annotation The final objective is to produce a Sequin annotated file for submission to GenBank. To have all the necessary information for the preparation of the Sequin annotated file, a useful intermediate step is a Word document, where all the genes and repetitive elements have been marked in the sequence (using highlights, letter colors, and letter formats, and text boxes or comments). This is what we want to see from your lab, homework, and exam. For the gene annotation we will mark the start and stop codons, the splicing sites (let us use bold red letters to have a common code), and the exons (let us use colored letters). TATAAAAATCGTTGTGGGGGCGGGATTGGGCGGCCATGGACACCTCTACTAGGGCGGCCAAGATCCCCTCCCTCCCCCAACAGACGGAG ATCAACTGGGACAAGTAAGCGCCCGCCCGCCCGCCCGCCTCCTCTTTCCTCAGCCGTTCCCCGGCTTATTTTTCTATTTTTTCCTCTAG GTTAATATCAACAAGCCAAAAGCCAGTGAAGTAGTTAGAAGATTGATCGCATGGATGGAAAGGTTTCTAAGTATGCCCATCCTTTCCGA GTCTTATTTCCTTCCGCTTCTTCCCAGAGCGTGTATGTGCTAAAGTTGAAGAGGCCCTGAGTCTATAATATTCTCATCAATAAA For the repetitive elements (let us use different highlight colors), we want to find the start and end of the element, mark the Host direct repeats (different highlight), and find the direct or inverted repeats (let us use bold letters). Where to start 1) Use Dotter to align the sequence against itself to find inverted and direct repeats. Write the coordinates of the repeats and mark them in your Word document (using ‘Word Count to identify the coordinates) as bold letters in the sequence. Repeats are frequently flanking repetitive elements! 1 2) Do a BLASTN of all the sequence against nr to identify the major features in the sequence. Explore the significant matches: do they have a name of a gene or a repeat? Do they match multiple different genomic locations (likely a repetitive element)? Do they match a region you previously found was flanked by direct or inverted repeats (likely a repetitive element)? Take note of the approximate locations of the different elements in the sequence. Repetitive elements 3) Identify first the repetitive sequences searching your sequence against a repeat database (TREP). If most of the matches with low E values (<e-10, >75% similar) have the same name, your element is likely a member of the family with that name). Be aware of the possibility of repetitive elements within repetitive elements: do the borders of the repeats match one element and something in the middle matches another one? Do the locations of your repetitive elements agree with your Dotter identification of direct and inverted repeats? 4) Determine the structure of your repetitive element. Do a BLASTN (using BLAST 2 Sequences) between the regions you identified as direct or inverted repeats. Include a little more at both ends to be sure you cover the complete repeat. The BLASTN (using BLAST 2 Sequences) alignment will help you to find the borders of the repeats. Mark the repeat regions with bold letters and the complete element including the long terminal repeats with a particular highlight. Look at the external borders of the repetitive element. Can you find a short (usually 4-8 bp) sequence that is repeated in direct orientation at the borders (at both the beginning and the end of the complete repetitive element? This is the Host Direct Repeat that marks the exact site of insertion, and the perfect border of the repetitive element. Mark both of them with the same highlight color, different from the repetitive element highlight (They are not part of the repetitive element). Look at the external borders of the long terminal repeats. These repeats themselves are flanked by short inverted repeats, at both the beginning and end of each long terminal repeat part of the repetitive element. Mark them with bold letters. Genes 2 In a separate file eliminate the repetitive elements from your sequence and use this ‘masked’ sequence to search for genes. First, do a new BLASTN with your masked sequence against nr to see if you can find genes. They usually appear as similar (aligned) segments (exons) separated by non similar ones (introns) If you do not find anything with BLASTN you can do a more sensitive search using BLASTX, which will translate your DNA and search the protein databases, which is more sensitive for more distantly related sequences. If the gene is in inverted orientation (plus/minus matches), first reverse complement the region (using GeneTool) in a separate file, and then work with the sequence in the correct orientation (is a lot easier for our brains!) Do a BLASTN against the EST database to see if there is a cDNA that correspond to your gene. If you find a very similar one (>90% identity), it will help you a lot to identify the borders of the exons. GeneSeqer is a very useful tool to integrate ESTs and splicing sites: http://bioinformatics.iastate.edu/cgi-bin/gs.cgi Use a couple of gene prediction programs to see how they predict the gene structure you are working with. Remember that the alignment to the corresponding cDNA is the gold-standard for the annotation of the exon structure. Unfortunately sometimes you do not have the exact EST… As a final validation of your annotation, you need to copy your gene, eliminate your predicted introns and splicing sites, and then translate the predicted cDNA into a protein. If you are getting premature stop codons your splicing sites are not well annotated. Once you have your predicted protein do a BLASTP search against nr. If your protein aligns along its complete length to a known protein you are done. If you have large gaps or you miss similarity after a splicing site, you probably are using the incorrect splicing site… Try another one! Names An additional important component of the annotation is to determine the most likely NAME of each gene and repeat discovered in the sequence based on similarity to previously annotated sequences. Write the names in text boxes at the side of your sequence. To find the most likely name: Genes: first determine the protein structure and then do a BLASTP search in nr or Ref_seq. Repetitive elements: search the specialized repeat databases REPBASE: animals and little from plants http://www.girinst.org/repbase/index.html REPEATMASKER: animals and little from plants http://www.repeatmasker.org/ TIGR: grasses, mainly rice http://tigrblast.tigr.org/euk-blast/index.cgi?project=osa1 TREP: wheat and barley http://wheat.pw.usda.gov/ITMI/Repeats/index.shtml 3