Guide to Genomic sequence annotation

advertisement
Guide to Genomic Sequence Annotation
The final objective is to produce a Sequin annotated file for submission to GenBank.
To have all the necessary information for the preparation of the Sequin annotated file, a
useful intermediate step is a Word document, where all the genes and repetitive
elements have been marked in the sequence (using highlights, letter colors, and letter
formats, and text boxes or comments). This is what we want to see from your lab,
homework, and exam.
For the gene annotation we will mark the start and stop codons, the splicing sites (let
us use bold red letters to have a common code), and the exons (let us use colored
letters).
TATAAAAATCGTTGTGGGGGCGGGATTGGGCGGCCATGGACACCTCTACTAGGGCGGCCAAGATCCCCTCCCTCCCCCAACAGACGGAG
ATCAACTGGGACAAGTAAGCGCCCGCCCGCCCGCCCGCCTCCTCTTTCCTCAGCCGTTCCCCGGCTTATTTTTCTATTTTTTCCTCTAG
GTTAATATCAACAAGCCAAAAGCCAGTGAAGTAGTTAGAAGATTGATCGCATGGATGGAAAGGTTTCTAAGTATGCCCATCCTTTCCGA
GTCTTATTTCCTTCCGCTTCTTCCCAGAGCGTGTATGTGCTAAAGTTGAAGAGGCCCTGAGTCTATAATATTCTCATCAATAAA
For the repetitive elements (let us use different highlight colors), we want to find the
start and end of the element, mark the Host direct repeats (different highlight), and find
the direct or inverted repeats (let us use bold letters).
Where to start
1) Use Dotter to align the sequence against itself to find inverted and direct repeats.
Write the coordinates of the repeats and mark them in your Word document (using ‘Word
Count to identify the coordinates) as bold letters in the sequence. Repeats are frequently
flanking repetitive elements!
1
2) Do a BLASTN of all the sequence against nr to identify the major features in the
sequence.

Explore the significant matches: do they have a name of a gene or a repeat?

Do they match multiple different genomic locations (likely a repetitive element)?

Do they match a region you previously found was flanked by direct or inverted
repeats (likely a repetitive element)?

Take note of the approximate locations of the different elements in the sequence.
Repetitive elements
3) Identify first the repetitive sequences searching your sequence against a repeat
database (TREP). If most of the matches with low E values (<e-10, >75% similar) have
the same name, your element is likely a member of the family with that name).

Be aware of the possibility of repetitive elements within repetitive elements: do
the borders of the repeats match one element and something in the middle
matches another one?

Do the locations of your repetitive elements agree with your Dotter identification
of direct and inverted repeats?
4) Determine the structure of your repetitive element.

Do a BLASTN (using BLAST 2 Sequences) between the regions you identified
as direct or inverted repeats. Include a little more at both ends to be sure you
cover the complete repeat. The BLASTN (using BLAST 2 Sequences) alignment
will help you to find the borders of the repeats. Mark the repeat regions with bold
letters and the complete element including the long terminal repeats with a
particular highlight.

Look at the external borders of the repetitive element. Can you find a short
(usually 4-8 bp) sequence that is repeated in direct orientation at the borders (at
both the beginning and the end of the complete repetitive element? This is the
Host Direct Repeat that marks the exact site of insertion, and the perfect border
of the repetitive element. Mark both of them with the same highlight color,
different from the repetitive element highlight (They are not part of the repetitive
element).

Look at the external borders of the long terminal repeats. These repeats
themselves are flanked by short inverted repeats, at both the beginning and end of
each long terminal repeat part of the repetitive element. Mark them with bold
letters.
Genes
2
In a separate file eliminate the repetitive elements from your sequence and use this
‘masked’ sequence to search for genes.
First, do a new BLASTN with your masked sequence against nr to see if you can find
genes. They usually appear as similar (aligned) segments (exons) separated by non
similar ones (introns)
If you do not find anything with BLASTN you can do a more sensitive search using
BLASTX, which will translate your DNA and search the protein databases, which is
more sensitive for more distantly related sequences.
If the gene is in inverted orientation (plus/minus matches), first reverse complement
the region (using GeneTool) in a separate file, and then work with the sequence in the
correct orientation (is a lot easier for our brains!)

Do a BLASTN against the EST database to see if there is a cDNA that
correspond to your gene. If you find a very similar one (>90% identity), it will
help you a lot to identify the borders of the exons.

GeneSeqer is a very useful tool to integrate ESTs and splicing sites:
http://bioinformatics.iastate.edu/cgi-bin/gs.cgi

Use a couple of gene prediction programs to see how they predict the gene
structure you are working with. Remember that the alignment to the
corresponding cDNA is the gold-standard for the annotation of the exon structure.
Unfortunately sometimes you do not have the exact EST…

As a final validation of your annotation, you need to copy your gene, eliminate
your predicted introns and splicing sites, and then translate the predicted cDNA
into a protein. If you are getting premature stop codons your splicing sites are not
well annotated. Once you have your predicted protein do a BLASTP search
against nr. If your protein aligns along its complete length to a known protein you
are done. If you have large gaps or you miss similarity after a splicing site, you
probably are using the incorrect splicing site… Try another one!
Names
An additional important component of the annotation is to determine the most likely
NAME of each gene and repeat discovered in the sequence based on similarity to
previously annotated sequences. Write the names in text boxes at the side of your
sequence.
To find the most likely name:
Genes: first determine the protein structure and then do a BLASTP search in nr or
Ref_seq.
Repetitive elements: search the specialized repeat databases
REPBASE: animals and little from plants http://www.girinst.org/repbase/index.html
REPEATMASKER: animals and little from plants http://www.repeatmasker.org/
TIGR: grasses, mainly rice http://tigrblast.tigr.org/euk-blast/index.cgi?project=osa1
TREP: wheat and barley http://wheat.pw.usda.gov/ITMI/Repeats/index.shtml
3
Download