Exercise: Eukaryotic Gene Prediction

advertisement
BSc Bioinformatics
Spring Term 2004
Exercise: Eukaryotic Gene Prediction
This exercise is expected to take you not more than about one or 1½ hours, during the
practical session on 11 February. It is not assessed in any way.
You will try out out some of the public web servers available for predicting the location
and structure of eukaryotic genes, using the DNA sequence of the contig containing most
of the human PAX6 gene (the subject of your assessed exercise).
Firstly, search the EMBL database for the contig concerned, using the SRS server at the
EBI (http://srs.ebi.ac.uk). The easiest way to locate the entry is to type the accession
number (Z83307) in the “Quick Text Search” box. Display the sequence in FASTA
format using the “Display Options” at the bottom of the web page.
Now investigate some of the following web server based gene prediction programs,
following the instructions in each case1. You may be required to select the species or type
of organism and/or the type of DNA (genomic), and you should also note whether the
sequence is expected to be in raw or FASTA format. If raw sequence (DNA characters
only) is required, you simply use the FASTA sequence without the top (title) line.
Genscan
http://genes.mit.edu/GENSCAN.html
HMMgene
http://www.cbs.dtu.dk/services/HMMgene/
GrailEXP
http://grail.lsd.ornl.gov/grailexp/
Genie
http://www.fruitfly.org/seq_tools/genie.html
In each case, examine the output you get. What exactly is predicted: exons? Complete
genes? Promoters, polyA tails and other upstream/downstream signals? How many exons
are predicted in each case? How easy is the output to understand?
Finally, go to the Ensembl database and find the entry for PAX6 on chromosome 11, as
you did in the first exercise. Look at Ensembl’s latest prediction for the structure of the
PAX6 gene, and compare its gene structure with that predicted by each of the individual
programs you used, remembering that the contig sequence you used does not contain the
complete gene. Assuming that Ensembl’s prediction is accurate (which will not always be
the case – why?), which of these programs did best, and which worst? Is this what you
would expect?
1
If you think you will run out of time, miss out one or two of the prediction programs rather than skipping
the comparison with Ensembl.
Download