BSc Bioinformatics Spring Term 2004 Exercise: Eukaryotic Gene Prediction This exercise is expected to take you not more than about one or 1½ hours, during the practical session on 11 February. It is not assessed in any way. You will try out out some of the public web servers available for predicting the location and structure of eukaryotic genes, using the DNA sequence of the contig containing most of the human PAX6 gene (the subject of your assessed exercise). Firstly, search the EMBL database for the contig concerned, using the SRS server at the EBI (http://srs.ebi.ac.uk). The easiest way to locate the entry is to type the accession number (Z83307) in the “Quick Text Search” box. Display the sequence in FASTA format using the “Display Options” at the bottom of the web page. Now investigate some of the following web server based gene prediction programs, following the instructions in each case1. You may be required to select the species or type of organism and/or the type of DNA (genomic), and you should also note whether the sequence is expected to be in raw or FASTA format. If raw sequence (DNA characters only) is required, you simply use the FASTA sequence without the top (title) line. Genscan http://genes.mit.edu/GENSCAN.html HMMgene http://www.cbs.dtu.dk/services/HMMgene/ GrailEXP http://grail.lsd.ornl.gov/grailexp/ Genie http://www.fruitfly.org/seq_tools/genie.html In each case, examine the output you get. What exactly is predicted: exons? Complete genes? Promoters, polyA tails and other upstream/downstream signals? How many exons are predicted in each case? How easy is the output to understand? Finally, go to the Ensembl database and find the entry for PAX6 on chromosome 11, as you did in the first exercise. Look at Ensembl’s latest prediction for the structure of the PAX6 gene, and compare its gene structure with that predicted by each of the individual programs you used, remembering that the contig sequence you used does not contain the complete gene. Assuming that Ensembl’s prediction is accurate (which will not always be the case – why?), which of these programs did best, and which worst? Is this what you would expect? 1 If you think you will run out of time, miss out one or two of the prediction programs rather than skipping the comparison with Ensembl.