Exercises for the Vertebrate Genomics section of IBIOS598B

advertisement
Assignment 7: Finding protein-coding genes
The purpose of this exercise is to illustrate some of the concepts in the lectures and
readings by using web servers to annotate genes. As with all my assignments, if your
interests lead you in a different direction, you are free to follow that direction as long as
it deals with gene annotation. You may do the assignment on genomic regions from
ANY organism (including bacteria, plants, and fungi) but you will probably have to do
more independent investigation than if you choose to use the assigned sequence. Of
course, please tell me what you did. The report from this exercise should be around two
to four pages, including figures. Quantitative answers are preferable to qualitative ones.
Describe your observations in your own words, and cite your sources for information.
Pick a genetic locus (single gene or multiple genes) that you are interested in. You can
choose the locus from any organism. The following description of the assignment is
based on a gene that almost everyone is interested in at some level, TP53. This gene
encodes a transcription factor, “tumor protein 53”, that regulates several aspects of cell
growth. It is also frequently mutated in many cancers. If you have no better preference,
then work on TP53. It and some adjacent genes are located at chr17:7,550,0017,608,000 in the GRCh37/hg19 assembly of the human genome. This 58 kb sequence
(in fastA format) is at the Angel course site.
(1) Run the sequence through Genscan to find the predicted genes. Genscan and the
associated server were developed by Chris Burge (now at MIT) and it is still supported
there:
http://genes.mit.edu/GENSCAN.html
If you are working with a bacterial sequence, try Glimmer (Salzberg lab); you can use
the server at NCBI:
http://www.ncbi.nlm.nih.gov/genomes/MICROBES/glimmer_3.cgi
Briefly state how the gene predictions were produced, and describe the results of the
gene predictions.
(2) Now compare these results to (a) evidence of transcription and (b) gene models built
by a comprehensive pipeline, such as “UCSC genes” or “GENCODE”. A good way to do
this is to examine tracks in the UCSC Genome Browser for
- A comprehensive pipeline, such as “UCSC genes” or “GENCODE”
- mRNA data
- Genscan predictions
- results of RNA-seq
Describe the gene annotations from these different sources. What similarities and
differences to you see? What is the basis for the differences? (This is asking about the
power and limitations – the good points and not-so-good points – about the different
methods.)
1
To help in getting started, I have shared a “browser session” with you. Copying and
pasting the following URL into your internet browser will open a view at the UCSC
Genome Browser.
http://genome.ucsc.edu/cgibin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=rosshardison&hgS_oth
erUserSessionName=TP53andFlanks
This is a good starting point, but I encourage you to explore these tracks, change the
settings, open other tracks, etc. This is an opportunity to delve more deeply into the
material we covered.
2
Download