Exercises for the Mammalian Genomics section of IBIOS598B

advertisement
CSH Computational Genomics Course
Nov 2-8, 2005
Workshop VI (?): Comparative Genomics Tools
Purpose and General Instructions:
This workshop gives you experience with several servers and analysis platforms for
comparative genomics. The illustrative examples will be from the ENCODE project, a
consortium supported by NHGRI/NIH. The ENCODE consortium is challenged with
determining all the functional sequences in the human genome, and data are available
from an initial pilot phase focused on 1% of the human genome, i.e. about 30 Mb
(Science 306 : 636-640). A set of about 40 segments has been chosen, ranging from
gene-rich to gene-poor, GC-rich to GC-poor etc. I’ve selected a gene-dense region and
a gene desert for you to analyze.
ENCODE region ENm008 (alpha-globin gene complex) chr16:1-500,000
ENCODE region ENr313 chr16:60,833,950-61,333,949
Coordinates are for the human May 2004 (hg17) assembly.
URLs:
zPicture and Mulan
PipMaker/MultiPipMaker
UCSC Genome Browser
UCSC Table Browser
Galaxy
http://www.dcode.org/
http://www.bx.psu.edu/
http://genome.ucsc.edu/
http://genome.ucsc.edu/
http://www.bx.psu.edu/
1. Compute local pairwise alignments of submitted sequences, using blastZ. The
instructions are for the zPicture and Mulan servers at http://www.dcode.org. Equivalent
functions are available at the PipMaker/MultiPipMaker site.
At zPicture, use ENCODE region ENm008 (alpha-globin gene complex) chr16:1500,000 as sequence 1, choose RefSeq as the annotation, and submit them to the
server.
For sequence 2, enter “NT_165211.1” for an NCBI accession number. This is a recently
released BAC sequence of a rabbit DNA fragment that is homologous to about the first
half of the human ENm008 region. Submit the sequences to the server. The output will
be provided on a page with dynamic visualization, a dot-plot, and links to several files. If
things are not going smoothly, you can submit the request ID (RID) 11040714416514
until it is deleted.
1a. Click on the dot-plot icon, and choose colors that you like. What does this tell you
about the genes in the rabbit BAC? What do the simple diagonals mean, and what do
the complex diagonals mean?
1
1b. Click on the dynamic visualization. Notice the complex patterns around the globin
genes (HBZ, HBM, HBA2, HBA1, and HBQ), with many lines in the pip for many of the
regions. In contrast, the other genes tend to have only one line in the pip for each
segment. Relate this to the patterns in the dot-plot (question 1a).
1c. There are several controls you can play with. Try choosing “Smooth graph” and then
hit “refresh”. Do you see many exons that do not have a signal in the 50-100% identity
range? What about introns - do you see any that have striking conservation? Look at
C16orf35 in particular.
Remember your RID (request ID), so you can get back to this for the next workshop.
2. Now compute local pairwise alignments of submitted sequences, using TBA on
the Mulan servers at http://www.dcode.org.
Sequence 1: human May 2004 chr16:1-500,000
Sequence 2: mouse May 2004 chr11:32,070,001-32,200,000
Sequence 3: dog July 2004 chr6:41,651,001-42,047,000
At an intermediate step, the server will ask you if the tree (seq3 (seq1 seq2)) is
acceptable, and it is, so click “continue”.
If you run into too many problems, you can get the output from Request ID:
m11040714464852
2a. How do you interpret the dot-plot for sequence 1 vs sequence 3 (human vs dog)?
2b. With the dynamic visualization, you can compare the human-mouse and human-dog
alignments. The mouse sequence does not extend through all the homologs of genes in
human ENm008 because of an interchromosomal rearrangement. Other than that, do
the patterns of conservation between the two pairs of sequences look similar or
different? A well-known enhancer of alpha-globin genes is in the fifth intron of C16orf35.
Which species pair has more conservation here? Why might that be the case?
2c. Take a look at the TBA alignment by clicking on the brownish-red rectangle (an
evolutionarily conserved region or ECR) above the peak. Note the high quality of the
alignment.
Again, note your RID.
3. Now compare these to the precomputed alignments on the UCSC Genome
Browser, which are made with multiZ.
2
Open the “Conservation” track in full, and zoom in to chr16:103,301-103,800. This is the
same enhancer in C16orf35 mentioned above. Click on the graph and it will take you to
a page with the alignments.
The conservation track not only provides alignments, but also the phastCons scores
(plotted at the top). What are these scores telling you in general? For this enhancer,
what do you think the peaks correspond to?
4. Look at the RefSeq and Conservation track on the UCSC browser for the gene
desert ENr313 (chr16:60,833,950-61,333,949). Why do you think they call it a gene
desert? How far out the phylogenetic tree are some regions conserved? Hypothesize a
function for these noncoding, stringently constrained elements.
5. If time permits, use the Galaxy platform to get data from the UCSC Table Browser
and use some of the operations. For instance, you could import all the coding exons
from ENm008 and all the segments with phastCons scores above 0.5. Then you could
subtract the former from the latter to calculate the fraction of highly constrained regions
that are coding exons.
3
Download