CSH Computational Genomics Course Nov 2-8, 2005 Workshop VI (?): Comparative Genomics Tools Purpose and General Instructions: This workshop gives you experience with several servers and analysis platforms for comparative genomics. The illustrative examples will be from the ENCODE project, a consortium supported by NHGRI/NIH. The ENCODE consortium is challenged with determining all the functional sequences in the human genome, and data are available from an initial pilot phase focused on 1% of the human genome, i.e. about 30 Mb (Science 306 : 636-640). A set of about 40 segments has been chosen, ranging from gene-rich to gene-poor, GC-rich to GC-poor etc. I’ve selected a gene-dense region and a gene desert for you to analyze. ENCODE region ENm008 (alpha-globin gene complex) chr16:1-500,000 ENCODE region ENr313 chr16:60,833,950-61,333,949 Coordinates are for the human May 2004 (hg17) assembly. URLs: zPicture and Mulan PipMaker/MultiPipMaker UCSC Genome Browser UCSC Table Browser Galaxy http://www.dcode.org/ http://www.bx.psu.edu/ http://genome.ucsc.edu/ http://genome.ucsc.edu/ http://www.bx.psu.edu/ 1. Compute local pairwise alignments of submitted sequences, using blastZ. The instructions are for the zPicture and Mulan servers at http://www.dcode.org. Equivalent functions are available at the PipMaker/MultiPipMaker site. At zPicture, use ENCODE region ENm008 (alpha-globin gene complex) chr16:1500,000 as sequence 1, choose RefSeq as the annotation, and submit them to the server. For sequence 2, enter “NT_165211.1” for an NCBI accession number. This is a recently released BAC sequence of a rabbit DNA fragment that is homologous to about the first half of the human ENm008 region. Submit the sequences to the server. The output will be provided on a page with dynamic visualization, a dot-plot, and links to several files. If things are not going smoothly, you can submit the request ID (RID) 11040714416514 until it is deleted. 1a. Click on the dot-plot icon, and choose colors that you like. What does this tell you about the genes in the rabbit BAC? What do the simple diagonals mean, and what do the complex diagonals mean? 1 1b. Click on the dynamic visualization. Notice the complex patterns around the globin genes (HBZ, HBM, HBA2, HBA1, and HBQ), with many lines in the pip for many of the regions. In contrast, the other genes tend to have only one line in the pip for each segment. Relate this to the patterns in the dot-plot (question 1a). 1c. There are several controls you can play with. Try choosing “Smooth graph” and then hit “refresh”. Do you see many exons that do not have a signal in the 50-100% identity range? What about introns - do you see any that have striking conservation? Look at C16orf35 in particular. Remember your RID (request ID), so you can get back to this for the next workshop. 2. Now compute local pairwise alignments of submitted sequences, using TBA on the Mulan servers at http://www.dcode.org. Sequence 1: human May 2004 chr16:1-500,000 Sequence 2: mouse May 2004 chr11:32,070,001-32,200,000 Sequence 3: dog July 2004 chr6:41,651,001-42,047,000 At an intermediate step, the server will ask you if the tree (seq3 (seq1 seq2)) is acceptable, and it is, so click “continue”. If you run into too many problems, you can get the output from Request ID: m11040714464852 2a. How do you interpret the dot-plot for sequence 1 vs sequence 3 (human vs dog)? 2b. With the dynamic visualization, you can compare the human-mouse and human-dog alignments. The mouse sequence does not extend through all the homologs of genes in human ENm008 because of an interchromosomal rearrangement. Other than that, do the patterns of conservation between the two pairs of sequences look similar or different? A well-known enhancer of alpha-globin genes is in the fifth intron of C16orf35. Which species pair has more conservation here? Why might that be the case? 2c. Take a look at the TBA alignment by clicking on the brownish-red rectangle (an evolutionarily conserved region or ECR) above the peak. Note the high quality of the alignment. Again, note your RID. 3. Now compare these to the precomputed alignments on the UCSC Genome Browser, which are made with multiZ. 2 Open the “Conservation” track in full, and zoom in to chr16:103,301-103,800. This is the same enhancer in C16orf35 mentioned above. Click on the graph and it will take you to a page with the alignments. The conservation track not only provides alignments, but also the phastCons scores (plotted at the top). What are these scores telling you in general? For this enhancer, what do you think the peaks correspond to? 4. Look at the RefSeq and Conservation track on the UCSC browser for the gene desert ENr313 (chr16:60,833,950-61,333,949). Why do you think they call it a gene desert? How far out the phylogenetic tree are some regions conserved? Hypothesize a function for these noncoding, stringently constrained elements. 5. If time permits, use the Galaxy platform to get data from the UCSC Table Browser and use some of the operations. For instance, you could import all the coding exons from ENm008 and all the segments with phastCons scores above 0.5. Then you could subtract the former from the latter to calculate the fraction of highly constrained regions that are coding exons. 3