DTC bioinformatics module Genome bioinformatics practical Gil McVean Divide into five groups – these will use, and (where possible) develop bioinformatics methods for the analysis of different aspects of genetic information. We will focus on human chromosome 20 – at 63 Mb one of the smallest in our genome – but one that has been extensively studied [1-3]. We will look at five areas i) Gene structure Describe the distribution of gene structures (exons, introns, coding sequences, regulatory signals) within the chromosome. Devise a simple parametric model for exon structure and estimate the parameters from the data. Are genes and gene structures randomly distributed along the chromosome? Do established and predicted genes have similar properties? Use the complete list of genes and gene structures available from http://www.ncbi.nlm.nih.gov/genome/guide/human/ See also http://www.sanger.ac.uk/HGP/Chr20/ for more information and additional downloads. ii) Base composition Describe how the base composition varies along the chromosome in terms of nucleotide frequencies and simple words. Develop a simple HMM to look for CpG islands. What is the relationship between these and genes? Use the complete DNA sequence available from http://genome.ucsc.edu/goldenPath/hg16/chromosomes/chr20.fa.zip iii) Molecular evolution Describe the extent and nature of the divergence between humans and chimps. Do all types of mutation occur at the same rate? To what extent is divergence influenced by local base composition? Do genes and non-coding sequences evolve at similar rates? Use the set of human-chimp alignments available from http://genome.ucsc.edu/goldenPath/panTro1/vsHg16/axtNet/chr20.axt.gz You could also use the set of 7645 aligned human-chimp-mouse genes sequenced by Celera (before they got bought out). Data available at http://www.sciencemag.org/cgi/content/full/302/5652/1960/DC1 (see [4]). iv) Population structure Describe the extent of population differentiation among humans using the SNP genotype information collected in a 10Mb region of chromosome 20. If you didn’t know where genotyped individuals came from, would you be able to classify people into different groups? How do these groups compare with the geographical labels? Use the genotype information available from www.stats.ox.ac.uk/~mcvean/DTC/SNP And the program STRUCTURE 2.1 available from http://pritch.bsd.uchicago.edu/software.html See [1,5] for a description of the data, [6] for a discussion of human structure and [7,8] for a discussion of the STRUCTURE algorithm. v) Recombination and linkage disequilibrium How does the recombination rates vary along the chromosome? Does recombination rate correlate with underlying genomic features such as gene location and GC content? Use pedigree-based estimates of the recombination rate available from the [9] http://www.nature.com/cgi-taf/DynaPage.taf?file=/ng/journal/v31/n3/full/ng917.html The program Recmin written by Simon Myers [10] http://www.stats.ox.ac.uk/~myers/RecMin.html And the LDhat package to estimate recombination-rate variation from population genetic data [5] – use the genotype data from unrelated UK Caucasians available from www.stats.ox.ac.uk/~mcvean/DTC/SNP (software will be made available on Friday) Reference List 1 Ke,X. et al. (2004) The impact of SNP density on fine-scale patterns of linkage disequilibrium. Hum. Mol. Genet. 13, 577-588 2 Bentley,D.R. et al. (2001) The physical maps for sequencing human chromosomes 1, 6, 9, 10, 13, 20 and X. Nature 409, 942-943 3 Lander,E.S. et al. (2001) Initial sequencing and analysis of the human genome. Nature 409, 860-921 4 Clark,A.G. et al. (2003) Inferring nonneutral evolution from human-chimp-mouse orthologous gene trios. Science 302, 1960-1963 5 McVean,G.A. et al. (2004) The fine-scale structure of recombination rate variation in the human genome. Science 304, 581-584 6 Rosenberg,N.A. et al. (2002) Genetic structure of human populations. Science 298, 23812385 7 Falush,D. et al. (2003) Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics 164, 1567-1587 8 Pritchard,J.K. et al. (2000) Inference of population structure using multilocus genotype data. Genetics 155, 945-959 9 Kong,A. et al. (2002) A high-resolution recombination map of the human genome. Nat. Genet. 31, 241-247 10 Myers,S.R. and Griffiths,R.C. (2003) Bounds on the minimum number of recombination events in a sample history. Genetics 163, 375-394