C20

advertisement
DTC bioinformatics module
Genome bioinformatics practical
Gil McVean
Divide into five groups – these will use, and (where possible) develop bioinformatics methods
for the analysis of different aspects of genetic information. We will focus on human
chromosome 20 – at 63 Mb one of the smallest in our genome – but one that has been
extensively studied [1-3]. We will look at five areas
i) Gene structure
Describe the distribution of gene structures (exons, introns, coding sequences, regulatory
signals) within the chromosome. Devise a simple parametric model for exon structure and
estimate the parameters from the data. Are genes and gene structures randomly distributed
along the chromosome? Do established and predicted genes have similar properties? Use the
complete list of genes and gene structures available from
http://www.ncbi.nlm.nih.gov/genome/guide/human/
See also
http://www.sanger.ac.uk/HGP/Chr20/
for more information and additional downloads.
ii) Base composition
Describe how the base composition varies along the chromosome in terms of nucleotide
frequencies and simple words. Develop a simple HMM to look for CpG islands. What is the
relationship between these and genes? Use the complete DNA sequence available from
http://genome.ucsc.edu/goldenPath/hg16/chromosomes/chr20.fa.zip
iii) Molecular evolution
Describe the extent and nature of the divergence between humans and chimps. Do all types
of mutation occur at the same rate? To what extent is divergence influenced by local base
composition? Do genes and non-coding sequences evolve at similar rates? Use the set of
human-chimp alignments available from
http://genome.ucsc.edu/goldenPath/panTro1/vsHg16/axtNet/chr20.axt.gz
You could also use the set of 7645 aligned human-chimp-mouse genes sequenced by Celera
(before they got bought out). Data available at
http://www.sciencemag.org/cgi/content/full/302/5652/1960/DC1 (see [4]).
iv) Population structure
Describe the extent of population differentiation among humans using the SNP genotype
information collected in a 10Mb region of chromosome 20. If you didn’t know where
genotyped individuals came from, would you be able to classify people into different groups?
How do these groups compare with the geographical labels? Use the genotype information
available from
www.stats.ox.ac.uk/~mcvean/DTC/SNP
And the program STRUCTURE 2.1 available from
http://pritch.bsd.uchicago.edu/software.html
See [1,5] for a description of the data, [6] for a discussion of human structure and [7,8] for a
discussion of the STRUCTURE algorithm.
v) Recombination and linkage disequilibrium
How does the recombination rates vary along the chromosome? Does recombination rate
correlate with underlying genomic features such as gene location and GC content? Use
pedigree-based estimates of the recombination rate available from the [9]
http://www.nature.com/cgi-taf/DynaPage.taf?file=/ng/journal/v31/n3/full/ng917.html
The program Recmin written by Simon Myers [10]
http://www.stats.ox.ac.uk/~myers/RecMin.html
And the LDhat package to estimate recombination-rate variation from population genetic data
[5] – use the genotype data from unrelated UK Caucasians available from
www.stats.ox.ac.uk/~mcvean/DTC/SNP
(software will be made available on Friday)
Reference List
1 Ke,X. et al. (2004) The impact of SNP density on fine-scale patterns of linkage
disequilibrium. Hum. Mol. Genet. 13, 577-588
2 Bentley,D.R. et al. (2001) The physical maps for sequencing human chromosomes 1, 6, 9,
10, 13, 20 and X. Nature 409, 942-943
3 Lander,E.S. et al. (2001) Initial sequencing and analysis of the human genome. Nature 409,
860-921
4 Clark,A.G. et al. (2003) Inferring nonneutral evolution from human-chimp-mouse
orthologous gene trios. Science 302, 1960-1963
5 McVean,G.A. et al. (2004) The fine-scale structure of recombination rate variation in the
human genome. Science 304, 581-584
6 Rosenberg,N.A. et al. (2002) Genetic structure of human populations. Science 298, 23812385
7 Falush,D. et al. (2003) Inference of population structure using multilocus genotype data:
linked loci and correlated allele frequencies. Genetics 164, 1567-1587
8 Pritchard,J.K. et al. (2000) Inference of population structure using multilocus genotype
data. Genetics 155, 945-959
9 Kong,A. et al. (2002) A high-resolution recombination map of the human genome. Nat.
Genet. 31, 241-247
10 Myers,S.R. and Griffiths,R.C. (2003) Bounds on the minimum number of recombination
events in a sample history. Genetics 163, 375-394
Download