Recombination - the Department of Statistics

advertisement
Recombination, and haplotype
structure
Simon Myers, Gil McVean
Department of Statistics, Oxford
The starting point
•
We have a genome’s worth of data on genetic variation
•
We wish to understand why the haplotype structure looks how it does
– Differences between regions, populations
Where do haplotypes come from?
•
In the absence of recombination, the most natural way to think about
haplotypes is in terms of the genealogical tree representing the history of
the chromosomes
•
•
Tree affects mutation patterns
Mutation patterns give information on tree
What determines the shape of the tree?
Present day
Ancestry of current population
Present day
Ancestry of sample
Present day
The coalescent: a model of genealogies
Most recent common ancestor (MRCA)
coalescence
Ancestral lineages
Present day
time
Simulating histories with the coalescent
Simulating data with the coalescent
Haplotype structure in the absence of recombination
•
In the absence of recombination, the shape of the tree and where
mutations fall on it determine patterns of haplotype structure
•
Two mutations on the same branch will be in complete association,
mutations on different branches will have lower and often low association
r2 = 1
r2 = 0.04
Haplotypes when there is recombination
•
When there is no recombination, haplotype structure reflects the age
distribution of mutations and the shape of the underlying tree
•
When there is some recombination, every nucleotide position has a tree,
but the tree changes along the chromosome at a rate determined by the
local recombination landscape
•
By using SNP information to inform us about the trees, we can learn about
how quickly the trees changes
– This relates to the recombination rate
A bit of recombination ‘shuffles’ genetic variation
Lots of recombination does lots of shuffling
Recombination and haplotype diversity
•
Without recombination, a new mutation can create at most one new
haplotype
– Any two mutations delineate at most 3 haplotypes in total (ancestral, plus two
new types)
•
With recombination, this mutation can spread onto every existing haplotype
background, creating the potential for more haplotypes
•
For a given number of SNPs a region with recombination will tend to have
(in comparison to a region with no recombination)
– More haplotypes
– Less variance in the pairwise differences between haplotypes
– Less skewed haplotype frequencies
The ancestral recombination graph
•
The combined history of recombination, mutation and coalescence is
described by the ancestral recombination graph
Event
Coalescence
Mutation
Coalescence
Coalescence
Mutation
Coalescence
Recombination
In humans, recombination is not uniformly distributed
•
Most recombination occurs in recombination hotspots – short (1-2kb)
regions every 50-100kb that occupy at most 3% of the genome but
probably account for 90% or more of the recombination
•
This means that haplotype structure in humans is an interesting hybrid
between the no recombination and lots of recombination situations
Learning about recombination
•
Just like there is a true genealogy underlying a sample of sequences
without recombination, there is a true ARG underlying samples of
sequences with recombination
•
We can consider nonparametric and parametric ways of learning about
recombination
•
There are useful nonparametric ways of learning about recombination
which we will consider first
– These really only apply to species, such as humans, where we can be fairly
sure that most SNPs are the result of a single ancestral mutation event
The signal of recombination?
Ancestral
chromosome
recombines
Recurrent mutation
Recombination
Detecting recombination from DNA sequence data
•
Look for all pairs of “incompatible” sites
•
Find minimum number of intervals in which recombination events must
have occured (Hudson and Kaplan 1985): Rm
Improving the detection algorithm
•
Rm greatly underestimates the amount of recombination in the history of a
set of sequences
•
Myers and Griffiths (2003) developed an improved way of detecting
recombination events
– Without recombination, every new mutation can create only a single new
haplotype
– With recombination, mutations can be shuffled between haplotype background,
generating haplotype diversity
– Each recombination makes at most one new haplotype
– If I see H haplotypes with S segregating sites, at least H-S-1 recombination
events must have occurred
•
This offers potential to identify many more recombination events
– Carefully combine bounds from different collection of sites
– Dynamic programming algorithm makes computation extremely fast
– Better (sometimes slower) algorithms developed recently
Problems with ‘counting’ recombination events
A tree-pair where we could
see recombination events, but don’t
Tree-pairs where we cannot
see recombination events
Modelling recombination
•
Model-based approaches to learning about recombination allow us to ask
more detailed questions than nonparametric approaches
– What is the rate of recombination (as opposed to just the number of events)
– Does gene A have a higher recombination rate than gene B?
– Is the rate of recombination across a region constant?
– Where are the recombination hotspots?
•
We can use coalescent model approaches (approximations) to calculating
the likelihood of arbitrary recombination maps given observed data
Fitting a variable recombination rate
•
Cold
Hot
Use a reversible-jump MCMC approach (Green 1995)
SNP positions
Split blocks
Merge blocks
Change block size
Change block rate
Acceptance rates

 ( , )  m in 1,

Composite likelihood ratio
 ( , u  ) 





( )  ( ) q ( , )
 ( , u ) 
C
C
(  )
 ( )
q ( , )
Ratio of priors
Hastings ratio
Jacobian of partial derivatives relating changes
in dimension to sampled random numbers
•
Include a prior on the number of change points that encourages smoothing
Strong concordance between fine-scale rate estimates
from sperm and genetic variation
Rates estimated from genetic variation
McVean et al (2004)
Rates estimated from sperm
Jeffreys et al (2001)
Inferring hotspots
•
We perform a statistical test for hotspot presence
•
Based on an approximation to the coalescent similar to that used for rate
estimation
•
All previously identified hotspots are 1-2kb in size
–
–
–
–
–
–
At a position in genome, consider where 2kb hotspot might be present
Fit a model with hotspot
Fit one without
Compare in terms of (approximate) likelihood ratio test
Evaluate significance via simulation
When p-value below threshold, declare a hotspot
Rates and hotspots across the human genome
Hotspots throughout human genome
(35,000 identified)
From Myers et al. (2005)
Applications of recombination approaches to real data
•
Rates and hotspots across the human genome (Myers et al. 2005)
– Previously, no understanding of why hotspots localise where they do
– Can 35,000 hotspots, accounting for >50% of human recombination, help?
•
Comparison of recombination rates (Winckler et al. 2004, Ptak et al. 2005)
– Between humans and chimpanzees
– At individual recombination hotspots
•
Understanding genomic rearrangements (Myers et al., submitted!)
– Cause a number of “genomic disorders”
– Relationship to recombination hotspots
32,996 Phase II HapMap hotspots
Estimated 50-70% of all human recombination
Hotspots on all chromosomes, including X
THE1B
(LTR of retrotransposon)
~20,000 hotspots localised to within 5kb
THE1B: Found in 1196 hotspots versus 606 coldspots (p<<10-20)
AluY: Found in 3635 hotspots versus 3262 coldspots (p=7x10-5)
THE1 consensus:
...CTTCCGCCATGATTGTGAGGCCTCCCCAGCCATGTGGAACTGTGAGTCCATT...
CCTCCCTAGCCAC
(n=165)
CCNCCNTNNCCNC
(n=263)
CCTCCCCNNCCAT
(n=10,690)
~3-4% of hotspots
L2 consensus:
...TGTCACCTCCTCAGAGAGGCCTTCCCTGACCACCCTATCTAAAATWGCACACC...
CCTCCCTGACCAC
(n=157)
CCNCCNTNNCCNC
(n=6,901)
CTTCCCTNNCCAC
(n=1,211)
~3-4% of hotspots
AluY, AluSc, AluSg consensus:
...CTCCTGACCTCGTGATCCGCCCGCCTCGGCCTCCCAAAGTGCTGGGATTACAG...
CCGCCTTGGCCTC
(n=14,028)
CCNCCNTNNCCNC
(n=15,706)
CCGCCTCNNCCTC
(n=55,916)
~3-4% of hotspots, including DNA3
Human hotspot motifs
•
In humans, specific words produce recombination hotspot activity
•
Hotspot motif CCTCCCTNNCCAC (p<10-33)
–
–
–
–
–
–
Raises probability of a hotspot across genetic backgrounds
Degenerate versions CCNCCNTNNCCNC and truncated CCTCCCT also raise probability,
to lesser extent
Motif explains ~40% of human hotspots
Operates in both sexes
We don’t know, very clearly, which hotspots
On THE1 background, hotspot 70-80% of time!
•
Biology not clearly understood
•
We identified a second, different hotspot motif (the best 9bp motif), CCCCACCCC,
also by comparison of hot and cold regions of the genome
Variation in individual hotspots
Sequence variation affects recombination at
DNA2 (Jeffreys and Neumann, Nature Genetics
2002)
SNPs disrupting hotspots disrupt motifs!
•
DNA2:
Jeffreys and Neumann (Nature Genetics 2002, Hum Mol. Evol. 2005)
•
Hot
AAAAGACAGCCTCCCTGTTGCTGC
Cold
AAAAGACAGCCCCCCTGTTGCTGC
NID1:
Hot
Cold
CACCCCCCACCCCACCCCAACATA
CACCTCCCACCCCACCCCAACATA
SNPs disrupting hotspots disrupt motifs
•
DNA2:
Jeffreys and Neumann (Nature Genetics 2002, Hum Mol. Evol. 2005)
Hot
AAAAGACAGCCTCCCTGTTGCTGC
Cold
AAAAGACAGCCCCCCTGTTGCTGC
Disruption of CCTCCCT, best 7bp motif
•
NID1:
Hot
Cold
CACCCCCCACCCCACCCCAACATA
CACCTCCCACCCCACCCCAACATA
Disruption of CCCCACCCC, best 9bp motif
Role of motif in X-linked ichthyosis
VCX2
1/5000 births
Deletion breakpoint hotspot (Van Esch et al. 2005)
• The 1kb deletion hotspot contains 25 repeats of
CCTCCCTNNCCAC
• Highest motif density in any LCR in entire genome
• Strongly implicates motif in producing hotspot
• Points to a link between deletion-causing and “normal”
recombination
A more general link?
•
Many other diseases are caused by recombination-mediated deletions and duplications (NAHR)
–
–
–
–
•
Smith-Magenis syndrome (hotspot)
CMT1A (hotspot)
NF1 microdeletion syndrome (hotspot)
DiGeorge syndrome….
Two recent studies suggest normal hotspots and hotspots of disease-causing deletion may coincide
–
–
de Raedt, Stephens et al. (Nature Genetics, 2006)
Two NF1 deletion hotspots both likely to coincide with crossover hotspots
–
–
Lindsay et al. (ASHG, 2006)
CMT1A deletion hotspot associated with crossover hotspot
Other “major” NAHR hotspots
CCNCCNTNNCCNC
overrepresented in
hotspots
p=0.0006
Evolution of recombination – human vs. chimps
LDhot hotspots
Human
Chimp
LDhat rate estimates
No significant correlation in hotspots positions between species (Winckler
et al. Science 2005, Ptak et al. Nature Genetics 2005)
Reading
•
Haplotype structure and recombination
– The International HapMap Consortium: A haplotype map of the human
genome. Nature 2005, 437:1299-1320.
– McVean G, Spencer CCA, Chaix R: Perspectives on human genetic variation
from the International HapMap Project. PLoS Genetics 2005, 1:e54.
– Myers S, Bottolo L, Freeman C, McVean G, Donnelly P: A fine-scale map of
recombination rates and recombination hotspots in the human genome.
Science 2005, 310:321–-324.
•
The coalescent
– Nordborg M: Coalescent Theory. In The Handbook of Statistical Genetics
(eds Balding, Bishop and Cannings), 2001. Wiley & Sons.
– Hudson RR: Gene genealogies and the coalescent process. In Oxford
Surveys in Evolutionary Biology (eds Futuyama and Antonovics) 1990, 7:1–44.
Oxford University Press.
Selected references
- Jeffreys, A.J., L. Kauppi, and R. Neumann. 2001. Intensely punctate meiotic recombination in the
class II region of the major histocompatibility complex. Nat Genet 29: 217-222.
- Jeffreys, A.J. and R. Neumann. 2002. Reciprocal crossover asymmetry and meiotic drive in a
human recombination hot spot. Nat Genet 31: 267-271.
- Jeffreys, A.J. and R. Neumann. 2005. Factors influencing recombination frequency and
distribution in a human meiotic crossover hotspot. Hum Mol Genet 14: 2277-2287.
- Myers, S., L. Bottolo, C. Freeman, G. McVean, and P. Donnelly. 2005. A fine-scale map of
recombination rates and hotspots across the human genome. Science 310: 321-324.
- Ptak, S.E., D.A. Hinds, K. Koehler, B. Nickel, N. Patil, D.G. Ballinger, M. Przeworski, K.A. Frazer,
and S. Paabo. 2005. Fine-scale recombination patterns differ between chimpanzees and humans.
Nat Genet 37: 429-434.
- The International HapMap Consortium. 2005. A haplotype map of the human genome. Nature
437: 1299-1320.
- The International HapMap Consortium. 2007. The Phase II HapMap. Nature
- Winckler, W., S.R. Myers, D.J. Richter, R.C. Onofrio, G.J. McDonald, R.E. Bontrop, G.A. McVean,
S.B. Gabriel, D. Reich, P. Donnelly et al. 2005. Comparison of fine-scale recombination rates in
humans and chimpanzees. Science 308: 107-111.
Download