ppt

advertisement
Fine Scale Mapping and the Coalescent
•The Fundamental Problem
•The Data
•Genotype to Phenotype Functions
•Types of Mapping
•Population Set-up & Measures of Dependency
•The Calculations
•Practical Considerations
Genotype and Phenotype Covariation: Gene Mapping
Sampling Genotypes and Phenotypes
Decay of local dependency
Time
Reich et al. (2001)
Genetype -->Phenotype Function
Result:The Mapping Function
Dominant/Recessive.
Penetrance
A set of characters.
Binary decision (0,1).
Spurious Occurrence
Quantitative Character.
Heterogeneity
genotype
Genotype  Phenotype
phenotype
Pedigree Analysis & Association Mapping
Association Mapping:
Pedigree Analysis:
M
r
D
Pedigree known
Few meiosis (max 100s)
D
2N generations
M
r
Resolution: cMorgans (Mbases)
Pedigree unknown
Many meiosis (>104)
Adapted from McVean and others
Resolution: 10-5 Morgans (Kbases)
Causes of linkage disequilibrium
D
M
Time t ago
D
M
Now
Creates LD
Breaks down LD
Drift
Selection
Admixture
Recombination
Gene conversion
Significance of a Single Association
Disease locus
Marker locus
Disease locus
Marker locus
Test for independence in 2 times 2 Contingency Table
XA,B Xa,B
X.,B
XA,b Xa,b
X.,b
XA,.
X.,.
Xa,.
Measuring Linkage Disequilibrium between 2 Loci with 2 Alleles
Remade from McVean
DA,B =fA,B-fAfB =-Da,B =-DA,b =Da,b
Correlation Coeffecient Measure [0,1]
Hill & Robertson (1968)
2
DAB
2
r 
  AB
f A fa f B fb
2
AB
Range constrained by allele frequencies [0,1]
Lewontin (1964)
'
DAB
 if ( D  0)
Odds-ratio formulation
Devlin & Risch (1995)
 AB 
f AB
fB f A
DAB
 DAB
else
min( f A , f b , f a , f B , )
min( f A , f b , f a , f B , )
Examples of Associations: Pairwise, Triple,...
Disease locus
Marker loci
Combine Single (Pairwise) to Multiple Tests
Bonferroni
Sharper bounds using linkage information.
ApoE and Alzheimers Syndrome
Causative SNP
6 markers with
low association
Martin et al 2000
The coalescent with recombination or gene conversion
Adapted from Hudson 1990
Recombination:
Gene Conversion:
Local trees for recombination and gene conversion
Gene conversion
Recombination
1 2
3 4
Tree 1
1
2
4
Tree 2
3
1 2 3
Tree 3
4
1 2
3 4
Tree 1
1
2
4
Tree 2
3
1 2 3
4
Tree 1
Target tree
Measures of tree similarity
Target
Region with no recombination
Same tree as target
Same topology as target
1
2
3
4
Same tree
1
2
3
Same MRCA as target
5
Same topology
4
5
1 2
3
4
Same MRCA
5
1
2
3 4 5
Local trees of the target and other positions
Sample size = 20
Only recombination, r=2.
Also gene conversion g/4
From Mikkel Schierup
Probability that the largest segment does not include the target
Recombination/gene conversion
rate
R=2, G=0
R=2,
G=8
#segments with same tree
1.02
1.8
P(target segment not largest)
0.2%
14%
#segments same topology
1.02
2.1
P(target segment not largest)
0.3%
20%
1.1
2.9
1.5%
25%
#segments same TMRCA
P(target segment not largest)
From Mikkel Schierup
Quantifying the mosaicism caused by Gene Conversion
A and B are the most distant markers in significant LD with target
A
Target
B
What is the proportion of markers between these also in significant LD?
Rho=4
G=0
56%
G=16
33%
From Mikkel Schierup
Development of multi-locus association methods
Single Marker Methods
•Kaplan et al. (1995), Rannala & Slatkin (1998)
Problem: Difficult to combine markers.
Haplotype methods with star-shaped genealogies
•Terwilliger (1995), Graham & Thompson (1998), McPeek &
Strahs(1999), Morris et al.(2000)
Problem: wrong genealogy, gives overconfidence in result.
Haplotype methods based on the coalescent
•Rannala & Reeve (2001), Morris et al. (2002), Larribe et al. (2003).
Problem: computationally intensive
Based on Morris et al. 2002
Probability of Data I:
3 step approach:
I Probability of Data given topology and branch lengths
Felsenstein81 for each column
Multiply for all columns
GCAGGTT
TCAGCCT
TCAGCAT
II Integrate over branch lengths
III Sum over topologies
  



0
0
0
P(Data Topo,tk tk1 ..t2)e
k 
tk / 
2 
k1 
 tk 1 / 
 2  t
dtk e

j 

2
j n
3
Conclusion: Exact Calculation Computationally Intractible!!
e dtk1 ....dt2
Probability of Data II:
Griffiths & Tavavé TPB46.2.131-149
q(n’’) – determined by equilibrium distribution.
q(n) 
ACCTAGGAT
n'n
TCCTAGGAT
(1,2) coalescence
3*9*3 mutations
ACCTAGGAT
n=
q(n') f (n,n')
TCCTAGGAT
TCCTAGGAT
Griffiths-Ethier-Tavare Recursions
nk (nk  1)
pa (T , n  ek )

n
(
n

1


)
k :nk 2

pa (T , n)
n=(3,1,2)

 n
d ,1
pa (T ' , n' )
n=(2,1,2)
n=(3,1,2)
1
1
2
3
1
2
3
1
2
3
2
Griffiths-Marjoram (1996) included recombination in the equations.
Example: Solving Linear System
q( x )   r ( x, y )q( y )   r ( x, z )q( z ), for x  B
yA
zB
q( x ) known when x  A and unknown when x  B.
??
q( )
r(,)
r(,)
r(,)
r(,)
??
??
r(,)
q( )
r(,)
r(,)
r(,)
r(,)
q( )
??
q( x )   r ( x, y )q( y )   r ( x, y1 )r ( y1 , y )q( y )
yA

y1B yA
  r( x, y )r( y , y )r( y , y )q( y )  ....
y1B y2B yA
1

1
2
2
 {  ......  r ( x, y1 )...r ( yk , y )q( y )}
k 0
y1B
yk B yA
Example: Solving Linear System
Construct Markov transition function, A(x,y), with following properties:
i) A(x,y) > 0 when r(x,y) >0
ii) The chain visits A with certainty.
j
r ( X kj1 , X k )
q( x0 )  E x0 {q( X  )
}
j
k 1 A( X k 1 , X k )
j
j
j
r
(
X
,
X
)
1 m
k 1
j
k
qˆ   q( X  j )
}
j
j
m j 1
k 1 A( X k 1 , X k )
•Introduced in coalescence theory by Griffiths & Tavare (1994)
•Griffiths & Marjoram (1996) included recombination
•Donnelly-Stephens-Fearnhead (2000-) accelerated these algorithms
The position of the marker locus is missing data
Larribe and Lessard.(2002)
Data:
haplotype
phenotype
multiplicity
15
3
6
2
1
2
1
Where is the disease causing disease?
Likelihood as function of disease locus position
Bayesian approach to LD mapping
Continuous version of Bayes formula
P(data | parameters ) f (parameter s)
f (parameters | data) 
P(data)
f (parameters) = prior distribution of parameters
P(data|parameters) = L(parameters) = likelihood function
f (P|D) = posterior distribution of parameters given data
The evolutionary parameter (e.g. disease location) is considered to have
prior distribution (any prior knowledge we may have)
and we learn about parameters through data
Advantage: f (parameters|data) is the full distribution of parameters of
interest given data, e.g. confidence intervals
The basic equation
P(data | parameters ) f (parameter s)
f (parameters | data) 
P(data)
Marginal posterior distribution of disease position:
P(disease position x | data) 
 P(paramete rs | data) dP ...dP
1
Parameters except x
n
Parameters in Shattered Coalescent Model
Morris, Whittaker and Balding (2001,,2003,2004..
P(x,h,W,T,z,N,|A,U) ~ L(A,U|x,h,W,T,z,N) p(W,T,z|) p()
p() = 2,p(W,T,z|) prior distribution of genealogies (coalescent like)
x
h
W
T
Z
N

A, U
Location of disease locus
Population marker-haplotype proportions
branch lengths of genealogical tree
topology (branching pattern)
Parental-status
effective population size Probability of Haplotypes associated Mutant
shattering parameter
cases, controls
At recombination markers
are incorporated from the
population distribution.
Morris et al: The Shattered Coalescent
Advantages: Allows for multiple origins of the disease mutant
+ sporadic occurrences of the disease without the mutation
Coalescent tree
Morris, Whittaker & Balding,2002
Monte-Carlo (Metropolis) sampling and integration
Metropolis et al.(1953)
P(disease position x ) 
 P(paramete rs | data) dP ...dP
1
n
Parameters except x
•Evaluate the function in the current point p, f(p)=x
•Suggest a new point, p'
•Evaluate the function in this point f(p') = y
•If x < y, go to point p'
•If x > y, go to point p' with the probability y/x
Due to Jesper Nymann
Monte-Carlo (Metropolis)
Projection on one axis equivalent
to integration over the remaining
parameters
1
2!
1
2?
3
2?
2
1
Due to Jesper Nymann
Example 1 - Cystic fibrosis
11
19
Morris et al. (2002).
Due to Jesper Nymann
Example 2 - BRCA2
Iceland Genomics Corporation:
1132 Cases, 54 with known mutation
758 Controls
Due to Jesper Nymann
Example 2 - BRCA2 continued
True Location
1
3
5
7
9
11
13
15
Multipoint calculation for the full BRCA2 dataset
1
3
5
7
9
11
13
15
Multipoint calculation where the 54 known
mutation cases has been removed.
Due to Jesper Nymann
The Basic Setup
Simulation Parameters:
Recombination rate = 50
Number of leaf nodes = 1000
Number of markers = 10
Diseased haplotype fraction: 0.08 – 0.12
No Heterogeneity
Simulated under the asumption of constant population size
Diplotypes (phase known)
Type of simulation
Basic (red curve)
50% quantile
0.044
Due to Jesper Nymann
The effect of marker density
Type of simulation
19 markers (blue curve)
19 markers and recombination rate = 100 (yellow curve)
Basic (red curve)
50% quantile
0.0292
0.02321
0.044
Due to Jesper Nymann
The effect of knowing phase
0
1
0/1
0
0/1
1
Type of simulation
With Genotype data (blue curve)
Basic (red curve)
0/1
0
0/1
0
1
1
0
0
1
1
0
1
0
0
1
0
0
1
1
0
0
0
0
0
50% quantile
0.05857
0.044
Due to Jesper Nymann
The Effect of knowing gene genealogy
Type of simulation
With known genealogy (blue curve)
Basic (red curve)
50% quantile
0.03516
0.044
Due to Jesper Nymann
The effect of disease fraction
Type of simulation
Disease fraction 12% - 14% (blue curve)
Disease fraction 18% - 22% (yellow curve)
Basic (red curve)
50% quantile
0.0353
0.03229
0.044
Due to Jesper Nymann
The effect of Heterogeneity
Type of simulation
50% quantile
With Heterogeneity (blue curve) 0.065587
Basic (red curve)
0.044
Due to Jesper Nymann
The effect of Impurity of cases and controls
Cases
Controls
33% cases are moved to
the controls and a similar
number of controls are
moved
to the cases
Type of simulation
With mixed cases/controls (blue curve)
Basic (red curve)
50% quantile
0.1518
0.044
Due to Jesper Nymann
LD in background population
No LD in background:
P(0) P(1) P(1) P(0) P(0) P(1) P(1) P(0) P(1) P(0)
0
LD in background:
1
1
0
0
1
1
0
1
0
P(0) P(1|0) P(1|1) P(0|1) P(0|0) P(1|0) P(1|1) P(0|1) P(1|0) P(0|1)
Gene
Pool
Type of simulation
LD in background (blue curve)
Basic (red curve)
50% quantile
0.0419
0.044
Due to Jesper Nymann
Comparing the different scenarios
Simulation Type
Mean
50% Quantile
70% Quantile
95% Quantile
Basic
19 markes rho=100
19 markers
18% - 22% cases
12% - 14% cases
Fixed topology
LD in background
Genotype Data
Heterogeneity
33% impure
0,059
0,044
0,053
0,046
0,048
0,047
0,078
0,087
0,088
0,173
0,044
0,023
0,029
0,032
0,035
0,035
0,042
0,059
0,066
0,152
0,070
0,043
0,047
0,052
0,050
0,058
0,072
0,099
0,092
0,217
0,193
0,142
0,176
0,146
0,136
0,111
0,273
0,305
0,246
0,452
Random
0,303
0,273
0,407
0,696
Due to Jesper Nymann
Summary
The Fundamental Problem
The Data
Genotype to Phenotype Functions
Types of Mapping
Population Set-up & Measures of Dependency
Methods:
Pure Coalescent Based
The Shattered Coalescent
Factors influencing mapping error.
Articles I
M. A. Beaumont and B. Rannala (2004) The Bayesian Revolution in genetics, Nature Reviews, Genetics vol. 5. 251
Botstein D, Risch N. (2003) Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for
complex disease. Nat Genet. 33 Suppl:228-237. Cardon, L. and J. Bell (2001) “Association Study Designs for Complex Diseases “ Nature
Review Genetics
Daly, M. J., Rioux, J. D., Schaner, S. F., Hudson, T. J. & Lander, E. S. (2001), High-resolution haplotype structure in the human genome, Nat
Genet 29(2), 229-232.
Devlin, B. & Roeder, K. (1999), Genomic control for association studies, Biometrics 55(4), 997-1004.
Frisse, L et al.(2001) Gene Conversion and Different Population Histories May Explain the Contrast between Polymorphisms and LD Levels.
AJHG 69..?-?
Gabriel, S. B. et al. (2002), The structure of haplotype blocks in the human genome, Science 296(5576), 2225-2229.
Griffiths,R & S. Tavare (1994) “ Simiulating probability distributions in the coalescent ” Theor.Pop.Biol. 46.2.131-159
Griifiths, R. and P. Marjoram (1996) “Ancestral inference from samples of DNA sequences with recombination ”J.Compu.Biol.
Hudson, R. R. (1990).Gene genealogies and the coalescent process, “Oxford Surveys in Evolutionary Biology” (D. futuyma and J. Antonovics,
Eds.) Vol 7, pp. 1-44, Oxford Univ. Press, Oxford, UK
B. Kerem, J. M. Rommens, J. A. Buchanan D. Markiewicz, T. K. Cox, A. Chakravarti, M. Buchwald and L. C. Tsui Identification of the Cystic
Fibrosis Gene: Genetic Analysis Science 245: 1073-1080, 1989
Kong A, et al. (2002) A high-resolution recombination map of the human genome. Nat Genet. 31,241-7.
Laitinen et al. (2004) Characterization of a common susceptibility locus for Asthma-related traits. Nature 304, 300-304.
Martin, E. R., et al. (2000), SNPing away at complex diseases: analysis of single-nucleotide polymorphisms around APOE in Alzheimer
disease, Am J Hum Genet 67, 383-394.
Larribe, M, S. Lessard and Schork (2002) “Gene Mapping via the Ancestral Recombination Graph”. Theor. Pop.Biol. 62.215-229.
Liu,J. et al.(2000) “Bayesian Analysis of Haplotypes for Linkage Disequilibrium Mapping” Genome Research 11.1716-24.
Martin, E. et al.(2001) “SNPing Away at Complex Diseases: Analysis of Single-Nucleotide Polymorphisms around APOE Alzheimer Disease”
AJHG 67.838-394.
N Metropolis N AW Rosenbluth, MN Rosenbluth, AH Teller, E Teller (1953) Equation of state calculation by fast computer machines, J. Chem.
Phys. 21:1087-1092
McVean,G.(2002) “A Genealogical Interpretation of Linkage Disequilibrium” Genetics 162.987-991
Morris, A., JC Whittaker and D. Balding “Fine-Scale Mapping of Disease Loci via Shattered Coalescent Modeling of Genealogies” AJHG
70.686-707.
Morris, J. C. Whittaker, and D. J. Balding (2004) Little loss of information due to unknown phase for fine-scale LD mapping with SNP
genotype data, AJHG . 74: 945-953, 2004
Andrew P. Morris, John C. Whittaker, Chun-Fang Xu, Louise K. Hosking, and David J. Balding Multipoint linkage-disequilibrium mapping
narrows location interval and identifies mutation heterogeneity, PNAS November 11, 2003, Vol. 100, 13442-13446
Articles II
McVean GA, Myers SR, Hunt S, Deloukas P, Bentley DR, Donnelly P. (2004) The fine-scale structure of recombination rate variation
in the human genome. Science 304:581-584.
Patil, N. et al. (2001) Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science
294: 1719-1723.
Reich, D. E. et al. (2001), Linkage disequilibrium in the human genome, Nature 411(6834), 199-204.
Reich D. E. and Lander, E. On the allelic spectrum of human diseases. Trends in Genetics 19, 502-510.
Reich, D. E. et al. (2002), Human genome sequence variation and the influence of gene history, mutation and recombination, Nat
Genet 32(1), 135-142.
Risch, N. and Merikangas, K. (1996) The future of genetic studies of complex human diseases. Science 273, 15161-1517.
Pritchard, J. K., Stephens, M., Rosenberg, N. A. & Donnelly, P. (2000), Association mapping in structured populations, Am J Hum
Genet 67(1), 170-181.
Stefansson, H. et al. (2003), Association of neuregulin 1 with schizophrenia confirmed in a Scottish population, Am J Hum Genet
72(1), 83-87.
Stephens JC et al. (2001) Haplotype variation and linkage disequilibrium in 313 human genes. Science.;293(5529):489-93.
Strachan, T. & Read, A. P. (2003) Human Molecular Genetics 3, BIOS Scientific Publishers Ltd, Wiley, New York.
Spielman R S and W J Ewens (1996) The TDT and other family-basedtests for linkage disquilibrium and association. Am. J. Hum.
Gen. 59:983-989
The International HapMap Consortium (2003) The International HapMap Project. Nature 426, 789-795.
Weiss, KM and Clark, AG (2002) Linkage disequilibrium and the mapping of complex human traits. Trends in Genetics 18:19-24.
Pritchard, J and M. Przeworski (2000) Linkage Disequilibrium in Humans: Models and Data AJHG 69.1-14.
Pritchard, JK et al.(2000) “Association Mapping in Structured Populations” Am.J.Hum.Genet. 67.170-181 .
Pritchard and Cox (2002) “The allelic architecture of human disease genes: common disease-common variant … or not” Human
Molecular Genetics 11.20.2417-2Rannala, B and JP Reeve (2001) High-Resolution Multipoint Linkage-Disequilibrium Mapping in
the Context of a Human Genome Sequence AMJHG 69.159-178.
R S Spielman and W J Ewens (1996) The TDT and other family-basedtests for linkage disquilibrium and association. Am. J. Hum.
Gen. 59:983-989
Tabor, Risch and Myers (2002) Candidate-gene approaches for studying complex genetic traits: practical considerations Nature
Reviews Genetics 3.May.1-7
Terwilliger,JD et al(2002) A bias-ed assessement of the use of SNPs in human complex traits. Curr.Opin. Genetics & Development
12.726-34
Weiss,K and Terwilliger, J (2000) “How many diseases does it take to map a disease with SNPs” Nature Genetics vol. 26 Oct.
Books & Www-sites
Books
Encyclopedia of the Human Genome (2003) Nature Publishing Group
Liu, . J(2001) “Monte Carlo Strategies in Scientific Computation” Springer Verlag
Ott, J.(1999) Analysis of Human Genetic Linkage 3rd edition Publisher: John Hopkins
Strachan & Read (2004)
Human Molecular Genetics III
Publisher: Biosciences
Weiss,K.(1993) “Genetic Variation and Human Disease” Cambridge University Press.
Web-sites
www.stats.ox.ac.uk/mcvean
Jeff Reeve and Bruce Rannala A multipoint linkage disequilibrium disease mapping program (DMLE+) that allows genotype data to be used
directly and allows estimation of allele ages.
http://dmle.org/
Liu, J.S., Sabatti, C., Teng, J., Keats, B.J.B. and N. Risch (Version upgraded by Xin Lu, June/9/2002) This is the software for the Bayesian
haplotype analysis method developed by Liu, J.S., Sabatti, C., Teng, J., Keats, B.J.B. and N. Risch in article Bayesian Analysis of Haplogypes
for Linkage Disequilibrium Mapping. Genome Research 11:1716, 2001
http://www.people.fas.harvard.edu/~junliu/TechRept/03folder/bladev2.tar
J. N. Madsen, M.H. Schierup, C. Storm, and L. Schauser, T. Mailund CoaSim is a tool for simulating the coalescent process with recombination
and geneconversion under the assumption of exponential population growth
http://www.birc.dk/Software/CoaSim/
Download