Lecture 2: Biology Review II

advertisement
Lecture 15: Linkage Analysis VII
Date: 10/14/02
 Correction: power calculation
 Lander-Green Algorithm
 (Titles on updated or added slides
highlighted)
Sample Size Calculation
 What is the sample size needed in order to
achieve a particular statistical power for an
estimate?
 We shall assume the relevant statistic is
distributed as chi-square statistic.
Sample Size Calculation (cont.)
g  1 

 P  df2 ,c  a2

 g is the statistical power
2

 a is the critical value to reject H0 with
significance level a.
 c is the non-centrality parameter, usually the
expectation of the log-likelihood ratio test statistic
under particular HA and experimental conditions.
 df is the degrees of freedom
Sample Size Calculation (cont.)

c 

1
2
g , df ,  a2
c  nEG0 


1
2
g , df , a2
is g - critical value for non - central
chi - square with noncentral ity parameter a2 and
df degrees of freedom.
Modeling
 Test your modeling skills. Propose a model for the
following family ascertainment situation.
 What if you knew that probands were detected
independently and with the same probability in each
family, except all secondary probands are more
easily detected (second, third, etc all to the same
degree) than the first proband in a family.
 The model formulation and calculation of pr
probabilities for families with 3 affected are now
posted to the website.
Lander-Green Algorithm
 Like the Elston-Stewart algorithm, the
Lander-Green algorithm models the pedigree
and data as a Hidden Markov Model (HMM),
except that the hidden states are the so-called
inheritance vectors.
 Like the Elston-Stewart algorithm, the
Lander-Green algorithm assumes that there is
no interference.
LG – (Dis)Advantages
 The Lander-Green algorithm is linear in the number
of loci and exponential in the number of members
in the pedigree.
 Recall that the Elston-Stewart algorithm is
complementary, linear in the number of members,
but exponential in the number of loci.
 Simulation methods (MCMC in particular) are used
to deal with pedigrees with both high numbers of
members and loci.
LG – Inheritance Vector
 The inheritance vector is a vector defined for each
locus i in the dataset.
 It is a binary vector with two components for each
non-founder individual in the pedigree. Thus, it is
of length 2(n – f).
 The entry in the inheritance vector is 0 if the
individual’s allele at that position is grandmaternal.
If grandpaternal, it is 1. There are 22(n – f) possible
inheritance vectors for each locus.
LG – Inheritance Vector (cont)
 The inheritance vector holds information
about the number of crossovers that occurred
to produce each non-founder in the
population.
 Thus, it is appropriate for estimating
recombination fractions as is our goal here
with the LG algorithm.
LG – Inheritance Vector
Example
1
AA
3
4
aa
aA
2
aa
5
6
aa
aA
9
7
8
aa
Aa
Aa
Gamete
v
4M
0|1
4P
0|1
5M
0|1
5P
0|1
7M
1
7P
0|1
8M
0
8P
0|1
9M
0
9P
0|1
LG – Simplification by
Conditioning
 Fortunately, conditional on the inheritance
vectors, the genotypes of each offspring are
independent.
 Of course, conditional on the genotype, the
phenotype probabilities are independent.
 Thus, we can calculate the probability for
each individual in the pedigree independently
of the others once we condition on the
inheritance vectors.
LG – Hidden States
 The inheritance vector constitutes the unknown
hidden state for each allele. We must define
transition probabilities among the hidden states
(from locus-to-locus).
 Begin, by considering the transition probability
between loci within a single individual, where the
inheritance vector is of length 2.
 Therefore, the hidden state at each locus is a binary
vector of length 2.
LG – Initial State
 We must define the initial state of the first marker
locus.
 Prior to viewing the genotypes, all inheritance
vectors are equally likely.
 Assume the initial state of the inheritance vector at
marker 1 is uniform over {(0,0), (1,0), (0,1), (1,1)},
where we list the maternal status first. In other
words, marker 1 has ¼ probability of being in each
of these possible states.
LG – Pairwise Transition
Probabilities
 Because of the assumption of no interference, the
transition probabilities from the state at locus i to the
state at locus i+1 are given by:
(0,0)
(1,0)
(0,1)
(1,1)
(0,0)
1   i 2
 i 1   i 
 i 1   i 
 i2
(1,0)
 i 1   i 
1   i 2
 i2
 i 1   i 
(0,1)
 i 1   i 
 i2
1   i 2
 i 1   i 
(1,1)
 i2
 i 1   i 
 i 1   i 
1   i 2
where i is the recombination fraction between locus
i and locus i+1.
LG – Switch in Notation
 From this point on, assume there are n nonfounders (rather than n – f).
 The reason for this change is simplification
of the equations.
LG – Inheritance Vector
Transition Probabilities
 The transition probabilities between
inheritance vectors defined on full pedigrees
with n relevant members, are given by
Pw v   
d v , w 
i
1  i 
2 n  d v , w 
where d(v,w) is the Hamming distance
between inheritance vectors v and w, i.e. the
number of discordances between them.
LG – Forward Variable
a i b   PO1 ,, Oi , yi  b  
1
a1 b   PO1 y1  b 
4
a i 1 b    PO1 ,, Oi 1 , vi 1  b, vi  
vi
  POi 1 , vi 1  b  , O1 , , Oi , vi PO1 , , Oi , vi  
vi
  a i vi Pvi 1  b  , vi POi 1  , vi 1  b 
vi
 POi 1  , vi 1  b  a i vi Pvi 1  b  , vi 
vi
LG – Backward Variable
 i b   POi 1 ,, Ol vi  b, 
 l b   1
 i b    POi 1 ,, Ol , vi 1 vi  b, 
vi 1
  POi 1 , , Ol vi  b, vi 1 , Pvi 1 vi  b, 
vi 1
  POi  2 ,  , Ol vi  b, vi 1 , Oi 1 , POi 1 vi  b, vi 1 , Pvi 1 vi  b, 
vi 1
   i 1 v i 1 POi 1 vi 1 Pvi 1 vi  b, 
vi 1
LG – xi(v,w)
x i v, w  Pvi  v, vi 1  w O, 

Pvi  v, vi 1  w, O  
PO  
transition probability
penetrance parameter
a i (v)Pw v, POi 1 w i 1 w

a i (v)Pw v, POi 1 w i 1 w
v,w
LG – Baum’s Lemma
 Baum’s Lemma: Let
Q , '   Pv, O  log Pv, O  '
v
If
Q , '  Q , 
then
PO  '  PO  
LG – Proof of Baum’s Lemma
PO  '
Pv, O   Pv, O  '


PO   v PO   Pv, O  
Pv, O  
 PO    1
PO  '
Pv, O  
Pv, O  '
log

 log
PO  
PO  
Pv, O  
 Q , '  Q ,  / PO  
v
v
LG – Jensen’s Inequality
log
PO  '
PO  

v
Pv, O  
PO  
 log
Pv, O  '
Pv, O  
  Pv, O  ' 
 Pv, O  '


log E v 
  E log






P
v
,
O

P
v
,
O

 



E f x   f Ex when f x 
is a concave function
LG – EM Algorithm
 We maximize Q(,’) over ’ to maximize
the likelihood P(O|) conditional on the
current parameter estimates .
 This may sound familiar. It is the M step of
the EM algorithm, and the EM algorithm is
how we maximize  over a pedigree.
 Details are shown below. Maximization is
the difficult step. We show it first.
LG - Maximization
Q , '   Pv, O  log Pv, O  '
Key step: by conditional
independence, this probability
becomes a product of
conditional probabilities.
v
Q , '   Pv, O  log Pv1 Pv2 v1 ,1 ' PO1 v1 PO2 v2 
v
Q , '

  Pv, O  
log Pv1 Pv2 v1 ,1 ' PO1 v1 PO2 v2 
 i '
 i
v
  Pv, O  
v

log Pvi 1 vi , i '
 i
LG - Maximization
Q , '

  Pv, O  
log Pvi 1 vi , i '
 i '
 i
v
  Pv, O  
v


2nd
log  'id i 1   i ' i
 i
 d  2n d i 
  Pv, O   i
 
v
 1 i ' i ' 
  Pv, O  d i  2n  i ' d i 1   i '
v

LG – EM Agorithm (M Step)
 



Q  ,ˆ'
  Pv, O   d i  2n ˆi ' d i 1  ˆi '  0
 i '
v
ˆ
'
i
 Pv, O  d

 Pv, O  2n
 Pv O, PO  d

 Pv O, PO  2n
i
v
v
i
v
v

E d i O, 
2n
LG – EM Algorithm (E Step)
Ed i v, w O,    d i v, wPvi  v, vi 1  w O, 
v,w

 d v, wx v, w
i
i
vi ,vi 1
sum over all pairs of
inheritance vectors
the usual
conditional
probabilities
needed to
calculate
expectation
Heterogeneity in
Recombination Fraction
 Allow for two recombination fraction
parameters in each interval.
 Allow for one recombination fraction in each
interval and a universal constant relating
male and female recombination fractions.
 Use nested models to test for evidence of
sex-based differences.
Model Misspecification
 Penetrance parameters, allele frequencies
may be incorrectly specified.
 The model is robust to misspecification such
that the false positive rate for linkage is
unaffected by misspecification of these
parameters.
Model Misspecification and
Ascertainment
 When ascertainment is made independent of
disease state and marker loci, the method
remains robust to misspecification in both.
 When ascertainment is made with respect to
disease state, then the method is robust to
misspecification of the disease parameters.
Effects on Power
 Power in two-point linkage analysis is
largely unaffected as long as the dominance
is specified correctly.
 Multipoint linkage analysis is much more
sensitive to misspecification of the model.
However, there is more information when
model parameters are jointly estimated along
with position.
Download