here

advertisement
QTL detection in Heterogeneous Stocks by Dynamic Programming
Richard Mott, Jonathan Flint
Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford OX3 7BN
UK
Problem statement. Equal numbers of animals from S inbred strains (ie each strain is
completely homozygous) are introduced to each other and allowed to interbreed
randomly over a lerge number G of generations. At the final generation animals are
phenotyped for a quantitative trait and genotypes across a set of markers. We wish to
find the locations(s) of QTLs for the trait.
Solution. Since the number of possible ancestral haplotype reconstructions increases
exponentially with the number of markers, it is impossible to calculate the probability
of each haplotype separately. However a dynamic programming (DP) algorithm greatly
reduces the complexity Our analysis is two-stage; haplotype probability reconstruction
using dynamic programming followed by hypothesis testing using linear regression.
We assume that at a QTL, a chromosome originating from the progenitor strain s ,
contributes an unknown additive amount Ts to the phenotype, so that the expected
genetic effect for a diploid individual with ancestral alleles labelled s, t at the trait
locus is Ts  Tt ; a test for a QTL is equivalent to testing for differences between the
Ts ’s. The DP method computes the probability Fn ( s, t ) that a given individual has the


ancestral alleles s, t at locus n. Then the expected phenotype is 2 Ts   Fn ( s, t )  , and
s
 t

the Ts ’s are estimated by a linear regression of the observed phenotypes on these
expected values across all individuals, followed by an analysis of variance to test
whether the progenitor estimates differ significantly.
The method’s effectiveness
depends on the ability to distinguish ancestral haplotypes across the interval; clearly the
1
power will be lower where all markers have the same non-informative allele
distribution, but markers share information where there is a mixture.
We test for a QTL in the intervals between adjacent markers rather than at each marker
locus. It is possible to generate pointwise logP values but they do not differ
significantly from the interval-wise values.
Dynamic programming algorithm. The dynamic-programming formulation of this
problem can be thought of as a Hidden Markov model, where the hidden states are the
progenitor haplotypes and the observed data the genotypes. The breeding strategy for
the HS mice implies that after G generations each chromosome will be a mosaic of the
progenitor haplotypes. Because most markers are not completely informative it is not
possible to deduce the ancestral haplotypes at a marker locus without using information
from other markers to help resolve ambiguities. Define Pn (s, t ) to be the probability
that for a certain individual, the progenitor haplotypes are s, t at marker n , given (i) the
genotypes for the ordered markers numbered 1 thru n , (ii) the founder strain
haplotypes, expressed as the probability  n ( s | a) that the ancestral state at marker n on
a particular chromosome is s given the allele observed at that locus is a , (iii) the
n, n  1 . Ignoring interference effects, the
genetic distances d n between markers
number of recombinants between markers n, n  1 is distributed as a Poisson random
variable with mean 1 / Gd n . Consequently the prior probability that on a certain
chromosome locus n  1 is in state s ' given locus n is in state s is
s  s'
e Gd  (1  e Gd ) / S
rn ( s' | s)  
Gd
 (1  e ) / S
s  s'
,
where S is the number of strains. The prior probability of each progenitor strain is the
same at any locus, and missing data are treated as an allele with equal probability in the
founder strains.
Conditional upon the ancestral haplotypes at n being s ' , t ' the
probability that the haplotypes at n  1 are s, t is:
f n (s, t | s' t ' )  Crn (s | s' )rn (t | t ' )[ n (s | a) n (t | b)   n (s | b) n (t | a)] / 2 , where a, b are
the genotypes observed at n  1. As the phase of the genotypes is unknown we must
consider both possibilities.
C
is a normalising constant chosen so that
2
f
n
( s, t | s ' , t ' )  1. Therefore the total probability that that the haplotypes at n  1 are
s ',t '
s, t
can
be
expressed
as
a
dynamic-programming
recurrence
relation
Pn 1 ( s, t )   f n ( s, t | s' t ' ) Pn ( s ' , t ' ) , summed over all possible haplotypes s ' , t ' at n .
s ',t '
Pn (s, t ) is computed iteratively across the chromosome, starting at the first marker.
Similarly, we can find Qn1 (s, t ) , the probability that locus n  1 is in state s, t given all
information from markers n  1 thru M by running the algorithm backwards from the
terminal marker. Analysis of N individuals, M markers and S strains requires space
proportional to NMS 2 and time proportional to NMS 4 .
QTL Detection Suppose a QTL is between markers n, n  1 at a distance cd n from n
(0  c  1) . The probability Fn ( s, t , c) that the haplotypes are s, t at the locus will
depend on the flanking marker distributions Pn , Qn1 and the pattern of recombination
in the interval (Figure 2). Fixing on one chromosome, the locus must either be linked to
Both markers, or just the Left marker, or just the Right, or be Unlinked, with respective
pB  e Gd ,
probabilities
pL  e Gcd  e Gd , pR  e G (1c ) d  e Gd ,
pU  1  pB  pL  pR (bold capitals refer to the corresponding subscripts).
An individual’s two chromosomes need not be linked the same way, so we must sum
over all possible ways the states s, t can occur at the locus; we find that, dropping the
subscripts for clarity,
Fn ( s, t , c)  P(s,t)Q(s, t)p B p B + P(s,t)Q( *, t)p L p B + P(*,t)Q( *, t)pU p B / S + P(*,t)Q(s,t)p R p B +
P(s,t)Q ( s,* ) p B p L + P(s,t)p L p L + P(*,t)pU p L / S + P(*,t)Q( s,* ) p R p L +
P(s,* )Q ( s,* ) p B pU / S + P(s,* )p L pU / S + pU pU / S 2 + Q( s,* ) p R pU / S +
P(s,* )Q(s,t)p B p R + P(s,* )Q( *, t)p L p R + Q( *, t)pU p R / S + Q(s,t)p R p R ,
(using the fact that the probability an unlinked locus is in a given state  1 / S ), and
e.g. P (*, t )   Pn ( s, t ) (Figure 2). We then integrate over c to obtain the interval-wide
s
probability. We found that greatest sensitivity to detect a QTL occurs when the
generations G is set substantially higher that the true number. Likely reasons for this
phenomenon are that the distances of nearby markers may be innacurate, and the
3
presence of erroneous genotypes that create false recombinant events. On a 450 Mhz
Pentium III running RedHat Linux 2.2, 750 mice, 45 markers and 8 strains can be
analysed (i.e. dynamic programming plus linear regression) in 73 CPU seconds using a
program, HAPPY, written in C.
4
5
Download