QTL detection in Heterogeneous Stocks by Dynamic Programming Richard Mott, Jonathan Flint Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN UK Problem statement. Equal numbers of animals from S inbred strains (ie each strain is completely homozygous) are introduced to each other and allowed to interbreed randomly over a lerge number G of generations. At the final generation animals are phenotyped for a quantitative trait and genotypes across a set of markers. We wish to find the locations(s) of QTLs for the trait. Solution. Since the number of possible ancestral haplotype reconstructions increases exponentially with the number of markers, it is impossible to calculate the probability of each haplotype separately. However a dynamic programming (DP) algorithm greatly reduces the complexity Our analysis is two-stage; haplotype probability reconstruction using dynamic programming followed by hypothesis testing using linear regression. We assume that at a QTL, a chromosome originating from the progenitor strain s , contributes an unknown additive amount Ts to the phenotype, so that the expected genetic effect for a diploid individual with ancestral alleles labelled s, t at the trait locus is Ts Tt ; a test for a QTL is equivalent to testing for differences between the Ts ’s. The DP method computes the probability Fn ( s, t ) that a given individual has the ancestral alleles s, t at locus n. Then the expected phenotype is 2 Ts Fn ( s, t ) , and s t the Ts ’s are estimated by a linear regression of the observed phenotypes on these expected values across all individuals, followed by an analysis of variance to test whether the progenitor estimates differ significantly. The method’s effectiveness depends on the ability to distinguish ancestral haplotypes across the interval; clearly the 1 power will be lower where all markers have the same non-informative allele distribution, but markers share information where there is a mixture. We test for a QTL in the intervals between adjacent markers rather than at each marker locus. It is possible to generate pointwise logP values but they do not differ significantly from the interval-wise values. Dynamic programming algorithm. The dynamic-programming formulation of this problem can be thought of as a Hidden Markov model, where the hidden states are the progenitor haplotypes and the observed data the genotypes. The breeding strategy for the HS mice implies that after G generations each chromosome will be a mosaic of the progenitor haplotypes. Because most markers are not completely informative it is not possible to deduce the ancestral haplotypes at a marker locus without using information from other markers to help resolve ambiguities. Define Pn (s, t ) to be the probability that for a certain individual, the progenitor haplotypes are s, t at marker n , given (i) the genotypes for the ordered markers numbered 1 thru n , (ii) the founder strain haplotypes, expressed as the probability n ( s | a) that the ancestral state at marker n on a particular chromosome is s given the allele observed at that locus is a , (iii) the n, n 1 . Ignoring interference effects, the genetic distances d n between markers number of recombinants between markers n, n 1 is distributed as a Poisson random variable with mean 1 / Gd n . Consequently the prior probability that on a certain chromosome locus n 1 is in state s ' given locus n is in state s is s s' e Gd (1 e Gd ) / S rn ( s' | s) Gd (1 e ) / S s s' , where S is the number of strains. The prior probability of each progenitor strain is the same at any locus, and missing data are treated as an allele with equal probability in the founder strains. Conditional upon the ancestral haplotypes at n being s ' , t ' the probability that the haplotypes at n 1 are s, t is: f n (s, t | s' t ' ) Crn (s | s' )rn (t | t ' )[ n (s | a) n (t | b) n (s | b) n (t | a)] / 2 , where a, b are the genotypes observed at n 1. As the phase of the genotypes is unknown we must consider both possibilities. C is a normalising constant chosen so that 2 f n ( s, t | s ' , t ' ) 1. Therefore the total probability that that the haplotypes at n 1 are s ',t ' s, t can be expressed as a dynamic-programming recurrence relation Pn 1 ( s, t ) f n ( s, t | s' t ' ) Pn ( s ' , t ' ) , summed over all possible haplotypes s ' , t ' at n . s ',t ' Pn (s, t ) is computed iteratively across the chromosome, starting at the first marker. Similarly, we can find Qn1 (s, t ) , the probability that locus n 1 is in state s, t given all information from markers n 1 thru M by running the algorithm backwards from the terminal marker. Analysis of N individuals, M markers and S strains requires space proportional to NMS 2 and time proportional to NMS 4 . QTL Detection Suppose a QTL is between markers n, n 1 at a distance cd n from n (0 c 1) . The probability Fn ( s, t , c) that the haplotypes are s, t at the locus will depend on the flanking marker distributions Pn , Qn1 and the pattern of recombination in the interval (Figure 2). Fixing on one chromosome, the locus must either be linked to Both markers, or just the Left marker, or just the Right, or be Unlinked, with respective pB e Gd , probabilities pL e Gcd e Gd , pR e G (1c ) d e Gd , pU 1 pB pL pR (bold capitals refer to the corresponding subscripts). An individual’s two chromosomes need not be linked the same way, so we must sum over all possible ways the states s, t can occur at the locus; we find that, dropping the subscripts for clarity, Fn ( s, t , c) P(s,t)Q(s, t)p B p B + P(s,t)Q( *, t)p L p B + P(*,t)Q( *, t)pU p B / S + P(*,t)Q(s,t)p R p B + P(s,t)Q ( s,* ) p B p L + P(s,t)p L p L + P(*,t)pU p L / S + P(*,t)Q( s,* ) p R p L + P(s,* )Q ( s,* ) p B pU / S + P(s,* )p L pU / S + pU pU / S 2 + Q( s,* ) p R pU / S + P(s,* )Q(s,t)p B p R + P(s,* )Q( *, t)p L p R + Q( *, t)pU p R / S + Q(s,t)p R p R , (using the fact that the probability an unlinked locus is in a given state 1 / S ), and e.g. P (*, t ) Pn ( s, t ) (Figure 2). We then integrate over c to obtain the interval-wide s probability. We found that greatest sensitivity to detect a QTL occurs when the generations G is set substantially higher that the true number. Likely reasons for this phenomenon are that the distances of nearby markers may be innacurate, and the 3 presence of erroneous genotypes that create false recombinant events. On a 450 Mhz Pentium III running RedHat Linux 2.2, 750 mice, 45 markers and 8 strains can be analysed (i.e. dynamic programming plus linear regression) in 73 CPU seconds using a program, HAPPY, written in C. 4 5