An HMM-based Boundary-flexible Model of Human Haplotype Variation by Jonathan Sheffi Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY May 2004 Lkne 2LOUI Massachusetts Institute of Technology 2004. All rights reserved. MASSACHUSETTS INSTITUTE OF TECHNOLOGY JUL 2 0 2004 Author ............. E. . i. .. .. ... JIBRARIES n.e. . .r..n....... . . Department of Electrical Engineering and Compute r Science May 20, 2004 Certified by. .................... Mark J. Daly Fellow, Whitehead Institute Thesis Supervisor Certified by........ David M. Altshuler Investigator, Broad Institute Thesi upervisor Accepted by ... . . . : ' .. . . . . Arthur C. Smith Chairman, Department Committee on Graduate Theses ARCHNES 2 An HMM-based Boundary-flexible Model of Human Haplotype Variation by Jonathan Sheffi Submitted to the Department of Electrical Engineering and Computer Science on May 20, 2004, in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science Abstract The construction of a meaningful and detailed description of haplotype variation holds the promise for more powerful genetic association studies. The segmentation of the human genome into blocks of limited haplotype diversity has been successfully employed by models that describe common variation. Some computational models of haplotype variation are flawed, however: they arbitrarily sever all haplotypes at block boundaries and assume that block boundaries are areas of free recombination. In reality, haplotypes break up when they recombine, and many past recombination events are predicted to occur at sites of occasional recombination. Thus, the genuine unit of shared genetic variation should often cross block boundaries, or sometimes end between them. This work seeks the truer mosaic structure of human haplotypes through flexible haplotype boundaries. This thesis introduces an HMM-based boundary-flexible model, and proves that this model is superior to a blockwise description via the Minimum Description Length (MDL) criterion. Thesis Supervisor: Mark J. Daly Title: Fellow, Whitehead Institute Thesis Supervisor: David M. Altshuler Title: Investigator, Broad Institute 3 4 Acknowledgments Behind every thesis stands not only its author, but also many others without whom the work would not be possible. I would like to recognize the people who made this work a reality. Mark and 'David The smartest and most supportive supervisors I could have possibly imagined. Thank you for this amazing opportunity. Itsik My smart, kind, and altogether wise partner in haplotype analysis crime, without whom this project and this thesis would have been nowhere near as fun or as good. You have been an utter joy to work with, and I plan to lobby the Nobel committee to create a prize in computational biology just so they can give it to you. Thank you. Jeff, Shaun, Claire and Andy Labmates who are so chill, we could safely keep penguins under the desks if we had to. Eric, Rafi and Derek I've always got your back, and I know you've always got mine. You're the best friends a guy could ask for. Ellie I never would have gotten this far without your encouragement, support, and faith in me. Thank you for your patience, your sense of humor, and your wonderful heart. Mom, Dad, and Karen Thanks for everything. I love you so much! 5 6 Contents 1 Introduction 17 2 Human Genetics Background 19 2.1 DNA ......... 19 2.2 Single Nucleotide Polymorphisms (SNPs) ................ 20 2.3 Linkage Disequilibrium . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4 Haplotype blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3 .................................... Computational Background 25 3.1 Hidden Markov Models (HMMs) . . . . . . . . . . . . . . . . . . . . . 25 3.1.1 Definition of an HMM 26 3.1.2 Likelihood of a set of sequences of observed symbols under an HMM: the Forward-Backward Algorithm . . . . . . . . . . . . 29 Learning the HMM: Baum-Welch Algorithm . . . . . . . . . . 31 Extensions to Standard HMM Theory . . . . . . . . . . . . . . . . . . 34 3.2.1 Incomplete data . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2.2 Products of HMMs . . . . . . . . . . . . . . . . . . . . . . . . 34 Minimum Description Length Principle . . . . . . . . . . . . . . . . . 35 3.1.3 3.2 3.3 . . . . . . . . . . . . . . . . . . . . . . 4 Related Work 37 5 A Flexible HMM for Haplotype Data 39 5.1 M odel Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.1.1 40 General Parameters . . . . . . . . . . . . . . . . . . . . . . . . 7 .. 40 5.1.2 Ancestral Segments . . . . .4 5.1.3 Ancestral Segment Tilings . . . . . . . . . . . . . . . . . . . . 41 5.1.4 Transition Matrices . . . . . . . . . . . . . . . . . . . . . . . . 42 5.1.5 Formal Model Definition . . . . . . . . . . . . . . . . . . . . . 42 Modeling haplotypes as an HMM . . . . . . . . . . . . . . . . . . . . 42 5.2.1 States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.2.2 Initial State Distribution . . . . . . . . . . . . . . . . . . . . . 43 5.2.3 Emission Probabilities . . . . . . . . . . . . . . . . . . . . . . 45 5.2.4 Transition Probabilities . . . . . . . . . . . . . . . . . . . . . . 45 5.2.5 Improvement over previous models . . . . . . . . . . . . . . . 46 5.3 Extension to diploid and trio data . . . . . . . . . . . . . . . . . . . . 46 5.4 Minimum Description Length . . . . . . . . . . . . . . . . . . . . . . 47 . . . . . . . . . . . . . . . . . . . . . . . . 47 . . . . . . . . . . . . . . . . . 49 5.2 5.5 5.4.1 Encoding a Model 5.4.2 Encoding an ancestral segment 5.4.3 Encoding a transition matrix . . . . . . . . . . . . . . . . . . 49 5.4.4 Decoding a Model . . . . . . . . . . . . . . . . . . . . . . . . . 50 . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.5.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.5.2 Optimizing the HMM . . . . . . . . . . . . . . . . . . . . . . . 52 5.5.3 Topology Optimization Strategy . . . . . . . . . . . . . . . 53 5.5.4 Candidate Topology Changes . . . . . . . . . . . . . . . 53 Optimizing the Model . 59 6 Empirical Results 6.1 Improvement in 5q31 . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.2 D ata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.3 Improvement in likelihood and MDL . . . . . . . . . . . . . . . . . . 61 6.4 Preserved boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 65 7 Conclusions 7.1 . . . . . . . . . . . . . . . . . . . . . . . . 65 Computational Results . . . . . . . . . . . . . . . . . . . . . . 65 Summary of Contribution 7.1.1 8 7.1.2 7.2 Biological Results . . . . . . . . . . . . . . . . . . . . . . . . . 66 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 7.2.1 Improve model optimization . . . . . . . . . . . . . . . . . . . 67 7.2.2 Optimization of general parameters . . . . . . . . . . . . . . . 67 7.2.3 Tag SNPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 7.2.4 Prior distribution that assumes coalescence . . . . . . . . . . . 69 9 10 List of Figures 3-1 A simple HMM. ....... 4-1 The strict block model. ...... 4-2 The block-free model. . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5-1 A flexible model and the HMM that represents it. . . . . . . . . . . . 44 5-2 A hypothetical flexible model. . . . . . . . . . . . . . . . . . . . . . . 48 5-3 The encoding order of units for the model shown in Figure 5-2. . . . . 48 5-4 A Horizontal Merge, before and after. . . . . . . . . . . . . . . . . . . 54 5-5 A Vertical Split, before and after. . . . . . . . . . . . . . . . . . . . . 55 5-6 A Prefix Match, before and after. . . . . . . . . . . . . . . . . . . . . 56 6-1 SNPs from chromosomal loci 433467 to 520521 in 5q31 under the block ............................... ........................... m odel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2 60 62 Number of block boundaries found in blockwise description that are traversed or preserved in the flexible model. 7-1 60 Improvement by the flexible model over the strict block model in both likelihood and description length. . . . . . . . . . . . . . . . . . . . . 6-4 38 SN1-s from chromosomal loci 433467 to 520521 in 5q31 under the flexible m odel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-3 27 . . . . . . . . . . . . . . 63 Example of historical haplotype structure. . . . . . . . . . . . . . . . 70 11 12 List of Tables 6.1 Details of the six chromosomal regions used in validating the flexible m odel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 61 14 List of Algorithms 1 Decoding a Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2 Horizontal Merge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3 Vertical Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4 Prefix Match . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5 Suffix Match. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 15 16 Chapter 1 Introduction Genetic differences among individuals affect medically important traits. Association of these differences with such traits has remained the defining challenge for medical genetics since its inception. Once the genetic factors affecting a disease are understood, medical researchers can use that knowledge to develop more rational therapies for diseases that have genetic components, and also gain an understanding of subgroups of patients that may or may not benefit from existing therapies. As variation in the human genome becomes better documented, we are able to develop improved bioinformatics methods to discover the links between genomic data and the causes of human disease. This field of computational genetics has been recently revolutionized by the concept of haplotype blocks, which partition the genome into genomic regions of high linkage disequilibrium. The discovery of haplotype blocks as a ubiquitous feature of the human genome suggests the feasibility of much more powerful and accurate genetic association studies. Some computational models of the genome treat haplotype blocks in a simplistic fashion, assuming that all haplotypes must be broken at all block boundaries and only at those block boundaries. Both biological theory and empirical observation indicate that a strict block model of the genomic landscape is inaccurate. For example, linkage disequilibrium often persists between blocks, and some haplotypes therefore naturally cross block boundaries without any breakdown. These observations suggest a mosaic structure of interleaving common haplotypes, each of which is the manifes17 tation of chromosomal segments that existed long ago. The prospect of a catalog of all such segregating ancestral haplotypes holds great promise as a tool for population geneticists. This work seeks to probabilistically model these chromosomal segments and show that they can be described more accurately by a flexible model than by a strict block model. It uses a Hidden Markov Model (HMM) to determine the likelihood of observed sequences under a given model, and by introducing flexibility to the block boundaries, we show that it is possible to model ancestral chromosomes more succinctly than before with little or no loss in accuracy. Chapters 2 and 3 lay the biological and computational groundwork for the major work of this thesis. Chapter 4 surveys related work, including other HMM approaches to haplotype modeling. Chapter 5 explains the design of the flexible model. Chapter 6 documents the effectiveness of the flexible model over a strict block model. Chapter 7 summarizes the contributions by this thesis to the field, and outlines future work in this area. 18 Chapter 2 Human Genetics Background As an aid to those who want to better understand the biological problem behind haplotype mapping, this chapter provides relevant background material related to human genetics. An overview of DNA and SNPs provides an understanding of the underlying data, and haplotype blocks are discussed as the motivating factor for this study. 2.1 DNA DeoxyribonucleicAcid (DNA) molecules encode the genetic information that dictates many cellular functions at the molecular level and thus affects many of the observed traits of a living organism. Abstractly, DNA may be thought of as a linear polymer of building blocks called nucleotides or bases. The four types of nucleotides are Adenine, Cytosine, Guanine and Thymine. The nucleotide sequence (a 3 billion letter string in humans) carries the genetic information of an individual. Each specific location along the sequence, measured in bases, kilobases (kb) or megabases (Mb), is called a site. A human cell has 46 DNA molecules, called chromosomes, which store essentially all genetic information for an individual. Higher organisms have a diploid genome, meaning that each of the chromosomes is paired with another, resulting in 23 pairs of chromosomes for humans. Twenty-two of these pairs are each composed of two near-identical copies of one another. These 19 pairs are numbered from 1 to 22. The twenty-third pair of chromosomes are the sex chromosomes. Females possess two copies of the X sex chromosome per cell, and males possess one copy of the X sex chromosome and one copy of the Y sex chromosome. The cellular process that forms the basis of sexual reproduction is meiosis. Meiosis produces each parent's genetic contribution to an offspring. This contribution is haploid, i.e., it includes exactly one copy of each type of chromosome: one copy of chromosomes 1 through 22 plus one sex chromosome. Within each parent, each pair of chromosomes crosses over, or recombines, to create one haploid daughter chromosome with alternating non-overlapping segments from the chromosome pair. A diploid offspring is then formed by the union of both parents' haploid contributions. Thus, an offspring inherits portions of her DNA from each parent, carrying some of the traits of each parent. The process of accurate DNA replication is central to the transmission of functional pieces of hereditary information from generation to generation. On rare occasions, this process suffers from imperfections, or replication errors called mutations. These accumulate over many generations and give rise to less-than-perfect agreement among chromosome copies in present-day individuals. The processes of recombination and mutation have led to the modification of DNA molecules and the resulting genetic diversity among living organisms. 2.2 Single Nucleotide Polymorphisms (SNPs) The DNA content among humans is very similar from individual to individual. In fact, two humans differ in only one of every 1,200 bases on average, at points where ancestral mutations have occurred. These small variations cause most of the observed, heritable differences among individuals. The most common type of mutation is a replication error in which one nucleotide is substituted for another. This duality is referred to as a single nucleotide polymorphism (SNP) [30], and each of the possible nucleotides at that site are called alleles. There are about ten million SNPs along the genome for which both possible alleles are common in the human population [28]. 20 The defining role of SNPs in the observed variation among members of our species makes them the focus of many studies that seek out genetic factors that contribute to disease. Population geneticists often try to associate a particular allele or set of alleles with a disease state, in what is known as an association study. Genotyping, or the reading of SNP alleles in particular chromosome copies, is therefore very important to these studies. SNPs lend themselves well to high-throughput and cost-effective genotyping, and thus provide a useful manner in which to categorize an individual's genotype. An individual who has the same allele for both chromosomal copies of a given SNP is said to be homozygous for that SNP. Similarly, an individual who has different allele values is said to be heterozygous for that SNP. Each SNP is genotyped independently of other SNPs, but is read for both chromosomes without distinction between the chromosomes. Thus, an individual who is heterozygous for a SNP is said to be of ambiguous phase: it is unclear which allele was derived from the mother and which was derived from the father. If the parental genotypes are also known, sometimes the phase can be determined, but there may still be some uncertainty. 2.3 Linkage Disequilibrium A genomic region is said to be in linkage disequilibrium(LD) if alleles in that region have not yet recombined enough times to erase traces of their shared ancestral chromosome copies. LD regions are valuable because they allow geneticists to observe genetic sequences without genotyping every SNP in a region. Instead, scientists sample only a few selected SNPs, allowing for association studies that do not directly examine the SNP in question [39]. 2.4 Haplotype blocks Recent research on haplotype blocks forms the primary motivation for this thesis. Haplotype blocks are genomic regions where almost complete LD is observed with 21 high significance across almost all SNPs in the region. That is, the alleles in a long region show evidence of very little recombination, and each of the observed sequences of allele values in the region is referred to as a haplotype. Within a haplotype block, common variation is therefore due only to mutation and divergence history. The origin of haplotype blocks is explained as regional variation in recombination rates [48] or as the result of random crossovers combined with genetic drift [52]. The organization of the genome into orderly regions of little or no recombination means that human variation is much more limited and much less random than previously thought. Recent analyses of the human genome confirm that it contains regions of low haplotype diversity [35, 15]. Consequently, it is hoped that one can concisely represent the genomic content of an individual with an order of magnitude fewer SNPs than previously thought, allowing more powerful and cost-effective genetic studies [16, 42]. Because SNP genotyping experiments are expensive, the reduced number of SNPs required to identify the haplotype in a region makes large-scale disease studies feasible by decreasing the number of SNPs genotyped with little loss of information. In the long run, as SNP genotyping becomes arbitrarily inexpensive, a realistic model of haplotype variation will be required to interpret the resulting data. A public effort has been launched to catalog variation across the genome as a resource for attempts to determine the genetic factors that contribute to common diseases. This haplotype map, or "HapMap," is becoming more refined, and provides the data that forms the basis for the research described in this thesis [10, 9]. Haplotype blocks are a common feature of the human genome, though empirical attempts to characterize them in terms of length and frequency [21, 15, 36, 46, 47] have differed due to both the analytic methods employed and the type of data collected. Population simulations that assume a uniform recombination rate estimated LD to extend only to a distance of roughly 3 kb [26]. In contrast, empirical data [38] shows that LD extends an order of magnitude longer than that due to variation in recombination rates [31]. Regional recombination rates in fact vary widely across the genome, from under 0.1 22 cM 1/Mb to more than 3 cM/Mb [24]. Jeffreys et al. [19] and other studies have shown evidence of recombination hotspots in humans and other organisms [2, 20, 50, 51], at which more recombination events happen than at other sites. High-resolution mappings have shown the existence of these hotspots, which are about 1 to 2 kb in length, and the great majority (about 94%) of crossover events lie within hotspots [19]. Very recent SNP surveys in European and African populations find evidence for extreme local recombination rate variation spanning four orders in magnitude, in which 50% of recombination events take place in less than 10% of the genome, and 80% of recombination events take place in 25% of the genome. The same surveys also confirm that recombination hotspots are a ubiquitous feature of the human genome [31]. 'Recombination rates are usually measured in centimorgans (cM). One centimorgan is equal to a 1% chance that a SNP at one site will be separated from a SNP at a second site due to crossing over in a single generation. 23 24 Chapter 3 Computational Background To provide a basic understanding of the computational work described in this thesis, this chapter provides relevant background material related to the computational aspects of this haplotype modeling project. Hidden Markov Models and Minimum Description Length are explained in detail here because these concepts are central to this project. 3.1 Hidden Markov Models (HMMs) A Hidden Markov Model (HMM) is a probabilistic description of a class of sequences. It depicts a finite automaton, or "theoretical machine," which consists of a set of given states, with a prescribed set of probabilities governing possible transitions between states through time. The current state of the HMM at a specific time frame depends on the previous state and the probabilistic transitions between states. At each time frame, the current HMM state probabilistically emits a symbol from the alphabet of symbols used by the class of sequences. Only the emitted symbols, not the states, are visible to external observers, so we call the states "hidden," hence the name of the model. HMMs are used to model processes that give rise to sequences of symbols by evaluating the likelihood of certain observed sequences of symbols under the model, and to estimate the parameters which yield the model that best describes the data. HMMs have been used in many fields, such as computer vision [6] and speech 25 recognition [8], as well as in bioinformatics [25]. In bioinformatics, the role of the time frame is often assumed by the position along a linear biopolymer such as DNA. For example, an HMM can simulate a DNA sequence by emitting a sequence of its monomers (i.e., nucleotides). The stochastic processes of recombination and mutation make biological data appear especially well-suited to a stochastic model rather than a deterministic model. For example, the GENEHUNTER software uses recombination along the chromosome in place of the HMM's time axis to estimate familial inheritance patterns from incomplete marker data [27]. Definition of an HMM 3.1.1 To define a particular model, the following general conventions are used (see, e.g. [29, 13]). Qv(H), Definition A given Hidden Markov Model 7H is a triplet (Qx(-H),y (R)), whereas: " Qx(H) represents the finite set of possible states {qi, ... , qNS} " Qv() represents the finite set of possible observed symbols {v 1 ,..., vM}- * E(H) denotes the free parameters (with fixed Qx and Qv) for a given HMM. E(H) is a triplet (A(H), B(H), P(H)) whereas: - A(R) = [ai,,(7)] is a N x N stochastic matrix that describes the probabilistic transitions between states. - B(i) = [bi,m(i)] is a N x M stochastic matrix that describes the probabilities of each possible symbol being emitted given that the HMM is in some state. - P(H) = (pi( 7 l)) is a probability vector of dimension N that describes the probabilities of the HMM starting in each particular state. 7 is omitted whenever it is clear from context. An example of an HMM is provided in Figure 3-1. 26 -2 4 ,5 3 1,3 a1 , t t t t . 3tt f a 4,, 3 2 b1 3 3 = V21 CWII LI1 1 0 3 =v Fig. 3-1: A simple HMM. qi through q5 are states of the HMM. The probabilities specified in ajj describe transitions between states, and the probabilities specified in bi,m describe the probabilities of emitting symbol vm in state qi. One path through the HMM is shaded diagonally, with the emitted symbols shaded in crosshatch. The HMM probability of this path is pi x a 4 ,1 x a4 ,5 . The conditional HMM probability given this path and the observed symbols is bl, 2 x b4,1 x b5,3- 27 Definition A path is a sequence of T states (qi,... , qiT) E (Qx(,H))T. We now describe how an HMM defines probability spaces on possible paths, assigning a probabilistic meaning to the free parameters P and A: Definition An HMM R defines a sequence of state random variables: X 1,... ,XT whose distribution is determined by the following probabilistic interpretation of H's free parameters P and A: " P is the initial state probability vector, i.e., it determines the distribution of the first state random variable: Pr(X1 = qi) = pi. " A is the transition matrix, i.e., it determines the distribution of the next state random variable given the current one: Pr(Xt+i = qjlXt = qi) = aij. We denote a single sequence of actual observed symbols by 0 = (vol, (Qv(j))T, and a set of such sequences by IF = {1, ... , ... , vOT) E OU}. We will next describe how an HMM defines probability spaces on possible output sequences, assigning a probabilistic meaning to the free parameter B. Definition An HMM N defines a sequence of output random variables 01,..., Or whose distribution is determined by the probabilistic distribution of 'H's free parameter B. B is the emission matrix, i.e., it determines the probability distribution of the current output symbol given the current state: Pr(Ot = vmIXt = qi) = bi,m. These concepts allow us to associate a probability with a given path and a given output sequence. For a path Q = (qil, ... , i) in an HMM R, its HMM probability is: T-1 Pr(Q I) = Pr(Xi = qil,... XT = qiTj'H) = pil x r aii t=1 For a path Q = (qjl,..., qi) 4' = (vo,... ,vOT ), in an HMM 'H, and a sequence of observed symbols the conditional HMM probability is: T T Pr(0'1Q) = fJPr(Ot = vo IXt = qit) = r t=1 t=1 28 by,0, 3.1.2 Likelihood of a set of sequences of observed symbols under an HMM: the Forward-Backward Algorithm One of the most important applications of an HMM 'H = (Qx, Qv, 9) is the calculation of Pr(T'7), the probability of observing the set of sequences of observed symbols given a particular model. Pr(T171) is also called the likelihood of T. This value allows us to choose the model, from a set of models, that best describes the data. Because Qx and QV are fixed, the likelihood of IF is often denoted as Pr(T'1e). A common method of determining Pr(TJ|) is the Forward-BackwardAlgorithm (see, e.g. [29, 13]). The forward-backward algorithm computes forward probabilities at(i) and backward probabilities ft(i), which require a forward and backward pass through the data, respectively. Specifically, at(i) represents the probability of having seen some sequence of symbols (01, ... , Ot) that reach state i at the t-th time frame, while it (i) represents the probability of traversing the remaining symbols in the sequence (Ot+1,... , OT) given that we are already in state i at the t-th time frame. In addition to their role in computing Pr('IO), these probabilities are used in the Baum-Welch algorithm for parameter reestimation, outlined in the next section. We begin by defining the a and # values: Definition For a position t E 1,... , T along an output sequence, we define two vectors of probabilities: at(i) = Pr(01 = vo,.. ., Ot = vo,, Xt = qjje) ,3t(i) = Pr(Ot+1 =Vot,...,OT =VoT Xt= qj,E) for each i E ,..., N. Next, we describe the recursive algorithm that computes the a and 3 values: 1. Recursively compute the forward (a) values: (a) Initialization for (t = 1): for each i E 1, 29 . .., N, ai(i) +- pi - b,01 (b) For each t E 2, ... , T and for each j E 1, ... ,N, compute: N Pr(01,..., Ot+1, Xt at+l(j) <- = qj, Xt+ = qj) N =5 Pr(01,...O, Xt = qi) - Pr(Xt+1 = qj|Xt= qj)x Pr (Ot+1|Xt+1 = qj) N at(i) - aij - bj, , = i=1 2. Recursively compute the backward (/3) values: (a) Initialization for (t = T): for each i E 1,.. .,N, (b) For each t E (T - 1), ... ,1 and for each i E 1, ... Pr(Ot+,.. ., OT, Xt N 2..' At(i) 3 T(i) <- P'r(Y. = j=1 = , N, 1 compute: qi, Xti+1 = qj) V2q N Pr(Ot+1IXt+1 = qj) - Pr(Ot+2 ,. .. , OT Xt+1 = qj) x = j=1 Pr(Xt = qj, Xt+1 = qj) Pr(Xt = qi) N bj,og - Ot+1(j) -ai, = j=1 3. Given the a and /3 values, we can determine Pr(0 16): N Pr(016) = EPr(01,... , OT, XT = qiIE) N = - #(i) for any t 5at(i) 2 The computational complexity of the forward-backward algorithm is O(T - N ) 30 3.1.3 Learning the HMM: Baum-Welch Algorithm Often when dealing with an HMM, we have some data (called a training set) but no model of the phenomenon that has produced the data. The standard procedure in this case is to construct a basic model and refine it iteratively until no further improvement to the data fitting is possible. Expectatwon maximization (EM) [12] is the common paradigm for such a convergent optimization process. EM adjusts the free parameters of the HMM based on the forward and backward probabilities previously computed. This adjustment uses the Baum-Welch algorithm (see, e.g. [29, 13]). Each EM iteration has two steps: an Expectation (E) step and a Maximization (M) step. During the E-step, summary measures of the data are evaluated. These measures are expected counts of events, such as state-transitions. In the M-step, these measures are interpreted as empirical frequencies of these events and substituted into the model for the old free parameters. This substitution is justified by the fact that the new free parameters (the summary measures) maximize the approximated likelihood of the data. EM is guaranteed to improve the probability of T being observed from the model (Pr('16)) in each iteration until some limiting probability is reached. By repeating the forward-backward and Baum-Welch algorithms, the parameters are guaranteed to converge to some local maximum likelihood, but not necessarily to the global maximum likelihood. We first outline the overall algorithm, and then explain the rationale for each new parameter reestimate: 1. Choose the initial parameters, 2. Calculate the a and # E= (A, B, P), arbitrarily. values, as well as Pr(T'Ie). 3. Reestirnate A, dij, and bim as described below, for all values of i, j, and m. 4. Let A= [d,], B [bi,m], P = ) 5. Let e= (A, B, P). 6. If E = e then quit, else set E to e and return to Step 2. 31 We now describe the details of each reestimate, p, di,j, bi,m. To reestimate P, we calculate the expected number of times that the initial state is qj under the observations: U E (# times X 1 = qjIT) Pr(Xi = qj |I ) = u=1 Y.. Pr(Vu, X, = qi) Pr ('u IE)) E U Ea(i)- #"(i) (3.1) u=1=1Pr(buIE)) This last quantity, computed on the basis of the current parameter estimates, provides Aj, the improved estimate of pi. To reestimate B, we first calculate the expected number of occurrences of state i: U E(# times Xt = qjjI|) T E Pr(Xt = qjj|u) = u=1 t=1 U T P() U ,3 U=1 t=1Pr (3.2) uIE) The expected number of such occurrences for which Ou = m is: U E(# times O" = m, Xt = qi4') = T ZZPr(t = m,Xt = u=1 t=1 U = Pr(Xt = q ,0u) u=1 t:OU=m au(i)#3t"(i) ~ (3.3) u=1 t:OU=m We divide Equation 3.3 by Equation 3.2 to evaluate the proportion of the occurrences of state qi for which the corresponding observation is m: 'm U=1 Pr(OuIe) Ue ZU_1[ 32 t:Og=m ET_ 1 au"(i)3t"(i)] This is the new estimate of bi,m. The reestimate of A is calculated in a similar fashion. The expected number of transitions out of state qj, given the observations, is: U T-1 Pr(Xt = q4j/u) = tJtW U=1 t=1 u=1 t=1 (3.5) rOI) and the expected number from i to j is: U T-1 E(# times Xt = q , Xt+1 = qj) = ZZ Pr(Xt = q , Xt+1 = qj I|)u) (3.6) u=1 t=1 Pr(ou|Xt = qi, Xt+1 = qj) - Pr(Xt = qi) - a .L u=1 t=1 ZSPA( I \I/1 (3.7) 2"|)) We also know that: Pr(OuIXt = qi, Xt+1 = qj) = Pr(Ou, ... , OujXt = qi) x Pr(O+1, ... IOUr|Xt+1 = qj) = Pr(O1,..., OujXt = qj) Pr(042 , ... = ( a"u(i) Pr(Xt = qi) Pr(O+1|Xt+1= qj) x , OTr|Xt+l = ) -bo -() qj) -otu+1(j) (3.8) Substituting Equation 3.8 into Equation 3.7, the expected number of transitions from i to j is then: U T -1 Pr(Xt = qj, Xt+1 = qj 10') = u=1 t=1 a (i ) - bI 3 \'*-' u=1 t=1 (j ) -a i j (3.9) I'-') Finally, dividing Equation 3.9 by Equation 3.5, the new estimate of aij is: di - = EU jT-1 Z= 1 t= 1 Catu(i)-bj,.t+1'0tu+j(j)-aj,j Pr(,uIe) (i)fu (i) tu T- 1 Pr(V"|)ul :U U=1 Et=1 33 (3.10) 3.2 3.2.1 Extensions to Standard HMM Theory Incomplete data The definition of an HMM in Section 3.1.1 can be generalized to handle incomplete data. In such a setting, instead of each data point representing a single HMM output symbol v,, E Qv, it represents a subset of the output alphabet: Ot C Qv. For example, if the output alphabet is Qv = {A, B, C, D, E}, then a data point may represent Ot = {B, D} instead of just Ot = B. The Forward-Backward and Baum-Welch algorithms can be modified to compute the likelihood of the observations under the model and to optimize a model to fit a data set, even with incomplete data. The same principle of per-state forward and backward probabilities can be applied by summing over the possible emitted symbols in a given subset Ot: S Pr(OtXt= qi,7)= Pr(v|Xt = qi,?i) vIvEOt We use this result in place of our original emission probability bi,m in the recursive computation of the a and 3 values. The computational complexity of the forwardbackward algorithm under this extension then becomes O(T - N 2 + N - E OtI) = O(T -N 2 +T . N - M). 3.2.2 Products of HMMs If we want to observe multiple sequences of observed symbols at once, we can do so by extending an existing HMM into multiple dimensions. We therefore extend the HMM definition of a single HMM (Section 3.1.1) to sets of HMMs. Definition Let {((), ... 1,j(D) } be a set of D HMMs. We define their D-dimensional HMM H = H (1 x " Qx(7i) x.X(D) to be the HMM (Qx(R), Qv(R), QX(i()) x ... x Qx(i(D)) * Qv(-H) = Qv(l)) x ... X Qv(i(D)) = 34 E(H)) such that e)(H) is a triplet (A(:), B(t), P(K)) of two matrices and a vector such that: - iD),(jljD)(111) - bi,),(m..,m)() - P(i,...,iD)(N) idd,Jd(H) Hd bid,md (H) HdPid( 7 ) This is a bona fide HMM, and the algorithms described in Sections 3.1.2 and 3.1.3 can be applied to compute the likelihood of samples under a model and to fit a model to existing data. A D-dimensional HMM can also handle incomplete data as described in 3.2.1. The only caveat is computational complexity: Adding more states and increasing alphabet sizes leads to a significant increase in the computation time to produce an optimized model, which is already of quadratic order. Although theoretically polynomial, a 4-dimensional HMM, H x H x H x H involves computation time of O(T - (N' + N4 . M 4 )), which is impractical for N > 10. 3.3 Minimum Description Length Principle In order to optimize our model, we employ the Minimum DescriptionLength (MDL) principle. The MDL Principle was introduced by Rissanen [40] as a statistical Occam's Razor: "The best model/model class among a collection of tentatively suggested ones is the one that gives the smallest stochastic complexity to the given data" [41, 40]. It seems intuitive that the less random behavior assumed by a model, the more likely that model is to arise by chance. It has already been widely applied to many problems of model complexity [18]. In order to apply the MDL principle, the greatest challenge is the calculation of a model's stochastic complexity. One accepted measure of this quantity is the length of an ideal coding of the model [41, 4]. The number of bits required to describe the parameters of a model is informationally equivalent to the negative logarithm of the probability of the model arising by chance: - log2 (Pr(model)). Determining this log probability is therefore solved by decomposing the model into its constituent 35 parts, encoding them, and devising an ideal coding scheme to describe the model parameters. We already know how to calculate the probability of observations under a model from Section 3.1.2, so we can construct an overall figure of merit ((D) for any given model: the minimal length of the data description through this model. <D = MDL(model, observations) = model code length - log 2 (likelihood) = - log 2 (Pr(model)) = - log 2 (Pr(R)) - - log 2 (Pr(observations model)) log 2 (Pr(T IR)) The first term of 4P increases as model complexity increases, whereas the second term decreases as model accuracy increases. By minimizing <D, we achieve a succinct and accurate model. 36 Chapter 4 Related Work The introduction by Daly et al. [11] of a high-resolution block model of haplotype recombination, including inter-block recombination (see Figure 4-1), laid the foundation for future models of haplotype variation. Their work provided a high-resolution map of haplotype block structure in a particular region using an HMM with haplotype states along the genome. Gabriel et al. [15] later standardized the block definition and showed that haplotype blocks are found throughout the genome. The Gabriel et al. model has been implemented in the popular Haploview program [3] and is also the basis for more recent research into the optimization of the block model. Distinct implementations of the HMM paradigm applied to the problem of haplotype variation vary in the meaning assigned to the model haplotypes, in the manner by which haplotypes are manifested as samples, and in the assumption of haplotype blockiness of the data. Such HMM-based models are becoming more and more complex, starting from a model that simply seeks to list common haplotypes within each block [23] and a model that attempts to minimize entropy of inter-block transitions [1]. These models optimize the strict block model using the MDL criterion, but do not account for mutations or the fact that haplotypes in the same block are likely to be similar to one another., Other models do attempt to assign a realistic meaning to emission probabilities as chances to observe an ancestral or mutated nucleotide [17, 22], but even these models still assume that haplotypes are broken up at every block boundary 37 and only at block boundaries, consequently limiting the model haplotypes to singleblock fragments. Such an assumption is arbitrary and does not reflect the genome structure of haplotype recombination. ATCGC GTCAACCATA CTCAC TTAGA ATATA CCTGC GTAATCGAGG GGCATTGTGA ATCGATGATA TCCGC TTCAG GGAGACCTGG __ _ Fig. 4-1: The strict block model. Areas of limited haplotype diversity (rectangles) are related to one another through probabilistic transitions (arrow) that represent the frequency of a sequence containing the haplotypes on each end of the arrow. Stephens et al. [43] uses a block-free model (see Figure 4-2), in which model haplotypes are the actual haplotypes observed in the sample data. The model incorporates both mutation and recombination, but fails to recognize the strong preference of recombination events to occur at hotspots. Furthermore, the model does not explicitly define common ancestral haplotypes, any one of which may have undergone a mutation or recombination events that affect many present-day samples. Fig. 4-2: The block-free model. All observed sequences are included in the model (rectangles), and putative recombinations (light arrows) are used to refine the model. 38 Chapter 5 A Flexible HMM for Haplotype Data The structure of haplotype recombination in the human genome is neither strictly block-like (i.e., characterized by recombination at hotspots alone), nor strictly blockfree (i.e., characterized by uniform recombination throughout the genome). Appropriately, we devise a general HMM approach to the description of haplotype variation that tries to mimic real biological phenomena and entities with components of the model. Specifically, we explicitly treat each haplotype in the model as a segment of an ancestral chromosome. Each such chromosome came about at specific time in history as a result of an ancient event of divergence, mutation, or recombination, and was present in some portion of the population. As time passed, the ancestral chromosomal segments underwent more recent events of divergence, mutation, and recombination, all of which are explicit in our model. 5.1 Model Components There are two entities that are central to the composition of our flexible model: ancestral segments and transition matrices. Informally, a flexible model is composed of some general fixed parameters, and free parameters, which include a set of ancestral segments and a set of transitionmatrices linking these segments. We first define these 39 components of a flexible model, followed by the formal definition of a model. 5.1.1 General Parameters We define a list g of some parameters of general use in the model. These parameters are assumed to be fixed and we do not fit them to the data. Definition The general parameters g = (Site, PA, PB, A) of the model are: Chromosome position list Site(1),.. ., Site(T) is an ordered list of chromosomal positions of SNPs in the data set. Throughout, we refer to SNP sites by their index in this list. Average recombination rate PA is the average genome-wide rate of recombination. We fix this parameter at 10-8 events per base (cM/Mb) [24]. Background recombination rate PB is the average background rate of recombination, i.e. the rate at which recent recombination has occurred at non-hotspot sites. Although PB according to this definition has not been explicitly estimated, studies [31] indicate that it is an order of magnitude lower than PA. Therefore, we set PB = 10- events per base (cM/Mb). Mutation rate A = 2 x 10- events per base (cM/Mb), the average mutation rate throughout the genome, is set as a uniform rate [7]. 5.1.2 Ancestral Segments Each ancestral segment entity corresponds to a segment of DNA thought to segregate unbroken in the modern population. Formally: Definition An ancestral segment is a triplet S = (L(S), C(S), f(S)), whereas Left endpoint L(S) is an integer (1 < L(S) < T) that denotes the index of the leftmost SNP of S. Alleles C(S) = {0, 1}IC(S) is a binary vector listing the alleles of the ICI SNPs in S. 40 Population frequency f(S) is a probability that represents the population frequency of S. We also calculate some other important properties of S: Right endpoint R(S) = L(S) + IC(S)I - 1 is an integer that denotes the index of the rightmost SNP of S. -(L PA -(Site(R(S)) -Site(L(S Age r(S) = ))) is a real negative number representing the pu- tative time to the most recent common ancestor that corresponds to S. (S) is omitted from notation when it is clear from context. 5.1.3 Ancestral Segment Tilings Definition Let S = {S',... SISI} be some set of ancestral segments, where S' = (L(St), C(Si), f(Si)). We formally define R(S) and L(S), the sets of left and right ancestral segment endpoints that do not include chromosomal endpoints, as: " R(S) = {R(S)I|S' E S} \ {T} * L(S) = {L(S) IS' E S} \ {1} We can now introduce an ordering of the ancestral segments which begin or end at a specific site: Definition For each t, let Ls = (i1 < that L(Si') l(ik) -- - ... < im) be the sequence of indices such L(Sim) = t. We define the left index of ik as its ordinal in LS: = k. The right index is symmetrically defined: Definition For each t, let Rs = (ij < ...< im) be the sequence of indices such that R(Sil) = -- = R(Sim) = t. We define the right index of i4 as its ordinal in RS: r(ik)=-k. An ancestral segment tiling is a set of ancestral segments that tiles the region under study. Formally: 41 Definition S is an ancestralsegment tiling if: * mini L(Si) * maxi R(St) = = 1 T * t E R(S) * t +1 5.1.4 E L(S) Transition Matrices Transition matrices define the interconnections between ancestral segments. They describe the probability that some ancestral segment will follow another ancestral segment along an individual's single chromosome copy. Formally: Definition An n x m stochastic matrix M = [mi,] is a transition matrix associated with a site t for a given ancestral segment tiling S = {(L(S'), C(S'), f(S'))} i'1 if n - IRsI m = ILS 11. We also define the notation L(M) = t, R(M) = t + 1. 5.1.5 Formal Model Definition Definition A haplotype flexible model is a triplet (9, S, M) where G is a list of general parameters, S = {M 1 ,... , MIMI } {S1 , ... , SIsI} is an ancestral segment tiling, and M = is a set of transition matrices for S such that for each t E L(S) 3 M E M such that R(M) = t. 5.2 Modeling haplotypes as an HMM This section relates the problem of haplotype variation to the HMM framework and the related algorithms detailed in Section 3.1. The reader is referred to Section 3.1.1, which explains the common HMM conventions that will be used frequently in this section. We first describe how each component of an HMM is implemented by the flexible model, showing how it can generate a sequence of observations that correspond to the sequence of alleles along a haploid chromosome. In this setting, each HMM 42 state corresponds to a single hidden ancestral segment at a single site. A diagram of how a particular flexible model corresponds to an HMM is shown in Figure 5-1. We begin by defining the input (training set): * T = length of each chromosome in the training set " QV = {Vm} = {0, 1} " Ot = is an output random variable which represents the emitted allele at site t. We model unknown alleles as incomplete data (see Section 3.2.1). In those cases, O 0 = = {0, 1} rather than exactly one of the values in Qv. (vol,... , VOT) is a single chromosome T = {p,... , 4 U} is the set of chromosomes across all individuals in the training set 5.2.1 States Each state in our HMM corresponds to a single ancestral segment S' and a single site t along that segment. Therefore, we construct R(S) - L(S) + 1 states qi,t = (Si, t) for each S. Notice that the states are now defined by two indices rather than just one index, as in Section 3.1.1. Thus, we can define the HMM states: " Qx = {qi,t} = {(S,t)|L(S') < t < R(S)} " Xt is a state random variable denoting the state at site t. 5.2.2 Initial State Distribution We use each ancestral segment's frequency along the ancestral segment. 0 Pit = f(Si) 43 f (S') for the initial state distribution TCGC TAGA TATA CCGC CTC CCT TTC t axis 1 2 3 4 5 6 7 T C G C C T C T A G A ............. I .T A T A C C * GC Fig. 5-1: A flexible model and the HMM that represents it. Solid black arrows correspond to the almost deterministic transitions along an ancestral segment. Unfilled arrows correspond to very infrequent recombination between ancestral segments. Dashed arrows correspond to transitions across transition matrices. For visual simplicity, unfilled arrows are only shown for transitions between the first two loci. Each state's symbol is emitted with very high probability (1 - , - r(S)). 44 5.2.3 Emission Probabilities At qi,t, C(S)[t] is emitted unless mutation occurs: Pr(Ot = C(S')[t]IXt = qi,t) = 1 - P - T(Si) This gives us the emission probabilities: bri,), A - T(Si), = t 5.2.4 - Si), { if Ot = C(Si)[t] if Ot $ C(Si)[t] Transition Probabilities Each transition probability in the haplotype flexible model describes the probability of moving between ancestral segments as t is incremented to t + 1. These transitions can occur along a single ancestral segment, or from one segment to another, and they can occur either within the segment or at its endpoint. Within an ancestral segment (L(S') <; t < R(S')), the probability of a transition from S' to another ancestral segment Si is based on the possibility of background recombination: Background(i,j,jt) = Pr(Xt+i = q,,t+1|Xt = qi,t) = PB - f (Si) (Site(t + 1) - Site(t)) (- max(r(St ), T(Si))) Between ancestral segments, each entry in the transition matrix describes the conditional probability of getting from one ancestral segment to another. Let S*, S3 be ancestral segments such that R(S') = L(Si) - 1 = t. Let M be the transition matrix such that L(M) = t. Recall that r(i) and 1(j) are the right index of i and the left index of j, respectively, as defined in Section 5.1.3. InterSegment(i, j, t) = Pr(Xt+i = qj,t+i IXt = qi,t) = M[r(i), 1(j)] 45 We complete the model with the matrix of transition probabilities. For any pair of states qi,t and qg,t,, the transition between them is formally defined by: 0 if t:#t' - 1 InterSegment(i,j, t) if t = R(S ) = L(Si) - 1 Background(i,j, t) if t =t' - 1; t # R(S') ; t' # L(SJ); i 1 - Ej otherwise a(j,t),(j,t') = 5.2.5 a(j,t),(j,t+1) # j Improvement over previous models This model differs from previous work because the endpoints of each haplotype are attributes of that haplotype rather than of the system as a whole. Thus, the endpoints can be altered: two haplotypes whose endpoints meet may be merged into a longer haplotype, or a haplotype may be severed at some SNP, creating a new transition matrix between the two new ancestral segments if one does not already exist at that SNP. 5.3 Extension to diploid and trio data The HMM outlined in the previous section is only defined for complete haploid data. That is, each sequence of observed symbols corresponds to a single chromosome where all SNP values are known. Unfortunately, haplotype data may be available only for inbred model organisms [45], obtaining such data may require cost-prohibitive technologies [35], and missing data is simply a fact of life in science. Typical human data sets include diploid data, which may be partially resolved into haplotypes using family information. Specifically, trio families (two parents, one offspring) are the norm in many studies. While the practical problem of missing data has been addressed by standard HMM theory (see Section 3.2.1), it is up to us to extend the flexible model beyond the singlechromosome HMM defined in Section 5.2 to diploid or trio data. For diploid data, each 46 individual in the data set represents two independent chromosomal sequences, each of which is an output sequence of a single haplotype flexible model. Each observation therefore corresponds to a pair of SNP values across the observed chromosomes. At homozygous sites, both chromosomes are observed, but at heterozygous sites, the ambiguous phase implies that a subset of pairs of flexible model output values is possible: (0,1) or (1,0). We define a diplotype HMM N = R' x H2 , where N' and R 2 are identical haplotype flexible models, and {Ot} E {0 = (0, 0), 1 = (1, 1), h = {(0, 1), (1, 0)}}. For trio data, we observe four independent chromosomes: maternal/paternal x transmitted/untransmitted. The offspring data often resolves the phase ambiguity of these four parental chromosomes. If all three individuals in the trio are heterozygous or data is missing, then there is still some possibility for ambiguity. We define a trio HMM N = ' x N2 x N 3 x N 4 , where each N is an identical haplotype flexible model. Thus, this model accommodates diploid (trio) data with missing information by mimicking two (four) identical independent haplotype HMMs, whose outputs are observed only after they are merged into diploid (trio) samples of unrelated individuals. 5.4 Minimum Description Length In this section, we outline and explain the algorithms that encode and decode any given model. This ideal code length is substituted for the stochastic complexity formula in the MDL criterion, as explained in Section 3.3. 5.4.1 Encoding a Model In order to encode a model, we devise a scheme that orders descriptions of each unit of the model (S or M) in such a way that the topology of the model and the locations of its components can be easily reconstructed from the order of the components and the length of the ancestral segments. Intuitively, units are sorted by left endpoint. Each unit is assigned a pair of integers to enable sorting, and ties are broken in favor 47 of (shorter) ancestral segments. Formally: 1. Assign a pair of integers to each ancestral segment and each transition matrix. " For an ancestral segment S', the integer pair is (L(S ), R(S') + 1). " For a transition matrix M, the integer pair is (L(M), R(M)). 2. Sort all units by the two integers, lexicographically. 3. Encode IL'I. 4. Encode each unit according to Sections 5.4.2 and 5.4.3 below. For a model with the topology in Figure 5-2, the units in the model will be ordered as in Figure 5-3. i Film i Fig. 5-2: A hypothetical flexible model. Each rectangle represents a single ancestral segment, and transition matrices are shown by arrows. Fig. 5-3: The encoding order of units for the model shown in Figure 5-2 using the algorithm described above. 48 5.4.2 Encoding an ancestral segment To encode each ancestral segment, we only need enough bits to encode the length of the ancestral segment, its frequency in the model, and its alleles. The rest of the relevant attributes can be inferred from the coding scheme, as we will show in Section 5.4.4. The ancestral segment's frequency f(S') is a real number, but it can be encoded with finite precision due to a standard trick: This frequency is only measured up to - accuracy (where N is the number of chromosomal samples) as the fraction of chromosomal samples that descended from S'. This is a ratio between an estimated integer and N, so at most log 2 (N) bits are required to encode f(S'). Given an ancestral segment S' and N chromosomal samples, its minimum encoding is: length log2 (R(S t ) - L(S t ) + 1) bits frequency log 2 (N) bits alleles (R(Si) - L(Si) + 1) bits 5.4.3 Encoding a transition matrix For a given n x m transition matrix M, we need to encode n x m probabilities, to y N accuracy. While n can be deduced from the number of ancestral segments encoded previous to M, m will need to be encoded explicitly. There are two possibilities for the density of M: * M is sparse, i.e., it has few nonzero entries. * M is dense, i.e., it has many nonzero entries. Let n x m be the dimensions of M and let r be the number of nonzero entries in M. If r is small relative to n - m, then we can encode the transitions individually: number of transitions log 2 r bits topology of the r transitions log 2 (r7) bits 49 values of the r transitions log 2 (N+r-1 bits If r is large relative to n -m, then it is more efficient to code every one of the nm transitions explicitly instead: values of nm transitions log 2 (N±nm-1 For each transition matrix, we choose the encoding that produces the minimum bit length and add a single bit to signal which encoding of the transitions has been chosen. Thus, given an n x m transition matrix M with r nonzero probabilities and N chromosomal samples, its minimum encoding is: second dimension log 2 (m) bits transition protocol bit 1 bit transitions min(log 2 (r) + log 2 (nm) + log 2 5.4.4 (N+r-1), log 2 (+n m - 1)) bits Decoding a Model Because the topology of a model is implied by the order of the ancestral segments and transition matrices, we can reconstruct the model from its coded representation very easily. We traverse the model from left to right, assigning all of the units to the current location until a transition matrix is read. Because of the unit ordering, each transition matrix's location must correspond to the leftmost right endpoint among all ancestral segments not yet capped by a transition matrix. The left endpoints of the next batch of ancestral segments must correspond to the location of the transition matrix most recently placed. As the second dimension of each transition matrix is encoded, the number of ancestral segment units between this transition matrix and the next one is always known, i.e., Rg is specifically encoded. The details of the decoding algorithm are shown in Algorithm 1. 50 Algorithm 1 Decoding a Model 0 Uncapped <- LOC i +- 1 j +- I Read in m <- ILSI while Units left do if m > 0 then Read in length, r> Read in an ancestral segment f, and C Construct S' <- (LOC, f, C) Uncapped <- Uncapped U {S} +- i+ 1 M <- m - i 1 else LOC <- (minieuncappedR(S')) + 1 n +- IRLoc-i1 Read in m Read in the n x m matrix Mj (L(M), R(M)) <- LOC - 1, LOC Uncapped <- Uncapped \ RLOC-1 j <-j +1 end if end while 51 > Read in a transition matrix 5.5 Optimizing the Model 5.5.1 Initialization Given a strict block model, the computational challenge is to infer a flexible model for the available data that better represents the boundaries of ancestral chromosomes. We start by initializing a flexible model with a blockwise model, which we then improve iteratively. Specifically, the model is initialized with ancestral segments and transition matrices provided by Haploview [3], which uses the Gabriel et al. [15] definition of haplotype blocks. 5.5.2 Optimizing the HMM The iterative HMM optimization paradigm alternates between two stages: Improving the topology of the model, and improving the probabilistic parameters of the HMM under the topology. We first describe the latter, simpler task. Optimization of the probabilistic parameters of a flexible model is simply the inference of the HMM free parameters from the data. Given the states and transitions defined in Section 5.2, the model likelihood is computed by the Forward-Backward algorithm explained in Section 3.1.2 and the free parameters of the HMM (A, B, P) are improved to convergence by the Baum-Welch algorithm explained in Section 3.1.3. Note that when following transitions during the forward-backward traversals of the data, we do not actually loop over the entire set of states for each 1 < t < T. Instead, we take advantage of the sparse structure of our HMM: only transitions from t - 1 to t are allowed (recall Figure 5-1). Therefore, if we define w(t) as the number of ancestral segments present at site t (formally, w(t) = {qt E Q()x}I), computation time for each forward-backward scan is O(EZi then the 1 w(t) -w(t + 1)). When we analyze single chromosomes (H is a haplotype HMM), the magnitude of w(t) is bounded in practice by the number NH of common haplotypes in a region due to limited haplotype diversity (Section 2.4). When diploid and trio data are analyzed (H is a diplotype or trio HMM), w(t) is bounded by (NH) 52 2 or (NH) 4 , respectively. Typical values of NH are between two and eight. 5.5.3 Topology Optimization Strategy After HMM convergence, the flexible model employs a "greedy" optimization strategy to improve the model topology in terms of its MDL score. That is, the strategy always chooses the best short-term topology change without regard for opportunities for longer-term gains. To implement this strategy, a list of candidate topology changes is first generated from the current model. The four types of possible topology changes are explained in Section 5.5.4 below. Next, each topology change is scored for its impact on 4D, as described in Section 3.3 above. We exploited the locality of reference property of the dynamic programming paradigm in our implementation of the Forward- Backward algorithm to allow changes in likelihood to be computed only locally, saving computation time. The topology change that generates the greatest decrease in (D is chosen and applied to the model. The algorithm terminates if no topology changes that decrease 4D remain. 5.5.4 Candidate Topology Changes We now detail the four types of topology-change steps considered by the greedy optimizer. Each such step operates locally on a handful of model components that satisfy step-specific criteria, as explained below. Horizontal Merges If a pair of adjacent ancestral segments almost always extend one another, they are merged into a single ancestral segment. Formally, the criterion for such a merge is the existence of Si, S3 such that: " Pr(XL(Si) = qj,L(Si)IXR(Si) " Pr(XR(Si) = qi,R(Si) 1XL(S) = = qi,R(Si)) = j,L(Si)) 53 = 1, and 1 Algorithm 2 formally outlines the horizontal merge algorithm and Figure 5-4 shows the results. Algorithm 2 Horizontal Merge Construct Sk <- (L(S'), C(S') + C(Sj), f(si)+f(Si)) Delete S' and Si ATCGC CTCAC ATCGC CTCAC TTAGA ATAKTA TCCGC CCTGC -ATATA TTCAG TTAGA CCTGC TCCGC TTCAG KK Fig. 5-4: A Horizontal Merge, before and after. The ancestral segments in bold only connect to each other in the model and therefore should be treated as a single ancestral segment. Vertical Splits If a single ancestral segment is linked to two other ancestral segments exclusively in a transition matrix, we clone it, splitting its associated probabilities with some perturbations, in the hope that the resulting ancestral segments will converge towards parallel links that the model can subsequently merge horizontally, creating a simpler model. Formally, the criterion for such a split is the existence of Si, Si, Sk such that: " R(S') = R(Si) = L(Sk) - 1, and " Pr(XL(Sk) =qk,L(Sk)|XR(Si) * Pr(XL(sk) = * qk,L(Sk)|XR(S) Pr(XR(si) =qi,R(Si)IXL(Sk) Pr(XR(Si) = = qi,R(Si)) = 1, and = qj,R(Si)) = 1, and = qk,L(Sk)) qj,R(Si)|XL(Sk) = + qk,L(Sk)) = or 54 1 * L(ST) = L(Si) " Pr(XR(sk) = * Pr(XR(sk) - = R(Sk) + 1, and qi,L(Si)) 1, and = qj,L(Si)) 1, and qk,R(Sk)I XL(Si) = qk,R(sk)IXL(Si) * Pr(XL(Si) =qi,L(Si)|XR(Sk) = qk,R(Sk)) + Pr(XL(Si) = qj,L(Sj)|XR(sk) = qk,R(sk)) 1 Algorithm 3 formally outlines the vertical split algorithm and Figure 5-5 shows the results. Algorithm 3 Vertical Split Create ancestral segment Sk' = Sk (f(Sk), f(Sk')) +- PerturbHalf(f(Sk)) Let M be the transition matrix such that L(M) = R(Sk) for all Segments x such that M[r(k), 1(x)] > 0 do (M[r(k), 1(x)], M[r(k'), 1(x)]) +- PerturbHalf(M[r(k), 1(x)]) end for Let M be the transition matrix such that R(M) = L(Sk) for all Segments x such that M[r(x), 1(k)] > 0 do (M[r(x), 1(k)], M[r(x), l(k')]) <- PerturbHalf(M[r(x), 1(k)]) end for procedure PERTURBHALF(p) E ~ Uniform(0, 0.1 x p) Return (P + c, -c) end procedure ATCGC ATCGC CTCAC CTCAC TCACA -0CCTGC kCCTGC T CA CA................ TTCAG TCACA TTCAG Fig. 5-5: A Vertical Split, before and after. The ancestral segment in bold on the left is duplicated in the hope that it will lead to two Horizontal Merges later on. 55 Prefix Matching If two ancestral segments start with the same string, or prefix, we can merge those parts of the ancestral segments and reduce the coding complexity. Formally, the criterion for such a change is the existence of S', S3 and site t such that: * L(S') = L(Si), and " Vx E [L(S'),t],C(Si)[x = C(Si)[x] Algorithm 4 formally outlines the prefix matching algorithm, and Figure 5-6. Algorithm 4 Prefix Match Construct Sk <- (L(Si), C(Si)[L(Si), t], f(Si) + f(Si)) C(Si) <- C(Si)[t + 1, R(S )] C(Si) <- C(Si)[t + 1, R(S)] L(S ), L(Si) <- t + 1 if $ M such that L(M) = t then Create a new transition matrix M associated with t. end if M[r(k), 1(i)] - f (Si) M[r(k), 1(j)] - f (Si) ATAGAACGCTG ATAGAAC GC TG G ATAGAACTCTG TTTGATAGTGT TTTGATA GTGT TAACTACTTTG TAACTAC TTTG Fig. 5-6: A Prefix Match, before and after. The bold parts of the ancestral segments match each other and can therefore be compressed into a single ancestral segment. Suffix Matching Similarly to prefix matching, if two ancestral segments end with the same string, or suffix, we can merge those parts of the ancestral segments and save some space. Formally, the criterion for such a change is the existence of S, Si and site t such that: 56 e R(S') 0 Vx G = R(Si), and [t, R(S')],C(S')[x] = C(Si)[x] Algorithm 5 formally outlines the suffix matching algorithm. A Suffix Match figure is omitted without loss of generality. Algorithm 5 Suffix Match Construct Sk <- (t, C(S')[t, R(S')], f(S') + f(Si)) C(S') <- C(Si)[L(S'), t - 1] C(Si) +- ((Si)[L(S'),t - 1] (R (S'),I R(Si) +- t - 1) if M such that R(M) = t then Create a new transition matrix M associated with t - 1. end if M[r(i), 1(k)] - f (Si) M[r(j), 1(k)] - f (Si) 57 58 Chapter 6 Empirical Results This chapter presents experimental results obtained using the flexible model on sample genotype data. We analyzed data from the original region of the genome used to develop the block theory [11], as well as recent data from the HapMap ENCODE project [9], which includes regions genotyped at the projected SNP density of the final HapMap [34]. We first present a specific example of the success of the flexible model, and then demonstrate the validity of the flexible model by showing that it is superior to the strict block model by the MDL criterion across six chromosomal regions. We then present a simple application of the model that evaluates rigid boundaries against flexible boundaries observed in the analyzed regions. These boundaries correspond biologically to haplotype recombination characterized solely by hotspots versus sporadic haplotype break-up. 6.1 Improvement in 5q31 To demonstrate the flexible model on well-known data, we present one subregion of the 5q31 data under the block model (Figure 6-1) and after optimization under the flexible model (Figure 6-2). Notice that the flexible model is more compact and that the flexible model not only agrees with the hotspot predicted by the block model, but also reveals sites of less frequent recombination, as anticipated. 59 oGCCCGATC CTCTGACT CCATACTC CCC TGACT CCCTGACC TCCCTGATC! CGCGCCCGGATCC CTGCCCCGGCTCC CTGCTATAACCGC TTGCCCCAACCCA TTGCCCCAACCCC CTGCTATAACCCC CC AT Fig. 6-1: SNPs from chromosomal loci 433467 to 520521 in 5q31 under the block model. CGCG CTGC CTGC CCCGGAT TATAACC TATAACC CC GC CC r (-*~L7 CGCG TTGC GC CC ii CCCGGCT CC CCCAACC CC, GATC CT, CT'>CT GACT CC AT :ACTC CC CT GACC CCf AT Fig. 6-2: SNPs from chromosomal loci 433467 to 520521 in 5q31 under the flexible model. 60 Data Set Daly et al. [11] ENCODE [9] Density Depth 1:5kb 129 trios 1:1kb Chr. Region Interval (Mbp) 5q31 0.27 - 0.89 SNPs 103 2p16 51.6 - 52.1 515 2q37 4q26 7q21 7q31 234.8 - 235.3 118.7- 119.2 89.4 - 89.9 126.1 - 126.6 573 480 379 463 30 trios Table 6.1: Details of the six chromosomal regions used in validating the flexible model. 6.2 Data In addition to the 5q31 region, we ran the flexible model on five ENCODE regions. A summary of the data sets used in our analysis of the flexible model is shown in Table 6.1. Due to computational constraints, we avoided running a trio-based model, despite the samples being trios. Instead, we inferred parental phasing as much as possible via the offspring data and considered each partially phased parent as an unrelated diploid in a diplotype flexible model. 6.3 Improvement in likelihood and MDL The implementation of the flexible model algorithm outlined in Chapter 5 is computationally intensive, so each ENCODE region was subdivided into a small number of subregions and run piecewise. Despite the fact that the program could not improve the model across the boundaries of these subregions, the flexible model still improved significantly upon the strict block model in both likelihood and description length (see Figure 6-3). The lower MDL of the flexible model with respect to the block model in each case shows that the flexible model not only is significantly more likely to have arisen by chance but also describes the data as well as or better than a strict block model. 61 ic, -- I- - iip jh - - -- T! --- o log(P(DatalBlock model)) * og(P(Block Model)) E log(P(DatajFlexible model)) IMlog(P(Flexible model)) 12000- 10000- 8000- 6000- 4000- 2000- 05q31 2p1 6 4q26 2q37 Chromosoal Region 7q21 7q31 Fig. 6-3: Improvement by the flexible model over the strict block model in both likelihood and description length. The lower MDL in every case indicates that the flexible model describes the data more concisely and accurately than the strict block model does. The left bar of each pair represents the data description length of a strict block model, whereas the right bar represents the data description length of a flexible model. The bottom part of every bar represents the code length required to describe the model, and the top part of every bar represents the likelihood of the samples under the model. 62 6.4 Preserved boundaries The flexible model bears out the hypothesis that while recombination hotspots are an important characteristic of haplotype variation, not all ancestral segments recombined at each hotspot. Figure 6-4 shows that the flexible model preserves some, but not all of the original boundaries that correspond with blockwise description. Boundaries that are not preserved by the flexible model, but are instead crossed by one or more ancestral segments, are referred to as traversed boundaries. Such boundaries are presumed to correspond with sites of infrequent ancestral recombination. In contrast, sites that are preserved are presumed to correspond with recombination hotspots. a 40 n Boundanies, Preserved a Boundaries Travesed 30 E 20 35 1D 17 0 Sq31 2p16 2q37 4q26 2521 Chromosomal Region 7q31 Fig. 6-4: Number of block boundaries found in blockwise description that are tra- versed or preserved in the flexible model. 63 64 Chapter 7 Conclusions The ability to accurately model haplotype variation in a clear and concise manner via the flexible model described here will provide researchers with an unprecedented framework for genetic studies. As genomics continues to expand into larger-scale studies, a model that reflects the true structure of haplotype variation will be indispensable for proposing new association studies, and for analyzing the underlying natural forces that shape our genetic information. This chapter reviews the contributions of this work and overviews some opportunities for possible improvements and extensions. 7.1 Summary of Contribution 7.1.1 Computational Results Model development We have developed a novel computational model to describe haplotype variation. The model fits within the HMM framework, with state layers per time frame, sharing of parameters between analogous states across layers, almost deterministic transitions between states corresponding to the same segment, and stochastic transitions between points of interest. The model is extended to handle missing and multi-chromosome data, as generated by current technologies. We define an MDL measure for a flexible model and develop 65 algorithms to optimize the model. Implementation The model was implemented and adapted to fit the practical requirements of real data, such as incompleteness. To this end, we introduced efficiency improvements to the HMM itself and to the inference procedure. Application to real data We have applied the flexible model to classical and stateof-the-art data sets. The flexible model optimizes existing human genome data, generating a significantly better picture of haplotype variation than a blockwise model of the same data. 7.1.2 Biological Results A truer model of haplotype variation The flexible model synthesizes important aspects of existing block and non-block theories of haplotype variation. It recognizes that while hotspots are a critical part of our understanding of the structure of haplotype variation, not all variation can be explained by them. On one hand, the flexible model explicitly acknowledges as rare the ancestral recombination that affects much of the population. Therefore, the flexible model predicts LD that is attributed to common ancestors of large fractions of the population, whose haplotypes are commonly observed intact in today's samples. Such LD is evident in dense data sets, where regions with no ancestral recombination are ubiquitous. On the other hand, the flexible model explicitly acknowledges that recombination hotspots are even rarer than sites of occasional ancestral recombination. As a result, it predicts that ancestral haplotypes will overlap in a mosaic pattern. Concise description of data The flexible model is simpler and more succinct than a blockwise model description, and the likelihood of samples under the flexible model is similar to that of a strict block model. This achievement has the potential to require fewer tag SNPs to characterize a region, as well as allowing fewer hypotheses of haplotype association to be checked, improving the power 66 of genetic studies. Characterization of haplotype boundaries The flexible model incorporates opportunities for both hotspot and non-hotspot recombination, setting haplotype boundaries where they make most sense, rather than at every block boundary. It also allows some insight into whether these boundaries are genuine hotspots or merely sites of occasional recombination. 7.2 Future Work As shown in Chapter 6, a flexible model can succinctly and accurately describe patterns of haplotype variation for haploid, diploid, or trio samples. There are, however, several opportunities for improvement and extension. This section outlines future projects based on this work. 7.2.1 Improve model optimization A major practical restriction of the method presented in this thesis involves computational limitations. The task of optimizing such a complex model is computationally intensive and does not scale well for long sequences. We would like to improve on the running time as much as possible to allow large data sets to be run more quickly. We would also like to implement optimization algorithms more elaborate than the greedy scheme, such as Monte Carlo Markov Chain (MCMC), to achieve topologies that outperform those developed so far. 7.2.2 Optimization of general parameters The selection of the general parameters in the flexible model is just as important as the optimization of the free parameters. Currently, the model treats the background recombination rate (PB) and the background mutation rate (IL) as fixed. We would like to investigate opportunities for computational optimization of P and PB to produce a higher likelihood (L). For example, the Newton-Raphson method (see, e.g. [49]) for 67 finding the roots of a function may be used to determine the value of PB at which L'(pB) is zero, and therefore at which L(PB) is a local maximum. While derivatives of the model's likelihood function can be computed analytically, it is easier and as satisfactory to use numerical derivation: 1. Select E << po%. 2. Compute: " L = likelihood(pB = p0) " L+ = likelihood(pB = P0&+ e) 1- = likelihood(pB = P L -E) 3. Compute the first and second derivatives of L with respect to p: 2E 2 aPB OPB 012 E 4. Compute the Newton-Raphson iteration. 12 1 pB LB11, 5. If po3 # pl, then let p' = pl and goto Step 1. 7.2.3 Tag SNPs Cost-effective designs for large-scale association studies rely on preliminary mapping of the variation in the region of interest in order to select a small set of SNPs to be typed in the large association study cohort. Selection of such effective and economical sets of tag SNPs is therefore a key task for medical genetics. Intuitively, the deeper we understand the haplotype structure in a region, the better our tags will be, and MDL is related to the description of a region by a minimal number of tag SNPs. 68 More specifically, selection of a SNP with an allele particular to a haplotype tags that haplotype and renders all other such SNPs redundant. A block model suggests that haplotypes be tagged on a per-block basis. The flexible model can improve upon this strategy. Each boundary-spanning haplotype identified by our model needs to be tagged only once. Therefore, merging ancestral segments saves tag SNPs. On the other hand, creating more haplotypes by prefix or suffix matching does not require additional tags, as the merged haplotype is automatically tagged if its two variants are. We predict that implementation of a tagging scheme based on the flexible model will significantly improve the efficiency of tagging. 7.2.4 Prior distribution that assumes coalescence Samples of data are related, by mutation, recombination, and laboratory error, to a hidden structure of ancestral relation or genealogy. For each (arbitrarily small) non-recombinant region, coalescent theory depicts the underlying tree-like genealogy of that region, and the putative occurrence of mutation along its branches (lineages). Moving along the chromosome, the genealogy changes across sites of ancestral recombination (see Figure 7-1). To determine the a prioriprobability of some genealogy giving rise to S, a putative set of ancestral haplotypes, we propose: o Inferring evident properties of the genealogies by methods similar to those used in phylogeny [37, 14]. * Employing tools such as the Matrix Tree Theorem [32, 5, 44] to analytically average over the entire space of putative tree structures, and Particle Filtering [33] to enhance accuracy. 69 H, :ACGGGTGT ] H,:TGACTGTI H5 :TC ACG ACA T hifni T H,:ACG h:AICAGGTG C:: 6A Anaeshy atgml AG ARm~ay bdeeIw s3w-gm S 4, 6 ACTGT ACTGT GGTGT GGTGT ACTGT AGCT TCA ACG TCA Fig. 7-1: Example of historical haplotype structure. Current haplotypes evolved (top) from one another by mutation (H2 , H3 , H 4 and H6 ), and recombination (H 5 ). A causative mutation renders some haplotypes (H 4 , H5 and H6 ) as predisposing to the disease. Each of the coalescent trees (bottom) for the non-recombinant ancestral segments is due to mutations only. The succession of these trees throughout the genome is a Hidden Markov Model - a flexible, yet complete representation of the haplotype structure, reflected by the data. 70 Bibliography [1] E. C. Anderson and J. Novembre. Finding haplotype block boundaries by using the minimum-description-length principle. American Journal of Human Genetics, 73(2):336-354, August 2003. [2] R. M. Badge, J. Yardley, A. J. Jeffreys, and J. A. Armour. Crossover breakpoint mapping identifies a subtelomeric hotspot for male meiotic recombination. Human Molecular Genetics, 9(8):1239-1244, May 2000. [3] J. C. Barrett and M. J. Daly. Haploview. http://www.broad.mit.edu/ personal/jcbarret/haploview/. [4] A. Barron, J. Rissanen, and B. Yu. The minimum description length principle in coding and modeling. IEEE Trans. Information Theory, 44(6):2743-2760, 1998. [5] C. Berge. The Theory of Graphs and its Applications. Wiley, New York, NY, 1964. [6] H. Bunke and T. Caelli. Hidden Markov Models: Applications in Computer Vision. World Scientific, River Edge, NJ, 2001. [7] M. Cargill, D. Altshuler, J. Ireland, P. Sklar, K. Ardlie, N. Patil, N. Shaw, C. R. Lane, E. P. Lim, N. Kalyanaraman, J. Nemesh, L. Ziaugra, L. Friedland, A. Rolfe, J. Warrington, R. Lipshutz, G. Q. Daley, and E. S. Lander. Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nature Genetics, 22(3):231-8, July 1999. [8] E. Charniak. Statistical Language Learning. MIT Press, Cambridge, MA, 1993. 71 [9] The International HapMap Consortium. The International HapMap Project. http ://hapmap. org. [10] The International HapMap Consortium. The International HapMap Project. Nature, 426:789-96, December 2003. [11] M. J. Daly, J. Rioux, S. F. Schaffner, T. J. Hudson, and E. S. Lander. High-resolution haplotype structure in the human genome. Nature Genetics, 29(2):229-32, October 2001. [12] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39:1-38, 1977. [13] R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison. Biological sequence analysis: Probabilisticmodels of proteins and nucleic acids. Cambridge University Press, 1998. [14] N. Friedman, M. Ninio, I. Pe'er, and T. Pupko. A structural EM algorithm for phylogenetic inference. Journal of Computational Biology, 9(2):331-53, 2002. [15] S. B. Gabriel, S. F. Schaffner, H. Nguyen, J. M. Moore, J. Roy, B. Blumenstiel, J. Higgins, M. DeFelice, A. Lochner, M. Faggart, S. N. Liu-Cordero, C. Rotimi, A. Adeyemo, R. Cooper, R. Ward, E. S. Lander, M. J. Daly, and D. Altshuler. The structure of haplotype blocks in the human genome. Science, 296:2225-2229, June 2002. [16] D. B. Goldstein. Islands of linkage disequilibrium. Nature Genetics, 29(2):10911, October 2001. [17] S. Greenspan and D. Geiger. Model-based inference of haplotype block variation. In Proceedings of the Seventh Annual InternationalConference on Research in Computational Molecular Biology, RECOMB 03, pages 131-137, Berlin, 2003. ACM. 72 [18] M. H. Hansen and B. Yu. Model selection and the principle of minimum description length. Journal of the American Statistical Association, 96:746-774, 2001. [19] A. J. Jeffreys, L. Kauppi, and R. Neumann. Intensely punctate meiotic recombination in the class II region of the major histocompatibility complex. Nature Genetics, 29(2):217-22, October 2001. [20] L. Kauppi, A. Sajantila, and A. J. Jeffreys. Recombination hotspots rather than population history dominate linkage disequilibrium in the MHC class II region. Human Molecular Genetics, 12(1):33-40, January 2003. [21] X. Ke, S. Hunt, W. Tapper, R. Lawrence, G. Stavrides, J. Ghori, P. Whittaker, A. Collins, A. P. Morris, D. Bentley, L. R. Cardon, and P. Deloukas. The impact of SNP density on fine-scale patterns of linkage disequilibrium. Journalof Human Molecular Genetics, 13(6):577-88, March 2004. [22] G. Kimmel and R. Shamir. Maximum likelihood resolution of multi-block genotypes. In Proceedings of the Eighth Annual InternationalConference on Research in Computational Molecular Biology, RECOMB 04, pages 2-9, San Diego, 2004. ACM. [23] M. Koivisto, M. Perola, T. Varilo, W. Hennah, J. Ekelund, M. Lukk, L. Peltonen, E. Ukkonen, and H. Mannila. An MDL method for finding haplotype blocks and for estimating the strength of haplotype block boundaries. In Proceedings of the Eighth Pacific Symposium on Biocomputing, PSB 2003, pages 502-513, Lihue, Hawaii, 2003. ACM. [24] A. Kong, D. F. Gudbjartsson, J. Sainz, G. M. Jonsdottir, S. A. Gudjonsson, B. Richardsson, S. Sigurdardottir, J. Barnard, B. Hallbeck, G. Masson, A. Shlien, S.T. Palsson, M. L. Frigge, T. E. Thorgeirsson, J.R. Gulcher, and K. Stefansson. A high-resolution recombination map of the human genome. Nature Genetics, 31(3):225-6, July 2002. 73 [25] T. Koski. Hidden Markov Models in Bioinformatics. Kluwer Academic Publishers, Dordrecht, The Netherlands, 2001. [26] L. Kruglyak. Prospects for whole-genome linkage disequilibrium mapping of common disease genes. Nature Genetics, 22(2):139-44, June 1999. [27] L. Kruglyak, M. J. Daly, M. P. Reeve-Daly, and E. S. Lander. Parametric and nonparametric linkage analysis: a unified multipoint approach. American Journal of Human Genetics, 58(6):1347-63, June 1996. [28] L. Kruglyak and D. A. Nickerson. Variation is the spice of life. Nature Genetics, 27(3):234-6, March 2001. Hidden Markov and Other Models for [29] I. L. MacDonald and W. Zucchini. Discrete-Valued Time Series. Chapman & Hall, London, UK, first edition, 1997. [30] G. Marth, R. Yeh, M. Minton, R. Donaldson, Q. Li, S. Duan, R. Davenport, R. D. Miller, and P. Y. Kwok. Single-nucleotide polymorphisms in the public domain: how useful are they? Nature Genetics, 27(4):371-2, April 2001. [31] G. A. T. McVean, S. R. Myers, S. Hunt, P. Deloukas, D. R. Bentley, and P. Donnelly. The fine-scale structure of recombination rate variation in the human genome. Science, 304:581-584, April 2004. [32] M. Meila and T. Jaakkola. Tractable bayesian learning of tree belief networks. In Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence, pages 380-8, San Francisco, CA, 2000. Morgan Kaufmann Publishers Inc. [33] B. Ng, L. Peshkin, and A. Pfeffer. Factored particles for scalable monitoring. In Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence, Edmonton, Canada, 2002. [34] National Institutes of Health. Additional genotyping for the human haplotype map. http: //grants .nih.gov/grants/guide/rfa-f iles/RFA-HG-04-005. html. 74 [35] N. Patil, A. J. Berno, D. A. Hinds, W. A. Barrett, J. M. Doshi, C. R. Hacker, C. R. Kautzer, D. H. Lee, C. Marjoribanks, D. P. McDonough, B. T. N. Nguyen, M. C. Norris, J. B. Sheehan, N. Shen, D. Stern, R. P. Stokowski, D. J. Thomas, M. 0. Trulson, K. R. Vyas, K. A. Frazer, S. P. A. Fodor, and D. R. Cox. Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science, 294:1719-1723, November 2001. [36] M. S. Phillips, R. Lawrence, R. Sachidanandam, A. P. Morris, D. J. Balding, M. A. Donaldson, J. F. Studebaker, W. M. Ankener, S. V. Alfisi, F. S. Kuo, A. L. Camisa, V. Pazorov, K. E. Scott, B. J. Carey, J. Faith, G. Katari, H. A. Bhatti, J. M. Cyr, V. Derohannessian, C. Elosua, A. M. Forman, N. M. Grecco, C. R. Hock, J. M. Kuebler, J. A. Lathrop, M. A. Mockler, E. P. Nachtman, S. L. Restine, S. A. Varde, M. J. Hozza, C. A. Gelfand, J. Broxholme, G. R. Abecasis, M. T. Boyce-Jacino, and L. R. Cardon. Chromosome-wide distribution of haplotype blocks and the role of recombination hot spots. Nature Genetics, 33(3):382-7, March 2003. [37] T. Pupko, I. Pe'er, R. Shamir, and D. Graur. A fast algorithm for joint reconstruction of ancestral amino acid sequences. Molecular Biology and Evolution, 17(6):890-6, June 2000. [38] D. E. Reich, M. Cargill, S. Bolk, J. Ireland, P. C. Sabeti, D. J. Richter, T. Lavery, R. Kouyoumjian, S. F. Farhadian, R. Ward, and E. S. Lander. Linkage disequilibrium in the human genome. Nature, 411:199-204, May 2001. [39] N. Risch and K. Merikangas. The future of genetic studies of complex human diseases. Science, 273:1516-7, September 1996. [40] J. Rissanen. Modeling by shortest data description. Automatica, 14:465-471, 1978. [41] J. Rissanen. Lectures on statistical modeling theory. Lecture Notes for University of London, Royal Holloway, UK, 2001. mdlresearch.org. 75 Contact: jorma.rissanen@ [42] J. C. Stephens, J. A. Schneider, D. A. Tanguay, J. Choi, T. Acharya, S. E. Stanley, R. Jiang, C. J. Messer, A. Chew, J. H. Han, J. Duan, J. L. Carr, M. S. Lee, B. Koshy, A. M. Kumar, G. Zhang, W. R. Newell, A. Windemuth, C. Xu, T. S. Kalbfleisch, S. L. Shaner, K. Arnold, V. Schulz, C. M. Drysdale, K. Nandabalan, R. S. Judson, G. Ruano, and G. F. Vovis. Haplotype variation and linkage disequilibrium in 313 human genes. Science, 293:489-93, July 2001. [43] M. Stephens, N. J. Smith, and P. Donnelly. A new statistical method for haplotype reconstruction from population data. American Journalof Human Genetics, 68(4):978-989, April 2001. [44] H. Trent. A note on the enumeration and listing of all possible trees in a connected linear graph. Proceedings of the National Academy of Sciences, 40:10041007, 1954. [45] C. M. Wade, E.J. Kulbokas, A. W. Kirby, M. C. Zody, J. C. Mullikin, E. S. Lander, K. Lindblad-Toh, and M. J. Daly. The mosaic structure of variation in the laboratory mouse genome. Nature, 420:574-8, December 2002. [46] J. D. Wall and J. K. Pritchard. Assessing the performance of the haplotype block model of linkage disequilibrium. American Journal of Human Genetics, 73(3):502-515, September 2003. [47] J. D. Wall and J. K. Pritchard. Haplotype blocks and linkage disequilibrium in the human genome. Nature Genetics, 4(8):587-97, August 2003. [48] N. Wang, J. M. Akey, K. Zhang, R. Chakraborty, and L. Jin. Distribution of recombination crossovers and the origin of haplotype blocks: the interplay of population history, recombination, and mutation. American Journal of Human Genetics, 71(5):1227-34, November 2002. [49] Eric W. Weisstein. Newton's method. From MathWorld-A Wolfram Web Resource. http: //mathworld. wolfram. com/NewtonsMethod. html. 76 [50] C. L. Yauk, P. R. Bois, and A. J. Jeffreys. High-resolution sperm typing of meiotic recombination in the mouse MHC Ebeta gene. The EMBO Journal, 22(6):1389-1397, March 2003. [51] S. P. Yip, J. U. Lovegrove, N. A. Rana, D. A. Hopkinson, and D. B. Whitehouse. Mapping recombination hotspots in human phosphoglucomutase (PGM1). Human Molecular Genetics, 8(9):1699-706, September 1999. [52] K. Zhang, J. M. Akey, N. Wang, M. Xiong, R. Chakraborty, and L. Jin. Randomly distributed crossovers may generate block-like patterns of linkage disequilibrium: an act of genetic drift. Human Genetics, 113(1):51-9, July 2003. 77