An HMM-based Boundary-flexible Model of

An HMM-based Boundary-flexible Model of
Human Haplotype Variation
by
Jonathan Sheffi
Submitted to the Department of Electrical Engineering and Computer
Science
in partial fulfillment of the requirements for the degree of
Master of Engineering in Electrical Engineering and Computer Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
May 2004 Lkne 2LOUI
Massachusetts Institute of Technology 2004. All rights reserved.
MASSACHUSETTS INSTITUTE
OF TECHNOLOGY
JUL 2 0 2004
Author .............
E. . i. .. ..
... JIBRARIES
n.e.
. .r..n.......
. .
Department of Electrical Engineering and Compute r Science
May 20, 2004
Certified by.
....................
Mark J. Daly
Fellow, Whitehead Institute
Thesis Supervisor
Certified by........
David M. Altshuler
Investigator, Broad Institute
Thesi upervisor
Accepted by ... . . . : ' .. . . . .
Arthur C. Smith
Chairman, Department Committee on Graduate Theses
ARCHNES
2
An HMM-based Boundary-flexible Model of Human
Haplotype Variation
by
Jonathan Sheffi
Submitted to the Department of Electrical Engineering and Computer Science
on May 20, 2004, in partial fulfillment of the
requirements for the degree of
Master of Engineering in Electrical Engineering and Computer Science
Abstract
The construction of a meaningful and detailed description of haplotype variation
holds the promise for more powerful genetic association studies. The segmentation
of the human genome into blocks of limited haplotype diversity has been successfully
employed by models that describe common variation. Some computational models
of haplotype variation are flawed, however: they arbitrarily sever all haplotypes at
block boundaries and assume that block boundaries are areas of free recombination.
In reality, haplotypes break up when they recombine, and many past recombination
events are predicted to occur at sites of occasional recombination. Thus, the genuine
unit of shared genetic variation should often cross block boundaries, or sometimes
end between them.
This work seeks the truer mosaic structure of human haplotypes through flexible haplotype boundaries. This thesis introduces an HMM-based boundary-flexible
model, and proves that this model is superior to a blockwise description via the
Minimum Description Length (MDL) criterion.
Thesis Supervisor: Mark J. Daly
Title: Fellow, Whitehead Institute
Thesis Supervisor: David M. Altshuler
Title: Investigator, Broad Institute
3
4
Acknowledgments
Behind every thesis stands not only its author, but also many others without whom
the work would not be possible. I would like to recognize the people who made this
work a reality.
Mark and 'David The smartest and most supportive supervisors I could have possibly imagined. Thank you for this amazing opportunity.
Itsik My smart, kind, and altogether wise partner in haplotype analysis crime, without whom this project and this thesis would have been nowhere near as fun or
as good. You have been an utter joy to work with, and I plan to lobby the
Nobel committee to create a prize in computational biology just so they can
give it to you. Thank you.
Jeff, Shaun, Claire and Andy Labmates who are so chill, we could safely keep
penguins under the desks if we had to.
Eric, Rafi and Derek I've always got your back, and I know you've always got
mine. You're the best friends a guy could ask for.
Ellie I never would have gotten this far without your encouragement, support, and
faith in me.
Thank you for your patience, your sense of humor, and your
wonderful heart.
Mom, Dad, and Karen Thanks for everything. I love you so much!
5
6
Contents
1
Introduction
17
2
Human Genetics Background
19
2.1
DNA .........
19
2.2
Single Nucleotide Polymorphisms (SNPs) ................
20
2.3
Linkage Disequilibrium . . . . . . . . . . . . . . . . . . . . . . . . . .
21
2.4
Haplotype blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
3
....................................
Computational Background
25
3.1
Hidden Markov Models (HMMs) . . . . . . . . . . . . . . . . . . . . .
25
3.1.1
Definition of an HMM
26
3.1.2
Likelihood of a set of sequences of observed symbols under an
HMM: the Forward-Backward Algorithm . . . . . . . . . . . .
29
Learning the HMM: Baum-Welch Algorithm . . . . . . . . . .
31
Extensions to Standard HMM Theory . . . . . . . . . . . . . . . . . .
34
3.2.1
Incomplete data . . . . . . . . . . . . . . . . . . . . . . . . . .
34
3.2.2
Products of HMMs . . . . . . . . . . . . . . . . . . . . . . . .
34
Minimum Description Length Principle . . . . . . . . . . . . . . . . .
35
3.1.3
3.2
3.3
. . . . . . . . . . . . . . . . . . . . . .
4 Related Work
37
5 A Flexible HMM for Haplotype Data
39
5.1
M odel Components . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
5.1.1
40
General Parameters . . . . . . . . . . . . . . . . . . . . . . . .
7
.. 40
5.1.2
Ancestral Segments . . . . .4
5.1.3
Ancestral Segment Tilings . . . . . . . . . . . . . . . . . . . .
41
5.1.4
Transition Matrices . . . . . . . . . . . . . . . . . . . . . . . .
42
5.1.5
Formal Model Definition . . . . . . . . . . . . . . . . . . . . .
42
Modeling haplotypes as an HMM . . . . . . . . . . . . . . . . . . . .
42
5.2.1
States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
5.2.2
Initial State Distribution . . . . . . . . . . . . . . . . . . . . .
43
5.2.3
Emission Probabilities
. . . . . . . . . . . . . . . . . . . . . .
45
5.2.4
Transition Probabilities . . . . . . . . . . . . . . . . . . . . . .
45
5.2.5
Improvement over previous models
. . . . . . . . . . . . . . .
46
5.3
Extension to diploid and trio data . . . . . . . . . . . . . . . . . . . .
46
5.4
Minimum Description Length . . . . . . . . . . . . . . . . . . . . . .
47
. . . . . . . . . . . . . . . . . . . . . . . .
47
. . . . . . . . . . . . . . . . .
49
5.2
5.5
5.4.1
Encoding a Model
5.4.2
Encoding an ancestral segment
5.4.3
Encoding a transition matrix
. . . . . . . . . . . . . . . . . .
49
5.4.4
Decoding a Model . . . . . . . . . . . . . . . . . . . . . . . . .
50
. . . . . . . . . . . . . . . . . . . . . . . . . .
52
5.5.1
Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
5.5.2
Optimizing the HMM . . . . . . . . . . . . . . . . . . . . . . .
52
5.5.3
Topology Optimization Strategy
. . . . . . . . . . . . . . .
53
5.5.4
Candidate Topology Changes
. . . . . . . . . . . . . . .
53
Optimizing the Model
.
59
6 Empirical Results
6.1
Improvement in 5q31 . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
6.2
D ata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
6.3
Improvement in likelihood and MDL
. . . . . . . . . . . . . . . . . .
61
6.4
Preserved boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
65
7 Conclusions
7.1
. . . . . . . . . . . . . . . . . . . . . . . .
65
Computational Results . . . . . . . . . . . . . . . . . . . . . .
65
Summary of Contribution
7.1.1
8
7.1.2
7.2
Biological Results . . . . . . . . . . . . . . . . . . . . . . . . .
66
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
7.2.1
Improve model optimization . . . . . . . . . . . . . . . . . . .
67
7.2.2
Optimization of general parameters . . . . . . . . . . . . . . .
67
7.2.3
Tag SNPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
7.2.4
Prior distribution that assumes coalescence . . . . . . . . . . .
69
9
10
List of Figures
3-1
A simple HMM. .......
4-1
The strict block model. ......
4-2
The block-free model. . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
5-1
A flexible model and the HMM that represents it. . . . . . . . . . . .
44
5-2
A hypothetical flexible model. . . . . . . . . . . . . . . . . . . . . . .
48
5-3
The encoding order of units for the model shown in Figure 5-2. . . . .
48
5-4
A Horizontal Merge, before and after. . . . . . . . . . . . . . . . . . .
54
5-5
A Vertical Split, before and after. . . . . . . . . . . . . . . . . . . . .
55
5-6
A Prefix Match, before and after. . . . . . . . . . . . . . . . . . . . .
56
6-1
SNPs from chromosomal loci 433467 to 520521 in 5q31 under the block
...............................
...........................
m odel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6-2
60
62
Number of block boundaries found in blockwise description that are
traversed or preserved in the flexible model.
7-1
60
Improvement by the flexible model over the strict block model in both
likelihood and description length. . . . . . . . . . . . . . . . . . . . .
6-4
38
SN1-s from chromosomal loci 433467 to 520521 in 5q31 under the flexible m odel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6-3
27
. . . . . . . . . . . . . .
63
Example of historical haplotype structure. . . . . . . . . . . . . . . .
70
11
12
List of Tables
6.1
Details of the six chromosomal regions used in validating the flexible
m odel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
61
14
List of Algorithms
1
Decoding a Model.
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
2
Horizontal Merge . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
3
Vertical Split
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
4
Prefix Match
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
5
Suffix Match.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
15
16
Chapter 1
Introduction
Genetic differences among individuals affect medically important traits. Association
of these differences with such traits has remained the defining challenge for medical genetics since its inception. Once the genetic factors affecting a disease are understood,
medical researchers can use that knowledge to develop more rational therapies for
diseases that have genetic components, and also gain an understanding of subgroups
of patients that may or may not benefit from existing therapies.
As variation in the human genome becomes better documented, we are able to
develop improved bioinformatics methods to discover the links between genomic data
and the causes of human disease. This field of computational genetics has been recently revolutionized by the concept of haplotype blocks, which partition the genome
into genomic regions of high linkage disequilibrium. The discovery of haplotype blocks
as a ubiquitous feature of the human genome suggests the feasibility of much more
powerful and accurate genetic association studies.
Some computational models of the genome treat haplotype blocks in a simplistic
fashion, assuming that all haplotypes must be broken at all block boundaries and
only at those block boundaries. Both biological theory and empirical observation indicate that a strict block model of the genomic landscape is inaccurate. For example,
linkage disequilibrium often persists between blocks, and some haplotypes therefore
naturally cross block boundaries without any breakdown. These observations suggest
a mosaic structure of interleaving common haplotypes, each of which is the manifes17
tation of chromosomal segments that existed long ago. The prospect of a catalog of
all such segregating ancestral haplotypes holds great promise as a tool for population
geneticists.
This work seeks to probabilistically model these chromosomal segments and show
that they can be described more accurately by a flexible model than by a strict
block model. It uses a Hidden Markov Model (HMM) to determine the likelihood
of observed sequences under a given model, and by introducing flexibility to the
block boundaries, we show that it is possible to model ancestral chromosomes more
succinctly than before with little or no loss in accuracy.
Chapters 2 and 3 lay the biological and computational groundwork for the major
work of this thesis. Chapter 4 surveys related work, including other HMM approaches
to haplotype modeling. Chapter 5 explains the design of the flexible model. Chapter 6
documents the effectiveness of the flexible model over a strict block model. Chapter 7
summarizes the contributions by this thesis to the field, and outlines future work in
this area.
18
Chapter 2
Human Genetics Background
As an aid to those who want to better understand the biological problem behind
haplotype mapping, this chapter provides relevant background material related to
human genetics. An overview of DNA and SNPs provides an understanding of the
underlying data, and haplotype blocks are discussed as the motivating factor for this
study.
2.1
DNA
DeoxyribonucleicAcid (DNA) molecules encode the genetic information that dictates
many cellular functions at the molecular level and thus affects many of the observed
traits of a living organism. Abstractly, DNA may be thought of as a linear polymer of
building blocks called nucleotides or bases. The four types of nucleotides are Adenine,
Cytosine, Guanine and Thymine. The nucleotide sequence (a 3 billion letter string in
humans) carries the genetic information of an individual. Each specific location along
the sequence, measured in bases, kilobases (kb) or megabases (Mb), is called a site.
A human cell has 46 DNA molecules, called chromosomes, which store essentially all
genetic information for an individual.
Higher organisms have a diploid genome, meaning that each of the chromosomes
is paired with another, resulting in 23 pairs of chromosomes for humans. Twenty-two
of these pairs are each composed of two near-identical copies of one another. These
19
pairs are numbered from 1 to 22. The twenty-third pair of chromosomes are the sex
chromosomes. Females possess two copies of the X sex chromosome per cell, and males
possess one copy of the X sex chromosome and one copy of the Y sex chromosome.
The cellular process that forms the basis of sexual reproduction is meiosis. Meiosis
produces each parent's genetic contribution to an offspring. This contribution is
haploid, i.e., it includes exactly one copy of each type of chromosome: one copy of
chromosomes 1 through 22 plus one sex chromosome. Within each parent, each pair of
chromosomes crosses over, or recombines, to create one haploid daughter chromosome
with alternating non-overlapping segments from the chromosome pair. A diploid
offspring is then formed by the union of both parents' haploid contributions. Thus,
an offspring inherits portions of her DNA from each parent, carrying some of the
traits of each parent.
The process of accurate DNA replication is central to the transmission of functional pieces of hereditary information from generation to generation. On rare occasions, this process suffers from imperfections, or replication errors called mutations.
These accumulate over many generations and give rise to less-than-perfect agreement
among chromosome copies in present-day individuals. The processes of recombination and mutation have led to the modification of DNA molecules and the resulting
genetic diversity among living organisms.
2.2
Single Nucleotide Polymorphisms (SNPs)
The DNA content among humans is very similar from individual to individual. In
fact, two humans differ in only one of every 1,200 bases on average, at points where
ancestral mutations have occurred. These small variations cause most of the observed,
heritable differences among individuals. The most common type of mutation is a
replication error in which one nucleotide is substituted for another. This duality is
referred to as a single nucleotide polymorphism (SNP) [30], and each of the possible
nucleotides at that site are called alleles. There are about ten million SNPs along
the genome for which both possible alleles are common in the human population [28].
20
The defining role of SNPs in the observed variation among members of our species
makes them the focus of many studies that seek out genetic factors that contribute
to disease.
Population geneticists often try to associate a particular allele or set of alleles with
a disease state, in what is known as an association study. Genotyping, or the reading
of SNP alleles in particular chromosome copies, is therefore very important to these
studies. SNPs lend themselves well to high-throughput and cost-effective genotyping,
and thus provide a useful manner in which to categorize an individual's genotype.
An individual who has the same allele for both chromosomal copies of a given SNP
is said to be homozygous for that SNP. Similarly, an individual who has different allele
values is said to be heterozygous for that SNP. Each SNP is genotyped independently
of other SNPs, but is read for both chromosomes without distinction between the
chromosomes. Thus, an individual who is heterozygous for a SNP is said to be of
ambiguous phase: it is unclear which allele was derived from the mother and which
was derived from the father. If the parental genotypes are also known, sometimes the
phase can be determined, but there may still be some uncertainty.
2.3
Linkage Disequilibrium
A genomic region is said to be in linkage disequilibrium(LD) if alleles in that region
have not yet recombined enough times to erase traces of their shared ancestral chromosome copies. LD regions are valuable because they allow geneticists to observe
genetic sequences without genotyping every SNP in a region. Instead, scientists sample only a few selected SNPs, allowing for association studies that do not directly
examine the SNP in question [39].
2.4
Haplotype blocks
Recent research on haplotype blocks forms the primary motivation for this thesis.
Haplotype blocks are genomic regions where almost complete LD is observed with
21
high significance across almost all SNPs in the region. That is, the alleles in a long
region show evidence of very little recombination, and each of the observed sequences
of allele values in the region is referred to as a haplotype. Within a haplotype block,
common variation is therefore due only to mutation and divergence history. The origin
of haplotype blocks is explained as regional variation in recombination rates [48] or
as the result of random crossovers combined with genetic drift [52].
The organization of the genome into orderly regions of little or no recombination means that human variation is much more limited and much less random than
previously thought. Recent analyses of the human genome confirm that it contains
regions of low haplotype diversity [35, 15]. Consequently, it is hoped that one can
concisely represent the genomic content of an individual with an order of magnitude
fewer SNPs than previously thought, allowing more powerful and cost-effective genetic
studies [16, 42].
Because SNP genotyping experiments are expensive, the reduced number of SNPs
required to identify the haplotype in a region makes large-scale disease studies feasible
by decreasing the number of SNPs genotyped with little loss of information. In the
long run, as SNP genotyping becomes arbitrarily inexpensive, a realistic model of
haplotype variation will be required to interpret the resulting data. A public effort
has been launched to catalog variation across the genome as a resource for attempts
to determine the genetic factors that contribute to common diseases. This haplotype
map, or "HapMap," is becoming more refined, and provides the data that forms the
basis for the research described in this thesis [10, 9].
Haplotype blocks are a common feature of the human genome, though empirical
attempts to characterize them in terms of length and frequency [21, 15, 36, 46, 47] have
differed due to both the analytic methods employed and the type of data collected.
Population simulations that assume a uniform recombination rate estimated LD to
extend only to a distance of roughly 3 kb [26].
In contrast, empirical data [38]
shows that LD extends an order of magnitude longer than that due to variation in
recombination rates [31].
Regional recombination rates in fact vary widely across the genome, from under 0.1
22
cM 1/Mb to more than 3 cM/Mb [24]. Jeffreys et al. [19] and other studies have shown
evidence of recombination hotspots in humans and other organisms [2, 20, 50, 51], at
which more recombination events happen than at other sites. High-resolution mappings have shown the existence of these hotspots, which are about 1 to 2 kb in length,
and the great majority (about 94%) of crossover events lie within hotspots [19]. Very
recent SNP surveys in European and African populations find evidence for extreme
local recombination rate variation spanning four orders in magnitude, in which 50%
of recombination events take place in less than 10% of the genome, and 80% of recombination events take place in 25% of the genome. The same surveys also confirm
that recombination hotspots are a ubiquitous feature of the human genome [31].
'Recombination rates are usually measured in centimorgans (cM). One centimorgan is equal to
a 1% chance that a SNP at one site will be separated from a SNP at a second site due to crossing
over in a single generation.
23
24
Chapter 3
Computational Background
To provide a basic understanding of the computational work described in this thesis, this chapter provides relevant background material related to the computational
aspects of this haplotype modeling project. Hidden Markov Models and Minimum
Description Length are explained in detail here because these concepts are central to
this project.
3.1
Hidden Markov Models (HMMs)
A Hidden Markov Model (HMM) is a probabilistic description of a class of sequences.
It depicts a finite automaton, or "theoretical machine," which consists of a set of given
states, with a prescribed set of probabilities governing possible transitions between
states through time. The current state of the HMM at a specific time frame depends
on the previous state and the probabilistic transitions between states. At each time
frame, the current HMM state probabilistically emits a symbol from the alphabet of
symbols used by the class of sequences. Only the emitted symbols, not the states, are
visible to external observers, so we call the states "hidden," hence the name of the
model. HMMs are used to model processes that give rise to sequences of symbols by
evaluating the likelihood of certain observed sequences of symbols under the model,
and to estimate the parameters which yield the model that best describes the data.
HMMs have been used in many fields, such as computer vision [6] and speech
25
recognition [8], as well as in bioinformatics [25]. In bioinformatics, the role of the
time frame is often assumed by the position along a linear biopolymer such as DNA.
For example, an HMM can simulate a DNA sequence by emitting a sequence of its
monomers (i.e., nucleotides). The stochastic processes of recombination and mutation
make biological data appear especially well-suited to a stochastic model rather than a
deterministic model. For example, the GENEHUNTER software uses recombination
along the chromosome in place of the HMM's time axis to estimate familial inheritance
patterns from incomplete marker data [27].
Definition of an HMM
3.1.1
To define a particular model, the following general conventions are used (see, e.g. [29,
13]).
Qv(H),
Definition A given Hidden Markov Model 7H is a triplet (Qx(-H),y
(R)),
whereas:
" Qx(H) represents the finite set of possible states {qi, ... , qNS}
"
Qv()
represents the finite set of possible observed symbols {v 1 ,..., vM}-
* E(H) denotes the free parameters (with fixed Qx and Qv) for a given HMM.
E(H) is a triplet (A(H), B(H), P(H)) whereas:
- A(R) = [ai,,(7)] is a N x N stochastic matrix that describes the probabilistic transitions between states.
- B(i) = [bi,m(i)] is a N x M stochastic matrix that describes the probabilities of each possible symbol being emitted given that the HMM is in
some state.
- P(H) = (pi( 7 l)) is a probability vector of dimension N that describes the
probabilities of the HMM starting in each particular state.
7 is omitted whenever it is clear from context. An example of an HMM is provided
in Figure 3-1.
26
-2 4
,5
3
1,3
a1 ,
t t t t . 3tt
f
a 4,,
3
2
b1 3
3
= V21
CWII LI1
1
0
3
=v
Fig. 3-1: A simple HMM. qi through q5 are states of the HMM. The probabilities
specified in ajj describe transitions between states, and the probabilities specified in
bi,m describe the probabilities of emitting symbol vm in state qi. One path through
the HMM is shaded diagonally, with the emitted symbols shaded in crosshatch. The
HMM probability of this path is pi x a 4 ,1 x a4 ,5 . The conditional HMM probability
given this path and the observed symbols is bl, 2 x b4,1 x b5,3-
27
Definition A path is a sequence of T states (qi,... , qiT) E (Qx(,H))T.
We now describe how an HMM defines probability spaces on possible paths, assigning
a probabilistic meaning to the free parameters P and A:
Definition An HMM R defines a sequence of state random variables: X 1,... ,XT
whose distribution is determined by the following probabilistic interpretation of H's
free parameters P and A:
" P is the initial state probability vector, i.e., it determines the distribution of the
first state random variable: Pr(X1 = qi) = pi.
" A is the transition matrix, i.e., it determines the distribution of the next state
random variable given the current one: Pr(Xt+i = qjlXt = qi) = aij.
We denote a single sequence of actual observed symbols by 0 = (vol,
(Qv(j))T,
and a set of such sequences by IF = {1,
...
,
...
, vOT)
E
OU}. We will next describe
how an HMM defines probability spaces on possible output sequences, assigning a
probabilistic meaning to the free parameter B.
Definition An HMM N defines a sequence of output random variables 01,..., Or
whose distribution is determined by the probabilistic distribution of 'H's free parameter B. B is the emission matrix, i.e., it determines the probability distribution of
the current output symbol given the current state: Pr(Ot = vmIXt = qi) = bi,m.
These concepts allow us to associate a probability with a given path and a given
output sequence. For a path
Q
= (qil,
...
, i)
in an HMM R, its HMM probability
is:
T-1
Pr(Q I)
= Pr(Xi = qil,...
XT = qiTj'H) =
pil x r
aii
t=1
For a path
Q
= (qjl,..., qi)
4' = (vo,... ,vOT ),
in an HMM 'H, and a sequence of observed symbols
the conditional HMM probability is:
T
T
Pr(0'1Q) = fJPr(Ot = vo IXt = qit) = r
t=1
t=1
28
by,0,
3.1.2
Likelihood of a set of sequences of observed symbols
under an HMM: the Forward-Backward Algorithm
One of the most important applications of an HMM 'H = (Qx, Qv, 9) is the calculation of Pr(T'7), the probability of observing the set of sequences of observed
symbols given a particular model. Pr(T171) is also called the likelihood of T. This
value allows us to choose the model, from a set of models, that best describes the
data. Because Qx and QV are fixed, the likelihood of IF is often denoted as Pr(T'1e).
A common method of determining Pr(TJ|) is the Forward-BackwardAlgorithm
(see, e.g. [29, 13]). The forward-backward algorithm computes forward probabilities
at(i) and backward probabilities ft(i), which require a forward and backward pass
through the data, respectively. Specifically, at(i) represents the probability of having
seen some sequence of symbols (01, ... , Ot) that reach state i at the t-th time frame,
while it (i) represents the probability of traversing the remaining symbols in the sequence (Ot+1,...
,
OT)
given that we are already in state i at the t-th time frame.
In addition to their role in computing Pr('IO), these probabilities are used in the
Baum-Welch algorithm for parameter reestimation, outlined in the next section.
We begin by defining the a and
#
values:
Definition For a position t E 1,... , T along an output sequence, we define two
vectors of probabilities:
at(i) = Pr(01 = vo,.. ., Ot = vo,, Xt = qjje)
,3t(i) = Pr(Ot+1 =Vot,...,OT =VoT
Xt=
qj,E)
for each i E ,..., N.
Next, we describe the recursive algorithm that computes the a and 3 values:
1. Recursively compute the forward
(a)
values:
(a) Initialization for (t = 1): for each i E 1,
29
. ..,
N, ai(i) +- pi - b,01
(b) For each t E 2, ... , T and for each j E 1, ... ,N, compute:
N
Pr(01,..., Ot+1, Xt
at+l(j) <-
=
qj, Xt+ = qj)
N
=5 Pr(01,...O, Xt = qi) - Pr(Xt+1 = qj|Xt= qj)x
Pr (Ot+1|Xt+1 = qj)
N
at(i) - aij - bj, ,
=
i=1
2. Recursively compute the backward (/3) values:
(a) Initialization for (t = T): for each i E 1,.. .,N,
(b) For each t E (T - 1), ...
,1
and for each i E 1, ...
Pr(Ot+,.. ., OT, Xt
N
2..'
At(i)
3 T(i) <-
P'r(Y. =
j=1
=
, N,
1
compute:
qi, Xti+1 = qj)
V2q
N
Pr(Ot+1IXt+1 = qj) - Pr(Ot+2 ,. .. , OT Xt+1 = qj) x
=
j=1
Pr(Xt = qj, Xt+1 = qj)
Pr(Xt = qi)
N
bj,og - Ot+1(j) -ai,
=
j=1
3. Given the a and /3 values, we can determine Pr(0 16):
N
Pr(016) = EPr(01,... , OT, XT = qiIE)
N
=
- #(i) for any t
5at(i)
2
The computational complexity of the forward-backward algorithm is O(T - N )
30
3.1.3
Learning the HMM: Baum-Welch Algorithm
Often when dealing with an HMM, we have some data (called a training set) but
no model of the phenomenon that has produced the data. The standard procedure
in this case is to construct a basic model and refine it iteratively until no further
improvement to the data fitting is possible.
Expectatwon maximization (EM) [12] is the common paradigm for such a convergent
optimization process. EM adjusts the free parameters of the HMM based on the
forward and backward probabilities previously computed. This adjustment uses the
Baum-Welch algorithm (see, e.g. [29, 13]).
Each EM iteration has two steps: an
Expectation (E) step and a Maximization (M) step. During the E-step, summary
measures of the data are evaluated. These measures are expected counts of events,
such as state-transitions. In the M-step, these measures are interpreted as empirical
frequencies of these events and substituted into the model for the old free parameters.
This substitution is justified by the fact that the new free parameters (the summary
measures) maximize the approximated likelihood of the data.
EM is guaranteed to improve the probability of T being observed from the model
(Pr('16)) in each iteration until some limiting probability is reached. By repeating
the forward-backward and Baum-Welch algorithms, the parameters are guaranteed
to converge to some local maximum likelihood, but not necessarily to the global
maximum likelihood. We first outline the overall algorithm, and then explain the
rationale for each new parameter reestimate:
1. Choose the initial parameters,
2. Calculate the a and
#
E=
(A, B, P), arbitrarily.
values, as well as Pr(T'Ie).
3. Reestirnate A, dij, and bim as described below, for all values of i, j, and m.
4. Let A= [d,], B
[bi,m], P =
)
5. Let e= (A, B, P).
6. If E = e then quit, else set E to
e
and return to Step 2.
31
We now describe the details of each reestimate, p, di,j, bi,m. To reestimate P, we
calculate the expected number of times that the initial state is qj under the observations:
U
E (# times X 1 = qjIT)
Pr(Xi = qj |I )
=
u=1
Y..
Pr(Vu, X,
=
qi)
Pr ('u IE))
E
U
Ea(i)- #"(i)
(3.1)
u=1=1Pr(buIE))
This last quantity, computed on the basis of the current parameter estimates, provides
Aj, the improved estimate of pi.
To reestimate B, we first calculate the expected number of occurrences of state i:
U
E(# times Xt = qjjI|)
T
E Pr(Xt = qjj|u)
=
u=1 t=1
U
T
P()
U
,3
U=1 t=1Pr
(3.2)
uIE)
The expected number of such occurrences for which Ou = m is:
U
E(# times O" = m, Xt =
qi4')
=
T
ZZPr(t
=
m,Xt
=
u=1 t=1
U
=
Pr(Xt = q ,0u)
u=1 t:OU=m
au(i)#3t"(i)
~
(3.3)
u=1 t:OU=m
We divide Equation 3.3 by Equation 3.2 to evaluate the proportion of the occurrences
of state qi for which the corresponding observation is m:
'm
U=1 Pr(OuIe)
Ue
ZU_1[
32
t:Og=m
ET_
1
au"(i)3t"(i)]
This is the new estimate of bi,m.
The reestimate of A is calculated in a similar fashion. The expected number of
transitions out of state qj, given the observations, is:
U
T-1
Pr(Xt = q4j/u) =
tJtW
U=1 t=1
u=1 t=1
(3.5)
rOI)
and the expected number from i to j is:
U
T-1
E(# times Xt = q , Xt+1 = qj) = ZZ
Pr(Xt = q , Xt+1 = qj I|)u)
(3.6)
u=1 t=1
Pr(ou|Xt = qi, Xt+1 = qj) - Pr(Xt = qi) - a
.L
u=1 t=1
ZSPA(
I
\I/1
(3.7)
2"|))
We also know that:
Pr(OuIXt = qi, Xt+1 = qj) = Pr(Ou, ... , OujXt = qi) x
Pr(O+1, ... IOUr|Xt+1 = qj)
=
Pr(O1,..., OujXt = qj)
Pr(042 , ...
=
(
a"u(i)
Pr(Xt = qi)
Pr(O+1|Xt+1= qj) x
, OTr|Xt+l =
) -bo
-()
qj)
-otu+1(j)
(3.8)
Substituting Equation 3.8 into Equation 3.7, the expected number of transitions from
i to j is then:
U T -1
Pr(Xt = qj, Xt+1 = qj 10') =
u=1 t=1
a (i ) - bI
3
\'*-'
u=1 t=1
(j ) -a i j
(3.9)
I'-')
Finally, dividing Equation 3.9 by Equation 3.5, the new estimate of aij is:
di - =
EU jT-1
Z= 1 t= 1 Catu(i)-bj,.t+1'0tu+j(j)-aj,j
Pr(,uIe)
(i)fu (i)
tu
T- 1 Pr(V"|)ul
:U
U=1
Et=1
33
(3.10)
3.2
3.2.1
Extensions to Standard HMM Theory
Incomplete data
The definition of an HMM in Section 3.1.1 can be generalized to handle incomplete
data. In such a setting, instead of each data point representing a single HMM output
symbol v,, E Qv, it represents a subset of the output alphabet: Ot C Qv.
For
example, if the output alphabet is Qv = {A, B, C, D, E}, then a data point may
represent Ot = {B, D} instead of just Ot = B.
The Forward-Backward and Baum-Welch algorithms can be modified to compute
the likelihood of the observations under the model and to optimize a model to fit a
data set, even with incomplete data. The same principle of per-state forward and
backward probabilities can be applied by summing over the possible emitted symbols
in a given subset Ot:
S
Pr(OtXt= qi,7)=
Pr(v|Xt = qi,?i)
vIvEOt
We use this result in place of our original emission probability bi,m in the recursive
computation of the a and 3 values. The computational complexity of the forwardbackward algorithm under this extension then becomes O(T - N 2 + N - E
OtI)
=
O(T -N 2 +T . N - M).
3.2.2
Products of HMMs
If we want to observe multiple sequences of observed symbols at once, we can do so
by extending an existing HMM into multiple dimensions. We therefore extend the
HMM definition of a single HMM (Section 3.1.1) to sets of HMMs.
Definition Let {((), ... 1,j(D) } be a set of D HMMs. We define their D-dimensional
HMM H = H (1 x
" Qx(7i)
x.X(D) to be the HMM (Qx(R), Qv(R),
QX(i())
x
...
x Qx(i(D))
* Qv(-H) = Qv(l)) x
...
X Qv(i(D))
=
34
E(H))
such that
e)(H) is a triplet (A(:), B(t), P(K)) of two matrices and a vector such that:
-
iD),(jljD)(111)
-
bi,),(m..,m)()
-
P(i,...,iD)(N)
idd,Jd(H)
Hd bid,md (H)
HdPid( 7
)
This is a bona fide HMM, and the algorithms described in Sections 3.1.2 and 3.1.3
can be applied to compute the likelihood of samples under a model and to fit a
model to existing data. A D-dimensional HMM can also handle incomplete data
as described in 3.2.1. The only caveat is computational complexity: Adding more
states and increasing alphabet sizes leads to a significant increase in the computation
time to produce an optimized model, which is already of quadratic order. Although
theoretically polynomial, a 4-dimensional HMM, H x H x H x H involves computation
time of O(T - (N' + N4 . M 4 )), which is impractical for N > 10.
3.3
Minimum Description Length Principle
In order to optimize our model, we employ the Minimum DescriptionLength (MDL)
principle. The MDL Principle was introduced by Rissanen [40] as a statistical Occam's
Razor: "The best model/model class among a collection of tentatively suggested ones
is the one that gives the smallest stochastic complexity to the given data" [41, 40]. It
seems intuitive that the less random behavior assumed by a model, the more likely
that model is to arise by chance. It has already been widely applied to many problems
of model complexity [18].
In order to apply the MDL principle, the greatest challenge is the calculation of
a model's stochastic complexity. One accepted measure of this quantity is the length
of an ideal coding of the model [41, 4].
The number of bits required to describe
the parameters of a model is informationally equivalent to the negative logarithm
of the probability of the model arising by chance: - log2 (Pr(model)). Determining
this log probability is therefore solved by decomposing the model into its constituent
35
parts, encoding them, and devising an ideal coding scheme to describe the model
parameters.
We already know how to calculate the probability of observations under a model
from Section 3.1.2, so we can construct an overall figure of merit ((D) for any given
model: the minimal length of the data description through this model.
<D = MDL(model, observations)
= model code length - log 2 (likelihood)
=
-
log 2 (Pr(model))
=
-
log 2 (Pr(R))
-
-
log 2 (Pr(observations model))
log 2 (Pr(T IR))
The first term of 4P increases as model complexity increases, whereas the second term
decreases as model accuracy increases. By minimizing <D, we achieve a succinct and
accurate model.
36
Chapter 4
Related Work
The introduction by Daly et al. [11] of a high-resolution block model of haplotype
recombination, including inter-block recombination (see Figure 4-1), laid the foundation for future models of haplotype variation. Their work provided a high-resolution
map of haplotype block structure in a particular region using an HMM with haplotype
states along the genome.
Gabriel et al. [15] later standardized the block definition and showed that haplotype blocks are found throughout the genome. The Gabriel et al. model has been
implemented in the popular Haploview program [3] and is also the basis for more
recent research into the optimization of the block model. Distinct implementations
of the HMM paradigm applied to the problem of haplotype variation vary in the
meaning assigned to the model haplotypes, in the manner by which haplotypes are
manifested as samples, and in the assumption of haplotype blockiness of the data.
Such HMM-based models are becoming more and more complex, starting from
a model that simply seeks to list common haplotypes within each block [23] and a
model that attempts to minimize entropy of inter-block transitions [1]. These models
optimize the strict block model using the MDL criterion, but do not account for
mutations or the fact that haplotypes in the same block are likely to be similar to
one another., Other models do attempt to assign a realistic meaning to emission
probabilities as chances to observe an ancestral or mutated nucleotide [17, 22], but
even these models still assume that haplotypes are broken up at every block boundary
37
and only at block boundaries, consequently limiting the model haplotypes to singleblock fragments. Such an assumption is arbitrary and does not reflect the genome
structure of haplotype recombination.
ATCGC
GTCAACCATA
CTCAC
TTAGA
ATATA
CCTGC
GTAATCGAGG
GGCATTGTGA
ATCGATGATA
TCCGC
TTCAG
GGAGACCTGG
__
_
Fig. 4-1: The strict block model. Areas of limited haplotype diversity (rectangles)
are related to one another through probabilistic transitions (arrow) that represent the
frequency of a sequence containing the haplotypes on each end of the arrow.
Stephens et al. [43] uses a block-free model (see Figure 4-2), in which model haplotypes are the actual haplotypes observed in the sample data. The model incorporates
both mutation and recombination, but fails to recognize the strong preference of
recombination events to occur at hotspots. Furthermore, the model does not explicitly define common ancestral haplotypes, any one of which may have undergone a
mutation or recombination events that affect many present-day samples.
Fig. 4-2: The block-free model. All observed sequences are included in the model
(rectangles), and putative recombinations (light arrows) are used to refine the model.
38
Chapter 5
A Flexible HMM for Haplotype
Data
The structure of haplotype recombination in the human genome is neither strictly
block-like (i.e., characterized by recombination at hotspots alone), nor strictly blockfree (i.e., characterized by uniform recombination throughout the genome). Appropriately, we devise a general HMM approach to the description of haplotype variation
that tries to mimic real biological phenomena and entities with components of the
model. Specifically, we explicitly treat each haplotype in the model as a segment
of an ancestral chromosome. Each such chromosome came about at specific time in
history as a result of an ancient event of divergence, mutation, or recombination,
and was present in some portion of the population. As time passed, the ancestral
chromosomal segments underwent more recent events of divergence, mutation, and
recombination, all of which are explicit in our model.
5.1
Model Components
There are two entities that are central to the composition of our flexible model:
ancestral segments and transition matrices. Informally, a flexible model is composed
of some general fixed parameters, and free parameters, which include a set of ancestral
segments and a set of transitionmatrices linking these segments. We first define these
39
components of a flexible model, followed by the formal definition of a model.
5.1.1
General Parameters
We define a list g of some parameters of general use in the model. These parameters
are assumed to be fixed and we do not fit them to the data.
Definition The general parameters g = (Site, PA, PB, A) of the model are:
Chromosome position list Site(1),.. ., Site(T) is an ordered list of chromosomal
positions of SNPs in the data set. Throughout, we refer to SNP sites by their
index in this list.
Average recombination rate PA is the average genome-wide rate of recombination. We fix this parameter at 10-8 events per base (cM/Mb) [24].
Background recombination rate PB is the average background rate of recombination, i.e. the rate at which recent recombination has occurred at non-hotspot
sites. Although PB according to this definition has not been explicitly estimated,
studies [31] indicate that it is an order of magnitude lower than PA. Therefore,
we set PB
=
10-
events per base (cM/Mb).
Mutation rate A = 2 x 10-
events per base (cM/Mb), the average mutation rate
throughout the genome, is set as a uniform rate [7].
5.1.2
Ancestral Segments
Each ancestral segment entity corresponds to a segment of DNA thought to segregate
unbroken in the modern population. Formally:
Definition An ancestral segment is a triplet S = (L(S), C(S),
f(S)),
whereas
Left endpoint L(S) is an integer (1 < L(S) < T) that denotes the index of the
leftmost SNP of S.
Alleles C(S) =
{0, 1}IC(S)
is a binary vector listing the alleles of the ICI SNPs in S.
40
Population frequency f(S) is a probability that represents the population frequency of S.
We also calculate some other important properties of S:
Right endpoint R(S) = L(S) + IC(S)I - 1 is an integer that denotes the index of
the rightmost SNP of S.
-(L
PA -(Site(R(S)) -Site(L(S
Age r(S) =
)))
is a real negative number representing the pu-
tative time to the most recent common ancestor that corresponds to S.
(S) is omitted from notation when it is clear from context.
5.1.3
Ancestral Segment Tilings
Definition Let S = {S',... SISI} be some set of ancestral segments, where S' =
(L(St), C(Si), f(Si)). We formally define R(S) and L(S), the sets of left and right
ancestral segment endpoints that do not include chromosomal endpoints, as:
" R(S) = {R(S)I|S' E S}
\ {T}
* L(S) = {L(S) IS' E S}
\ {1}
We can now introduce an ordering of the ancestral segments which begin or end at a
specific site:
Definition For each t, let Ls = (i1 <
that L(Si')
l(ik)
--
-
...
< im) be the sequence of indices such
L(Sim) = t. We define the left index of ik as its ordinal in LS:
= k.
The right index is symmetrically defined:
Definition For each t, let Rs = (ij < ...< im) be the sequence of indices such that
R(Sil) = -- = R(Sim) =
t.
We define the right index of i4 as its ordinal in RS:
r(ik)=-k.
An ancestral segment tiling is a set of ancestral segments that tiles the region under
study. Formally:
41
Definition S is an ancestralsegment tiling if:
* mini L(Si)
* maxi R(St)
=
=
1
T
* t E R(S) * t +1
5.1.4
E L(S)
Transition Matrices
Transition matrices define the interconnections between ancestral segments. They
describe the probability that some ancestral segment will follow another ancestral
segment along an individual's single chromosome copy. Formally:
Definition An n x m stochastic matrix M = [mi,] is a transition matrix associated
with a site t for a given ancestral segment tiling S = {(L(S'), C(S'), f(S'))} i'1 if
n
-
IRsI m = ILS 11. We also define the notation L(M) = t, R(M) = t + 1.
5.1.5
Formal Model Definition
Definition A haplotype flexible model is a triplet (9, S, M) where G is a list of
general parameters, S =
{M 1 ,...
, MIMI }
{S1 , ... , SIsI}
is an ancestral segment tiling, and
M
=
is a set of transition matrices for S such that for each t E L(S)
3 M E M such that R(M) = t.
5.2
Modeling haplotypes as an HMM
This section relates the problem of haplotype variation to the HMM framework and
the related algorithms detailed in Section 3.1. The reader is referred to Section 3.1.1,
which explains the common HMM conventions that will be used frequently in this
section. We first describe how each component of an HMM is implemented by the
flexible model, showing how it can generate a sequence of observations that correspond
to the sequence of alleles along a haploid chromosome. In this setting, each HMM
42
state corresponds to a single hidden ancestral segment at a single site. A diagram of
how a particular flexible model corresponds to an HMM is shown in Figure 5-1.
We begin by defining the input (training set):
* T = length of each chromosome in the training set
" QV = {Vm} = {0, 1}
" Ot
=
is an output random variable which represents the emitted allele at site
t. We model unknown alleles as incomplete data (see Section 3.2.1). In those
cases, O
0
=
=
{0, 1} rather than exactly one of the values in Qv.
(vol,...
, VOT)
is a single chromosome
T = {p,... , 4 U} is the set of chromosomes across all individuals in the training
set
5.2.1
States
Each state in our HMM corresponds to a single ancestral segment S' and a single site
t along that segment. Therefore, we construct R(S) - L(S) + 1 states qi,t = (Si, t)
for each S. Notice that the states are now defined by two indices rather than just
one index, as in Section 3.1.1. Thus, we can define the HMM states:
" Qx = {qi,t} = {(S,t)|L(S') < t < R(S)}
" Xt is a state random variable denoting the state at site t.
5.2.2
Initial State Distribution
We use each ancestral segment's frequency
along the ancestral segment.
0 Pit = f(Si)
43
f (S')
for the initial state distribution
TCGC
TAGA
TATA
CCGC
CTC
CCT
TTC
t axis
1
2
3
4
5
6
7
T
C
G
C
C
T
C
T
A
G
A
.............
I
.T
A
T
A
C C
*
GC
Fig. 5-1: A flexible model and the HMM that represents it. Solid black arrows
correspond to the almost deterministic transitions along an ancestral segment. Unfilled arrows correspond to very infrequent recombination between ancestral segments.
Dashed arrows correspond to transitions across transition matrices. For visual simplicity, unfilled arrows are only shown for transitions between the first two loci. Each
state's symbol is emitted with very high probability (1 - , - r(S)).
44
5.2.3
Emission Probabilities
At qi,t, C(S)[t] is emitted unless mutation occurs:
Pr(Ot = C(S')[t]IXt = qi,t) = 1
-
P
- T(Si)
This gives us the emission probabilities:
bri,),
A - T(Si),
=
t
5.2.4
-
Si),
{
if Ot = C(Si)[t]
if Ot $ C(Si)[t]
Transition Probabilities
Each transition probability in the haplotype flexible model describes the probability
of moving between ancestral segments as t is incremented to t + 1. These transitions
can occur along a single ancestral segment, or from one segment to another, and they
can occur either within the segment or at its endpoint.
Within an ancestral segment (L(S') <; t < R(S')), the probability of a transition
from S' to another ancestral segment Si is based on the possibility of background
recombination:
Background(i,j,jt) = Pr(Xt+i = q,,t+1|Xt = qi,t)
= PB
- f (Si)
(Site(t + 1) - Site(t)) (- max(r(St ), T(Si)))
Between ancestral segments, each entry in the transition matrix describes the conditional probability of getting from one ancestral segment to another. Let S*, S3 be
ancestral segments such that R(S') = L(Si) - 1 = t. Let M be the transition matrix
such that L(M) = t. Recall that r(i) and 1(j) are the right index of i and the left
index of j, respectively, as defined in Section 5.1.3.
InterSegment(i, j, t) = Pr(Xt+i = qj,t+i IXt = qi,t)
=
M[r(i), 1(j)]
45
We complete the model with the matrix of transition probabilities. For any pair of
states qi,t and qg,t,, the transition between them is formally defined by:
0
if t:#t' - 1
InterSegment(i,j, t)
if t = R(S ) = L(Si) - 1
Background(i,j, t)
if t =t' - 1; t # R(S') ; t' # L(SJ); i
1 - Ej
otherwise
a(j,t),(j,t') =
5.2.5
a(j,t),(j,t+1)
#
j
Improvement over previous models
This model differs from previous work because the endpoints of each haplotype are
attributes of that haplotype rather than of the system as a whole. Thus, the endpoints
can be altered: two haplotypes whose endpoints meet may be merged into a longer
haplotype, or a haplotype may be severed at some SNP, creating a new transition
matrix between the two new ancestral segments if one does not already exist at that
SNP.
5.3
Extension to diploid and trio data
The HMM outlined in the previous section is only defined for complete haploid data.
That is, each sequence of observed symbols corresponds to a single chromosome where
all SNP values are known. Unfortunately, haplotype data may be available only
for inbred model organisms [45], obtaining such data may require cost-prohibitive
technologies [35], and missing data is simply a fact of life in science. Typical human
data sets include diploid data, which may be partially resolved into haplotypes using
family information. Specifically, trio families (two parents, one offspring) are the
norm in many studies.
While the practical problem of missing data has been addressed by standard HMM
theory (see Section 3.2.1), it is up to us to extend the flexible model beyond the singlechromosome HMM defined in Section 5.2 to diploid or trio data. For diploid data, each
46
individual in the data set represents two independent chromosomal sequences, each
of which is an output sequence of a single haplotype flexible model. Each observation
therefore corresponds to a pair of SNP values across the observed chromosomes. At
homozygous sites, both chromosomes are observed, but at heterozygous sites, the
ambiguous phase implies that a subset of pairs of flexible model output values is
possible: (0,1) or (1,0). We define a diplotype HMM N = R' x H2 , where N' and
R 2 are identical haplotype flexible models, and {Ot} E {0 = (0, 0), 1 = (1, 1), h
=
{(0, 1), (1, 0)}}.
For trio data, we observe four independent chromosomes: maternal/paternal x
transmitted/untransmitted. The offspring data often resolves the phase ambiguity of
these four parental chromosomes. If all three individuals in the trio are heterozygous
or data is missing, then there is still some possibility for ambiguity. We define a trio
HMM N
=
' x N2
x
N 3 x N 4 , where each N is an identical haplotype flexible
model.
Thus, this model accommodates diploid (trio) data with missing information by
mimicking two (four) identical independent haplotype HMMs, whose outputs are observed only after they are merged into diploid (trio) samples of unrelated individuals.
5.4
Minimum Description Length
In this section, we outline and explain the algorithms that encode and decode any
given model.
This ideal code length is substituted for the stochastic complexity
formula in the MDL criterion, as explained in Section 3.3.
5.4.1
Encoding a Model
In order to encode a model, we devise a scheme that orders descriptions of each unit
of the model (S or M) in such a way that the topology of the model and the locations
of its components can be easily reconstructed from the order of the components and
the length of the ancestral segments. Intuitively, units are sorted by left endpoint.
Each unit is assigned a pair of integers to enable sorting, and ties are broken in favor
47
of (shorter) ancestral segments. Formally:
1. Assign a pair of integers to each ancestral segment and each transition matrix.
" For an ancestral segment S', the integer pair is (L(S ), R(S') + 1).
" For a transition matrix M, the integer pair is (L(M), R(M)).
2. Sort all units by the two integers, lexicographically.
3. Encode IL'I.
4. Encode each unit according to Sections 5.4.2 and 5.4.3 below.
For a model with the topology in Figure 5-2, the units in the model will be ordered
as in Figure 5-3.
i
Film
i
Fig. 5-2: A hypothetical flexible model. Each rectangle represents a single ancestral
segment, and transition matrices are shown by arrows.
Fig. 5-3: The encoding order of units for the model shown in Figure 5-2 using the
algorithm described above.
48
5.4.2
Encoding an ancestral segment
To encode each ancestral segment, we only need enough bits to encode the length
of the ancestral segment, its frequency in the model, and its alleles. The rest of
the relevant attributes can be inferred from the coding scheme, as we will show in
Section 5.4.4. The ancestral segment's frequency f(S') is a real number, but it can be
encoded with finite precision due to a standard trick: This frequency is only measured
up to - accuracy (where N is the number of chromosomal samples) as the fraction of
chromosomal samples that descended from S'. This is a ratio between an estimated
integer and N, so at most log 2 (N) bits are required to encode f(S').
Given an ancestral segment S' and N chromosomal samples, its minimum encoding is:
length log2 (R(S t ) - L(S t ) + 1) bits
frequency log 2 (N) bits
alleles (R(Si) - L(Si) + 1) bits
5.4.3
Encoding a transition matrix
For a given n x m transition matrix M, we need to encode n x m probabilities, to
y
N
accuracy. While n can be deduced from the number of ancestral segments encoded
previous to M, m will need to be encoded explicitly.
There are two possibilities for the density of M:
* M is sparse, i.e., it has few nonzero entries.
* M is dense, i.e., it has many nonzero entries.
Let n x m be the dimensions of M and let r be the number of nonzero entries in M.
If r is small relative to n - m, then we can encode the transitions individually:
number of transitions log 2 r bits
topology of the r transitions log 2
(r7)
bits
49
values of the r transitions log 2 (N+r-1 bits
If r is large relative to n -m, then it is more efficient to code every one of the nm
transitions explicitly instead:
values of nm transitions log 2 (N±nm-1
For each transition matrix, we choose the encoding that produces the minimum bit
length and add a single bit to signal which encoding of the transitions has been chosen.
Thus, given an n x m transition matrix M with r nonzero probabilities and N
chromosomal samples, its minimum encoding is:
second dimension log 2 (m) bits
transition protocol bit 1 bit
transitions min(log 2 (r) + log 2 (nm) + log 2
5.4.4
(N+r-1),
log 2
(+n
m -
1))
bits
Decoding a Model
Because the topology of a model is implied by the order of the ancestral segments
and transition matrices, we can reconstruct the model from its coded representation
very easily. We traverse the model from left to right, assigning all of the units to the
current location until a transition matrix is read. Because of the unit ordering, each
transition matrix's location must correspond to the leftmost right endpoint among all
ancestral segments not yet capped by a transition matrix. The left endpoints of the
next batch of ancestral segments must correspond to the location of the transition
matrix most recently placed. As the second dimension of each transition matrix is
encoded, the number of ancestral segment units between this transition matrix and
the next one is always known, i.e., Rg is specifically encoded. The details of the
decoding algorithm are shown in Algorithm 1.
50
Algorithm 1 Decoding a Model
0
Uncapped <-
LOC
i +- 1
j +- I
Read in m
<-
ILSI
while Units left do
if m > 0 then
Read in length,
r> Read in an ancestral segment
f, and
C
Construct S' <- (LOC, f, C)
Uncapped <- Uncapped U {S}
+-
i+ 1
M <-
m -
i
1
else
LOC <- (minieuncappedR(S')) + 1
n +- IRLoc-i1
Read in m
Read in the n x m matrix Mj
(L(M), R(M)) <- LOC - 1, LOC
Uncapped <- Uncapped \ RLOC-1
j <-j +1
end if
end while
51
> Read in a transition matrix
5.5
Optimizing the Model
5.5.1
Initialization
Given a strict block model, the computational challenge is to infer a flexible model for
the available data that better represents the boundaries of ancestral chromosomes. We
start by initializing a flexible model with a blockwise model, which we then improve
iteratively. Specifically, the model is initialized with ancestral segments and transition
matrices provided by Haploview [3], which uses the Gabriel et al. [15] definition of
haplotype blocks.
5.5.2
Optimizing the HMM
The iterative HMM optimization paradigm alternates between two stages: Improving
the topology of the model, and improving the probabilistic parameters of the HMM
under the topology. We first describe the latter, simpler task.
Optimization of the probabilistic parameters of a flexible model is simply the
inference of the HMM free parameters from the data. Given the states and transitions
defined in Section 5.2, the model likelihood is computed by the Forward-Backward
algorithm explained in Section 3.1.2 and the free parameters of the HMM (A, B, P)
are improved to convergence by the Baum-Welch algorithm explained in Section 3.1.3.
Note that when following transitions during the forward-backward traversals of
the data, we do not actually loop over the entire set of states for each 1 < t < T.
Instead, we take advantage of the sparse structure of our HMM: only transitions from
t - 1 to t are allowed (recall Figure 5-1). Therefore, if we define w(t) as the number
of ancestral segments present at site t (formally, w(t) =
{qt E Q()x}I),
computation time for each forward-backward scan is O(EZi
then the
1 w(t) -w(t + 1)). When
we analyze single chromosomes (H is a haplotype HMM), the magnitude of w(t) is
bounded in practice by the number NH of common haplotypes in a region due to
limited haplotype diversity (Section 2.4). When diploid and trio data are analyzed
(H is a diplotype or trio HMM), w(t) is bounded by (NH)
52
2
or (NH) 4 , respectively.
Typical values of NH are between two and eight.
5.5.3
Topology Optimization Strategy
After HMM convergence, the flexible model employs a "greedy" optimization strategy to improve the model topology in terms of its MDL score. That is, the strategy
always chooses the best short-term topology change without regard for opportunities for longer-term gains. To implement this strategy, a list of candidate topology
changes is first generated from the current model. The four types of possible topology
changes are explained in Section 5.5.4 below. Next, each topology change is scored
for its impact on 4D, as described in Section 3.3 above. We exploited the locality of
reference property of the dynamic programming paradigm in our implementation of
the Forward- Backward algorithm to allow changes in likelihood to be computed only
locally, saving computation time. The topology change that generates the greatest
decrease in
(D
is chosen and applied to the model. The algorithm terminates if no
topology changes that decrease 4D remain.
5.5.4
Candidate Topology Changes
We now detail the four types of topology-change steps considered by the greedy
optimizer. Each such step operates locally on a handful of model components that
satisfy step-specific criteria, as explained below.
Horizontal Merges
If a pair of adjacent ancestral segments almost always extend one another, they are
merged into a single ancestral segment. Formally, the criterion for such a merge is
the existence of Si, S3 such that:
" Pr(XL(Si)
= qj,L(Si)IXR(Si)
" Pr(XR(Si) =
qi,R(Si) 1XL(S)
=
=
qi,R(Si)) =
j,L(Si))
53
=
1, and
1
Algorithm 2 formally outlines the horizontal merge algorithm and Figure 5-4 shows
the results.
Algorithm 2 Horizontal Merge
Construct Sk <- (L(S'), C(S') + C(Sj), f(si)+f(Si))
Delete S' and Si
ATCGC
CTCAC
ATCGC
CTCAC
TTAGA
ATAKTA
TCCGC
CCTGC
-ATATA
TTCAG
TTAGA
CCTGC
TCCGC
TTCAG
KK
Fig. 5-4: A Horizontal Merge, before and after. The ancestral segments in bold
only connect to each other in the model and therefore should be treated as a single
ancestral segment.
Vertical Splits
If a single ancestral segment is linked to two other ancestral segments exclusively
in a transition matrix, we clone it, splitting its associated probabilities with some
perturbations, in the hope that the resulting ancestral segments will converge towards
parallel links that the model can subsequently merge horizontally, creating a simpler
model. Formally, the criterion for such a split is the existence of Si, Si, Sk such that:
" R(S') = R(Si) = L(Sk) - 1, and
"
Pr(XL(Sk)
=qk,L(Sk)|XR(Si)
* Pr(XL(sk) =
*
qk,L(Sk)|XR(S)
Pr(XR(si) =qi,R(Si)IXL(Sk)
Pr(XR(Si)
=
= qi,R(Si)) = 1,
and
= qj,R(Si)) = 1,
and
= qk,L(Sk))
qj,R(Si)|XL(Sk)
=
+
qk,L(Sk)) =
or
54
1
* L(ST) = L(Si)
" Pr(XR(sk)
=
* Pr(XR(sk) -
=
R(Sk) + 1, and
qi,L(Si))
1, and
= qj,L(Si))
1, and
qk,R(Sk)I XL(Si) =
qk,R(sk)IXL(Si)
* Pr(XL(Si) =qi,L(Si)|XR(Sk) = qk,R(Sk)) +
Pr(XL(Si) = qj,L(Sj)|XR(sk) =
qk,R(sk))
1
Algorithm 3 formally outlines the vertical split algorithm and Figure 5-5 shows
the results.
Algorithm 3 Vertical Split
Create ancestral segment Sk' = Sk
(f(Sk), f(Sk')) +- PerturbHalf(f(Sk))
Let M be the transition matrix such that L(M) = R(Sk)
for all Segments x such that M[r(k), 1(x)] > 0 do
(M[r(k), 1(x)], M[r(k'), 1(x)]) +- PerturbHalf(M[r(k), 1(x)])
end for
Let M be the transition matrix such that R(M) = L(Sk)
for all Segments x such that M[r(x), 1(k)] > 0 do
(M[r(x), 1(k)], M[r(x), l(k')]) <- PerturbHalf(M[r(x), 1(k)])
end for
procedure PERTURBHALF(p)
E ~ Uniform(0, 0.1 x p)
Return (P + c, -c)
end procedure
ATCGC
ATCGC
CTCAC
CTCAC
TCACA -0CCTGC
kCCTGC
T CA CA................
TTCAG
TCACA
TTCAG
Fig. 5-5: A Vertical Split, before and after. The ancestral segment in bold on the left
is duplicated in the hope that it will lead to two Horizontal Merges later on.
55
Prefix Matching
If two ancestral segments start with the same string, or prefix, we can merge those
parts of the ancestral segments and reduce the coding complexity. Formally, the
criterion for such a change is the existence of S', S3 and site t such that:
* L(S') = L(Si), and
" Vx E [L(S'),t],C(Si)[x = C(Si)[x]
Algorithm 4 formally outlines the prefix matching algorithm, and Figure 5-6.
Algorithm 4 Prefix Match
Construct Sk <- (L(Si), C(Si)[L(Si), t], f(Si) + f(Si))
C(Si) <- C(Si)[t + 1, R(S )]
C(Si) <- C(Si)[t + 1, R(S)]
L(S ), L(Si) <- t + 1
if $ M such that L(M) = t then
Create a new transition matrix M associated with t.
end if
M[r(k), 1(i)] - f (Si)
M[r(k), 1(j)] - f (Si)
ATAGAACGCTG
ATAGAAC
GC TG
G
ATAGAACTCTG
TTTGATAGTGT
TTTGATA
GTGT
TAACTACTTTG
TAACTAC
TTTG
Fig. 5-6: A Prefix Match, before and after. The bold parts of the ancestral segments
match each other and can therefore be compressed into a single ancestral segment.
Suffix Matching
Similarly to prefix matching, if two ancestral segments end with the same string,
or suffix, we can merge those parts of the ancestral segments and save some space.
Formally, the criterion for such a change is the existence of S, Si and site t such that:
56
e R(S')
0
Vx
G
=
R(Si), and
[t, R(S')],C(S')[x] = C(Si)[x]
Algorithm 5 formally outlines the suffix matching algorithm. A Suffix Match
figure is omitted without loss of generality.
Algorithm 5 Suffix Match
Construct Sk <- (t, C(S')[t, R(S')], f(S') + f(Si))
C(S') <- C(Si)[L(S'), t - 1]
C(Si) +- ((Si)[L(S'),t - 1]
(R (S'),I R(Si) +- t - 1)
if
M such that R(M) = t then
Create a new transition matrix M associated with t - 1.
end if
M[r(i), 1(k)] - f (Si)
M[r(j), 1(k)] - f (Si)
57
58
Chapter 6
Empirical Results
This chapter presents experimental results obtained using the flexible model on sample
genotype data. We analyzed data from the original region of the genome used to
develop the block theory [11], as well as recent data from the HapMap ENCODE
project [9], which includes regions genotyped at the projected SNP density of the
final HapMap [34].
We first present a specific example of the success of the flexible model, and then
demonstrate the validity of the flexible model by showing that it is superior to the
strict block model by the MDL criterion across six chromosomal regions. We then
present a simple application of the model that evaluates rigid boundaries against
flexible boundaries observed in the analyzed regions. These boundaries correspond
biologically to haplotype recombination characterized solely by hotspots versus sporadic haplotype break-up.
6.1
Improvement in 5q31
To demonstrate the flexible model on well-known data, we present one subregion of
the 5q31 data under the block model (Figure 6-1) and after optimization under the
flexible model (Figure 6-2). Notice that the flexible model is more compact and that
the flexible model not only agrees with the hotspot predicted by the block model, but
also reveals sites of less frequent recombination, as anticipated.
59
oGCCCGATC
CTCTGACT
CCATACTC
CCC TGACT
CCCTGACC
TCCCTGATC!
CGCGCCCGGATCC
CTGCCCCGGCTCC
CTGCTATAACCGC
TTGCCCCAACCCA
TTGCCCCAACCCC
CTGCTATAACCCC
CC
AT
Fig. 6-1: SNPs from chromosomal loci 433467 to 520521 in 5q31 under the block
model.
CGCG
CTGC
CTGC
CCCGGAT
TATAACC
TATAACC
CC
GC CC
r
(-*~L7
CGCG
TTGC
GC
CC
ii
CCCGGCT
CC
CCCAACC
CC, GATC
CT,
CT'>CT GACT
CC
AT :ACTC
CC
CT
GACC
CCf
AT
Fig. 6-2: SNPs from chromosomal loci 433467 to 520521 in 5q31 under the flexible
model.
60
Data Set
Daly et al. [11]
ENCODE [9]
Density Depth
1:5kb
129 trios
1:1kb
Chr. Region Interval (Mbp)
5q31
0.27 - 0.89
SNPs
103
2p16
51.6 - 52.1
515
2q37
4q26
7q21
7q31
234.8 - 235.3
118.7- 119.2
89.4 - 89.9
126.1 - 126.6
573
480
379
463
30 trios
Table 6.1: Details of the six chromosomal regions used in validating the flexible model.
6.2
Data
In addition to the 5q31 region, we ran the flexible model on five ENCODE regions.
A summary of the data sets used in our analysis of the flexible model is shown in
Table 6.1. Due to computational constraints, we avoided running a trio-based model,
despite the samples being trios. Instead, we inferred parental phasing as much as
possible via the offspring data and considered each partially phased parent as an
unrelated diploid in a diplotype flexible model.
6.3
Improvement in likelihood and MDL
The implementation of the flexible model algorithm outlined in Chapter 5 is computationally intensive, so each ENCODE region was subdivided into a small number of
subregions and run piecewise. Despite the fact that the program could not improve
the model across the boundaries of these subregions, the flexible model still improved
significantly upon the strict block model in both likelihood and description length
(see Figure 6-3).
The lower MDL of the flexible model with respect to the block
model in each case shows that the flexible model not only is significantly more likely
to have arisen by chance but also describes the data as well as or better than a strict
block model.
61
ic, --
I- -
iip jh -
- -- T!
---
o log(P(DatalBlock model))
* og(P(Block Model))
E log(P(DatajFlexible model))
IMlog(P(Flexible model))
12000-
10000-
8000-
6000-
4000-
2000-
05q31
2p1 6
4q26
2q37
Chromosoal Region
7q21
7q31
Fig. 6-3: Improvement by the flexible model over the strict block model in both
likelihood and description length. The lower MDL in every case indicates that the
flexible model describes the data more concisely and accurately than the strict block
model does. The left bar of each pair represents the data description length of a
strict block model, whereas the right bar represents the data description length of
a flexible model. The bottom part of every bar represents the code length required
to describe the model, and the top part of every bar represents the likelihood of the
samples under the model.
62
6.4
Preserved boundaries
The flexible model bears out the hypothesis that while recombination hotspots are an
important characteristic of haplotype variation, not all ancestral segments recombined
at each hotspot. Figure 6-4 shows that the flexible model preserves some, but not
all of the original boundaries that correspond with blockwise description. Boundaries
that are not preserved by the flexible model, but are instead crossed by one or more
ancestral segments, are referred to as traversed boundaries.
Such boundaries are
presumed to correspond with sites of infrequent ancestral recombination. In contrast,
sites that are preserved are presumed to correspond with recombination hotspots.
a
40
n Boundanies, Preserved
a Boundaries Travesed
30
E
20
35
1D
17
0
Sq31
2p16
2q37
4q26 2521
Chromosomal Region
7q31
Fig. 6-4: Number of block boundaries found in blockwise description that are tra-
versed or preserved in the flexible model.
63
64
Chapter 7
Conclusions
The ability to accurately model haplotype variation in a clear and concise manner
via the flexible model described here will provide researchers with an unprecedented
framework for genetic studies. As genomics continues to expand into larger-scale
studies, a model that reflects the true structure of haplotype variation will be indispensable for proposing new association studies, and for analyzing the underlying
natural forces that shape our genetic information. This chapter reviews the contributions of this work and overviews some opportunities for possible improvements and
extensions.
7.1
Summary of Contribution
7.1.1
Computational Results
Model development We have developed a novel computational model to describe
haplotype variation. The model fits within the HMM framework, with state layers per time frame, sharing of parameters between analogous states across layers, almost deterministic transitions between states corresponding to the same
segment, and stochastic transitions between points of interest. The model is
extended to handle missing and multi-chromosome data, as generated by current technologies. We define an MDL measure for a flexible model and develop
65
algorithms to optimize the model.
Implementation The model was implemented and adapted to fit the practical requirements of real data, such as incompleteness. To this end, we introduced
efficiency improvements to the HMM itself and to the inference procedure.
Application to real data We have applied the flexible model to classical and stateof-the-art data sets. The flexible model optimizes existing human genome data,
generating a significantly better picture of haplotype variation than a blockwise
model of the same data.
7.1.2
Biological Results
A truer model of haplotype variation The flexible model synthesizes important
aspects of existing block and non-block theories of haplotype variation. It recognizes that while hotspots are a critical part of our understanding of the structure
of haplotype variation, not all variation can be explained by them. On one hand,
the flexible model explicitly acknowledges as rare the ancestral recombination
that affects much of the population. Therefore, the flexible model predicts LD
that is attributed to common ancestors of large fractions of the population,
whose haplotypes are commonly observed intact in today's samples. Such LD
is evident in dense data sets, where regions with no ancestral recombination
are ubiquitous. On the other hand, the flexible model explicitly acknowledges
that recombination hotspots are even rarer than sites of occasional ancestral
recombination. As a result, it predicts that ancestral haplotypes will overlap in
a mosaic pattern.
Concise description of data The flexible model is simpler and more succinct than
a blockwise model description, and the likelihood of samples under the flexible
model is similar to that of a strict block model.
This achievement has the
potential to require fewer tag SNPs to characterize a region, as well as allowing
fewer hypotheses of haplotype association to be checked, improving the power
66
of genetic studies.
Characterization of haplotype boundaries The flexible model incorporates opportunities for both hotspot and non-hotspot recombination, setting haplotype
boundaries where they make most sense, rather than at every block boundary.
It also allows some insight into whether these boundaries are genuine hotspots
or merely sites of occasional recombination.
7.2
Future Work
As shown in Chapter 6, a flexible model can succinctly and accurately describe patterns of haplotype variation for haploid, diploid, or trio samples. There are, however,
several opportunities for improvement and extension. This section outlines future
projects based on this work.
7.2.1
Improve model optimization
A major practical restriction of the method presented in this thesis involves computational limitations. The task of optimizing such a complex model is computationally
intensive and does not scale well for long sequences. We would like to improve on
the running time as much as possible to allow large data sets to be run more quickly.
We would also like to implement optimization algorithms more elaborate than the
greedy scheme, such as Monte Carlo Markov Chain (MCMC), to achieve topologies
that outperform those developed so far.
7.2.2
Optimization of general parameters
The selection of the general parameters in the flexible model is just as important as
the optimization of the free parameters. Currently, the model treats the background
recombination rate (PB) and the background mutation rate (IL) as fixed. We would like
to investigate opportunities for computational optimization of P and PB to produce a
higher likelihood (L). For example, the Newton-Raphson method (see, e.g. [49]) for
67
finding the roots of a function may be used to determine the value of PB at which
L'(pB) is zero, and therefore at which L(PB) is a local maximum. While derivatives
of the model's likelihood function can be computed analytically, it is easier and as
satisfactory to use numerical derivation:
1. Select E << po%.
2. Compute:
" L = likelihood(pB = p0)
"
L+ = likelihood(pB = P0&+ e)
1- = likelihood(pB = P
L
-E)
3. Compute the first and second derivatives of L with respect to p:
2E
2
aPB
OPB
012
E
4. Compute the Newton-Raphson iteration.
12
1
pB
LB11,
5. If po3 # pl, then let p' = pl and goto Step 1.
7.2.3
Tag SNPs
Cost-effective designs for large-scale association studies rely on preliminary mapping
of the variation in the region of interest in order to select a small set of SNPs to be
typed in the large association study cohort. Selection of such effective and economical
sets of tag SNPs is therefore a key task for medical genetics. Intuitively, the deeper
we understand the haplotype structure in a region, the better our tags will be, and
MDL is related to the description of a region by a minimal number of tag SNPs.
68
More specifically, selection of a SNP with an allele particular to a haplotype tags that
haplotype and renders all other such SNPs redundant. A block model suggests that
haplotypes be tagged on a per-block basis. The flexible model can improve upon this
strategy.
Each boundary-spanning haplotype identified by our model needs to be tagged
only once. Therefore, merging ancestral segments saves tag SNPs. On the other hand,
creating more haplotypes by prefix or suffix matching does not require additional tags,
as the merged haplotype is automatically tagged if its two variants are. We predict
that implementation of a tagging scheme based on the flexible model will significantly
improve the efficiency of tagging.
7.2.4
Prior distribution that assumes coalescence
Samples of data are related, by mutation, recombination, and laboratory error, to
a hidden structure of ancestral relation or genealogy. For each (arbitrarily small)
non-recombinant region, coalescent theory depicts the underlying tree-like genealogy
of that region, and the putative occurrence of mutation along its branches (lineages).
Moving along the chromosome, the genealogy changes across sites of ancestral recombination (see Figure 7-1).
To determine the a prioriprobability of some genealogy giving rise to S, a putative
set of ancestral haplotypes, we propose:
o Inferring evident properties of the genealogies by methods similar to those used
in phylogeny [37, 14].
* Employing tools such as the Matrix Tree Theorem [32, 5, 44] to analytically average over the entire space of putative tree structures, and Particle Filtering [33]
to enhance accuracy.
69
H, :ACGGGTGT ]
H,:TGACTGTI
H5 :TC
ACG
ACA
T
hifni
T
H,:ACG
h:AICAGGTG C::
6A
Anaeshy
atgml
AG
ARm~ay bdeeIw
s3w-gm
S
4,
6
ACTGT
ACTGT
GGTGT
GGTGT
ACTGT
AGCT
TCA
ACG
TCA
Fig. 7-1: Example of historical haplotype structure. Current haplotypes evolved
(top) from one another by mutation (H2 , H3 , H 4 and H6 ), and recombination (H 5 ).
A causative mutation renders some haplotypes (H 4 , H5 and H6 ) as predisposing to
the disease. Each of the coalescent trees (bottom) for the non-recombinant ancestral
segments is due to mutations only. The succession of these trees throughout the
genome is a Hidden Markov Model - a flexible, yet complete representation of the
haplotype structure, reflected by the data.
70
Bibliography
[1] E. C. Anderson and J. Novembre. Finding haplotype block boundaries by using
the minimum-description-length principle. American Journal of Human Genetics, 73(2):336-354, August 2003.
[2] R. M. Badge, J. Yardley, A. J. Jeffreys, and J. A. Armour. Crossover breakpoint mapping identifies a subtelomeric hotspot for male meiotic recombination.
Human Molecular Genetics, 9(8):1239-1244, May 2000.
[3] J. C. Barrett and M. J. Daly.
Haploview.
http://www.broad.mit.edu/
personal/jcbarret/haploview/.
[4] A. Barron, J. Rissanen, and B. Yu. The minimum description length principle in
coding and modeling. IEEE Trans. Information Theory, 44(6):2743-2760, 1998.
[5] C. Berge. The Theory of Graphs and its Applications. Wiley, New York, NY,
1964.
[6] H. Bunke and T. Caelli.
Hidden Markov Models: Applications in Computer
Vision. World Scientific, River Edge, NJ, 2001.
[7] M. Cargill, D. Altshuler, J. Ireland, P. Sklar, K. Ardlie, N. Patil, N. Shaw, C. R.
Lane, E. P. Lim, N. Kalyanaraman, J. Nemesh, L. Ziaugra, L. Friedland, A. Rolfe,
J. Warrington, R. Lipshutz, G.
Q.
Daley, and E. S. Lander. Characterization
of single-nucleotide polymorphisms in coding regions of human genes. Nature
Genetics, 22(3):231-8, July 1999.
[8] E. Charniak. Statistical Language Learning. MIT Press, Cambridge, MA, 1993.
71
[9] The International HapMap Consortium. The International HapMap Project.
http ://hapmap. org.
[10] The International HapMap Consortium. The International HapMap Project.
Nature, 426:789-96, December 2003.
[11] M. J. Daly, J. Rioux, S. F. Schaffner, T. J. Hudson, and E. S. Lander.
High-resolution haplotype structure in the human genome.
Nature Genetics,
29(2):229-32, October 2001.
[12] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from
incomplete data via the EM algorithm. Journal of the Royal Statistical Society,
Series B, 39:1-38, 1977.
[13] R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison. Biological sequence analysis:
Probabilisticmodels of proteins and nucleic acids. Cambridge University Press,
1998.
[14] N. Friedman, M. Ninio, I. Pe'er, and T. Pupko. A structural EM algorithm for
phylogenetic inference. Journal of Computational Biology, 9(2):331-53, 2002.
[15] S. B. Gabriel, S. F. Schaffner, H. Nguyen, J. M. Moore, J. Roy, B. Blumenstiel,
J. Higgins, M. DeFelice, A. Lochner, M. Faggart, S. N. Liu-Cordero, C. Rotimi,
A. Adeyemo, R. Cooper, R. Ward, E. S. Lander, M. J. Daly, and D. Altshuler.
The structure of haplotype blocks in the human genome. Science, 296:2225-2229,
June 2002.
[16] D. B. Goldstein. Islands of linkage disequilibrium. Nature Genetics, 29(2):10911, October 2001.
[17] S. Greenspan and D. Geiger. Model-based inference of haplotype block variation.
In Proceedings of the Seventh Annual InternationalConference on Research in
Computational Molecular Biology, RECOMB 03, pages 131-137, Berlin, 2003.
ACM.
72
[18] M. H. Hansen and B. Yu. Model selection and the principle of minimum description length. Journal of the American Statistical Association, 96:746-774,
2001.
[19] A. J. Jeffreys, L. Kauppi, and R. Neumann. Intensely punctate meiotic recombination in the class II region of the major histocompatibility complex. Nature
Genetics, 29(2):217-22, October 2001.
[20] L. Kauppi, A. Sajantila, and A. J. Jeffreys. Recombination hotspots rather than
population history dominate linkage disequilibrium in the MHC class II region.
Human Molecular Genetics, 12(1):33-40, January 2003.
[21] X. Ke, S. Hunt, W. Tapper, R. Lawrence, G. Stavrides, J. Ghori, P. Whittaker,
A. Collins, A. P. Morris, D. Bentley, L. R. Cardon, and P. Deloukas. The impact
of SNP density on fine-scale patterns of linkage disequilibrium. Journalof Human
Molecular Genetics, 13(6):577-88, March 2004.
[22] G. Kimmel and R. Shamir. Maximum likelihood resolution of multi-block genotypes. In Proceedings of the Eighth Annual InternationalConference on Research
in Computational Molecular Biology, RECOMB 04, pages 2-9, San Diego, 2004.
ACM.
[23] M. Koivisto, M. Perola, T. Varilo, W. Hennah, J. Ekelund, M. Lukk, L. Peltonen,
E. Ukkonen, and H. Mannila. An MDL method for finding haplotype blocks and
for estimating the strength of haplotype block boundaries. In Proceedings of the
Eighth Pacific Symposium on Biocomputing, PSB 2003, pages 502-513, Lihue,
Hawaii, 2003. ACM.
[24] A. Kong, D. F. Gudbjartsson, J. Sainz, G. M. Jonsdottir, S. A. Gudjonsson,
B. Richardsson, S. Sigurdardottir, J. Barnard, B. Hallbeck, G. Masson, A. Shlien,
S.T. Palsson, M. L. Frigge, T. E. Thorgeirsson, J.R. Gulcher, and K. Stefansson.
A high-resolution recombination map of the human genome. Nature Genetics,
31(3):225-6, July 2002.
73
[25] T. Koski. Hidden Markov Models in Bioinformatics. Kluwer Academic Publishers, Dordrecht, The Netherlands, 2001.
[26] L. Kruglyak.
Prospects for whole-genome linkage disequilibrium mapping of
common disease genes. Nature Genetics, 22(2):139-44, June 1999.
[27] L. Kruglyak, M. J. Daly, M. P. Reeve-Daly, and E. S. Lander. Parametric and
nonparametric linkage analysis: a unified multipoint approach. American Journal of Human Genetics, 58(6):1347-63, June 1996.
[28] L. Kruglyak and D. A. Nickerson. Variation is the spice of life. Nature Genetics,
27(3):234-6, March 2001.
Hidden Markov and Other Models for
[29] I. L. MacDonald and W. Zucchini.
Discrete-Valued Time Series. Chapman & Hall, London, UK, first edition, 1997.
[30] G. Marth, R. Yeh, M. Minton, R. Donaldson,
Q. Li,
S. Duan, R. Davenport,
R. D. Miller, and P. Y. Kwok. Single-nucleotide polymorphisms in the public
domain: how useful are they? Nature Genetics, 27(4):371-2, April 2001.
[31] G. A. T. McVean, S. R. Myers, S. Hunt, P. Deloukas, D. R. Bentley, and P. Donnelly. The fine-scale structure of recombination rate variation in the human
genome. Science, 304:581-584, April 2004.
[32] M. Meila and T. Jaakkola. Tractable bayesian learning of tree belief networks.
In Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence,
pages 380-8, San Francisco, CA, 2000. Morgan Kaufmann Publishers Inc.
[33] B. Ng, L. Peshkin, and A. Pfeffer. Factored particles for scalable monitoring.
In Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence,
Edmonton, Canada, 2002.
[34] National Institutes of Health. Additional genotyping for the human haplotype
map.
http: //grants .nih.gov/grants/guide/rfa-f iles/RFA-HG-04-005.
html.
74
[35] N. Patil, A. J. Berno, D. A. Hinds, W. A. Barrett, J. M. Doshi, C. R. Hacker,
C. R. Kautzer, D. H. Lee, C. Marjoribanks, D. P. McDonough, B. T. N. Nguyen,
M. C. Norris, J. B. Sheehan, N. Shen, D. Stern, R. P. Stokowski, D. J. Thomas,
M. 0. Trulson, K. R. Vyas, K. A. Frazer, S. P. A. Fodor, and D. R. Cox. Blocks
of limited haplotype diversity revealed by high-resolution scanning of human
chromosome 21. Science, 294:1719-1723, November 2001.
[36] M. S. Phillips, R. Lawrence, R. Sachidanandam, A. P. Morris, D. J. Balding,
M. A. Donaldson, J. F. Studebaker, W. M. Ankener, S. V. Alfisi, F. S. Kuo,
A. L. Camisa, V. Pazorov, K. E. Scott, B. J. Carey, J. Faith, G. Katari, H. A.
Bhatti, J. M. Cyr, V. Derohannessian, C. Elosua, A. M. Forman, N. M. Grecco,
C. R. Hock, J. M. Kuebler, J. A. Lathrop, M. A. Mockler, E. P. Nachtman,
S. L. Restine, S. A. Varde, M. J. Hozza, C. A. Gelfand, J. Broxholme, G. R.
Abecasis, M. T. Boyce-Jacino, and L. R. Cardon. Chromosome-wide distribution
of haplotype blocks and the role of recombination hot spots. Nature Genetics,
33(3):382-7, March 2003.
[37] T. Pupko, I. Pe'er, R. Shamir, and D. Graur. A fast algorithm for joint reconstruction of ancestral amino acid sequences. Molecular Biology and Evolution,
17(6):890-6, June 2000.
[38] D. E. Reich, M. Cargill, S. Bolk, J. Ireland, P. C. Sabeti, D. J. Richter, T. Lavery, R. Kouyoumjian, S. F. Farhadian, R. Ward, and E. S. Lander. Linkage
disequilibrium in the human genome. Nature, 411:199-204, May 2001.
[39] N. Risch and K. Merikangas. The future of genetic studies of complex human
diseases. Science, 273:1516-7, September 1996.
[40] J. Rissanen. Modeling by shortest data description. Automatica, 14:465-471,
1978.
[41] J. Rissanen. Lectures on statistical modeling theory. Lecture Notes for University of London, Royal Holloway, UK, 2001.
mdlresearch.org.
75
Contact:
jorma.rissanen@
[42] J. C. Stephens, J. A. Schneider, D. A. Tanguay, J. Choi, T. Acharya, S. E.
Stanley, R. Jiang, C. J. Messer, A. Chew, J. H. Han, J. Duan, J. L. Carr,
M. S. Lee, B. Koshy, A. M. Kumar, G. Zhang, W. R. Newell, A. Windemuth,
C. Xu, T. S. Kalbfleisch, S. L. Shaner, K. Arnold, V. Schulz, C. M. Drysdale,
K. Nandabalan, R. S. Judson, G. Ruano, and G. F. Vovis. Haplotype variation
and linkage disequilibrium in 313 human genes. Science, 293:489-93, July 2001.
[43] M. Stephens, N. J. Smith, and P. Donnelly. A new statistical method for haplotype reconstruction from population data. American Journalof Human Genetics,
68(4):978-989, April 2001.
[44] H. Trent. A note on the enumeration and listing of all possible trees in a connected linear graph. Proceedings of the National Academy of Sciences, 40:10041007, 1954.
[45] C. M. Wade, E.J. Kulbokas, A. W. Kirby, M. C. Zody, J. C. Mullikin, E. S.
Lander, K. Lindblad-Toh, and M. J. Daly. The mosaic structure of variation in
the laboratory mouse genome. Nature, 420:574-8, December 2002.
[46] J. D. Wall and J. K. Pritchard. Assessing the performance of the haplotype
block model of linkage disequilibrium. American Journal of Human Genetics,
73(3):502-515, September 2003.
[47] J. D. Wall and J. K. Pritchard. Haplotype blocks and linkage disequilibrium in
the human genome. Nature Genetics, 4(8):587-97, August 2003.
[48] N. Wang, J. M. Akey, K. Zhang, R. Chakraborty, and L. Jin. Distribution of
recombination crossovers and the origin of haplotype blocks: the interplay of
population history, recombination, and mutation. American Journal of Human
Genetics, 71(5):1227-34, November 2002.
[49] Eric W. Weisstein. Newton's method. From MathWorld-A Wolfram Web Resource. http: //mathworld. wolfram. com/NewtonsMethod. html.
76
[50] C. L. Yauk, P. R. Bois, and A. J. Jeffreys.
High-resolution sperm typing of
meiotic recombination in the mouse MHC Ebeta gene.
The EMBO Journal,
22(6):1389-1397, March 2003.
[51] S. P. Yip, J. U. Lovegrove, N. A. Rana, D. A. Hopkinson, and D. B. Whitehouse.
Mapping recombination hotspots in human phosphoglucomutase (PGM1). Human Molecular Genetics, 8(9):1699-706, September 1999.
[52] K. Zhang, J. M. Akey, N. Wang, M. Xiong, R. Chakraborty, and L. Jin. Randomly distributed crossovers may generate block-like patterns of linkage disequilibrium: an act of genetic drift. Human Genetics, 113(1):51-9, July 2003.
77