DNA Segmentation

advertisement
DNA Segmentation
Presented by Ming-Te Cheng
IFT 6299 - Algorithmique de l’ADN
Autumn 2004
November 15, 2004
Overview




Introduction
Segmentation Models
Segmentation Methods
Discussion
Introduction
 Statistical analysis of DNA sequences are
motivated by 3 areas of exploration
 DNA sequence data offer an extremely fine view where
traditional methods of variation analysis can be
extended
 DNA sequence data allow fine-tuning and organization
of genetic process
 Comparison of sequences between species demands
methods of determining similarities in evolution or
function
Introduction
 Large chunks of the genome are sequenced, where
the functionality of many sequences are unknown
 Scientists rely on homology (similarity) to analyze
unknown sequences with previously well studies
small sequences
 Need methods of describing and assessing
sequences that provide useful characterizations
 Ex: Segmentation models
Segmentation Models
 Assumption: Sequences can be partitioned
into a number of segments
 Each segment has a certain degree of
internal homogeneity (or similarity)
 Ex: Isochores
 Large segments (> 300 kb) of DNA belonging to a
number of classes defined by different [G+C] levels
and by fairly homogeneous base compositions
Segmentation Methods
 Common techniques




Moving Window
Maximum Likelihood Estimation
Hidden Markov Models
Recursive Segmentation
Segmentation Methods
Moving Window
 Most commonly used algorithm in biology
community
 Straightforward implementation
 Calculate density of a sequence feature of
interest within a window
 Move window along sequence
 Recalculate density again
Segmentation Methods
Moving Window
 Drawbacks
 Arbitrary choice of window size and moving
distance
 If window size is too large, local fluctuations that
contain significant biological information may be
averaged out
 If moving distance is too long, one domain can be
split between two windows and its distinctive
feature may be lost
Segmentation Methods
Maximum-Likelihood Estimation
 Algorithm that computes the maximum
likelihood estimate for the number of
changed segments
Segmentation Methods
Maximum-Likelihood Estimation
 Let X1,…,Xn represent a sequence of independent
random letters from an alphabet   { 1 ,...,  r }
 Let every Xi be one of two known distributions,
specified by the probabilities p( j ) and q( j )
 Changed segment is a segment [a,b] of indices
where P{X i  j}  q( j) for all i  [a, b]
 Unchanged segment is a segment [a,b] of indices
where P{X i  j}  p( j) for all i  [a, b]
Segmentation Methods
Maximum-Likelihood Estimation
 Let xi be the observed values of Xi for sequence
 Let C be a non-intersecting set of hypothetical
changed segments
 Let z = (z1,…,zn) be the indicator vector for C
Segmentation Methods
Maximum-Likelihood Estimation
 Likelihood function is can be written as
L  f ( x | z, p, q) 

n
i 1
( p( xi ))1 zi (q( xi )) zi
 Log-likelihood can be written as
log L  log  ( p ( x )) (q ( x ))
1 zi
n
i 1
i
 i 1 log p ( xi )  i 1 log
n
zi
i
n
q ( xi )
zi
p ( xi )
 i 1 log p ( xi )  i 1 i zi
n
n
 First term represents log-likelihood of null hypothesis that there are
no changed segments
 Second term represents log-likelihood ratio of the alternative
hypothesis
Segmentation Methods
Maximum-Likelihood Estimation
Segmentation Methods
Hidden Markov Models
 Example of Markov Model
Segmentation Methods
Hidden Markov Models
 Example of Hidden Markov Model
Segmentation Methods
Hidden Markov Models
 Assumes that different segments can be
classified into a finite set of state, where the
nucleotide data in each state follows a
probability distribution
Segmentation Methods
Hidden Markov Models
 Let the finite number of r states underlying
the observations be denoted by Si
 Let the states follows a Markov process
with transition matrix  jk
 System of equations for the hidden chain
can be written as
r
r
P[ S i  si | S i 1  si 1 ] 

j 1 k 1
s si 1,k
 jki , j
Segmentation Methods
Hidden Markov Models
 Likewise, system of equations for the
observations can be writtenm asm
P[Yi  yi | Yi 1  yi 1 , S i  s] 

j 1 k 1
y
p s ,ij1, j
yi ,k
where yi = (yi,1,…,yi,m) represent vector of m
possible observed outcomes, and where
each observation is associated with one of
the states
Segmentation Methods
Hidden Markov Models
 With the system equations for hidden chain
and observations, the smoothing equations
can be derived
P[ Si  s | Y1 ,..., Yn ]
and be used to plot the homogeneous
regions in the sequence
Segmentation Methods
Hidden Markov Models
Segmentation Methods
Recursive Segmentation
 Assumes that sequences exhibit hierarchical
patterns (possibility of subdomains)
 It is possible to apply a filter to convert the
original four-base DNA sequence into ksymbol sequence
 Ex: S(strong) = {C,G} and W(weak) = {A,T}
Segmentation Methods
Recursive Segmentation
 Divide-and-conquer approach is applied
 For k-symbol sequence of length N,
calculate each position i (0 < i < N) the
entropy H of the whole sequence, entropy
Hl of the subsequence on the left side of the
partition point, and entropy Hr of the
subsequence on the right side.
Segmentation Methods
Recursive Segmentation
 Entropy equations as defined by (Shannon
1948)
N
N
Hˆ  
log
N
N
k
j
j
j 1
Hˆ l  
k

j 1
N j ,l
i
log
N j ,l
i
Hˆ r  
k
N j ,r
 N i
j 1
log
N j ,r
N i
where Nj, Nj,l, and Nj,r are the counts of
symbol j in the whole, left, and right
sequence, respectively
Segmentation Methods
Recursive Segmentation
 Maximized Jensen-Shannon divergence was
chosen to measure the heterogeneity of the
sequence
i
N i


Dˆ JS  max i Dˆ JS (i )  max i  Hˆ  Hˆ l 
Hˆ r 
N
N


 If divergence is large enough, the sequence is
heterogeneous and should be segmented
 Equation is recursively applied for both the left
and the right subsequence, as long as the
calculated divergence value stays above the given
threshold (similar to constructing a binary tree)
Segmentation Methods
Recursive Segmentation
 Alternate approach to determining stopping
criterion involves finding a model at the border
between underfitting models (those that do not fit
the data well) and overfitting models (those that fit
the data too well by using too many parameters)
 Bayesian Information Criterion (BIC) was used to
balance goodness-of-fit of the model to data
Segmentation Methods
Recursive Segmentation
 Alternate approach to determining stopping
criterion involves finding a model at the border
between underfitting models (those that do not fit
the data well) and overfitting models (those that fit
the data too well by using too many parameters)
 Bayesian Information Criterion (BIC) was used to
balance the “goodness-of-fit” of the model to data
BIC  2 log(Lˆ )  log( N ) K
L is the likelihood of the model, K the number of
free parameters, and N the sample size
Segmentation Methods
Recursive Segmentation
 Two models can be compared:
 Modelling the sequence as one single random sequence
 Modelling it as two random subsequences with
different base compositions
 In order for recursive segmentation to continue,
the following must apply
2NDˆ JS  log( N )k
where k is the number of different symbols in the
sequence
Segmentation Methods
Recursive Segmentation
 Alternate recursive segmentation algorithm
condition can be used to define the
segmentation strength s, i.e.
2 NDˆ JS  log( N )k
s
log( N )k
 Recursive segmentation process can be
continued as long as s > s0, where s0 is
predefined by the user
Segmentation Methods
Recursive Segmentation
Segmentation Methods
Recursive Segmentation
Segmentation Methods
Recursive Segmentation
Discussion
 DNA sequences can be assumed to have segments where each has a
degree of homogeneity
 A number of statistical methods can be used to identify and analyse
these segments





Isochores
CpG islands
Replication origin and terminus
Complex patterns in telomeres
Coding-noncoding borders
 Other statistical methods for analysing DNA segmentation do exist,
each with varying degrees of success
 Bayesian approach
 Walking Markov
 Change-point methods
References
 Braun J.V., Müller H.-G. “Statistical methods for DNA sequence
segmentation,” Statistical Science, 13:142-162, 1998.
 Duda R.O., Hart P.E., Stork D.G. (2001) Pattern Classification, New
York: John Wiley & Sons, Inc.
 Churchill, G.A. “Stochastic models for heterogeneous DNA
sequences,” Bulletin of Mathematical Biology, 51:79-94, 1989.
 Csürös M. “Algorithms for finding maximal-scoring segment sets,”
Proc. WABI, 2004.
 Li W., et al. “Applications of recursive segmentation to the analysis of
DNA sequences,” Computational Chemistry, 26:491-510, 2002.
Download