DNA Segmentation Presented by Ming-Te Cheng IFT 6299 - Algorithmique de l’ADN Autumn 2004 November 15, 2004 Overview Introduction Segmentation Models Segmentation Methods Discussion Introduction Statistical analysis of DNA sequences are motivated by 3 areas of exploration DNA sequence data offer an extremely fine view where traditional methods of variation analysis can be extended DNA sequence data allow fine-tuning and organization of genetic process Comparison of sequences between species demands methods of determining similarities in evolution or function Introduction Large chunks of the genome are sequenced, where the functionality of many sequences are unknown Scientists rely on homology (similarity) to analyze unknown sequences with previously well studies small sequences Need methods of describing and assessing sequences that provide useful characterizations Ex: Segmentation models Segmentation Models Assumption: Sequences can be partitioned into a number of segments Each segment has a certain degree of internal homogeneity (or similarity) Ex: Isochores Large segments (> 300 kb) of DNA belonging to a number of classes defined by different [G+C] levels and by fairly homogeneous base compositions Segmentation Methods Common techniques Moving Window Maximum Likelihood Estimation Hidden Markov Models Recursive Segmentation Segmentation Methods Moving Window Most commonly used algorithm in biology community Straightforward implementation Calculate density of a sequence feature of interest within a window Move window along sequence Recalculate density again Segmentation Methods Moving Window Drawbacks Arbitrary choice of window size and moving distance If window size is too large, local fluctuations that contain significant biological information may be averaged out If moving distance is too long, one domain can be split between two windows and its distinctive feature may be lost Segmentation Methods Maximum-Likelihood Estimation Algorithm that computes the maximum likelihood estimate for the number of changed segments Segmentation Methods Maximum-Likelihood Estimation Let X1,…,Xn represent a sequence of independent random letters from an alphabet { 1 ,..., r } Let every Xi be one of two known distributions, specified by the probabilities p( j ) and q( j ) Changed segment is a segment [a,b] of indices where P{X i j} q( j) for all i [a, b] Unchanged segment is a segment [a,b] of indices where P{X i j} p( j) for all i [a, b] Segmentation Methods Maximum-Likelihood Estimation Let xi be the observed values of Xi for sequence Let C be a non-intersecting set of hypothetical changed segments Let z = (z1,…,zn) be the indicator vector for C Segmentation Methods Maximum-Likelihood Estimation Likelihood function is can be written as L f ( x | z, p, q) n i 1 ( p( xi ))1 zi (q( xi )) zi Log-likelihood can be written as log L log ( p ( x )) (q ( x )) 1 zi n i 1 i i 1 log p ( xi ) i 1 log n zi i n q ( xi ) zi p ( xi ) i 1 log p ( xi ) i 1 i zi n n First term represents log-likelihood of null hypothesis that there are no changed segments Second term represents log-likelihood ratio of the alternative hypothesis Segmentation Methods Maximum-Likelihood Estimation Segmentation Methods Hidden Markov Models Example of Markov Model Segmentation Methods Hidden Markov Models Example of Hidden Markov Model Segmentation Methods Hidden Markov Models Assumes that different segments can be classified into a finite set of state, where the nucleotide data in each state follows a probability distribution Segmentation Methods Hidden Markov Models Let the finite number of r states underlying the observations be denoted by Si Let the states follows a Markov process with transition matrix jk System of equations for the hidden chain can be written as r r P[ S i si | S i 1 si 1 ] j 1 k 1 s si 1,k jki , j Segmentation Methods Hidden Markov Models Likewise, system of equations for the observations can be writtenm asm P[Yi yi | Yi 1 yi 1 , S i s] j 1 k 1 y p s ,ij1, j yi ,k where yi = (yi,1,…,yi,m) represent vector of m possible observed outcomes, and where each observation is associated with one of the states Segmentation Methods Hidden Markov Models With the system equations for hidden chain and observations, the smoothing equations can be derived P[ Si s | Y1 ,..., Yn ] and be used to plot the homogeneous regions in the sequence Segmentation Methods Hidden Markov Models Segmentation Methods Recursive Segmentation Assumes that sequences exhibit hierarchical patterns (possibility of subdomains) It is possible to apply a filter to convert the original four-base DNA sequence into ksymbol sequence Ex: S(strong) = {C,G} and W(weak) = {A,T} Segmentation Methods Recursive Segmentation Divide-and-conquer approach is applied For k-symbol sequence of length N, calculate each position i (0 < i < N) the entropy H of the whole sequence, entropy Hl of the subsequence on the left side of the partition point, and entropy Hr of the subsequence on the right side. Segmentation Methods Recursive Segmentation Entropy equations as defined by (Shannon 1948) N N Hˆ log N N k j j j 1 Hˆ l k j 1 N j ,l i log N j ,l i Hˆ r k N j ,r N i j 1 log N j ,r N i where Nj, Nj,l, and Nj,r are the counts of symbol j in the whole, left, and right sequence, respectively Segmentation Methods Recursive Segmentation Maximized Jensen-Shannon divergence was chosen to measure the heterogeneity of the sequence i N i Dˆ JS max i Dˆ JS (i ) max i Hˆ Hˆ l Hˆ r N N If divergence is large enough, the sequence is heterogeneous and should be segmented Equation is recursively applied for both the left and the right subsequence, as long as the calculated divergence value stays above the given threshold (similar to constructing a binary tree) Segmentation Methods Recursive Segmentation Alternate approach to determining stopping criterion involves finding a model at the border between underfitting models (those that do not fit the data well) and overfitting models (those that fit the data too well by using too many parameters) Bayesian Information Criterion (BIC) was used to balance goodness-of-fit of the model to data Segmentation Methods Recursive Segmentation Alternate approach to determining stopping criterion involves finding a model at the border between underfitting models (those that do not fit the data well) and overfitting models (those that fit the data too well by using too many parameters) Bayesian Information Criterion (BIC) was used to balance the “goodness-of-fit” of the model to data BIC 2 log(Lˆ ) log( N ) K L is the likelihood of the model, K the number of free parameters, and N the sample size Segmentation Methods Recursive Segmentation Two models can be compared: Modelling the sequence as one single random sequence Modelling it as two random subsequences with different base compositions In order for recursive segmentation to continue, the following must apply 2NDˆ JS log( N )k where k is the number of different symbols in the sequence Segmentation Methods Recursive Segmentation Alternate recursive segmentation algorithm condition can be used to define the segmentation strength s, i.e. 2 NDˆ JS log( N )k s log( N )k Recursive segmentation process can be continued as long as s > s0, where s0 is predefined by the user Segmentation Methods Recursive Segmentation Segmentation Methods Recursive Segmentation Segmentation Methods Recursive Segmentation Discussion DNA sequences can be assumed to have segments where each has a degree of homogeneity A number of statistical methods can be used to identify and analyse these segments Isochores CpG islands Replication origin and terminus Complex patterns in telomeres Coding-noncoding borders Other statistical methods for analysing DNA segmentation do exist, each with varying degrees of success Bayesian approach Walking Markov Change-point methods References Braun J.V., Müller H.-G. “Statistical methods for DNA sequence segmentation,” Statistical Science, 13:142-162, 1998. Duda R.O., Hart P.E., Stork D.G. (2001) Pattern Classification, New York: John Wiley & Sons, Inc. Churchill, G.A. “Stochastic models for heterogeneous DNA sequences,” Bulletin of Mathematical Biology, 51:79-94, 1989. Csürös M. “Algorithms for finding maximal-scoring segment sets,” Proc. WABI, 2004. Li W., et al. “Applications of recursive segmentation to the analysis of DNA sequences,” Computational Chemistry, 26:491-510, 2002.