DNA Segmentation

DNA Segmentation Presented by Ming-Te Cheng IFT 6299 - Algorithmique de l’ADN Autumn 2004 November 15, 2004 Overview     Introduction Segmentation Models Segmentation Methods Discussion Introduction  Statistical analysis of DNA sequences are motivated by 3 areas of exploration  DNA sequence data offer an extremely fine view where traditional methods of variation analysis can be extended  DNA sequence data allow fine-tuning and organization of genetic process  Comparison of sequences between species demands methods of determining similarities in evolution or function Introduction  Large chunks of the genome are sequenced, where the functionality of many sequences are unknown  Scientists rely on homology (similarity) to analyze unknown sequences with previously well studies small sequences  Need methods of describing and assessing sequences that provide useful characterizations  Ex: Segmentation models Segmentation Models  Assumption: Sequences can be partitioned into a number of segments  Each segment has a certain degree of internal homogeneity (or similarity)  Ex: Isochores  Large segments (> 300 kb) of DNA belonging to a number of classes defined by different [G+C] levels and by fairly homogeneous base compositions Segmentation Methods  Common techniques     Moving Window Maximum Likelihood Estimation Hidden Markov Models Recursive Segmentation Segmentation Methods Moving Window  Most commonly used algorithm in biology community  Straightforward implementation  Calculate density of a sequence feature of interest within a window  Move window along sequence  Recalculate density again Segmentation Methods Moving Window  Drawbacks  Arbitrary choice of window size and moving distance  If window size is too large, local fluctuations that contain significant biological information may be averaged out  If moving distance is too long, one domain can be split between two windows and its distinctive feature may be lost Segmentation Methods Maximum-Likelihood Estimation  Algorithm that computes the maximum likelihood estimate for the number of changed segments Segmentation Methods Maximum-Likelihood Estimation  Let X1,…,Xn represent a sequence of independent random letters from an alphabet   { 1 ,...,  r }  Let every Xi be one of two known distributions, specified by the probabilities p( j ) and q( j )  Changed segment is a segment [a,b] of indices where P{X i  j}  q( j) for all i  [a, b]  Unchanged segment is a segment [a,b] of indices where P{X i  j}  p( j) for all i  [a, b] Segmentation Methods Maximum-Likelihood Estimation  Let xi be the observed values of Xi for sequence  Let C be a non-intersecting set of hypothetical changed segments  Let z = (z1,…,zn) be the indicator vector for C Segmentation Methods Maximum-Likelihood Estimation  Likelihood function is can be written as L  f ( x | z, p, q)   n i 1 ( p( xi ))1 zi (q( xi )) zi  Log-likelihood can be written as log L  log  ( p ( x )) (q ( x )) 1 zi n i 1 i  i 1 log p ( xi )  i 1 log n zi i n q ( xi ) zi p ( xi )  i 1 log p ( xi )  i 1 i zi n n  First term represents log-likelihood of null hypothesis that there are no changed segments  Second term represents log-likelihood ratio of the alternative hypothesis Segmentation Methods Maximum-Likelihood Estimation Segmentation Methods Hidden Markov Models  Example of Markov Model Segmentation Methods Hidden Markov Models  Example of Hidden Markov Model Segmentation Methods Hidden Markov Models  Assumes that different segments can be classified into a finite set of state, where the nucleotide data in each state follows a probability distribution Segmentation Methods Hidden Markov Models  Let the finite number of r states underlying the observations be denoted by Si  Let the states follows a Markov process with transition matrix  jk  System of equations for the hidden chain can be written as r r P[ S i  si | S i 1  si 1 ]   j 1 k 1 s si 1,k  jki , j Segmentation Methods Hidden Markov Models  Likewise, system of equations for the observations can be writtenm asm P[Yi  yi | Yi 1  yi 1 , S i  s]   j 1 k 1 y p s ,ij1, j yi ,k where yi = (yi,1,…,yi,m) represent vector of m possible observed outcomes, and where each observation is associated with one of the states Segmentation Methods Hidden Markov Models  With the system equations for hidden chain and observations, the smoothing equations can be derived P[ Si  s | Y1 ,..., Yn ] and be used to plot the homogeneous regions in the sequence Segmentation Methods Hidden Markov Models Segmentation Methods Recursive Segmentation  Assumes that sequences exhibit hierarchical patterns (possibility of subdomains)  It is possible to apply a filter to convert the original four-base DNA sequence into ksymbol sequence  Ex: S(strong) = {C,G} and W(weak) = {A,T} Segmentation Methods Recursive Segmentation  Divide-and-conquer approach is applied  For k-symbol sequence of length N, calculate each position i (0 < i < N) the entropy H of the whole sequence, entropy Hl of the subsequence on the left side of the partition point, and entropy Hr of the subsequence on the right side. Segmentation Methods Recursive Segmentation  Entropy equations as defined by (Shannon 1948) N N Hˆ   log N N k j j j 1 Hˆ l   k  j 1 N j ,l i log N j ,l i Hˆ r   k N j ,r  N i j 1 log N j ,r N i where Nj, Nj,l, and Nj,r are the counts of symbol j in the whole, left, and right sequence, respectively Segmentation Methods Recursive Segmentation  Maximized Jensen-Shannon divergence was chosen to measure the heterogeneity of the sequence i N i   Dˆ JS  max i Dˆ JS (i )  max i  Hˆ  Hˆ l  Hˆ r  N N    If divergence is large enough, the sequence is heterogeneous and should be segmented  Equation is recursively applied for both the left and the right subsequence, as long as the calculated divergence value stays above the given threshold (similar to constructing a binary tree) Segmentation Methods Recursive Segmentation  Alternate approach to determining stopping criterion involves finding a model at the border between underfitting models (those that do not fit the data well) and overfitting models (those that fit the data too well by using too many parameters)  Bayesian Information Criterion (BIC) was used to balance goodness-of-fit of the model to data Segmentation Methods Recursive Segmentation  Alternate approach to determining stopping criterion involves finding a model at the border between underfitting models (those that do not fit the data well) and overfitting models (those that fit the data too well by using too many parameters)  Bayesian Information Criterion (BIC) was used to balance the “goodness-of-fit” of the model to data BIC  2 log(Lˆ )  log( N ) K L is the likelihood of the model, K the number of free parameters, and N the sample size Segmentation Methods Recursive Segmentation  Two models can be compared:  Modelling the sequence as one single random sequence  Modelling it as two random subsequences with different base compositions  In order for recursive segmentation to continue, the following must apply 2NDˆ JS  log( N )k where k is the number of different symbols in the sequence Segmentation Methods Recursive Segmentation  Alternate recursive segmentation algorithm condition can be used to define the segmentation strength s, i.e. 2 NDˆ JS  log( N )k s log( N )k  Recursive segmentation process can be continued as long as s > s0, where s0 is predefined by the user Segmentation Methods Recursive Segmentation Segmentation Methods Recursive Segmentation Segmentation Methods Recursive Segmentation Discussion  DNA sequences can be assumed to have segments where each has a degree of homogeneity  A number of statistical methods can be used to identify and analyse these segments      Isochores CpG islands Replication origin and terminus Complex patterns in telomeres Coding-noncoding borders  Other statistical methods for analysing DNA segmentation do exist, each with varying degrees of success  Bayesian approach  Walking Markov  Change-point methods References  Braun J.V., Müller H.-G. “Statistical methods for DNA sequence segmentation,” Statistical Science, 13:142-162, 1998.  Duda R.O., Hart P.E., Stork D.G. (2001) Pattern Classification, New York: John Wiley & Sons, Inc.  Churchill, G.A. “Stochastic models for heterogeneous DNA sequences,” Bulletin of Mathematical Biology, 51:79-94, 1989.  Csürös M. “Algorithms for finding maximal-scoring segment sets,” Proc. WABI, 2004.  Li W., et al. “Applications of recursive segmentation to the analysis of DNA sequences,” Computational Chemistry, 26:491-510, 2002.

DNA Segmentation

Related documents

Products

Support

DNA Segmentation

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib