A statistical approach to distinguish between different DNA functional parts I. ABNIZOVA, M. SCHILSTRA, R. TE BOEKHORST, C.L. NEHANIV STRIC, Computer Science Department University of Hertfordshire College Lane, Hatfield Campus, AL10 9AB United Kingdom Abstract Motivation: Over the last decade, a vast amount of new genomic data has become available, and all of these sequences need to be annotated. Conventional annotation techniques are based on comparison of the new sequences with reference sets or genomic data from evolutionary close species. This rather narrow spectrum of techniques is in dire need of complimentary approaches to make genome annotation more reliable. Here, we present a computational Bayesian approach that allows detection of the genomic regions which are the most likely to contain coding or regulatory sequences. Results: We present a computational statistical, content-based approach to the genome wide search for coding and regulatory regions in the eukaryotic DNA. Our method performs an unsupervised search, without using reference sets or cross genome comparison. Although in this study we have restricted our analyses to the sea urchin Otx, and the mouse HoxD genes, we are confident that the technique is widely applicable. The outcome of these preliminary investigations revealed the potential of our approach as a powerful DNA sequence characterization and annotation tool. What distinguishes our approach from the others in this area: (i) Few, if any, attempts have been made to search for regulatory regions solely on the basis of their statistical properties, without prior description or evolutionary comparison. (ii) Although we use standard statistical segmentation methods, we introduce a new optimization technique that avoids averaging, and whose outcome is independent of the size of the sliding window. The technique, which takes account of the heterogeneity in the DNA sequence, reliably identifies the borders of the regions of interest. Availability: The software is available from the authors on request. Key-Words: Computational methods, genomic sequence, heterogeneity, regulatory region, coding regions, long range correlations, entropy 1 Introduction Statistical analysis of DNA sequences is important for understanding the structure, function and evolution of genomes. Statistical dependences between nucleotides have been analysed for decades in various contexts. The detection of long range correlations (LRCs) has attracted much attention from physicists ([1], [2],[3], [4], [5]), and correlations ranging from a few base pairs up to 1000 bp have been analysed using mutual walk. They concluded that there are LRCs between successive base pairs in noncoding DNA, which exibit a specific power-law form. Such power-law correlations are not present in coding DNA. Similar observations were reported information functions [6], spectra ( [2]), statistical linguistic [7] and random walk analyses [8]. It is still an open question whether LRCs are present in both coding and non-coding DNA. Several authors have claimed that non-coding DNA and coding DNA have distinguishable statistical properties ([1], [9]). The authors applied rescaled and detrended fluctuation analysis, which is a modification of a standard root mean square fluctuation analysis of a random by [1], who applied standard Fourier analysis to a sample consisting of several genes. Voss [2] confirmed the long-range power law correlation in genomic DNA, but failed to distinguish between the statistical properties of coding and noncoding DNA. The spatial heterogeneity of the base composition [10] and the long-range correlations largely shape the complexity of the whole sequence. W.Li [1] used the Jenssen-Shannon distance to partition heterogeneous DNA into relatively homogeneous parts. Non-coding regulatory and nonregulatory, non-coding DNA regions have different rates of evolutionary micro changes. Therefore, it may be assumed that they exhibit different long-range correlations. We are developing computational tool that exploits the variation of statistical properties along a given DNA sequence to classify the different regions on the sequence. To this aim, we use a Bayesian approach to integrate the results of the following computational statistical methods for DNA segmentation: random walk [8], entropy measurement [3], and JenssenShannon divergence segmentation [6] . When these techniques are applied in the conventional way, the results they produce are highly dependent on the size of the sliding window, and have insufficient sensitivity when averaging over large stretches of DNA. To get round the problems introduced by fixed-size sliding windows, we introduce a method that optimizes the size of the windows, using the computational Probability Grid technique [11]. This technique splits the DNA sequence into segments of the equal length, and fills bins with posterior probabilities (i.e. probabilities conditional on results of different computational methods). The segments for which the maximum a posteriori probability (MAP) estimate is maximal are the most probable locations for coding and regulatory regions. Thus, we exploit the well-known asymptotic Gaussian behaviour of the MAP-MLE (maximum likelihood estimator in case of uninformed priors) estimate [12]. Remarkably, only few independent samples (5, sometimes even 3; in our case, a 'sample' represents the results of different independent computational methods) suffice to approximate the asymptotic MLE estimate, owing to its fast convergence [12]. We apply the technique for optimizing window size to each of the three DNA segmentation methods mentioned above. In doing so, we simultaneously locate the 'hot' areas (coding and regulatory regions), and define their borders with accuracy given by the size of our Probability Grid bins. We propose that this combination of established computational statistical methods, augmented with our sliding window optimization technique, creates a powerful tool in the search for differences in statistical properties between noncoding and regulatory non-coding DNA. 2 Methods 2.1 Random walk Brownian motion is a random process with the following properties: it is stationary, it has independent increments, and finite standard deviation. Fractional Brownian motion (fBm) has normally distributed increments that are no longer independent. Fractional Brownian motion also exhibits the self-affinity property. We consider a DNA sequence as if it was generated by a stochastic random walk process, and then analyse the dependence of its increments. Various 'occurrences' (comparable to events in time dependent random walks) may be considered as states for the random walk: the occurrence CG bonds, the occurrence of particular A, C, G, T base pairs, purine-pyrimidine occurrences, etc. The most common approach uses purine-pyrimidine occurrences. As a first approach, we performed a 'rescaled analysis' to study a long range behaviour of root mean square fluctuation function. To obtain the characteristic size of the power-law parameter (scaling exponent) one should first map DNA sequence onto the random walk 'landscape'. To do this, the walker moves up each time he meets pyrimidine: T or C; and moves down when he meets purine: A or G. As a result, one obtains a fractalshaped landscape similar to the one shown in Figure 1a, in which the cyiiia regulatory region is mapped, and in which long stretches of mainly purine alternate with stretches that contain mostly pyrimidine. Such landscapes provide us with clues about areas in which LRCs are found (fBm), and areas in which the purinepyrimidine sequence is totally random and independent (Bm). In general, rescaled analysis can be applied to any time or space series. We define the values xk = +1, k = T, C; or xk = -1, k = A, G. The sequence {xk} can be treated as a fractal records in time. For any fractal records xk = 1,…N in time or space, and for any 2 n N one can define: x n i 1 n xi , X (i, n) x m x n n i 1 m 1 R(n) max X (i, n) min X (i, n) in in 12 1 n S ( n ) [ ( x i x n ) 2 ] n i 1 . Fig.1 a Fig. 1 b For scale free data R(n)/S(n) ~ (n/2)H, in which H is called the Hurst exponent. As n grows, we obtain growing values of R(n)/S(n). The Hurst exponent is estimated from the least squares fit of log[R(n)/S(n)] against log[n]. As an example, we show the log-log plot of R(n)/S(n) analysis for cyiiia [13] regulatory region segment, for which H = 0.79 (Figure 1b). Applied to fractional Brownian motion, a system is called 'persistent' when H > 0.5, which means that if at any time motion is in one direction, it is more likely that the motion will continued in this particular direction. Systems with H < 0.5, are called anti-persistent, and the opposite holds. For H = 0.5, the system displays Brownian motion and does not have LRCs. Peng et al. [1] claim that only the noncoding areas of DNA exhibit long range power-law correlations and the coding DNA do not. This observation formed the basis of an algorithm that distinguishes coding and non-coding regions [14], which works satisfactory as long as the coding region is above 1000 bp in length. We could not confirm the observation that H = 0.5 for coding DNA. However, there still was a significant difference between different functional parts of DNA, as the Hurst exponent seems to have the greatest value for non-coding non-regulatory DNA, decreases for regulatory segments, and is smallest (but still sometimes not equal to 0.5) for exons in sea urchin and mouse genes. We already mentioned that the above algorithm is highly dependent on window size. In view of the fact that the average length of a human exon is 146 bp, the data presented above, which were obtained with a 1000 bp window size, will probably not tell us a lot about exon locations. Therefore, instead of scanning of the DNA sequence with fixed windows, we set out to retrieve the areas with the most pronounced LRCs, on the assumption that these areas are likely candidates for noncoding non-regulatory regions. We also searched for areas with very weak LRCs (i.e. areas of Brownian motion, with Hurst exponent close to 0.5), assuming that those are probably coding regions. The regions with intermediate values of Hurst exponent are hypothesised to be regulatory regions because of their intermediate properties of evolutionary patterns (E. Davidson, private communications). To this aim, we applied the technique outlined in the Introduction. The optimal window length and location is defined as the area has a maximum or minimum Hurst exponent: wind arg max H (wind ) . Using this simple and transparent idea, we simultaneously define the borders of the coding and regulatory regions on the basis of the variation of the Hurst exponent along the sequence. The behaviour of Hurst exponent confirmed our hypotheses in our test systems. Typically, H is minimal in the (known) coding areas, maximal for noncoding non-regulatory DNA, and has intermediate values in the regulatory regions. wind the objects of interest, so the optimal window size should be calculated. The plot in Figure 2b is obtained by calculating the Hurst exponents successively in each varying (optimal) sliding window, with a stepsize of 5 bp, along the most conserved parts of non coding DNA of the otx gene sequence [14]. In Figure 2a, H reaches its minimum approximately in the same area where the entropy is maximal (see below): 6001300, whereas the next minimal area contains the seqA sequence. In Figure 2b, the presumed regulatory segment starts at position 350 and is 350 bp long. Notice that H decreases below 0.5: exon 1,2 starts from position 1270, where H decreases even below 0.5. 2.2 Entropy measurement Fig. 2a Firstly, we briefly describe the conventional procedure of measuring DNA entropy, and to point out our reasons for not apply it. The conventional procedure for measuring DNA entropy [4, typically calculates a frequency vector describing the area's nucleotide composition for a sufficiently large, but subjectively defined area, and then submits it to the well-known Entropy formula: M H ( seq ) Pi log( Pi ) i 1 Fig. 2b Figure 2a contains a plot of the values of Hurst exponent along 3Kb context around the conserved sequence seqA in the HoxD mouse gene (the annotated sequences kindly provided by Dr. D.Yap, MRC Cambridge). Thus, application of the algorithm, leads to the conclusion that seqA is likely to be a regulatory or even a coding sequence. The area upstream of seqA, starting approximately at position 550, is most likely to be a coding region. To obtain the results, we started with fixed window sizes close to real sizes of the stretches of interest. In real life problems, of course, we do not know the lengths of It is assumed that the most conserved areas (coding regions) are also the most homogeneous, and thus have the highest entropy. However, for stretches longer than some 700 bp, an average entropy is calculated, whose value may no longer be correlated with DNA homogeneity. The stretch of DNA under consideration may well be highly heterogeneous and have a low entropy, but owing to the effect of averaging it is not possible to distinguish between homogeneous and heterogeneous areas. Therefore, we should again only consider local areas along the sequence, and optimize their length and locations. The parameter of interest in this case is the entropy level. As a more robust measure, we also introduce the “entropy density”, the entropy of one segment divided by the length of the segment, so as to avoid any dependence on the length. However, because the entropy increases with the length of a regular structure (the longer a homogeneous segment, the higher its entropy), the entropy density measure is of minor importance in our analysis. described above. This analysis reveals that the regions 500-1300 and 2000-2200 (local coordinates) are 'hot': the first is probably coding, second is coding or regulatory. After we had performed our entropy measurement analysis, our conclusions were confirmed by crossgenomic comparison and laboratory experiments. 2.2.1 Numerical tests for entropy To measure information entropy along a sequence, transition information matrices (or some other frequency representation) may be calculated. Transition information matrices are defined similar to transition matrices for first-order Markov models, but instead are normalized to the total matrix sum. Thus, the matrix values are equal to the probabilities to find any adjacent pair in a given sequence. If we represent these probabilities as the vector {Pk }M k 1 {Pij : Pij 0} , in which M is 2.3 Segmentation using the JenssenShannon divergence formula the number of non-zero probabilities, we can calculate the information entropy of any part of the sequence, where the values of Pk are estimated from the Entropy formula (see above). Fig. 3 After calculation of information entropy in a sliding optimal window along the sequence, we get plots similar to the one in Figure 3. The maximum entropy point is the equilibrium state, or minimum complexity point. The areas with maximum entropy correspond to the experimentally observed exon and most locally conserved seqA sequences, they are approximately 700 and 200 bp in length and were again identified with the optimization procedure The Jenssen-Shannon distance [6] is commonly used as a tool for partitioning heterogeneous DNA into relatively homogeneous parts. The difference in their base composition of two concatenated sequences of lengths n1 and n2 is effectively measured using the Jenssen-Shannon divergence formula: J (n 1 , n 2 ) N H ( n1 n H1 2 H 2 ) N N Here H is the entropy of the whole sequence, H1 is the entropy of the left sequence, and H2 is the entropy of the right sequence, and N = n1 + n2. In order to distinguish areas with maximum difference in DNA base pair composition, we must first define the regions with more or less stationary (constant) composition. This, again, can only be done reliably by optimizing the length of the local windows. To this aim, we move the pointer along the DNA sequence, and find the maximum difference in base composition to the right and to the left of this pointer (a change point) in the optimal windows. We look for the most constant compositions at both sides of the pointer. The maximal difference in the DNA base pair composition is reached when we encounter the most homogeneous (high entropy) region at one side, and the most non-homogeneous, but still well-mixed constant stretch (micro-satellites are the most likely candidates) at the other side of the change point. In Figure 4, the central peak for the divergence function indicates the start of Fig. 4 the conserved sequence seqA upstream of HoxD mouse gene. To initialise this recursive updating formula, we need a prior belief that bin j is the actual location of coding/regulatory region. To reflect a total lack of knowledge, all bins j are uniformly distributed: P( j ) 1 / L , which means that all bins in our genomic DNA are equally probable to be a coding/regulatory regions. L is the length of the genomic sequence. The next step is to update our belief with the information given about the results of the first algorithm (Hurst values),Inf1: P( j | Inf 1 ) 1 P( j ) L( j | Inf 1 ) 3 Bayesian integration For a reliable position estimate, the information of consecutive runs of different methods must be integrated. Suppose that P( j | Inf 1 ... Inf n 1 ) is the posterior probability that nucleotide bin j refers to the coding/regulatory region, given independent pieces of information in the form of the results of the algorithms, Inf 1 ,..., Inf n 1 . The results are the collections of bins within the given genomic sequence, predicted to be coding/regulatory region by corresponding algorithm. To update the probability for bin j, given the results of the new algorithm, Infn, we use the following update formula [11]: P( j | Inf n ) n P( j | Inf n1 ) P( Inf n | j ) In our case the Information values, Inf 1 ,..., Inf n , are the outputs of different independent search algorithms, that we use to update our prior believe about the location of coding/regulatory regions. The P( Inf n | j ) term (also denoted as L( j | Inf n ) ) is the likelihood of getting the results Infn assuming that bin j refers to the coding/regulatory location. In our case, these likelihoods are discrete values, depending on whether the bin j is detected as coding/regulatory one or not, assuming that algorithm works fairly. The constant n simply normalizes the sum of the position bin probabilities over all j up to 1. Then Inf2 must be integrated: P( j | Inf 2 ) 2 P( j | Inf 1 ) L( j | Inf 2 ) and so on, until we have integrated all available independent information about the bin locations in the sequence. In our case, we have three independent methods: rescaled analysis, information entropy analysis, and Jenssen-Shannon divergence segmentation. Thus, we must define the likelihoods for all bins j for all three methods. At the end of the recursive integration process, all nucleotide position bins in the genomic sequence are scored with posterior probabilities to be the coding/regulatory regions. We pick up those that are maximal with respect to the constructed MAP: 3 j arg max L( j | Inf i ) , j i 1 where j is the set of most probable position bins, and is not a unique value. Note that we use the sum of the likelihoods, rather than their product. Thus, we have constructed the MAP estimate, which in case of uninformed prior is reduced to an MLE (maximum likelihood estimator). It is known [12] that an asymptotic MLE approaches a normal curve, whose mean gives provides the best posterior estimate of the parameter. In practice, 3-5 independent sources of information are sufficient to approach it. In our case we are interested to approach a number of local maxima (equal to the number of actual coding/regulatory regions). The means are the locations of actual coding/regulatory regions. Theoretically, the more algorithms results we have, the more robust the results are. We have shown that integrating the results of only three independent algorithms was sufficient to improve the performance of the search algorithms. 4 Discussion To perform a reliable segmentation of a stretch of DNA into coding, non-coding regulatory, and non-coding non-regulatory regions, we must define the regions with more or less stationary (constant) composition. It is impossible to accomplish this task without a clear understanding of the nature of the processes or events that cause DNA to be heterogeneous or homogeneous. Arguably, the main cause of DNA heterogeneity is the evolutionary process of DNA duplication and mutation at all scale levels [15]. Different regions have different duplication and mutation rates because of the different selective constraints [E. Davidson, personal communication]. In coding regions, only selected single nucleotide polymorphisms (SNPs) are allowed between evolutionary micro-states. This makes coding regions most homogeneous in the genome, and correspondingly, most high entropic. Micro-satellites seem to play a special role in the transcription factors recruitment process. These mysterious entities are eschewed and masked as 'weeds' by many bioinformatics repeat masker tools, even though in population genetics microsatellites are regarded to be very important pointers {16]. From a computational point of view, micro-satellites can be the main cause of local DNA heterogeneity. Let us mention possible biological reasons for DNA complexity and long-range correlations. From a molelular biological point of view, long-range correlations (LRCs) and DNA complexity are not surprising since the complex organization of genomes involves many different scales. For example, ther is a well-known LRC of GC content. It has been pointed out that a mosaic structure of genomes is presumably responsible for LRCs [3]. The organization of genome is very complex: eukaryotic genes usually consist of several protein coding segments (exons), interrupted by intervening sequences (introns). There are also regulatory elements such as promoters, splice sites, enhancers and silencers, which are simetimes up to thousands of base pairs away from exons. Genomes of high eukaryotes also comprise long stretches of DNA without any obvious biological function containing, e.g., pseudogenes and various types of repeats. We selected the sea urchin otx, and mouse hoxD genes as our test beds. These sequences were selected from cross genome comparison, and there may be significant uncertainty about the function and location of the conserved region. But it is commonly the case in bioinformatics, and must be taken into account. 5 Conclusion Statistical significance of regions detected with our integrated approach does not necessarily imply biological significance. Moreover, the observations on which the approach is founded are working hypotheses, not incontrovertible facts. However, the fact that application of the approach to data from 2 genes in two species predicted actual known regulatory and coding regions, suggests the approach may indeed be a useful aid to regulatory and coding regions prediction; especially since it requires no information other than the DNA genomic sequence of interest. Additionally, our optimization technique could be of help in a various DNA segmentation methods. References: [1]Peng, C. K., Buldyrev SV, Havlin S, Simons M, Stanley, H. E. and Goldberger, A., Mosaic Organization of Nucleotides, Physical Rev. E, Vol. 1994, pp. 1685-1689. [2]Voss, R., Evolution of Long-Range Fractal Correlations and 1/f Noise in DNA Base Sequences, Physical Review Letters Vol. 68, 1992, pp.38053808. [3]Herzel, H. and Große, I., Correlations in DNA sequences: The role of protein coding segments, Physical Review E , Vol. 55, 1997, 800-810. [4]Bernaola-Galván, P., Oliver, J. L. and Román-Roldán, R., Decomposition of DNA Sequence Complexity, Physical Review Letters, Vol. 83, 1999, pp. 3336-3339. [5]Azbel, Y. M., Universality in a DNA statistical structure, Physical Review Letters, Vol. 75, 1995, pp. 68-171. [6]Li, W., The complexity of DNA, Complexity, Vol.3, 1997, pp.33-37 [7]Mantegna, R. N., Buldyrev, S. V., Goldberger, A. L., Havlin, S., Peng, C. K., Simons, M. and H. E. Stanley, 1994, Linguistic features of noncoding DNA sequences, Physical Review Letters, Vol. 73, 1994, pp. 3169-3172. [8]Ossadnik, S., Buldyrev, S., Goldberger, A., Havlin, S., Mantegna, R., Peng, C., Simmons, M., Stanley, H., Correlation approach to identify coding regions in DNA sequences, Biophys. Journal, Vol. 1, 1994, pp.6470 [9]Buldyrev, S. V., Goldberger, A. L. , Havlin, S., Peng, C. K., Simons, M., Sciortino, F. and Stanley, H. E., Long range fractal correlations in DNA (Comment on the letter by R. F. Voss in PRL, 68, 3805), Physical Review Letters, Vol. 71, 1993, p. 1776. [10]Li, W., Marr, K., and Kaneko, K. Understanding long range correlations in DNA sequences, Physica D, Vol. 75, 1994, pp. 392-416. [11]Pearl, J., Probabilistic reasoning in intelligent systems, Morgan Kaufmann, 1988. [12]Papoulis, A., Probability, Random Variables, and Stochastic Processes, McGraw-Hill, Inc, 1991 [13]Kirchammer, C. and Davidson, E., Title? Development, Vol. 122, 1996, pp. 333-348. [14]Yuh, C., Brown, C.T., Livi, C., Rowen, L., Clarke, P., and Davidson, E., Patchy interspesific Sequence similarities efficiently identify positive cis-regulatory elements in sea urchin, Developmental Biology, Vol. 246, 2002, pp.148-161 [15]Ohno, S. Evolution by gene duplication, Springer, Berlin Heidelberg NY, 1970. [16]Pritchard, J.K., Stephens, M. and Donnelly, P., Inference of Population Structure Using Multilocus Genotype Data, Genetics, Vol. 155, 2001, 945-959.