A statistical approach to distinguish between different DNA

advertisement
A statistical approach to distinguish between different DNA
functional parts
I. ABNIZOVA, M. SCHILSTRA, R. TE BOEKHORST, C.L. NEHANIV
STRIC, Computer Science Department
University of Hertfordshire
College Lane, Hatfield Campus, AL10 9AB
United Kingdom
Abstract Motivation: Over the last decade, a vast amount of new genomic data has become
available, and all of these sequences need to be annotated. Conventional annotation
techniques are based on comparison of the new sequences with reference sets or genomic
data from evolutionary close species. This rather narrow spectrum of techniques is in dire
need of complimentary approaches to make genome annotation more reliable. Here, we
present a computational Bayesian approach that allows detection of the genomic regions
which are the most likely to contain coding or regulatory sequences.
Results: We present a computational statistical, content-based approach to the genome wide
search for coding and regulatory regions in the eukaryotic DNA. Our method performs an
unsupervised search, without using reference sets or cross genome comparison. Although in
this study we have restricted our analyses to the sea urchin Otx, and the mouse HoxD genes,
we are confident that the technique is widely applicable. The outcome of these preliminary
investigations revealed the potential of our approach as a powerful DNA sequence
characterization and annotation tool.
What distinguishes our approach from the others in this area: (i) Few, if any, attempts
have been made to search for regulatory regions solely on the basis of their statistical
properties, without prior description or evolutionary comparison. (ii) Although we use
standard statistical segmentation methods, we introduce a new optimization technique that
avoids averaging, and whose outcome is independent of the size of the sliding window. The
technique, which takes account of the heterogeneity in the DNA sequence, reliably identifies
the borders of the regions of interest.
Availability: The software is available from the authors on request.
Key-Words: Computational methods, genomic sequence, heterogeneity, regulatory region,
coding regions, long range correlations, entropy
1 Introduction
Statistical analysis of DNA sequences is
important for understanding the structure,
function and evolution of genomes.
Statistical
dependences
between
nucleotides have been analysed for
decades in various contexts. The detection
of long range correlations (LRCs) has
attracted much attention from physicists
([1], [2],[3], [4], [5]), and correlations
ranging from a few base pairs up to 1000
bp have been analysed using mutual
walk. They concluded that there are LRCs
between successive base pairs in noncoding DNA, which exibit a specific
power-law form. Such power-law
correlations are not present in coding
DNA. Similar observations were reported
information functions [6], spectra ( [2]),
statistical linguistic [7] and random walk
analyses [8].
It is still an open question whether LRCs
are present in both coding and non-coding
DNA. Several authors have claimed that
non-coding DNA and coding DNA have
distinguishable statistical properties ([1],
[9]). The authors applied rescaled and
detrended fluctuation analysis, which is a
modification of a standard root mean
square fluctuation analysis of a random
by [1], who applied standard Fourier
analysis to a sample consisting of several
genes. Voss [2] confirmed the long-range
power law correlation in genomic DNA,
but failed to distinguish between the
statistical properties of coding and noncoding DNA.
The spatial heterogeneity of the base
composition [10] and the long-range
correlations largely shape the complexity
of the whole sequence. W.Li [1] used the
Jenssen-Shannon distance to partition
heterogeneous DNA into relatively
homogeneous parts.
Non-coding
regulatory
and
nonregulatory, non-coding DNA regions have
different rates of evolutionary micro
changes. Therefore, it may be assumed
that they exhibit different long-range
correlations.
We
are
developing
computational
tool that exploits the
variation of statistical properties along a
given DNA sequence to classify the
different regions on the sequence. To this
aim, we use a Bayesian approach to
integrate the results of the following
computational statistical methods for
DNA segmentation: random walk [8],
entropy measurement [3], and JenssenShannon divergence segmentation [6] .
When these techniques are applied in the
conventional way, the results they produce
are highly dependent on the size of the
sliding window, and have insufficient
sensitivity when averaging over large
stretches of DNA.
To get round the problems introduced by
fixed-size sliding windows, we introduce
a method that optimizes the size of the
windows, using the computational
Probability Grid technique [11]. This
technique splits the DNA sequence into
segments of the equal length, and fills bins
with
posterior
probabilities
(i.e.
probabilities conditional on results of
different computational methods). The
segments for which the maximum a
posteriori probability (MAP) estimate is
maximal are the most probable locations
for coding and regulatory regions. Thus,
we exploit the well-known asymptotic
Gaussian behaviour of the MAP-MLE
(maximum likelihood estimator in case of
uninformed
priors)
estimate
[12].
Remarkably, only few independent
samples (5, sometimes even 3; in our case,
a 'sample' represents the results of
different
independent
computational
methods) suffice to approximate the
asymptotic MLE estimate, owing to its
fast convergence [12]. We apply the
technique for optimizing window size to
each of the three DNA segmentation
methods mentioned above. In doing so,
we simultaneously locate the 'hot' areas
(coding and regulatory regions), and
define their borders with accuracy given
by the size of our Probability Grid bins.
We propose that this combination of
established
computational
statistical
methods, augmented with our sliding
window optimization technique, creates a
powerful tool in the search for differences
in statistical properties between noncoding and regulatory non-coding DNA.
2 Methods
2.1 Random walk
Brownian motion is a random process
with the following properties: it is
stationary, it has independent increments,
and finite standard deviation. Fractional
Brownian motion (fBm) has normally
distributed increments that are no longer
independent. Fractional Brownian motion
also exhibits the self-affinity property.
We consider a DNA sequence as if it was
generated by a stochastic random walk
process, and then analyse the dependence
of its increments. Various 'occurrences'
(comparable to events in time dependent
random walks) may be considered as
states for the random walk: the occurrence
CG bonds, the occurrence of particular A,
C, G, T base pairs, purine-pyrimidine
occurrences, etc. The most common
approach
uses
purine-pyrimidine
occurrences.
As a first approach, we performed a
'rescaled analysis' to study a long range
behaviour of root mean square fluctuation
function. To obtain the characteristic size
of the power-law parameter (scaling
exponent) one should first map DNA
sequence onto the random walk
'landscape'. To do this, the walker moves
up each time he meets pyrimidine: T or C;
and moves down when he meets purine: A
or G. As a result, one obtains a fractalshaped landscape similar to the one shown
in Figure 1a, in which the cyiiia regulatory
region is mapped, and in which long
stretches of mainly purine alternate with
stretches that contain mostly pyrimidine.
Such landscapes provide us with clues
about areas in which LRCs are found
(fBm), and areas in which the purinepyrimidine sequence is totally random and
independent (Bm).
In general, rescaled analysis can be
applied to any time or space series. We
define the values xk = +1, k = T, C; or
xk = -1, k = A, G. The sequence {xk} can
be treated as a fractal records in time. For
any fractal records xk = 1,…N in time or
space, and for any 2  n  N one can
define:
 x n 
i
1 n
xi , X (i, n)   x m   x n 

n i 1
m 1
R(n)  max X (i, n)  min X (i, n)
in
in
12
1 n
S ( n )  [  ( x i   x n ) 2 ]
n i 1
.
Fig.1 a
Fig. 1 b
For scale free data R(n)/S(n) ~ (n/2)H, in
which H is called the Hurst exponent. As
n grows, we obtain growing values of
R(n)/S(n). The Hurst exponent is
estimated from the least squares fit of
log[R(n)/S(n)] against log[n]. As an
example, we show the log-log plot of
R(n)/S(n) analysis for cyiiia [13]
regulatory region segment, for which H =
0.79 (Figure 1b).
Applied to fractional Brownian motion, a
system is called 'persistent' when H > 0.5,
which means that if at any time motion is
in one direction, it is more likely that the
motion will continued in this particular
direction. Systems with H < 0.5, are
called anti-persistent, and the opposite
holds. For H = 0.5, the system displays
Brownian motion and does not have
LRCs.
Peng et al. [1] claim that only the noncoding areas of DNA exhibit long range
power-law correlations and the coding
DNA do not. This observation formed the
basis of an algorithm that distinguishes
coding and non-coding regions [14],
which works satisfactory as long as the
coding region is above 1000 bp in length.
We could not confirm the observation that
H = 0.5 for coding DNA. However, there
still was a significant difference between
different functional parts of DNA, as the
Hurst exponent seems to have the greatest
value for non-coding non-regulatory
DNA, decreases for regulatory segments,
and is smallest (but still sometimes not
equal to 0.5) for exons in sea urchin and
mouse genes.
We already mentioned that the above
algorithm is highly dependent on window
size. In view of the fact that the average
length of a human exon is 146 bp, the data
presented above, which were obtained
with a 1000 bp window size, will probably
not tell us a lot about exon locations.
Therefore, instead of scanning of the DNA
sequence with fixed windows, we set out
to retrieve the areas with the most
pronounced LRCs, on the assumption that
these areas are likely candidates for noncoding non-regulatory regions. We also
searched for areas with very weak LRCs
(i.e. areas of Brownian motion, with Hurst
exponent close to 0.5), assuming that
those are probably coding regions. The
regions with intermediate values of Hurst
exponent are hypothesised to be
regulatory regions because of their
intermediate properties of evolutionary
patterns
(E.
Davidson,
private
communications).
To this aim, we applied the technique
outlined in the Introduction. The optimal
window length and location is defined as
the area has a maximum or minimum
Hurst exponent:
wind   arg max H (wind )
.
Using this simple and transparent idea, we
simultaneously define the borders of the
coding and regulatory regions on the basis
of the variation of the Hurst exponent
along the sequence.
The behaviour of Hurst exponent
confirmed our hypotheses in our test
systems. Typically, H is minimal in the
(known) coding areas, maximal for noncoding non-regulatory DNA, and has
intermediate values in the regulatory
regions.
wind
the objects of interest, so the optimal
window size should be calculated.
The plot in Figure 2b is obtained by
calculating
the
Hurst
exponents
successively in each varying (optimal)
sliding window, with a stepsize of 5 bp,
along the most conserved parts of non
coding DNA of the otx gene sequence
[14]. In Figure 2a, H reaches its minimum
approximately in the same area where the
entropy is maximal (see below): 6001300, whereas the next minimal area
contains the seqA sequence. In Figure 2b,
the presumed regulatory segment starts at
position 350 and is 350 bp long.
Notice that H decreases below 0.5: exon
1,2 starts from position 1270, where H
decreases even below 0.5.
2.2 Entropy measurement
Fig. 2a
Firstly, we briefly describe the
conventional procedure of measuring
DNA entropy, and to point out our reasons
for not apply it.
The conventional procedure for measuring
DNA entropy [4, typically calculates a
frequency vector describing the area's
nucleotide composition for a sufficiently
large, but subjectively defined area, and
then submits it to the well-known Entropy
formula:
M
H ( seq )   Pi  log( Pi )
i 1
Fig. 2b
Figure 2a contains a plot of the values of
Hurst exponent along 3Kb context around
the conserved sequence seqA in the HoxD
mouse gene (the annotated sequences
kindly provided by Dr. D.Yap, MRC
Cambridge). Thus, application of the
algorithm, leads to the conclusion that
seqA is likely to be a regulatory or even a
coding sequence. The area upstream of
seqA, starting approximately at position
550, is most likely to be a coding region.
To obtain the results, we started with fixed
window sizes close to real sizes of the
stretches of interest. In real life problems,
of course, we do not know the lengths of
It is assumed that the most conserved
areas (coding regions) are also the most
homogeneous, and thus have the highest
entropy. However, for stretches longer
than some 700 bp, an average entropy is
calculated, whose value may no longer be
correlated with DNA homogeneity. The
stretch of DNA under consideration may
well be highly heterogeneous and have a
low entropy, but owing to the effect of
averaging it is not possible to distinguish
between homogeneous and heterogeneous
areas.
Therefore, we should again only consider
local areas along the sequence, and
optimize their length and locations. The
parameter of interest in this case is the
entropy level. As a more robust measure,
we also introduce the “entropy density”,
the entropy of one segment divided by the
length of the segment, so as to avoid any
dependence on the length. However,
because the entropy increases with the
length of a regular structure (the longer a
homogeneous segment, the higher its
entropy), the entropy density measure is of
minor importance in our analysis.
described above. This analysis reveals that
the regions 500-1300 and 2000-2200
(local coordinates) are 'hot': the first is
probably coding, second is coding or
regulatory. After we had performed our
entropy measurement analysis, our
conclusions were confirmed by crossgenomic comparison and laboratory
experiments.
2.2.1 Numerical tests for entropy
To measure information entropy along a
sequence, transition information matrices
(or some other frequency representation)
may be calculated. Transition information
matrices are defined similar to transition
matrices for first-order Markov models,
but instead are normalized to the total
matrix sum. Thus, the matrix values are
equal to the probabilities to find any
adjacent pair in a given sequence. If we
represent these probabilities as the vector
{Pk }M
k 1  {Pij : Pij  0} , in which M is
2.3 Segmentation using the JenssenShannon divergence formula
the number of non-zero probabilities, we
can calculate the information entropy of
any part of the sequence, where the values
of Pk are estimated from the Entropy
formula (see above).
Fig. 3
After calculation of information entropy in
a sliding optimal window along the
sequence, we get plots similar to the one
in Figure 3.
The maximum entropy point is the
equilibrium state, or minimum complexity
point. The areas with maximum entropy
correspond to the experimentally observed
exon and most locally conserved seqA
sequences, they are approximately 700
and 200 bp in length and were again
identified with the optimization procedure
The Jenssen-Shannon distance [6] is
commonly used as a tool for partitioning
heterogeneous DNA into relatively
homogeneous parts. The difference in
their
base
composition
of
two
concatenated sequences of lengths n1 and
n2 is effectively measured using the
Jenssen-Shannon divergence formula:
J (n 1 , n 2 )  N  H  (
n1
n
 H1  2  H 2 )
N
N
Here H is the entropy of the whole
sequence, H1 is the entropy of the left
sequence, and H2 is the entropy of the
right sequence, and N = n1 + n2.
In order to distinguish areas with
maximum difference in DNA base pair
composition, we must first define the
regions with more or less stationary
(constant) composition. This, again, can
only be done reliably by optimizing the
length of the local windows. To this aim,
we move the pointer along the DNA
sequence, and find the maximum
difference in base composition to the right
and to the left of this pointer (a change
point) in the optimal windows. We look
for the most constant compositions at both
sides of the pointer. The maximal
difference in the DNA base pair
composition is reached when we
encounter the most homogeneous (high
entropy) region at one side, and the most
non-homogeneous, but still well-mixed
constant stretch (micro-satellites are the
most likely candidates) at the other side of
the change point.
In Figure 4, the central peak for the
divergence function indicates the start of
Fig. 4
the conserved sequence seqA upstream of
HoxD mouse gene.
To initialise this recursive
updating
formula, we need a prior belief that bin j is
the actual location of coding/regulatory
region. To reflect a total lack of
knowledge, all bins j are uniformly
distributed: P( j )  1 / L , which means
that all bins in our genomic DNA are
equally probable to be a coding/regulatory
regions. L is the length of the genomic
sequence. The next step is to update our
belief with the information given about
the results of the first algorithm (Hurst
values),Inf1:
P( j | Inf 1 )   1  P( j )  L( j | Inf 1 )
3 Bayesian integration
For a reliable position estimate, the
information of consecutive runs of
different methods must be integrated.
Suppose that P( j | Inf 1  ...  Inf n 1 ) is
the posterior probability that nucleotide
bin j refers to the coding/regulatory
region, given independent pieces of
information in the form of the results of
the algorithms, Inf 1 ,..., Inf n 1 . The results
are the collections of bins within the given
genomic sequence, predicted to be
coding/regulatory region by corresponding
algorithm. To update the probability for
bin j, given the results of the new
algorithm, Infn, we use the following
update formula [11]:
P( j | Inf n )   n  P( j | Inf n1 )  P( Inf n | j )
In our case the Information values,
Inf 1 ,..., Inf n , are the outputs of different
independent search algorithms, that we
use to update our prior believe about the
location of coding/regulatory regions. The
P( Inf n | j )
term
(also
denoted
as L( j | Inf n ) ) is the likelihood of getting
the results Infn assuming that bin j refers
to the coding/regulatory location. In our
case, these likelihoods are discrete values,
depending on whether the bin j is detected
as coding/regulatory one or not, assuming
that algorithm works fairly. The constant
n simply normalizes the sum of the
position bin probabilities over all j up to 1.
Then Inf2 must be integrated:
P( j | Inf 2 )   2  P( j | Inf 1 )  L( j | Inf 2 )
and so on, until we have integrated all
available independent information about
the bin locations in the sequence.
In our case, we have three independent
methods: rescaled analysis, information
entropy analysis, and Jenssen-Shannon
divergence segmentation. Thus, we must
define the likelihoods for all bins j for all
three methods.
At the end of the recursive integration
process, all nucleotide position bins in the
genomic sequence are scored with
posterior probabilities to be the
coding/regulatory regions. We pick up
those that are maximal with respect to the
constructed MAP:
3
j  arg max  L( j | Inf i ) ,

j
i 1
where j  is the set of most probable
position bins, and is not a unique value.
Note that we use the sum of the
likelihoods, rather than their product.
Thus, we have constructed the MAP
estimate, which in case of uninformed
prior is reduced to an MLE (maximum
likelihood estimator). It is known [12] that
an asymptotic MLE approaches a normal
curve, whose mean gives provides the best
posterior estimate of the parameter. In
practice, 3-5 independent sources of
information are sufficient to approach it.
In our case we are interested to approach a
number of local maxima (equal to the
number of actual coding/regulatory
regions). The means are the locations of
actual
coding/regulatory
regions.
Theoretically, the more algorithms results
we have, the more robust the results are.
We have shown that integrating the results
of only three independent algorithms was
sufficient to improve the performance of
the search algorithms.
4 Discussion
To perform a reliable segmentation of a
stretch of DNA into coding, non-coding
regulatory, and non-coding non-regulatory
regions, we must define the regions with
more or less stationary (constant)
composition. It is impossible to
accomplish this task without a clear
understanding of the nature of the
processes or events that cause DNA to be
heterogeneous or homogeneous.
Arguably, the main cause of DNA
heterogeneity is the evolutionary process
of DNA duplication and mutation at all
scale levels [15]. Different regions have
different duplication and mutation rates
because of the different selective
constraints [E. Davidson, personal
communication]. In coding regions, only
selected single nucleotide polymorphisms
(SNPs) are allowed between evolutionary
micro-states. This makes coding regions
most homogeneous in the genome, and
correspondingly, most high entropic.
Micro-satellites seem to play a special role
in the transcription factors recruitment
process. These mysterious entities are
eschewed and masked as 'weeds' by many
bioinformatics repeat masker tools, even
though in population genetics microsatellites are regarded to be very important
pointers {16]. From a computational point
of view, micro-satellites can be the main
cause of local DNA heterogeneity.
Let us mention possible biological reasons
for DNA complexity and long-range
correlations. From a molelular biological
point of view, long-range correlations
(LRCs) and DNA complexity are not
surprising since the complex organization
of genomes involves many different
scales. For example, ther is a well-known
LRC of GC content. It has been pointed
out that a mosaic structure of genomes is
presumably responsible for LRCs [3]. The
organization of genome is very complex:
eukaryotic genes usually consist of several
protein
coding
segments
(exons),
interrupted by intervening sequences
(introns). There are also regulatory
elements such as promoters, splice sites,
enhancers and silencers, which are
simetimes up to thousands of base pairs
away from exons. Genomes of high
eukaryotes also comprise long stretches of
DNA without any obvious biological
function containing, e.g., pseudogenes and
various types of repeats.
We selected the sea urchin otx, and mouse
hoxD genes as our test beds. These
sequences were selected from cross
genome comparison, and there may be
significant uncertainty about the function
and location of the conserved region. But
it is commonly the case in bioinformatics,
and must be taken into account.
5 Conclusion
Statistical significance of regions detected
with our integrated approach does not
necessarily imply biological significance.
Moreover, the observations on which the
approach is founded are working
hypotheses, not incontrovertible facts.
However, the fact that application of the
approach to data from 2 genes in two
species predicted actual known regulatory
and coding regions, suggests the approach
may indeed be a useful aid to regulatory
and coding regions prediction; especially
since it requires no information other than
the DNA genomic sequence of interest.
Additionally, our optimization technique
could be of help in a various DNA
segmentation methods.
References:
[1]Peng, C. K., Buldyrev SV, Havlin S,
Simons M, Stanley, H. E. and
Goldberger, A., Mosaic Organization
of Nucleotides, Physical Rev. E, Vol.
1994, pp. 1685-1689.
[2]Voss, R., Evolution of Long-Range
Fractal Correlations and 1/f Noise in
DNA Base Sequences, Physical
Review Letters Vol. 68, 1992, pp.38053808.
[3]Herzel, H. and Große, I., Correlations
in DNA sequences: The role of
protein coding segments, Physical
Review E , Vol. 55, 1997, 800-810.
[4]Bernaola-Galván, P., Oliver, J. L. and
Román-Roldán, R., Decomposition
of DNA Sequence Complexity,
Physical Review Letters, Vol. 83, 1999,
pp. 3336-3339.
[5]Azbel, Y. M., Universality in a DNA
statistical structure, Physical Review
Letters, Vol. 75, 1995, pp. 68-171.
[6]Li, W., The complexity of DNA,
Complexity, Vol.3, 1997, pp.33-37
[7]Mantegna, R. N., Buldyrev, S. V.,
Goldberger, A. L., Havlin, S., Peng, C.
K., Simons, M. and H. E. Stanley, 1994,
Linguistic features of noncoding
DNA sequences, Physical Review
Letters, Vol. 73, 1994, pp. 3169-3172.
[8]Ossadnik, S., Buldyrev, S.,
Goldberger, A., Havlin, S., Mantegna,
R., Peng, C., Simmons, M., Stanley,
H., Correlation approach to identify
coding regions in DNA sequences,
Biophys. Journal, Vol. 1, 1994, pp.6470
[9]Buldyrev, S. V., Goldberger, A. L. ,
Havlin, S., Peng, C. K., Simons, M.,
Sciortino, F. and Stanley, H. E., Long
range fractal correlations in DNA
(Comment on the letter by R. F. Voss
in PRL, 68, 3805), Physical Review
Letters, Vol. 71, 1993, p. 1776.
[10]Li, W., Marr, K., and Kaneko, K.
Understanding long range correlations
in DNA sequences, Physica D, Vol.
75, 1994, pp. 392-416.
[11]Pearl, J., Probabilistic reasoning in
intelligent systems, Morgan
Kaufmann, 1988.
[12]Papoulis, A., Probability, Random
Variables, and Stochastic Processes,
McGraw-Hill, Inc, 1991
[13]Kirchammer, C. and Davidson, E.,
Title? Development, Vol. 122, 1996,
pp. 333-348.
[14]Yuh, C., Brown, C.T., Livi, C.,
Rowen, L., Clarke, P., and Davidson,
E., Patchy interspesific Sequence
similarities efficiently identify positive
cis-regulatory elements in sea urchin,
Developmental Biology, Vol. 246,
2002, pp.148-161
[15]Ohno, S. Evolution by gene
duplication, Springer, Berlin
Heidelberg NY, 1970.
[16]Pritchard, J.K., Stephens, M. and
Donnelly, P., Inference of
Population Structure Using
Multilocus Genotype Data,
Genetics, Vol. 155, 2001, 945-959.
Download