Fractals are observed in nature

advertisement
ANALYSIS OF DNA SEQUENCE IN THE FRACTAL PERSPECTIVE:
THE CHAOS GAME AND FRACTIONAL BROWNIAN MOTION
Yoon-jung Choi
MAT335H Term Project
Professor: Randall Pyke
Submission: May 20, 2003
University of Toronto, Department of Mathematics
ABSTRACT
Various mathematical methods have been applied to investigate the nature of
DNA sequences. The chaos game representation of DNA sequences has been reported to
produce a unique pattern consistently over different parts of the genome of an organism.
From the image generated from the chaos game, characteristics of a DNA sequence can
be studied, such as finding association between two letters. The concept of fractional
Brownian motion has been also applied to DNA sequence leading to the discovery of the
long-range correlation in DNA sequence. However, in order to fully understand the
pattern in a DNA sequence, application of more than one method is desired. In this report,
the upstream region (31,375bp) of human serotonin receptor 2A gene (HTR2A) was
analyzed by using both the chaos game and fractional Brownian motion. The two
methods compliment each other, and here, I suggest novel perspective in interpreting the
data obtained form the chaos game and fractional Brownian motion.
INTRODUCTION
DNA, deoxyribonucleic acid, is composed of an extremely long array of
nucleotides. Each nucleotide contains one of the four bases, adenine (A), guanine(G),
which are purines (double ring structure), and cytosine (C), thymine(T), which are
pyrimidines (single ring). An example of DNA sequence is
…GTGATAGGGTCTCACTCTGT…
In fact, this sequence in letter can be converted to a quaternary number sequence by
changing T into 1, A into 2, C into 3, and G into 4, such as
…41421244413132313141….
The length of DNA sequence in genome varies depending on species. For
example, the size of genomic DNA sequence of a fruit fly is approximately 120Mbp
(120X10^6 letters), a mouse has about 2900Mbp, and human has about 3213Mbp-long
DNA sequence in the genome. This genomic sequence is what is contained in the whole
set of chromosomes in the nucleus of a single cell. It is a remarkable phenomenon that
DNA sequence contained in a cell dictates development of a complete, mature organism
from one single cell. Scientists have attempted to decipher the structure and meaning of
DNA sequences; however, consensus has not been reached and opinions are diverged.
A prevalent method for DNA analysis is related to random walk or Brownian
motion which led to the discovery of long-range correlation in DNA sequences (Peng et
al., 1992; Voss, 1992; Chatzidimitriou-Dreismann et al., 1994). Another more recent
approach is applying the chaos game (Jeffrey, 1990; Deschavanne et al., 1999; Almeida
et al., 2001). However, these two methods appear to run in parallel without a focal point.
In the first part of this report, the application of the chaos game to a selected DNA
sequence will be described. The chaos game representation of DNA sequence led to the
application of fractional Brownian motion, which will be explained in the second part.
Lastly in the third part, the procedure of coordination between the data from the two
independent methods will be elaborated, and the relevance of the data coordination in
DNA analysis will be emphasized in the discussion section.
The sequence studied in this report was the 5’ region (31,375bp) of the human
serotonin receptor 2A gene (HTR2A), extracted from genomic sequence available at
GenBank (see reference). Various medical studies proposed that HTR2A is associated
with Schizophrenia, bipolar disorder, seasonal affective disorder, and suicidal behaviors
(see reference-data source).
I. APPLICATION OF CHAOS GAME TO ANALYSIS OF A DNA SEQUENCE
Drawing Sierpinsky Triangle by Chaos Game
Sierpinsky triangle is a fractal structure, that is, a part of structure resembles the
entire structure, thus, self-similar (Fig.1). One way to draw a fractal structure is by
playing the chaos game. First, an equilateral triangle with a number 1,2, and 3 written at
each corner is used as a game board, and an arbitrary point inside or outside of the
triangle is used as the initial game point z0 (Fig.1). Then a number 1, 2, or 3 is randomly
chosen, for example, by rolling a die with each number written twice. Suppose 1 was
chosen. Then we move z0 to midpoint between point 1 and z0, generating the next game
point z1. Similarly, z2 is obtained by moving z1 to the midpoint between the next number
chosen, say 3, and the previous point, z1. Repeating this procedure eventually produces
the image of Sierpinsky triangle. The principle of the chaos game is that the current point
is determined by the previous point by a funciont wi.
zk+1=wi(zk),
where zk denotes the game point at step k, and wi is the iterative function which tells the
movement of game points. Each game point zk  wk (wk 1 ( (w2 (w1 ( z0 ) ) is assigned an
address sk sk 1 s2 s1 , where sn is the number chosen from the die at nth step. Thus,
playing k steps of the chaos game would generate an address of k numbers long, which is
a tertiary number sequence. As mentioned in the introduction, DNA sequence can be
represented by a quaternary number sequence. The chaos game can be modified in such a
way that four numbers are used to generate an image for a DNA sequence.
z k has an address in length k
e.g.
21321323312….213213213
(a tertiary sequence)
zk+1= w k i (z k)
Sequence: 113213
Fig.1. The chaos game for Sierpinsky triangle (the third image).
Chaos Game Representation of a DNA Sequence
Although various methods can be developed for the chaos game representation of
DNA sequences, the following is the most prevalent method reported in literature. Each
corner of a square 1 by 1 is written T at (0,0), A at (0,1), C at (1,1), and G at (1,0), and
the initial point z0 drawn at (0.5, 0,5) (Almeida et al., 2001). Unlike in the chaos game for
Sierpinsky triangle where the number is randomly selected, the chaos game for a DNA
sequence is play according to the given DNA sequence. For example, for a sequence
ATGCGAGT…., the first game point z1 is obtained by moving z0(0.5, 0.5) to the
midpoint between z0 and A(0,1) (Fig.2). Likewise, z2 is drawn at the midpoint between
T(0,0) and z1. In general,
z k+1 = wi(zk-1) = z k + 1/2[z k – q k+1]
where q k+1 is the position of one of the four letters at step k+1. For example, q5 would be
G(1,0) in this case. As described earlier, zk is assigned an address sk sk 1 s2 s1 ; thus, in
case of the sequence above, z8 would have an address TGAGCGTA.
Sequence: ATGCGAGT…
z k : game point at step k
qk : fixed point corresponding to sk
for sequence s1s2…sk-1sk
z k+1 = z k + ½ [z k – q k+1]
Fig.2. The chaos game for DNA sequence (Almeida et al., 2001).
Fig.3a shows the result of the chaos game for 31,375bp of serotonin receptor 2A
gene by using Dnacgr (Chaos Game Representation of DNA sequence) program (see
reference). This image is remarkably similar to the ones reported in the literature. Chaos
game of human  globin region (73,357bp) (Jeffrey, 1990), human intron sequences
(Solovyev, 1993), and randomly selected human DNA sequence of 100Kbp long
(Deschavanne et al., 1999) resemble the image shown in Fig.3a. Interestingly,
Deschavanne et al. (1999) found out from other species that the images obtained from
parts of genome presented the same structure as that of whole genome. They also showed
that different organisms exhibited different patterns in the images.
As mentioned by Deschavanne et al. (1999), a closer look at the Fig.3a reveals
two major features of the DNA sequence. First, the empty patches indicate that the areas
which have GC in their addresses have notably low density. This means that the
probability of CG occurring along the sequence is very low. The particular shape of the
empty patches is repeated in different scales. Especially, the sub-quadrant T, A, and C
resemble the entire structure. Yet, the image as a whole is not a fractal in a precise sense
because sub-quadrant G is not self-similar. In order for an image to be a fractal, any part
of the image should regenerate the entire structure when rescaled. Second, the diagonal
lines imply the prevalence of AA, AG, GA, GG, TT, CT, TC, and CC (Fig.3a). Yet, these
features do not directly tell us if the sequence is random or not, even though they
indirectly suggest non-randomness of the sequence. To answer this question more clearly,
and to provide explicit evidence, further experiments have been performed.
Interpretation of the Image from the Chaos Game for the DNA sequence
The probability of each letter can be also calculated by using Dnacgr program.
The input sequence had probability 0.3127 for A, 0.1887 for C, 0.2925 for T, and 0.2061
for G. In order to test the role of these different probabilities in generating the distinct
pattern, the sequence was shuffled and used as an input for the program. Note that
shuffling does not alter the probability of each letter, but it only breaks the order of letters
in the sequence. Fig.3b shows the image generated from the shuffled sequence, which is
very different from the original image; the empty patches and diagonal lines disappeared.
The gradient observed is due to the different probabilities. Since disrupting the order of
letters destroys the original image, the sequence must have certain patterns in the order,
thus not random. If the original sequence was random, the image from the shuffled
sequence would have been the same as that from the original sequence.
In fact, when the shuffled sequence was shuffled again, the resultant image was
similar to the image obtained from the shuffled sequence, which implies that the order of
letters in the shuffled sequence was indeed random (image not shown). To further ensure
that the shuffled sequence had random order of letters, the known probabilities were
entered into the full square chaos game applet (see reference), which uses random
numbers to generate game points (Fig.4). The image was similar to the previous two
images obtain from shuffled and reshuffled sequences. Since random order of letters with
the fixed probabilities abolishes the original pattern, the factor primarily responsible for
the original image must be the order rather than the probability of each letter.
Fig.3. The chaos
game image of the
human serotonin
receptor 2A region
(31,375bp).
a: original sequence,
b: shuffled sequence.
Note
A(0,1), C(1,1)
T(0,0), G(1,0) .
a.
b
.
3(C)
b.
2(A)
1(T)
3(C)
4(G)
Fig.4. Full square chaos game with random number with probability fixed as in the
original sequence. Probability of T: 0.2998, A: 0.3004, C: 0.1968, and G: 0.2030
Then how can we test (instead of assuming) that the empty patches and diagonal
lines are resulted from biased association between two letters? Using a modified chaos
game applet (see reference), ‘34’, representing ‘CG’, was entered for substring so that
CG could be eliminated from the chaos game. This applet uses random numbers from 1
to 4, with the probability adjusted as earlier. As a result, the modified chaos game
imitated the pattern of empty patches in the original image. Thus, the empty patches can
be indeed characterized by CG depletion in the sequence. However, the diagonal lines
were still not present in the image from this simulation. Demonstrating the reason for
diagonal lines is more difficult with chaos game, and an alternative method will be
introduced in the next section.
Chaos game can provide us with a quick overview of characteristics of the given
sequence; however, it has a limitation in interpreting a sequence. Essentially, the image
from the chaos game does not tell us the order of the game points. For example, when the
sequence from second half of serotonin receptor 2A gene (31376bp) was tested for chaos
game, the image was indistinguishable from the original image which used the first half
of 5HT2R (data not shown). The two sequences were found to be completely different
when tested by Clustal W (see reference), a program that aligns different sequences in
parallel and matches the common letters. Thus, having the same image does not mean the
sequences are also similar, although the sequences must be sharing certain characteristics.
Therefore, a more rigorous approach was required to trace the order of letters in the
sequence such as plotting DNA sequence in a time series format (Peng et al., 1992).
2(A)
3(C)
1(T)
4(G)
Fig.5. Modified chaos game (full square) with the fixed probability and substring 34.
II. APPLICATION OF FRACTAL BROWNIAN MOTION TO ANALYSIS OF A
DNA SEQUENCE
DNA walk
The motion of Brownian particle consists of steps of movement in a characteristic
length in a random direction; thus, it’s also called a random walk (Feder, 1988). Suppose
the particle moves on the x-axis by jumping + or - every  seconds, then its movement
can be plotted as time proceeds. Likewise, DNA sequence can be plotted in a form of
time-series, but the x-axis represents an array of DNA sequence instead of time (Peng et
al., 1992). This way, the profile of letters can be preserved along the sequence unlike in
the chaos game.
DNA Walker program (see reference) was used to generate the plot of ‘random’
walk of the given sequence (Fig.6). For one-dimensional DNA walk, purine-pyrimdine
skew scale was used, in which G,A (purines) were moved +1, and C,T (pyrimidines)
were moved -1. Plotting the original sequence was followed by the shuffled sequence,
which then placed together for easier comparison. The shuffled sequence exhibited a
tendency to oscillate closer to the x-axis than the original sequence which drifted further
down and moved back towards the x-axis. The tendency to move constantly up or
constantly down suggests that purines tend to be associated with purines, and pyrimidines
tend to pair with pyrimidines. This agrees with the game points that are concentrated at
the diagonal lines in the chaos game. The region of diagonal lines contain the address AA,
GG, AG, GA (purine pairs) or CC, TT, TC, CT (pyrimidine pairs).
Then, how can this difference between the original and shuffled sequence be
numerically represented? Borovik et al. (1994) showed the presence of long-range
correlation in DNA sequences by R/S analysis of DNA walk.
)
X(l)(x10^2)
Shuffled sequence
Original sequence
l (x10^4) (nucleotide distance)
Fig.6. The plots of DNA walk of original sequence (bottom, red), and shuffled sequence
(top, blue) (31,375bp) show cumulative movements as the sequence proceeds: +1 for A,G
(purines) and -1 for T,C (pyrimidines). For the original sequence, the maximum point is
at 225 and the minimum point is at -520 along the y-axis. For the shuffled sequence,
max:250, min:-50.
R/S Analysis and Hurst Exponent
R/S analysis, or rescaled range analysis, was first invented by Hurst, who spent
his lifetime studying the Nile River and water storage (Feder, 1988). Let (t) be the
annual discharge of water from a dam at year t. Let X(t) be the accumulated departures of
(t) from the mean <> ( = t2 – t1).
X (t ,  )   ξ(u )  ξ
t

u 1
.
(1)
Then, the range R is the difference between the maximum and minimum amounts of
water contained in a sufficiently large dam that never empties or outflows (Fig.7). This
can be written as
R = max X(t, ) – min X(t, ) , 1 t  .
(2)
On the other hand, S, the standard deviation, is written as
1 t
S   ξ(u)  ξ
  u 1

2

1/ 2

 .

(3)
Hurst empirically found out from the data of natural phenomena such as river discharges,
lake levels, and rainfall that there was a relation between the rescaled range, R/S, and an
exponent K, now called Hurst exponent,H, such that
 
R/S   .
(4)
2
The data that Hurst collected from natural phenomena produced ~0.73 for mean H. On
the other hand, the data generated by statistically independent process produced H=0.5.
H
a.
b.
Fig.7. a: Sketch of a reservoir with an influx of (t). The range, R, is the difference
between the max. and the min. contents of the reservoir. b: Lake Albert annual discharge
(t) (dotted line), and accumulated departure from the mean discharge, X(t)(solid line).
The range is indicated by R (Feder, 1988).
Fractional Brownian Motion
Introduced by Mandelbrot, fractional Brownian motion is a generalization of X(t)
by modifying H=1/2 to 0<H<1, where X(t), the position of a Brownian particle, is a
random function of time t (Feder, 1988). For
X(t) – X(t 0) ~   t - t 0  H ,
H = ½ for ordinary Brownian motion, in which the displacement of the particle is
independent of previous displacements, thus a random process. On the other hand, when
H >1/2, displacement of the particle is positively influenced by the displacement in the
past. That is, if the particle moved + at step i, it tends to move + as well at step i+1. If
the particle moved - previously, it is likely to move - in the next step. This type of
behavior is called persistence. For H < ½, we have antipersistence, where the particle
tends to move in opposite direction from the previous displacement (Table 1).
Hurst
character of particle particle displacement particle displacement (average)
exponent movement
at step i
at step i+1
H >1/2
persistence
positive
Positive
negative
Negative
H =1/2
independence
no correlation (Brownian motion)
H < 1/2
antipersistence
positive
Negative
negative
Positive
Table 1. Properties of particle movement according to the value of Hurst exponent.
R/S Analysis of the DNA Sequence: Estimation of Hurst Exponent
Applying this concept to DNA sequence, Hurst exponent can be calculated from
DNA walk. The same principle introduced earlier is applied to DNA sequence. From (1),
s
X ( s, l )   {ξ(u )  ξ l } ,
u 1
where s is a letter on the sequence of l letters long, and
1 l
ξ l   ξ( s ) .
l s 1
Calculated from table 2, the sum of movements for the entire sequence of length l would
be
l
 ξ(s) = 9424 + 6371- 9405-6175 = 215.
s 1
nucleotide probability
A
G
T
C
Total
0.2998
0.3004
0.1968
0.203
1
number of
movement
occurrence (bp)
9424
1
6371
1
9405
-1
6175
-1
31375
Table 2. Probabiliy, number of
occurrence (bp), and movement of each
nucleotide.
Then,
ξ l
1 l
215
ξ( s ) 
 0.0068526  0 .

l s 1
31375
(5)
thus,
l
l
l
u 1
u 1
u 1
X ( s, l )   {ξ(u )  ξ l }   {ξ(u )  0}   ξ(u )
(6)
From (2) and (3), with adequate letter conversions,
R(l )  max X ( s, l )  min X ( s, l )
1/ 2
1 l

S   {ξ( s)  ξ l }2 
 l s 1

.
(7)
From (6),
R(l )
s
s
u 1
u 1
max  ξ(u )  min  ξ(u ) .
s
Since
 ξ(u ) is
the position of a letter s along the y-axis, R(l) is equivalent to the
u 1
difference between the maximum point and the minimum point on the DNA walk; thus,
from Fig.6,
R(l) = 225-(-520) = 745.
From (5) and (7),
1/ 2
1 l
2
S   ξ( s)  0 
 l s 1

1/ 2
1 l
2
    ξ( s)  
 l s 1

1/ 2
1

   31375
l

1/ 2
 1


 31375
 31375

 1 . (8)
From (4) and (8),
H
l
R / S  R(l )   .
2
Consequently,
log R(l )
log 745

 0.685 .
log(l / 2) log 31375 / 2
For the shuffled sequence (Fig.6),
H=
R(l)=250-(-50)=300.
Thus,
H=
log R(l )
log 300

 0.590 .
log(l / 2) log 31375 / 2
In fact, estimation of Hurst exponent by R/S analysis is more laborious than this
naïve estimation. R/S value is calculated for l, l/2, l/4,…, and 1/2n, and for each division
of l, average R/S is calculated again. Then a linear regression line is obtained from
plotting log(R/S) versus logl. Then, the slope of the linear graph is the estimated Hurst
exponent (Kaplan, 2003).
Instead of calculating every step manually, Hurst exponent was automatically
estimated by SELFIS (SELF-similarity analysis) program (see reference). The input data
was modified from letter sequence to a number sequence where purines (A,G) were
converted to 1, and pyrimidines (T,C) were converted to -1. Consequently, the original
sequence produced H=0.639 whereas the shuffled sequence had H = 0.553, which shows
that the naïve estimation was overestimated, yet more or less similar (Fig.8).
a.
b.
Fig.8. Estimation of Hurst exponent for 31375bp DNA sequence of serotonin receptor 2A
region by SELFIS.a: H= 0.639 for the original sequence, b: H= 0.553 for the shuffled
sequence.
These values suggest that the there exists persistence in the original sequence at a
greater level compared to the shuffled sequence. H of shuffled sequence is closer to
theoretical value H=1/2 for ordinary Brownian motion. This observation is relevant in
both DNA walk and chaos game. Constant downward or upward displacement for a long
range of the sequence in DNA walk and the diagonal lines in the chaos game can be
explained by persistence, strongly supported by the numerical value of H>1/2. Also,
these values are comparable with the published H values for DNA sequences (Table 3).
Sequences of random characters show H closer to ½.
sequence
human beta-cardiac myosin
heavy chain gene
human beta globin purinepyrimidine representation
synthetic model sequence
human serotonin receptor 2 gene
H
0.67
0.708
0.655
0.639
sequence in comparison
human beta-cardiac myosin heavy
chain cDNA
human beta globin (A,C)-(G,T)
representation
random noncorrelated sequence
human serotonin receptor 2 gene,
shuffled
H
0.49
references
Peng et al.,
1992
0.515
Borovik et al.,
0.517
Borovik et al.,
0.553
this report
Table 3. Comparison between H values published for various sequences and H value
measured in this report.
III. COORDINATING THE DATA FROM CHAOS GAME AND DNA WALK
Expression of Game Points in the Chaos Game
Iterative functions for the DNA chaos game can be written differently for each
nucleotide.
1/ 2 0   x   0 
T : w1  
  
 0 1/ 2   y   0 
1/ 2 0   x   0 
A : w2  

 
 0 1/ 2   y  1/ 2 
1/ 2 0   x  1/ 2 
C : w3  

 
 0 1/ 2   y  1/ 2 
1/ 2 0   x  1/ 2 
G : w4  

 
 0 1/ 2   y   0 
In general,
1/ 2 0   x  1  a 
wi  
   
 0 1/ 2   y  2  b 
where,
 ai   0 
     for i  1 ;T
 bi   0 
 ai  1
     for i  3 ;C
 bi  1
 ai   0 
     for i  2 ; A
 bi   1 
.
 ai   1 
     for i  4 ;G
 bi   0 
At kth step, the game point zk can be expressed as
 xk  1/ 2 0   xk 1  1  aik
 
 

 yk   0 1/ 2   yk 1  2  bik



Hence,
zk  wk (wk 1 ( (w2 (w1 ( z0 ) ) .
 x0  1/ 2 
 x  1/ 2 0   x0  1  ai1 
For k  1,  1   
     , where    
 . Thus,

 y1   0 1/ 2   y0  2  bi1 
 y0  1/ 2 
 x1  1/ 22  1  ai1 
  .
 
2
 y1  1/ 2  2  bi1 
(9)
 x  1/ 2 0   x1  1  ai2 
For k  2,  2   
   
 y2   0 1/ 2   y1  2  bi2 
1/ 2 0   1/ 22  1  ai1   1  ai2 

     

2
 0 1/ 2   1/ 2  2  bi1   2  bi2 
1/ 23  1  ai1  1  ai2 

 2    
3
 
 
1/
2

 2  bi1  2  bi2 
 x  1/ 2n 1  1  ai1  1  ai2 
For k  n,  n   
 n    n 1   
n 1 
 yn  1/ 2  2  bi1  2  bi2 

1  ain1  1  ain 

  
22  bin1  2  bin 
1
1
1
1
 1

 2n 1  2n  ai1  2n 1  ai2   22  ain1  2  ain 


 1  1 b  1 b   1 b  1 b 
 n 1
i
i
i
i 
2
2n 1 2n 1 2
22 n1 2 n 
 .ain ain1 ...ai2 ai1 1

, where aik , bik  0,1 , i  1, 2,3, 4
 .bi bi ...bi bi 1 
 n n1 2 1 
(10)
Therefore, a game point zk can be represented by binary expansion. This means
that converting (x,y) coordinate of a game point can be converted into a binary expansion,
which is useful in finding the corresponding location of the game point on DNA walk
(see below). This expression (10) is also compatible with sk sk 1 s2 s1 , the address
 ain 
assignment of zk. Each sn can be represented by   , which is parallel to the expression
 bi 
 n
in (10) when extended from n to 1.
Expression of Position of DNA Walk
Recalling (9), (s), displacement at s, can be written as
ai  bi  1  1, if ai  bi
*
ξ ( s)  ai  bi  {
.
ai  bi  1, if ai  bi
Thus, the position of s equivalent to
s
X ( sk )  ξ (u )  ai  bi 1  ai  bi 2 
*
*
u=1
where n(Y) denotes the number of cases of Y.
Implications
 ai  bi k  n(ai  bi )  n(ai  bi ) ,
*
(11)
Since a game point can be represented by x,y coordinate in binary expansion,
each point can give a value for X ( sk ) by counting n(ai  bi ) and n(ai  bi ) . Then this
integer value can be interpolated on the plot of DNA walk to find the corresponding s.
Unfortunately, this process is not straightforward because there can be more than one, in
fact, many, values that have the same X ( sk ) . One-to-one projection from a single point
on the chaos game to a single s on DNA walk is difficult because the chaos game is twodimensional while DNA walk is one-dimensional.
The reasoning that the constant downward or upward displacement of letters on
DNA walk is associated with the diagonal lines from the chaos game can be explicitly
proved by the following approach even though it seems obvious. First the points
concentrated on the diagonal lines are converted to binary expansions and X ( sk ) is
computed accordingly. This intuitively suggests that the values would be largely negative
or positive since the points are near the lines y = x or y = -x+1. For example, for a game
point positioned at 0.10, 0.10 , will have n(ai  bi )   while n(ai  bi ) = 0, thus largely




negative (11). For a game point at 0.10, 0.01 , y  -x +1, and n(ai  bi )   ,
n(ai  bi ) n(ai  bi ) , thus largely positive (11). n(ai  bi ) The higher the X (sk ) is, the
less the number of corresponding s will be found on DNA walk, simply because the
position X ( sk ) farther from the x-axis is less likely to be found at other s’s. For instance,
from Fig.6, there is only one s for maximum X ( sk ) and minimum X ( sk ) , respectively.
Consequently, the average X (sk ) obtained from points on chaos game for the
shuffled sequence would be lower than that for the original sequence. Although this is
already implied from the R/S analysis, relating the binary expansion and X ( sk ) provides
another perspective to view the different methods as a whole.
DISCUSSION
The square 1x1 is divided N times resulting in 4^N sub-squares, which we call quadrant
qij. Supposed the square was divided 5 times. There would be 4^5 squares and 2^5
subsections each on x-axis and y-axis. We choose qij  i  20  for example (Fig.9).


 j 9 
This quadrant has address GATGG, which is sk sk 1sk 2 sk 3 sk 4 . Thus, the game points
located within qij  i  20  has address GATGG s k 5 s2 s1 , where 5  k  31375 in case of


 j 9 
this report. In an alternative view, the points can be located along the sequence where a
fragment of the sequence end with GGTAG, the reverse of the address (Fig.9). Therefore,
if there are n points in the quadrant qij  i  20  , there will be also n segments of the sequence


 j 9 
which ends with GGTAG.
A
C
j=16
Sequence of length l
s1
T
i=16
qij  i  20 
…GGTAG
…GGTAG
…GGTAG
G


 j 9 
i=24
Fig.9. The square is subdivided 5 times resulting in 4^5 sub-squares. The x, y-axis can be
divided into 32 sections: i = 1~32 for x-axis, and j = 1~32 for y-axis. qij  i  20  has address


 j 9 
GATGG…, which is GGTAG along the sequence. The location of n game points
positioned in qij  i  20  can be found n times along the sequence. The length of address for


 j 9 
each game point in qij  i  20  varies.


 j 9 
Accordingly, GGTAG gives +1,+1,-1,+1,+1 displacement on DNA walk.
However, a major confounding problem is that there are 2^5 different combination of
letters which results in the same displacement. This might be solved by applying twodimensional DNA walk, which is a reasonable candidate for the future study. If twodimensional DNA walk was used, the position of GGTAG can be located along the
sequence presumably without the confounding factor. The positions may reveal
periodicity of certain fragments in the DNA sequence. The higher the density of qij , the
higher the frequency of the specific fragment of the sequence specific to the qij . The
value of the chaos game should be reminded that it enables easy comparison of frequency
of every combination of the four letters. For instance, for N=5, where N is the number of
division, the frequency of 4^5 different fragments of 5 letters long can be obtained
simultaneously.
Furthermore, the smaller the quadrant, the longer the fragments of DNA sequence
that can be located along the given sequence. Tracing the letters on the plot of DNA walk
corresponding to qij on the chaos game can be repeated for every ij, and for every
N=1,2,…k,. This can provide not only a gross but also detailed look into the profile of a
DNA sequence, such as answering what fragments occur where and how frequently. This
approach, which comes from the merge between the chaos game and DNA walk, offers
an insight into developing an algorithm for a software which can detect unknown
nucleotide repeats in an input sequence (of course, locating known nucleotide repeats is
easy!). Ability to search nucleotide repeats in many different lengths might be relevant in
biological and medical studies, such as finding a new transposable elements.
CONCLUSIONS
Serotonin receptor 2A gene (31375bp) was analyzed by using the chaos game and
fractional Brownian motion, or DNA walk. As a result, CG depletion and purine-purine,
pyrimidine-pyrimidine association were observed and explained through computer
simulation or mathematical reasoning. Explanation of patterns observed in the chaos
game and DNA walk was facilitated by mutual understanding of the both methods. One
dimensional DNA walk created time series plot of the DNA sequence and enabled
estimation of Hurst exponent, which led to the finding of persistence of the sequence.
However, coordinating the data from the chaos game and one-dimensional DNA walk is
limited mainly due to the different dimension in each method. Application of twodimensional DNA might solve this problem, and is recommended for a future study.
ACKNOWLEDGEMENTS
I would like to thank Dr. Randall Pyke for encouragement and helpful discussions,
Joseph Mocanu for solving technical problems regarding computer programs, and
Thomas Karagiannis for guide to his SELFIS program.
REFERENCES
Software
Chaos game, modified chaos game applets http://www.math.toronto.edu/courses/335/
Clustal W http://clustalw.genome.ad.jp/
Dnacgr 2.0 (Chaos Game Representation of DNA): Indraneel Majumdar, 2000
bioinformatics.org/cgi-bin/cvsweb.cgi/dnacgr/
DNA Walker: Department of Biochemistry and Microbiology, University of Victoria,
2003 http://athena.bioc.uvic.ca/pbr/walk
SELFIS (SELF similarity analysIS): Thomas Karagiannis, University of California at
Reverside, 2001
http://www.google.ca/search?q=cache:kcabiNQzLLkC:www.cs.ucr.edu/~tkarag/Selfis/Se
lfis.html+hurst+exponent+download+download+-benoit+-order+filetype:pdf&hl=en&ie=UTF-8
Shuffle DNA http://www.gchelpdesk.ualberta.ca/downloads/shuffle_dna.html
Data source
Serotonin receptor 2A gene (HTR2A) sequence: GenBank
http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=NT_024524.12&from=15982005&t
o=16044755&txt=on&view=fasta
Sequence information
http://www.ncbi.nlm.nih.gov/LocusLink/LocRpt.cgi?l=3356
Literature Cited
Almeida, J.S., Carrico, J.A., Maretzek, A., Noble, P.A., and Fletcher, M.(2002). Analysis
of genomic sequence by Chaos Game Representation. Bioinformatics, 17: 429-437
Borovik, A.S., Grosberg, A.Y., and Frank-Kamenetskii, M.D.(1994). Fractality of DNA
texts. J. Biomol. Structure & Dynamics, 12: 655-669
Chatzidimitriou-Dreismann, C.A., Friedrich Streffer, R.M., and Larhammar, D. (1994).
Variations in base pair composition and associated long-range correlations in DNA
sequences – computer simulation results. Biochemica et biophysica Acta, 1217: 181-187
Deschavanne, P.J., Giron, A., Vilain, J., Fagot, G., and Fertil, B.(1999). Genomic
signatureA: characterization and classification of species assessed by chaos game
representation of sequences. Mol.Biol.Evol., 16(10):1391-1399
Feder,J.(1998) Fractals., Plenum Press, NY & London
Jeffrey, H.J. (1990). Chaos game representation of gene structure. Nuc.Acids Res.,
18:2163-2170
Kaplan, I. (2003) http://www.bearcave.com/misl/misl_tech/wavelets/hurst/
Mandelbrot, B.B. (1982). The Fractal Geometry of Nature, Freeman & Co., New York
Peng, C.K., Buldyrev, S.V., Goldberger, A.L., Havlin, S., Sciortino, F., Simons, M., and
Stanley, H.E. (1992). Long-range correlations in nucleotide sequences. Nature, 356: 168170
Solovyev, V.V. (1993). Fractal graphical representation and analysis of DNA and protein
sequences. Bio.Systems, 30: 137-160
Voss, R.F. (1992). Evolution of long-range fractal correlation and 1/f noise in DNA
sequences. Phy.Rev.Let., 68: 3805-3808
Download