An algorithm for creation of image of genomic sequences, using non

advertisement
An algorithm for creation of image of genomic sequences, using
symmetrical characteristics of DNA sequences and non-exact
distances.
Leontiev A.Yu.,Tarasov D.S.
Keywords: Applications, Science, genomic texts, artificial intellegence.
CONTENS.
ABSTRACT ................................................................................................................................................................................1
INTRODUCTION ......................................................................................................................................................................1
ALGORITHM’S BLOCKS DESCRIPTION. .........................................................................................................................2
CONCLUSIONS ........................................................................................................................................................................3
REFERENCES ...........................................................................................................................................................................3
Abstract
A new method for creation image of DNA functional sites is presented. It takes
into consideration, presence of symmetries non-exact distances between functional sites,
presence of symmetries and possible substitutions in synonymous fragments. Suggested
algorithm is based on simulation of human thinking.
This image intends for recognition of functional sites and can be used for analyzing
of functional site structure.
Introduction
Recognition of functional sites in DNA texts is one of important problems of
contemporary molecular biology. Since first DNA sequences become available many
algorithms were suggested to analyze and recognize functional sites and coding
sequences. These are methods based on statistic [2], [3] mathematical linguistic [4] and
other.
A new method, based on symmetrical characteristics of DNA sequences was suggested in
1992 by Leontiev [1]. Although the biological significance of some types of symmetries
is still unknown, this method was successfully used for analyzing of prokaryotic
replication origins [5]. It was shown that symmetrical structures in genomic sequences
are significant for their function [1]. The images of genomic sequence formed using
symmetries are more compact and suitable for automatic recognition.
In this work, we presented an algorithm, which automatically creates images of
DNA fragments. It uses symmetrical characteristics of genetic, combined with using
words that are not part of symmetrical structures. It also takes into consideration distances
between words and possible substitutions in symmetrical fragments.
The image is created as a sequence of words with additional information (positions,
number of occurrences, symmetry e.t.c.). This type of organization helps us to combine
advances of symmetrical method, with is context-free and non-statistical, with trivial
comparison of sequences. Such system allows us to create the fullest images.
The basic scheme of algorithm is shown on picture 1. The algorithm consists of
several independent blocks, with can be easily modified according to specific tasks. First
blocks compare a number of sequences in order to find similarities. The final blocks
analyze these similarities and construct image.
Searching similar words with fixed length
Picture 1
Restoration of original words
Image creation block
Algorithm’s blocks description.
The first block is called “The block for searching shared words with fixed length”. It
compares a number of genomic sequences and finds all similar words. The level of
similarity of two words computing as follow
S
2*n
*k
l1  l 2
Where S – level of words similarity, n- number of shared letters, l1- length of first word,
l2 – length of second word, k – symmetry weight.
Symmetry weight is necessary when two symmetrical words is comparing.
Different symmetrical weight assigns to each type of symmetry in order to alter similarity
level.
The words are considered as similar if their similarity level is higher than some
constant threshold.
Searching for similar words of various lengths with substitutions and symmetrical
conversions can be a time-consuming task. To improve speed we use pre-generated
words with fixed length (for example 6 bases). Due to four letters DNA, alphabets there
are exist only few of such words (4096 in the case of hexanucleotides). So comparing
these words with words in sequences reduces comparison time. The block output is
positions of similar words.
The second block is called “words restoration”. It uses positions, generates from
block 1 to restore the longer words. Such way of processing is chosen because it is faster
than successive searching for the words of all lengths.
The third block computes then words with puts into image. One word should be
generated from all similar words and this word should be closest to all of them. This
means that the word, which is puts to image, should require minimum substitutions to
make any from list of similar words. Such word may be absent from this list and must be
generated. The words, generated on this step are placed into the shared dictionary. Using
of these types of words in image reduces the image size.
The last block creates the image itself. This block simulates human thinking. In
order to create image we need to define relations between locations of all words from
shared dictionary. This can be done by two ways.
In first case, a counting point can be located at the sequence. Using information
about it reduce a number of calculations. The distance from that point can classify so all
words. Only the words that are located at similar positions in all sequences can be placed
to image. Therefore, we compute the level of distance similarity from counting point to
word
.С 
max( Ra  R )
Ra
* 100%
Where C is the level of distance similarity, Ra is average distance of similar words of
counting point and R is distance of considering word.
If C is not higher than some constant threshold, the word is placed to image with
average distance.
The appearance of image should be: the words Wrdi have similarity level with
word in sequence not less than M and located from counting point at the distance R i 
RDEVi , where RDEV= max( R  R ) .
In second case there can be no counting point in sequences. So relative distances
should be calculated and placed to image. The pair of words with similar distances
between them in all sequences is placed to image. The appearance of image should be:
Wrdi have similarity level with word in sequence not less than M and located from Wrdj
at the distance Ri  RDEVi
All types of images created by this algorithm could by easily used for recognition
of any types of functional regions in genetic texts. The only thing we need to do is to
check: do conditions of image correspond to sample genomic sequence or not.
Conclusions
The suggested algorithm can automatically create images of DNA text fragments.
The images created by this algorithm are more compacts can unite a wider range of
objects than manual image creation. Using of symmetries allows reducing the number of
sequences, required for image creation, it also reduces image size. Described algorithm
provides correct processing of inexact distances between functional fragments and
substitutions in synonymous genetic texts.
References
1. Leontiev A. Yu. «Symmetry of single chain DNA molecules» Biophysics, Vol., 37
No.5, pp.771-774, 1992
2. Shulman M.J., Steinberg C.M. Westmoreland N. The coding function of nucleotide
sequences can be described by statistical analysis //J. Theor.Biol.1981. Vol. 88 P.409420
3. Fichant G., Gautier C. Statistical method for predicting protein coding regions in
nucleic acid sequences // Comput. Appl. Biosci. 1987. Vol.3.P. 287-295
4. Brendel V., Trifonov E.N. A computer algorithm for testing potential procariotic
terminators// Nucl. Acids Res. 1984, Vol.12, N 10.P. 4411-4427.
5. Akberova N.I. Leontiev A. Yu. «Symmetrical structure in genetic texts of prokaryotes
DNA replication origins» NetSci 1996, v.2. March on-line
http://www.awod.com/netsci/Issues/Oct95/feature4.html
Download