An algorithm for creation of image of genomic sequences, using symmetrical characteristics of DNA sequences and non-exact distances. Leontiev A.Yu.,Tarasov D.S. Keywords: Applications, Science, genomic texts, artificial intellegence. CONTENS. ABSTRACT ................................................................................................................................................................................1 INTRODUCTION ......................................................................................................................................................................1 ALGORITHM’S BLOCKS DESCRIPTION. .........................................................................................................................2 CONCLUSIONS ........................................................................................................................................................................3 REFERENCES ...........................................................................................................................................................................3 Abstract A new method for creation image of DNA functional sites is presented. It takes into consideration, presence of symmetries non-exact distances between functional sites, presence of symmetries and possible substitutions in synonymous fragments. Suggested algorithm is based on simulation of human thinking. This image intends for recognition of functional sites and can be used for analyzing of functional site structure. Introduction Recognition of functional sites in DNA texts is one of important problems of contemporary molecular biology. Since first DNA sequences become available many algorithms were suggested to analyze and recognize functional sites and coding sequences. These are methods based on statistic [2], [3] mathematical linguistic [4] and other. A new method, based on symmetrical characteristics of DNA sequences was suggested in 1992 by Leontiev [1]. Although the biological significance of some types of symmetries is still unknown, this method was successfully used for analyzing of prokaryotic replication origins [5]. It was shown that symmetrical structures in genomic sequences are significant for their function [1]. The images of genomic sequence formed using symmetries are more compact and suitable for automatic recognition. In this work, we presented an algorithm, which automatically creates images of DNA fragments. It uses symmetrical characteristics of genetic, combined with using words that are not part of symmetrical structures. It also takes into consideration distances between words and possible substitutions in symmetrical fragments. The image is created as a sequence of words with additional information (positions, number of occurrences, symmetry e.t.c.). This type of organization helps us to combine advances of symmetrical method, with is context-free and non-statistical, with trivial comparison of sequences. Such system allows us to create the fullest images. The basic scheme of algorithm is shown on picture 1. The algorithm consists of several independent blocks, with can be easily modified according to specific tasks. First blocks compare a number of sequences in order to find similarities. The final blocks analyze these similarities and construct image. Searching similar words with fixed length Picture 1 Restoration of original words Image creation block Algorithm’s blocks description. The first block is called “The block for searching shared words with fixed length”. It compares a number of genomic sequences and finds all similar words. The level of similarity of two words computing as follow S 2*n *k l1 l 2 Where S – level of words similarity, n- number of shared letters, l1- length of first word, l2 – length of second word, k – symmetry weight. Symmetry weight is necessary when two symmetrical words is comparing. Different symmetrical weight assigns to each type of symmetry in order to alter similarity level. The words are considered as similar if their similarity level is higher than some constant threshold. Searching for similar words of various lengths with substitutions and symmetrical conversions can be a time-consuming task. To improve speed we use pre-generated words with fixed length (for example 6 bases). Due to four letters DNA, alphabets there are exist only few of such words (4096 in the case of hexanucleotides). So comparing these words with words in sequences reduces comparison time. The block output is positions of similar words. The second block is called “words restoration”. It uses positions, generates from block 1 to restore the longer words. Such way of processing is chosen because it is faster than successive searching for the words of all lengths. The third block computes then words with puts into image. One word should be generated from all similar words and this word should be closest to all of them. This means that the word, which is puts to image, should require minimum substitutions to make any from list of similar words. Such word may be absent from this list and must be generated. The words, generated on this step are placed into the shared dictionary. Using of these types of words in image reduces the image size. The last block creates the image itself. This block simulates human thinking. In order to create image we need to define relations between locations of all words from shared dictionary. This can be done by two ways. In first case, a counting point can be located at the sequence. Using information about it reduce a number of calculations. The distance from that point can classify so all words. Only the words that are located at similar positions in all sequences can be placed to image. Therefore, we compute the level of distance similarity from counting point to word .С max( Ra R ) Ra * 100% Where C is the level of distance similarity, Ra is average distance of similar words of counting point and R is distance of considering word. If C is not higher than some constant threshold, the word is placed to image with average distance. The appearance of image should be: the words Wrdi have similarity level with word in sequence not less than M and located from counting point at the distance R i RDEVi , where RDEV= max( R R ) . In second case there can be no counting point in sequences. So relative distances should be calculated and placed to image. The pair of words with similar distances between them in all sequences is placed to image. The appearance of image should be: Wrdi have similarity level with word in sequence not less than M and located from Wrdj at the distance Ri RDEVi All types of images created by this algorithm could by easily used for recognition of any types of functional regions in genetic texts. The only thing we need to do is to check: do conditions of image correspond to sample genomic sequence or not. Conclusions The suggested algorithm can automatically create images of DNA text fragments. The images created by this algorithm are more compacts can unite a wider range of objects than manual image creation. Using of symmetries allows reducing the number of sequences, required for image creation, it also reduces image size. Described algorithm provides correct processing of inexact distances between functional fragments and substitutions in synonymous genetic texts. References 1. Leontiev A. Yu. «Symmetry of single chain DNA molecules» Biophysics, Vol., 37 No.5, pp.771-774, 1992 2. Shulman M.J., Steinberg C.M. Westmoreland N. The coding function of nucleotide sequences can be described by statistical analysis //J. Theor.Biol.1981. Vol. 88 P.409420 3. Fichant G., Gautier C. Statistical method for predicting protein coding regions in nucleic acid sequences // Comput. Appl. Biosci. 1987. Vol.3.P. 287-295 4. Brendel V., Trifonov E.N. A computer algorithm for testing potential procariotic terminators// Nucl. Acids Res. 1984, Vol.12, N 10.P. 4411-4427. 5. Akberova N.I. Leontiev A. Yu. «Symmetrical structure in genetic texts of prokaryotes DNA replication origins» NetSci 1996, v.2. March on-line http://www.awod.com/netsci/Issues/Oct95/feature4.html