Bee Algorithms for Solving DNA Fragment Assembly Problem with Noisy and Noiseless data Jesun Sahariar Firoz ∗ M. Sohel Rahman Tanay Kumar Saha A`EDA group Department of CSE, BUET Dhaka, Bangladesh A`EDA group Department of CSE, BUET Dhaka, Bangladesh A`EDA group Department of CSE, BUET Dhaka, Bangladesh jesunsahariar@gmail.com msrahman@cse.buet.ac.bd ABSTRACT tanay@cse.jnu.ac.bd best represents the DNA sequence. This problem finds its motivation from the limitation of current technology, which enables us to read only several hundreds of base pairs (bps) of a single DNA at a time. Consequently, we need to read fragments of the DNA but not the whole sequence at a time. During this process base pairs may be removed, misread or inserted. DNA fragment assembly problem is one of the crucial challenges faced by computational biologists where, given a set of DNA fragments, we have to construct a complete DNA sequence from them. As it is an NP-hard problem, accurate DNA sequence is hard to find. Moreover, due to experimental limitations, the fragments considered for assembly are exposed to additional errors while reading the fragments. In such scenarios, meta-heuristic based algorithms can come in handy. We analyze the performance of two swarm intelligence based algorithms namely Artificial Bee Colony (ABC) algorithm and Queen Bee Evolution Based on Genetic Algorithm (QEGA) to solve the fragment assembly problem and report quite promising results. Our main focus is to design meta-heuristic based techniques to efficiently handle DNA fragment assembly problem for noisy and noiseless data. To read the DNA fragments, a process named shotgun sequencing [31] is used. In this method, multiple copies of the DNA sequence are first generated through a process called amplification. Then we cut these sequences at random points keeping in mind that we can only directly read a sequence of several hundreds bps long. With these fragments, we try to reconstruct the overlapping DNA sequence as accurately as possible. This is called shotgun sequencing. A lot of tools have been invented to automate DNA sequencing. Among these tools, PHRAP [12], TIGR assembler [32], STROLL [7], CAP3 [13], Celera assembler [24] and EULER [26] may be cited. All of these tools focus on coping with different problems faced during fragment assembly. Categories and Subject Descriptors F.2.2 [Analysis of Algorithms and Problem Complexity]: Nonnumerical Algorithms and Problems—Sequencing To the best of our knowledge, the first work on solving the fragment assembly problem with meta-heuristics can be found in [20]. In this paper, approaches based on Ant Colony System (ACS) are proposed to solve FAP. Later, in [18], the authors presented several methods - a canonical genetic algorithm, a CHC (Cross generational elitist selection, Heterogeneous recombination and Cataclysmic mutation) method, a scatter search algorithm and a simulated annealing method - to solve accurately some problem instances that are 77K base pairs long. Since the work of [18], the problem of DNA fragment assembly has been tackled with different other heuristics and meta-heuristics in the literature [6, 8, 23]. However, the quest for new more accurate and faster techniques still continues. General Terms Algorithm Keywords Bioinformatics, Genetic algorithms, DNA computing, Metaheuristics, Combinatorial optimization 1. INTRODUCTION In DNA (Deoxyribonucleic Acid) fragment assembly problem (FAP), we are given a set of large number of DNA fragments with errors and are asked to find the correct sequence of the DNA by finding the permutation of fragments that ∗ This work is part of the M.Sc. Engg. thesis work of Firoz under the supervision of Rahman at the Department of CSE, BUET. Classical assemblers use fitness functions favoring solutions with strong overlaps between adjacent fragments in the layouts. But we also need to obtain an order of the fragments that minimizes the number of contigs, with the goal of reaching one single contig, i.e., a complete DNA sequence composed of all the overlapping fragments. PALS [6] (Problem Aware Local Search) is a simple, fast and accurate heuristic solution that is recently proposed with the objective of achieving this criteria. In PALS, the number of contigs is used as a high-level criterion to judge the whole quality of the results since it is difficult to capture the dy- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. GECCO’12, July 7-11, 2012, Philadelphia, Pennsylvania, USA. Copyright 2012 ACM 978-1-4503-1177-9/12/07 ...$10.00. 201 namics of the problem into other mathematical functions. However, the calculation of the number of contigs is quite time-consuming, and this fact definitely precludes any algorithm to use such calculation. A solution to this problem was introduced in [6] as the utilization of a method which should not need to know the exact number of contigs and thus be computationally light. The key contribution of PALS [6] is to indirectly estimate the number of contigs by measuring the actual number of contigs that are created or destroyed when tentative solutions are manipulated. The authors in [6] used a variation of a heuristic algorithm for TSP for the DNA field, which does not only use the overlaps among the fragments, but also takes into account (in an intelligent manner) the number of contigs that has been created or destroyed. tical nucleotides in a sequence. During sequencing, the four DNA composing nucleotides are periodically flowed over the inserts to be sequenced. Within each flow, the intensity of the signal emitted reflects the number of nucleotides incorporated and thus the length of the homopolymer under consideration. For chemical and technical reasons, this signal is subject to fluctuations that lead to sequencing errors. After generating realistic dataset with errors, we have implemented Artificial Bee Colony (ABC) algorithm and Queen Bee Evolution Based on Genetic Algorithm (QEGA) to solve the DNA fragment assembly problem as accurately as possible and compared the relative fitness achieved in each case. As has been mentioned before, in most of the related works it was assumed that, the fragments being read are correct. We have eliminated this assumption by incorporating error models in generating artificial fragments to be sequenced and exploit the probabilistic behavior of ABC or QEGA so that worse individuals in the population are taken into consideration. In this way, the chance of completely eliminating a probably misread fragment is minimized and we are bound to find more realistic solutions. Simultaneously, we also analyze the behavior of these two algorithms for noiseless cases. Also, using forward and backward read technique of MetaSim [29], we have been able to emulate a more realistic input sequence mimicking experimental output. These facts give us a promising insight about the feasibility of using meta-heuristics to solve DNA fragment assembly problem. In [8], the authors proposed a new method combining a general purpose meta-heuristic with a local search method specifically designed for this problem, namely, PALS of [6]. They presented a new approach of cellular genetic algorithms (cGA) that regulates the intensity on the search while solving a problem, outperforming the compared cGAs with static populations. Another paper [17] proposed evolutionarybased iterative optimization method called Prototype Optimization with Evolved Improvement Steps (POEMS) to solve the DNA fragment assembly problem. POEMS is an iterative algorithm that seeks for the best modification of the current solution in each iteration. The modifications are evolved by means of an evolutionary algorithm. All of the above mentioned previous works consider operations on noiseless data. Recently, in [23], performance of various meta-heuristic based algorithms were discussed where a uniform random error model was introduced in the fitness score matrix.Very recently in [11] several hybrid metaheuristics were proposed to tackle the fragment assembly problem with noisy data taking into consideration different error models (e.g. Sanger,454 etc). The rest of the paper is organized as follows: In Section 2, we give an overview of DNA fragment assembly problem and error models used in fragment generation. In Section 3, we present the details of Artificial Bee Colony (ABC) algorithm and Queen Bee Evaluation based on genetic algorithm (QEGA) and the adaptation of these two algorithms to solve DNA FAP. Section 4 presents the experimental results achieved by executing our algorithms on several standard instances of DNA that are widely used in the literature. We briefly conclude in Section 5. 1.1 Our Contribution In this paper, we focus on solving FAP for noisy and noiseless instances of DNA sequences. We have used fragments generated by GenFrag [9] to simulate noiseless instances. For noisy case, we first observe an important drawback of the recent work of [23]. The drawback of this technique is that no particular sequencing model was taken into consideration to incorporate the corresponding error model. To overcome this drawback and for the construction of a realistic read data set, here we use a sequencing simulator MetaSim [29]. In this simulator, the user is able to choose from different (adaptable) error models of current sequencing technologies (e.g. Sanger [21, 22], Rochei’s 454 [19] and Illumina (former Solexa) [28]). In Sanger sequencing, the error model is based on the following two assumptions: 2. BACKGROUND In this section, we present some background and preliminaries. Most of the definitions and procedures presented in this section follow from [18]. The input of the DNA fragment assembly problem is a set of fragments that are randomly cut from a DNA sequence. The DNA is a double helix of two anti-parallel and complementary nucleotide sequences. One strand is read from 50 to 30 and the other from 30 to 50 . There are four kinds of nucleotides in any DNA sequence, namely, Adenine (A), Thymine (T), Guanine (G), and Cytosine (C). 1. the probability of an error occurring at position i of a read increases linearly with i, and To further understand the problem, we need to know the following basic terminology: 2. if an error occurs at position i, then with some fixed probabilities, it is either a substitution, a deletion or an insertion. • Fragment: A short sequence of DNA with length up to 1000 bps. In 454 sequencing, the intensity of emitted light is used to estimate the length of homopolymers, i.e. , runs of iden- • Shotgun data: A set of fragments. 202 • Prefix: A substring comprising the first n characters of a fragment f . 2.3 454 Sequencing Error Model For 454 sequencing [1] error model, let r denote the length of a given homopolymer. The emitted light intensity can be modeled by a normal distribution√N (µ, σ), with mean µ = σ and standard deviation σ = k ∗ r, where k is a fixed proportionality factor (k ≈ 0.15). Although basic statistics imply that the standard deviation should grow with the square root of r, for the light intensity emitted during 454sequencing, it is reported to be σ = k ∗ r. In 454 model, a negative flow is a flow of nucleotides in which the sequence to synthesize is not elongated. Light intensities of negative flows follow a lognormal distribution, with mean µ = 0.23, and standard deviation σ = 0.15. A random variable X is said to be lognormally distributed, if the random variable ln X is normally distributed. Let µ and σ be the mean and standard deviation of X and let m and s be the mean and standard deviation of ln(X). Then, r µ2 σ m = ln( p ), s = ln(( )2 + 1) (2) 2 2 µ σ +µ • Suffix: A substring comprising the last n characters of a fragment f . • Overlap: Common sequence between the suffix of one fragment and the prefix of another fragment. • Layout: An alignment of a collection of fragments based on the overlap order, i.e., the fragment order in which the fragments must be joined. • Contig: A layout consisting of contiguous overlapping fragments, i.e., a sequence in which the overlap between adjacent fragments is greater than a predefined threshold. • Consensus: A sequence or string derived from the layout by taking the majority vote for each column of the layout. To measure the quality of a consensus, we can look at the distribution of the coverage. Coverage at a base position is defined as the number of fragments at that position. It is a measure of the redundancy of the fragment data. It denotes the number of fragments, on average, in which a given nucleotide in the target DNA is expected to appear and is computed as follows [30]: Pg i=1 length of the fragment i , (1) Coverage = target sequence length The base-calling intensities of negative flows can be based on this and the misinterpretation of a base can be modeled as homopolymers of length = 1. A simulator take the order of the sequencing flows into account. The nucleotides are cyclicly flowed in the order T, A, C, G and thus only two specific negative flows in a specific order are allowed after any given base. Now comes the base-calling process. As the details of the base-calling algorithm are not known in detail, and are presumably subject to constant improvements, there may occur a loss of accuracy here. First, the intersections of the density functions of the normal distributions for different homopolymer lengths r1 and r2 are calculated and stored in an intersection matrix M . Then, these values are used to decide which homopolymer length is called. where g denotes the no. of fragments. 2.1 DNA Sequencing Process To determine the function of specific genes, scientists read the sequence of nucleotides comprising a DNA sequence in a process called DNA sequencing. The fragment assembly starts with breaking the given DNA sequence into small fragments. To do that, multiple exact copies of the original DNA sequence are made. Each copy is then cut into short fragments at random positions (duplicate and sonicate phases). Then this biological material is “converted” to computer data (sequence and call bases phases). These steps take place in the laboratory. After the fragment set is obtained, traditional assembly approach is followed in the following phases in the given order: overlap, layout, and then consensus. In overlap phase, the best or longest match between the suffix of one sequence and the prefix of another are found. Layout phase consists of finding the order of fragments based on the computed similarity score. Consensus phase consists of deriving the DNA sequence from the layout. The most common technique used in this phase is to apply the majority rule in building the consensus. 2.4 Exact Error Model The Exact Error Model is suitable for sampling reads without modifying any bases, i.e. in contrast to the other provided error models, no substitution, insertion or deletion is applied to the read sequences. But orientations may be changed during reading. 3. BEE ALGORITHMS The Bee Colony meta-heuristic algorithms are natureinspired algorithms based on various biological and natural processes observed in honeybee. Some of these algorithms are inspired by the foraging behavior of honeybee. One of them, Artificial Bee Colony (ABC) Algorithm, is a metaheuristic designed upon the concept of the intelligent behavior of honey bee swarm. Honey bees use techniques like waggle dance that can be used to develop new intelligent search algorithms. Another class of Bee Colony algorithms is inspired by the queen bee evolution process and has used it to enhance the optimization capability of genetic algorithms. The queen-bee evolution makes it possible for genetic algorithms to quickly reach the global optimum as well as decreasing the probability of premature convergence. 2.2 Sanger Sequencing Error Model In Sanger Sequencing [1] the following parameters concerning different probability values, P must be set because of the two assumptions mentioned earlier: 1. P (error at i) for positions i = 1 and i = 1000 2. P (deletion | error) 3.1 3. P (insertion | error) Artificial Bee Colony (ABC) Algorithm In the ABC algorithm [16], the colony of artificial bees contains three groups of bees: employed bees, onlookers and scouts. A bee which waits at the dance area for making de- 4. P (subtitution | error) = 1 - P (deletion | error) - P (insertion | error) 203 cision to choose a food source depending on waggle dance of employed bee, is called an onlooker and a bee going to the food source visited by itself previously is named an employed bee. A bee which performs random search for food is called a scout. In the ABC algorithm (Algorithm 1), first half of the colony consists of employed artificial bees and the second half constitutes the onlookers. For every food source around the hive, there is only one employed bee. The employed bee whose food source is exhausted by the employed and onlooker bees becomes a scout. all food sources. If the nectar amount of a food source increases, the probability with which that food source is chosen by an onlooker increases, too. The dance of employed bees carrying higher nectar has higher probability of being selected by onlookers for exploitation. Algorithm 1 ABC Algorithm 1: Initialize potential food sources for employed bees. 2: while Requirements are not met do 3: Each employed bee goes to a food source in her memory and determines a neighbour source, then evaluates its nectar amount and dances in the hive 4: Each onlooker watches the dance of employed bees and chooses one of their sources depending on the dances, and then goes to that source. After choosing a neighbour around that, she evaluates its nectar amount. 5: Abandoned food sources are determined and are replaced with the new food sources discovered by scouts. 6: The best food source found so far is registered 7: end while For our DNA FAP, given a set of fragments, our target is to generate a permutation of fragments that minimizes the number of contigs and maximizes the fitness. The solution achieving these two objectives best represents the DNA sequence instance. In our ABC algorithm, each of the food sources represents a permutation of fragments of a DNA sequence. The position of a food source represents a possible solution and the nectar amount of a food source corresponds to the quality (fitness) of the associated solution. Initially we generate the food source positions (i.e. permutations) randomly. We do not assume that the initial seeds are taken from the best of previous execution of any other algorithm. This is a more realistic assumption compared to [23] where the authors use seeding strategies to find good solution beforehand to exploit promising regions, for both noisy and noiseless data. Our algorithm is robust in the sense that, although we do not make any such preprocessing, our algorithm converges to almost same fitness compared to them in reasonable time. 3.2 Conventional genetic algorithm sometimes become unsuccessful to find a globally optimal solution within a limited no. of evolutions. To overcome this disadvantage, a queen-bee evolution based on genetic algorithm (QEGA) ( [15], [27]) is employed for solving the DNA FAP. After the initialization, the food sources are repeatedly searched by employed bees, onlooker bees and scout bees. A employed or onlooker bee produces a modification of the solution for finding a new food source and tests the nectar amount (fitness value) of the new solution. For the modification, we have used problem aware local search (PALS) [6] on the permutation µ under consideration. We have done so because, besides considering fitness value, we also take no. of contigs of a solution into consideration. In all cases, we have used a fitness function (calculation of the amount of nectar) that sums the overlap score for adjacent fragments (f [i] and f [i + 1]) in a given solution. Let us denote the overlap score by w(f [i], f [i + 1]) and fitness function by Fµ . Then, Fµ = g−2 X w(f [i], f [i + 1]), Queen-bee Evaluation Based On Genetic Algorithm (QEGA) Algorithm 2 QEGA Algorithm Input: time t, population size k, populations P , normal mutation rate σ, normal mutation probability pm , strong mutation probability p0m , a queen-bee Iq , selected bees Im Output: best fitness solution 1: Initialize P (t) 2: Evaluate P (t) 3: while Condition not met do 4: t←t+1 5: Select P (t) from P (t − 1). So, P (t) = {Iq (t − 1), Im (t − 1). 6: Recombine P (t) 7: Crossover 8: for i = 0 to k do 9: if i ≤ (σ ∗ k) then 10: Mutate with pm 11: else 12: Mutate with p0m 13: end if 14: end for 15: Evaluate P (t) 16: end while (3) i=0 where g is the no. of fragments in the solution. If the nectar amount of the new modified source is higher than that of the previous one, the bee memorizes the new solution and forgets the old one. Otherwise it keeps the position of the previous one. Onlooker bees choose the food source for modification probabilistically, depending on the nectar amount of the food source. The probability, pµ is calculated as follows: The population of QEGA consists of several permutations of fragments of a DNA sequence. To determine the fitness value of each individual in the population, we have used Equation 3 as the fitness calculating function in Steps 2 and 15 of QEGA (Algorithm 2). Fµ , (4) globalmax where globalmax denotes the maximum fitness value among pµ = 204 There are two major differences between conventional genetic algorithm (CGA) and QEGA. Firstly, the parents P (t) in CGA are composed of the k individuals selected by a selection algorithm such as tournament selection. On the other hand, parents P (t) in QEGA consist of the k2 copies of a queen-bee Iq (t − 1), where q = arg max {Fµ , 1 ≤ µ ≤ k} and k2 copies of bees Im (t − 1) selected by a selection algorithm, where 1 ≤ m ≤ k2 . Secondly, all individuals in conventional genetic algorithm are mutated with small mutation probability pm , while in QEGA only a part of the individuals are mutated with normal mutation probability pm and the others are mutated with strong mutation probability p0m . The ratio between pm and p0m is denoted by σ in QEGA. Generally, pm is less than 0.1 and p0m is greater than pm . Table 1: Information of datasets. Accession numbers are used as the name of the instances Instances (abbreviated form) acin1 (a 1) acin2 (a 2) acin3 (a 3) acin5 (a 5) acin7 (a 7) acin9 (a 9) x60189 4 (x 1) x60189 5 (x 2) x60189 6 (x 3) x60189 7 (x 4) m15421 5 (m 1) m15421 6 (m 2) m15421 7 (m 3) j02459 7 (j 1) BX842596 4 (b 1) BX842596 7 (b 2) So, In QEGA, the fittest individual in a generation, crossbreeds with the bees selected as parents by means of a selection algorithm. This feature of queen-bee evolution reinforces the exploitation of genetic algorithms. That is, fitness of the offsprings mainly rely on the crossover operation and the fittest individual. Consequently, it also increases the probability of premature convergence. Nonetheless, the second feature of QEGA helps genetic algorithms search new space, i.e., it increases the exploration of genetic algorithms through strong mutation. These two features enable genetic algorithms to evolve quickly and simultaneously maintain good solutions. In other words, the queen-bee evolution makes it possible for genetic algorithms to quickly approach the global optimum as well as decreasing the probability of premature convergence. CoverMean age fragment length Number of fragments 26 3 3 2 2 7 4 5 6 6 5 6 7 7 4 182 1002 1001 1003 1003 1003 395 386 343 387 398 350 383 700 708 307 451 601 751 901 1049 39 48 66 68 127 173 177 352 442 5 703 773 Original sequence length (in bps) 2170 147200 200741 329958 426840 156305 3835 10089 48502 77292 cussed as follows: We have consulted different DNA sequence simulators like FASIM [14], GenFrag [9] and Celsim [25]. In [14], the authors have showed various problems associated with Celsim and GenFrag version 1.0 and 2.1. Mainly, these programs are designed on the assumption that fragments are equally distributed on genome. That is, clone sampling is uniformly carried out during fragment generation. Thus, these programs do not sufficiently reflect actual conditions of WGSS (Whole Genome Shotgun Sequencing). For FASIM, current release of the software is only available upon request. On the other hand, MetaSim is a freely available sequencing simulator for genomic and metagenomics. It can be utilized to simulate fragments of real read experiments by incorporating errors which occur at the Layout phase, i.e., unknown orientation, base call errors, incomplete coverage, repeated regions, chimeras and contamination. Here we have used three error models, namely, 454, Sanger and exact error models considering configuration of all parameters. These configurations are presented in Table 2. Ordered two-point crossover has been used as crossover operator in Step 7 of Algorithm 2. In this scheme, given two parents, two random crossover points are selected partitioning them into a left, middle and right portion. Then ordered crossover is carried out in the following way: Child 1 inherits its left and right sections from Parent 1, and its middle section is determined by the fragments in the middle section of Parent 1 in the order these appear in Parent 2. A similar process is applied to determine Child 2. Notably, the Ordered crossover operator is a pure recombination operator [33]. For strong mutation, we have used problem aware local search (PALS) [6]. For normal mutation, we have done random mutation on the offsprings. 4. EXPERIMENTS 4.2 In this section, we present our experimental setup and the results obtained by our algorithms. The experiments were conducted in Ubuntu 11.04 running on an Intel core i3 Processor with 2GB RAM. As the exact orientation of the generated fragments are not predictable, to tackle the problem of unknown orientation, we have checked fragment overlapping in both forward and backward orientation during calculation of the score matrix. For example, let us assume that we want to calculate overlap between Fragment 1 and Fragment 2. We first calculate normal overlap between these two segments (suffixprefix) and then compare the obtained value with the overlap value calculated from the reverse direction of Fragment 2 (suffix-suffix). We take the best value from the above two options. We have used GenFrag and MetaSim for generating noiseless and noisy artificial fragments respectively. For this, we collected the DNA sequences from NCBI [2, 3, 4, 5]. We give a summary on the different features of the datasets in Table 1. 4.1 Noisy Input Data Generation 4.3 We have used MetaSim [29] for noisy input data generation. The motivation behind using Metasim is briefly dis- Score Matrix Calculation ABC Control Parameters ABC algorithm has a few control parameters. We set 205 Table 2: Configuration of MetaSim for different error models and no. of generations used by the algorithms Parameter Considered Error Model Number of reads or Mate pair 454, Sanger, Exact -do-do-do-do454 -do-do- DNA clone size distribution type a Mean Second Parameter Mate pair Probability Expected read length Mate pair read length Number of flow cycles Mean negative flow cyclesb Std. deviation for negative flow cyclec Signal std. deviation multiplier Read length distribution type Error rate at read start Error rate at end of read Insertion error rate Deletion error rate Insertion Deletion Substitutions No. of base pair processed No. of base pair generated -do-do-doSanger -do-do-do-do454 Sanger 454 Sanger Sanger 454 Sanger Exact 454 Sanger Exact acin1 307 acin2 451 acin3 601 acin5 751 190 10 1020 30 1001 1003 190 1020 1007 1007 75 400 395 395 1330 146 369 162 532 57857 56414 55915 58818 56398 55915 10460 1433 2758 1370 4095 458964 452401 461288 466666 452464 461288 13651 1643 3687 1679 5094 594676 567768 596036 604640 567732 596036 17146 2147 4438 2131 6409 738514 710617 755338 751222 710633 755338 Instances acin7 acin9 901 1049 x60189 6 68 m15421 7 177 j02459 7 352 387 387 700 388 388 700 152 152 275 604 71 125 70 216 23719 23145 23719 24198 23146 24198 1571 157 388 165 537 63688 61276 69562 64871 61268 69562 5694 666 1469 621 1988 233127 223168 243163 237352 223213 243163 Normal 1003 1003 100 0 1005 1005 20 394 394 0.23 0.15 0.15 Normal 0.01 0.02 0.2 0.2 20882 23872 2480 2860 5396 6373 2467 2965 7642 8969 889923 1034189 852423 997708 903193 1050247 905409 1051688 852436 997603 903193 1050247 a A Clone in MetaSim is a DNA fragment that is randomly extracted from the source genome sequence for read/mate-pair sampling. b A negative flow is a flow of nucleotides in which the sequence to synthesize is not elongated. Light intensities of negative flows follow a lognormal distribution. The default value (µ = 0.23) is taken from [10]. c The default value (0.15) is taken from [10]. Maximum number of cycles (MCN), i.e., maximum number of generation to 4000 and the colony size, i.e., population size to 256. The percentage of onlooker bees was set to 50% of the colony, the employed bees were 50% of the colony and the number of scout bees was selected as one. Notably, an increase in the number of scouts encourages the exploration whereas that in onlookers of a food source increases the exploitation. Table 3: Best final contig number and fitness for noiseless data Ins. x1 x2 x3 x4 m 1 m 2 m 3 j1 b 1 b 2 a1 a2 a3 a5 a7 a9 4.4 Results Now, we present the results obtained by our ABC algorithm and QEGA for solving different DNA instances. We measure the quality of the solution by considering the fitness value achieved as well as the number of contigs. The significance of contig calculation is to ensure that the solution best represents a continuous assembled sequence. We have tested our algorithms for both noisy and noiseless fragments. For noiseless instances, we set a cutoff value, i.e., required overlap between two adjacent fragments, to thirty. Table 3 shows the results obtained for noiseless instances by our algorithms as well as by simulated annealing (SA), Problem aware local search (PALS) and Genetic algorithm (GA). In Table 3, A, Q, S, P and G denotes ABC, QEGA, SA, PALS and GA respectively. In this case we have found that our implementation of ABC and QEGA perform very competitively with the best algorithm in the literature which is based on simulated annealing (SA) for noiseless instances [23]. Similar behavior is observed in comparison to PALS [6], a defacto standard to compare genome assemblers, in all cases. These experiments validate our algorithms’ good performance. Moreover, QEGA algorithm performs better than ABC algorithm for most of the instances. Notably, QEGA produces better results for longer instances. In terms of no. of con- 206 ABC 11478 14016 18339 21184 38423 47515 54607 114774 225431 440187 45403 142539 155413 144339 155182 321397 Best Fitness QEGA SA PALS 11476 11478 11204 14027 14027 12898 18266 18301 16992 21208 21271 20424 38578 38583 36540 47882 48048 45773 55020 55048 51454 116222 116257 108744 227252 226538 211792 443600 436739 412374 47115 46955 44656 144133 144705 138423 156138 156630 151928 144541 146607 143835 155322 157984 155089 322768 324559 310235 GA 11478 13502 17688 20884 37714 46949 52695 111103 220558 416414 45565 143444 154947 145332 155873 313203 A 1 1 1 1 3 2 2 3 13 12 8 237 362 556 726 552 Best Q 1 1 1 1 1 1 1 1 8 3 4 233 358 552 722 552 Contig S P 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 G 1 1 1 1 1 2 2 1 6 3 5 236 358 552 722 552 tigs, QEGA does a better job at contig reduction than the ABC algorithm. Next, We execute ABC and QEGA for noisy instances. It is to be noted that, in noisy instances, a lot of insertions, deletions or substitutions can occur. For instance, if we refer to Table 2, there are 5694 insertions, 1469 deletions and 1988 substitutions on j02459 7 instance in 454 error model. So, empirically we set the cutoff value for noisy instances to lower values but ensure overlapping adjacent fragments. The results obtained by our algorithms (ABC and QEGA) are presented in Table 4 along with three other algorithms from [11] namely Genetic Algorithms (GA), Genetic Algorithm with Simulated Annealing (GA+SA)and Genetic Algorithm with Hill Climbing(GA+HC). We have given the comparison of our result with [11] because the authors of [11] have used the same error models and parameters as ours to present results. More importantly, there is no way to compare our results with [23] because the authors of [23] have only told that they have introduced random error but there was no indication of how much error was being introduced. As opposed to them, we have clearly mentioned the parameters of our noisy dataset generation in Table 2. Instead of introducing random error in score matrix [23], we generated fragments with three error models (454, Sanger and Exact). We execute our algorithms on these instances. We also take into consideration that DNA FAP is an off-line problem. So, instead of emphasizing on execution time of the algorithms, we concern ourself with assessment of the solution quality, based on number of contigs and fitness value. Table 4: Best fitness obtained by the algorithms for noisy data Instances acin1 acin2 acin3 acin5 acin7 acin9 As can be seen from Table 4, for most of the instances, QEGA outperforms all previously proposed algorithms as well as the ABC algorithm because of the elitist selection method imposed by QEGA. Our implementation of QEGA breeds fittest parent (i.e. queen bee) with other parents. The strong mutation in Step 12 of Algorithm 2 employs PALS to get the fittest mutants from the offspring. So we perform exploitation in the vicinity of the best solution of the current cycle. On the other hand, although ABC algorithm performs analogous action of mutation, but it does not do so with the fittest ones. Exploitation can sometime leads to local minima and QEGA countermeasures this by random mutation. All the fitness values for QEGA and ABC algorithm shown in Table 4 are obtained for a single contig i.e. one continuous sequence with some exceptions. These exceptions can not reach single contig value and the final contig value for each of these instances are shown in brackets just beside their fitness in Table 4. For GA, GA+HC and GA+SA, the authors in [11] did not mention the final contigs obtained by these algorithms. So we are unable to comment about these three algorithms in terms of contig value. x60189 6 m15421 7 j02459 7 Figure 1 and 2 show the progressive improvement of fitness values for the instance j02459 7 for noiseless and noisy data by applying QEGA. As can be seen from the figures, final fitness values for the noisy cases is significantly smaller than that of noiseless cases. Figure 1 also shows that QEGA is able to overcome any local minima as is evident from the transition of lower fitness value at almost 1100-th iteration to higher fitness value at almost 1200-th iteration. Error Model 454 Sanger Exact 454 Sanger Exact 454 Sanger Exact 454 Sanger Exact 454 Sanger Exact 454 Sanger Exact 454 Sanger Exact 454 Sanger Exact 454 Sanger Exact ABC 4323 7758 17354 2271 2981 20570 2960 3281 26611 3759 4223 14694 3759(2) 3964 15052(4) 4671 5204 31914(2) 353 609 2292 978 1429 7028 1670 2466 13353 Best QEGA 4631 8512 160007(4) 2418 3142 20884 3170 3661 10605(237) 3994 4533 16182 4792 5177 18864 6315 7881 48154(107) 363 623 2303 1029 1467 7181 1776 2563 13607 Fitness GA 162 178 214 260 264 267 326 344 374 405 430 431 490 516 522 608 623 625 41 36 36 88 97 93 183 188 179 GA+SA 3738 6744 14157 1778 2413 19140 1972 2199 12865 2220 2384 7069 2528 2651 8212 2838 2982 11816 358 622 2289 939 1380 6840 1429 2095 12659 GA+HC 4611 8394 17914 2208 2924 20677 2657 3095 13694 3113 3540 15001 3487 3817 16029 3828 4735 38262 362 624 2302 1026 1455 7170 1673 2443 13526 Figure 1: Iteration vs Fitness graph for j02459 7 noiseless data using QEGA optimal fitness for noisy instances. We also wish to implement the parallel versions of our algorithms. 6. 5. CONCLUSION REFERENCES [1] http://ab.inf.uni-tuebingen.de/teaching/ss07/ albi2/script/readsim_2July2007.pdf. [Online; accessed 16-November-2011]. [2] http://www.ncbi.nlm.nih.gov/nuccore/178817? report=fasta(M15421.1). [Online; accessed 20-November-2011]. [3] http://www.ncbi.nlm.nih.gov/nuccore/34645? report=fasta(X60189.1. [Online; accessed 20-November-2011]. [4] http://www.ncbi.nlm.nih.gov/nuccore/215104? report=fasta(J02459.1. [Online; accessed 20-November-2011]. [5] http://0-www.ncbi.nlm.nih.gov.ilsprod.lib.neu. edu/Traces/wgs/?hide\_master\\*=\&val= ACIN02000001\#9(ACIN\_ALL). [Online; accessed 20-November-2011]. In this paper, we have used Artificial Bee Colony (ABC) algorithm and Queen-bee Evaluation based on Genetic Algorithm (QEGA) to tackle the DNA fragment assembly problem for noisy and noiseless instances. In particular, we have recorded the results obtained by our algorithm, taking into consideration various standard error models normally used by DNA sequencing experiments. Previous works on DNA FAP report their results for noiseless cases mostly. We have observed that, although our algorithms perform very well for noiseless data just like others, for noisy instances their obtained fitness value decrease significantly for different error models. We have pointed out the fact that errors in DNA fragments significantly impact the performance of standard algorithms. Future works should take this into account. In future, we intend to design algorithms to achieve near- 207 [18] [19] [20] Figure 2: Iteration vs Fitness graph for j02459 7 noisy data(454 error model) using QEGA [21] [22] [6] E. Alba and G. Luque. A new local search algorithm for the DNA fragment assembly problem. In Evolutionary Computation in Combinatorial Optimization, EvoCOP’07, Lecture Notes in Computer Science 4446, pages 1–12, Valencia, Spain, April 2007. Springer. [7] T. Chen and S. S. Skiena. Trie-based data structures for sequence assembly. In The Eighth Symposium on Combinatorial Pattern Matching, pages 206–223. Springer-Verlag, 1997. [8] B. Dorronsoro, E. Alba, G. Luque, and P. Bouvry. A self-adaptive cellular memetic algorithm for the DNA fragment assembly problem. In IEEE Congress on Evolutionary Computation, pages 2651–2658, 2008. [9] M. L. Engle and C. Burks. Artificially generated data sets for testing DNA sequence assembly algorithms. Genomics, 16(1):286–8, 1993. [10] M. M. et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature, 437(7057):376–80, 2005. [11] J. S. Firoz, M. S. Rahman, and T. k. Saha. Analyzing memetic algorithm behavior in DNA fragment assembly problem for noisy data. In 4th International Conference on Bioinformatics and Computational Biology (BICoB),Las Vegas, Nevada, USA, 2012. [12] P. Green. Phrap. http://www.phrap.org/. [Online; accessed 20-November-2011]. [13] X. Huang and A. Madan. Cap3: A DNA sequence assembly program. Genome Research, 9:868–877, 1999. [14] C. G. Hur. Fasim: Fragments assembly simulation using biased-sampling model and assembly simulation for microbial genome shotgun sequencing. [15] S. H. Jung. Queen-bee evolution for genetic algorithms. Electronics Letters, 39:575–576, March 2003. [16] D. Karaboga and B. Basturk. A powerful and efficient algorithm for numerical function optimization: artificial bee colony (abc) algorithm. J. of Global Optimization, 39:459–471, November 2007. [17] J. Kubalik, P. Buryan, and L. Wagner. Solving the dna fragment assembly problem efficiently using iterative optimization with evolved hypermutations. In Proceedings of the 12th annual conference on Genetic [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] 208 and evolutionary computation, GECCO ’10, pages 213–214, New York, NY, USA, 2010. ACM. G. Luque and E. Alba. Metaheuristics for the DNA Fragment Assembly Problem. International Journal of Computational Intelligence Research, 1(1-2):98–108, 2005. M. Margulies, M. Egholm, W. E. Altman, S. Attiya, J. S. Bader, L. A. Bemben, J. Berka, M. S. Braverman, Y.-J. Chen, Z. Chen, and et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature, 437(7057):376–380, 2005. P. Meksangsouy and N. Chaiyaratana. DNA fragment assembly using an ant colony system algorithm. In Proceedings of Congress on Evolutionary Computation, volume 3, pages 1756–1763, 2003. D. Meldrum. Automation for genomics, part one: Preparation for sequencing. Genome Research, 10(8):1081–1092, 2000. D. Meldrum. Automation for genomics, part two: Sequencers, microarrays, and future trends. Genome Research, 10(9):1288–1303, 2000. G. Minetti and E. Alba. Metaheuristic assemblers of DNA strands: Noiseless and noisy cases. In IEEE Congress on Evolutionary Computation’10, pages 1–8, 2010. E. W. Myers. Toward simplifying and accurately formulating fragment assembly. Journal of Computational Biology, 2:275–290, 1995. G. Myers. A dataset generator for whole genome shotgun sequencing. In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, pages 202–210. AAAI Press, 1999. P. Pevzner. Computational molecular biology: An algorithmic approach. MIT Press, 2000. L. Qin, Q. Jiang, Z. Zou, and Y. Cao. A queen-bee evolution based on genetic algorithm for economic power dispatch. In 39th International Universities Power Engineering Conference, volume 1, pages 453–456, September 2004. D. R and Bentley. Whole-genome re-sequencing. Current Opinion in Genetics and Development, 16(6):545 – 552, 2006. D. C. Richter, F. Ott, A. F. Auch, R. Schmid, and D. H. Huson. Metasim:a sequencing simulator for genomics and metagenomics. PLoS ONE, 3(10):e3373+, 2008. J. Setubal and J. Meidanis. Fragment assembly of DNA. Introduction to Computational Molecular Biology, pages 105–139, 1997. J. C. Setubal and J. Meidanis. Introduction to Computational Molecular Biology. PWS Publishing Company, 1997. G. Sutton, W. O., A. M.D., and K. A.R. Tigr assembler: A new tool for assembling large shotgun sequencing projects. Genome Science and Tech, pages 9–19, 1995. E. G. Talbi. Metaheuristics from design to implementation. WILEY, 2009.