Bee Algorithms for Solving DNA Fragment Assembly Problem

advertisement
Bee Algorithms for Solving DNA Fragment Assembly
Problem with Noisy and Noiseless data
Jesun Sahariar Firoz ∗
M. Sohel Rahman
Tanay Kumar Saha
A`EDA group
Department of CSE, BUET
Dhaka, Bangladesh
A`EDA group
Department of CSE, BUET
Dhaka, Bangladesh
A`EDA group
Department of CSE, BUET
Dhaka, Bangladesh
jesunsahariar@gmail.com msrahman@cse.buet.ac.bd
ABSTRACT
tanay@cse.jnu.ac.bd
best represents the DNA sequence. This problem finds its
motivation from the limitation of current technology, which
enables us to read only several hundreds of base pairs (bps)
of a single DNA at a time. Consequently, we need to read
fragments of the DNA but not the whole sequence at a time.
During this process base pairs may be removed, misread or
inserted.
DNA fragment assembly problem is one of the crucial challenges faced by computational biologists where, given a set
of DNA fragments, we have to construct a complete DNA
sequence from them. As it is an NP-hard problem, accurate
DNA sequence is hard to find. Moreover, due to experimental limitations, the fragments considered for assembly
are exposed to additional errors while reading the fragments.
In such scenarios, meta-heuristic based algorithms can come
in handy. We analyze the performance of two swarm intelligence based algorithms namely Artificial Bee Colony (ABC)
algorithm and Queen Bee Evolution Based on Genetic Algorithm (QEGA) to solve the fragment assembly problem and
report quite promising results. Our main focus is to design
meta-heuristic based techniques to efficiently handle DNA
fragment assembly problem for noisy and noiseless data.
To read the DNA fragments, a process named shotgun sequencing [31] is used. In this method, multiple copies of
the DNA sequence are first generated through a process
called amplification. Then we cut these sequences at random points keeping in mind that we can only directly read
a sequence of several hundreds bps long. With these fragments, we try to reconstruct the overlapping DNA sequence
as accurately as possible. This is called shotgun sequencing. A lot of tools have been invented to automate DNA
sequencing. Among these tools, PHRAP [12], TIGR assembler [32], STROLL [7], CAP3 [13], Celera assembler [24] and
EULER [26] may be cited. All of these tools focus on coping
with different problems faced during fragment assembly.
Categories and Subject Descriptors
F.2.2 [Analysis of Algorithms and Problem Complexity]: Nonnumerical Algorithms and Problems—Sequencing
To the best of our knowledge, the first work on solving
the fragment assembly problem with meta-heuristics can
be found in [20]. In this paper, approaches based on Ant
Colony System (ACS) are proposed to solve FAP. Later,
in [18], the authors presented several methods - a canonical
genetic algorithm, a CHC (Cross generational elitist selection, Heterogeneous recombination and Cataclysmic mutation) method, a scatter search algorithm and a simulated
annealing method - to solve accurately some problem instances that are 77K base pairs long. Since the work of [18],
the problem of DNA fragment assembly has been tackled
with different other heuristics and meta-heuristics in the literature [6, 8, 23]. However, the quest for new more accurate
and faster techniques still continues.
General Terms
Algorithm
Keywords
Bioinformatics, Genetic algorithms, DNA computing, Metaheuristics, Combinatorial optimization
1. INTRODUCTION
In DNA (Deoxyribonucleic Acid) fragment assembly problem (FAP), we are given a set of large number of DNA fragments with errors and are asked to find the correct sequence
of the DNA by finding the permutation of fragments that
∗
This work is part of the M.Sc. Engg. thesis work of Firoz
under the supervision of Rahman at the Department of CSE,
BUET.
Classical assemblers use fitness functions favoring solutions with strong overlaps between adjacent fragments in
the layouts. But we also need to obtain an order of the
fragments that minimizes the number of contigs, with the
goal of reaching one single contig, i.e., a complete DNA sequence composed of all the overlapping fragments. PALS [6]
(Problem Aware Local Search) is a simple, fast and accurate heuristic solution that is recently proposed with the
objective of achieving this criteria. In PALS, the number of
contigs is used as a high-level criterion to judge the whole
quality of the results since it is difficult to capture the dy-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
GECCO’12, July 7-11, 2012, Philadelphia, Pennsylvania, USA.
Copyright 2012 ACM 978-1-4503-1177-9/12/07 ...$10.00.
201
namics of the problem into other mathematical functions.
However, the calculation of the number of contigs is quite
time-consuming, and this fact definitely precludes any algorithm to use such calculation. A solution to this problem was introduced in [6] as the utilization of a method
which should not need to know the exact number of contigs
and thus be computationally light. The key contribution of
PALS [6] is to indirectly estimate the number of contigs by
measuring the actual number of contigs that are created or
destroyed when tentative solutions are manipulated. The
authors in [6] used a variation of a heuristic algorithm for
TSP for the DNA field, which does not only use the overlaps among the fragments, but also takes into account (in
an intelligent manner) the number of contigs that has been
created or destroyed.
tical nucleotides in a sequence. During sequencing, the four
DNA composing nucleotides are periodically flowed over the
inserts to be sequenced. Within each flow, the intensity
of the signal emitted reflects the number of nucleotides incorporated and thus the length of the homopolymer under
consideration. For chemical and technical reasons, this signal is subject to fluctuations that lead to sequencing errors.
After generating realistic dataset with errors, we have
implemented Artificial Bee Colony (ABC) algorithm and
Queen Bee Evolution Based on Genetic Algorithm (QEGA)
to solve the DNA fragment assembly problem as accurately
as possible and compared the relative fitness achieved in
each case.
As has been mentioned before, in most of the related works
it was assumed that, the fragments being read are correct.
We have eliminated this assumption by incorporating error
models in generating artificial fragments to be sequenced
and exploit the probabilistic behavior of ABC or QEGA so
that worse individuals in the population are taken into consideration. In this way, the chance of completely eliminating a probably misread fragment is minimized and we are
bound to find more realistic solutions. Simultaneously, we
also analyze the behavior of these two algorithms for noiseless cases. Also, using forward and backward read technique
of MetaSim [29], we have been able to emulate a more realistic input sequence mimicking experimental output. These
facts give us a promising insight about the feasibility of using
meta-heuristics to solve DNA fragment assembly problem.
In [8], the authors proposed a new method combining a
general purpose meta-heuristic with a local search method
specifically designed for this problem, namely, PALS of [6].
They presented a new approach of cellular genetic algorithms (cGA) that regulates the intensity on the search while
solving a problem, outperforming the compared cGAs with
static populations. Another paper [17] proposed evolutionarybased iterative optimization method called Prototype Optimization with Evolved Improvement Steps (POEMS) to
solve the DNA fragment assembly problem. POEMS is an
iterative algorithm that seeks for the best modification of
the current solution in each iteration. The modifications are
evolved by means of an evolutionary algorithm.
All of the above mentioned previous works consider operations on noiseless data. Recently, in [23], performance of various meta-heuristic based algorithms were discussed where
a uniform random error model was introduced in the fitness score matrix.Very recently in [11] several hybrid metaheuristics were proposed to tackle the fragment assembly
problem with noisy data taking into consideration different
error models (e.g. Sanger,454 etc).
The rest of the paper is organized as follows: In Section
2, we give an overview of DNA fragment assembly problem
and error models used in fragment generation. In Section
3, we present the details of Artificial Bee Colony (ABC)
algorithm and Queen Bee Evaluation based on genetic algorithm (QEGA) and the adaptation of these two algorithms
to solve DNA FAP. Section 4 presents the experimental results achieved by executing our algorithms on several standard instances of DNA that are widely used in the literature.
We briefly conclude in Section 5.
1.1 Our Contribution
In this paper, we focus on solving FAP for noisy and noiseless instances of DNA sequences. We have used fragments
generated by GenFrag [9] to simulate noiseless instances. For
noisy case, we first observe an important drawback of the recent work of [23]. The drawback of this technique is that no
particular sequencing model was taken into consideration
to incorporate the corresponding error model. To overcome
this drawback and for the construction of a realistic read
data set, here we use a sequencing simulator MetaSim [29].
In this simulator, the user is able to choose from different
(adaptable) error models of current sequencing technologies
(e.g. Sanger [21, 22], Rochei’s 454 [19] and Illumina (former
Solexa) [28]). In Sanger sequencing, the error model is based
on the following two assumptions:
2.
BACKGROUND
In this section, we present some background and preliminaries. Most of the definitions and procedures presented in
this section follow from [18].
The input of the DNA fragment assembly problem is a set
of fragments that are randomly cut from a DNA sequence.
The DNA is a double helix of two anti-parallel and complementary nucleotide sequences. One strand is read from
50 to 30 and the other from 30 to 50 . There are four kinds
of nucleotides in any DNA sequence, namely, Adenine (A),
Thymine (T), Guanine (G), and Cytosine (C).
1. the probability of an error occurring at position i of a
read increases linearly with i, and
To further understand the problem, we need to know the
following basic terminology:
2. if an error occurs at position i, then with some fixed
probabilities, it is either a substitution, a deletion or
an insertion.
• Fragment: A short sequence of DNA with length up
to 1000 bps.
In 454 sequencing, the intensity of emitted light is used
to estimate the length of homopolymers, i.e. , runs of iden-
• Shotgun data: A set of fragments.
202
• Prefix: A substring comprising the first n characters
of a fragment f .
2.3
454 Sequencing Error Model
For 454 sequencing [1] error model, let r denote the length
of a given homopolymer. The emitted light intensity can
be modeled by a normal distribution√N (µ, σ), with mean
µ = σ and standard deviation σ = k ∗ r, where k is a fixed
proportionality factor (k ≈ 0.15). Although basic statistics imply that the standard deviation should grow with the
square root of r, for the light intensity emitted during 454sequencing, it is reported to be σ = k ∗ r. In 454 model, a
negative flow is a flow of nucleotides in which the sequence
to synthesize is not elongated. Light intensities of negative
flows follow a lognormal distribution, with mean µ = 0.23,
and standard deviation σ = 0.15. A random variable X is
said to be lognormally distributed, if the random variable
ln X is normally distributed. Let µ and σ be the mean and
standard deviation of X and let m and s be the mean and
standard deviation of ln(X). Then,
r
µ2
σ
m = ln( p
), s = ln(( )2 + 1)
(2)
2
2
µ
σ +µ
• Suffix: A substring comprising the last n characters of
a fragment f .
• Overlap: Common sequence between the suffix of one
fragment and the prefix of another fragment.
• Layout: An alignment of a collection of fragments
based on the overlap order, i.e., the fragment order
in which the fragments must be joined.
• Contig: A layout consisting of contiguous overlapping
fragments, i.e., a sequence in which the overlap between adjacent fragments is greater than a predefined
threshold.
• Consensus: A sequence or string derived from the layout by taking the majority vote for each column of the
layout.
To measure the quality of a consensus, we can look at the
distribution of the coverage. Coverage at a base position
is defined as the number of fragments at that position. It
is a measure of the redundancy of the fragment data. It
denotes the number of fragments, on average, in which a
given nucleotide in the target DNA is expected to appear
and is computed as follows [30]:
Pg
i=1 length of the fragment i
,
(1)
Coverage =
target sequence length
The base-calling intensities of negative flows can be based
on this and the misinterpretation of a base can be modeled
as homopolymers of length = 1. A simulator take the order
of the sequencing flows into account. The nucleotides are
cyclicly flowed in the order T, A, C, G and thus only two
specific negative flows in a specific order are allowed after
any given base. Now comes the base-calling process. As the
details of the base-calling algorithm are not known in detail,
and are presumably subject to constant improvements, there
may occur a loss of accuracy here. First, the intersections of
the density functions of the normal distributions for different
homopolymer lengths r1 and r2 are calculated and stored in
an intersection matrix M . Then, these values are used to
decide which homopolymer length is called.
where g denotes the no. of fragments.
2.1 DNA Sequencing Process
To determine the function of specific genes, scientists read
the sequence of nucleotides comprising a DNA sequence in
a process called DNA sequencing. The fragment assembly
starts with breaking the given DNA sequence into small fragments. To do that, multiple exact copies of the original DNA
sequence are made. Each copy is then cut into short fragments at random positions (duplicate and sonicate phases).
Then this biological material is “converted” to computer data
(sequence and call bases phases). These steps take place in
the laboratory. After the fragment set is obtained, traditional assembly approach is followed in the following phases
in the given order: overlap, layout, and then consensus. In
overlap phase, the best or longest match between the suffix
of one sequence and the prefix of another are found. Layout
phase consists of finding the order of fragments based on
the computed similarity score. Consensus phase consists of
deriving the DNA sequence from the layout. The most common technique used in this phase is to apply the majority
rule in building the consensus.
2.4
Exact Error Model
The Exact Error Model is suitable for sampling reads
without modifying any bases, i.e. in contrast to the other
provided error models, no substitution, insertion or deletion
is applied to the read sequences. But orientations may be
changed during reading.
3.
BEE ALGORITHMS
The Bee Colony meta-heuristic algorithms are natureinspired algorithms based on various biological and natural
processes observed in honeybee. Some of these algorithms
are inspired by the foraging behavior of honeybee. One of
them, Artificial Bee Colony (ABC) Algorithm, is a metaheuristic designed upon the concept of the intelligent behavior of honey bee swarm. Honey bees use techniques like
waggle dance that can be used to develop new intelligent
search algorithms. Another class of Bee Colony algorithms
is inspired by the queen bee evolution process and has used
it to enhance the optimization capability of genetic algorithms. The queen-bee evolution makes it possible for genetic algorithms to quickly reach the global optimum as well
as decreasing the probability of premature convergence.
2.2 Sanger Sequencing Error Model
In Sanger Sequencing [1] the following parameters concerning different probability values, P must be set because
of the two assumptions mentioned earlier:
1. P (error at i) for positions i = 1 and i = 1000
2. P (deletion | error)
3.1
3. P (insertion | error)
Artificial Bee Colony (ABC) Algorithm
In the ABC algorithm [16], the colony of artificial bees
contains three groups of bees: employed bees, onlookers and
scouts. A bee which waits at the dance area for making de-
4. P (subtitution | error) = 1 - P (deletion | error) - P
(insertion | error)
203
cision to choose a food source depending on waggle dance
of employed bee, is called an onlooker and a bee going to
the food source visited by itself previously is named an employed bee. A bee which performs random search for food
is called a scout. In the ABC algorithm (Algorithm 1), first
half of the colony consists of employed artificial bees and the
second half constitutes the onlookers. For every food source
around the hive, there is only one employed bee. The employed bee whose food source is exhausted by the employed
and onlooker bees becomes a scout.
all food sources. If the nectar amount of a food source increases, the probability with which that food source is chosen by an onlooker increases, too. The dance of employed
bees carrying higher nectar has higher probability of being
selected by onlookers for exploitation.
Algorithm 1 ABC Algorithm
1: Initialize potential food sources for employed bees.
2: while Requirements are not met do
3:
Each employed bee goes to a food source in her memory and determines a neighbour source, then evaluates
its nectar amount and dances in the hive
4:
Each onlooker watches the dance of employed bees
and chooses one of their sources depending on the
dances, and then goes to that source. After choosing a neighbour around that, she evaluates its nectar
amount.
5:
Abandoned food sources are determined and are replaced with the new food sources discovered by scouts.
6:
The best food source found so far is registered
7: end while
For our DNA FAP, given a set of fragments, our target is
to generate a permutation of fragments that minimizes the
number of contigs and maximizes the fitness. The solution
achieving these two objectives best represents the DNA sequence instance.
In our ABC algorithm, each of the food sources represents
a permutation of fragments of a DNA sequence. The position of a food source represents a possible solution and the
nectar amount of a food source corresponds to the quality
(fitness) of the associated solution. Initially we generate the
food source positions (i.e. permutations) randomly. We do
not assume that the initial seeds are taken from the best of
previous execution of any other algorithm. This is a more
realistic assumption compared to [23] where the authors use
seeding strategies to find good solution beforehand to exploit promising regions, for both noisy and noiseless data.
Our algorithm is robust in the sense that, although we do
not make any such preprocessing, our algorithm converges
to almost same fitness compared to them in reasonable time.
3.2
Conventional genetic algorithm sometimes become unsuccessful to find a globally optimal solution within a limited no.
of evolutions. To overcome this disadvantage, a queen-bee
evolution based on genetic algorithm (QEGA) ( [15], [27]) is
employed for solving the DNA FAP.
After the initialization, the food sources are repeatedly
searched by employed bees, onlooker bees and scout bees.
A employed or onlooker bee produces a modification of the
solution for finding a new food source and tests the nectar
amount (fitness value) of the new solution. For the modification, we have used problem aware local search (PALS) [6]
on the permutation µ under consideration. We have done so
because, besides considering fitness value, we also take no.
of contigs of a solution into consideration. In all cases, we
have used a fitness function (calculation of the amount of
nectar) that sums the overlap score for adjacent fragments
(f [i] and f [i + 1]) in a given solution. Let us denote the
overlap score by w(f [i], f [i + 1]) and fitness function by Fµ .
Then,
Fµ =
g−2
X
w(f [i], f [i + 1]),
Queen-bee Evaluation Based On Genetic
Algorithm (QEGA)
Algorithm 2 QEGA Algorithm
Input: time t, population size k, populations P , normal
mutation rate σ, normal mutation probability pm , strong
mutation probability p0m , a queen-bee Iq , selected bees
Im
Output: best fitness solution
1: Initialize P (t)
2: Evaluate P (t)
3: while Condition not met do
4:
t←t+1
5:
Select P (t) from P (t − 1). So, P (t) = {Iq (t − 1),
Im (t − 1).
6:
Recombine P (t)
7:
Crossover
8:
for i = 0 to k do
9:
if i ≤ (σ ∗ k) then
10:
Mutate with pm
11:
else
12:
Mutate with p0m
13:
end if
14:
end for
15:
Evaluate P (t)
16: end while
(3)
i=0
where g is the no. of fragments in the solution.
If the nectar amount of the new modified source is higher
than that of the previous one, the bee memorizes the new
solution and forgets the old one. Otherwise it keeps the
position of the previous one. Onlooker bees choose the food
source for modification probabilistically, depending on the
nectar amount of the food source. The probability, pµ is
calculated as follows:
The population of QEGA consists of several permutations
of fragments of a DNA sequence. To determine the fitness
value of each individual in the population, we have used
Equation 3 as the fitness calculating function in Steps 2
and 15 of QEGA (Algorithm 2).
Fµ
,
(4)
globalmax
where globalmax denotes the maximum fitness value among
pµ =
204
There are two major differences between conventional genetic algorithm (CGA) and QEGA. Firstly, the parents P (t)
in CGA are composed of the k individuals selected by a selection algorithm such as tournament selection. On the other
hand, parents P (t) in QEGA consist of the k2 copies of a
queen-bee Iq (t − 1), where q = arg max {Fµ , 1 ≤ µ ≤ k}
and k2 copies of bees Im (t − 1) selected by a selection algorithm, where 1 ≤ m ≤ k2 . Secondly, all individuals in
conventional genetic algorithm are mutated with small mutation probability pm , while in QEGA only a part of the
individuals are mutated with normal mutation probability
pm and the others are mutated with strong mutation probability p0m . The ratio between pm and p0m is denoted by σ
in QEGA. Generally, pm is less than 0.1 and p0m is greater
than pm .
Table 1: Information of datasets. Accession numbers are used as the name of the instances
Instances
(abbreviated
form)
acin1 (a 1)
acin2 (a 2)
acin3 (a 3)
acin5 (a 5)
acin7 (a 7)
acin9 (a 9)
x60189 4 (x 1)
x60189 5 (x 2)
x60189 6 (x 3)
x60189 7 (x 4)
m15421 5 (m 1)
m15421 6 (m 2)
m15421 7 (m 3)
j02459 7 (j 1)
BX842596 4
(b 1)
BX842596 7
(b 2)
So, In QEGA, the fittest individual in a generation, crossbreeds with the bees selected as parents by means of a selection algorithm. This feature of queen-bee evolution reinforces the exploitation of genetic algorithms. That is, fitness
of the offsprings mainly rely on the crossover operation and
the fittest individual. Consequently, it also increases the
probability of premature convergence. Nonetheless, the second feature of QEGA helps genetic algorithms search new
space, i.e., it increases the exploration of genetic algorithms
through strong mutation. These two features enable genetic
algorithms to evolve quickly and simultaneously maintain
good solutions. In other words, the queen-bee evolution
makes it possible for genetic algorithms to quickly approach
the global optimum as well as decreasing the probability of
premature convergence.
CoverMean
age
fragment
length
Number
of fragments
26
3
3
2
2
7
4
5
6
6
5
6
7
7
4
182
1002
1001
1003
1003
1003
395
386
343
387
398
350
383
700
708
307
451
601
751
901
1049
39
48
66
68
127
173
177
352
442
5
703
773
Original
sequence
length
(in bps)
2170
147200
200741
329958
426840
156305
3835
10089
48502
77292
cussed as follows:
We have consulted different DNA sequence simulators like
FASIM [14], GenFrag [9] and Celsim [25]. In [14], the authors have showed various problems associated with Celsim
and GenFrag version 1.0 and 2.1. Mainly, these programs
are designed on the assumption that fragments are equally
distributed on genome. That is, clone sampling is uniformly
carried out during fragment generation. Thus, these programs do not sufficiently reflect actual conditions of WGSS
(Whole Genome Shotgun Sequencing). For FASIM, current
release of the software is only available upon request. On the
other hand, MetaSim is a freely available sequencing simulator for genomic and metagenomics. It can be utilized to
simulate fragments of real read experiments by incorporating errors which occur at the Layout phase, i.e., unknown
orientation, base call errors, incomplete coverage, repeated
regions, chimeras and contamination. Here we have used
three error models, namely, 454, Sanger and exact error
models considering configuration of all parameters. These
configurations are presented in Table 2.
Ordered two-point crossover has been used as crossover
operator in Step 7 of Algorithm 2. In this scheme, given
two parents, two random crossover points are selected partitioning them into a left, middle and right portion. Then
ordered crossover is carried out in the following way: Child
1 inherits its left and right sections from Parent 1, and its
middle section is determined by the fragments in the middle
section of Parent 1 in the order these appear in Parent 2. A
similar process is applied to determine Child 2. Notably, the
Ordered crossover operator is a pure recombination operator [33]. For strong mutation, we have used problem aware
local search (PALS) [6]. For normal mutation, we have done
random mutation on the offsprings.
4. EXPERIMENTS
4.2
In this section, we present our experimental setup and the
results obtained by our algorithms. The experiments were
conducted in Ubuntu 11.04 running on an Intel core i3 Processor with 2GB RAM.
As the exact orientation of the generated fragments are
not predictable, to tackle the problem of unknown orientation, we have checked fragment overlapping in both forward
and backward orientation during calculation of the score matrix. For example, let us assume that we want to calculate
overlap between Fragment 1 and Fragment 2. We first calculate normal overlap between these two segments (suffixprefix) and then compare the obtained value with the overlap value calculated from the reverse direction of Fragment
2 (suffix-suffix). We take the best value from the above two
options.
We have used GenFrag and MetaSim for generating noiseless and noisy artificial fragments respectively. For this, we
collected the DNA sequences from NCBI [2, 3, 4, 5]. We
give a summary on the different features of the datasets in
Table 1.
4.1 Noisy Input Data Generation
4.3
We have used MetaSim [29] for noisy input data generation. The motivation behind using Metasim is briefly dis-
Score Matrix Calculation
ABC Control Parameters
ABC algorithm has a few control parameters. We set
205
Table 2: Configuration of MetaSim for different error models and no. of generations used by the algorithms
Parameter Considered
Error Model
Number of reads or Mate pair
454, Sanger,
Exact
-do-do-do-do454
-do-do-
DNA clone size distribution type a
Mean
Second Parameter
Mate pair Probability
Expected read length
Mate pair read length
Number of flow cycles
Mean negative flow cyclesb
Std. deviation for negative flow cyclec
Signal std. deviation multiplier
Read length distribution type
Error rate at read start
Error rate at end of read
Insertion error rate
Deletion error rate
Insertion
Deletion
Substitutions
No. of base pair processed
No. of base pair generated
-do-do-doSanger
-do-do-do-do454
Sanger
454
Sanger
Sanger
454
Sanger
Exact
454
Sanger
Exact
acin1
307
acin2
451
acin3
601
acin5
751
190
10
1020
30
1001
1003
190
1020
1007
1007
75
400
395
395
1330
146
369
162
532
57857
56414
55915
58818
56398
55915
10460
1433
2758
1370
4095
458964
452401
461288
466666
452464
461288
13651
1643
3687
1679
5094
594676
567768
596036
604640
567732
596036
17146
2147
4438
2131
6409
738514
710617
755338
751222
710633
755338
Instances
acin7
acin9
901
1049
x60189 6
68
m15421 7
177
j02459 7
352
387
387
700
388
388
700
152
152
275
604
71
125
70
216
23719
23145
23719
24198
23146
24198
1571
157
388
165
537
63688
61276
69562
64871
61268
69562
5694
666
1469
621
1988
233127
223168
243163
237352
223213
243163
Normal
1003
1003
100
0
1005
1005
20
394
394
0.23
0.15
0.15
Normal
0.01
0.02
0.2
0.2
20882
23872
2480
2860
5396
6373
2467
2965
7642
8969
889923
1034189
852423
997708
903193
1050247
905409
1051688
852436
997603
903193
1050247
a
A Clone in MetaSim is a DNA fragment that is randomly extracted from the source genome sequence for read/mate-pair
sampling.
b
A negative flow is a flow of nucleotides in which the sequence to synthesize is not elongated. Light intensities of negative
flows follow a lognormal distribution. The default value (µ = 0.23) is taken from [10].
c
The default value (0.15) is taken from [10].
Maximum number of cycles (MCN), i.e., maximum number
of generation to 4000 and the colony size, i.e., population
size to 256. The percentage of onlooker bees was set to 50%
of the colony, the employed bees were 50% of the colony
and the number of scout bees was selected as one. Notably,
an increase in the number of scouts encourages the exploration whereas that in onlookers of a food source increases
the exploitation.
Table 3: Best final contig number and fitness for
noiseless data
Ins.
x1
x2
x3
x4
m 1
m 2
m 3
j1
b 1
b 2
a1
a2
a3
a5
a7
a9
4.4 Results
Now, we present the results obtained by our ABC algorithm and QEGA for solving different DNA instances. We
measure the quality of the solution by considering the fitness
value achieved as well as the number of contigs. The significance of contig calculation is to ensure that the solution
best represents a continuous assembled sequence. We have
tested our algorithms for both noisy and noiseless fragments.
For noiseless instances, we set a cutoff value, i.e., required
overlap between two adjacent fragments, to thirty. Table 3
shows the results obtained for noiseless instances by our algorithms as well as by simulated annealing (SA), Problem
aware local search (PALS) and Genetic algorithm (GA). In
Table 3, A, Q, S, P and G denotes ABC, QEGA, SA, PALS
and GA respectively. In this case we have found that our
implementation of ABC and QEGA perform very competitively with the best algorithm in the literature which is based
on simulated annealing (SA) for noiseless instances [23]. Similar behavior is observed in comparison to PALS [6], a defacto standard to compare genome assemblers, in all cases.
These experiments validate our algorithms’ good performance.
Moreover, QEGA algorithm performs better than ABC algorithm for most of the instances. Notably, QEGA produces
better results for longer instances. In terms of no. of con-
206
ABC
11478
14016
18339
21184
38423
47515
54607
114774
225431
440187
45403
142539
155413
144339
155182
321397
Best Fitness
QEGA
SA
PALS
11476
11478
11204
14027
14027
12898
18266
18301
16992
21208
21271
20424
38578
38583
36540
47882
48048
45773
55020
55048
51454
116222
116257
108744
227252
226538
211792
443600
436739
412374
47115
46955
44656
144133
144705
138423
156138
156630
151928
144541
146607
143835
155322
157984
155089
322768
324559
310235
GA
11478
13502
17688
20884
37714
46949
52695
111103
220558
416414
45565
143444
154947
145332
155873
313203
A
1
1
1
1
3
2
2
3
13
12
8
237
362
556
726
552
Best
Q
1
1
1
1
1
1
1
1
8
3
4
233
358
552
722
552
Contig
S
P
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
G
1
1
1
1
1
2
2
1
6
3
5
236
358
552
722
552
tigs, QEGA does a better job at contig reduction than the
ABC algorithm.
Next, We execute ABC and QEGA for noisy instances. It
is to be noted that, in noisy instances, a lot of insertions,
deletions or substitutions can occur. For instance, if we refer to Table 2, there are 5694 insertions, 1469 deletions and
1988 substitutions on j02459 7 instance in 454 error model.
So, empirically we set the cutoff value for noisy instances
to lower values but ensure overlapping adjacent fragments.
The results obtained by our algorithms (ABC and QEGA)
are presented in Table 4 along with three other algorithms
from [11] namely Genetic Algorithms (GA), Genetic Algorithm with Simulated Annealing (GA+SA)and Genetic Algorithm with Hill Climbing(GA+HC). We have given the
comparison of our result with [11] because the authors of [11]
have used the same error models and parameters as ours to
present results. More importantly, there is no way to compare our results with [23] because the authors of [23] have
only told that they have introduced random error but there
was no indication of how much error was being introduced.
As opposed to them, we have clearly mentioned the parameters of our noisy dataset generation in Table 2. Instead of
introducing random error in score matrix [23], we generated
fragments with three error models (454, Sanger and Exact).
We execute our algorithms on these instances. We also take
into consideration that DNA FAP is an off-line problem. So,
instead of emphasizing on execution time of the algorithms,
we concern ourself with assessment of the solution quality,
based on number of contigs and fitness value.
Table 4: Best fitness obtained by the algorithms for
noisy data
Instances
acin1
acin2
acin3
acin5
acin7
acin9
As can be seen from Table 4, for most of the instances,
QEGA outperforms all previously proposed algorithms as
well as the ABC algorithm because of the elitist selection
method imposed by QEGA. Our implementation of QEGA
breeds fittest parent (i.e. queen bee) with other parents.
The strong mutation in Step 12 of Algorithm 2 employs
PALS to get the fittest mutants from the offspring. So we
perform exploitation in the vicinity of the best solution of
the current cycle. On the other hand, although ABC algorithm performs analogous action of mutation, but it does
not do so with the fittest ones. Exploitation can sometime
leads to local minima and QEGA countermeasures this by
random mutation. All the fitness values for QEGA and ABC
algorithm shown in Table 4 are obtained for a single contig
i.e. one continuous sequence with some exceptions. These
exceptions can not reach single contig value and the final
contig value for each of these instances are shown in brackets just beside their fitness in Table 4. For GA, GA+HC
and GA+SA, the authors in [11] did not mention the final
contigs obtained by these algorithms. So we are unable to
comment about these three algorithms in terms of contig
value.
x60189 6
m15421 7
j02459 7
Figure 1 and 2 show the progressive improvement of fitness
values for the instance j02459 7 for noiseless and noisy data
by applying QEGA. As can be seen from the figures, final
fitness values for the noisy cases is significantly smaller than
that of noiseless cases. Figure 1 also shows that QEGA is
able to overcome any local minima as is evident from the
transition of lower fitness value at almost 1100-th iteration
to higher fitness value at almost 1200-th iteration.
Error
Model
454
Sanger
Exact
454
Sanger
Exact
454
Sanger
Exact
454
Sanger
Exact
454
Sanger
Exact
454
Sanger
Exact
454
Sanger
Exact
454
Sanger
Exact
454
Sanger
Exact
ABC
4323
7758
17354
2271
2981
20570
2960
3281
26611
3759
4223
14694
3759(2)
3964
15052(4)
4671
5204
31914(2)
353
609
2292
978
1429
7028
1670
2466
13353
Best
QEGA
4631
8512
160007(4)
2418
3142
20884
3170
3661
10605(237)
3994
4533
16182
4792
5177
18864
6315
7881
48154(107)
363
623
2303
1029
1467
7181
1776
2563
13607
Fitness
GA
162
178
214
260
264
267
326
344
374
405
430
431
490
516
522
608
623
625
41
36
36
88
97
93
183
188
179
GA+SA
3738
6744
14157
1778
2413
19140
1972
2199
12865
2220
2384
7069
2528
2651
8212
2838
2982
11816
358
622
2289
939
1380
6840
1429
2095
12659
GA+HC
4611
8394
17914
2208
2924
20677
2657
3095
13694
3113
3540
15001
3487
3817
16029
3828
4735
38262
362
624
2302
1026
1455
7170
1673
2443
13526
Figure 1: Iteration vs Fitness graph for j02459 7
noiseless data using QEGA
optimal fitness for noisy instances. We also wish to implement the parallel versions of our algorithms.
6.
5. CONCLUSION
REFERENCES
[1] http://ab.inf.uni-tuebingen.de/teaching/ss07/
albi2/script/readsim_2July2007.pdf. [Online;
accessed 16-November-2011].
[2] http://www.ncbi.nlm.nih.gov/nuccore/178817?
report=fasta(M15421.1). [Online; accessed
20-November-2011].
[3] http://www.ncbi.nlm.nih.gov/nuccore/34645?
report=fasta(X60189.1. [Online; accessed
20-November-2011].
[4] http://www.ncbi.nlm.nih.gov/nuccore/215104?
report=fasta(J02459.1. [Online; accessed
20-November-2011].
[5] http://0-www.ncbi.nlm.nih.gov.ilsprod.lib.neu.
edu/Traces/wgs/?hide\_master\\*=\&val=
ACIN02000001\#9(ACIN\_ALL). [Online; accessed
20-November-2011].
In this paper, we have used Artificial Bee Colony (ABC)
algorithm and Queen-bee Evaluation based on Genetic Algorithm (QEGA) to tackle the DNA fragment assembly problem for noisy and noiseless instances. In particular, we have
recorded the results obtained by our algorithm, taking into
consideration various standard error models normally used
by DNA sequencing experiments. Previous works on DNA
FAP report their results for noiseless cases mostly. We have
observed that, although our algorithms perform very well for
noiseless data just like others, for noisy instances their obtained fitness value decrease significantly for different error
models. We have pointed out the fact that errors in DNA
fragments significantly impact the performance of standard
algorithms. Future works should take this into account.
In future, we intend to design algorithms to achieve near-
207
[18]
[19]
[20]
Figure 2: Iteration vs Fitness graph for j02459 7
noisy data(454 error model) using QEGA
[21]
[22]
[6] E. Alba and G. Luque. A new local search algorithm
for the DNA fragment assembly problem. In
Evolutionary Computation in Combinatorial
Optimization, EvoCOP’07, Lecture Notes in
Computer Science 4446, pages 1–12, Valencia, Spain,
April 2007. Springer.
[7] T. Chen and S. S. Skiena. Trie-based data structures
for sequence assembly. In The Eighth Symposium on
Combinatorial Pattern Matching, pages 206–223.
Springer-Verlag, 1997.
[8] B. Dorronsoro, E. Alba, G. Luque, and P. Bouvry. A
self-adaptive cellular memetic algorithm for the DNA
fragment assembly problem. In IEEE Congress on
Evolutionary Computation, pages 2651–2658, 2008.
[9] M. L. Engle and C. Burks. Artificially generated data
sets for testing DNA sequence assembly algorithms.
Genomics, 16(1):286–8, 1993.
[10] M. M. et al. Genome sequencing in microfabricated
high-density picolitre reactors. Nature,
437(7057):376–80, 2005.
[11] J. S. Firoz, M. S. Rahman, and T. k. Saha. Analyzing
memetic algorithm behavior in DNA fragment
assembly problem for noisy data. In 4th International
Conference on Bioinformatics and Computational
Biology (BICoB),Las Vegas, Nevada, USA, 2012.
[12] P. Green. Phrap. http://www.phrap.org/. [Online;
accessed 20-November-2011].
[13] X. Huang and A. Madan. Cap3: A DNA sequence
assembly program. Genome Research, 9:868–877, 1999.
[14] C. G. Hur. Fasim: Fragments assembly simulation
using biased-sampling model and assembly simulation
for microbial genome shotgun sequencing.
[15] S. H. Jung. Queen-bee evolution for genetic
algorithms. Electronics Letters, 39:575–576, March
2003.
[16] D. Karaboga and B. Basturk. A powerful and efficient
algorithm for numerical function optimization:
artificial bee colony (abc) algorithm. J. of Global
Optimization, 39:459–471, November 2007.
[17] J. Kubalik, P. Buryan, and L. Wagner. Solving the
dna fragment assembly problem efficiently using
iterative optimization with evolved hypermutations. In
Proceedings of the 12th annual conference on Genetic
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
208
and evolutionary computation, GECCO ’10, pages
213–214, New York, NY, USA, 2010. ACM.
G. Luque and E. Alba. Metaheuristics for the DNA
Fragment Assembly Problem. International Journal of
Computational Intelligence Research, 1(1-2):98–108,
2005.
M. Margulies, M. Egholm, W. E. Altman, S. Attiya,
J. S. Bader, L. A. Bemben, J. Berka, M. S.
Braverman, Y.-J. Chen, Z. Chen, and et al. Genome
sequencing in microfabricated high-density picolitre
reactors. Nature, 437(7057):376–380, 2005.
P. Meksangsouy and N. Chaiyaratana. DNA fragment
assembly using an ant colony system algorithm. In
Proceedings of Congress on Evolutionary
Computation, volume 3, pages 1756–1763, 2003.
D. Meldrum. Automation for genomics, part one:
Preparation for sequencing. Genome Research,
10(8):1081–1092, 2000.
D. Meldrum. Automation for genomics, part two:
Sequencers, microarrays, and future trends. Genome
Research, 10(9):1288–1303, 2000.
G. Minetti and E. Alba. Metaheuristic assemblers of
DNA strands: Noiseless and noisy cases. In IEEE
Congress on Evolutionary Computation’10, pages 1–8,
2010.
E. W. Myers. Toward simplifying and accurately
formulating fragment assembly. Journal of
Computational Biology, 2:275–290, 1995.
G. Myers. A dataset generator for whole genome
shotgun sequencing. In Proceedings of the Seventh
International Conference on Intelligent Systems for
Molecular Biology, pages 202–210. AAAI Press, 1999.
P. Pevzner. Computational molecular biology: An
algorithmic approach. MIT Press, 2000.
L. Qin, Q. Jiang, Z. Zou, and Y. Cao. A queen-bee
evolution based on genetic algorithm for economic
power dispatch. In 39th International Universities
Power Engineering Conference, volume 1, pages
453–456, September 2004.
D. R and Bentley. Whole-genome re-sequencing.
Current Opinion in Genetics and Development,
16(6):545 – 552, 2006.
D. C. Richter, F. Ott, A. F. Auch, R. Schmid, and
D. H. Huson. Metasim:a sequencing simulator for
genomics and metagenomics. PLoS ONE,
3(10):e3373+, 2008.
J. Setubal and J. Meidanis. Fragment assembly of
DNA. Introduction to Computational Molecular
Biology, pages 105–139, 1997.
J. C. Setubal and J. Meidanis. Introduction to
Computational Molecular Biology. PWS Publishing
Company, 1997.
G. Sutton, W. O., A. M.D., and K. A.R. Tigr
assembler: A new tool for assembling large shotgun
sequencing projects. Genome Science and Tech, pages
9–19, 1995.
E. G. Talbi. Metaheuristics from design to
implementation. WILEY, 2009.
Download