PARALLEL IMPLEMENTATIONS OF - Cresco

advertisement
PARALLEL IMPLEMENTATIONS OF
GENETIC ALGORITHMS FOR PARAMETERS
ESTIMATE OF METABOLIC NETWORKS
Giuseppe Aprea1, Grazia Licciardello2, Vittorio Rosato3
1 ENEA,
Portici Research Center, CRESCO Project, Via del Vecchio Macello, 80055 Portici (Naples), Italy. Email:
giuseppe.aprea@gmail.com
2 Science and Technology Park of Sicily, Stradale V. Lancia 57, z.i. Blocco Palma I, 95121 Catania, Italy. Email:
gralicci@unict.it
3 ENEA, Casaccia Research Center, Computinig and Modelling Unit, Via Anguillarese 301, 00123 S.Maria di
Galeria, Italy. Email: rosato@casaccia.enea.it
Introduction
Advances in -omic sciences produce large amounts of data that need analysis and
interpretation. Reliable explanations of how processes are regulated require an accurate
modeling approach at the systems level. In this paper we focus on metabolic networks
models which reproduce the time evolution of all the metabolites. Quite often these models
rely on several unknown parameters which have to be estimated from experimental data.
This task consists in the solution of an inverse problem which requires the use of an
efficient optimization algorithm. The genetic algorithm (GA)
[1] is a widely known optimum search method which yields reliable values for model
parameters with a large computational demand. Our aim is to develop a parallel
implementation for parameter estimate based on GA to fully take advantage of the large
computational facilities such as that set up at the Enea Portici site.
The genetic algorithm: Basics
According to GA analogy, in a biochemical network, a set K = {ki , i = 1, m} of unknown
parameters is defined genome. Each of these missing array of constants gives rise to a
different behavior of the network, that is different time evolutions for the metabolites'
concentrations. The network with the genome end its behavior together constitute an
individual and a group of individuals is a population. As in the case of populations of
organisms in nature, GA populations undergo a selection where good experimental data
fitting represent the selection criterion. After every selection stage, a new population is
created; the current generation is over and a new one is ready. After a large number of
generations, GA is expected to yield the individuals which best fit experimental data.
The genetic algorithm: Computational scheme
GA is implemented following this steps:
1. a random number I of genomes is chosen, each of them differing from the others by
the value of the unknown m constants. This is the starting population.
2. for each individual, the time evolution of the metabolites' concentration is calculated
with the E-Cell tool[2][3].
3. metabolites' concentration allows to evaluate, for each individual, the corresponding
value of the cost function: Fk =  xi τ j+1   xi τ j  ,(k= 1, I); a good fit to data implies
a small Fk and viceversa.
4. Mating is performed selecting the individuals of the current generation according to
the cost function Fk, the smaller the more likely to be selected. Mutation is also


added to yield the final offspring. After mating and mutation, a new generation is
ready and and the cycle restarts from point (2) unless we meet one of the following
conditions (in these cases, the procedure ends):
 the best value for the cost function becomes smaller than a fixed threshold;
 a pre-defined limit number of generations has been reached.
GA and specific computational architectures
The efficiency of the GA implementation stems from the choice of a number of parameters:
the number of individuals in a population, the number of generations and the specific
values of the parameters used for selection, mating and mutation processes.
Another important element to save overall time execution is parallelizing the code; for
example the computation of the fitness of individuals belonging to the same generation. In
this work we suggest efficient implementations of GA, in relation to specific computing
architectures:
 a single cluster architecture (SCA), characterized by many computational nodes
tightly interconnected through a low-latency, high-bandwidth network
 a GRID (or multicluster architecture), characterized by a number of SCAs linked
through a Wide Area Network.
The computational problem has thus been mapped on a specific computing architecture by
assuming different implementations of the method.
 on a SCA we propose a single-stage parallelism approach. A master node
generates an instance of a population of N individuals and then, if is the number
of computing nodes, allots to each of them the same number
of fitness
evaluations. When the fitness of all the genomes has been evaluated, the master
node gathers all data from the other computing nodes and performs selection,
mating and mutation to provide new offspring which is subsequently re-allotted, as
before, to the computing nodes. The procedure is repeated for nG generations.
 In the GRID case, the specific computing architecture suggests a different
implementation of the method. In this case, in fact, the single larger generation of
SCA is replaced by a number β of smaller "secondary generations" (SG), each
composed of N/β genomes. The master node allots the SGs to other different
secondary master nodes which accomplish the procedure described above for
SCA. This procedure can be seen as the definition of many "islands" where subsets
of a generation are allowed to evolve. Each "island" evolves new generations,
independently, on the basis of their initial genomes. After a number of generations
nG , the islands are allowed to exchange their resulting genomes. The master node
gathers all data from the SGs, ranks their best genomes and send back those with
the lowest cost function. Then each SG generates a new seed-population by using
half of the best-fitting genomes received from the master and half by randomly
generating new individuals. This procedure is iterated n S times, so that the total
number of SG generations is NT = nG · nS. In this way two main results are
achieved: the first is to allow cooperation among SGs, which can exchange best
individuals, and the second is keeping diversity to avoid being trapped by local
minima effects.
Details of the models used for computations
We have considered different networks whose topology and kinetic constants are known.
We have generated the time-course of a number of reaction products which are then used
as they were experimental results. We have thus "hidden" some of the kinetic constants
and asked the method to recover their values. As for the network used, the first describes
the cyrcadian rhythm in Drosophila [3] Tomita M, Hashimoto K, Takahashi K, Shimizu T,
Matsuzaki Y, Miyoshi F, Saito K, Tanida S, Yugi K, Venter J, Hutchison C: Bioinformatics
1999, 15:7284.
[4] while the second is related to the PHA (polyhydroxyalkanoate) production in
Pseudomonas. The two cases have been chosen as paradigmatic of different behaviors: in
the first case, the method should be able to recover the frequency and amplitudes of
experimental observations; in the second, the large initial production rate and the
subsequent time-scale for reaching the asymptotic flat behavior.
Results and discussion
In this section we report the results obtained with the two model's implementations, for the
two metabolic networks. In Table 1 we report the numerical results of our GA
implementations.
Solutions from the different approaches
Model
Cyrcadian
rhythm
Parameter
[Solution]
SCA
GRID
(N/β = 20)
GRID
(N/β = 40)
GRID
(N/β = 60)
k1pha[10.0sec-1]
9.9998
9.6165
10.0118
10.0000
k2pha[1.21e05M]
1.1986e - 05 1.2494e - 05 1.2100e - 05 1.2099e - 05
k3pha [7.2e-04M] 27.1001e - 04 16.1585e - 04 7.1986e - 04 6.4814e - 04
k1period[200M -1]
PHA
metabolism
k2period[10sec-1]
199.9998
199.9978
199.9888
200.0003
10.0000
10.0000
10.0001
9.99999
Table 1: Numerical results for the hidden parameters. In the second column, within
brackets, we report the "true" value of the parameter.
Results allow to highlight the following points:
 Both the SCA and the GRID implementations allow to estimate numerical values for
the kinetic constants very close to the "true" ones which can reproduce the
"experimental" data with a very good approximation. The k 3pha parameter is the only
one for which both approaches produce different results and different from the "true"
one. A further investigation has shown the existence of a (almost) flat optimum for
this parameter i.e. there is a substantial independence of the resulting solutions
from the absolute value of this parameter.
 if compared at equal number of genomes processed, the SCA implementation
showed a better performance than the GRID one, either in minimizing the cost
function and in the number of genomes needed to reach the minimum. This is not
much surprising because it is known that GA works better with larger populations;
 the amplitude of maximum fluctuation of the cost function away from the minimum
of the best genome of each generation in the SCA approach is about 1.5/2 times
larger than that of the best genome from the set of parallel generations of the GRID
implementation. This also follows from the ability of a larger population to efficiently
span the parameters space.
Conclusions
In this work, we have attempted to show that GA is a powerful tool providing good guesses
of unknown kinetic parameters in metabolic networks. It has been used in combination
with the E-Cell tool which yields the evaluation of the cost function. Although it is
computationally expensive, its workflow is prone to be parallelized on multi-processor
computing architectures. We have
 provided evidences on the effective ability to retrieve correct guesses when used to
discover unknown kinetic constants in metabolic networks; evidences have been
provided for different cases, where the time behavior of the products in the
networks varies from exponential decay, as in the case of PHA metabolism in
Pseudomonas to an oscillating behavior, as in the case of the cyrcadian rhythm in
Drosophila;
 designed two different parallel implementations of the method, to be used on
different classes of computing architectures: large homogeneous clusters, with fast
interconnection (SCA), and GRID architectures, characterized by loosely coupled
computing elements.
The SCA implementation is based on the fast internode's communication which allows a
rapid and efficient flux of information between a master node and the computation nodes.
The GRID implementation, in turn, introduces a double level of parallelism: each
computational node provides the evolution of a subset of the whole population as the loose
interconnection does not allow to make every node communicate with a single master at
the end of each generation. Only at given frequency, the different population subsets
exchange data.
We have tested both SCA and GRID implementations to deal with two different cases: in
the first, we wanted to reproduce a periodic time behavior of metabolites concentration
while in the second we faced a typical slowly-decaying behavior, i.e. an initial increase
followed by a slow saturation. In both cases the parallel procedures were able to provide
fully satisfactory results both from a qualitative and, more importantly, from a quantitative
points of view. The SCA implementation, using a single, large population, is faster than the
GRID one in reaching its optimum, even in the case where we compare the performance
of the latter with a SCA implementation with a smaller population.
We are working on a further implementation of the method which mixes up features of both
the SCA and the GRID approach. In this approach, we will force successive generations to
reaching higher population densities in the parameter space, by allowing them to explore
just a subset of the whole parameter space. At programmable intervals of generations, a
number of subsets with the worst results is discarded and further generations are
constrained to make a fine-grain exploration in the region(s) of the parameters space
where better solutions have been obtained. We expect this approach to be better in
approaching the correct solution and to overcome problems of instability due to the
possible presence of more than one stable minimum which plague the SCA approach.
Moreover, we will apply this method on further cases of some technological relevance. The
proposed method can be run in conjunction with the methods allowing the network to be
designed [5], in order to ultimately provide reliable computational models of metabolic
pathways to be used for metabolomic studies and to provide in-silico derived insights for
bacterial metabolic engineering.
References
[1] Goldberg DE: Genetic Alghorithms in Search, Optimization and Machine Learning.
Addison-Wesley Longman Publishing Company, Boston (MA) USA 1989.
[2] E-cell [http://www.e-cell.org/ecell].
[3] Tomita M, Hashimoto K, Takahashi K, Shimizu T, Matsuzaki Y, Miyoshi F, Saito K,
Tanida S, Yugi K, Venter J, Hutchison C: Bioinformatics 1999, 15:7284.
[4] Tyson J, Hong C, Thron C, Novak B: Biophys J. 1999, 77(5):2411.
[5] DREAM project [http://wiki.c2b2.columbia.edu/dream/index.php/The_DREAM_Project]
Download