PARALLEL IMPLEMENTATIONS OF GENETIC ALGORITHMS FOR PARAMETERS ESTIMATE OF METABOLIC NETWORKS Giuseppe Aprea1, Grazia Licciardello2, Vittorio Rosato3 1 ENEA, Portici Research Center, CRESCO Project, Via del Vecchio Macello, 80055 Portici (Naples), Italy. Email: giuseppe.aprea@gmail.com 2 Science and Technology Park of Sicily, Stradale V. Lancia 57, z.i. Blocco Palma I, 95121 Catania, Italy. Email: gralicci@unict.it 3 ENEA, Casaccia Research Center, Computinig and Modelling Unit, Via Anguillarese 301, 00123 S.Maria di Galeria, Italy. Email: rosato@casaccia.enea.it Introduction Advances in -omic sciences produce large amounts of data that need analysis and interpretation. Reliable explanations of how processes are regulated require an accurate modeling approach at the systems level. In this paper we focus on metabolic networks models which reproduce the time evolution of all the metabolites. Quite often these models rely on several unknown parameters which have to be estimated from experimental data. This task consists in the solution of an inverse problem which requires the use of an efficient optimization algorithm. The genetic algorithm (GA) [1] is a widely known optimum search method which yields reliable values for model parameters with a large computational demand. Our aim is to develop a parallel implementation for parameter estimate based on GA to fully take advantage of the large computational facilities such as that set up at the Enea Portici site. The genetic algorithm: Basics According to GA analogy, in a biochemical network, a set K = {ki , i = 1, m} of unknown parameters is defined genome. Each of these missing array of constants gives rise to a different behavior of the network, that is different time evolutions for the metabolites' concentrations. The network with the genome end its behavior together constitute an individual and a group of individuals is a population. As in the case of populations of organisms in nature, GA populations undergo a selection where good experimental data fitting represent the selection criterion. After every selection stage, a new population is created; the current generation is over and a new one is ready. After a large number of generations, GA is expected to yield the individuals which best fit experimental data. The genetic algorithm: Computational scheme GA is implemented following this steps: 1. a random number I of genomes is chosen, each of them differing from the others by the value of the unknown m constants. This is the starting population. 2. for each individual, the time evolution of the metabolites' concentration is calculated with the E-Cell tool[2][3]. 3. metabolites' concentration allows to evaluate, for each individual, the corresponding value of the cost function: Fk = xi τ j+1 xi τ j ,(k= 1, I); a good fit to data implies a small Fk and viceversa. 4. Mating is performed selecting the individuals of the current generation according to the cost function Fk, the smaller the more likely to be selected. Mutation is also added to yield the final offspring. After mating and mutation, a new generation is ready and and the cycle restarts from point (2) unless we meet one of the following conditions (in these cases, the procedure ends): the best value for the cost function becomes smaller than a fixed threshold; a pre-defined limit number of generations has been reached. GA and specific computational architectures The efficiency of the GA implementation stems from the choice of a number of parameters: the number of individuals in a population, the number of generations and the specific values of the parameters used for selection, mating and mutation processes. Another important element to save overall time execution is parallelizing the code; for example the computation of the fitness of individuals belonging to the same generation. In this work we suggest efficient implementations of GA, in relation to specific computing architectures: a single cluster architecture (SCA), characterized by many computational nodes tightly interconnected through a low-latency, high-bandwidth network a GRID (or multicluster architecture), characterized by a number of SCAs linked through a Wide Area Network. The computational problem has thus been mapped on a specific computing architecture by assuming different implementations of the method. on a SCA we propose a single-stage parallelism approach. A master node generates an instance of a population of N individuals and then, if is the number of computing nodes, allots to each of them the same number of fitness evaluations. When the fitness of all the genomes has been evaluated, the master node gathers all data from the other computing nodes and performs selection, mating and mutation to provide new offspring which is subsequently re-allotted, as before, to the computing nodes. The procedure is repeated for nG generations. In the GRID case, the specific computing architecture suggests a different implementation of the method. In this case, in fact, the single larger generation of SCA is replaced by a number β of smaller "secondary generations" (SG), each composed of N/β genomes. The master node allots the SGs to other different secondary master nodes which accomplish the procedure described above for SCA. This procedure can be seen as the definition of many "islands" where subsets of a generation are allowed to evolve. Each "island" evolves new generations, independently, on the basis of their initial genomes. After a number of generations nG , the islands are allowed to exchange their resulting genomes. The master node gathers all data from the SGs, ranks their best genomes and send back those with the lowest cost function. Then each SG generates a new seed-population by using half of the best-fitting genomes received from the master and half by randomly generating new individuals. This procedure is iterated n S times, so that the total number of SG generations is NT = nG · nS. In this way two main results are achieved: the first is to allow cooperation among SGs, which can exchange best individuals, and the second is keeping diversity to avoid being trapped by local minima effects. Details of the models used for computations We have considered different networks whose topology and kinetic constants are known. We have generated the time-course of a number of reaction products which are then used as they were experimental results. We have thus "hidden" some of the kinetic constants and asked the method to recover their values. As for the network used, the first describes the cyrcadian rhythm in Drosophila [3] Tomita M, Hashimoto K, Takahashi K, Shimizu T, Matsuzaki Y, Miyoshi F, Saito K, Tanida S, Yugi K, Venter J, Hutchison C: Bioinformatics 1999, 15:7284. [4] while the second is related to the PHA (polyhydroxyalkanoate) production in Pseudomonas. The two cases have been chosen as paradigmatic of different behaviors: in the first case, the method should be able to recover the frequency and amplitudes of experimental observations; in the second, the large initial production rate and the subsequent time-scale for reaching the asymptotic flat behavior. Results and discussion In this section we report the results obtained with the two model's implementations, for the two metabolic networks. In Table 1 we report the numerical results of our GA implementations. Solutions from the different approaches Model Cyrcadian rhythm Parameter [Solution] SCA GRID (N/β = 20) GRID (N/β = 40) GRID (N/β = 60) k1pha[10.0sec-1] 9.9998 9.6165 10.0118 10.0000 k2pha[1.21e05M] 1.1986e - 05 1.2494e - 05 1.2100e - 05 1.2099e - 05 k3pha [7.2e-04M] 27.1001e - 04 16.1585e - 04 7.1986e - 04 6.4814e - 04 k1period[200M -1] PHA metabolism k2period[10sec-1] 199.9998 199.9978 199.9888 200.0003 10.0000 10.0000 10.0001 9.99999 Table 1: Numerical results for the hidden parameters. In the second column, within brackets, we report the "true" value of the parameter. Results allow to highlight the following points: Both the SCA and the GRID implementations allow to estimate numerical values for the kinetic constants very close to the "true" ones which can reproduce the "experimental" data with a very good approximation. The k 3pha parameter is the only one for which both approaches produce different results and different from the "true" one. A further investigation has shown the existence of a (almost) flat optimum for this parameter i.e. there is a substantial independence of the resulting solutions from the absolute value of this parameter. if compared at equal number of genomes processed, the SCA implementation showed a better performance than the GRID one, either in minimizing the cost function and in the number of genomes needed to reach the minimum. This is not much surprising because it is known that GA works better with larger populations; the amplitude of maximum fluctuation of the cost function away from the minimum of the best genome of each generation in the SCA approach is about 1.5/2 times larger than that of the best genome from the set of parallel generations of the GRID implementation. This also follows from the ability of a larger population to efficiently span the parameters space. Conclusions In this work, we have attempted to show that GA is a powerful tool providing good guesses of unknown kinetic parameters in metabolic networks. It has been used in combination with the E-Cell tool which yields the evaluation of the cost function. Although it is computationally expensive, its workflow is prone to be parallelized on multi-processor computing architectures. We have provided evidences on the effective ability to retrieve correct guesses when used to discover unknown kinetic constants in metabolic networks; evidences have been provided for different cases, where the time behavior of the products in the networks varies from exponential decay, as in the case of PHA metabolism in Pseudomonas to an oscillating behavior, as in the case of the cyrcadian rhythm in Drosophila; designed two different parallel implementations of the method, to be used on different classes of computing architectures: large homogeneous clusters, with fast interconnection (SCA), and GRID architectures, characterized by loosely coupled computing elements. The SCA implementation is based on the fast internode's communication which allows a rapid and efficient flux of information between a master node and the computation nodes. The GRID implementation, in turn, introduces a double level of parallelism: each computational node provides the evolution of a subset of the whole population as the loose interconnection does not allow to make every node communicate with a single master at the end of each generation. Only at given frequency, the different population subsets exchange data. We have tested both SCA and GRID implementations to deal with two different cases: in the first, we wanted to reproduce a periodic time behavior of metabolites concentration while in the second we faced a typical slowly-decaying behavior, i.e. an initial increase followed by a slow saturation. In both cases the parallel procedures were able to provide fully satisfactory results both from a qualitative and, more importantly, from a quantitative points of view. The SCA implementation, using a single, large population, is faster than the GRID one in reaching its optimum, even in the case where we compare the performance of the latter with a SCA implementation with a smaller population. We are working on a further implementation of the method which mixes up features of both the SCA and the GRID approach. In this approach, we will force successive generations to reaching higher population densities in the parameter space, by allowing them to explore just a subset of the whole parameter space. At programmable intervals of generations, a number of subsets with the worst results is discarded and further generations are constrained to make a fine-grain exploration in the region(s) of the parameters space where better solutions have been obtained. We expect this approach to be better in approaching the correct solution and to overcome problems of instability due to the possible presence of more than one stable minimum which plague the SCA approach. Moreover, we will apply this method on further cases of some technological relevance. The proposed method can be run in conjunction with the methods allowing the network to be designed [5], in order to ultimately provide reliable computational models of metabolic pathways to be used for metabolomic studies and to provide in-silico derived insights for bacterial metabolic engineering. References [1] Goldberg DE: Genetic Alghorithms in Search, Optimization and Machine Learning. Addison-Wesley Longman Publishing Company, Boston (MA) USA 1989. [2] E-cell [http://www.e-cell.org/ecell]. [3] Tomita M, Hashimoto K, Takahashi K, Shimizu T, Matsuzaki Y, Miyoshi F, Saito K, Tanida S, Yugi K, Venter J, Hutchison C: Bioinformatics 1999, 15:7284. [4] Tyson J, Hong C, Thron C, Novak B: Biophys J. 1999, 77(5):2411. [5] DREAM project [http://wiki.c2b2.columbia.edu/dream/index.php/The_DREAM_Project]