FINAL WORKSHOP OF GRID PROJECTS “PON RICERCA 2000-2006, AVVISO 1575” 1 sequences among all Pseudomonas spp. and perform multi-gene/protein comparisons. PSEUDOBIORES: A NEW COMPREHENSIVE DATABASE FOR PSEUDOMONAS Grazia Licciardello1,2, Vittoria Catara2, Anshuma Mangtani3, Rocco Casilli4 and Vittorio Rosato3. 1 Science and Technology Park of Sicily, Italy, g.licciardello@unict.it 2 Dipartimento di Scienze e Tecnologie Fitosanitarie, Università di Catania, Italy, vcatara@unict.it 3 ENEA, Casaccia Research Centre, Italy 4 Ylichron S.r.l. Roma, Italy, rosato@casaccia.enea.it ____________________________________________ Abstract—Pseudomonas-related data are dispersed among many bioinformatics resources and data acquisition of species which genomes have not been sequenced yet represents the most important limit for researchers. Data search and extraction could be possible only via cross researches through different web sources. To overcome this issue we projected a database, designated “PseudoBioRes”, that aims to provide an integrated resource collecting information on genes and/or proteins on the basis of their potential applications. The database has been assembled by linking the data to their original sources. As a model, in this first version, a section dedicated to PHA, has been developed. We collected data for 6 specific enzymes involved in the PHA metabolic pathway for 15 species studied, providing a total of 75 gene accessions and 276 protein sequences as well as the genomic contest for those species whose genomes have been completely sequenced. Thanks to a user friendly interface the user can browse PHA gene or protein The database is open source in order to maintain consistency with the new findings and can also be used as a guideline in order to create other sections for other relevant metabolites. In the next future, thanks to the storage resources and computing capability of the GRID, we aim to improve the data analysis possibilities sharing them on the web with other laboratories. At the moment, it is accessible at the URL: www.ylichron.it/PHA_pseudomonas_DB. Index Te1ms — Pseudomonas, polyhydroxyalkanotes, database. I. INTRODUCTION seudomonas Migula 1894, includes bacterial species of relevant interest in medicine, plant pathology and biotechnology, as confirmed by the 35 genome projects on 10 different Pseudomonas spp. submitted and the large number (19) in progress or in draft assembly status. For the most important species, e.g. P. aeruginosa, P. putida, P. fluorescens and P. syringae, the genome of more than one strain has been sequenced. In the last few years, a huge amount of data has been generated, and more and more are expected in the next years. P FINAL WORKSHOP OF GRID PROJECTS “PON RICERCA 2000-2006, AVVISO 1575” This offers an unprecedented opportunity to use the comparative analysis approach in studies of evolution and functional genomics, shedding light on molecular mechanisms regulating different metabolic pathways. In this context, the problem of the optimal extraction of representative datasets of genomic and proteomic data assumes a crucial importance. Genome annotations are accessible directly from the GenBank (http://ncbi.nlm.nih.gov) and from specialized web sites. The Pseudomonas Genome Database v2 (PGDv2) is a P. aeruginosa specialized database in which the genome annotation is continually updated and the database content and functionality. Since 2005, this database provides also annotations of other Pseudomonas genomes, and acts as a valuable comparative resource. Data about P. syringae, a plant-pathogenic bacterium, strains of which are characterised for their diverse plant-specific interactions, are collected in the Pseudomonas - Plant Interaction web site (http://pseudomonas- syringae.org/home.html ). Recently, an integrated bioinformatics platform for a Pseudomonas systems biology approach to infection and biotechnology has been established. The database called SYSTOMONAS (SYSTems biology of pseudOMONAS) accessible at http://www.systomonas.de encourages the Pseudomonas community to elucidate cellular processes of interest [1]. On the other hand, little information is available on strains, which genomes have not been sequenced yet. Pseudomonas-related data (gene and protein sequences and metabolic pathways) despite being available for a large number of strains, are, in fact, dispersed among many sources. Information extraction could be accessed only via cross researches from either different web sources dedicated to specific class of enzymes or bacterial species or the GenBank Database. We faced this problem studying polyhydroyalkanoates (PHA) production by Pseudomonas species. Most of the bacteria in this genus are 1 able to produce granules of medium-chainlength poly (3 -hydroxyalkanoates) (mclPHAs) as energy storage compounds [2]. Once extracted from cells these molecules reveal similar properties to those of common plastic, moreover they are degraded by microbial depolymerases. mcl-PHA genetic locus in Pseudomonas spp. [2] consisted of two PHA synthases (PhaC1 and PhaC2) [3] separated by the intracellular PHA depolymerase (PhaZ) essential for polymer utilization and biodegradability [4]; a proposed structural protein belonging to the TetR family regulators (PhaD); and two PHA granule-associated proteins (PhaF and PhaI) [5, 6]. Integration of P. putida KT2442 classical experimental data along with genomic and high-throughput data stimulated the reconstruction of three different metabolic models aimed to improve PHA production, as a demonstration of the interest of the bioinformatics’ community for this metabolic pathway [7, 8, 9]. PHA genes of many Pseudomonas strains in addition to those derived from the genome sequencing projects are considerable and dispersed in many database and bioinformatic resources. Since now, there is no instrument enabling a simple and rapid extraction of Pseudomonas related data in a sole comprehensive database. In the following sections we describe the construction and content of PseudoBioRes, a database which aims to partially fill this void, its graphical interface and usefulness [10]. 1) Construction and content PseudoBioRes aims to generate a specialized Pseudomonas resource to complement the available databases in their biological utility and application, providing a comprehensive information of Pseudomonas-related sequences and data on gene and protein sequences worldwide available clustered on the basis of the metabolic pathway in which they are involved. In its current release, all the proteins involved in the metabolism of PHA isolated or FINAL WORKSHOP OF GRID PROJECTS “PON RICERCA 2000-2006, AVVISO 1575” deduced by the genome sequencing projects in species belonging to the Pseudomonas genus, were collected. The proteins clustered on the basis of their sequence similarity and class (PhaC1, PhaC2, PhaZ, PhaD, PhaI, PhaF, PhaG and PhaJ) were also interconnected with genomic data when available. We collected data for 6 specific enzymes involved in the PHA metabolic pathway for 15 species studied, providing a total of 75 gene accessions and 276 protein sequences as well as the genomic contest for species completely sequenced. To complete this section it was necessary to articulate query terms and to manually implement data results for each single species. We collected also the sequence data of 15 genomes. The database consolidated information from external sources and manually annotated them into a relational database. A search engine tool that allows the query/retrieval of a class of protein in all the Pseudomonas species in which it has been sequenced, will be developed. Protein and gene sequences could be extracted and exported simultaneously for all the Pseudomonas species ready to be used for in silico analysis. The way it provides for the retrieval and extraction of sequences will allows the user to overcome obstacle encountered in the integrative of different bioinformatic resources. 2) Data sources PseudoBioRes is a result of experimental data provided by different research groups or retrieved by external sources. Among them are Pseudomonas Genome Database v2 (PGDv2, http://www.pseudomonas.com/), GenBank (http://www.ncbi.nlm.nih.gov/), KEGG Kyoto Encyclopedia of Genes and Genomes (http://www.genome.jp/kegg/) and the List of Prokaryotic Names with Standing in Nomenclature–LPSN (http://www.bacterio.cict.fr/). Data on genome sequences were extracted from the section Gbrowse of PGDv2, which 1 stores and integrates data extracted from the project Pseudomonas Genome Project and from PseudoCAP (Pseudomonas aeruginosa Community Annotation Project). The GenBank was used as gene and protein data source using the engine of NCBI (National Centre for Biotechnology Information). The PHA database dedicated section was completed with articulated query terms and manually implemented data results for each single species. It took a long time but all data were included For metabolic pathways and enzyme classes we used the Japanese GenomeNet service, KEGG, which integrates metabolic pathways (data on metabolic pathway and complex), genes (data on functional genes and their protein products) and ligands (Chemical compounds, drugs, glycans, and reactions). From here we extrapolated Pseudomonas PHA metabolic pathway. The occurrence of many DNA sequences obtained from “unknown” strains without any further characterization pointed out a gap between environmental studies and Pseudomonas taxonomy. Thus we provided a list and the link of Pseudomonas species as retrieved from the LPSN, which includes the nomenclature of prokaryotes and their changes as cited in the Approved Lists of Bacterial Names or published in the International Journal of Systematic Bacteriology (IJSB) or later in the International Journal of Systematic and Evolutionary Microbiology (IJSEM). Genes not attributed to a species were referred as Pseudomonas spp. and the strain name was reported as in the NCBI taxonomy database. 3) Structure of the database PseudoBioRes database has a tree-structure with an introduction page which reports the main goals of database. The content of the database is built on three main interconnected blocks dealing with species, genomes and genes. FINAL WORKSHOP OF GRID PROJECTS “PON RICERCA 2000-2006, AVVISO 1575” From the section “Species” the user is sent to the web pages corresponding to alphabetical list of the 175 species of Pseudomonas, taken from LPSN, with a related link to the correct nomenclature, where a particular species can be selected to get its relevant data. Provided links allows to reach Genome sequencing projects (complete and in progress) from the Gbrowser of the Pseudomonas genome database v2 web site. A complete comprehensive list is provided from the website www.pseudomonas.com, showing the ongoing and the completed genome projects related to sequencing of genes of various Pseudomonas species. Starting from a page dedicated to a particular Pseudomonas species page it is possible to access 5 fields: general description, NCBI Taxonomy browser, relevant papers, genome sequence (if available) and genes and proteins involved in a specific pathway (in this version only PHA). The general description has been prepared by using various literature sources focusing upon some general characteristics of the species and PHA production. It reports a brief description and its biological relevance and role in different fields such as clinical, agricultural, environmental. The link to the Tax-browser of NCBI gives more information on taxonomy. It also links to other databases like Genes, Proteins, Genome, Nucleotide, Genome Projects, Structure etc. The relevant scientific literature used to compile the text pages and related to the particular species is given into the “relevant papers” page. The link to “complete genome” corresponds to GBrowse tool which shows the complete map along with the positioning of genes. It is possible to find here the exact location of specific genes in the genome map for some species using the search engine of that site. In future we plan to replace this link with a better source. The link to “PHA related genes and proteins” shows the various genes and proteins involved in Pseudomonas PHA production. From there, it is possible to get into further 1 pages where genes and proteins related to PHA biosynthesis were collected, into two sets of data. The set “PHA related genes” contains all the genes derived from Pseudomonas genome sequence project, when available. From this page it is possible to gain the corresponding page of the Entrez Gene ID of the NCBI web site which provide the genomic context, genomic region, the transcript and product and link to other database (Conserved domain, PubMed, KEGG, taxonomy, TIGR, etc). It allows to directly download the nucleotide sequence in FASTA format and to have information about the metabolic pathway in which the gene is involved thanks to KEGG Database link. The set “PHA related proteins” contains all the protein sequences derived either from the sequence genome contest or directly from cloned genes. In this case, sequence information was extracted after gene isolation and sequencing and related to PHA yield data. Also in this case it is possible to download the FASTA protein sequence format and have information about the metabolic pathway. By the “Gene” resource it is possible to access to specific sections dedicated to classes of gene with relevant interest. Gene chromosome location, sequence and structural information are extracted from the NCBI Taxonomy database, used also as reference for information on the biological sources of the protein sequenced providing links to the main important biological database (KEGG). This section is still in progress. 4) The web interface The web interface has been developed by using Microsoft ASP.NET technology, by leveraging on Framework .NET 2.0. Care has been taken to allow a simple and rapid update of the database with the inclusion of new entries. The database is open source in order to maintain consistency with the new findings and can also be used as a guideline in order to FINAL WORKSHOP OF GRID PROJECTS “PON RICERCA 2000-2006, AVVISO 1575” create other sections for other relevant metabolites. At the moment, it is accessible at the following URL: www.ylichron.it/PHA_pseudomonas_DB. II. CONCLUSION The tool we describe here has been developed to support lab scientists and bioinformatics to gain information and data about Pseudomonas species, targeting sequences of the most important classes of compound and biotechnological interest. The way it provides for the retrieval and extraction of sequences allows the user to overcome the obstacles encountered in the integrative use of different bioinformatic resources. At the meantime, the completeness of the sequence collection allows intra- and interspecies comparison at different biological levels (genes, transcripts and proteins. ACKNOWLEDGMENTS This work has been performed in the frame of the project “CRESCO” (Computational Research Center for Complex Systems) cofounded by ENEA and the Italian Ministry of University and Research in the frame of “Programma Operativo Nazionale 2000-2006 Ricerca Scientifica, Sviluppo Tecnologico, Alta Formazione, Misura II.2 : Società della Informazione per il Sistema Scientifico Meridionale, Azione a : Sistemi di calcolo e simulazione ad alte prestazioni”. REFERENCES [1] Choi C, Münch R, Leupold S, Klein J, Siegel I, Thielen B, Benkert B, Kucklick M, Schobert M, Barthelmes J, Ebeling C, Haddad I, Scheer M, Grote A, Hiller K, Bunk B, Schreiber K, Retter I, Schomburg D and Jahn D. (2007) SYSTOMONAS — an integrated database for systems biology analysis of Pseudomonas. Nucleic Acids Res 35:533-537 [2] Madison L, Huisman GW. (1999) Metabolic engineering of poly(3hydroxyalkanoates): from DNA to Plastic. Microbiol. Molec Biol Reviews 63 (1): 21–53. 1 [3] Rehm BHA, Steinbuchel A. (1999) Biochemical and genetic analysis of PHA synthases and other proteins required for PHA synthesis. Int J Biol Macromol 25: 3–19. [4] de Eugenio LI, Garcia P, Luengo JM, Sanz JM, Roman JS, Garcia JL, Prieto MA. (2007) Biochemical evidence that phaZ gene encodes a specific intracellular medium chain length polyhydroxyalkanoate depolymerase in Pseudomonas putida KT2442: characterization of a paradigmatic enzyme. J Biol Chem. 16, 4951–4962. [5] Hoffmann N, Rehm BHA. (2004) Regulation of polyhydroxyalkanoate biosynthesis in Pseudomonas putida and Pseudomonas aeruginosa. FEMS Microbiol Lett 237: 1–7. [6] Hoffmann N, Rehm BHA. (2005) Nitrogen-dependent regulation of mediumchain length polyhydroxyalkanoate biosynthesis genes in pseudomonads. Biotechnol Lett 27: 279–282. [7] Dias JML, Oehmen A, Serafim LS, Lemos PC, Reis MAM, and Oliveira Rui (2008). Metabolic modelling of polyhydroxyalkanoate copolymers production by mixed microbial cultures. BMC Syst Biol 2008: 2:59 [8] Nogales J, Palsson B and Thiele I. (2008) A genome-scale metabolic reconstruction of Pseudomonas putida KT2440: iJN746 as a cell factory BMC Syst Biol 2:79 [9] Puchaka J, Oberhardt MA, Godinho M, Bielecka A, Regenhardt D, Timmis KN, Papin JA, Martins dos Santos VAP. (2008) GenomeScale Reconstruction and Analysis of the Pseudomonas putida KT2440 Metabolic Network Facilitates Applications in Biotechnology. PLoS Comput Biol 4(10) [10] Licciardello G, Catara V, Mangtani A, Casilli R, Rosato V. (2008) PseudoBioRes: una risorsa bioinformatica per il genere Pseudomonas. Conferenza Nazionale Italiani E-Science 2008, Book of abstract 126. FINAL WORKSHOP OF GRID PROJECTS “PON RICERCA 2000-2006, AVVISO 1575” Grazia Licciardello was born in 1978. She is a molecular biologist with a good expertise in plant pathology and biotechnology thanks to a II level Post graduate Master in “Biotechnology for sustainable protection of crops and agrifood” and a PhD in “Phytosanitary technologies” at the University of Catania. From 2004 up to now, she works as researcher at Scientific and Technological Park of Sicily. She has participated to the following project: the PON project “Utilization of waste material to develop biodegradable polymers (PHA) for agriculture and agroindustry” and the MIUR project “CRESCO, Computational Centre for research on Complex Systems”. Her main area of research is the genetic manipulation for biotechnology purposes, the detection of microbial phytopathogens by molecular methods and the study of genes involved in genetic regulation. She is author of about 30 scientific papers published in international refereed journals and presented in national and international congress. Vittoria Catara, Associated Professor at the University of Catania. Since 1990 she has joined the research activity of Di.S.Te.F University of Catania, Italy; she cooperated in a number of projects; she has been responsible for 3 Project of the University of Catania, in a project for young researcher of Catania University and Coordinator of a British programme funded by CRUI and British council. She is involved in phytobacteriology studies and on to molecular aspects of fungal diagnosis and characterization. She collaborated at the preparation of 120 contributes among scientific publications, published on technical and scientific journals, or presented in conference and published in proceedings on the following subjects: plant diseases; molecular techniques for the diagnosis of plant pathogens; phenotypic and genomic characterization of P. corrugata; evaluation of resistance to biotic and abiotic factors; characterization and application of biocontrol agents; analysis of bacterial populations; Evaluation of bacteria for polyhydroxyalkanoates production; regulation of polyhydroxyalkanoates genes, Quorum sensing in Pseudomonas spp. She described new diseases from known pathogens, described a new bacterial species, P. mediterranea Catara et al (2002). She is co-author of a patent. 1