ProGenGrid: a Grid-enabled platform for Bioinformatics G. Aloisio, M. Cafaro, S. Fiore, M. Mirto CACT/ISUFI and SPACI Consortium, University of Lecce, Italy HealthGrid 2005 7th-9th April, Oxford OUTLINE • Bioinformatics: some issues • Why Bioinformatics Grid? • The Proteomics and Genomics Grid (ProGenGrid) project: a Grid framework for Bionformatics • Data management services • Conclusions and future work Bioinformatics Issues • Large amounts of data & many applications; • High heterogeneity: Different types, algorithms, communities, service providers forms, implementations, • High complexity and inter-relations; • Exploitation of large computing power for supporting “in silico” experiments; Why Bioinformatics Grid? • Deployment, distribution, management needed software components; system of • Harmonized standard integration of various software layers and services; • Powerful, flexible policy definition, control negotiation mechanism for a collaborative environment; and grid • The Life Science Grid Research Group established under the Global Grid Forum, underlines as a Grid framework (offering services and standards) satisfies bioinformatics requirements. ProGenGrid Project The aim of the ProGenGrid project is the creation of a distributed and ubiquitous grid environment for supporting “in silico” experiments in bioinformatics. Using such an environment, that can be considered as a virtual laboratory, the e-scientists will access • • • analysis tools (e.g. EMBOSS, Blast), biological databases (e.g. GenBank, Protein Data Bank), visualization tools (e.g. Rasmol) These tools will be available as Web/Grid Service according to a Service Oriented architecture and accessible through a Web Portal. Service Oriented Architecture Web Service Web/Grid Service − − XML,SOAP,WSDL, UDDI Service description Grid WSDL • OGSA & WSRF (Open Grid Service Architecture & Web Service Resource Framework) Service Consumer SOAP XML based Messaging Re dir ect s to Redirect to description ser Search Service vic e UDDI Registry Allows building enhanced services independently of platform, programming language, tools, and network infrastructure. Services-layered Architecture WorkFlow Main Focus Application Semantic Grid Services Data Grid Services GridFTP SRM … Level 4 DAI XML RDF RDF Schema Ontology Level 3 Generic Grid Services GRAM Genome database GSI Protein database … Disease database MDS Clinical database Level 2 Level 1 Level 1: legacy data sources • Several data sources • Heterogeneity of data sources • Poor level of integration • Legacy catalogue Framework for supporting bioinformatics research. Level 2: Generic Grid Services Job submission GRAM Security GSI Information Service: MDS iGrid Efficient Data Transport GridFTP Level 3: Semantic Grid Services Additional information bridging the syntactic and semantic gaps among the individual data sources and the user are provided within the ontologies. Several format connected with the ontologies are: XML RDF ….. This level provides services supporting data integration Ontology An ontology defines a common vocabulary for the information in a specific domain. It includes definitions of basic concepts in the domain and relations among them, which should be interpretable both by machines and humans. • Use of ontologies at two levels: Workflow Validation during the composition of tasks without know applications details and conversion of input data, if needed. Data Accessing: 9 Semantic integration of different data sources; 9 Analysis of stored data coming from different experiments. Ontology of software for ProGenGrid WF Classification of ProGenGrid components software into data banks, bioinformatics algorithms, graphics tools, drug design tools, and input data types. This first ontology, written in DAML+OIL, has been stored in a relational database. role M father type display INPUT TYPE N N accept CLASS M id_class name description type filename 1 belongs conditiontype conditionvalue N WORKFLOW 1 composed by N ACTIVITY INSTANCE M child N id_workflow name description filename id_activity description Advantages for using ontology To keep track of 9 input data that a given component could receive; 9 relations between input and output data of the components for determining of rules and establishing the correct flow of data. To associate a description at the logic name of the activities. Level 3: Data Grid Services One of the main goal of grid technology is to provide efficient access to data Our scenario is connected with: A lot of distributed and heterogeneous data sources Huge amount of data Intensive computations Bioinformatics need efficient data grid services for data integration Data Grid Services Main Focus Data Integration Access data from a legacy system may be difficult for several reasons: 9 Developed for a different hardware or software platform 9 Use a different data model 9 Use a different DBMS 9 Use a different data definitions 9 Use a different data format All these make difficulty in integration and sharing data Data Integration Client/User RDF Mediator Mediator Engine Data Source Ontology Information Integrator Web Services Mapper WRAPPER WRAPPER WRAPPER RDF Scheme Ontology WRAPPER WRAPPER WRAPPER WRAPPER Flat File Relational DB XML Standard Database Access Interface 2.0 Std Data source Access Interface Features: ¾ Standard Access to Data source ¾ Plug-in architecture based on dynamic libraries ¾ Wrapper Extensions for bioinformatic data sources Level 4: WorkFlow Mng System (WFMS) The WorkFlow Management Coalition (WFMC) defines workflow as: “The automation of a business process, in whole or part, during which documents, information or tasks are passed from one participant to another for action, according to a set of procedural rules.” WFMC • founded in 1993, 24 countries, 170 members • terminology, standard interfaces, promotion Workflow phases Stage 1 - Component discovery: It discovers available bioinformatics tools, data banks, graphics tools, modeled through the ontology; Stage 2 - Workflow editing: Discovered components are made available to a semantic editor that allows the design (i.e. the activities are modeled using UML) of an experiment (Abstract Workflow); Stage 3 - Execution Plan: The abstract workflow is translated into an “execution plan” (Concrete Workflow) containing the activities order and the logical name of the resources (needed for their discovery in a Grid environment); Stage 4 - Application execution: The ProGenGrid scheduler schedules the concrete workflow in a computational grid; Stage 5 - Application monitoring: Whenever workflow activities are started/finished, the system visualizes the advancement of the workflow execution using a graphical utility. ProGenGrid Editor Discovery components Available components MOR Validation MOR Result Traduces Abstract Workflow Workflow architecture Execution Plan (Concrete Workflow) Is sent Enactment Service Query Select Generates WorkList Activities Transforms Web Service Invocation Executes Grid Resources Resource Discovery & Selector Select Resource Information Service Workflow GUI Toolbar for inserting special status, fork, join and condition task. UML graph related to current workflow Activities classification avalaible on a computational grid Graphical WorkFlow Monitoring Activity and Workflow status with relative applications error messages represented as activities UML workflowUML classe astratta che rappresenta sia le foglie che l'elemento composto. Fornisce l'interfaccia e il comportamento di default di t utte le classi. attivit aUML larghezza altez za larghezzaArc o altez zaArco rettangolo : RoundRectangle2D baric entro : Point2D IDUML IDModelloAttivita desc rizione IDOggettoUMLPrec IDOggettoUMLSucc t ipoInputUtente valoreInputUtent e disegna(Graphics) contiene(Point2D) getBaricentro() setBaricentro(Point2D) getIDUML() getIDModello() getValoreInput() setValoreInput() getTipoInputUtente() setTipoInputUtente() getIDOggettoUMLSucc() getIDOggettoUMLPrec() getWidth() getHeigth() accett aCollegamenti() riceveCollegamenti() setText(St ring) getText() setPrecedente(ID : String) setSuccessivo(ID : String) canc ellaSucces sivo(ID : St ring) canc ellaPrecedente(ID : St ring) getPuntoCollegamento(Line2D) disegna(Graphics) c ontiene(Point2D) aggiungi(work flowView) elimina(workflowView) getBaricentro() s etBaricentro(Point 2D) getIDUML() getW idt h() getHeigth() accett aCollegamenti() 0..n riceveCollegamenti() s etTex t(St ring) getTex t() s etPrecedent e(ID : St ring) s etSuccessivo(ID : St ring) c ancellaSuc cessivo(ID : St ring) c ancellaPrecedente(ID : St ring) getPunt oCollegament o(Line2D) startUML diametro baricentro : Point2D cerchio : Ellipse2D IDUML IDOggettoUMLSucc commento disegna(Graphics) contiene(Point2D) getBaricentro() setBaricentro(Point2D) getIDUML() getWidth() getHeigth() accettaCollegamenti() riceveCollegamenti() setText(String) getText() setSuccessivo(ID : String) cancellaSuccessivo(ID : String) getPuntoCollegamento(Line2D) endUML Classe concreta usata per c ontenere e gestire gli element i grafici del workflow. element iGrafici[0. .*] : workflowView prossimoID workflowView 1 disegna(Graphic s) contiene(Point2D) aggiungi(workflowView) elimina(workflowView) generaID() getElementiGrafici() collega(origine : workflowView, dest inaz ione : work flowView) collegaCond(origine : conditionUML, dest inazione : work flowView, es pressione : S tring) getAttivitaUML(IDModello : String) getOgget toUML(IDUML : S tring) addStartUML(baricentro : Point 2D) addEndUML(baricentro : Point2D) addForkUML(baricent ro : Point2D) addJoinUML(baricentro : Point2D) addConditionUML(baricentro : Point 2D) disconnett i(origine : workflowView, dest inazione : workflowView) getFrecciaUML(IDUMLOrigine : St ring, IDUMLArrivo : St ring) update(Observable o, Object arg) forkUML joinUML java.util.Obser ver diametroInt diametroEst baricentro : Point2D cerchioInt : Ellipse2D cerchioEst : Ellipse2D IDUML IDOggettoUMLPrec commento larghezza altezza baricentro : Point2D rettangolo : Rectangle2D IDUML IDOggettoUMLPrec : String IDOggettiUMLSucc[0..*] : String commento larghezza altezza baricentro : Point2D rettangolo : Rectangle2D IDUML IDOggettiUMLPrec[0..*] : String IDOggettoUMLSucc : String commento conditionUML larghezza altezza baricentro : Point2D rombo : Polygon IDUML IDOggettoUMLPrec : String IDOggettiUMLSucc[0..*] : String commento disegna(Graphics) contiene(Point2D) getBaricentro() setBaricentro(Point2D) getIDUML() getWidth() getHeigth() accettaCollegamenti() riceveCollegamenti() setText(String) getText() setPrecedente(ID : String) cancellaPrecedente(ID : String) getIDOggettoUMLPrec() getPuntoCollegamento(Line2D) disegna(Graphics) contiene(Point2D) getBaricentro() setBaricentro(Point2D) getIDUML() getWidth() getHeigth() accettaCollegamenti() riceveCollegamenti() setText(String) getText() setPrecedente(ID : String) setSuccessivo(ID : String) cancellaSuccessivo(ID : String) cancellaPrecedente(ID : String) getOggettiUMLSucc() getIDOggettoUMLPrec() getPuntoCollegamento(Line2D) disegna(Graphics) contiene(Point2D) getBaricentro() setBaricentro(Point2D) getIDUML() getWidth() getHeigth() accettaCollegamenti() riceveCollegamenti() setText(String) getText() setPrecedente(ID : String) setSuccessivo(ID : String) cancellaSuccessivo(ID : String) cancellaPrecedente(ID : String) getOggettoUMLSucc() getIDOggettiUMLPrec() getPuntoCollegamento(Line2D) disegna(Graphics) contiene(Point2D) getBaricentro() setBaricentro(Point2D) getIDUML() getWidth() getHeigth() accettaCollegamenti() riceveCollegamenti() setText(String) getText() setPrecedente(ID : String) setSuccessivo(ID : String) cancellaSuccessivo(ID : String) cancellaPrecedente(ID : String) getIDOggettoUMLPrec() getIDOggettiUMLSucc() getPuntoCollegamento(Line2D) frecciaUML traiettoria : Line2D effettiva : Line2D WF : workflowUML IDUML IDOggettoUMLOrigine IDOggettoUMLArrivo commento disegna(Graphics) contiene(Point2D) getIDUML() setText(String) getText() getIDOggettoUMLOrigine() getIDOggettoUMLArrivo() Drug Discovery: Development Life Cycle Discovery (2 to 10 Years) Preclinical Testing (Lab and Animal Testing) Phase I (20-30 Healthy Volunteers used to check for safety and dosage) Phase II (100-300 Patient Volunteers used to check for efficacy and side effects) Phase III (1000-5000 Patient Volunteers used to monitor reactions to long-term drug use) $600-700 Million! FDA Review & Approval Post-Marketing Testing Years 0 2 4 6 8 10 7 – 15 Years! 12 14 16 Phases of DD • Target Identification − What protein can we attack to stop the disease from progressing? • Lead discovery & optimization − What sort of molecule will bind to this protein? (Molecular Docking) • Toxicology − Side effects Issues and Grid solutions for DD • Screening of a large set of compound 9 The old way: exhaustive screening 9 The new way: parallel screening on Grid! • Docking 9 The old way: execution of legacy software 9 The new way: docking on large-scale transforming existing sw into a parameter sweep applications for execution on distributed system Split Service: General Purpose Schema Splitter request XML Format Split Service Splitter Component ACL ID Data & Query Up/Down load Result ID ClientAB BE HIN Available IDs Request HE 3 DT ClientA WA LL 1 1 ClientB Fragments ID 2 Computational Engine Enhanced Split Service Within the ProGenGrid project we have been developing an enhanced Split Service customized for bioinformatics applications. Customizations are related to: 9 Computational Engines Autodock, Dock (Sphgen, grid) GAMESS 9 Broker functionalities 9 Workflow support 9 High level functionalities for end users Conclusions and Future Work ProGenGrid is a software platform allowing the composition of existing bioinformatics resources, wrapped as Web Services, to create complex workflows. In particular, it offers: • tools for services composition, workflow execution and monitoring. • data integration approach to simplify heterogeneous biological databases. access to In the future… Full implementation of the architecture evaluating it with other approaches. SPACI Project A grid infrastructure based on three geographically spread High Performance Computing Centers located in Southern Italy Southern Partnership for Advanced Computational Infrastructures ISUFI/CACT Center for Advanced Computing Technologies University of Lecce Director: Prof. Giovanni Aloisio CPS/CNR Center for Research on Parallel Computing and Supercomputing (now Section of Naples of ICAR/CNR) Director: Prof. Almerico Murli MIUR/HPCC Center of Excellence for High Perfomance Computing University of Calabria Director: Prof. Lucio Grandinetti For any information… About ProGenGrid Project Project P. I.: Maria Mirto (maria.mirto@unile.it) Giovanni Aloisio (giovanni.aloisio@unile.it) Massimo Cafaro (massimo.cafaro@unile.it) Sandro Fiore (sandro.fiore@unile.it) WebSite: http://datadog.unile.it/progen