ProGenGrid: a Grid-enabled platform for

advertisement
ProGenGrid: a Grid-enabled platform for
Bioinformatics
G. Aloisio, M. Cafaro, S. Fiore, M. Mirto
CACT/ISUFI and SPACI Consortium, University of Lecce, Italy
HealthGrid 2005 7th-9th April, Oxford
OUTLINE
• Bioinformatics: some issues
• Why Bioinformatics Grid?
• The Proteomics and Genomics Grid (ProGenGrid) project:
a Grid framework for Bionformatics
• Data management services
• Conclusions and future work
Bioinformatics Issues
• Large amounts of data & many applications;
• High heterogeneity:
ƒ Different types, algorithms,
communities, service providers
forms,
implementations,
• High complexity and inter-relations;
• Exploitation of large computing power for supporting
“in silico” experiments;
Why Bioinformatics Grid?
• Deployment, distribution, management
needed software components;
system
of
• Harmonized standard integration of various software
layers and services;
• Powerful, flexible policy definition, control
negotiation mechanism for a collaborative
environment;
and
grid
• The Life Science Grid Research Group established
under the Global Grid Forum, underlines as a Grid
framework (offering services and standards) satisfies
bioinformatics requirements.
ProGenGrid Project
The aim of the ProGenGrid project is the creation of a
distributed and ubiquitous grid environment for supporting “in
silico” experiments in bioinformatics.
Using such an environment, that can be considered as a
virtual laboratory, the e-scientists will access
•
•
•
analysis tools (e.g. EMBOSS, Blast),
biological databases (e.g. GenBank, Protein Data Bank),
visualization tools (e.g. Rasmol)
These tools will be available as Web/Grid Service according to
a Service Oriented architecture and accessible through a Web
Portal.
Service Oriented Architecture
Web
Service
Web/Grid Service
−
−
XML,SOAP,WSDL, UDDI
Service
description
Grid
WSDL
• OGSA & WSRF
(Open Grid Service Architecture
& Web Service Resource Framework)
Service
Consumer
SOAP
XML based
Messaging
Re
dir
ect
s to
Redirect to
description
ser
Search
Service
vic
e
UDDI
Registry
Allows building enhanced services independently of
platform, programming language, tools, and network
infrastructure.
Services-layered Architecture
WorkFlow
Main
Focus
Application
Semantic Grid Services
Data Grid Services
GridFTP
SRM
…
Level 4
DAI
XML
RDF
RDF Schema
Ontology
Level 3
Generic Grid Services
GRAM
Genome
database
GSI
Protein
database
…
Disease
database
MDS
Clinical
database
Level 2
Level 1
Level 1: legacy data sources
• Several data sources
• Heterogeneity of data sources
• Poor level of integration
• Legacy catalogue
Framework for supporting bioinformatics research.
Level 2: Generic Grid Services
Job submission
ƒ GRAM
Security
ƒ GSI
Information Service:
ƒ MDS
ƒ iGrid
Efficient Data Transport
ƒ GridFTP
Level 3: Semantic Grid Services
Additional information bridging the syntactic and semantic
gaps among the individual data sources and the user are
provided within the ontologies.
ƒ Several format connected with the ontologies are:
ƒ XML
ƒ RDF
ƒ …..
ƒ This level provides services supporting data integration
Ontology
An ontology defines a common vocabulary for the information in a
specific domain. It includes definitions of basic concepts in the
domain and relations among them, which should be interpretable
both by machines and humans.
• Use of ontologies at two levels:
ƒ Workflow Validation during the composition of tasks
without know applications details and conversion of input
data, if needed.
ƒ Data Accessing:
9 Semantic integration of different data sources;
9 Analysis of stored data coming from different
experiments.
Ontology of software for ProGenGrid WF
Classification of ProGenGrid components software into data
banks, bioinformatics algorithms, graphics tools, drug design tools,
and input data types. This first ontology, written in DAML+OIL, has
been stored in a relational database.
role
M
father
type
display
INPUT TYPE
N
N
accept
CLASS
M
id_class
name
description
type
filename
1
belongs
conditiontype
conditionvalue
N
WORKFLOW
1
composed by
N
ACTIVITY
INSTANCE
M
child
N
id_workflow
name
description
filename
id_activity
description
Advantages for using ontology
ƒTo keep track of
9 input data that a given component could receive;
9 relations between input and output data of the
components for determining of rules and
establishing the correct flow of data.
ƒTo associate a description at the logic name of the
activities.
Level 3: Data Grid Services
One of the main goal of grid technology is to provide
efficient access to data
Our scenario is connected with:
ƒ A lot of distributed and heterogeneous data sources
ƒ Huge amount of data
ƒ Intensive computations
Bioinformatics need efficient data grid services for data
integration
Data Grid Services
Main Focus
Data Integration
ƒ Access data from a legacy system may be difficult for
several reasons:
9 Developed for a different hardware or software
platform
9 Use a different data model
9 Use a different DBMS
9 Use a different data definitions
9 Use a different data format
ƒ All these make difficulty in integration and sharing data
Data Integration
Client/User
RDF
Mediator Mediator
Engine
Data Source
Ontology
Information Integrator
Web
Services
Mapper
WRAPPER
WRAPPER
WRAPPER
RDF
Scheme
Ontology
WRAPPER
WRAPPER
WRAPPER
WRAPPER
Flat File
Relational DB
XML
Standard Database Access
Interface 2.0
Std Data source Access Interface
Features:
¾
Standard Access to Data source
¾
Plug-in architecture based on dynamic libraries
¾
Wrapper Extensions for bioinformatic data sources
Level 4: WorkFlow Mng System (WFMS)
The WorkFlow Management Coalition (WFMC) defines workflow
as:
“The automation of a business process, in whole or part, during
which documents, information or tasks are passed from one
participant to another for action, according to a set of procedural
rules.”
WFMC
• founded in 1993, 24 countries, 170 members
• terminology, standard interfaces, promotion
Workflow phases
Stage 1 - Component discovery: It discovers available
bioinformatics tools, data banks, graphics tools, modeled through
the ontology;
Stage 2 - Workflow editing: Discovered components are made
available to a semantic editor that allows the design (i.e. the
activities are modeled using UML) of an experiment (Abstract
Workflow);
Stage 3 - Execution Plan: The abstract workflow is translated into
an “execution plan” (Concrete Workflow) containing the activities
order and the logical name of the resources (needed for their
discovery in a Grid environment);
Stage 4 - Application execution: The ProGenGrid scheduler
schedules the concrete workflow in a computational grid;
Stage 5 - Application monitoring: Whenever workflow activities are
started/finished, the system visualizes the advancement of the workflow
execution using a graphical utility.
ProGenGrid
Editor
Discovery
components
Available
components
MOR
Validation
MOR
Result
Traduces Abstract
Workflow
Workflow
architecture
Execution Plan
(Concrete
Workflow)
Is sent
Enactment
Service
Query
Select
Generates
WorkList
Activities
Transforms
Web Service
Invocation
Executes
Grid
Resources
Resource
Discovery &
Selector
Select
Resource
Information
Service
Workflow GUI
Toolbar for
inserting special
status, fork, join
and condition task.
UML graph related to
current workflow
Activities classification
avalaible on a
computational grid
Graphical WorkFlow
Monitoring
Activity and Workflow status with relative
applications error messages represented as
activities
UML
workflowUML
classe astratta che
rappresenta sia le foglie che
l'elemento composto.
Fornisce l'interfaccia e il
comportamento di default di
t utte le classi.
attivit aUML
larghezza
altez za
larghezzaArc o
altez zaArco
rettangolo : RoundRectangle2D
baric entro : Point2D
IDUML
IDModelloAttivita
desc rizione
IDOggettoUMLPrec
IDOggettoUMLSucc
t ipoInputUtente
valoreInputUtent e
disegna(Graphics)
contiene(Point2D)
getBaricentro()
setBaricentro(Point2D)
getIDUML()
getIDModello()
getValoreInput()
setValoreInput()
getTipoInputUtente()
setTipoInputUtente()
getIDOggettoUMLSucc()
getIDOggettoUMLPrec()
getWidth()
getHeigth()
accett aCollegamenti()
riceveCollegamenti()
setText(St ring)
getText()
setPrecedente(ID : String)
setSuccessivo(ID : String)
canc ellaSucces sivo(ID : St ring)
canc ellaPrecedente(ID : St ring)
getPuntoCollegamento(Line2D)
disegna(Graphics)
c ontiene(Point2D)
aggiungi(work flowView)
elimina(workflowView)
getBaricentro()
s etBaricentro(Point 2D)
getIDUML()
getW idt h()
getHeigth()
accett aCollegamenti()
0..n
riceveCollegamenti()
s etTex t(St ring)
getTex t()
s etPrecedent e(ID : St ring)
s etSuccessivo(ID : St ring)
c ancellaSuc cessivo(ID : St ring)
c ancellaPrecedente(ID : St ring)
getPunt oCollegament o(Line2D)
startUML
diametro
baricentro : Point2D
cerchio : Ellipse2D
IDUML
IDOggettoUMLSucc
commento
disegna(Graphics)
contiene(Point2D)
getBaricentro()
setBaricentro(Point2D)
getIDUML()
getWidth()
getHeigth()
accettaCollegamenti()
riceveCollegamenti()
setText(String)
getText()
setSuccessivo(ID : String)
cancellaSuccessivo(ID : String)
getPuntoCollegamento(Line2D)
endUML
Classe concreta usata per
c ontenere e gestire gli
element i grafici del
workflow.
element iGrafici[0. .*] : workflowView
prossimoID
workflowView
1
disegna(Graphic s)
contiene(Point2D)
aggiungi(workflowView)
elimina(workflowView)
generaID()
getElementiGrafici()
collega(origine : workflowView, dest inaz ione : work flowView)
collegaCond(origine : conditionUML, dest inazione : work flowView, es pressione : S tring)
getAttivitaUML(IDModello : String)
getOgget toUML(IDUML : S tring)
addStartUML(baricentro : Point 2D)
addEndUML(baricentro : Point2D)
addForkUML(baricent ro : Point2D)
addJoinUML(baricentro : Point2D)
addConditionUML(baricentro : Point 2D)
disconnett i(origine : workflowView, dest inazione : workflowView)
getFrecciaUML(IDUMLOrigine : St ring, IDUMLArrivo : St ring)
update(Observable o, Object arg)
forkUML
joinUML
java.util.Obser
ver
diametroInt
diametroEst
baricentro : Point2D
cerchioInt : Ellipse2D
cerchioEst : Ellipse2D
IDUML
IDOggettoUMLPrec
commento
larghezza
altezza
baricentro : Point2D
rettangolo : Rectangle2D
IDUML
IDOggettoUMLPrec : String
IDOggettiUMLSucc[0..*] : String
commento
larghezza
altezza
baricentro : Point2D
rettangolo : Rectangle2D
IDUML
IDOggettiUMLPrec[0..*] : String
IDOggettoUMLSucc : String
commento
conditionUML
larghezza
altezza
baricentro : Point2D
rombo : Polygon
IDUML
IDOggettoUMLPrec : String
IDOggettiUMLSucc[0..*] : String
commento
disegna(Graphics)
contiene(Point2D)
getBaricentro()
setBaricentro(Point2D)
getIDUML()
getWidth()
getHeigth()
accettaCollegamenti()
riceveCollegamenti()
setText(String)
getText()
setPrecedente(ID : String)
cancellaPrecedente(ID : String)
getIDOggettoUMLPrec()
getPuntoCollegamento(Line2D)
disegna(Graphics)
contiene(Point2D)
getBaricentro()
setBaricentro(Point2D)
getIDUML()
getWidth()
getHeigth()
accettaCollegamenti()
riceveCollegamenti()
setText(String)
getText()
setPrecedente(ID : String)
setSuccessivo(ID : String)
cancellaSuccessivo(ID : String)
cancellaPrecedente(ID : String)
getOggettiUMLSucc()
getIDOggettoUMLPrec()
getPuntoCollegamento(Line2D)
disegna(Graphics)
contiene(Point2D)
getBaricentro()
setBaricentro(Point2D)
getIDUML()
getWidth()
getHeigth()
accettaCollegamenti()
riceveCollegamenti()
setText(String)
getText()
setPrecedente(ID : String)
setSuccessivo(ID : String)
cancellaSuccessivo(ID : String)
cancellaPrecedente(ID : String)
getOggettoUMLSucc()
getIDOggettiUMLPrec()
getPuntoCollegamento(Line2D)
disegna(Graphics)
contiene(Point2D)
getBaricentro()
setBaricentro(Point2D)
getIDUML()
getWidth()
getHeigth()
accettaCollegamenti()
riceveCollegamenti()
setText(String)
getText()
setPrecedente(ID : String)
setSuccessivo(ID : String)
cancellaSuccessivo(ID : String)
cancellaPrecedente(ID : String)
getIDOggettoUMLPrec()
getIDOggettiUMLSucc()
getPuntoCollegamento(Line2D)
frecciaUML
traiettoria : Line2D
effettiva : Line2D
WF : workflowUML
IDUML
IDOggettoUMLOrigine
IDOggettoUMLArrivo
commento
disegna(Graphics)
contiene(Point2D)
getIDUML()
setText(String)
getText()
getIDOggettoUMLOrigine()
getIDOggettoUMLArrivo()
Drug Discovery: Development Life Cycle
Discovery
(2 to 10 Years)
Preclinical Testing
(Lab and Animal Testing)
Phase I
(20-30 Healthy Volunteers used to
check for safety and dosage)
Phase II
(100-300 Patient Volunteers used to
check for efficacy and side effects)
Phase III
(1000-5000 Patient Volunteers
used to monitor reactions to
long-term drug use)
$600-700 Million!
FDA Review
& Approval
Post-Marketing
Testing
Years
0
2
4
6
8
10
7 – 15 Years!
12
14
16
Phases of DD
• Target Identification
− What protein can we attack to stop
the disease from progressing?
• Lead discovery & optimization
− What sort of molecule will bind to
this protein? (Molecular Docking)
• Toxicology
− Side effects
Issues and Grid solutions for DD
• Screening of a large set of compound
9 The old way: exhaustive screening
9 The new way: parallel screening on Grid!
• Docking
9 The old way: execution of legacy software
9 The new way: docking on large-scale
transforming existing sw into a parameter
sweep applications for execution on
distributed system
Split Service: General Purpose Schema
Splitter request
XML Format
Split
Service
Splitter
Component
ACL
ID Data &
Query
Up/Down
load
Result
ID
ClientAB
BE
HIN
Available IDs
Request
HE
3
DT
ClientA
WA
LL
1
1
ClientB
Fragments ID
2
Computational
Engine
Enhanced Split Service
Within
the
ProGenGrid
project
we
have
been
developing an enhanced Split Service customized for
bioinformatics applications.
Customizations are related to:
9 Computational Engines
ƒ
Autodock,
ƒ
Dock (Sphgen, grid)
ƒ
GAMESS
9 Broker functionalities
9 Workflow support
9 High level functionalities for
end users
Conclusions and Future Work
ProGenGrid is a software platform allowing the composition
of existing bioinformatics resources, wrapped as Web
Services, to create complex workflows. In particular, it offers:
• tools for services composition, workflow execution and
monitoring.
• data integration approach to simplify
heterogeneous biological databases.
access
to
In the future…
Full implementation of the architecture evaluating it with other
approaches.
SPACI Project
A grid infrastructure based on three geographically spread
High Performance Computing Centers located in Southern Italy
Southern Partnership for Advanced Computational Infrastructures
ISUFI/CACT
Center for Advanced Computing
Technologies
University of Lecce
Director: Prof. Giovanni Aloisio
CPS/CNR
Center for Research on Parallel Computing
and Supercomputing
(now Section of Naples of ICAR/CNR)
Director: Prof. Almerico Murli
MIUR/HPCC
Center of Excellence for
High Perfomance Computing
University of Calabria
Director: Prof. Lucio Grandinetti
For any information…
About ProGenGrid Project
Project P. I.: Maria Mirto (maria.mirto@unile.it)
Giovanni Aloisio (giovanni.aloisio@unile.it)
Massimo Cafaro (massimo.cafaro@unile.it)
Sandro Fiore (sandro.fiore@unile.it)
WebSite: http://datadog.unile.it/progen
Download