myGrid Update and Status Carole Goble 1

advertisement
myGrid Update and Status
Carole Goble
http://www.mygrid.org.uk
1
• Open Source Upper Middleware for
Bioinformatics
• (Web) Service-based architecture
• Targeted at Tool Developers,
Bioinformaticians and Service
Providers
Newcastle
Sheffield
Manchester
Nottingham
Hinxton
Southampton
2
Philosophy
• Openness
– open source
– open world of services
– open extensible technology
– open to wider eScience context
– open to user feedback
– open to third party metadata
• Collection of components for assembly
• Pick and mix
3
Data-intensive
bioinformatics
ID
DE
DE
DE
GN
OS
OC
OC
KW
FT
FT
SQ
MURA_BACSU
STANDARD;
PRT;
429 AA.
PROBABLE UDP-N-ACETYLGLUCOSAMINE 1-CARBOXYVINYLTRANSFERASE
(EC 2.5.1.7) (ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMINE
ENOLPYRUVYL TRANSFERASE) (EPT).
MURA OR MURZ.
BACILLUS SUBTILIS.
BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE;
BACILLUS.
PEPTIDOGLYCAN SYNTHESIS; CELL WALL; TRANSFERASE.
ACT_SITE
116
116
BINDS PEP (BY SIMILARITY).
CONFLICT
374
374
S -> A (IN REF. 3).
SEQUENCE
429 AA; 46016 MW; 02018C5C CRC32;
MEKLNIAGGD SLNGTVHISG AKNSAVALIP ATILANSEVT IEGLPEISDI ETLRDLLKEI
GGNVHFENGE MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI GLPGGCHLGP
RPIDQHIKGF EALGAEVTNE QGAIYLRAER LRGARIYLDV VSVGATINIM LAAVLAEGKT
IIENAAKEPE IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP DRIEAGTFMI
4
Use Scenarios
Grave’s Disease
• Autoimmune disease of the thyroid
• Simon Pearce and Claire Jennings, Institute of
Human Genetics School of Clinical Medical
Sciences, University of Newcastle
• Discover all you can about a gene
• Annotation pipelines and Gene expression analysis
• Services from Japan, Hong Kong, various sites in UK
Williams-Beuren Syndrome
• Microdeletion of 155 Mbases on Chromosome 7
• Hannah Tipney, May Tassabehji, Andy Brass, St
Mary’s Hospital, Manchester, UK
• Characterise an unknown gene
• Annotation pipelines and Gene expression analysis
Services from USA, Japan, various sites in UK
5
Bioinformatics: Classical
Perform microarray experiment to
identify genes differentially regulated
in patients. Import microarray data to
Affymetrix data Mining Tool, Run
Analyses and select Candidate
Genes
Select suitable
gene and find
SNPs that lie
within
Study Annotations for
candidate genes at
many different websites
Find restriction sites
and design primers
by eye for
genotyping
experiments
6
Workflow approach
Williams-Beuren Syndrome
• Manually: takes two days (+)
including analysis
• Now takes 30 mins to
produce results and half a
day for analysis
• Manually: Do analysis as
perform experiment
• Workflow: Do analysis at end
of experiment
• Therefore need good result
co-ordination for backtracking
7
Tool Providers
Taverna
Talisman
Web Portal
Gateway
Registries
Service and Workflow
Discovery
Ontologies
Ontology Mgt
Views
Metadata Mgt
FreeFluo Workflow
Enactment Engine
Personalisation
Provenance
Event Notification
Core services
myGrid Information
Repository
OGSA-DQP
Distributed Query Processor
Web Service (Grid Service) communication fabric
SoapLab
Legacy apps
GowLab
Legacy apps
Native Web
Services
AMBIT
Text Extraction
Service
12
External services
Service Providers
Work bench
Service Stack
Applications
Bioinformaticians
myGrid
Pre-Prototype
Experimental
Web-based
Requirements gathering
Prototype 1
Architectural workout
All services represented
NetBeans workbench
API-based integration
Info Repository oriented
XML-based process provenance
Workflow enactment engine
Prototype 2
Second generation services
Reworked information model
Open information management
Life Science Identifiers
RDF based provenance
Taverna workbench
Web-based portal
In a
nutshell
Demo at ISMB 2003
Full paper and demo
at ISMB 2004
GSK deployment
Real biology
13
Two+ Paths
Innovative work
Core functionality
• Services – Soaplab
• Service and workflow
and Gowlab
registration
• Workflow enactment
• Semantic discovery
engine – Freefluo
• Provenance
• Workflow workbench
– Taverna
management
• Data integration –
• Text mining
OGSADQP
• Information model &
management
In between
• Event notification
• Gateway
14
Domain Services
• Native WSDL Web services
– DDBJ, NCBI BLAST, PathPort,
BioMOBY
• Wrapped legacy services
– SoapLab
– GowLab
• Web pages as web services
For each application
CreateJob
Run
WaitFor
GetResults
Destroy
– One button wrapping
– Leveraged the EMBOSS Suite
– ~159 services
EBI Support
agreed to
support Soaplab
• Lots of them and lots of
services as core
redundant services
business
• The joys of firewalls and licensing
http://industry.ebi.ac.uk/soaplab/
15
Workflow environment
• Freefluo workflow
enactment engine
•
Taverna
Workbench
http://freefluo.sourceforge.net
• Taverna development and
execution environment
•
•
•
•
•
Scufl language
parser
http://taverna.sourceforge.net
– Joint work with HGMP
Simple conceptual unified
flow language (XScufl).
Rapid development and
release cycle on source
Forge (LGPL)
Picked up by a others
– Wells Fargo, GODIVA …
“tethered” programme: own
open source development
community
Freefluo Enactor Core
Processor
Processor
Processor
Processor
Web
Service
Soaplab
Local
App
Enactor
16
• Control flow, iteration and
data flow
• Data sets and nested flows
• Configurable failure handling
• Incorporated Life Science Id
resolution
• Provenance and status
reporting
• Type and data management
• Plug-ins
• User notification
• Data entry wizard
• Libraries of SHIM services
• Libraries of workflows
17
18
OGSA-DQP
• Used in Grave’s Disease
• Uses OGSA-DAI data
access services to
access individual data
resources.
• A single query to access
and join data from more
than one OGSA-DAI
wrapped data resource.
• Supports orchestration of
computational as well as
data access services.
• Interactive interface for
integrating resources and
executing requests.
• Implicit, pipelined and
http://www.ogsa-dai.org.uk/dqp
partitioned parallelism. 19
Service Discovery (1)
• Scavenge!
• Seek the WSDL
• Incorporating a
service into Taverna
straightforward
20
Service discovery (2)
•
•
•
•
Registry
Third party registries
Third party services
Third party
metadata
• Views over
federated registries
• UDDI extended with
RDF
• Aim to have a public
view on the web.
21
Service and Workflow
registration
Workflow registration allows peer
review and publication of e-Science
methods.
Scufl
URI
Workfllow
Executive
Summary
Descriptions
Inputs,
Outputs,
Tasks,
Component
resources
Workflow
registry
entry
Operational
Descriptions
Cost, QoS
Access rights…
Provenance
Descriptions
Authors,
creation date,
institution…
Invokable Interface
descriptions
e.g. XML data types
stored
WSDL
Syntactic
descriptions
e.g. MIME types
RDF
Conceptual
descriptions
OWL
encoded
RDF Store
OWL/RDF
• Workflows and
Services registered
• Description
scheme
• RDFS &
DAML+OIL / OWL
ontologies of
services & biology
• Based on DAML-S
• Reasoning over
OWL descriptions
• Query over RDF
• Aim to have
semantic discovery
over public view on
the web.
22
Semantic Discovery
Pedro data capture tool
View annotations
on workflow
Drag a workflow entry into the
explorer pane and the workflow
loads.
Drag a service/ workflow to the
scavenger window for inclusion
23
into the workflow
Organisation level provenance
Process level provenance
runBy
e.g. BLAST @ NCBI
Provenance (1)
Project
Experiment design
partOf
Service
Process
componentProcess
Workflow design
e.g. web service invocation
of BLAST @ NCBI
instanceOf
Event
componentEvent
e.g. completion of a
web service invocation
at 12.04pm
Workflow run
Data/ knowledge level provenance
knowledge statements
run for
User can add templates to
Person
each workflow process to
determine links between
data items.
Organisation
e.g. similar protein sequence to
Data item
Data item
Data item
25
data derivation e.g. output data derived from input data
RDF Rules
..masked_sequence_of
project
.. nucleotide_sequence
>gi|19747251|gb|AC005089.3| Homo
sapiens BAC clone CTA-315H11 from 7,
complete sequence
AAGCTTTTCTGGCACTGTTTCCTTCTT
CCTGATAACCAGAGAAGGAAAAGATC
TCCATTTTACAGATGAG
GAAACAGGCTCAGAGAGGTCAAGGCT
CTGGCTCAAGGTCACACAGCCTGGGA
ACGGCAAAGCTGATATTC
AAACCCAAGCATCTTGGCTCCAAAGC
CCTGGTTTCTGTTCCCACTACTGTCAG
TGACCTTGGCAAGCCCT
GTCCTCCTCCGGGCTTCACTCTGCAC
ACCTGTAACCTGGGGTTAAATGGGCT
CACCTGGACTGTTGAGCG
experiment definition
rdf:type
..part_of
urn:lsid:taverna:datathing:13
..BLAST_Report
..similar_sequences_to
AC005089.3
831
Homo sapiens BAC
clone CTA-315H11 from 7, complete sequence
15145617
clone RP11-622P13 from 7, complete sequence
15384807
from clone RP11-553N16 on chromosome 1, complete sequence
7717376
chromosome 21 segment HS21C082
16304790
cDNA DKFZp686G08119 (from clone DKFZp686G08119)
5629923
BAC RPCI11-256L6 (Roswell Park Cancer Institute Human BAC Library) complete sequence
34533695
FLJ45040 fis, clone BRAWH3020486
20377057
chromosome 17, clone RP11-104J23, complete sequence
4191263
from clone RP4-715N11 on chromosome 20q13.1-13.2 Contains two putative novel genes, ESTs, STSs and GSSs, complete sequence
17977487
clone RP11-731I19 from 2, complete sequence
17048246
chromosome 15, clone RP11-342M21, complete sequence
14485328
from clone RP11-461K13 on chromosome 10, complete sequence
5757554
clone RP3-368G6 from X, complete sequence
4176355
chromosome 4 clone B200N5 map 4q25, complete sequence
2829108
group
..author
rdf:type
..works_for
person
..author
workflow invocation
..run_during
..run_for
service description
AC073846.6
815
Homo sapiens BAC
AL365366.20
46.1
Human DNA sequence
service invocation
AL163282.2
44.1
Homo sapiens
AL133523.5
44.1
Human chromosome
urn:lsid:taverna:datathing:15
14 DNA sequence BAC R-775G15 of library RPCI-11 from chromosome 14 of Homo sapiens (Human), complete sequence
34367431
..part_of
workflow definition
..invocation_of
19747251
organisation
..part_of
BX648272.1
44.1
Homo sapiens mRNA;
AC007298.17
44.1
Homo sapiens 12q22
..described_by
AK126986.1
44.1
Homo sapiens cDNA
AC069363.10
44.1
Homo sapiens
AL031674.1
44.1
Human DNA sequence
AC093690.5
44.1
Homo sapiens BAC
AC012568.7
44.1
Homo sapiens
AL355339.7
44.1
Human DNA sequence
AC007074.2
44.1
Homo sapiens PAC
..created_by
AC005509.1
44.1
Homo sapiens
AF042090.1
44.1
Homo sapiens
chromosome 21q22.3 PAC 171F15, complete sequence
..filtered_version_of
A
Relationship BLAST
report has with
other items in the repository
B of information related
Other classes
to BLAST report
26
Using Haystack
27
Information Model v2
Bioinformatics middleware –
domain neutral
•
•
•
•
•
Resources and Identifiers
People, teams and
organizations
Representing the e-science
process
Experimental methods for escience
Scientific data and the life-science
identifier
– Types
– Identifier Types
– Values and Documents
Provenance information
Annotation and Argumentation
•
•
<<Resource>>
Study
1
has participants
0..*
scmInvestigator
+name:String
+description:String
+startTime:DateTime
+endTime:DateTime
+status:String
Subject
Object
Resources.Resource
0..*
Agent
<<Resource>>
StudyParticipation
0..*
participates in
1
<<Resource>>
PeopleAndTeams.Person
LabBookView
0..*
1
selected studies
1
0..*
+name:String
+rule:String
0..*
labBooks
acts in
contains
+getId:URIString
0..*
StudyRole
1
+roleName:String
+description:String
ProgrammeResource
uses
1
0..*
+name:String
<<Resource>>
Operations.Operation
0..*
uses
1
Investigation
1..*
0..*
method
Programme
<<Resource>>
ExperimentDesign
1
has instances
0..*
Agent
ExperimentInstance
In the middle of
deployment
method
30
http://www.i3c.org/wgr/ta/resources/lsid/docs/
I3C / IBM / EBI proposal
for a
Life Science Identifier
Pioneered by myGrid
LSIDs
• LSID provides a uniform naming
scheme.
• LSID Resolver guarantees to
resolve to same data object.
• LSID Authority dishes them out.
• Also returns metadata of object.
• Used throughout myGrid as an
object naming device.
• myGrid Repository acts an LSID
Authority
• LSID allows universal access to
results for collaboration, as well
as for review.
• RDF+LSID explains the context
of results, and provides
guidance for further
31
investigations.
Event notification
• Used by
commercial
company in India.
• Push and pull
• Publishersubscriber
• Asynchronous
• Durable topics
• Dynamic
Hierarchical
namespace for
topics
• http://cvs.mygrid.or
g.uk/notificationstable/downloads
32
Text Services Architecture
User Client
XScufl workflow definition
+ parameters
Workflow Server
Clustered PubMed Ids
+ titles
Initial
Cluster
Workflow
Abstracts
Workflow
Swissprot/Blast
Enactment
record
Extract
Get Related
PubMed Id
Abstracts
Term-annotated
Medline abstracts
Medline Server (Sheffield)
Medline
Abstracts
PubMed Ids
Medline: pre-processed
offline to extract biomedical
terms + indexed
Get Medline
Abstract
PubMed Ids
33
To Dos
• Improve results management
• Deployment of mIR
• Portal for finding workflows, launching & monitoring workflows,
launching taverna, browsing results
• Deploying publicly accessible semantic registry
• Reinstate service discovery during enactment
• Large scale data throughput workflow engine
• Event notification on services
• Using provenance graphs for impact analysis
• Hiding LSIDs
• Lexicons for concept names
• Hardening semantic discovery
• Ambient Text
• Er..Security
• Etc…
• “myGrid in a box”
35
End game(s)
• myGrid-in-a-box
• Technical follow-ons
– Best practice (6) and OMII (Freefluo,Taverna, Event notification)
bids
• Research follow-ons
– Semantic Grids, Data Grids, Workflow, Provenance services
– PhD students
• Science follow-ons
– Life Sciences: ISPIDER, e-Fungi
– Clinical: PsyGrid, CLEF-II
– PhD students
• Networking
– LinK-up with BIRN/SEEK/GEON (SDSC) & SCEC/GriPhyN
(ISI,USC)
36
Project Follow ons
ISPIDER
OGSA-DAIT
SIMDAT
DQP
FreeFluo
e-Fungi
CLEF
LinK-up
Provenance
Army of
PhD students
Semantic
Discovery
Provenance
PASOA
OntoGrid
DynamO
37
Proposals
ISPIDER
OGSA-DAIT
SIMDAT
DQP
e-Fungi
FreeFluo
CLEF-II
PsyGrid
Rapid
Prototyping
CLEF
LinK-up Experimental
NSF MOBY-S
Data mgtmyGrid+IB
SACK
Recycle
Provenance
Semantic
Discovery
Provenance
PASOA
OntoGrid
Text mining
myClinicalGrid
DynamO
38
Wrap Up
• Achieved a great deal – software, publications, demos, takeup
• Managed the transition from generic middleware development to
practical day to day useful services
• Real users (plural) fundamental to that
– Moved a lot of man power to individual support of domain
scientists
– Created and now support a user base
– Bridging will need to continue to be supported
• End to end support for an entire scenario
• Show stoppers for practical adoption are not sexy technical
showstoppers
– Can I incorporate my favourite service?
– Can I manage the results?
• Grass roots
• By tapping into (defacto) standards and communities we can leverage
others results and tools – LSID, Haystack, Pedro.
• http://www.mygrid.org.uk
39
Acknowledgements
myGrid is an EPSRC funded UK eScience Program Pilot Project
Particular thanks to the other members of the
Taverna project, http://taverna.sf.net
40
myGrid People
Core
• Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis,
Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Carole
Goble, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna,
Peter Li, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit
Mukherjee, Tom Oinn, Juri Papay, Savas Parastatidis, Norman Paton, Terry
Payne, Matthew Pockock Milena Radenkovic, Stefan Rennick-Egglestone,
Peter Rice, Martin Senger, Nick Sharman, Robert Stevens, Victor Tan, Anil
Wipat, Paul Watson and Chris Wroe.
Users
• Simon Pearce and Claire Jennings, Institute of Human Genetics School of
Clinical Medical Sciences, University of Newcastle, UK
• Hannah Tipney, May Tassabehji, Andy Brass, St Mary’s Hospital,
Manchester, UK
Postgraduates
• Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, John Dickman, Keith
Flanagan, Antoon Goderis, Tracy Craddock, Alastair Hampshire
Industrial
• Dennis Quan, Sean Martin, Michael Niemi, Syd Chapman (IBM)
• Robin McEntire (GSK)
Collaborators
• Keith Decker
41
Download