myGrid Update and Status Carole Goble http://www.mygrid.org.uk 1 • Open Source Upper Middleware for Bioinformatics • (Web) Service-based architecture • Targeted at Tool Developers, Bioinformaticians and Service Providers Newcastle Sheffield Manchester Nottingham Hinxton Southampton 2 Philosophy • Openness – open source – open world of services – open extensible technology – open to wider eScience context – open to user feedback – open to third party metadata • Collection of components for assembly • Pick and mix 3 Data-intensive bioinformatics ID DE DE DE GN OS OC OC KW FT FT SQ MURA_BACSU STANDARD; PRT; 429 AA. PROBABLE UDP-N-ACETYLGLUCOSAMINE 1-CARBOXYVINYLTRANSFERASE (EC 2.5.1.7) (ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMINE ENOLPYRUVYL TRANSFERASE) (EPT). MURA OR MURZ. BACILLUS SUBTILIS. BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE; BACILLUS. PEPTIDOGLYCAN SYNTHESIS; CELL WALL; TRANSFERASE. ACT_SITE 116 116 BINDS PEP (BY SIMILARITY). CONFLICT 374 374 S -> A (IN REF. 3). SEQUENCE 429 AA; 46016 MW; 02018C5C CRC32; MEKLNIAGGD SLNGTVHISG AKNSAVALIP ATILANSEVT IEGLPEISDI ETLRDLLKEI GGNVHFENGE MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI GLPGGCHLGP RPIDQHIKGF EALGAEVTNE QGAIYLRAER LRGARIYLDV VSVGATINIM LAAVLAEGKT IIENAAKEPE IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP DRIEAGTFMI 4 Use Scenarios Grave’s Disease • Autoimmune disease of the thyroid • Simon Pearce and Claire Jennings, Institute of Human Genetics School of Clinical Medical Sciences, University of Newcastle • Discover all you can about a gene • Annotation pipelines and Gene expression analysis • Services from Japan, Hong Kong, various sites in UK Williams-Beuren Syndrome • Microdeletion of 155 Mbases on Chromosome 7 • Hannah Tipney, May Tassabehji, Andy Brass, St Mary’s Hospital, Manchester, UK • Characterise an unknown gene • Annotation pipelines and Gene expression analysis Services from USA, Japan, various sites in UK 5 Bioinformatics: Classical Perform microarray experiment to identify genes differentially regulated in patients. Import microarray data to Affymetrix data Mining Tool, Run Analyses and select Candidate Genes Select suitable gene and find SNPs that lie within Study Annotations for candidate genes at many different websites Find restriction sites and design primers by eye for genotyping experiments 6 Workflow approach Williams-Beuren Syndrome • Manually: takes two days (+) including analysis • Now takes 30 mins to produce results and half a day for analysis • Manually: Do analysis as perform experiment • Workflow: Do analysis at end of experiment • Therefore need good result co-ordination for backtracking 7 Tool Providers Taverna Talisman Web Portal Gateway Registries Service and Workflow Discovery Ontologies Ontology Mgt Views Metadata Mgt FreeFluo Workflow Enactment Engine Personalisation Provenance Event Notification Core services myGrid Information Repository OGSA-DQP Distributed Query Processor Web Service (Grid Service) communication fabric SoapLab Legacy apps GowLab Legacy apps Native Web Services AMBIT Text Extraction Service 12 External services Service Providers Work bench Service Stack Applications Bioinformaticians myGrid Pre-Prototype Experimental Web-based Requirements gathering Prototype 1 Architectural workout All services represented NetBeans workbench API-based integration Info Repository oriented XML-based process provenance Workflow enactment engine Prototype 2 Second generation services Reworked information model Open information management Life Science Identifiers RDF based provenance Taverna workbench Web-based portal In a nutshell Demo at ISMB 2003 Full paper and demo at ISMB 2004 GSK deployment Real biology 13 Two+ Paths Innovative work Core functionality • Services – Soaplab • Service and workflow and Gowlab registration • Workflow enactment • Semantic discovery engine – Freefluo • Provenance • Workflow workbench – Taverna management • Data integration – • Text mining OGSADQP • Information model & management In between • Event notification • Gateway 14 Domain Services • Native WSDL Web services – DDBJ, NCBI BLAST, PathPort, BioMOBY • Wrapped legacy services – SoapLab – GowLab • Web pages as web services For each application CreateJob Run WaitFor GetResults Destroy – One button wrapping – Leveraged the EMBOSS Suite – ~159 services EBI Support agreed to support Soaplab • Lots of them and lots of services as core redundant services business • The joys of firewalls and licensing http://industry.ebi.ac.uk/soaplab/ 15 Workflow environment • Freefluo workflow enactment engine • Taverna Workbench http://freefluo.sourceforge.net • Taverna development and execution environment • • • • • Scufl language parser http://taverna.sourceforge.net – Joint work with HGMP Simple conceptual unified flow language (XScufl). Rapid development and release cycle on source Forge (LGPL) Picked up by a others – Wells Fargo, GODIVA … “tethered” programme: own open source development community Freefluo Enactor Core Processor Processor Processor Processor Web Service Soaplab Local App Enactor 16 • Control flow, iteration and data flow • Data sets and nested flows • Configurable failure handling • Incorporated Life Science Id resolution • Provenance and status reporting • Type and data management • Plug-ins • User notification • Data entry wizard • Libraries of SHIM services • Libraries of workflows 17 18 OGSA-DQP • Used in Grave’s Disease • Uses OGSA-DAI data access services to access individual data resources. • A single query to access and join data from more than one OGSA-DAI wrapped data resource. • Supports orchestration of computational as well as data access services. • Interactive interface for integrating resources and executing requests. • Implicit, pipelined and http://www.ogsa-dai.org.uk/dqp partitioned parallelism. 19 Service Discovery (1) • Scavenge! • Seek the WSDL • Incorporating a service into Taverna straightforward 20 Service discovery (2) • • • • Registry Third party registries Third party services Third party metadata • Views over federated registries • UDDI extended with RDF • Aim to have a public view on the web. 21 Service and Workflow registration Workflow registration allows peer review and publication of e-Science methods. Scufl URI Workfllow Executive Summary Descriptions Inputs, Outputs, Tasks, Component resources Workflow registry entry Operational Descriptions Cost, QoS Access rights… Provenance Descriptions Authors, creation date, institution… Invokable Interface descriptions e.g. XML data types stored WSDL Syntactic descriptions e.g. MIME types RDF Conceptual descriptions OWL encoded RDF Store OWL/RDF • Workflows and Services registered • Description scheme • RDFS & DAML+OIL / OWL ontologies of services & biology • Based on DAML-S • Reasoning over OWL descriptions • Query over RDF • Aim to have semantic discovery over public view on the web. 22 Semantic Discovery Pedro data capture tool View annotations on workflow Drag a workflow entry into the explorer pane and the workflow loads. Drag a service/ workflow to the scavenger window for inclusion 23 into the workflow Organisation level provenance Process level provenance runBy e.g. BLAST @ NCBI Provenance (1) Project Experiment design partOf Service Process componentProcess Workflow design e.g. web service invocation of BLAST @ NCBI instanceOf Event componentEvent e.g. completion of a web service invocation at 12.04pm Workflow run Data/ knowledge level provenance knowledge statements run for User can add templates to Person each workflow process to determine links between data items. Organisation e.g. similar protein sequence to Data item Data item Data item 25 data derivation e.g. output data derived from input data RDF Rules ..masked_sequence_of project .. nucleotide_sequence >gi|19747251|gb|AC005089.3| Homo sapiens BAC clone CTA-315H11 from 7, complete sequence AAGCTTTTCTGGCACTGTTTCCTTCTT CCTGATAACCAGAGAAGGAAAAGATC TCCATTTTACAGATGAG GAAACAGGCTCAGAGAGGTCAAGGCT CTGGCTCAAGGTCACACAGCCTGGGA ACGGCAAAGCTGATATTC AAACCCAAGCATCTTGGCTCCAAAGC CCTGGTTTCTGTTCCCACTACTGTCAG TGACCTTGGCAAGCCCT GTCCTCCTCCGGGCTTCACTCTGCAC ACCTGTAACCTGGGGTTAAATGGGCT CACCTGGACTGTTGAGCG experiment definition rdf:type ..part_of urn:lsid:taverna:datathing:13 ..BLAST_Report ..similar_sequences_to AC005089.3 831 Homo sapiens BAC clone CTA-315H11 from 7, complete sequence 15145617 clone RP11-622P13 from 7, complete sequence 15384807 from clone RP11-553N16 on chromosome 1, complete sequence 7717376 chromosome 21 segment HS21C082 16304790 cDNA DKFZp686G08119 (from clone DKFZp686G08119) 5629923 BAC RPCI11-256L6 (Roswell Park Cancer Institute Human BAC Library) complete sequence 34533695 FLJ45040 fis, clone BRAWH3020486 20377057 chromosome 17, clone RP11-104J23, complete sequence 4191263 from clone RP4-715N11 on chromosome 20q13.1-13.2 Contains two putative novel genes, ESTs, STSs and GSSs, complete sequence 17977487 clone RP11-731I19 from 2, complete sequence 17048246 chromosome 15, clone RP11-342M21, complete sequence 14485328 from clone RP11-461K13 on chromosome 10, complete sequence 5757554 clone RP3-368G6 from X, complete sequence 4176355 chromosome 4 clone B200N5 map 4q25, complete sequence 2829108 group ..author rdf:type ..works_for person ..author workflow invocation ..run_during ..run_for service description AC073846.6 815 Homo sapiens BAC AL365366.20 46.1 Human DNA sequence service invocation AL163282.2 44.1 Homo sapiens AL133523.5 44.1 Human chromosome urn:lsid:taverna:datathing:15 14 DNA sequence BAC R-775G15 of library RPCI-11 from chromosome 14 of Homo sapiens (Human), complete sequence 34367431 ..part_of workflow definition ..invocation_of 19747251 organisation ..part_of BX648272.1 44.1 Homo sapiens mRNA; AC007298.17 44.1 Homo sapiens 12q22 ..described_by AK126986.1 44.1 Homo sapiens cDNA AC069363.10 44.1 Homo sapiens AL031674.1 44.1 Human DNA sequence AC093690.5 44.1 Homo sapiens BAC AC012568.7 44.1 Homo sapiens AL355339.7 44.1 Human DNA sequence AC007074.2 44.1 Homo sapiens PAC ..created_by AC005509.1 44.1 Homo sapiens AF042090.1 44.1 Homo sapiens chromosome 21q22.3 PAC 171F15, complete sequence ..filtered_version_of A Relationship BLAST report has with other items in the repository B of information related Other classes to BLAST report 26 Using Haystack 27 Information Model v2 Bioinformatics middleware – domain neutral • • • • • Resources and Identifiers People, teams and organizations Representing the e-science process Experimental methods for escience Scientific data and the life-science identifier – Types – Identifier Types – Values and Documents Provenance information Annotation and Argumentation • • <<Resource>> Study 1 has participants 0..* scmInvestigator +name:String +description:String +startTime:DateTime +endTime:DateTime +status:String Subject Object Resources.Resource 0..* Agent <<Resource>> StudyParticipation 0..* participates in 1 <<Resource>> PeopleAndTeams.Person LabBookView 0..* 1 selected studies 1 0..* +name:String +rule:String 0..* labBooks acts in contains +getId:URIString 0..* StudyRole 1 +roleName:String +description:String ProgrammeResource uses 1 0..* +name:String <<Resource>> Operations.Operation 0..* uses 1 Investigation 1..* 0..* method Programme <<Resource>> ExperimentDesign 1 has instances 0..* Agent ExperimentInstance In the middle of deployment method 30 http://www.i3c.org/wgr/ta/resources/lsid/docs/ I3C / IBM / EBI proposal for a Life Science Identifier Pioneered by myGrid LSIDs • LSID provides a uniform naming scheme. • LSID Resolver guarantees to resolve to same data object. • LSID Authority dishes them out. • Also returns metadata of object. • Used throughout myGrid as an object naming device. • myGrid Repository acts an LSID Authority • LSID allows universal access to results for collaboration, as well as for review. • RDF+LSID explains the context of results, and provides guidance for further 31 investigations. Event notification • Used by commercial company in India. • Push and pull • Publishersubscriber • Asynchronous • Durable topics • Dynamic Hierarchical namespace for topics • http://cvs.mygrid.or g.uk/notificationstable/downloads 32 Text Services Architecture User Client XScufl workflow definition + parameters Workflow Server Clustered PubMed Ids + titles Initial Cluster Workflow Abstracts Workflow Swissprot/Blast Enactment record Extract Get Related PubMed Id Abstracts Term-annotated Medline abstracts Medline Server (Sheffield) Medline Abstracts PubMed Ids Medline: pre-processed offline to extract biomedical terms + indexed Get Medline Abstract PubMed Ids 33 To Dos • Improve results management • Deployment of mIR • Portal for finding workflows, launching & monitoring workflows, launching taverna, browsing results • Deploying publicly accessible semantic registry • Reinstate service discovery during enactment • Large scale data throughput workflow engine • Event notification on services • Using provenance graphs for impact analysis • Hiding LSIDs • Lexicons for concept names • Hardening semantic discovery • Ambient Text • Er..Security • Etc… • “myGrid in a box” 35 End game(s) • myGrid-in-a-box • Technical follow-ons – Best practice (6) and OMII (Freefluo,Taverna, Event notification) bids • Research follow-ons – Semantic Grids, Data Grids, Workflow, Provenance services – PhD students • Science follow-ons – Life Sciences: ISPIDER, e-Fungi – Clinical: PsyGrid, CLEF-II – PhD students • Networking – LinK-up with BIRN/SEEK/GEON (SDSC) & SCEC/GriPhyN (ISI,USC) 36 Project Follow ons ISPIDER OGSA-DAIT SIMDAT DQP FreeFluo e-Fungi CLEF LinK-up Provenance Army of PhD students Semantic Discovery Provenance PASOA OntoGrid DynamO 37 Proposals ISPIDER OGSA-DAIT SIMDAT DQP e-Fungi FreeFluo CLEF-II PsyGrid Rapid Prototyping CLEF LinK-up Experimental NSF MOBY-S Data mgtmyGrid+IB SACK Recycle Provenance Semantic Discovery Provenance PASOA OntoGrid Text mining myClinicalGrid DynamO 38 Wrap Up • Achieved a great deal – software, publications, demos, takeup • Managed the transition from generic middleware development to practical day to day useful services • Real users (plural) fundamental to that – Moved a lot of man power to individual support of domain scientists – Created and now support a user base – Bridging will need to continue to be supported • End to end support for an entire scenario • Show stoppers for practical adoption are not sexy technical showstoppers – Can I incorporate my favourite service? – Can I manage the results? • Grass roots • By tapping into (defacto) standards and communities we can leverage others results and tools – LSID, Haystack, Pedro. • http://www.mygrid.org.uk 39 Acknowledgements myGrid is an EPSRC funded UK eScience Program Pilot Project Particular thanks to the other members of the Taverna project, http://taverna.sf.net 40 myGrid People Core • Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis, Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Carole Goble, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Peter Li, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Tom Oinn, Juri Papay, Savas Parastatidis, Norman Paton, Terry Payne, Matthew Pockock Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Robert Stevens, Victor Tan, Anil Wipat, Paul Watson and Chris Wroe. Users • Simon Pearce and Claire Jennings, Institute of Human Genetics School of Clinical Medical Sciences, University of Newcastle, UK • Hannah Tipney, May Tassabehji, Andy Brass, St Mary’s Hospital, Manchester, UK Postgraduates • Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, John Dickman, Keith Flanagan, Antoon Goderis, Tracy Craddock, Alastair Hampshire Industrial • Dennis Quan, Sean Martin, Michael Niemi, Syd Chapman (IBM) • Robin McEntire (GSK) Collaborators • Keith Decker 41