Data-driven research with e-Laboratories Stuart Owen University of Manchester stuart.owen@manchester.ac.uk http://www.mygrid.org.uk Scientific workflow management system for accessing open, public data services, assembling data processing and analysis pipelines and recording provenance. LGPL 361 organisation, 48 countries 70,000+ binary downloads , ~4000 source Social collaboration environments for sharing, curating and cataloguing personal, group and community contributed scientific assets. BSD 5000+ registered users, 56 countries 1600+ workflows, 1700+ services Handy tools for data management tasks in bioinformatics. BSD Web-based Software & Sharing Services “Mobilising the long tail of scientists for all our benefit” Common Ruby on RAILS platform Common and exchanged codebases Scientific workflows, scripts and pipelines Now also neuroscience, music and numerical analysis Developed with Oxford and Southampton Systems Biology models, data and protocols Adopted by 4 EU wide consortiums and 4 UK sites Developed with HITS and Stellenboch Crowd sourced curated Web services Adopted by EdUnify and ELDA education projects Developed with EBI and EMBRACE network Find experts, advice, scripts, variable sets Towards interface for UK Data Archives Developed with NIBHI SysMO-DB Project A data access, model handling and data integration platform for Systems Biology: • To support and manage the diversity of – Data, Models and experimental protocols (SOPs) from a consortium • Web based • Standards compliant DB Systems Biology of Microorganisms http://www.sysmo.net • Pan European collaboration • 13 individual projects, >100 institutes – Different research outcomes – A cross-section of microorganisms, incl. bacteria, archaea and yeast • Record and describe the dynamic molecular processes occurring in microorganisms in a comprehensive way • Present these processes in the form of computerized mathematical models • Pool research capacities and know-how • Already running since April 2007 • Runs for 3-5 years • This year, 2 new projects joined and 6 left Data Driven • Multiple omics – genomics, transcriptomics – proteomics, metabolomics – fluxomics, reactomics • Images • Molecular biology • Reaction Kinetics • Models – Metabolic, gene network, kinetic • Relationships between data sets/experiments – Procedures, experiments, data, results and models • Analysis of data A Tree View of Assets Investigation Studies SOP Assay SOP ISA infrastructure provides a directory structure for experiments http://isatab.sourceforge.net/ SOP Construction Validation Just Enough Sharing Access Permissions ...we don’t talk about security Reward and Provenance Attribution. Trust. Credit Reusing myExperiment Just Enough sharing SysMOLab Wiki COSMIC Fetch on Request Alfresco MOSES Wiki ANOTHER Direct Upload A DATA STORE SOP RightField: Annotation by Stealth http://rightfield.org.uk SEEK, the e-Laboratory A dynamic resource for analysis as well as browsing • Automatic comparison of data from inside files • Understanding where and how data and models are linked • Running simulations with new experimental data • Running analyses and workflows over the data and models Open Integration: JWS Simulator Web based easy to use interface: “runs in your browser”, integrated in SEEK Standard simulation functionality Models can be accessed via browser, SEEK and web services. Data linked to models via file upload (e.g. Excel), or via database connection. Data Fuse Taverna Workbench Available services Workflow diagram Workflow Explorer http://www.taverna.org.uk The Taverna Open Suite of Tools Workflow Repository GUI Workbench Web Portals Client User Interfaces Virtual Machine Third Party Tools Service Catalogue Workflow Engine Provenance Store Workflow Server Activity and Service Plug-in Manager Open Provenance Model Secure Service Access Programming and APIs Taverna and the ‘Cloud’ + Analysing Next Generation Sequencing Data Analysing African Cattle with Taverna 2.2 10,000 years separation African Livestock adaptations: • Hardier • Better disease resistance Potential outcomes: • Food security • Understanding resistance • Understanding environmental Conditions • Drought • Parasites • Understanding diversity http://www.bbc.co.uk/news/10403254 The Analysis Pipeline (in Perl) Input SNP data from sequencer MAP FILTER Map between Genome Builds (Liftover) Filter for SNPs in Exons ANALYSIS SNP consequences Harry Noyes – University of Liverpool Identifying damaging SNPs (Polyphen) Workflow and phases Input SNP file Populate DB with start SNP’s and resource version numbers Lift-over: maps between UMD3 and BTA4 cow assemblies Exon positions from ENSMBL Find SNPs in Exon regions PolyPhen to mark “damaging” SNP’s Accessing Taverna on the Cloud Architecture overview Loading inputs Experiment Metadata Input Provenance Jobs Status Input data summary Summary of Workflow Output 11 Million SNP for N’ Dama Non-synonymous coding SNPs Polyphen predictions: probably damaging The result can be downloaded as a MySQL database or TSV / CSV download Why use the Cloud? • This is a highly repetitive task – And “embarrassingly parallel” • But it also needs to be done on demand • And within the financial reach of researchers – Who do not always have access to their own compute • We have very fast network access – So we don’t need to do this in-house Timings SEEK as a data analysis and meta analysis service • SBML model construction and population Calibration workflow Data requirements Parameterised SBML model Experimental data Metabolite concentrations from key results database Calibration by COPASI web service Peter Li Workflow Management System Search and Analysis across data sets, models and stuff • Analysis pool • Analysis As A Cloud Service • Analysis using Cloud Computing Services • Run analysis tools and knowledge bases Automated Model Generation MCISB Centre (Li) Annotation pipeline SUMO SysMO project (Maleki-Dizaji) Next Gen Seq annotation pipelines using Amazon Cloud Services (Noyes, Li ) Li et al, BMC Bioinformatics 2010, 11:582, doi:10.1186/1471-2105-11-582, highly accessed Hucka and Le Novère, BMC Biology 2010, 8:140, doi:10.1186/1741-7007-8-140 SysMO-DB Dev Team Sergejs Aleksejevs Wolfgang Müller Heidelberg Institute for Theoretical Studies Germany Carole Goble Olga Krebs Quyen Ngyen University of Manchester, UK Stuart Owen Jacky Snoep Franco du Preez University of Stellenbosch, South Africa University of Manchester, UK Katy Wolstencroft Finn Bacall Further Information • myGrid – http://www.mygrid.org.uk • Taverna – http://www.taverna.org.uk • myExperiment – http://www.myexperiment.org • BioCatalogue – http://www.biocatalogue.org • SEEK – http://www.sysmo-db.org • RightField – http://www.rightfield.org.uk • MethodBox – http://www.methodbox.org.uk