Data-Driven research with e-Laboratories

advertisement
Data-driven research with
e-Laboratories
Stuart Owen
University of Manchester
stuart.owen@manchester.ac.uk
http://www.mygrid.org.uk
Scientific workflow management system for
accessing open, public data services,
assembling data processing and analysis
pipelines and recording provenance. LGPL
361 organisation, 48 countries
70,000+ binary downloads , ~4000 source
Social collaboration environments for sharing,
curating and cataloguing personal, group and
community contributed scientific assets. BSD
5000+ registered users, 56 countries
1600+ workflows, 1700+ services
Handy tools for data management tasks in
bioinformatics. BSD
Web-based Software & Sharing Services
“Mobilising the long tail of scientists for all our benefit”
Common Ruby on RAILS platform
Common and exchanged codebases
Scientific workflows, scripts and pipelines
Now also neuroscience, music and numerical analysis
Developed with Oxford and Southampton
Systems Biology models, data and protocols
Adopted by 4 EU wide consortiums and 4 UK sites
Developed with HITS and Stellenboch
Crowd sourced curated Web services
Adopted by EdUnify and ELDA education projects
Developed with EBI and EMBRACE network
Find experts, advice, scripts, variable sets
Towards interface for UK Data Archives
Developed with NIBHI
SysMO-DB Project
A data access, model handling and data
integration platform for Systems Biology:
• To support and manage the diversity of
– Data, Models and experimental protocols
(SOPs) from a consortium
• Web based
• Standards compliant
DB
Systems Biology of Microorganisms
http://www.sysmo.net
• Pan European collaboration
• 13 individual projects, >100 institutes
– Different research outcomes
– A cross-section of microorganisms, incl.
bacteria, archaea and yeast
• Record and describe the dynamic
molecular processes occurring in
microorganisms in a comprehensive way
• Present these processes in the form of
computerized mathematical models
• Pool research capacities and know-how
• Already running since April 2007
• Runs for 3-5 years
• This year, 2 new projects joined and 6 left
Data Driven
• Multiple omics
– genomics, transcriptomics
– proteomics, metabolomics
– fluxomics, reactomics
• Images
• Molecular biology
• Reaction Kinetics
• Models
– Metabolic, gene network, kinetic
• Relationships between data sets/experiments
– Procedures, experiments, data, results and models
• Analysis of data
A Tree View of Assets
Investigation
Studies
SOP
Assay
SOP
ISA infrastructure provides a
directory structure for
experiments
http://isatab.sourceforge.net/
SOP
Construction
Validation
Just Enough Sharing
Access
Permissions
...we don’t talk about security
Reward and Provenance
Attribution.
Trust.
Credit
Reusing myExperiment
Just Enough sharing
SysMOLab
Wiki
COSMIC
Fetch on
Request
Alfresco
MOSES
Wiki
ANOTHER
Direct
Upload
A DATA
STORE
SOP
RightField: Annotation by Stealth
http://rightfield.org.uk
SEEK, the e-Laboratory
A dynamic resource for analysis as well as browsing
• Automatic comparison of data from inside files
• Understanding where and how data and models are
linked
• Running simulations with new experimental data
• Running analyses and workflows over the data and
models
Open Integration: JWS Simulator
Web based easy to use interface:
“runs in your browser”, integrated in SEEK
Standard
simulation
functionality
Models can be accessed via browser, SEEK and web services.
Data linked to models via file upload (e.g. Excel), or via
database connection.
Data Fuse
Taverna Workbench
Available
services
Workflow
diagram
Workflow
Explorer
http://www.taverna.org.uk
The Taverna Open Suite of Tools
Workflow Repository
GUI Workbench
Web Portals
Client User Interfaces
Virtual
Machine
Third Party Tools
Service Catalogue
Workflow Engine
Provenance
Store
Workflow
Server
Activity and Service
Plug-in Manager
Open
Provenance
Model
Secure Service Access
Programming and
APIs
Taverna and the ‘Cloud’
+
Analysing Next Generation Sequencing Data
Analysing African Cattle with Taverna 2.2
10,000 years separation
African Livestock adaptations:
• Hardier
• Better disease resistance
Potential outcomes:
• Food security
• Understanding resistance
• Understanding environmental
Conditions
• Drought
• Parasites
• Understanding diversity
http://www.bbc.co.uk/news/10403254
The Analysis Pipeline (in Perl)
Input SNP data from sequencer
MAP
FILTER
Map between
Genome Builds (Liftover)
Filter for SNPs in Exons
ANALYSIS
SNP consequences
Harry Noyes –
University of Liverpool
Identifying damaging SNPs
(Polyphen)
Workflow and phases
Input SNP file
Populate DB with start SNP’s and
resource version numbers
Lift-over: maps between UMD3
and BTA4 cow assemblies
Exon positions from ENSMBL
Find SNPs in Exon regions
PolyPhen to mark “damaging” SNP’s
Accessing Taverna on the Cloud
Architecture overview
Loading inputs
Experiment
Metadata
Input Provenance
Jobs Status
Input data
summary
Summary of Workflow Output
11 Million SNP for N’ Dama
Non-synonymous coding
SNPs
Polyphen predictions:
probably damaging
The result can be downloaded
as a MySQL database or TSV /
CSV download
Why use the Cloud?
• This is a highly repetitive task
– And “embarrassingly parallel”
• But it also needs to be done on demand
• And within the financial reach of researchers
– Who do not always have access to their own compute
• We have very fast network access
– So we don’t need to do this in-house
Timings
SEEK as a data analysis and
meta analysis service
• SBML model construction and population


Calibration workflow
Data requirements



Parameterised SBML model
Experimental data
 Metabolite
concentrations from key
results database
Calibration by COPASI
web service
Peter Li
Workflow
Management System
Search and Analysis
across data sets, models and stuff
• Analysis pool
• Analysis As A
Cloud Service
• Analysis using
Cloud Computing
Services
• Run analysis tools
and knowledge
bases
Automated
Model Generation
MCISB Centre (Li)
Annotation pipeline
SUMO SysMO project (Maleki-Dizaji)
Next Gen Seq annotation pipelines
using Amazon Cloud Services (Noyes, Li )
Li et al, BMC Bioinformatics 2010, 11:582, doi:10.1186/1471-2105-11-582, highly accessed
Hucka and Le Novère, BMC Biology 2010, 8:140, doi:10.1186/1741-7007-8-140
SysMO-DB Dev Team
Sergejs Aleksejevs
Wolfgang
Müller
Heidelberg
Institute for
Theoretical
Studies
Germany
Carole Goble
Olga Krebs
Quyen Ngyen
University of Manchester, UK
Stuart Owen
Jacky Snoep
Franco du Preez
University of Stellenbosch,
South Africa
University of Manchester, UK
Katy Wolstencroft
Finn Bacall
Further Information
• myGrid
– http://www.mygrid.org.uk
• Taverna
– http://www.taverna.org.uk
• myExperiment
– http://www.myexperiment.org
• BioCatalogue
– http://www.biocatalogue.org
• SEEK
– http://www.sysmo-db.org
• RightField
– http://www.rightfield.org.uk
• MethodBox
– http://www.methodbox.org.uk
Download