Providing an environment where every data-driven researcher will thrive Professor Carole Goble

advertisement
Providing an environment
where every data-driven
researcher will thrive
Professor Carole Goble
carole.goble@manchester.ac.uk
University of Manchester, UK
• Pipelines
– Scientific workflows over (web) services
– Data pipelines, model population and
validation, simulation sweeps
– Distributed, federated datasets and analyses
combined with local datasets and analysis
– Opening up resources.
• e-Laboratories
– Crowd-sourcing, group curating and
sharing/reusing scientific assets.
– Web 2.0 and Semantic Web.
– Social networking, community content,
collaborative filtering
– Sharing and exchanging “Research Objects”
– Opening up capabilities and capacity.
• Pan European collaboration.
• Systems Biology of Microorganisms
13 projects, 91 institutes
– Different research outcomes
– A cross-section of microorganisms,
incl. bacteria, archaea and yeast.
• Record and describe the dynamic
molecular processes occurring in
microorganisms by computerized
mathematical models.
– Modellers meet experimentalists
• Pool research capacities, data,
models and know-how.
• Retrospectively.
http://www.sysmo.net
BaCell-SysMO
COSMIC
SUMO
KOSMOBAC
SysMO-LAB
PSYSMO
Valla
MOSES
TRANSLUCENT
STREAM
SulfoSYS
+ two more
Data-driven
• Multiple ‘omics
– genomics, transcriptomics
– proteomics, metabolomics
•
•
•
•
Images,
Reaction Kinetics
Models
Data sets + experiments + models
– SBML, Agent-based, Mechanics based
• Analysis of data
Systems biology workflows in MCISB
• High throughput
experimental methods
• Public data sets (e.g. EBI)
• Web Services
• ~ 1400 NAR January Issue
Little Data
•
•
•
•
•
•
•
Little databases
Lab books
Spreadsheets
Private and Shared.
Proliferation
Derived data
Long tail.
Big Data
Group Science
Data services
Access
“Little” Data
“Local” Science
Publish
My
Datasets
My
Analytics
Massive decentralisation – wikis,
sticks, spreadsheets
Massive centralisation – commons,
clouds, curated core facilities
Tremendous fragility
Digital Dust in Data Tombs
Picking Pain Points. Keeping it Real.
• Project Directors
– Data remains with us
under our control.
– We control who sees
what.
– Just enough exchange.
• SysMO PALs
– Spreadsheets.
– Yellow Pages.
– Standard Operating
Procedures.
An education
Modellers vs
Experimentalists
Computational thinking
Systems thinking
Gray‘s Laws (modified)
• Working Now, Working to working
–
–
–
–
Gateways and ramps
Jam today, jam tomorrow
Just enough, just in time
Work with what you got already
• 20 questions
–
–
–
–
??
??
Is there any group generating kinetic data?
Is this data available?
Who is working with which organism?
What methods are been used to determine enzyme
activity?
– Under which experimental conditions are my partners
working on for the measurement of glucose concentration?
Help people
search for and
find stuff
Data
Services
Processes
Models
Software
Experts
SysMO SEEK
Interlinking ASSETS CATALOGUE
Assets Catalogue. Archive. Social Network. Sharing Space. Gateway.
• Yellow Pages
– People. Expertise. Projects. Institutions. Facilities. Studies.
• Data
– Experimental data sets and analysed results.
– Gateway to data stores – SABIO-RK, ‘omics
• Models
– Store. Stimulate. Publish. Curate.
– Gateway to COPASI, JWS Online, BioModels.
• Processes
–
–
–
–
Laboratory protocols – Standard Operating Procedures
Bioinformatics analyses – computational workflows - Taverna
Model population and validation – workflows – Taverna
Gateway to myExperiment, MolMeth, OpenWetWare….
Linking data to process
Standard Operating Procedures
Models
Software
Provenance
The Lab Book
Retrospective method reconstruction
The myth of reproducible science
• Scientists willing
to share methods
and protocols.
• SOPs an early win.
• Defined standard
metadata model
based on Nature
Protocols.
• Seeded.
Linking data with stuff
• Research Objects for packaging and
exchanging Assets
– Workflows linked to models linked to
data linked to SOPs
– Encapsulate community standards
– Mixed resources: External and central.
– Trust
– “Preservation Packet”
– Bechhofer et al 2010 forthcoming in The Future of
The Web for Collaborative Science 2010.
• SBRML
– Systems Biology Results Markup
Language
– To tie to the SBML
At the coal-face
The Spreadsheet.
The Content
Management Systems.
Legacy assets are assets.
Metadata ramps.
The Content Management System
• Lightweight and flexible. Low take-on, hidden operations
costs. Knowledgeable Civilians. Looks nice.
• Anarchy amenable.
SysMOLab
Spreadsheets
• Template distribution
• Template mapping
Everyone wants metadata.
No one wants to collect it.
Standards mayhem
Metadata millstones
Most data is thrown away.
Metadata for my sake
Metadata compliance by stealth
Preparation for publishing
Minimum
Information
Models
63%
47%
CIMR Core Information for Metabolomics Reporting
MIABE Minimal Information About a Bioactive Entity
MIACA Minimal Information About a Cellular Assay
MIAME Minimum Information About a Microarray Experiment
MIAME/Env MIAME / Environmental transcriptomic experiment
MIAME/Nutr MIAME / Nutrigenomics
MIAME/Plant MIAME / Plant transcriptomics
MIAME/Tox MIAME / Toxicogenomics
MIAPA Minimum Information About a Phylogenetic Analysis
MIAPAR Minimum Information About a Protein Affinity Reagent
MIAPE Minimum Information About a Proteomics Experiment
MIARE Minimum Information About a RNAi Experiment
MIASE Minimum Information About a Simulation Experiment
MIENS Minimum Information about an ENvironmental Sequence
MIFlowCyt Minimum Information for a Flow Cytometry Experiment
MIGen Minimum Information about a Genotyping Experiment
MIGS Minimum Information about a Genome Sequence
MIMIx Minimum Information about a Molecular Interaction Experiment
MIMPP Minimal Information for Mouse Phenotyping Procedures
MINI Minimum Information about a Neuroscience Investigation
MINIMESS Minimal Metagenome Sequence Analysis Standard
MINSEQE Minimum Information about a high-throughput SeQuencing Experiment
MIPFE Minimal Information for Protein Functional Evaluation
MIQAS Minimal Information for QTLs and Association Studies
MIqPCR Minimum Information about a quantitative Polymerase Chain Reaction experiment
MIRIAM Minimal Information Required In the Annotation of biochemical Models
MISFISHIE Minimum Information Specification For In Situ Hybridization and Immunohistochemistry
Experiments
STRENDA Standards for Reporting Enzymology Data
TBC Tox Biology Checklist
BioPAX : Biological Pathways Exchange http://www.biopax.org/
FuGE Functional Genomics Experiment
MIBBI: Minimum Information
for Biological and Biomedical Investigations
MGED: Microarray Experimental Conditions
http://www.mibbi.org/index.php/MIBBI_portal
Just Enough Results Model
• Harvest standards e.g.
MIAME (MIBBI.org)
• Analyse consortium
schemas and
spreadsheets
• JERMs for each data
type – microarray,
metabolomics,
proteomics ....
• Map project data
sources to JERMs.
• Distribute JERM
spreadsheet templates
“I only want to collect and
share just enough results”
JERM Spreadsheets Templates
Controlled vocabulary plug in
• RDF for ripping, mashing and comparing spreadsheets.
• A little semantics goes a long way
Reward curation
Local curation at the point of
capture – ISA-TAB for ‘omics.
Centralised curation – SBML,
CellML, SBO
Automated curation.
Which data is worth curating?
• Blue-Collar
Science.
• Curator Credit
• Curator Career
• Funding.
• Personal and
institutional
visibility
• Scholarly citation
metrics
• Federate
workloads
• Unpopular with
the big data
providers.
www.biocurators.org
Commons-based Quality Control.
Progressive Curation:
“lazy evaluation” metadata
Just enough, Just in time
Jam today and Jam tomorrow
Pain
Very
BAD
Just
right
Good, but
Unlikely
Gain
Sensitive sharing.
Collaborate to compete
Good reasons not to.
Just enough just in time sharing.
Data kept at host.
Registered centrally through harvesting.
Pre-Publication sharing vs Publication
Competitive advantage.
Academic vanity.
Adoption.
Reputation.
Rewards
Scrutiny.
Being scooped.
Misinterpretation.
Reputation.
Legal issues.
Risks
Nature 461, 145 (10 September 2009) | doi:10.1038/461145a
Just
Enough
Sharing
Access
Permissions
Reusing myExperiment
Reward sharing
and reusing not
reinventing.
Technically. Culturally.
Institutionally.
Credit and Risk
Mitigation.
Reward and Provenance
Attribution.
Trust.
Credit
Reusing myExperiment
Some pretty key things
• Data citation
• Stable and shared ids and names
– A nightmare.
– Sharednames.org
– Biosharing.org
• Versioning and Provenance
– Models, software, data sets
– Ensembl web service doesn’t report version number.
Data commons,
Data havens
For data after the
project has ended.
For the common
Beth’s Provenance Objects
good or me.
Tidy and untidy data.
Bio2RDF
Access and availability of data
and data analysis resources
Web services underpin the ESFRI ELIXIR programme.
Interfaces that are understandable and stable.
Designed for people too.
No access, no tools, no point (Keith Haines)
Deposition to community databanks that minimise
pain.
What is it?
Is it working?
Data analysis, model population and
data pipelining ramps.
Crossing the adoption chasm
There is a world of complexity for data
preparation, processing and analysis
Science Informatics Sweatshops.
E-Laboratories. Workflows. Portals.
Pre-cooked processes and process
templates. Pre-cooked interfaces.
Training.
Lymphoma Prediction Workflow
caArray
MicroArray from
tumor tissue
Microarray
preProcessing
Use geneexpression
patterns
associated with
two lymphoma
types to predict
the type of an
unknown sample.
Lymphoma
prediction
GenePattern
Ack. Juli Klemm, Xiaopeng Bian, Rashmi Srinivasa (NCI)
Jared Nedzel (MIT)
Wei Tan Univ. Chicago
myExperiment Communities
• Supermarket
shoppers
• Tool builders
• Trainers and
Trainees
Drop and Compute
Ian Cottam
Local folder
synchronised
and shared via
cloud
Condor job
submitted by
drag and drop
Results appear in
Dropbox
Bashing against
local IT
NO – you can’t access that
datastore / run your
analysis.
Joined up thinking.
Data +
Publications
Data trapped in
documents
Supplemental
information
Text mining
Text mining workflows
Text mining to find
method and controls
Reflect.
Elsevier Challenge Winner 2009
[Oscar-3]
Manual and Auto-mark up
Do not
underestimate the
power of
Interactive
Visualisation and
Browsing
Pre-cooked complex queries.
Navigation.
With my data.
At the click of a button.
• Distributed Annotation Service
• Upload and overlay my data
SysMO summary
• Providing an environment where every data-driven
researcher will thrive
• Reality is messy.
– Extreme Technology Determinism vs Voluntarist Sociocultural
shaping
• Extreme and continuous partnership with users.
– Act Local Think Global
• Agile development environment facilitated stream of
features to tackle pain points.
– Leverage other e-Laboratories, Maintaining scientists’ buy-in.
• Socio-Political Axis dominates the Technical Axis.
– Collaboration evolutions, Confidence in exchange.
Six Action Plan Areas
Coordination
Data
Capacity
Capacity building of
our skills base
• Influence training and capacity
building programmes.
• Promote training for young and
mid-career researchers and
research technologists.
• Enable mixed skilled research
teams to include research and
information technologists.
• Value and reward highly skilled
research and information
technologists within HE
institutions with a career
structure.
Data Silo
culture
Funding silos
Discipline silos
Academic
Credit and Risk
Mitigation
for sharing,
curating, and
reusing not
reinventing
Data and Software
is free like puppies
are free
EML Research
gGmbH,
Germany
Wolfgang
Müller
Sergejs Aleksejevs
Carole Goble
Isabel Rojas Olga Krebs
Katy Wolstencroft
University of Manchester, UK
Stuart Owen
Jacky Snoep
University of Stellenbosch, South Africa
University of Manchester, UK
Finn Bacal
Links
• myGrid Project
– http://www.mygrid.org.uk
• SysMO-DB
– http://www.sysmo-db.org
• myExperiment
– http://www.myexperiment.org
• Taverna
– http://www.taverna.org.uk
• JWS Online
– http://jjj.biochem.sun.ac.za/
• SABIO-RK
– http://sabio.villa-bosch.de/
Download