Motivation and Strategies for data-intensive Biology Douglas B. Kell Chief Executive

advertisement
Motivation and Strategies for
data-intensive Biology
Douglas B. Kell
Chief Executive
Biotechnology and Biological Sciences Research Council
http://dbkgroup.org/ @dbkell
http://blogs.bbsrc.ac.uk www.mcisb.org
Synopsis of talk
• Intro and background/ philosophy
of data-driven science
• How big is the ‘big data’ problem?
• Genomics and Systems Biology
• Data sharing
• Conclusions and challenges
Pre- and post-genomics
FUNCTION/
PHENOTYPE
PRE
POST
GENE
Holism/reductionism
WHOLE
(ORGANISM)
REDUCTIONISM
SYNTHESIS/
HOLISM
PARTS
(MOLECULES)
Models and Reality
THE BIOLOGY
PRODUCE/
REFINE
THE
MODEL
RUN
THE
MODEL
THE IN SILICO
MODEL
Modelling
PARAMETERS
PRODUCE/
REFINE
e.g. O.D.E.
MODEL
INFERENCE/
SYSTEM
IDENTIFICATION
VARIABLES
The cycle of knowledge
KNOWLEDGE/
RULES/ IDEAS
HYPOTHESIS/
ANALYSIS/
DEDUCTION
SYNTHESIS/
INDUCTION
OBSERVATIONS
/ DATA
dB
The evolution of Systems Biology
Westerhoff & Palsson NBT 22, 1249-52 (2004)
Molecular
But despite everything science is in some ways
becoming LESS effective in an applied context
Numerical
Declining numbers of drug launches
Leeson & Springthorpe, NRDD 6, 881-890 (2007)
Attrition
Kola & Landis, NRDD 3, 711-5 (2004)
Biology IS a big data science
• 5.5 PB - EMBL-EBI expanded capacity (from
2.5PB) – recently increased with BBSRC funding
• 10 PB - purportedly to be generated by Large
Hadron Collider at CERN per year
• 50 PB – estimate of “the entire [written] works of
humankind over recorded history, in all
languages" if digitally compressed
• 1TB/hr - Youtube adds 15h of video per minute in
2009 http://googleblog.blogspot.com/2009/05/2008-foundersletter.html - ca, ~ 0.01EB/yr)
Sequencing Technologies
Michael R. Stratton, Peter J. Campbell & P. Andrew Futreal
Nature 458, 719-724(9 April 2009)
Sequencing, bioimaging, digital
organisms
• Solexa machine 100GB  1TB/ week, 50
weeks/y, 20 machines = 1 PB/y already
• Bioimaging data, if made available online, would
easily rival/exceed Youtube
• The same for model outputs from digital
organisms/ VPH, needed for inferencing
Whoever (person, grouping, country) learns to
store, access and analyse such data will gain a huge
scientific and commercial advantage
EMBL Repositories
C. Southan & G. Cameron
“The Fourth Paradigm”
Biology IS a big data science
• EBI expect a ten-fold increase in capability
from 2010 – 2020, to manage data from the next
generation sequencing machines alone
• EMBL-EBI website databases serves 300,000
independent users every month
– approximately 10,000 scientists use the data
generated at CERN.
• RCUK recognition – ‘The Exabyte Age’
• BGI – 128 HiSeq 2000 @ 200Gb/8d run each
2Tb/d, 40 PB storage, 1 PFLOPS
• Federated data
Bioinformatics
Bioinformatics
Data
BBSRC Strategic Plan 2010-2015
http://www.bbsrc.ac.uk/strategy
Research Council investment
Recent significant Research Council
commitments in sequencing and genomics
• £13.5M - BBSRC Genome Analysis Centre (TGAC)
• £9.1M - 4 MRC High-throughput Sequencing Hubs
• £2.3M - MRC award to the Wellcome Trust Sanger
Institute for public resource of mouse strain sequences
• £0.8M - MRC award to the Babraham Institute for highthroughput epigenomics
• £2M/pa - NERC - Biomolecular Analysis Facility (NBAF)
and Environmental Bioinformatics Centre (NEBC)
Genomics
• Next-next generation sequencers: 2 – 3
orders of magnitude more, within less than
5y, probably < 2y
• PacBio instruments starting to ship
Value for Money of data storage
Data collection in 2008
$M
Annual Cost of PDB
<1%
Issues not just data storage
• Bandwidth now limiting – bringing computing to
the data, not data to computing
• Data-intensive science – the next phase, and
new architectures needed. Qualitative change.
• Web 2.0/3.0 and the Semantic Web
• Curation – and the need for a new breed of
curators
• Training, access and utilisation – major need:
(i) upskill the user community,
(ii) change the style of software to favour nonbioinformaticians
Data-intensive science
A qualitative change in science
Current Model
Sequencing
Instrument
ACGTTTCCC….
Sequencing
Instrument
ACGTTTCCC….
Storage
High-Performanance
Cluster
Sequencing
Instrument
ACGTTTCCC….
Scientist / User
Sequencing Centre
Download
Submission
Multi Peta-byte
High-Performanance storage
Public Repository
Storage
Sequencing Node
Sequencing
Instrument
Sequencing
Instrument
ACGTTTCCC….
Sequencing
Instrument
ACGTTTCCC….
VM test
enviromemt
ACGTTTCCC….
Staging
Storage
LIMS
Primary Analysis
QA
Assemblies
Multi Peta-byte
High-Performanance storage
Metadata
Analysis Output
User
High-Performanance
Cluster
Analysis Submission
Virtual
Machine
Pool
Cloud infrastructure
The Information Age
800,000
700,000
600,000
500,000
400,000
300,000
200,000
100,000
19
49
19
53
19
57
19
61
19
65
19
69
19
73
19
77
19
81
19
85
19
89
19
93
19
97
20
01
20
05
0
Total number of scientific papers added to Medline per year
- need for text mining and NLP (NB UKPMC)
Information Age
800,000
700,000
600,000
500,000
400,000
300,000
200,000
100,000
0
49 9 53 9 57 9 61 9 65 9 69 9 73 9 77 9 81 9 85 9 89 9 93 9 97 0 01 0 05
9
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
Total number of scientific papers added to Medline per year
H1N1 Outbreak
Rohr, Nature
2008
Smith, Nature 2009
European
Bioinformatics
Institute, UK
Rohr paper
Need for an Open Access human
metabolic network model
A ‘grand challenge’….
Systems biology and modelling are all
about representation
The main representation for
systems biology models is SBML
http://sbml.org/
www.sbml.org
The human metabolic network (1)
• 8 cellular compartments
• 2,712 compartment-specific metabolites
• ~ 1,500 different chemical entities
• 1,496 genes
• 2,233 metabolic reactions (1,795 unique)
• 1,078 transport reactions (32.6%)
PNAS 104, 1777-1782 (2007)
PNAS 104, 1777-1782 (2007)
The human metabolic network (2)
• Not yet compartmentalised
• 2,823 reactions (incl 300 ‘orphans’), of which 2,215
have disease assiociations, plus 1189 transport
reactions and 457 exchange reactions
• 2,322 genes (1069 common with UCSD model)
Molecular Systems Biology 3, 135 (2007)
Task: bring together these models
and canonicalise them
• In particular we need proper semantic
annotation to refer to specific chemical
entities sensibly, either via persistent
databases (e.g. ChEBI) or better via dBindependent means such as SMILES or
InChI strings
• Principled yeast metabolic network model
(Herrgård et al.) Nature Biotechnology 26, 1155 –
1160 (2008)
Herrgård et al., Nature Biotechnology
26, 1155-60 (2008)
Some key features of yeast
consensus reconstruction
• Precise and semantically
aware (via InChI,
SMILES, dB links and
SBO)
• Available online, also as
accurate SBML
• Live
• Directly linked to B-net
database
http://www.comp-sys-bio.org/yeastnet/
© Organisation for Economic Co-operation & Development, 2007
Data sharing: Background
Publicly-funded research data are
a public good, produced in the
public interest
Publicly-funded research data
should be openly available to the
maximum extent possible
Following consultations with the UK
bioscience community, BBSRC launched
its Data Sharing Policy in April 2007
BBSRC Data Sharing Policy
Few
Restrictions
Regulatory
Requirements
(ethics)
Appropriate to
discipline
Timely
Data Sharing
Key Principles
Use of Existing
Resources
Appropriate
Data Quality
Metadata
Use of
Standards
www.bbsrc.ac.uk/datasharing
BBSRC Data Sharing Policy:
Implementation I
Data Sharing Statements
(Peer reviewed)
Final Reports
(Peer Reviewed)
Portfolio / initiative
evaluations
Institute
evaluations
BBSRC Data Sharing Policy Monitoring Group
Revisions / amendments to policy to reflect best practice / developments
BBSRC Data Sharing Policy:
Implementation II
Data Infrastructure
ELIXIR [aims] to construct and operate a sustainable infrastructure
for biological information in Europe to support life science research
and its translation to medicine and the environment, the bioindustries and society.
Sustainable Resources
The Bioinformatics & Biological Resources Fund aims to
support the establishment, maintenance and enhancement of
community resources required by bioscientists.
Data Sharing Tool Development
The Tools & Resources Development Fund supports small/short
pump-priming projects or community-building activities aimed at the
development of novel technologies or methods to tackle a biological
challenge.
Funds are available through project grants (responsive mode) to
develop tools and resources for data sharing.
Bioinformatics and Biological
Resources Fund
A good model for a programme responsive
to community needs is provided by the UK
BBSRC's recent Bioinformatics and
Biological Resources Fund which provides
dedicated funding for development and
sustainability of public resources and
informatics tools.
Nature Vol 461(7261):171-3 10 Sept 2009
Bioinformatics and Biological
Resources Fund
“Governments must ensure that at least one of
their national funding agencies has money
specifically set aside for the long-term support of
bioresource infrastructures. A good model to
emulate would be the UK's BBSRC, which allows
databases and other resources to apply for ringfenced funding, saving them from having to
compete with hypothesis-driven grants, which are
the agency's mainstay.”
Nature Vol 462, 19 November 2009
(Editorial)
Challenges
Data  transfer/bandwidth 
storage/infrastructure  annotation  curation 
mining  visualisation  knowledge
• Data floods, lack of bandwidth
• Skills and capacity:
• Pipeline:
– computational bioscience
– maths skills
• Cultural barriers and disciplinary boundaries
• Data standards, interoperability and Ontologies
• Integrating the literature!
Need for tools to integrate
PLoS Comp Biol 4, e1000204 (2008) – top 4 most tagged at citeulike.org
Academic
Credit and Risk
Mitigation
for sharing,
curating, and
reusing not
reinventing
Slide courtesy of Carole Goble
Iron behaving badly
BMC Med Genomics 2, 2 (2009) – 79pp,
with 2,469 references
http://www.biomedcentral.com/1755-8794/2/2/
Motivation and Strategies for
data-intensive Biology
Douglas B. Kell
Chief Executive
Biotechnology and Biological Sciences Research Council
http://dbkgroup.org/dbkPubs
http://dbkgroup.org/ @dbkell
http://blogs.bbsrc.ac.uk www.mcisb.org
Download