Motivation and Strategies for data-intensive Biology Douglas B. Kell Chief Executive Biotechnology and Biological Sciences Research Council http://dbkgroup.org/ @dbkell http://blogs.bbsrc.ac.uk www.mcisb.org Synopsis of talk • Intro and background/ philosophy of data-driven science • How big is the ‘big data’ problem? • Genomics and Systems Biology • Data sharing • Conclusions and challenges Pre- and post-genomics FUNCTION/ PHENOTYPE PRE POST GENE Holism/reductionism WHOLE (ORGANISM) REDUCTIONISM SYNTHESIS/ HOLISM PARTS (MOLECULES) Models and Reality THE BIOLOGY PRODUCE/ REFINE THE MODEL RUN THE MODEL THE IN SILICO MODEL Modelling PARAMETERS PRODUCE/ REFINE e.g. O.D.E. MODEL INFERENCE/ SYSTEM IDENTIFICATION VARIABLES The cycle of knowledge KNOWLEDGE/ RULES/ IDEAS HYPOTHESIS/ ANALYSIS/ DEDUCTION SYNTHESIS/ INDUCTION OBSERVATIONS / DATA dB The evolution of Systems Biology Westerhoff & Palsson NBT 22, 1249-52 (2004) Molecular But despite everything science is in some ways becoming LESS effective in an applied context Numerical Declining numbers of drug launches Leeson & Springthorpe, NRDD 6, 881-890 (2007) Attrition Kola & Landis, NRDD 3, 711-5 (2004) Biology IS a big data science • 5.5 PB - EMBL-EBI expanded capacity (from 2.5PB) – recently increased with BBSRC funding • 10 PB - purportedly to be generated by Large Hadron Collider at CERN per year • 50 PB – estimate of “the entire [written] works of humankind over recorded history, in all languages" if digitally compressed • 1TB/hr - Youtube adds 15h of video per minute in 2009 http://googleblog.blogspot.com/2009/05/2008-foundersletter.html - ca, ~ 0.01EB/yr) Sequencing Technologies Michael R. Stratton, Peter J. Campbell & P. Andrew Futreal Nature 458, 719-724(9 April 2009) Sequencing, bioimaging, digital organisms • Solexa machine 100GB 1TB/ week, 50 weeks/y, 20 machines = 1 PB/y already • Bioimaging data, if made available online, would easily rival/exceed Youtube • The same for model outputs from digital organisms/ VPH, needed for inferencing Whoever (person, grouping, country) learns to store, access and analyse such data will gain a huge scientific and commercial advantage EMBL Repositories C. Southan & G. Cameron “The Fourth Paradigm” Biology IS a big data science • EBI expect a ten-fold increase in capability from 2010 – 2020, to manage data from the next generation sequencing machines alone • EMBL-EBI website databases serves 300,000 independent users every month – approximately 10,000 scientists use the data generated at CERN. • RCUK recognition – ‘The Exabyte Age’ • BGI – 128 HiSeq 2000 @ 200Gb/8d run each 2Tb/d, 40 PB storage, 1 PFLOPS • Federated data Bioinformatics Bioinformatics Data BBSRC Strategic Plan 2010-2015 http://www.bbsrc.ac.uk/strategy Research Council investment Recent significant Research Council commitments in sequencing and genomics • £13.5M - BBSRC Genome Analysis Centre (TGAC) • £9.1M - 4 MRC High-throughput Sequencing Hubs • £2.3M - MRC award to the Wellcome Trust Sanger Institute for public resource of mouse strain sequences • £0.8M - MRC award to the Babraham Institute for highthroughput epigenomics • £2M/pa - NERC - Biomolecular Analysis Facility (NBAF) and Environmental Bioinformatics Centre (NEBC) Genomics • Next-next generation sequencers: 2 – 3 orders of magnitude more, within less than 5y, probably < 2y • PacBio instruments starting to ship Value for Money of data storage Data collection in 2008 $M Annual Cost of PDB <1% Issues not just data storage • Bandwidth now limiting – bringing computing to the data, not data to computing • Data-intensive science – the next phase, and new architectures needed. Qualitative change. • Web 2.0/3.0 and the Semantic Web • Curation – and the need for a new breed of curators • Training, access and utilisation – major need: (i) upskill the user community, (ii) change the style of software to favour nonbioinformaticians Data-intensive science A qualitative change in science Current Model Sequencing Instrument ACGTTTCCC…. Sequencing Instrument ACGTTTCCC…. Storage High-Performanance Cluster Sequencing Instrument ACGTTTCCC…. Scientist / User Sequencing Centre Download Submission Multi Peta-byte High-Performanance storage Public Repository Storage Sequencing Node Sequencing Instrument Sequencing Instrument ACGTTTCCC…. Sequencing Instrument ACGTTTCCC…. VM test enviromemt ACGTTTCCC…. Staging Storage LIMS Primary Analysis QA Assemblies Multi Peta-byte High-Performanance storage Metadata Analysis Output User High-Performanance Cluster Analysis Submission Virtual Machine Pool Cloud infrastructure The Information Age 800,000 700,000 600,000 500,000 400,000 300,000 200,000 100,000 19 49 19 53 19 57 19 61 19 65 19 69 19 73 19 77 19 81 19 85 19 89 19 93 19 97 20 01 20 05 0 Total number of scientific papers added to Medline per year - need for text mining and NLP (NB UKPMC) Information Age 800,000 700,000 600,000 500,000 400,000 300,000 200,000 100,000 0 49 9 53 9 57 9 61 9 65 9 69 9 73 9 77 9 81 9 85 9 89 9 93 9 97 0 01 0 05 9 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 Total number of scientific papers added to Medline per year H1N1 Outbreak Rohr, Nature 2008 Smith, Nature 2009 European Bioinformatics Institute, UK Rohr paper Need for an Open Access human metabolic network model A ‘grand challenge’…. Systems biology and modelling are all about representation The main representation for systems biology models is SBML http://sbml.org/ www.sbml.org The human metabolic network (1) • 8 cellular compartments • 2,712 compartment-specific metabolites • ~ 1,500 different chemical entities • 1,496 genes • 2,233 metabolic reactions (1,795 unique) • 1,078 transport reactions (32.6%) PNAS 104, 1777-1782 (2007) PNAS 104, 1777-1782 (2007) The human metabolic network (2) • Not yet compartmentalised • 2,823 reactions (incl 300 ‘orphans’), of which 2,215 have disease assiociations, plus 1189 transport reactions and 457 exchange reactions • 2,322 genes (1069 common with UCSD model) Molecular Systems Biology 3, 135 (2007) Task: bring together these models and canonicalise them • In particular we need proper semantic annotation to refer to specific chemical entities sensibly, either via persistent databases (e.g. ChEBI) or better via dBindependent means such as SMILES or InChI strings • Principled yeast metabolic network model (Herrgård et al.) Nature Biotechnology 26, 1155 – 1160 (2008) Herrgård et al., Nature Biotechnology 26, 1155-60 (2008) Some key features of yeast consensus reconstruction • Precise and semantically aware (via InChI, SMILES, dB links and SBO) • Available online, also as accurate SBML • Live • Directly linked to B-net database http://www.comp-sys-bio.org/yeastnet/ © Organisation for Economic Co-operation & Development, 2007 Data sharing: Background Publicly-funded research data are a public good, produced in the public interest Publicly-funded research data should be openly available to the maximum extent possible Following consultations with the UK bioscience community, BBSRC launched its Data Sharing Policy in April 2007 BBSRC Data Sharing Policy Few Restrictions Regulatory Requirements (ethics) Appropriate to discipline Timely Data Sharing Key Principles Use of Existing Resources Appropriate Data Quality Metadata Use of Standards www.bbsrc.ac.uk/datasharing BBSRC Data Sharing Policy: Implementation I Data Sharing Statements (Peer reviewed) Final Reports (Peer Reviewed) Portfolio / initiative evaluations Institute evaluations BBSRC Data Sharing Policy Monitoring Group Revisions / amendments to policy to reflect best practice / developments BBSRC Data Sharing Policy: Implementation II Data Infrastructure ELIXIR [aims] to construct and operate a sustainable infrastructure for biological information in Europe to support life science research and its translation to medicine and the environment, the bioindustries and society. Sustainable Resources The Bioinformatics & Biological Resources Fund aims to support the establishment, maintenance and enhancement of community resources required by bioscientists. Data Sharing Tool Development The Tools & Resources Development Fund supports small/short pump-priming projects or community-building activities aimed at the development of novel technologies or methods to tackle a biological challenge. Funds are available through project grants (responsive mode) to develop tools and resources for data sharing. Bioinformatics and Biological Resources Fund A good model for a programme responsive to community needs is provided by the UK BBSRC's recent Bioinformatics and Biological Resources Fund which provides dedicated funding for development and sustainability of public resources and informatics tools. Nature Vol 461(7261):171-3 10 Sept 2009 Bioinformatics and Biological Resources Fund “Governments must ensure that at least one of their national funding agencies has money specifically set aside for the long-term support of bioresource infrastructures. A good model to emulate would be the UK's BBSRC, which allows databases and other resources to apply for ringfenced funding, saving them from having to compete with hypothesis-driven grants, which are the agency's mainstay.” Nature Vol 462, 19 November 2009 (Editorial) Challenges Data transfer/bandwidth storage/infrastructure annotation curation mining visualisation knowledge • Data floods, lack of bandwidth • Skills and capacity: • Pipeline: – computational bioscience – maths skills • Cultural barriers and disciplinary boundaries • Data standards, interoperability and Ontologies • Integrating the literature! Need for tools to integrate PLoS Comp Biol 4, e1000204 (2008) – top 4 most tagged at citeulike.org Academic Credit and Risk Mitigation for sharing, curating, and reusing not reinventing Slide courtesy of Carole Goble Iron behaving badly BMC Med Genomics 2, 2 (2009) – 79pp, with 2,469 references http://www.biomedcentral.com/1755-8794/2/2/ Motivation and Strategies for data-intensive Biology Douglas B. Kell Chief Executive Biotechnology and Biological Sciences Research Council http://dbkgroup.org/dbkPubs http://dbkgroup.org/ @dbkell http://blogs.bbsrc.ac.uk www.mcisb.org