Negotiating the maze – Data Complexity in the Life Sciences Sarah Butcher s.butcher@imperial.ac.uk Bioinformatics Support Service www.imperial.ac.uk/bioinfsupport Summary Characteristics of Life Sciences data as ‘Big Data’ The Life Sciences Data Maze Observations on good practice What We Do – Bioinformatics Facility We support all stages in the data lifecycle - experimental design, data and metadata capture, primary and later stage analyses, data management, visualisation, sharing and publication Large-scale genomics & Next Generation Sequencing Analyses Tools for multiplatform data and metadata management Bespoke clinical and biological databases, tissue-banking Software and script development, data visualisation, mobile apps Full grant-based collaboration across disciplines Brokering, skills sharing, advocacy New ways of high throughput working – e.g. cloud, workflows Teaching, Workshops and One-to-One tutorials Variety of skill-sets cover wet-lab bio, statistics, computer science Bio-data Characteristics – The Basics Lack of structure, rapid growth but not (very) huge volume, high heterogeneity Multiple file formats, widely differing sizes, acquisition rates Considerable manual data collection Multiple format changes over data lifetime including production of (evolving) exchange formats Huge range of analysis methods, algorithms and software in use with wide ranging computational profiles Association with multiple metadata standards and ontologies, some of which are still evolving Increasing reference or link to patient data with associated security requirements So What Are These Data Anyway? Raw data files (sometimes) Analysed data files (generally) Results (multiple formats, often quantitative) Mathematics Models (sometimes) New hypotheses (hard to encapsulate without context) Standard operating procedures (occasionally) Software, tools and interfaces (sometimes) ‘dark data’ – miscellaneous additional or interim datasets that aren’t directly tied to a publication – might be reuseable if shared and suitable quality Diversity ? Transcriptome Genomes Other -omes Bio-Imaging Improved understanding of complex biological system Challenges in primary analyses (smaller) AND in meaningful integration (huge) Proteome Large-scale field studies Metabolomics Clinical data, Sample-related data Variant analyses Protein interactions Size - The Rise Of Genome Sequencing Current data doubling time ~ 7 months So - Are We Really Producing ‘Big Data?’ Stephens et al (2015) Big Data: Astronomical or Genomical? PLoS Biol doi:10.1371/journal.pbio.1002195. 3.6 petabases sequence already in public archives (SRA) most hasn’t been deposited yet Worldwide sequencing capacity > 35 petabases per year Simple experiment - Sequencing 170 human genomes to fourfold coverage (Illumina) platform yields: • Raw gzipped data: 2039 Gb • Bam files (keep): 3.6 Tb • Variants (Vcf format – for public repository): 48Gb One Recent MethQTLs study across imputed SNPs, 8 million methylation sites, half million CpGs To calculate all SNPs vs DNA-Methylation at all CpGs took 215 CPU years (47 days on 1,700 cores), 40TB storage It Isn’t Just Sequence Data Though…. High throughput Metabolomics - targeted profiling on serum or urine samples For NMR, 0.5 to 2 MB per sample assay For Mass Spectroscopy (MS), volume is more variable but in the region of 7GB per sample assay Targeted Mass spectroscopy assays yield less data per sample, in the 100’s of MB One assay can be run every 15 minutes per machine, ~200GB per day, per instrument. Multiple machines…… High resolution Light microscopy can yield hundreds of GB per day from a single instrument newer techniques – Light Sheet microscopy can run at 1.6GB per second for multi-minutes Adding Complexity – Formats, Standards, Repositories One raw data type BUT many file formats -may be human readable, require specific software, proprietary or open source Over 1552 different public databases, most limited by data domain, origin or both (NAR online Molecular Biology Database Collection) 30+ minimum reporting guidelines and standards for bio/ biomedical data but few cross experimental types = fragmentation, confusion for non-domain specialists It’s Complicated …. Different Data Types, Different Repositories Small selection of core public repositories Primary database – DNA or protein sequence Secondary - (derived information e.g. protein domains) Protein structure or other (e.g. crystal coordinates) Public Repositories NAR online Molecular Biology Database Collection http://www.oxfordjournals.org/nar/database/c/ currently 1552 databases Limited by data domain or origin or both One project may require data submission to >1 Some cross-referencing across databases Each has its own format and metadata requirements Some are manually curated, many are not Data submission may be a requirement for journal publication Fragmentation makes finding the data harder The Publication Complication May cross-reference datasets across databases (good) Data submission may be a requirement for journal publication (good) Quality assurance can be variable Large datasets can take weeks to prepare/validate for submission and generate 100’s of thousands of lines of XML, TB of data Automation complicated by regular changes to uploaders Where to put the other associated data – that may not be linked to a publication? Published But Private Too Some data can never be fully and openly shared e.g. Identifyable human data - stays on private networks (including NHS) Identification can occur via aggregation of otherwise unremarkable parameters Different rules in different countries – (e.g. age, DOB) Whole genomes classed as identifyable Publication has to adhere to ethics agreements (not standardised) Additional complication when research requires access to clinical data for phenotypic interpretation Public databases now available for Human genomic data where access is only granted on case-by-case basis behind an ethics panel e.g. https://www.ebi.ac.uk/ega/home Multiple Funders With Multiple Requirements and…. and…. Different funders require different data management plans Data Stages - RNA-Seq Experiment Total Volume experiment 260 MB 32 GB 240 GB 24 TB Data-Centric Science – It’s All About the Data “Hypotheses are not only tested through directed data collection and analysis but also generated by combining and mining the pool of data already available “ Goble and Roure (2009) from The Fourth Paradigm: Data-Intensive Scientific Discovery Edited by Hey, Tansley and Tolle). But In order to do this – data have to be discoverable and re-useable The Data Lifecycle for Systems Biology Systems Biology requires: Multiple different data sources and types Very good metadata to ensure Data interoperability Re-use of published data from public repositories Figure from ISBE D 2.2 FAIR and RARE Data should be FAIR - Findable, Accessible, Interoperable, Reusable/Reproducible. Research practices should be RARE - Robust, Accountable, Intelligible, Reproducible) History of Data Sharing Principles in large bio-projects: • Bermuda agreement 1996 • Fort Lauderdale agreement 2003 Bio-Data Standards 30+ minimum reporting guidelines for diverse areas of biological and biomedical data Few cross experimental types – confusion, fragmentation Differing levels of use and maturity ‘Minimum’ can still be complex – ‘just enough’ movement Multiple standard formats for reporting e.g. MAGE-ML Not always easy to find associated tools to help use http://mibbi.sourceforge.net/portal.shtml ISA-TAB ISA-TAB framework – investigation, study, assay hierarchy Acts as a framework for associating complex data from a large investigation Uptake increasing but still not very widely used Lack of helpful tools to help researchers use ….. http://isatab.sourceforge.net/format.html Xperimentr – Tomlinson et al (2013) BMC Bioinformatics 10.1186/1471-2105-14-8 Organising Local and Pre-publication Data Can be project or data-type centric Can support data sharing within large projects and collaborations Databases can have more functionality than spreadsheets…. Need to support data that will not be published directly Write a data management plan Update it at least once a year Share it with everyone working on the project…… Be consistent about HOW and WHERE you store metadata Find out which standards are available for your experimental types BEFORE you generate data Example - Bridging the Gaps In One Domain – Bio-imaging Confocal image analysis feature detection • • • • Sample tracking for image analysis specialists Bespoke automated analysis systems for biologists Maintaining OMERO OME database for Photonics researchers MRI scan management solution for research groups Example - Encouraging Electronic Data Capture - Mobile applications For Data Input LabBook http://labbook.cc customisable geo-tagged data capture in the field automated remote database storage Secure backup, sharing, search, version control via website Handwritten notes, annotation Supports photos, videos, file attachments, voice memos, barcode scanning Practical Improvements For Increasingly Large Scale Data What can we learn from Collaborators: High Energy Physics Astronomy Photonics Chemistry Mathematics Computer Science GenomeThreader in the MapReduce framework Encourage Reuse Reproducible analysis workflows: Standardise analyses across multiple datasets Can ensure use of same parameters, software versions Can be designed to produce accompanying metadata Can be published as a pipeline to accompany data (for other people to use - interoperability) Standard Operating Procedures: Ensure experiments are carried out on standard, comparable samples Essential for quality assurance and better understanding of how data produced and hence data reuse Grass Roots Challenges Integrative approaches repeatedly show that complete metadata are vital for optimal data reuse BUT Metadata capture still a complex time-consuming task Data fragmentation across multiple sites still a major barrier to uptake (can’t find it… can’t use it…) Practical aspects – cost of storage & curation, sheer volume of datasets Difficulty of obtaining consistent funding for fundamentals - maintaining core infrastructure, software, databases Staff – shortage of truly inter-disciplinary infrastructure & knowledge providers, career progression A Few Infrastructure Projects/Initiatives to Watch Infrastructure Systems Biology Europe http://project.isbe.eu/ Elixir http://www.elixir-europe.org Biomedbridges http://www.biomedbridges.eu/ Research Data Alliance https://rd-alliance.org/ Software Sustainability Institute http://www.software.ac.uk/ Biosharing http://biosharing.org/ FAIRDOM http://fair-dom.org/