Data Complexity in the Life Sciences

Negotiating the maze –
Data Complexity in the
Life Sciences
Sarah Butcher
Bioinformatics Support Service
 Characteristics of Life Sciences data as ‘Big Data’
 The Life Sciences Data Maze
 Observations on good practice
What We Do – Bioinformatics Facility
We support all stages in the data lifecycle - experimental design,
data and metadata capture, primary and later stage analyses,
data management, visualisation, sharing and publication
Large-scale genomics & Next Generation Sequencing Analyses
Tools for multiplatform data and metadata management
Bespoke clinical and biological databases, tissue-banking
Software and script development, data visualisation, mobile apps
Full grant-based collaboration across disciplines
Brokering, skills sharing, advocacy
New ways of high throughput working – e.g. cloud, workflows
Teaching, Workshops and One-to-One tutorials
Variety of skill-sets cover wet-lab bio, statistics, computer science
Bio-data Characteristics – The Basics
 Lack of structure, rapid growth but not (very) huge volume,
high heterogeneity
 Multiple file formats, widely differing sizes, acquisition rates
 Considerable manual data collection
 Multiple format changes over data lifetime including
production of (evolving) exchange formats
 Huge range of analysis methods, algorithms and
software in use with wide ranging computational profiles
 Association with multiple metadata standards and
ontologies, some of which are still evolving
 Increasing reference or link to patient data with associated
security requirements
So What Are These Data Anyway?
 Raw data files (sometimes)
 Analysed data files (generally)
 Results (multiple formats, often quantitative)
 Mathematics Models (sometimes)
 New hypotheses (hard to encapsulate without context)
 Standard operating procedures (occasionally)
 Software, tools and interfaces (sometimes)
 ‘dark data’ – miscellaneous additional or interim datasets
that aren’t directly tied to a publication – might be reuseable if shared and suitable quality
Other -omes
Improved understanding
of complex biological system
Challenges in primary analyses (smaller)
AND in meaningful integration (huge)
Large-scale field studies
Clinical data,
Sample-related data
Size - The Rise Of Genome Sequencing
Current data doubling time ~ 7 months
So - Are We Really Producing ‘Big Data?’
 Stephens et al (2015) Big Data: Astronomical or Genomical?
PLoS Biol doi:10.1371/journal.pbio.1002195.
 3.6 petabases sequence already in public archives (SRA) most
hasn’t been deposited yet
 Worldwide sequencing capacity > 35 petabases per year
 Simple experiment - Sequencing 170 human genomes to fourfold coverage (Illumina) platform yields:
• Raw gzipped data: 2039 Gb
• Bam files (keep): 3.6 Tb
• Variants (Vcf format – for public repository): 48Gb
 One Recent MethQTLs study across imputed SNPs, 8 million
methylation sites, half million CpGs
To calculate all SNPs vs DNA-Methylation at all CpGs took 215 CPU
years (47 days on 1,700 cores), 40TB storage
It Isn’t Just Sequence Data Though….
High throughput Metabolomics - targeted profiling on serum
or urine samples
 For NMR, 0.5 to 2 MB per sample assay
 For Mass Spectroscopy (MS), volume is more variable but in
the region of 7GB per sample assay
 Targeted Mass spectroscopy assays yield less data per sample,
in the 100’s of MB
 One assay can be run every 15 minutes per machine, ~200GB
per day, per instrument. Multiple machines……
High resolution Light microscopy
 can yield hundreds of GB per day from a single instrument
 newer techniques – Light Sheet microscopy can run at 1.6GB
per second for multi-minutes
Adding Complexity – Formats, Standards, Repositories
 One raw data type BUT many file formats -may be human
readable, require specific software, proprietary or open source
 Over 1552 different public databases, most limited by data
domain, origin or both (NAR online Molecular Biology Database Collection)
 30+ minimum reporting guidelines and standards for bio/
biomedical data but few cross experimental types
= fragmentation, confusion for non-domain specialists
It’s Complicated …. Different Data Types, Different
Small selection of core public repositories
Primary database – DNA or protein sequence
Secondary - (derived information e.g. protein domains)
Protein structure or other (e.g. crystal coordinates)
Public Repositories
 NAR online Molecular Biology Database Collection
currently 1552 databases
 Limited by data domain or origin or both
 One project may require data submission to >1
 Some cross-referencing across databases
 Each has its own format and metadata requirements
 Some are manually curated, many are not
 Data submission may be a requirement for journal
 Fragmentation makes finding the data harder
The Publication Complication
 May cross-reference datasets across databases (good)
 Data submission may be a requirement for journal
publication (good)
 Quality assurance can be variable
 Large datasets can take weeks to prepare/validate for
submission and generate 100’s of thousands of lines of
XML, TB of data
 Automation complicated by regular changes to uploaders
 Where to put the other associated data – that may not be
linked to a publication?
Published But Private Too
 Some data can never be fully and openly shared e.g. Identifyable
human data
 - stays on private networks (including NHS)
 Identification can occur via aggregation of otherwise unremarkable
 Different rules in different countries – (e.g. age, DOB)
 Whole genomes classed as identifyable
 Publication has to adhere to ethics agreements (not standardised)
 Additional complication when research requires access to clinical data
for phenotypic interpretation
 Public databases now available for Human genomic data where
access is only granted on case-by-case basis behind an ethics panel
Multiple Funders With Multiple Requirements
and…. and….
Different funders require different
data management plans
Data Stages - RNA-Seq Experiment
Total Volume
260 MB
32 GB
240 GB
24 TB
Data-Centric Science – It’s All About the Data
“Hypotheses are not only tested through directed
data collection and analysis but also generated by
combining and mining the pool of data already
available “
Goble and Roure (2009) from The Fourth Paradigm: Data-Intensive Scientific
Discovery Edited by Hey, Tansley and Tolle).
But In order to do this – data have to be
discoverable and re-useable
The Data Lifecycle for Systems Biology
Systems Biology requires:
Multiple different data sources
and types
Very good metadata to ensure
Data interoperability
Re-use of published data from
public repositories
Figure from ISBE D 2.2
Data should be FAIR - Findable, Accessible, Interoperable,
Research practices should be RARE - Robust, Accountable,
Intelligible, Reproducible)
History of Data Sharing Principles in large bio-projects:
• Bermuda agreement 1996
• Fort Lauderdale agreement 2003
Bio-Data Standards
 30+ minimum reporting guidelines for diverse areas of
biological and biomedical data
 Few cross experimental types – confusion, fragmentation
 Differing levels of use and maturity
 ‘Minimum’ can still be complex – ‘just enough’ movement
 Multiple standard formats for reporting e.g. MAGE-ML
 Not always easy to find associated tools to help use
 ISA-TAB framework –
investigation, study, assay
 Acts as a framework for associating
complex data from a large
 Uptake increasing but still not very
widely used
 Lack of helpful tools to help
researchers use …..
Xperimentr – Tomlinson et al (2013) BMC Bioinformatics 10.1186/1471-2105-14-8
Organising Local and Pre-publication Data
Can be project or data-type centric
Can support data sharing within large projects and
Databases can have more functionality than spreadsheets….
Need to support data that will not be published directly
Write a data management plan
Update it at least once a year
Share it with everyone working on the project……
Be consistent about HOW and WHERE you store metadata
Find out which standards are available for your experimental types
BEFORE you generate data
Example - Bridging the Gaps In One Domain
– Bio-imaging
Confocal image
analysis feature detection
Sample tracking for image analysis
Bespoke automated analysis systems
for biologists
Maintaining OMERO OME database
for Photonics researchers
MRI scan management solution for
research groups
Example - Encouraging Electronic Data Capture
- Mobile applications For Data Input
customisable geo-tagged
data capture in the field
automated remote
database storage
Secure backup, sharing, search, version
control via website
Handwritten notes, annotation
Supports photos, videos, file attachments,
voice memos, barcode scanning
Practical Improvements For
Increasingly Large Scale Data
What can we learn from
High Energy Physics
Computer Science
in the MapReduce
Encourage Reuse
Reproducible analysis workflows:
 Standardise analyses across multiple datasets
 Can ensure use of same parameters, software versions
 Can be designed to produce accompanying metadata
Can be published as a pipeline to accompany data (for
other people to use - interoperability)
Standard Operating Procedures:
Ensure experiments are carried out on standard,
comparable samples
Essential for quality assurance and better understanding
of how data produced and hence data reuse
Grass Roots Challenges
 Integrative approaches repeatedly show that complete
metadata are vital for optimal data reuse BUT
 Metadata capture still a complex time-consuming task
 Data fragmentation across multiple sites still a major barrier
to uptake (can’t find it… can’t use it…)
 Practical aspects – cost of storage & curation, sheer volume
of datasets
 Difficulty of obtaining consistent funding for fundamentals
- maintaining core infrastructure, software, databases
 Staff – shortage of truly inter-disciplinary infrastructure &
knowledge providers, career progression
A Few Infrastructure Projects/Initiatives to Watch
Infrastructure Systems Biology Europe
Research Data Alliance
Software Sustainability Institute