A nation-wide infrastructure for Biobank-based

advertisement
A na%on-­‐wide infrastructure for Biobank-­‐
based Integra%ve Omics Studies: the BIOS consor%um Matthijs Moed / Leids Universitair Medisch Centrum
m.h.moed@lumc.nl
Introduc%on •  Biobanks collect biological material for particular cohorts
•  Study molecular causes of human disease
•  Which genetic factors influence a phenotype (disease)?
Integra%ve omics •  Genetics does not tell the entire story
•  Integrate layers of data for mechanistic insight:
• 
• 
• 
Genomics
Epigenomics
Transcriptomics
BBMRI: Biobank-­‐based Integrated Omics Studies (BIOS) •  Six biobanks
4.000 ✔
4.000 (most of them generated)
4.000 ✔ (in-depth phenotyping)
Data storage •  20 TB of raw data
•  Grid storage:
•  Disk
•  Double tape backup
•  Lightpaths for fast transfer (1 Gbit/s)
•  Data management protocol
•  Security:
•  Grid certificates
•  VO membership approval
•  Code of conduct: data does not leave the grid
Metadata organiza%on •  Storage of metadata is crucial:
•  Trackability, reproducibility
•  Connect data types for each sample
•  Integration of data normally distributed in several files
•  Requirement for grid computation
•  Metadata design co-evolves with analyses
•  Traditional SQL-like databases too difficult: enter CouchDB
•  Hosted at HPC cloud
•  Clear responsibilities:
•  Sample tracking
•  Centralized storage
Data analysis – computa%onal requirements •  Embarrassingly parallel computation on grid: RNASeq alignment
•  2.300 samples
•  150.000 CPU hours in 4 days
Data analysis – computa%onal requirements •  High-memory computation on HPC cloud VM (methylation data QC)
•  256GB of memory
•  48 core hours
Data analysis – computa%onal requirements •  IO-heavy computation:
•  Methylation quantitative trait loci analyses
Data analysis – human requirements •  Grid infrastructure has steep learning curve
•  Cloud VM/local LSG clusters more familiar to less experienced users
Ø  Complementarity of grid and HPC cloud infrastructure is essential
Role of e-­‐infrastructure •  Scope and scale of studies will increase
•  Secure storage of data essential:
•  Authorization
•  Data integrity
•  Availability
•  In the future: move to centralized infrastructure
•  Requires simplified grid middleware
•  Requires adaptation of existing software
•  Metadata organization
Acknowledgements •  Michiel van Galen, Maarten van Iterson, Matthijs Moed, Jan Bot, Leon
Mei, Jeroen van Rooij, Marijn Verkerk, Freerk van Dijk, Mark-Jan
Bonder, Peter-Bram 't Hoen, Rene Luijk, Dasha Zhernakova, Patrick
Deelen, Wibowo Arindrarto, Martijn Vermaat, Joyce van Meurs, Szymon
Kiełbasa, Erik van Zwet, Jenny van Dongen, Gonneke Willemsen, Joris
Deelen, Ruud van der Breggen, Mila Jhamai, Renee de Menezes, André
Uitterlinden, Morris Swertz, Dorret Boomsma, Eline Slagboom, Cisca
Wijmenga, Cornelia van Duijn, Jan Veldink, Marleen van Greevenbroek
•  Management team: Rick Jansen, Lude Franke, Aaron Isaacs, Bas
Heijmans (chair)
Download