A na%on-­‐wide infrastructure for Biobank-­‐ based Integra%ve Omics Studies: the BIOS consor%um Matthijs Moed / Leids Universitair Medisch Centrum m.h.moed@lumc.nl Introduc%on • Biobanks collect biological material for particular cohorts • Study molecular causes of human disease • Which genetic factors influence a phenotype (disease)? Integra%ve omics • Genetics does not tell the entire story • Integrate layers of data for mechanistic insight: • • • Genomics Epigenomics Transcriptomics BBMRI: Biobank-­‐based Integrated Omics Studies (BIOS) • Six biobanks 4.000 ✔ 4.000 (most of them generated) 4.000 ✔ (in-depth phenotyping) Data storage • 20 TB of raw data • Grid storage: • Disk • Double tape backup • Lightpaths for fast transfer (1 Gbit/s) • Data management protocol • Security: • Grid certificates • VO membership approval • Code of conduct: data does not leave the grid Metadata organiza%on • Storage of metadata is crucial: • Trackability, reproducibility • Connect data types for each sample • Integration of data normally distributed in several files • Requirement for grid computation • Metadata design co-evolves with analyses • Traditional SQL-like databases too difficult: enter CouchDB • Hosted at HPC cloud • Clear responsibilities: • Sample tracking • Centralized storage Data analysis – computa%onal requirements • Embarrassingly parallel computation on grid: RNASeq alignment • 2.300 samples • 150.000 CPU hours in 4 days Data analysis – computa%onal requirements • High-memory computation on HPC cloud VM (methylation data QC) • 256GB of memory • 48 core hours Data analysis – computa%onal requirements • IO-heavy computation: • Methylation quantitative trait loci analyses Data analysis – human requirements • Grid infrastructure has steep learning curve • Cloud VM/local LSG clusters more familiar to less experienced users Ø Complementarity of grid and HPC cloud infrastructure is essential Role of e-­‐infrastructure • Scope and scale of studies will increase • Secure storage of data essential: • Authorization • Data integrity • Availability • In the future: move to centralized infrastructure • Requires simplified grid middleware • Requires adaptation of existing software • Metadata organization Acknowledgements • Michiel van Galen, Maarten van Iterson, Matthijs Moed, Jan Bot, Leon Mei, Jeroen van Rooij, Marijn Verkerk, Freerk van Dijk, Mark-Jan Bonder, Peter-Bram 't Hoen, Rene Luijk, Dasha Zhernakova, Patrick Deelen, Wibowo Arindrarto, Martijn Vermaat, Joyce van Meurs, Szymon Kiełbasa, Erik van Zwet, Jenny van Dongen, Gonneke Willemsen, Joris Deelen, Ruud van der Breggen, Mila Jhamai, Renee de Menezes, André Uitterlinden, Morris Swertz, Dorret Boomsma, Eline Slagboom, Cisca Wijmenga, Cornelia van Duijn, Jan Veldink, Marleen van Greevenbroek • Management team: Rick Jansen, Lude Franke, Aaron Isaacs, Bas Heijmans (chair)