Data-Intensive Research Theme Genomics Mario Caccamo mario.caccamo@bbsrc.ac.uk as Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC 1 • • • • • • • • • • How many data sources do you handle? How much data do you have already? How much data do you expect? How urgent are your analyses? How complex is the data? How complex are the analyses? How many researchers work with this data or these methods? How many do you anticipate? What are your measures of success/satisfaction? How as many other places/groups in your discipline have nearly the same requirements? Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC 2 How many data sources …. HiSeq 2000 75-120 bases reads ~200 Gb / expt as ~8 days AB SOLiD 4 System 50 bases reads ~100 Gb / expt (300Gb 4hq) ~10 days Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC 3 How much data now…. as Michael R. Stratton, Peter J. Campbell & P. Andrew Futreal Nature 458, 719-724(9 April 2009) Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC 4 The “Next Generation Gap” • Large dataset • Error models / quality scores • New applications • Skills shortage • Hardware We are closing the gap… • Novel algorithms • New standards • Repositories • Intelligent storage • as data-intensive biology… Towards Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC 5 “Sequencing Centres” as James Hardfield - CRUK Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC 6 Data sharing model Sequencing Instrument ACGTTTCCC…. Sequencing Instrument Sequencing Instrument ACGTTTCCC…. ACGTTTCCC…. Storage High-Performanance Cluster Storage Scientist / User Sequencing Centre Download Submission as Multi Peta-byte High-Performanance storage Public Repository Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC 7 How much data… as C. Southan & G. Cameron “The Fourth Paradigm” Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC 8 How much data to expect… as Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC 9 www.uk10k.org as Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC 10 as Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC 11 as Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC 12 How complex... Variomics Transcriptomics Genomics Phenomics Proteomics as Metagenomics Metatranscriptomics Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC 13 De Novo Pipelines – large genomes Generation of Sequence Assemble contigs (N50 ~1Kb) Extend contigs (N50 ~10Kb) Build Scaffolds (N50 ~100Kb-1Mb) Annotation (~35K genes) Integration of red pair information NGS Sequencing Cortex/AByS S Paired end reads 2kb, 6kb, 10kb Integration of community resources Transcriptomics Expression Diversity BAC-ends, genetic markers, long reads (454), sequenced BACs Curtain/Velv et Integration Annotation NNNNN SNP (G->A) as NNNNN NNNNN NNNNN NNN NNN Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC 14 Research community… • Molecular Biologists • Bioinformaticians • Statisticians • Sociologists, Ethicists, Policy-makers, Physicians…. as Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC 15 Success? DNA Cell …A C T T T G G G C C C C… …T G A A A C C C G G G G… as Proteins (eg hemoglobin, myoglobin, myosin) Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC 16 as Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC 17 Integromics Omics classes Genomics Transcriptomics Proteomics Metabolomics Phenomics Sequence Information Pathways (Reactome) Structure Data / DBMS APIaPIAPI API Expression Pathways Bioinformatics as Science Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC 18 Current Model Sequencing Instrument ACGTTTCCC…. Sequencing Instrument Sequencing Instrument ACGTTTCCC…. ACGTTTCCC…. Storage High-Performanance Cluster Storage Scientist / User Sequencing Centre Download Submission as Multi Peta-byte High-Performanance storage Public Repository Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC 19 Sequencing Node Sequencing Instrument ACGTTTCCC…. Sequencing Instrument Sequencing Instrument ACGTTTCCC…. ACGTTTCCC…. Staging Storage LIMS Primary Analysis QA Assemblies Multi Peta-byte High-Performanance storage as Cloud infrastructure VM test enviromemt Metadata Analysis Output User High-Performanance Cluster Analysis Submission Virtual Machine Pool Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC 20 Summary • • Sequencing Technologies – Single-molecule – Cheaper ($1000 genome) – Faster (on-line sequencing) Hardware – Connectivity – Storage – HPC clusters – Cloud Computing (GrayWulf) as • Software – – Bioinformatics Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC 21