“Big Data” Bioinformatics Mario Caccamo (TGAC) Understanding Variation high yield low yield …A C T T T G G G C C C C… …A C T T T G A G C C C C… Understanding Variation resistant susceptible …A C T T T G G G C C C C… …A C T T T G A G C C C C… Genetic Variation ‘Omics Variomics Transcriptomics Genomics Phenomics Proteomics Metagenomics Metatranscriptomics Integromics Omics classes Genomics Transcriptomics Proteomics Metabolomics Phenomics Sequence Information Pathways (Reactome) Structure APIaPIAPI Data / DBMS API Expression Pathways Bioinformatics Portal/Publication Assembly Genome Genome is fragmented. Sequence fragments - traces. ACCTG.. CCTTG.. Assemble fragments - contigs. Link contigs into scaffolds. NNNNN 5x Repeats A1 A1 A2 A3 >90% DNA A2 De Bruijn Graphs - Contigs CAACAA A CAACAA A GCAACA A CGCAA CA AACTAACGAC G CGCA T CAAAA ACTAACGAC T CGCA T CAAAA ACTAACGAC G CGCA A CAAAA GCGCA AC 1x CGCGC AA AACTAA C 1x ACTAAC G 3x CTAACG A TAACGA C 3x 3x 2x 2x AACGAC G ACGAC GC 2x CGACG CG 2x GACGC GC 2x ACGCG CA 1x CGCGC AT GCGCAT C CGCATC A TCGCAT C AACGAC T CTCGCA T ACGACT C CGACTC G GACTCG C ACTCGC A GCATCA A 2x CATCAA A 2x ATCAAA A 2x N (N(LK )/G) P(d 0) 1 e L >K E. coli Reference E. coli Reference Sequencing Technologies Michael R. Stratton, Peter J. Campbell & P. Andrew Futreal Nature 458, 719-724(9 April 2009) Heavy Weights HiSeq 2000 75-120 bases reads ~200 Gb / expt ~8 days AB SOLiD 4 System 50 bases reads ~100 Gb / expt (300Gb 4hq) ~10 days “Sequencing Centres” James Hardfield - CRUK We want more…. • “Genome Zoo” – 10K vertebrates • 1000 plants • 1001 Arabidopsis • Cancer genomes • Single cell? Single-Molecule Sequencing Single molecule: primer immobilized DNA sequence detection as molecules pass through a nm - sized pore Single molecule sequencing by enzyme tethered in 20 nm hole Bioinformatics Bioinformatics Data Light Weights SOLiD PI System GS Junior iPod dock Ion Torrent iSCAN Embracing Complexity “Solving the puzzle of complex diseases, from obesity to cancer, will require holistic understanding of the interplay between factors such as genetics, diet, infectious agents, environment, behavior, and social structures.” Elias Zerhouni, The NIH Roadmap, Science2003, 302:63- 64. “Our brains are wired for narrative, not statistical uncertainty” – Francis Bacon (I think) The Genome Analysis Centre A new facility to provide critical mass and excellence in genomics specialised in animal, microbial and plant research: - high throughput sequencing - new technology platforms - bioinformatics - impact through innovation and enterprise TGAC will complement work being carried out at The Wellcome Trust Sanger Institute and MRC / NERC Centres. Norwich Norwich Research Park University Hospital UEA JIC TSL TGAC IFR www.tgac.bbsrc.ac.uk Next Generation Platforms Pyrosequencing Sequencing by synthesis Sequencing by ligation > 400 bases / read >75 base paired end reads 50 base paired end reads ~ 500 Mb / expt >30 Gb / expt >30 Gb / expt Data Analysis Pipeline TGAC Sequencing Instrument Sequencing Instrument Genome Technology Division Sequencing Informatics Sequencing Instrument Primary Analysis Web Interface LIMS Genome Analysis Computational Genomics Internal Users External Users Annotated High quality data Public Repositories Repositories “Submitter decides” Community Input EBI Genome Archive EMBL Archive Sequencing Centre Trace Archive Annotated genomes “Communi ty decides” Sequencing Centre Genome Browsers DAS genome browser local storage reference sequence DAS client XML DAS server DAS server DAS server remote storage remote storage remote storage Browsers DAS server Visualisation API Data Integration Programmatic Access Data mining Data Sharing DAS tracks Current Model Sequencing Instrument ACGTTTCCC…. Sequencing Instrument ACGTTTCCC…. Storage High-Performanance Cluster Sequencing Instrument ACGTTTCCC…. Scientist / User Sequencing Centre Download Submission Multi Peta-byte High-Performanance storage Public Repository Storage EMBL Repositories C. Southan & G. Cameron “The Fourth Paradigm” Sequencing Centre Sequencing Instrument Sequencing Instrument ACGTTTCCC…. Sequencing Instrument ACGTTTCCC…. LIMS Staging Storage Primary Analysis QA Assemblies Multi Peta-byte High-Performanance storage VM test enviromemt ACGTTTCCC…. Metadata Analysis Output User High-Performanance Cluster Analysis Submission Virtual Machine Pool Cloud infrastructure Summary • Sequencing Technologies – – – • Hardware – – – – • Single-molecule Cheaper ($1000 genome) Faster (real-time sequencing) Connectivity Storage HPC clusters Cloud Computing (GrayWulf) Software – – – – – Bioinformatics Staff New paradigms Data Integration & Dissemination Data visualisation Thanks!