The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University June 5th, 2013 Next-generation sequencing (NGS) 2 Stein, Genome Biol. 2010 Falling cost of sequencing 3 DeWitt, Nat. Biotechnol. 2012 Sequencing human genomes 2001 2011 2013 (?) The Human Genome 1000 Genomes Project Your Genome ~ 3 Billion $ ~ 10 000 $ 100 - 1000 $ Outline • Overview of Next-Generation Sequencing (NGS) • Applications • Challenges • Solutions 5 Sequencing Revolution Sanger sequencing Next-Generation sequencing http://www.brusselsgenetics.be Metzker, Nat. Rev. Genet. 2010 100s of reactions… 10000s of base pairs… Millions of reactions! Billions of base pairs! 6 High-throughput Sequencing 2009 36bp X 20M X 8 lanes 6 Gbases 2013 2 X 150bp X 250M X 8 lanes 600 Gbases 200 Human Genomes in 1 run!!! NGS Technology Comparison instrument Method Pacbio Ion Torrent 454 Single-molecule Ion Pyrosequencing in real-time semiconductor Illumina SOLiD synthesis Ligation Read length 3kb average 200 bp 700 bp 50 to 250 bp 50+35 or 50+50 bp Error type indel indel indel substitution A-T bias single-Pass Error rate % 13 ~1 ~0.1 ~0.1 ~0.1 Reads per run 35000–75000 up to 4M 1M up to 3.2G 1.2 to 1.4G Time per run 30 minutes to 2 hours 2 hours 24 hours 1 to 10 days, 1 to 2 weeks Cost per 1 million bases (in US$) $2 $1 $10 $0.05 to $0.15 $0.13 Advantages Longest read length. Fast. Less expensive high sequence Long read size. equipment. yield, cost, Fast. Fast. accuracy Low cost per base. Low yield at Slower than Runs are high accuracy. Equipment can other methods, Homopolymer expensive. Disadvantages Equipment can be very read length, errors. Homopolymer be very expensive. longevity of the errors. expensive. plateform Genome Canada • > $915M investment and > $900M in co-funding • 100s Large-scale genomics projects • 5 Innovation centers 9 Outline • Overview of Next-Generation Sequencing (NGS) • Applications • Challenges • Solutions 10 Applications (I) • De novo sequencing – From the human genome… To all model organisms… To all relevant organisms (e.g. extreme genomes)… To “all” organisms? 11 Human Genome • 3 Billion DNA base pairs (bp) • Two human genomes are ~99.9% identical • There are about ~3M bp differences between you and me • Some of these differences explain variation in: – Disease susceptibility – Differences in drug metabolism – … www.dnacenter.com 12 Applications (II) • Genome re-sequencing – Genetic disorders – Cancer genome sequencing – Map genomic structural variations across individuals – Genealogy and migration – Agricultural crops – … The Cancer Genome Atlas 1000 Genomes Project 13 Exome sequencing for Mendelian disease “… about one-half to one-third (~3,000) of all known or suspected Mendelian disorders (for example, cystic fibrosis and sickle cell anaemia) have been discovered. However, there is a substantial gap in our knowledge about the genes that cause many rare Mendelian phenotypes.” “Accordingly, we can realistically look towards a future in which the genetic basis of all Mendelian traits is known, …” 14 Exome sequencing 15 Cancer genome sequencing Can obtain a full catalogue of mutations 16 Michael Stromberg, bioinformatics.ca Mutations in paediatric gliblastoma Jabado, Pfister and Majewski 18 Mutations in paediatric gliblastoma Sequenced the exomes of 48 paediatric GBM samples, found: • Somatic mutations in the H3.3ATRX-DAXX chromatin remodelling pathway in 44% of tumours • Recurrent mutations in H3F3A, which encodes the replicationindependent histone 3 variant H3.3 in 31% of tumours 19 Applications (III) • Quantitative biology of complex systems – New high-throughput technologies in functional genomics: ChIP-Seq, RNA-Seq, ChIA-PET, RIP-Seq, … – From single-gene measurements, to thousands of probes on arrays, to profiles covering all 3B bases of the genome – Important systems: Stem cells, Cancer, Infectious diseases… 20 Outline • • • • Overview of Next-Generation Sequencing (NGS) Applications Challenges Solutions 21 High-throughput Sequencing 2009 36bp X 20M X 8 lanes 6 Gbases 2013 2 X 150bp X 250M X 8 lanes 600 Gbases 200 Human Genomes in 1 run!!! Big Data 2013 70 TBytes Image files 2 X 10 TBytes 1 TBytes Intensity files Reads + qualities Big Data 2013 From: Alexandre Montpetit Subject: news from Illumina Date: 4 June, 2013 2:15:16 2 XPM 10EDT TBytes To: Guillaume Bourque 1 TBytes De Mark Van Oene (vp Illumina ventes): dans la prochaine Intensity files Reads + qualities annee on doit s'attendre a 2x plus de reads en 2x moins de temps (et 2x plus longs) Ca cause probleme? 240 TBytes 12 TBytes Alex 25 TB of raw data / month 300 TB of raw data / year Large NGS project Cancer project with whole genome data: 500 matched-normal 500 tumors 125 TB raw 500 X 3 lanes = 500 X 250GB vs 125 TB raw 500 X 3 lanes = 500 X 250GB DNA bases sequenced at the Innovation Center 72 Trillions! DNA bases 0r 800 genomes at 30X 12 HiSeqs 26 adventure.nationalgeographic.com 27 Biomedical research is built on data integration Your data Biomedical research is built on data integration 100X Your data Challenges • NGS instruments generate TBs of data • NGS instruments are getting faster, cheaper and will increasingly be found in small research labs and hospitals • Data sharing and integration is critical in biomedical research • Sequencing data represents sensitive private data and is identifiable 30 Outline • • • • Overview of Next-Generation Sequencing (NGS) Applications Challenges Solutions 31 Nanuq software Has tracked data and meta-data for more than: • 2.6 million sample aliquots, • 20,500 reagents, • 17,000 plates, • 140,000 tubes, • Multiple platforms, technologies and workflows(sequencing, genotyping, microarray, etc.) • 3,900 external users 32 Standardized analysis pipelines … Methylation Analysis report RNA-Seq Analysis report ChIP-Seq Analysis report … … … … 33 Data center at the Innovation Center > 1200 cores > 2 PB disk > 5 PB tape 34 Need more! UdeS Mammouth – 39168 cores McGill Guillimin – 16000 cores 35 Data processing issues • We have many different projects all needing space and processing. • We want to use the Compute Canada clusters for scalability but also to facilitate data distribution (we have >800 users). • This brings uniformity problems: – Different setups Hardware and Software – Different configurations – Etc. Our strategy • We wrote analyses pipelines to be easily configurable across clusters. • Same code, one ini file to customize (we already have templates for 3 cluster sites) • We install Linux modules readable by all on all these clusters so we know exactly what is available everywhere • We also deploy common genomes across sites. Usage on Compute Canada 38 Canadian Epigenetics, Environment and Health Research Consortium (CEEHRC) $1.5M (2012-2017) 39 PORTal for the Analysis of Genetics and Genomics Experiments (PORTAGGE) 40 Conclusions • NGS offers a variety of technologies and numerous exciting applications • Many areas of NGS data analyses are still under active development (e.g. RNA-Seq) • A major challenge is to ensure sufficient compute and storage capacities not to limit more advanced analyses • Need to work together to avoid duplication of efforts in installing tools but also to develop efficient ways to use HPC in biomedical research Acknowledgements IT team Terrance Mcquilkin Marc-André Labonté Genevieve Dancausse Andras Frankel Alexandru Guja EDCC team David Morais (UdeS) Carol Gauthier (UdeS) Bryan Caron (McGill) Alain Veilleux (UdeS) ME Rousseau (McGill) Analysis team Louis Letourneau Mathieu Bourgey Maxime Caron Gary Lévesque Robert Eveleigh Francois Lefebvre Johanna Sandoval Pascale Marquis Development team Nathalie Émond David Bujold Francois Cantin Catherine Côté Burak Demirtas Daniel Guertin Louis Dumond Joseph Francois Korbuly Marc Michaud Thuong Ngo guil.bourque@mcgill.ca Questions? 43