Next Generation Sequencing An Introduction Setia Pramana TEIN4 WORKSHOP ON NEXT GENERATION SEQUENCING AND BIG DATA ANALYSIS 26-28 August 2015 UNIVERSITAS INDONESIA Setia Pramana 1 Educational Background • Universitas Brawijaya Malang, FMIPA, Statistics department, 1995-1999. • Hasselt Universiteit, Belgium, MSc in Applied Statistics 2005-2006. • Hasselt Universiteit, Belgium, MSc in Biostatistics 20062007. • Hasselt Universiteit, Belgium, PhD Statistical Bioinformatics, 2007-2011. • Medical Epidemiology And Biostatistics Dept. Karolinska Institutet, Sweden, Postdoctoral, 2011-2014 Now? • Head of Center for Methodology and Computational Statistics Studies, Sekolah Tinggi Ilmu Statistik, Jakarta. • Adjunct Faculty at Medical Epidemiology and Biostatistics Dept, Karolinska Institutet, Stockholm. • Adjunct Faculty at Faculty of Medicine, University of Indonesia Jakarta. What is Bioinformatics? Setia Pramana 4 Bioinformatics • Bioinformatics is a science straddling the domains of biomedical, informatics, mathematics and statistics. • Applying computational techniques to biology data 5 What is Bioinformatics ? Bioinformatics is a multifaceted discipline combining many scientific fields including computational biology, statistics, mathematics, molecular biology and Setia Pramana 6 genetics (Fenstermacher, 2005, p. 440). …from Bayat (2002), p 1018. Medical Implications • Pharmacogenomics – Not all drugs work on all patients, some good drugs cause death in some patients – So by doing a gene analysis before the treatment the offensive drugs can be avoided – Also drugs which cause death to most can be used on a minority to whose genes that drug is well suited – Customized treatment • Gene Therapy – Replace or supply the defective or missing gene – E.g: Insulin and Factor VIII or Haemophilia • BioWeapons (??) 7 Sequencing: from DNA to Genomes Setia Pramana 8 DNA Sequencing • DNA sequencing is a laboratory technique used to determine the exact sequence of bases (A, C, G, and T) in a DNA molecule. • The DNA base sequence carries the information a cell needs to assemble protein and RNA molecules. • The information is important to investigate the functions of genes. • The technology was made faster and less expensive as a part of the Human Genome Project. Genome.gov Setia Pramana 9 The Birth of Sequencing • Walter Gilbert and Allan Maxxam (Harvard Univ) in 1977 reported a technique to sequence DNA by chemical cleavage • The same year, Frederick Sanger’s group (Cambridge Univ) reported DNA polymerase and dideoxynucleotides technique Setia Pramana 10 DNA Sequencing Setia Pramana 11 The Human Genome Project Setia Pramana 12 The Human Genome Project • • • • First draft genome of human in 2001, final 2004 Estimated costs $3 billion, time 13 years Used Sanger Sequencing Today: Illumina: 1 week, 9500$ Exome: 6 weeks*, $1000 Towards 1000$ genome Setia Pramana 13 The Human Genome Project • The draft sequence of the HGP was imperfect because of the incomplete coverage of many regions – a huge number of gaps • The IHGSC published a ‘finished’ version of the human genome sequence in 2004 and the HGP was then deemed to be ‘complete’ Setia Pramana 14 The Human Genome Project • This ‘finished’ version of the genome achieved almost complete coverage of all the regions and also significantly reduced the number of gaps to 341 from the initial hundreds of thousands • Initiated a new era in the study of genetic variation and the functional characterization of the human genome Setia Pramana 15 Next (second) Generation Sequencing • New technologies allowing the massive production of tens of millions of short sequencing fragments. Thus, it is also called: “Massively parallel sequencing” • These techniques could be used to – deal with similar problems than microarrays, – but also with many other.• • They raised the promise of personalized medicine Setia Pramana 16 NGS • The advent of high-throughput sequencing technologies has initiated the ‘personal genome sequencing’ era for both normal and cancer genomes • Large-scale international projects such as the 1000 Genomes Project and the International Cancer Genome Consortium Setia Pramana 17 NGS • NGS technologies have been on the market only since 2004 • Have now largely replaced Sanger sequencing technologies (owing to the ultra-high-throughput production/hundreds gigabases) • Ability to simultaneously sequence millions of DNA fragments - massively parallel sequencing technologies Setia Pramana 18 NGS • Reduced sequencing costs significantly, making large-scale or WGS studies much more affordable Setia Pramana 19 NGS Technologies/Platforms Setia Pramana 20 NGS Technologies/Platforms Setia Pramana 21 Differences between platforms • • • • • Run times vary from hours to days Production range from Mb to Gb Read length from <100 bp to > 1500 bp Accuracy per base from 0.1% to 15% Cost per base varies Setia Pramana 22 NGS, General Procedure Setia Pramana 23 NGS Process in different Platforms Setia Pramana 24 Roche Main applications: • Microbial genomics and metagenomics • Targeted resequencing Setia Pramana 25 Roche 454 Sequencing Setia Pramana 26 Roche Sequencing Process • DNA Library Preparation Setia Pramana 27 Roche Sequencing Process • EMPCR Setia Pramana 28 Roche Sequencing Process • Sequencing Setia Pramana 29 Roche Sequencing Process • Fast run time 23h • 500-700 bp long reads (75-150bp for SOLiD and HiSeq) • 700 Mbases from one run (180-300Gb per flowcell using SOLiD or HiSeq) • High run costs Setia Pramana 30 ABI SOLiD Setia Pramana 31 ABI SOLiD Features • High accuracy due to two-base encoding • True paired-end chemistry - ligation from either end • Mate-pair libraries Main applications • Whole genome, exome and targeted resequencing • Transcriptome analyses Setia Pramana • Methylome and ChiPSeq 32 Illumina (Solexa) Setia Pramana 33 Illumina Main applications: • Whole genome, exome and targeted reseq (HiSeq) • Transcriptome analyses • Methylome and ChiPSeq • Rapid targeted resequencing (MiSeq) Setia Pramana 34 Ion Torrent Main applications: • Microbial and metagenomic sequencing • Targeted resequencing • Clinical sequencing Setia Pramana 35 Ion Torrent Application • Microbial sequencing – Accurate, fast bacteria and virus de-novo & resequencing • Mitochondrial sequencing - Deep sequencing for research, clinical, and forensic applications • Amplicon sequencing - Detection of germline and somatic mutations - Bacterial and viral typing • Resequencing by target enrichment • Validation of whole genome and whole exome mutations • Whole-transcriptome human RNA-Seq • Chip-Seq Setia Pramana 36 Third Generation Sequencing Setia Pramana 37 Pacific Bioscience Single-Molecule, Real-Time DNA sequencing Setia Pramana 38 Pacific Bioscience • • • • Yield: 50-200 Mb Read length: 2700 Error Rate 13% Cost ~ $500-2000/Gb ($100/run) Setia Pramana 39 Pacific Bioscience • • • • • Viral/ Microbial Genome Assembly Full length RNA isoform identification Eucaryotic genome assembly Chemical Modifications Clinical sequencing of amplicons or arrayenriched DNA Setia Pramana 40 Oxford Nanopore Technology • • • • • • • • Just Launched (2014) 150 bases/sec/pore Protein nanopores on silicon chip DNA measured as it is pulled through 125 Gb/ day 20-100.000 bases reads 4% error rate Cost $10/Gb Setia Pramana 41 NGS Platforms Setia Pramana 42 NGS Platforms Setia Pramana 43 NGS Platforms Setia Pramana 44 NGS Application RNA-seq Whole Genome Seq Gene Regulation NGS Epigenetic Exome Seq Resequencing Metagenomics Setia Pramana 45 NGS Application NGS Application • Whole genome re-sequencing • Ancient genomes • Metagenomics • Cancer genomics • Exome sequencing (targeted) • RNA sequencing • Chromatin immunoprecipitation (CHiP)-Seq: Protein interaction with DNA • Genomic Epidemiology • Epigenomic • Genetic human variation : SNP, CNV (diseases) • anything with DNA Setia Pramana 47 Sequencing Factory: Beijing Genome Institute • Purchased 128 HiSeq2000 sequencers from Illumina in January 2010 • each of which can produce 25 billion base pairs of sequence a day NGS Application: Whole Genome Seq Setia Pramana 50 NGS Application: Exome Genome Seq Setia Pramana 51 NGS Application: RNA Sequencing Setia Pramana 52 Bioinformatics Challenges of NGS Setia Pramana 53 Sequencing has gotten Cheaper and Faster Cost of one human genome • HGP $ 3 billion (13 yrs) •2004: $ 30,000,000 •2008: $100,000 •2010: $ 30,000 •2011: $10,000 •2012-13: $7,000 •2014: $4,000 (~1 week) •???: $1,000 The Race for the $1,000 Genome (Sequencing) Cost is Getting Cheaper • Reduced sequencing costs significantly, making large-scale or WGS studies much more affordable Setia Pramana 55 NGS Challenges Setia Pramana 56 Huge Data Storage and HPC Demand Generalized NGS Analysis Setia Pramana 58 NGS Challenges • Highest cost is (almost) not the sequencing but storage and analysis. • A standard human (30-40x) whole genome sequencing would create 100 Gb of data • Extreme data size causes problems • Just transferring and storing the data • Standard comparisons fail (N*N) • Standard tools can not be used • Think in fast and parallel programs Setia Pramana 59 • Bioinformatics Challenges of NGS Need for large amount of CPU power - Informatics groups must manage compute clusters -Challenges in parallelizing existing software or redesign of algorithms to work in a parallel environment - Another level of software complexity and challenges to interoperability Setia Pramana 60 Bioinformatics Challenges of NGS • VERY large text files (~10 million lines long) - Can’t do ‘business as usual’with familiar tools such as Perl/Python. - Impossible memory usage and execution time - Impossible to browse for problems • Need sequence Quality filtering Setia Pramana 61 Data Management Issues • Raw data are large. How long should be kept? • Processed data are manageable for most people – 20 million reads (50bp) ~1Gb • More of an issue for a facility: HiSeq recommends 32 CPU cores, each with 4GB RAM • Certain studies much more data intensive than other – Whole genome sequencing 30X coverage genome pair (tumor/normal) ~500 GB 50 genome pairs ~ 25Setia TBPramana 62 Data Management • Primary data usually discarded soon after run • Secondary and tertiary data maintained on fast access disk during analysis, then moved to slower access disk afterward Bioinformatics Challenges of NGS • In NGS we have to process really big mounts of data, which is not trivial in computing terms. • Big NGS projects require super computing infrastructures: it's not the case that any one can study everything. Small facilities must carefully choose their projects to be scaled with their computing capabilities Setia Pramana 64 Computational Infrastructure for NGS We can start with: - Computing cluster: Multiple nodes(servers) with of course multiple cores High performance storage(TB, PB level) Fast networks(10Gb ethernet, infiniband) - Enough space and conditions for the equipment("servers room") - Skilled people (sys admin, developers) Setia Pramana 65 Big Computing Infrastructure • Distributed memory cluster Starting at 20 computing nodes 60 to240 cores At least 48GB RAM per node • Fast networks 10Gbit Infiniband • Batch queue system (sge, condor, pbs, slurm) • Optional MPI and GPUs environment depending on project requirements • Starting at 200.000€ (hardware only) Setia Pramana 66 Middle size infrastructure • • • • "Small” distributed file system( around 50TB). "Small” cluster (around 10 nodes, 80 to 120 cores). At least giga bit ethernet network. Price range: 50.000 –100.000 € (just hard ware) Setia Pramana 67 Small Infrastructure • Recommended at least 2 machines – 8 or 12 cores each machine – 48 Gb RAM minimum each machine. – BIG local disk. At least 4 TB each machine As much local disks as we can afford Price range: starting at 8.000€-10.000 € (2 machines) Setia Pramana 68 Alternatives • Cloud Computing • Grid Computing Setia Pramana 69 Swedish National Infrastructure for Large Scale DNA sequencing (SNISS) Setia Pramana 70 UPPNEX • UPPmax NEXt generation sequence cluster & storage • Located at UPPMAX - Uppsala Multidisciplinary Center for Advanced • Computational Science (UPPMAX) • Dedicated computer cluster (500 nodes) • Uppnex is serving over 240 projects and hosting over 800 TB of data Setia Pramana 71 Interpretation Bottleneck Big Collaboration • Need Collaborative expertise (human intelligence and intuition) are required for meaning and interpretation (Bergeron 2002) • Including on-demand communication & sharing of protocols, electronic resources, data, and findings among the stakeholders • Collaboration with other Big DATA sources: National Registers, BPJS, Hospitals, etc. Next Generation Projects • 1000 Genomes Project (to provide a comprehensive resource on human genetic variation. ) • TCGA (The Cancer Genome Atlas) • MalariaGen: Sequencing thausands malaria isolates • 1001 Genome Project: Arabidopsis WGS • UK10K: Sequencing 10.000 healthy and disease affected individuals. • Southeast Asia Mycobacterium tuberculosis complex (MTBC) DB: Sequencing MTBC Isolates • Many more….. Collaboration Challenges • Potential conflict between traditional silo researchers and those embracing Big Collaboration • Compatible technologies and Cloud infrastructures • IT management of groups with different tools, requirements and expectations • Ownership of data • Government regulations and policies • Accessible data repositories and lack of transparency in findings • Resources to support bioinformatics • Patient privacy Frost & Sullivan Five Domains of Genomic Research Green. 2011. Nature 470, 204-213 Summary • Unraveling the Bioinformatics (Big) Data would provide right decisions at the right time for the right patients. • The problem is not producing data, but more on how to interpret them • Bioinformatician is one of the sexiest job Summary • Challenges: – Still expensive – Lack of Infrastructure (in developing countries) – Lack of skilled personal on Bioinformatics – Need (large scale) collaborations – Integrate different technologies and system – Making it all clinically relevant Setia Pramana 78 References • The Evolution of High-Throughput Sequencing Technologies: From Sanger to Single-Molecule Sequencing. Chee-Seng Ku , Yudi Pawitan , Mengchu Wu , Dimitrios H. Roukos , and David N. Cooper (2013) • http://ueb.ir.vhebron.net/NGS • http://cbs.dtu.dk/courses/27626/programme.php • http://nestor.uppnex.se/twiki/bin/view/Courses/CM1209/Schedule Setia Pramana 79 Thank you….. Setia Pramana 80