Introduction to Next Generation Sequencing Overview • Day 1: AM - Basic biology recap and Intro to NGS • Day 1: PM - Intro to Data Analysis – Format(s), Quality checking, Trimming • Day 2: AM - General procedures and strategies in NGS • Day 2: PM - Exome sequence analysis practical (Galaxy) • Day 3: AM - Introduction to RNA-Seq (and a touch of miRNA-Seq) • Day 3: PM - RNA-Seq practical (Tophat + Cuffdiff pipeline on Galaxy) Note: practical write-ups = assessment assignment Overview • Day 4: AM – NGS in the wild (case studies) – Clinical genomics – Human microbiome • Day 4: PM - Candidate filtering and prioritization – Mostly SNP based – Little bit of functional and pathway enrichment analysis • Day 5: AM - Knowledge-driven methods for finding ‘causative’ genes & wrap-up • Day 5: PM – Free or wrap up practical Next Generation Sequencing Day 1: Introduction Full genome sequencing Day 1 - Overview • Central Dogma Review • History of DNA Sequencing • First Generation (Sanger) Sequencing • Next Generation Sequencing Introduction • NGS Opportunities and Challenges • NGS Applications • NGS Study Design and Technology Choice History 1866 Gregor Mendel published the results of his investigations of the inheritance of "factors" in pea plants. DNA was first isolated by the Swiss physician Friedrich Miescher in 1869. 1950's • Maurice Wilkins (19162004), Rosalind Franklin (1920-1957), Francis Crick (1916-2004) and James Watson (1928- ) discover chemical structure of DNA • Starts a new branch of science - molecular biology. The Central Dogma of Molecular Biology Reverse Transcription 10 Structure of the DNA molecule • DNA is shaped like a double helix • It is like a spiral staircase • Another way to think of it is a twisted ladder 11 Connecting the DNA molecule • Rails of the DNA ladder are alternating sugar & phosphates • Rungs are composed of pairs of bases – A bonds with T – G bonds with C 12 Connecting the DNA molecule • The two strands of DNA are different • One is called the sense strand and it is the plan to make a protein • The other strand is the antisense strand 13 Connecting the DNA molecule • The two strands of DNA are said to be antiparallel antisense • The other strand is oriented in the opposite 3’ to 5’ direction sense • One strand is oriented in a 5’ to 3’ direction 5’ 3’ 3’ 5’ 14 Replication of DNA 15 DNA sequencing exploits the physicochemical properties of DNA and the enzymes involved in its replication (more later…) Introns and Exons • Introns – non-coding sequences in the DNA that are NOT used to make to make a protein • Exons – coding sequences in the DNA that are expressed or used to make mRNA and ultimately are used to make a protein 17 Introns and Exons 18 Transcription 19 Transcription 20 Translation 21 Sanger Method Fred Sanger, 1958 Was originally a protein chemist Made his first mark in sequencing proteins Made his second mark in sequencing RNA 1980 dideoxy sequencing Sanger Method: Dideoxy Chain Termination 300-500 bases Capillary Method - Fluorescent Dyes 800-1000 bases Automated Sequencing – Leroy Hood developed fluorescent color labels for the 4 terminator nucleotide bases (late 80s). – This allowed all 4 bases to be sequenced in a single reaction and sorted in a single gel lane. – Hood also pioneered direct data collection by computer – Improvements in this technology now enabled sequencing of billion base genomes in < 1 year. • Automated sequencing machines use 4 colors, so they can read all 4 bases at once. Genome Sequencing TG..GT TC..CC AC..GC CG..CA TT..TC TG..AC AC..GC GA..GC CT..TG AC..GC GT..GC AC..GC AA..GC AT..AT TT..CC Genome Short fragments of DNA ACGTGGTAA CGTATACAC TAGGCCATA GTAATGGCG CACCCTTAG TGGCGTATA CATA… ACGTGGTAATGGCGTATACACCCTTAGGCCATA Short DNA sequences ACGTGACCGGTACTGGTAACGTACA CCTACGTGACCGGTACTGGTAACGT ACGCCTACGTGACCGGTACTGGTAA CGTATACACGTGACCGGTACTGGTA ACGTACACCTACGTGACCGGTACTG GTAACGTACGCCTACGTGACCGGTA CTGGTAACGTATACCTCT... Sequenced genome 28 28 -2001 The HGP consortium publishes its working draft in Nature (15 February), and Celera publishes its draft in Science (16 February). Sequencing the Human Genome 2001: Human Genome Project 2.7G$, 11 years Log10(price) 10 8 6 2007: 454 1M$, 3 months 2008: ABI SOLiD 60K$, 2 weeks 2001: Celera 100M$, 3 years 4 2009: Illumina, Helicos 40-50K$ 2 2000 2010: 5K$, a few days? 2012: 100$, <24 hrs? 2005 Year 2010 30 Sequence Database Size Exponential Data Increase Year NAR. 2007 September; 35(18): 6227–6237. Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998) History of DNA Sequencing 1870 Miescher: Discovers DNA 1940 Avery: Proposes DNA as ‘Genetic Material’ Efficiency (bp/person/year) 1953 Watson & Crick: Double Helix Structure of DNA 1 1965 Holley: Sequences Yeast tRNAAla 1970 Wu: Sequences Cohesive End DNA 1977 Sanger: Dideoxy Chain Termination Gilbert: Chemical Degradation 1980 Messing: M13 Cloning 1986 Hood et al.: Partial Automation 15 150 1,500 15,000 25,000 50,000 1990 • Cycle Sequencing • Improved Sequencing Enzymes • Improved Fluorescent Detection Schemes 200,000 50,000,000 100,000,000,000 2002 2008 • Next Generation Sequencing •Improved enzymes and chemistry •Improved image processing Sanger vs NGS • ‘Sanger sequencing’ has been the only DNA sequencing method for 30 years but… • …hunger for even greater sequencing throughput, at lower cost • NGS has the ability to process millions of sequence reads in parallel rather than 96 at a time (at a small fraction of the cost) Next Generation Sequencing: Why Now? • Motivation: HGP and its derivatives, personalized medicine • Short reads applications: (re-)sequencing, other methods (e.g. gene expression) • Advancements in technology 34 “Paradigm Shift” • Standard ABI “Sanger” sequencing – 96 samples/day – Read length ~650 bp = 450,000 bases • 454 was the game changer! – ~400,000 different templates (reads)/day – Read length ~250 bp – Total = 100,000,000 bases of sequence data!!! Solexa ups the Game • Solexa (Illumina GA) – 60,000,000 different sequence templates (yes that is an insane 60 million reads) – 36 bp read length (much longer now) – 4 billion bases of DNA per run (3 days) Next Generation Sequencing • 454 Life Sciences/Roche – Genome Sequencer FLX: currently produces 400-600 million bases per day per machine – Published 1 million bases of Neanderthal DNA in 2006 – May 2007 published complete genome of James Watson (3.2 billion bases ~20x coverage) • Solexa/Illumina – 10 GB per machine/week – May 2008 published complete genomes for 3 hapmap subjects (14x coverage) • ABI SOLID – 20 GB per machine/week Nanotechnology • Each system works differently, but they are all based on a similar principals: 1. 2. 3. 4. Shear target DNA into small pieces bind individual DNA molecules to a solid surface, amplify each molecule into a cluster copy one base at a time and detect different signals for A, C, T, & G bases 5. requires very precise high-resolution imaging of tiny features • (Solexa has 800 images @ 4 megapixels each) Sequencing by Synthesis (SBS) Problem: Huge Amount of Image Data • Raw image data huge: 1-2 TB for the Solexa, more for ABI-SOLID, less for 454 • The images are immediately processed into intensity data (spots w/ location and brightness) • Intensity data is then processed into basecalls (A, C, T, or G plus a quality score for each) • Basecall data is on the order of 5-10 GB per run (or a week of runs for 454) From John McPherson, OICR Next-gen sequencers 100 Gb AB/SOLiDv3, Illumina/GAII short-read sequencers (10+Gb in 50-100 bp reads, >100M reads, 4-8 days) bases per machine run 10 Gb 454 GS FLX pyrosequencer 1 Gb (100-500 Mb in 100-400 bp reads, 0.5-1M reads, 5-10 hours) 100 Mb ABI capillary sequencer (0.04-0.08 Mb in 450-800 bp reads, 96 reads, 1-3 hours) 10 Mb 1 Mb 10 bp 100 bp read length 1,000 bp Adapted from John McPherson, OICR 2009/10 AB SOLiDv3 120Gb, 100 bp reads 100 Gb Illumina HiSeq 100Gb, 150bp reads bases per machine run 10 Gb 1 Gb 454 GS FLX Titanium 0.4-0.6 Gb, 100-400 bp reads 100 Mb 10 Mb ABI capillary sequencer (0.04-0.08 Mb, 450-800 bp reads 1 Mb 10 bp 100 bp read length 1,000 bp Stein Genome Biology 2010 11:207 Storage is becoming a real problem Kahn, 2011, Science Lower Cost = More Innovation • As sequencing becomes cheaper, more investigators can use it for routine assays • Leads to variations and absolutely novel applications Lower Cost = More samples • More patients in GWAS studies • More replicates (or the use of some replicates and statistical approaches) in all other assays Bioinformatics is the Bottleneck • Sequencing is a commodity – can easily be outsourced • Bioinformatics is the essential point of the science – Data analysis and discovery of meaning in results • As the data throughput increases, the cost and time spent on analysis increase more than linearly More Investigators = Less Informatics skill • Sequencing is a readout for many different types of laboratory experiments • Clinical and basic science investigators from all areas of biology can make use of this technology • Many are completely naïve about bioinformatics • Informatics tools for NGS are very challenging Challenging Bioinformatics Environment • Very rapid change in technology platform – New file formats, new data types – Different “standards” from different vendors • Very rapid evolution of new methods • Very rapid ‘release’ of methods as ‘software’ via unsupported open source distribution • Large data sizes (both experimental and reference) The key Automation, automation, automation… 454 Sequencing Overview • Prepare library of single stranded DNA, 200-500 bp long and ligate adapters • Perform emulsion PCR, amplifying a single DNA template molecule in each microreactor (bead). • Sequence all clonally amplified sample fragments in parallel using pyrosequencing technology • Analyze sequence results – CLEAN data – Align overlapping sequence of individual reads to define contigs (Shotgun) – Order and orient contigs, create scaffolds (Paired End) – Identify variants (Amplicon) – Determine gene expression patterns (Transcriptome) Emulsion Based Clonal Amplification A + PCR Reagents + Emulsion Oil B Micro-reactors Adapter carrying library DNA Mix DNA Library & capture beads (limited dilution) “Break micro-reactors” Isolate DNA containing beads Create “Water-in-oil” emulsion Perform emulsion PCR • Generation of millions of clonally amplified sequencing templates on each bead From: Roche 454 James Grabeau 2007 (www.lsbi.mafes.msstate.edu/Roche%20454%20James%20Grabau%202007.ppt ) Depositing DNA Beads into the PicoTiter™Plate Load beads into PicoTiter™Plate Load Enzyme Beads 44 μm Adapted from: Roche 454 James Grabeau 2007 (www.lsbi.mafes.msstate.edu/Roche%20454%20James%20Grabau%202007.ppt ) Reagent flow and image capture PicoTiterPlate Wells Reagent Flow Sequencing By Synthesis Photons Generated are Captured by Camera Sequencing Image Created Adapted from: Roche 454 James Grabeau 2007 (www.lsbi.mafes.msstate.edu/Roche%20454%20James%20Grabau%202007.ppt ) FLX Sequencing Reaction www.roche-applied-science.com Different Library Preparation Methods for Different Project Aims • Shotgun Library Preparation for de novo or resequencing of genomic DNA or long PCR product. Align overlapping reads to define contigs • Paired End Library Preparation provides regions of sequence a known distance apart, allowing for ordering of contigs and analysis of genetic rearrangement. • Amplicon Library Preparation for detection of rare variants. Shotgun Library Preparation Create random DNA fragments, 300-800 bp, by nebulization with compressed N2 Ligate universal adpaters “A” and “B”. Select for “A” – “B” fragments. Remove second strand Attach to library beads via “B” adapter at calculated concentration to yield a single template molecule per library bead Proceed to emPCR Images from: https://www.roche-applied-science.com/ Amplicon Library Preparation • Target amplicon of 200-500 bp – 200 bp for uni-direction reads – 500 bp requires bi-directional reads • Amplify using fusion primers that include template specific primer and primers A and B •Purify and quantify •Proceed to emPCR From Michael Metzker, http://view.ncbi.nlm.nih.gov/pubmed/19997069 Solexa/Illumina Sequencing: Fluorescently labeled Nucleotides (Solexa) Complementary strand elongation: DNA Polymerase 60 From Debbie Nickerson, Department of Genome Sciences, University of Washington, http://tinyurl.com/6zbzh4 From Michael Metzker, http://view.ncbi.nlm.nih.gov/pubmed/19997069 From Michael Metzker, http://view.ncbi.nlm.nih.gov/pubmed/19997069 Sequencing by Synthesis (SBS) From Debbie Nickerson, Department of Genome Sciences, University of Washington, http://tinyurl.com/6zbzh4 Illumina (Solexa) Applications Resequencing • Characterise different related species or strains Transcriptome analysis • • No chip/array required! random priming of RNA DNA methylation analysis • sequencing bisulfite-converted DNA methylation-sensitive restriction digest enriched fragments Examine chromatin modifications • Quantify in vivo protein-DNA interactions using the combination of chromatin immunoprecipitation and sequencing (ChIP-Seq) Computational Biology Research Group 454 vs Solexa • • • • • Homopolymers (AAAAA..) Read length: 400 bp Number of reads: 400.000 Per-base cost greater De novo assembly, metagenomics • • • • Read length: 40 bp Number of reads: millions Per-base cost cheaper Ideal for application requiring short reads: ncRNA General ways of using the sequences: • Assemble them and look at what you have • You map them (align against a known genome) and then look at what you have. • Or a mixture of both! • Sometimes you select the DNA you are sequencing or you try to sequence everything • Depends on biological question, sequencing machine you have, and how much time and money you have Bioinformatics Tools • Alignment of reads to reference genome • Assembly of de novo sequence • Quality Control & Base Calling • Polymorphism detection • Differential expression and splicing detection • Genome browsing and annotation Alignment of reads • Reads generated from sequencing is mapped to a reference genome • Conventional tools like BLAST or BLAT do not work well with short sequence reads. • Modification of existing alignment algorithms to handle short reads. Alignment Tools • • • • • • • • ELAND MAQ Mosaik SHRiMP SOAP BWA Bowtie NOVOALIGN (commercial) Assembly • De novo sequencing involves assembling overlapping reads to form contiguous sequence of DNA • Done in cases where there’s no genomic information available NGS Applications • DNA mixtures from diverse ecosystems = metagenomics • Identification of all mutations in an organism • Deciphering cell’s transcripts at sequence level without prior knowledge of the genome sequence • Chip-seq: interactions protein-DNA • Epigenomics • Detecting noncoding RNA (miRNA-Seq is BIG now) • Genetic human variation : SNP, CNV (diseases) • Ancient DNA • Pooled sequencing Take home message Before you choose the analysis tools, choose your NGS technology wisely AND Decide whether NGS is absolutely necessary Where to get help/tips/clues