Genome Biology for Programmers Lecture Series: Illumina Sequencing Chris Daum JGI Illumina Group Lead April 1, 2011 Outline • Workflow Overview • Process Science – Sample Prep & qPCR quantification – Cluster Generation – Sequencing • Sequencer instruments: GA & HiSeq • Illumina Developments • Illumina quality & continuous improvement Illumina Workflow Sample Preparation Sample Quantification Clustering Sequencing Analysis Analysis Sample Preparation Library Preparation – Main Goals: • Prepares sample nucleic acids for sequencing • Many library types and creation procedures exist • However, all preparation results in the same general template structure: – Double-stranded DNA flanked by two different adapters – Variables include: • Sequencing Application & Starting material (e.g. gDNA, mRNA, Mate Pair, Active Chromatin, ChIP-Seq) • Insert Size • Adaptor type • Index for multiplexing Example Sample Prep Workflow: TruSeq Paired-end Library DNA RNA Library Quantification - qPCR • Real-time qPCR allows accurate quantification of DNA templates: – qPCR is based on the detection of a fluorescent reporter molecule that increases as PCR product accumulates with each cycle of amplification – By using primers specific to the Illumina universal adapters in a qPCR reaction containing library template, only cluster-forming templates will be amplified and quantified Library Quantification - qPCR Threshold of florescence for amplicon to produce a Cq Cq – Cycle of Quantification Cycle Threshold Plot Standard curve using controls and determine concentration of library Phases of qPCR: Geometric phase – amplicons doubling every cycle; greatest precision & accuracy for quantitation Log initial concentration Take home: qPCR mimics what is happening on the surface of the flowcell during cluster generation and allows for determining optimal loading concentrations. Cluster Generation • Process occurs on cBot instrument: – Aspirates DNA samples into flow cell – Automates the formation of amplified clonal clusters from the DNA single molecules – 1000x amplification generates clusters – Hybridizes sequencing primer(s) Illumina cBot • Cluster Generation 2.0 – Automated system significantly reduces workload for generation of flowcells – Compact design saves lab space – Reagent cartridge reduces prep time Flowcell Cluster Generation Prep • Prepare reagents and denature & dilute library: • The goal is to have the perfect cluster density to maximize yield (bp), this is achieved via optimized loading concentrations as determined by qPCR • Considerations: – Too low density: Fewer clusters, less sequence generated – Too high density: Overlapping clusters, removed by analysis filters, poor quality Cluster Generation Chemistry • Cluster generation Chemistry: – Hybridization – Amplification – Linearization – Blocking – Primer hybridization Cluster Generation Chemistry • Hybridize Sample fragments & extend: Cluster Generation Chemistry • Bridge Amplification: Cluster Generation Chemistry • Linearization, Blocking & Sequencing Primer Hybridization: Sequencing • Main Goals: – Translate the chemical information of the nucleotides into fluorescence information which can be captured optically – The optical information is then transformed into text, which can be searched, aligned, or otherwise mined for biologically relevant data Sequencing Workflow HiSeq Run Type Approx. Run Days 1x50 Flowcells 2x100 Flowcells 2x150 Flowcells 2 9 13 Sequencing by Synthesis • Clustered Flowcell is loaded on Illumina sequencer: Sequencing Chemistry: First Cycle Base Incorporation • To initiate the first sequencing cycle, add all 4 fluorescently labeled reversible terminators and DNA polymerase enzyme to the flowcell. • The complementary nucleotide will be added to the first position of each cluster. • A laser is then used to excite the attached fluorophore. Sequencing Chemistry: First Cycle Imaging Sequencing Chemistry: Cycle 2 and so on… Sequencing Read 2 • Resynthesis of second strand for Read 2 occurs on sequencer without removing flowcell: Paired-End Sequencing: When performing a paired-end run, after the initial cycles (Read1), an additional cluster generation is perform the analyzer, and the template is sequenced in the opposite direction, as depicted in the figures below. Index for Multiplex Sequencing • Sample multiplexing involves 3 reads: – A: Sample Read 1 is sequenced – B: Read 1 product removed and Index Read is sequenced – C: Template strand used to generate complementary strand, and sample Read 2 is sequenced • Analysis software identifies the index sequence from each cluster so that the sample reads 1 & 2 can be assigned to single sample Illumina HiSeq2000 Sequencer Nifty Lights HiSeq2000 Reagents 1 HiSeq = 2 GAs HiSeq2000 Fluidics Fluidics were the Achilles heel of the GA, and now 2X in the HiSeq HiSeq2000 Fluidics FY11 Service Metrics: Pareto Pareto: FY11 Service Requests 9 100% 8 90% 80% 7 Incidents 60% 5 50% 4 40% 3 Cumulative Percent 70% 6 30% 2 20% 1 10% 0 0% Service Request Categories 29 HiSeq: Temperature control • 3 mechanisms: – Heat extraction via liquid coolant – Flow cell temperature control via Peltier – Maintain reagents temperature via cooled compartment Flow cell sits on Peltier blocks, and is water cooled (heat extraction from underneath) Reagent Chiller: • All reagents cooled at 4C • Condensation Pump runs every 4 min for 30 sec HiSeq Flowcell Loading HiSeq Imaging HiSeq Optics HiSeq Lasers HiSeq Software Interface HiSeq Software Interface HiSeq – Real Time Metrics HiSeq vs GA Cost & Throughput Comparison GAIIx Run Type 1x36 Seq Prep Reagents $ Seq Reagents $ Seq Prep & Seq Total $ 2x36 HiSeq 2x76 2x150 1x50 2x50 2x100 2x150 2,292 864 $ $ 4,012 1,728 $ $ 4,012 3,456 $ $ 4,012 6,912 $ $ 2,442 1,436 $ $ 3,747 2,872 $ $ 3,747 5,175 $ $ 3,737 6,611 3,156 $ 5,740 $ 7,468 $ 10,924 $ 3,878 $ 6,619 $ 8,922 $ 10,348 Avg. Bases (Gb) 8.0 19.1 35.9 70.4 20.8 41.6 83.3 124.9 Avg. Reads (Millions) 222.2 265.0 236.3 234.6 416.0 416.0 416.4 416.3 Cost per lane $ 451 $ 820 $ 1,067 $ 1,561 $ 554 $ 946 $ 1,275 $ 1,478 Cost per 1 Gb $ Cost per Million reads $ 395 14 $ $ 301 22 $ $ 208 32 $ $ 155 47 $ $ 186 9 $ $ 159 16 $ $ 107 21 $ $ 83 25 Notes: •Throughput metrics are averages from runs performed in FY11 for each of the run types to date •Italicized HiSeq Bases & Reads throughput metrics are estimates based on 2x100 run type since we have limited data on other run types •Only vendor reagent costs shown here; library creation and overhead costs are not included, but are roughly equal and are mostly independent of run type •Cost per million reads goes up with the longer run types, but the readlength increases as well and this makes each read more valuable for some assembly applications •HiSeq 2x150 run type not yet supported & the current HiSeq chemistry has worse quality beyond 80-100bases than compared to GA •The HiSeq platform is still new and we are experiencing a higher number of hardware failures than GA; Illumina does replace reagents for failed runs and we rerun failed flowcells immediately whenever possible. HiSeq Development Coming in early Summer: 40 HiSeq Development 41 HiSeq Development 42 Introducing MiSeq 43 MiSeq: all-in-one 44 MiSeq: Fast, low throughput 45 Providing Quality Sequence Incident Reporting & Resolution (JIRA) Troubleshooting Procedures Throughput Goals & Metrics FY11 Cumulative Flowcells 500 489 450 400 350 250 FC Cumulative FC Goal 200 163 FY11 Cumulative Bases (Gb) 100 45,000 50 40,000 0 39696.000 35,000 30,000 25,000 Bases Cumulative 20,000 Base Goal 15,000 10,000 Continuous Improvement - Lean Six Sigma 7993.547 5,000 0 Failure Tracking & SPC Charts; RQC Instrument Status & real-time run monitoring Illumina Process Metrics by week: Cluster & Run failure rates 60% % of attempts that failed 150 Total Bases (Gb) Flowcells 300 Instrument Utilization & Efficiency *Illumina05 Illumina02 Illumina03 Problematic instruments with multiple run failures; 06 is being replaced & 07 had significant service work Illumina07 Illumina08 Illumina09 Illumina10 Illumina11 Illumina12 0.0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0% 70.0% 80.0% 90.0% 30% Cluster Failures Run failures 20% 10% 3-May *Illumina01 Illumina06 40% 0% FY10 Q3 Illumina Utilization Illumina04 50% 100.0% 10-May 17-May 24-May 31-May 7-Jun 14-Jun 21-Jun 28-Jun 5-Jul 12-Jul 19-Jul 26-Jul 2-Aug LLNL – Six Sigma Training • Tools and methodologies to: – Improve work quality – Improve process efficiencies & eliminate waste – Improve employee and customer satisfaction • Lean Six Sigma is about: – Eliminating waste and improving process flow – Focusing on reducing variation and improving process yield by following a problem-solving approach using statistical tools What is Six Sigma? • A Six Sigma process is literally one that’s statistically 99.99966% successful. • This is not always cost effective to achieve, so as a methodology it’s about gaining control of a process and implementing improvements. What is Six Sigma? • Six Sigma is a data driven problem solving approach where process inputs (Xs) are identified and optimized to impact the output (Y) Y = f(x) • The output is a function of the inputs and process – Y: Output – f: function – X: variables that must be controlled to consistently predict Y