Next-Generation Sequencing: Data, Methods, Analysis and Implications Lecture I January 31, 2011 Peter J. Park Center for Biomedical Informatics Harvard Medical School Children’s Hospital Boston/Brigham and Women’s Hospital Breakthroughs in Technology • An essential tool in the molecular biology toolkit is the ability to read the base sequence of DNA molecules • Rapid DNA sequencing in the 1970s – Sanger – Gilbert and Maxam • Microarrays in late 1990s and 2000s – cDNA arrays – oligonucleotide arrays Microarrays oligonucleotide arrays cDNA arrays Images courtesy of Bioteach Microarrays • Extremely successful • Popular applications: Gene expression profiling, DNA copy number (comparative genomic hybridization), SNPs, microRNAs, ChIP-chip (tiling arrays), splicing (exon arrays) Disadvantages: • One must know the sequences to design the array • Even if one knows the sequences, one cannot fit all of them in a small number of arrays • High noise level due to cross-hybrization, non-linearity, etc. What is Next-Generation Sequencing? • One can sequence hundreds of millions of short sequences (35bp-100bp) in a single run • Illumina/Solexa GA II / HiSeq 2000 • Life Technologies/Applied Biosystems SOLiD • Roche/454 FLX, Titanium • Helicos • Pacific Biosciences • CompleteGenomics Illumina Genome Analyzer • • • • • 1 “flow cell” = 8 “lanes” 1 lane = ~10-30 million “reads” ~5-20 million “mapped reads” 36bp, 50bp, 75bp, 100bp Single-end (SE) or Paired-ends (PE) • 1 lane: $800-$2000 • Single-end or paired-ends • Multiplexing Illumina: Sequencing-by-synthesis Multiplexing • We may not need to generate so many reads per sample • Multiplexing: Pool samples into a single lane of a flow cell • Add a short “index” to tag libraries • Current Illumina multiplexing kit – six-base oligos – currently 12 unique tags to generate 96 samples/run • Easy in theory but has not been easy in practice Leading Platforms With 3730s, ~60Mb per year Specifications as of summer 2008 454 Solexa/Illumina SOLiD (ABI) Bp per run 400 Mb 2-3 Gb 3-6 Gb Read length 250-400 bp 35-50 (70-100) bp 35-50 bp run time 10 hr 2.5 days 5 days Download 20 min 27 hr (44 min) ~1 day Analysis 2-5 hr 2 days 2-3 days Files 20-50 Gb 1T 1T Latest Platforms: Illumina HiSeq • ~1 billion clusters • 30x coverage of two human genomes in a single run • ~10K per sample? • 1 x 35bp: ~1.5 days, ~30Gb • 2 x 50bp: ~4 days, 75-100Gb • 2 x 100bp: ~8 days, 150-200Gb SOLiD 5500xl: • • • • • With microbeads or nanobeads 20-45 Gb/day 12 lanes Similar run times as HiSeq Up to 180-300Gb per run Rapid Decrease in Cost • The Human Genome Project: 13 years and $3 billion. • Sequencing of the Watson Genome by 454 in 2007: $2 million • Illumina: eight days at a cost of about $10,000. • ~104 reduction in 5 yrs • Claims: a genome in 15 minutes for $1000? Source: The Economist ABI SOLiD (Seq by Oligo Ligation/Detection) • Clonal bead library via emulsion PCR • The actual base detection is no longer done by the polymerase-driven incorporation of labeled dideoxy terminators. • SOLiD uses a mixture of labeled oligonucleotides and queries the input strand with ligase. • Each base is interrogated twice – built-in error checking capability that distinguishes between measurement errors and true polymorphisms – detection of more complicated variations SOLiD Technology Ligation-based chemistry with dibase labelled probes • Oligos: – Positions 1-2 (from 3’ side): one of 16 dinucleotides – Positions 3-5: degenerate (Ns) – Positions 6-5’: degenerate and holds one of four fluorescent dyes • 5-7 ligation reactions are followed by a reset cycle • Next a new initial primer is used that is N-1 in length Working in “Colorspace” Helicos • True Single Molecule SequencingTM • No amplification • Very easy sample prep • Sept 2009: – Nature Biotechnology: ‘Singlemolecule sequencing of an individual human genome’ – 24-70bp reads, 28x coverage • Measuring a small amount of DNA (3-6ng) is difficult • Alignment is tricky Pacific Biosciences • Single Molecular Real Time (SMRT) real-time technology • Each chip with waveguides – a 100-nm hole to watch DNA polymerase perform sequencing by synthesis; phospholinked nucleotides labeled with colored fluorophore are introduced • Long reads, short run times, high quality • 1000-1200bp reads (5% 3-5K), fast and low cost per run Eid et al, Science, 2009 Iron Torrent Personal Genome Machine • When a nucleotide is incorporated into a strand of DNA by a polymerase, a hydrogen ion is released • A high-density array of wells (using semiconductor technology) with each well holding a different DNA template. Beneath the well is an ion-sensitive layer and a senor • Sequentially floods the chip with one nucleotide after another • 10Mb of “high-quality” • If a match, a hydrogen ion is released sequence and the change in the pH of the • Runs in ~2 hours solution is detected Access to Platforms • As in most new technologies, getting good data from a sequencer initially is not trivial • This is especially the case if you only have only one or two machines • The situations has improved dramatically in the past 2-3 years as the technology has become more stable • The cost (500K-800K) is still prohibitive for most universities • NIH funded many machines through their “large instrumentation” program • The big genome centers have substantial advantage in technology development • Future landscape? Stock Prices Data Analysis Problems with NGS data • Reads are short – difficult to assemble/map repetitive regions • Not all sequences are equally likely to be sequenced – GC content – fragment length • Amplification bias • Sequencing errors – especially toward the end • Variable quality/turn-around Are Short Reads Useful? • But a big problem with repetitive regions! Francesco Ferrari Error Rate • Error rate is high in the first 1-2 bases • It increases exponentially toward the end Wang et al, Nature 456: 470, 2008 Kircher et al. Genome Biology 2009 Quality Score • Each base position in a sequence comes with a “quality score”. • This measures the probability that a base is called incorrectly, by a phred-like algorithm similar to that originally developed for Sanger sequencing experiments. • The quality score of a given base, Q, is defined by Q= -10*log10(e) where e is the estimated probability of the base call being wrong. • A quality score of 20 represents an error rate of 1 in 100, with a corresponding call accuracy of 99%. Quality Scores • 100-bp reads • 40 is the highest, 0 is lowest Inter-laboratory variation quality score Illumina (internal) 100 bp reads A large genome sequencing center • • • Data from the 1000 genomes project Different samples but same population Consistent across many samples Francesco Ferrari Data Generation Pipeline Image Processing Base-calling Genome Alignment Data format • qseq.txt file Data Management • Raw data are large; to be kept for ~6 months? • Processed data (e.g., BAM files) are manageable for most people: ~1GB for 20 million reads (50bp) • Alignment is not a big issue for most investigators • More of an issue for a facility: HiSeq recommends 32 CPU cores, 4 GB RAM each • Whole-genome sequencing: – A 30X coverage genome pair (tumor/normal): ~500GB – 50 genome pairs: ~25TB • “Why can’t I get a 1TB drive at Costco for $100?” • We generally want high-performance, replicated storage • At HMS, ~$700/TB/year; non-redundant storage: $200/TB How To Transfer Data • It is difficult to download data via http or ftp • A commercial software/protocol is become popular (Aspera “next-generation file transport”) • This can give 400-800Mbps Genome Alignment • One can specify how many mismatches are to be tolerated • This can also be quantified by accounting for quality scores • A typical criterion might be 1-2 mismatches for 36bp reads • From the raw sequences, ~50-80% of the reads are typically aligned to the genome – – – – sequencing errors multiple matches in the genome deviations from the reference genome (SNPs, insertions, etc) problems with the aligner • This % of mapped reads is a good measure of data quality • Often need to normalize using an “alignability map” Genome Alignment • A dynamic programming can be used to find the local alignments between a text T and a pattern P in O(|T||P|) time • The genome is too big for this approach • How to find exact match? – Sort all 36mers in the reference genome – Search a sorted list in log(N) step • The genome must be ‘indexed’ • A BWT (Burrows-Wheeler Transformation) index for the human genome occupies just around 1 G bytes • Exact matches are too stringent, so heuristic approaches are needed Popular Aligners • Generate a sorted list of genomic oligomers or a hash table – eland – MAQ • Burrows-Wheeler Transformation – Bowtie – BWA Program Bowtie BWA MAQ Mosaik Algorithm FM-index FM-index hashing reads hashing ref. Long reads Gapped No No Yes Yes No Yes Yes Yes PairedEnds Yes Yes Yes Yes Use of quality info Yes No Yes No Name BLAT Bowtie BWA ELAND GMAP and GSNAP MAQ MOSAIK Description BLAST-Like Alignment Tool. Can handle one mismatch in initial alignment step. Uses a Burrows-Wheeler transform to create a permanent, reusable index of the genome; 1.3 GB memory footprint for human genome. Aligns more than 25 million Illumina reads in 1 CPU hour. Uses a Burrows-Wheeler transform to create an index of the genome. It's a bit slower than bowtie but allows indels in alignment Implemented by Illumina. Includes ungapped alignment with a finite read length. Robust, fast, short-read alignment. GMAP: singleton reads; GSNAP: paired reads. Useful for digital gene expression, SNP and indel genotyping. Ungapped alignment that takes into account quality scores for each base Fast gapped aligner and reference-guided assembler. Aligns reads using a banded SmithWaterman algorithm seeded by results from a k-mer hashing scheme. Supports reads ranging in size from very short to very long. SOAP No read length limit. Hamming or edit distance mapping with configurable error rates. Configurable and predictable sensitivity (runtime/sensitivity tradeoff). Supports paired-end read mapping. Indexes the reads instead of the reference genome. Uses masks to generate possible keys. Can map ABI SOLiD color space reads. Slider is an application for the Illumina Sequence Analyzer output that uses the "probability" files instead of the sequence files as an input for alignment to a reference sequence or a set of reference sequences. Robust with a small (1-3) number of gaps and mismatches. Speed improvement over BLAT, uses a 12 letter hash table. Now SOAP2 is much faster than the first version. SOCS For ABI SOLiD technologies. Significant increase in time to map reads with mismatches (or color errors). Uses an iterative version of the Rabin-Karp string search algorithm. SSAHA Taipan Fast for a small number of variants. de-novo Assembler for Illumina reads RazerS SHRiMP SLIDER based on http://en.wikipedia.org/wiki/List_of_sequence_alignment_software Applications of NGS • If you build it, they will come! • Whole-genome sequencing • de novo genome assembly (much harder with shorter reads) • Variant detection (mutations, SNPs, indels, copy number) • Targeted resequencing (e.g.,exons) • ChIP-seq – Protein-DNA binding, histone modifications, nucleosomes • Expression profiling: – RNA-seq – splicing variants – Digital expression profiling • Small RNA sequencing • and many more . Number of Publications in Pubmed Kahvejian, Quackenbush, Thompson, Nature Biotech 26:1125, 2008