Previous Lecture: Next-Generation DNA Sequencing Technology This Lecture NGS Alignment Slides are from Stratos Efstathiadis, Cole Trapneli, Steven Salzberg, Ben Langmead, Thomas Keane, Dennis Wall, Jianhua Ruan, J Fass Learning Objectives • • • • • • • • Challenge of NGS read alignment Suffix Arrays FM index BWA algorithm SAM/BAM format CIGAR string for variants Software to work directly with SAM/BAM files Sequence viewers Short Read Alignment • Given a reference and a set of reads, report at least one “good” local alignment for each read if one exists – Approximate answer to: where in genome did the read originate? • What is “good”? For now, we concentrate on: – Fewer mismatches is better …TGATCATA… GATCAA – Failing to align a low-quality base is better than failing to align a …TGATATTA… GATcaT high-quality base – Match Uniqueness of the alignment better than …TGATCATA… GAGAAT better than …TGATcaTA… GTACAT Multiple mapping • A single tag may occur more than once in the reference genome. • The user may choose to ignore tags that appear more than n times. • As n gets large, you get more data, but also more noise in the data. Inexact matching ? • An observed tag may not exactly match any position in the reference genome. • Sometimes, the tag almost matches one or more positions. • Such mismatches may represent a SNP (single-nucleotide polymorphism, see wikipedia) or a bad read-out. • The user can specify the maximum number of mismatches, or a phredstyle quality score threshold. • As the number of allowed mismatches goes up, the number of mapped tags increases, but so does the number of incorrectly mapped tags. % of Paired K-mers with Uniquely Assignable Location Read Length is Not As Important For Resequencing 100% 90% 80% 70% 60% E.COLI 50% HUMAN 40% 30% 20% 10% 0% 8 Jay Shendure 10 12 14 16 18 20 Length of K-mer Reads (bp) New alignment algorithms must address the requirements and characteristics of NGS reads – Hundreds of Millions of reads per run (30x genome coverage) – Short Reads (as short as 36bp) – Different types of reads (single-end, paired-end, mate-pair, etc.) – Base-calling quality factors (should the aligner use them?) – Sequencing errors ( ~ 1%) – Repetitive regions – Sequencing sample vs. reference genome – Must adjust to evolving sequencing technologies and data formats Index: an auxiliary data structure Two classes of indexing algorithms: (1) Hash tables (the “old” way) Hash of Reads (MAQ, ELAND, ZOOM, …) Smaller but variable memory requirements (depends on the amount of reads). Hash of Reference (SOAP, MOSAIK, … ) Predictable memory requirements. (2) Suffix arrays (the “new” way) BWA, Bowtie, SOAP2, … Indexing • Genomes and reads are too large for direct approaches like dynamic programming • Indexing is required Suffix tree Suffix array Seed hash tables Many variants, incl. spaced seeds • Choice of index is key to performance Suffix Array Find "ctat” in the reference Is the process reversible ? NGS Read Alignment Burrows Wheeler Transformation (BWT) • Invented by David Wheeler in 1983 (Bell Labs). Published in 1994. “A Block Sorting Lossless Data Compression Algorithm” Systems Research Center Technical Report No 124. Palo Alto, CA: Digital Equipment Corporation, Burrows M, Wheeler DJ. 1994 • Originally developed for compressing large files (bzip2, etc.) • Lossless, Fully Reversible • Alignment Tools based on BWT: bowtie, BWA, SOAP2, etc. • Approach: – Align reads on the transformed reference genome, using an efficient index (FM index) – Solve the simple problem first (align one character) and then build on that solution to solve a slightly harder problem (two characters) etc. • Results in great speed and efficiency gains (a few GigaByte of RAM for the entire H. Genome). Other approaches require tens of GigaBytes of memory and are much slower. FM-index Burrows-Wheeler Transform is the basis of the bzip2 file compression tool. B-W uses an FM-index (Ferragina & Manzini) which allows efficient finding of substring matches within compressed text – algorithm is sub-linear with respect to time and storage space required for a certain set of input data (reference genome) Reduced memory footprint, faster execution. Burrows-Wheeler • Store entire reference genome. • Align tag base by base from the end. • When tag is traversed, all active locations are reported. • If no match is found, then back up and try a substitution. Jianhua Ruan The University of Texas at San Antonio Burrows-Wheeler Transform (BWT) BWT acaacg$ $acaacg aacg$ac acaacg$ acg$aca caacg$a cg$acaa g$acaac gc$aaac Burrows-Wheeler Matrix (BWM) Burrows-Wheeler Matrix $acaacg aacg$ac acaacg$ acg$aca caacg$a cg$acaa g$acaac Burrows-Wheeler Matrix 3 1 4 2 5 6 $acaacg aacg$ac acaacg$ acg$aca caacg$a cg$acaa g$acaac See the suffix array? Key observation a1c1a2a3c2g1$1 “last first (LF) mapping” The i-th occurrence of character X in the last column corresponds to the same text character as the i-th occurrence of X in the first column. 1$acaacg1 2aacg$ac1 1acaacg$1 3acg$aca2 1caacg$a1 2cg$acaa3 1g$acaac2 Exact match (another example) BWT(agcagcagact) = tgcc$ggaaaac gca gca Search for pattern: gca gca gca $agcagcagact $agcagcagact $agcagcagact $agcagcagact act$agcagcag act$agcagcag act$agcagcag act$agcagcag agact$agcagc agact$agcagc agact$agcagc agact$agcagc agcagact$agc agcagact$agc agcagact$agc agcagact$agc agcagcagact$ agcagcagact$ agcagcagact$ agcagcagact$ cagact$agcag cagact$agcag cagact$agcag cagact$agcag cagcagact$ag cagcagact$ag cagcagact$ag cagcagact$ag ct$agcagcaga ct$agcagcaga ct$agcagcaga ct$agcagcaga gact$agcagca gact$agcagca gact$agcagca gact$agcagca gcagact$agca gcagact$agca gcagact$agca gcagact$agca gcagcagact$a gcagcagact$a gcagcagact$a gcagcagact$a t$agcagcagac t$agcagcagac t$agcagcagac t$agcagcagac Test with your own seq and pattern at: http://www.allisons.org/ll/AlgDS/Strings/BWT/ Exact Matching with FM Index • To match Q in T using BWT(T), repeatedly apply rule: top = LF(top, qc); bot = LF(bot, qc) – Where qc is the next character in Q (right-to-left) and LF(i, qc) maps row i to the row whose first character corresponds to i’s last character as if it were qc FM Index is Small • Entire FM Index on DNA reference consists of: – BWT (same size as T) – Checkpoints (~15% size of T) – SA sample (~50% size of T) Assuming 2-bit-per-base encoding and no compression, as in Bowtie Assuming a 16-byte checkpoint every 448 characters, as in Bowtie Assuming Bowtie defaults for suffixarray sampling rate, etc • Total: ~1.65x the size of T ~1.65x >45x >15x >15x FASTQ Format: • The de-facto file format for sharing sequence read data • Sequence and a per-base quality score SAM (Sequence Alignment/Map) format: • A unified format for storing read alignments to a reference genome. • Generally large files (a byte per bp) • Very compact in size but computationally efficient to access. BAM (Binary Alignment/Map) format: • A Binary equivalent to SAM. • Developed for fast processing and indexing http://bioinformatics.oxfordjournals.org/cgi/reprint/btp352v1 Short-read mapping software Software Technique Developer License Eland Hashing reads Illumnia ? SOAP Hashing refs BGI Academic Maq Hashing reads Sanger (Li, Heng) GNUPL Bowtie BWT Salzberg/UMD BWA BWT Sanger (Li, Heng) GNUPL SOAP2 BWT & hashing BGI GNUPL Academic http://www.oxfordjournals.org/our_journals/bioinformatics/nextgenerationsequencing.html FASTQ Files Sequence Id (Illumina) 36 bps read @HWUSI-EAS610_0001:3:1:4:1405#0/1 GATAGTTCAATTCCAGAGATCAGAGAGAGGTGAGTG + B;30;<4@7/5@=?5?7?1>A2?0<6?<<80>79## 36 Quality scores • • • • • The de-facto file format for sharing DNA sequence read data 4 Lines per read Sequence line and a per-base Phred quality score line per read FASTQ Files are Text files There is No file Header An Introduction to Phred Quality Score 10 Q Phred is the Error Probability: The probability that a base call is wrong. 10 Q Phred 10 log 10 ( ) Q: Phred Quality Score Q Probability the base call in wrong (confidence) 40 0.0001 0.01% (99.99%) 30 0.001 0.1% (99.9%) 20 0.01 1% (99%) 10 0.1 10% (90%) • Phred Quality Score encoding in FASTQ/SAM files: ASCII Character = Q + 33 • FASTQ Files: Q represents Base Call Quality: Probability the base call is wrong. • SAM Files: Q represents Mapping Quality: Probability the mapping position of the read is incorrect. $perl –e ‘print chr(33);’ http://en.wikipedia.org/wiki/FASTQ_format The SAM file SAM data fields Mapping Quality (MAPQ) in BWA Mapping Quality is a function of Edit Distance and the Uniqueness of the alignment. BWA Mapping Quality 0 A read aligns equally well to multiple positions (hits). BWA picks randomly one of the positions and assigns MAPQ=0 1 – 23 25 Only 1 Best hit (with no suboptimal hits) with more than 2 mismatches. Or Only 1 Best hit, with 1 suboptimal hit. 37 Only 1 Best hit (no suboptimal hits), with up to 2 mismatches (edit distance could be more than 2) SAM/BAM format @HD @SQ @RG @RG Header section VN:1.0 SN:chr20 AS:HG18 LN:62435964 ID:L1 PU:SC_1_10 LB:SC_1 SM:NA12891 ID:L2 PU:SC_2_12 LB:SC_2 SM:NA12891 position of alignment Alignment section Query Name V00-HWI-EAS132:3:38:959:2035#0 V00-HWI-EAS132:4:99:122:772#0 V00-HWI-EAS132:4:44:473:970#0 V00-HWI-EAS132:4:29:113:1934#0 Ref sequence 147 177 25 99 chr1 chr1 chr1 chr1 query sequence (same strand as ref) 28 32 40 44 255 255 255 255 36M 36M 36M 36M = = * = 79 9127 0 107 0 0 0 0 GATCTGATGGCAGAAAACCCCTCTCAGTCCGTCGTG AAAGGATCTGATGGCAGAAAACCCCTCTCAGTCCGT GTCGTGGTGAAGGATCTGATGGCAGAAAACACCTCT GGGTTTTCTGCCATCAGATCCTTTACCACGACAGAC query quality aaX`[\`Y^Y^]ZX``\EV_BBBBBBBBBBBBBBBB aaaaaa\OWaI_\WL\aa`Xa^]\ZUaa[XWT\^XR __YaZ`W[aZNUZ[U[_TL[KVVX^QURUTDRVZBB aaaQaa__``]\\_^``^a^`a`_^^^_XQ[ZS\XX NM:i:1 NM:i:1 NM:i:2 NM:i:1 Post-processing: Tools and programming APIs for parsing and manipulating alignments: Samtools: http://samtools.sourceforge.net/ Convert SAM to BAM and vice versa Sort and Index BAM files Merge multiple BAM files Show alignments in text viewer Remove Duplicates from PCR amplification step Picard Tools: (Java-based) http://picard.sourceforge.net/index.shtml BAM is Indexed & Binary Compressed SAM • BAM is indexed by genome location. • Software toolkit allows other software to extract sequence data from BAM at specific genome – do not need to store entire data file in memory during operation of program! SAMTools/Picard http://samtools.sourceforge.net/ • SAMTools is a simple toolkit to transform SAM to BAM, merge, sort, index • Can also calculate statistics (mean quality, depth of coverage, etc.) • filter duplicate reads • create multiple alignments of all reads over a genomic interval • call variants SAMTools commands • Download and install SAMTools http://sourceforge.net/projects/samtools/files/samtools/ Make the BAM file: samtools view –bt ref.fa –o data.bam data.sam Sort the BAM file: samtools sort data.bam data.sorted.bam Index the BAM file: samtools index data.sorted.bam data.st_index.bam http://www.htslib.org/doc/sam.html http://samtools.sourceforge.net/samtools-c.shtml Manual Reference Pages - samtools (1) DESCRIPTION • Samtools is a set of utilities that manipulate alignments in the BAM format. It imports from and exports to the SAM (Sequence Alignment/Map) format, does sorting, merging and indexing, and allows to retrieve reads in any regions swiftly. SYNOPSIS samtools samtools samtools samtools samtools samtools samtools samtools samtools samtools bcftools bcftools bcftools view -bt ref_list.txt -o aln.bam aln.sam.gz sort aln.bam aln.sorted index aln.sorted.bam idxstats aln.sorted.bam view aln.sorted.bam chr2:20,100,000-20,200,000 merge out.bam in1.bam in2.bam in3.bam faidx ref.fasta pileup -vcf ref.fasta aln.sorted.bam mpileup -C50 -gf ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam tview aln.sorted.bam ref.fasta index in.bcf view in.bcf chr2:100-200 > out.vcf view -vc in.bcf > out.vcf 2> out.afs Heng Li from the Sanger Institute wrote the C version of samtools. Picard (Java) Toolkit • • • • • • • • • • • • • • • • • • • • • • • • • • • • AddCommentsToBam AddOrReplaceReadGroups BamToBfq BamIndexStats BedToIntervalList BuildBamIndex CalculateHsMetrics CleanSam CollectAlignmentSummaryMetrics CollectBaseDistributionByCycle CollectGcBiasMetrics CollectInsertSizeMetrics CollectMultipleMetrics CollectTargetedPcrMetrics CollectRnaSeqMetrics CollectWgsMetrics CompareSAMs CreateSequenceDictionary DownsampleSam ExtractIlluminaBarcodes EstimateLibraryComplexity FastqToSam FifoBuffer FilterSamReads FilterVcf FixMateInformation GatherBamFiles GenotypeConcordance • • • • • • • • • • • • • • • • • • • • • • • • • • • • • IlluminaBasecallsToFastq IlluminaBasecallsToSam CheckIlluminaDirectory IntervalListTools MakeSitesOnlyVcf MarkDuplicates MarkDuplicatesWithMateCigar MeanQualityByCycle MergeBamAlignment MergeSamFiles MergeVcfs NormalizeFasta ExtractSequences QualityScoreDistribution ReorderSam ReplaceSamHeader RevertSam RevertOriginalBaseQualitiesAndAddMateCigar SamFormatConverter SamToFastq SortSam SortVcf UpdateVcfSequenceDictionary VcfFormatConverter MarkIlluminaAdapters SplitVcfs ValidateSamFile ViewSam VcfToIntervalList Visualization • BAM file format also simplifies visualization of NGS data. • Must make a .BAI index for each BAM file using samtools index command. • BAM and .BAI files must be located on your own computer • Index allows viewer to quickly find and load only read data for a specific genomic interval. • Integrative Genomics Viewer (IGV) from Broad Institute is our current favorite. • Use the Java Webstart to download and run IGV (use the 1.2 GB version): http://www.broadinstitute.org/igv/ Summary • • • • • • • • Challenge of NGS read allignment Suffix Array FM index BWA algorithm SAM/BAM format CIGAR string for variants Software to work directly with SAM/BAM files Sequence viewers Next Lecture: ChIP-seq