Read Processing and Mapping: From Raw to Analysis-ready Reads B E N PA S S A R E L L I QUAKE LAB NGS WORKSHOP M AY 3 0 , 2 0 1 4 From Raw to Analysis-ready Reads Raw reads Read assessment and prep Mapping Duplicate Marking Local realignment Base quality recalibration Analysis-ready reads 2 Session Topics • • • • • • • • • • • Brief overview of high-throughput sequencing platforms Understand read data formats and quality scores Identify and fix some common read data problems Find a genomic reference for mapping Map reads to a reference genome Understand alignment output Sort, merge, index alignment for further analysis Mark/eliminate duplicate reads Locally realign at indels Recalibrate base quality scores How to get started Sequencing Platforms at a Glance Illumina Sequencing Platforms MiSeq NextSeq 500 HiSeq 2500 Features MiSeq NextSeq 500 HiSeq 2500 # Flowcells 1 1 2 # Sample Mixes 1 1 16 # Clusters / Run 25M 400M 3200M Max Read Length 2x300 2x150 2x100 Gb / Run 15 120 640 Hours / Run 55 hours 30 hours 12 days Reagent Cost / Gb $79 $32 $36 Single Cell Analysis Toolset • Built on R Statistics Package • Differential gene expression analysis and visualization • PCA • Unsupervised clustering • ANOVA (statistical hypothesis testing) Sample to Raw Reads Sample Preparation C1 Single Cell Capture Imaging / Lysis Amp of DNA / cDNA 6 QC and Quantification AATI Fragment Analyzer Evaluate and Quantitate Harvested C1 DNA products Library Construction Sequencing NextSeq 500 300M or 800M Reads In ~24 hours Raw Reads Solid Phase Amplification Sequencing Steps •Clusters are linearized •Sequencing primer annealed •All labeled dNTPs added at each cycle •Intensity of different tags base call •Error Profile: substitutions Library DNA binds to Oligos Immobilized on Glass Flowcell Surface 7 Instrument Output Illumina MiSeq NextSeq HiSeq Base call file (.bcl) LifeTech PGM Pacific Biosciences Oxford Nanopore RS MinION Proton Standard flowgram file (.sff) Trace (.trc.h5) Pulse (.pls.h5) Base (.bas.h5) Sequence Data (FASTQ Format) 8 Squiggle (.h5) FASTQ Format (Illumina Example) Read Record Header Separator (with optional repeated header) Lane Flow Cell ID Tile Tile Coordinates Barcode @DJG84KN1:272:D17DBACXX:2:1101:12432:5554 1:N:0:AGTCAA CAGGAGTCTTCGTACTGCTTCTCGGCCTCAGCCTGATCAGTCACACCGTT + Read Bases BCCFFFDFHHHHHIJJIJJJJJJJIJJJJJJJJJJIJJJJJJJJJIJJJJ @DJG84KN1:272:D17DBACXX:2:1101:12454:5610 1:N:0:AG AAAACTCTTACTACATCAGTATGGCTTTTAAAACCTCTGTTTGGAGCCAG Read Quality + Scores @@@DD?DDHFDFHEHIIIHIIIIIBBGEBHIEDH=EEHI>FDABHHFGH2 @DJG84KN1:272:D17DBACXX:2:1101:12438:5704 1:N:0:AG CCTCCTGCTTAAAACCCAAAAGGTCAGAAGGATCGTGAGGCCCCGCTTTC + CCCFFFFFHHGHHJIJJJJJJJI@HGIJJJJIIIJGIGIHIJJJIIIIJJ @DJG84KN1:272:D17DBACXX:2:1101:12340:5711 1:N:0:AG NOTE: for paired-end runs, there is a second file GAAGATTTATAGGTAGAGGCGACAAACCTACCGAGCCTGGTGATAGCTGG with + one-to-one corresponding headers and reads CCCFFFFFHHHHHGGIJJJIJJJJJJIJJIJJJJJGIJJJHIIJJJIJJJ Base Call Quality: Phred Quality Scores Phred* quality score Q with base-calling error probability P Q = -10 log10P * Name of first program to assign accurate base quality scores. From the Human Genome Project. Q score Probability of base error Base confidence Sangerencoded (Q Score + 33) ASCII character 10 0.1 90% “+” 20 0.01 99% “5” 30 0.001 99.9% “?” 40 0.0001 99.99% “I” SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS..................................................... ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...................... LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL.................................................... !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126 S - Sanger Phred+33 I - Illumina 1.3+ Phred+64 L - Illumina 1.8+ Phred+33 range: 0 to 40 range: 0 to 40 range: 0 to 41 Initial Read Assessment and Processing Raw reads Read assessment and prep Common problems that can affect analysis: Low confidence base calls • typically toward ends of reads • criteria vary by application Mapping Duplicate Marking Presence of adapter sequence in reads • poor fragment size selection • protocol execution or artifacts Local realignment Over-abundant sequence duplicates Base quality recalibration Library contamination Analysis-ready reads Quick Read Assessment: FastQC Free Download Download: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ Tutorial : http://www.youtube.com/watch?v=bz93ReOv87Y Samples reads (200K default): fast, low resource use Read Assessment Example (Cont’d) Trim for base quality or adapters (run or library issue) Trim leading bases (library artifact) Read Assessment Example (Cont’d) TruSeq Adapter, Index 9 5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATGCCGTCTTCTGCTTG Comprehensive Read Assessment: Prinseq http://prinseq.sourceforge.net/ 15 Selected Tools to Process Reads Fastx toolkit* http://hannonlab.cshl.edu/fastx_toolkit/ (partial list) FASTQ Information: Chart Quality Statistics and Nucleotide Distribution FASTQ Trimmer: Shortening FASTQ/FASTA reads (removing barcodes or noise). FASTQ Clipper: Removing sequencing adapters FASTQ Quality Filter: Filters sequences based on quality FASTQ Quality Trimmer: Trims (cuts) sequences based on quality FASTQ Masker: Masks nucleotides with 'N' (or other character) based on quality *defaults to old Illumina fastq (ASCII offset 64). Use –Q33 option. SepPrep https://github.com/jstjohn/SeqPrep Adapter trimming Merge overlapping paired-end read Biopython http://biopython.org, http://biopython.org/DIST/docs/tutorial/Tutorial.html (for python programmers) Especially useful for implementing custom/complex sequence analysis/manipulation Galaxy http://galaxy.psu.edu Great for beginners: upload data, point and click Just about everything you’ll see in today’s presentations SolexaQA2 http://solexaqa.sourceforge.net Dynamic trimming Length sorting (resembles read grouping of Prinseq) Many Analysis Pipelines Start with Read Mapping Genotyping/Haplotyping Gene Expression https://www.broadinstitute.org/gatk/guide/best-practices?bpm=DNAseq Tumor/Normal Comparison https://www.broadinstitute.org/gatk/guide/best-practices 17 http://www.appistry.com/sites/all/themes/appistry/files/pdfs/CGAS_download.pdf Read Mapping Raw reads Read assessment and prep Mapping Duplicate Marking Local realignment Base quality recalibration Analysis-ready reads http://www.broadinstitute.org/igv/ Sequence References and Annotations http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/data.shtml http://www.ncbi.nlm.nih.gov/guide/howto/dwn-genome Comprehensive reference information http://hgdownload.cse.ucsc.edu/downloads.html Comprehensive reference, annotation, and translation information ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle References and SNP information data by GATK Human only http://cufflinks.cbcb.umd.edu/igenomes.html Pre-indexed references and gene annotations for Tuxedo suite Human, Mouse, Rat , Cow, Dog, Chicken, Drosophila, C. elegans, Yeast http://www.repeatmasker.org Fasta Sequence Format • • • • • One or more sequences per file “>” denotes beginning of sequence or contig Subsequent lines up to the next “>” define sequence Lowercase base denotes repeat masked base Contig ID may have comments delimited by “|” >chr1 … TGGACTTGTGGCAGGAATgaaatccttagacctgtgctgtccaatatggt agccaccaggcacatgcagccactgagcacttgaaatgtggatagtctga attgagatgtgccataagtgtaaaatatgcaccaaatttcaaaggctaga aaaaaagaatgtaaaatatcttattattttatattgattacgtgctaaaa taaccatatttgggatatactggattttaaaaatatatcactaatttcat … >chr2 … >chr3 … Read Mapping Novoalign (3.0) SOAP3 (0.01 beta) BWA (0.7.8) Bowtie2 (2.2.2) Tophat2 (2.0.11) STAR (2.3.0e) License Commercial GPL v3 GPL v3 Artistic Artistic GPL v3 Mismatch allowed up to 8 up to 3 user specified. max is function of read length and error rate user specified uses Bowtie2 user specified Alignments reported per read random/all/none random/all/none user selected user selected uses Bowtie2 user selected Gapped alignment up to 7bp 1-3bp gap yes yes yes splice junctions introns yes splice junctions introns Pair-end reads yes yes yes yes yes yes Best alignment highest alignment score minimal number of mismatches minimal number of mismatches highest alignment score uses Bowtie2 highest alignment score Trim bases 3’ end 3’ end 3’ and 5’ end 3’ and 5’ end uses Bowtie2 3’ and 5’ end Comments At one time, best performance and alignment quality Can use nVIDIA CUDA (GPU) Element of Broad’s “best practices” genotyping workflow Smith-Waterman quality alignments, currently fastest Currently most popular RNA-seq aligner Very fast; uses memory to achieve performance Read Mapping: BWA BWA Features • Uses Burrows Wheeler Transform — fast — modest memory footprint (<4GB) • Accurate • Tolerates base mismatches — increased sensitivity — reduces allele bias • Gapped alignment for both single- and paired-ended reads • Automatically adjusts parameters based on read lengths and error rates • Native BAM/SAM output (the de facto standard) • Large installed base, well-supported • Open-source (no charge) Read Mapping: Bowtie2 Bowtie2 • Uses dynamic programming (edit distance scoring) oEliminates need for realignment around indels oCan be tuned for different sequencing technologies • Multi-seed search - adjustable sensitivity • Input read length limited only by available memory • Fasta or Fastq input • Caveats oLonger input reads require much more memory oTrade-off parallelism with memory requirement Dynamic Programming Illustration http://bowtie-bio.sourceforge.net/bowtie2 Langmead B, Salzberg S. Fast gapped-read alignment with Bowtie 2, Nature Methods. 2012, 9:357-359 23 SAM (BAM) Format Sequence Alignment/Map format Universal standard Human-readable (SAM) and compact (BAM) forms Superset of FASTQ Structure Header version, sort order, reference sequences, read groups, program/processing history Alignment records SAM/BAM Format: Header [benpass align_genotype]$ samtools view -H allY.recalibrated.merge.bam samtools to view bam @HD VN:1.0 GO:none SO:coordinate header sort order @SQ SN:chrM LN:16571 @SQ SN:chr1 LN:249250621 @SQ SN:chr2 LN:243199373 reference sequence names @SQ SN:chr3 LN:198022430 with lengths … @SQ SN:chr19 LN:59128983 @SQ SN:chr20 LN:63025520 @SQ SN:chr21 LN:48129895 @SQ SN:chr22 LN:51304566 read groups with platform, @SQ SN:chrX LN:155270560 library and sample information @SQ SN:chrY LN:59373566 … @RG ID:86-191 PL:ILLUMINA LB:IL500 SM:86-191-1 @RG ID:BsK010 PL:ILLUMINA LB:IL501 SM:BsK010-1 @RG ID:Bsk136 PL:ILLUMINA LB:IL502 SM:Bsk136-1 @RG ID:MAK001 PL:ILLUMINA LB:IL503 SM:MAK001-1 @RG ID:NG87 PL:ILLUMINA LB:IL504 SM:NG87-1 … program (analysis) history @RG ID:SDH023 PL:ILLUMINA LB:IL508 SM:SDH023 @PG ID:GATK IndelRealigner VN:2.0-39-gd091f72 CL:knownAlleles=[] targetIntervals=tmp.intervals.li @PG ID:bwa PN:bwa VN:0.6.2-r126 SAM/BAM Format: Alignment Records [benpass align_genotype]$ samtools view allY.recalibrated.merge.bam 2 3 4 5 6 8 9 HW-ST605:127:B0568ABXX:2:1201:10933:3739 147 chr1 27675 60 101M = 27588 -188 10 TCATTTTATGGCCCCTTCTTCCTATATCTGGTAGCTTTTAAATGATGACCATGTAGATAATCTTTATTGTCCCTCTTTCAGCAGACGGTATTTTCTTATGC 11 =7;:;<=??<=BCCEFFEJFCEGGEFFDF?BEA@DEDFEFFDE>EE@E@ADCACB>CCDCBACDCDDDAB@@BCADDCBC@BCBB8@ABCCCDCBDA@>:/ RG:Z:86-191 1 HW-ST605:127:B0568ABXX:3:1104:21059:173553 83 chr1 27682 60 101M = 27664 -119 ATGGCCCCTTCTTCCTATATCTGGTAGCTTTTAAATGATGACCATGTAGATAATCTTTATTGTCCCTCTTTCAGCAGACGGTATTTTCTTATGCTACAGTA 8;8.7::<?=BDHFHGFFDCGDAACCABHCCBDFBE</BA4//BB@BCAA@CBA@CB@ABA>A??@B@BBACA>?;A@8??CABBBA@AAAA?AA??@BB0 RG:Z:SDH023 * Many fields after column 12 deleted (e.g., recalibrated base scores) have been deleted for improved readability http://samtools.sourceforge.net/SAM1.pdf Compression is Big Win for HTS Data 33.8M 100bp Illumina reads Compression Ratio 6x 5x 4x Improvement 3x Preparing for Next Steps Raw reads Read assessment and prep Mapping Duplicate Marking Local realignment Base quality recalibration Analysis-ready reads 28 Subsequent steps require sorted and indexed bams Sort orders: karyotypic, lexicographical Indexing improves analysis performance Picard tools: fast, portable, free http://picard.sourceforge.net/command-line-overview.shtml Sort: SortSam.jar Merge: MergeSamFiles.jar Index: BuildBamIndex.jar Order: sort, merge (optional), index Duplicate Marking Raw reads Read assessment and prep Mapping Duplicate Marking Local realignment Base quality recalibration Analysis-ready reads $java -Xmx4g -jar <path to picard>/MarkDuplicates.jar \ INPUT=aligned.sorted.bam \ OUTPUT=aligned.sorted.dedup.bam \ VALIDATION_STRINGENCY=LENIENT \ METRICS_FILE=aligned.dedup.metrics.txt \ REMOVE_DUPLICATES=false \ ASSUME_SORTED=true http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates SAM/BAM Format: Alignment Records [benpass align_genotype]$ samtools view allY.recalibrated.merge.bam HW-ST605:127:B0568ABXX:2:1201:10933:3739 147 chr1 27675 60 101M = 27588 -188 TCATTTTATGGCCCCTTCTTCCTATATCTGGTAGCTTTTAAATGATGACCATGTAGATAATCTTTATTGTCCCTCTTTCAGCAGACGGTATTTTCTTATGC =7;:;<=??<=BCCEFFEJFCEGGEFFDF?BEA@DEDFEFFDE>EE@E@ADCACB>CCDCBACDCDDDAB@@BCADDCBC@BCBB8@ABCCCDCBDA@>:/ RG:Z:86-191 http://picard.sourceforge.net/explain-flags.html http://samtools.sourceforge.net/SAM1.pdf Local Realignment Raw reads Read assessment and prep Mapping Duplicate Marking Local realignment Base quality recalibration Analysis-ready reads BWT-BASED ALIGNMENT INDIVIDUAL IS FAST FOR MATCHING READS TO REFERENCE BASE ALIGNMENTS OFTEN SUB-OPTIMAL AT INDELS APPROACH Fast read mapping with BWT-based aligner Realign reads at indel sites using gold standard (but much slower) Smith-Waterman algorithm BENEFITS Refines location of indels Reduces erroneous SNP calls Very high alignment accuracy in significantly less time, with fewer resources 1Smith, Temple F.; and Waterman, Michael S. (1981). "Identification of Common Molecular Subsequences". Journal of Molecular Biology 147: 195–197. doi:10.1016/0022-2836(81)90087-5. PMID 7265238 Local Realignment Raw BWA alignment DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889 Post re-alignment at indels Base Quality Recalibration Raw reads Read assessment and prep Mapping Duplicate Marking Local realignment Base quality recalibration Analysis-ready reads STEP 1: Find covariates at non-dbSNP sites using: Reported quality score The position within the read The preceding and current nucleotide (sequencer properties) java -Xmx4g -jar GenomeAnalysisTK.jar \ -T BaseRecalibrator \ -I alignment.bam \ -R hg19/ucsc.hg19.fasta \ -knownSites hg19/dbsnp_135.hg19.vcf \ -o alignment.recal_data.grp STEP 2: Generate BAM with recalibrated base scores: java -Xmx4g -jar GenomeAnalysisTK.jar \ -T PrintReads \ -R hg19/ucsc.hg19.fasta \ -I alignment.bam \ -BQSR alignment.recal_data.grp \ -o alignment.recalibrated.bam Base Quality Recalibration (Cont’d) Raw reads Read assessment and prep Mapping Duplicate Marking Local realignment Base quality recalibration Analysis-ready reads 35 Is there an easier way to get started?! http://galaxyproject.org/ Click on “Use Galaxy” Getting Started Raw reads Read assessment and prep Mapping Duplicate Marking Local realignment Base quality recalibration Analysis-ready reads 38