Resequencing Genome Timothee Cezard EBI NGS workshop 16/10/2012 NGS Course – Data Flow Overview Karim Gharbi DNA Sequencing Resequencing & assembly Gene regulation Timothee Cezard ChIP-seq analysis Elizabeth Murchison Remco Loos/ Myrto Kostadima Sequence archives ENA/SRA submission and retrieval Data compression Guy Cochrane Genome variation & disease Jon Teague /Adam Butler/ Simon Forbes Laura Clarke Rajesh Radhakrishnan Rasko Leinonen Arnaud Oisel Marc Rossello Vadim Zalunin RNA Sequencing Gene annotation Gene expression RNA-Seq Ensembl gene build RNA-Seq Transcriptome analysis Ensembl/John Collins Myrto Kostadima/ Remco Loos NGS Course – Data Flow Overview Karim Gharbi DNA Sequencing Resequencing & assembly Gene regulation Timothee Cezard ChIP-seq analysis Elizabeth Murchison Slides and Sequence archives ENA/SRA submission and retrieval Data compression Guy Cochrane Genome variation & disease Jon Teague tutorials are available at: Remco Loos/ /Adam Rajesh Radhakrishnan Rasko Leinonen Arnaud Oisel Marc Rossello Vadim Zalunin RNA Sequencing Gene annotation Gene expression RNA-Seq Ensembl gene build RNA-Seq Transcriptome analysis https://www.wiki.ed.ac.uk/display/GenePoolExternal/NGS+workshop+16.10.2012+at+EBI Myrto Kostadima Butler/ Ensembl/John Collins Myrto Kostadima/ Simon Forbes Laura Clarke Remco Loos NGS Course – Data Flow Overview Karim Gharbi DNA Sequencing Resequencing & assembly Gene regulation Timothee Cezard ChIP-seq analysis Elizabeth Murchison Remco Loos/ Myrto Kostadima Sequence archives ENA/SRA submission and retrieval Data compression Guy Cochrane Genome variation & disease Jon Teague /Adam Butler/ Simon Forbes Laura Clarke Rajesh Radhakrishnan Rasko Leinonen Arnaud Oisel Marc Rossello Vadim Zalunin RNA Sequencing Gene annotation Gene expression RNA-Seq Ensembl gene build RNA-Seq Transcriptome analysis Ensembl/John Collins Myrto Kostadima/ Remco Loos Overview • • • DNA (Re)sequencing • Sequencing technologies • Sequencing output • Quality control Mapping • Mapping programs • Sam/Bam format • Mapping improvements Variant calling • Types of variants • SNPs/indels • VCF format Overview • • • DNA (Re)sequencing • Sequencing technologies • Sequencing output • Quality control Mapping • Mapping programs • Sam/Bam format • Mapping improvements Variant calling • Types of variants • SNPs/indels • VCF format Resequencing genomes Library prep DNA Extraction Library prep Library prep Sequencing data GATGGGAAGA GCGGTTCAGC AGGAATGCCG AGACCGATAT CGTATGCCGT Sequence data • • • Precise Fairly unbiased Easy to QC Coverage depth data • • Can be biased Hard to know what’s true Sequencer specific errors Homopolymer run create false indels Specific sequence patterns can create phasing issues Sequencer specific errors Specific sequence patterns can create phasing issues Sequencing output (Fastq format) Example fastq record: @ILLUMINA06_0016:6:1:5388:12733#0 GATGGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATATCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAG + CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDADACBCCCDADBDDCBCD;BBDBDBBBB%%%%%%%%% Sequencing output (Fastq format) Example fastq record: @ILLUMINA06_0016:6:1:5388:12733#0 GATGGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATATCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAG + CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDADACBCCCDADBDDCBCD;BBDBDBBBB%%%%%%%%% Sequencing output (Fastq format) Example fastq record: @ILLUMINA06_0016:6:1:5388:12733#0 GATGGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATATCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAG + CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDADACBCCCDADBDDCBCD;BBDBDBBBB%%%%%%%%% Quality control Questions you should ask (yourself or your sequencing provider): • • • Sequencing QC: • How much sequencing? • What’s the sequencing quality? Library QC: • What’s the base profile across the reads? • Is there an unexpected GC bias? • Are there any library preparation contaminants? Post mapping QC: • What is the fragment length distribution? (for paired end) • Is there an unexpected Duplicate rate? Example with FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ Example with FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ Overview • • • DNA (Re)sequencing • Sequencing technologies • Sequencing output • Quality control Mapping • Mapping programs • Sam/Bam format • Mapping improvements Variant calling • Types of variants • SNPs/indels • VCF format Mapping Reads to a reference genome Problems: • How to find the best match of short sequence onto a large genome (high sensitivity) • How to not find a match when • for 100,000,000,000 reads in reasonable amount of time. Solution: • Hashing based algorithms: • BLAST, Eland, MAQ, Shrimps, GSNAP, Stampy • More sensitive when SNPs/Indels • Suffix trie + Burrows Wheeler Transform algorithms: • Bowtie, SOAP BWA • Faster Different software for different applications Transcriptome to genome GSNAP Tophat Mapping to distant reference Stampy Shrimp Very fast mapping bowtie BWA Different software for different applications Transcriptome to genome Mapping to distant reference GSNAP Tophat Genomatics Bwasw Splitseek Very fast mapping Stampy Shrimp Bowtie CLC bio Mr fast Bwa Smalt Mrs fast Ssaha2 Partek Different software for different applications Transcriptome to genome GSNAP Fastq Very fast mapping Genomatics Bwasw Splitseek Tophat Mapping to distant reference Stampy Mapper Shrimp Bowtie Sam/Bam CLC bio Mr fast Bwa Smalt Mrs fast Ssaha2 Partek SAM/BAM format SAM: Sequence Alignment/Map format v1.4 The SAM Format Specification Working Group (Sept 2011) http://samtools.sourceforge.net/SAM1.pdf • • • • Standardized format for alignment Bam: binary equivalent of SAM Bam can be indexed for fast record retrieval Manipulate Sam/Bam file using samtools and others 2 parts: • Header: contains metadata about the sample • Alignment: SAM/BAM format 1 2 3 R00 1 83 ref 37 COLUMNS: 1 QNAME 2 FLAG 3 RNAME 4 POS 5 MAPQ 6 CIGAR 7 RNEXT 8 PNEXT 9 TLEN 10 SEQ 11 QUAL 4 5 6 7 8 9 10 11 12 30 9M = 7 -39 CAGCGCAT CAGCGCAT TAG String Int String Int Int String String Int Int String String Query template NAME bitwise FLAG Reference sequence NAME 1-based leftmost mapping POSition MAPping Quality CIGAR string Ref. name of the mate/next fragment Position of the mate/next fragment observed Template LENgth fragment SEQuence ASCII of Phred-scaled base QUALity+33≈ Bitwise flag 83 = 1010011 in binary format Bit integer Description 0x1 1 template having multiple segments in sequencing 0x2 2 each segment properly aligned according to the aligner 0x4 4 segment unmapped 0x8 8 next segment in the template unmapped 0x10 16 SEQ being reverse complemented 0x20 32 SEQ of the next segment in the template being reversed 0x40 64 the first segment in the template 0x80 128 the last segment in the template 0x100 256 secondary alignment 0x200 512 not passing quality controls 0x400 1024 PCR or optical duplicate Bitwise flag 83 = 1010011 in binary format http://picard.sourceforge.net/explain-flags.html CIGAR alignment M I D N S H P = X alignment match (can be a sequence match or mismatch) insertion to the reference deletion from the reference skipped region from the reference soft clipping (clipped sequences present in SEQ) hard clipping (clipped sequences NOT present in SEQ) padding (silent deletion from padded reference) sequence match sequence mismatch Ref: AGGTCCATGGACCTG || ||||X||||||| Query: AG-TCCACGGACCTG CTTATGTGATC ||||||||||| Query: CTTATGTGATCCCTG 2M1D12M or 2=1D4=1X7= Ref: 10M4S Mapping enhancement Each read is mapped independently: Can borrow knowledge from neighbor to improve mapping Picard Marking Duplicates: A duplicated read pair is when both two or more read pairs have the same coordinates. Samtools BAQ: Hidden markov model that downweight mismatching based if they are close to indel GATK Indel realignment: take every reads around potential indel and perform a more sensitive alignment GATK Base recalibration: look at several contextual information, such as position in the read or dinucleotide composition to identify covariate of sequencing errors Indel realignment AACAATATCTATGGA/TTTCG/TTTTG Indel realignment Indel realignment Overview • • • DNA (Re)sequencing • Sequencing technologies • Sequencing output • Quality control Mapping • Mapping programs • Sam/Bam format • Mapping improvements Variant calling • Types of variants • SNPs/indels • VCF format The whole pipeline Raw data Alignment Realignment Mark duplicates Base recalibration ? Final bam file(s) The whole pipeline Raw data Alignment Realignment Mark duplicates Base recalibration ? Final bam file(s) Final bam file(s) SNPs/Indels Calling CNV Calling Structural Variant Calling Pool analysis The whole pipeline Raw data Alignment Realignment Mark duplicates Base recalibration ? Final bam file(s) Final bam file(s) SNPs/Indels Calling CNV Calling Structural Variant Calling Pool analysis SNPs and indels calling Samtools mpileup + bcftools GATK UnifiedGenotyper Bayesian based Bayesian based yes yes Input: bam file(s) bam file(s) output vcf file vcf file Rather fast Slow but multithreaded Up to 2alleles 3 by default Algorithm multiple samples calling Runtime Multi-allelic VCF format Variant format designed for 1000 genome project - SNPs - Insertions - Deletions - Duplications - Inversions - Copy number variation http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41 VCF format Header: define the optional fields ##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> Variants: • 8 mandatory columns describing the variant • 1 column defining the genotype format • 1 column per sample describing the genotype for that SNP for that sample http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41 ##fileformat=VCFv4.1 ##samtoolsVersion=0.1.18 (r982:295) ##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth"> ##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases"> ##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square mapping quality of covering reads"> ##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability of all samples being the same"> ##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele frequency (assuming HWE)"> ##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele count (no HWE assumption)"> ##INFO=<ID=G3,Number=3,Type=Float,Description="ML estimate of genotype frequencies"> ##INFO=<ID=HWE,Number=1,Type=Float,Description="Chi^2 based HWE test P-value based on G3"> ##INFO=<ID=CLR,Number=1,Type=Integer,Description="Log ratio of genotype likelihoods with and without the constraint"> ##INFO=<ID=UGT,Number=1,Type=String,Description="The most probable unconstrained genotype configuration in the trio"> ##INFO=<ID=CGT,Number=1,Type=String,Description="The most probable constrained genotype configuration in the trio"> ##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for strand bias, baseQ bias, mapQ bias and tail distance bias"> ##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL."> ##INFO=<ID=PC2,Number=2,Type=Integer,Description="Phred probability of the nonRef allele frequency in group1 samples being larger (,smaller) than in group2."> ##INFO=<ID=PCHI2,Number=1,Type=Float,Description="Posterior weighted chi^2 P-value for testing the association between group1 and group2 samples."> ##INFO=<ID=QCHI2,Number=1,Type=Integer,Description="Phred scaled PCHI2."> ##INFO=<ID=PR,Number=1,Type=Integer,Description="# permutations yielding a smaller PCHI2."> ##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=GL,Number=3,Type=Float,Description="Likelihoods for RR,RA,AA genotypes (R=ref,A=alt)"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="# high-quality bases"> ##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand bias P-value"> ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT germline tumor chr4 27668 . T C 8.65 . DP=2;AF1=1;AC1=4;DP4=0,0,0,1;MQ=60;FQ=-27.4 GT:PL:DP:SP:GQ 0/1:0,0,0:0:0:3 1/1:38,3,0:1:0:3 chr4 27669 . G T 4.77 . DP=2;AF1=1;AC1=4;DP4=0,0,0,1;MQ=60;FQ=-27.4 GT:PL:DP:SP:GQ 0/1:0,0,0:0:0:3 0/1:33,3,0:1:0:4 chr4 27712 . T C 44 . DP=2;AF1=1;AC1=4;DP4=0,0,1,1;MQ=60;FQ=-30.8 GT:PL:DP:SP:GQ 1/1:40,3,0:1:0:8 1/1:37,3,0:1:0:8 chr4 27774 . G A 5.47 . DP=2;AF1=0.5011;AC1=2;DP4=1,0,0,1;MQ=60;FQ=-5.67;PV4=1,1,1,1 GT:PL:DP:SP:GQ 0/1:34,0,23:2:0:28 0/0:0,0,0:0:0:3 chr4 36523 . A T 10.4 . DP=1;AF1=1;AC1=4;DP4=0,0,1,0;MQ=60;FQ=-27.4 GT:PL:DP:SP:GQ 0/1:0,0,0:0:0:3 1/1:40,3,0:1:0:4 HEADER DATA VCF format SNPs #CHROM chr4 chr4 chr4 chr4 chr4 POS 27668 27669 27712 27774 36523 ID . . . . . REF T G T G A ALT C T C A T QUAL 8.65 4.77 44 5.47 10.4 FILTER . . . . . INFO DP=2;AF1=1;AC1=4;… DP=2;AF1=1;AC1=4;… DP=2;AF1=1;AC1=4;… DP=2;AF1=0.5011; AC1=2; … DP=1;AF1=1;AC1=4;… FORMAT GT:PL:DP:SP:GQ GT:PL:DP:SP:GQ GT:PL:DP:SP:GQ GT:PL:DP:SP:GQ GT:PL:DP:SP:GQ germline 0/1:0,0,0:0:0:3 0/1:0,0,0:0:0:3 1/1:40,3,0:1:0:8 0/1:34,0,23:2:0:28 0/1:0,0,0:0:0:3 Genotype format SNPs quality SNP Identifier SNPs information Position Reference base Filtering reasons Alternate base(s) Chromosome name Genotype information Variant Filtering • Depth of Coverage: • confident het call= 10X-20X • SNPs quality depends on the caller: 30-50 • Genotype quality: 20 • Strand bias • Biological interpretation