CBI NGS Workshop Lesson 4 The Genome Analysis Toolkit (GATK) Liu Huan(刘欢) Center for Bioinformatics, Peking University 2011-05-30 Outline Basic Concepts Overview for Variant Discovery GATK Architecture Data Processing Pipeline of GATK for Variant Detection Basic Concepts Single-nucleotide polymorphism (SNP) - a DNA sequence variation occurring when a single nucleotide - A,T,C,G- in the genome differs between members of a biological species or paired chromosomes in an individual. - e.g. two DNA fragments from different individuals, AAGCCTA to AAGCTTA (two alleles) - Almost all common SNPs have only two alleles - Within a population, SNPs can be assigned a minor allele frequency Basic Concepts Indel - an insertion or a deletion - e.g. reference: AT GG indel 1: AT _G indel 2: ATCGG Basic Concepts copy number variation (CNV) - a form of structural variation - alterations of the DNA of a genome that results in the cell having an abnormal number of copies of one or more sections of the DNA - CNVs correspond to relatively large regions of the genome that have been deleted (fewer than the normal number) or duplicated (more than the normal number) on certain chromosomes - e.g. normal chromosome structure: A-B-C-D CNV 1: A-B-C-C-D (a duplication of "C") CNV 2: A-B-D (a deletion of "C") - This variation accounts for roughly 12% of human genomic DNA - each variation may range from about one kilobase (1,000 nucleotide bases) to several megabases in size Outline Basic Concepts Overview for Variant Discovery GATK Architecture Data Processing Pipeline of GATK for Variant Detection Framework for variation discovery and genotyping from next-generation DNA sequencing Phase 1: - raw read data with platform-dependent biases were transformed into a single, generic representation with wellcalibrated base error estimates, mapped to their correct genomic origin and aligned consistently with respect to one another. Mapping algorithms placed reads with an initial alignment on the reference genome, either generated in, or converted to, the technology-independent SAM reference file format. - molecular duplicates were eliminated - initial alignments were refined by local realignment and then an empirically accurate per-base error model was determined. Framework for variation discovery and genotyping from next-generation DNA sequencing Phase 2: - the analysis-ready SAM/BAM files were analyzed to discover all sites with statistical evidence for an alternate allele present among the samples including SNPs, short indels and copy number variations (CNVs) Framework for variation discovery and genotyping from next-generation DNA sequencing Phase 3: - technical covariates, known sites of variation, genotypes for individuals, linkage disequilibrium (LD), and family and population structure were integrated with the raw variant calls from phase 2 to separate true polymorphic sites from machine artifacts, and at these sites, high-quality genotypes were determined for all samples. Outline Basic Concepts Overview for Variant Discovery GATK Architecture Data Processing Pipeline of GATK for Variant Detection GATK architecture MapReduce MapReduce: - parallel computation - two steps: subdivide large problems into many discrete independent pieces, which are fed to the map function, followed by reduce function, joining the map results back into a final product - subdividing - load balance Example: - SNP discovery: map function - ChIP-seq (peak calling) : reduce function GATK architecture traversals - provide the division and preparation of data walkers - analysis module - provide the map and reduce methods that consume the data GATK can provide a nearly comprehensive set of traversal types that satisfy the data access needs of the majority of analysis tools Traversal Types in GATK “By each sequencer read” (read-based) and “by every read covering each single base position in a genome” (locus-based) - standard methods for accessing data for several analyses - e.g. counting reads, building base quality histograms, reporting average coverage of sequencer reads over the genome, calling SNP Read-based Traversals Read-based Traversal - presents the analysis walker with each read individually, passing each read once and only once to the walker’s map function. - along with the sequencer read, the walker is presented with the reference bases that the read overlaps - is useful for analyzing read quality scores, alignment scores, and merging reads from multiple bam files. Locus-based Traversals Locus-based Traversal - It presents the analysis walkers with all the associated genomic data, including all the reads that span the genomic location, all reference ordered data, and the reference base at the specific locus in the genome. - Each of these single-base loci are passed to the walker’s map function - e.g. depth of coverage calculation, variant analysis Depth of Coverage Walker in GATK Depth of Coverage: - important in CNV discovery, SNP calling, and other downstream analysis Depth of Coverage Walker in GATK: - at each site the walker receives a list of the reads covering the reference base and emits the size of the pileup - The end user can optionally exclude reads of low mapping quality, and other read filtering criteria. - can also be provided with a list of regions to calculate coverage, summing the average coverage over each region - can also be used to quantify sequencing results over complex or highly variable regions, e.g major histocompatibility complex (MHC) Depth of Coverage Walker in GATK Outline Basic Concepts Overview for Variant Discovery GATK Architecture Data Processing Pipeline of GATK for Variant Detection Data Processing Pipline of GATK initial mapping refinement of the initial reads multi-sample indel and SNP calling filtering of the raw SNP calls finally variant quality score recalibration. Reference Genome of GATK hg19 is not supported b37 is used - to keep up to date with dbSNP and the 1000 Genomes Project data files Resources Download: GSA FTP server: location: ftp.broadinstitute.org username: gsapubftp-anonymous password: <blank> Raw Data Processing raw fastq file NGS reads aligner For Illumina data: recommend BWA - accurate, fast, well-supported, opensource, and emits BAM files natively Raw BAM to realigned, recalibrated BAM Purpose of realignment - locally realign reads such that the number of mismatching bases is minimized across all the reads - In general, a large percent of regions requiring local realignment are due to the presence of an insertion or deletion (indels) in the individual’s genome with respect to the reference genome. Such alignment artifacts result in many bases mismatching the reference near the misalignment, which are easily mistaken as SNPs. Two steps of realignment - Step 1: Determining (small) suspicious intervals which are likely in need of realignment - Step 2: Running the realigner over those intervals Raw BAM to realigned, recalibrated BAM Two types of realignment - Realignment only at known sites very efficient can operate with little coverage can only realign reads at known indels - Fully local realignment uses mismatching bases to determine if a site should be realigned, and relies on sufficient coverage to discover the correct indel allele in the reads for alignment much slower (involves SW step) can discover new indel sites in the reads Raw BAM to realigned, recalibrated BAM Purpose of base quality recalibration - After recalibration, the quality scores in the QUAL field in each read in the output BAM are more accurate in that the reported quality score is closer to its actual probability of mismatching the reference genome - the recalibration tool attempts to correct for variation in quality with machine cycle and sequence contex - more accurate quality scores Base Quality Recalibration: analyzing the covariation among several features of a base. e.g. - Reported quality score - The position within the read - The preceding and current nucleotide observed by the sequencing machine - Probability of mismatching the reference genome these covariation recalibrate the quality scores of all reads in a BAM file recommendation: lane-level recalibration, sample-level realignment Initial variant discovery and genotyping Input BAMs for variant discovery and genotyping - already have a single realigned, recalibrated, dedupped BAM per sample, called sampleX.bam, for X from 1 to N samples in your cohort. Multi-sample SNP and indel calling - apply the Unified genotyper to identify sites among the cohort samples. This will produce a multi-sample VCF file, with sites discovered across samples and genotypes assigned to each sample in the cohort. - Note: by default the Unified Genotyper calls SNPs only. To enable the indel calling capabilities instead use the -glm DINDEL argument. Initial variant discovery and genotyping Selecting an appropriate quality score threshold - A common question is the confidence score threshold to use for variant detection. - Recommend: Deep (> 10x coverage per sample) data recommend a minimum confidence score threshold of Q30 with an emission threshold of Q10. These Q10-Q30 calls will be emitted filtered out as LowQual. Shallow (< 10x coverage per sample) data recommend a min. confidence score of Q4 and an emission threshold of Q3, since variants have by necessity lower quality with shallower coverage. Initial variant discovery and genotyping Protocol VCF (variant call format) - standarised format for storing the most prevalent types of sequence variation, including SNPs, indels and larger structural variants, together with rich annotations - usually stored in a compressed manner, and can be indexed for fast data retrieval of variants from a range of positions on the reference genome - VCFtools: a software suite that implements various utilities for processing VCF files, including validation, merging and comparing… - http://vcftools.sourceforge.net Initial variant discovery and genotyping VCF (variant call format) Initial variant discovery and genotyping Integrating analyses: getting the best call set possible Problems of raw VCF file - raw VCF will have many sites that aren't really genetic variants but are machine artifacts that make the site statistically non-reference - should separate out the FP machine artifacts from the TP genetic variants ! Tools: - VariantFiltrationWalker: apply hard filters - Variant quality score recalibration: build an adaptive error model using known variant sites and then apply this model to estimate the probability that each variant is a true genetic variant or a machine artifact. Recommend: Regardless of whether you'll ultimately apply hard filtering or adaptive error modeling to select your final calls, first apply some common SNP filters to avoid obvious misalignment and indel artifacts. Integrating analyses: getting the best call set possible Analysis read VCF protocol: Integrating analyses: getting the best call set possible Basic indel filtering: - purpose: remove alignment artifacts from the data - methods: flagging variants with high strand bias and in poorly mapped regions (HARD_TO_VALIDATE set) with more than 10% of the reads having mapping quality 0 - arguments for VariantFiltrationWalker: Integrating analyses: getting the best call set possible Basic SNP filtering: - purpose: remove alignment artifacts from the data - methods: flagging SNPs within clusters (3 SNPs with 10 bp of each other) and those in poorly mapped regions (HARD_TO_VALIDATE set) with more than 10% of the reads having mapping quality 0 - arguments for VariantFiltrationWalker: Integrating analyses: getting the best call set possible Filtering around indels - Purpose: It's possible that, despite even local realignment, misalignments around true and artifactual indels will result in some false SNP calls. These errors are quite common if you didn't do local realignment, didn't provide a set of known indels during local realignment, and around very large indels that can't be modeled properly by local realignment. - methods: perform indel calling, then you can filter your SNP calls around the raw indel calls from your data set - arguments for VariantFiltrationWalker: Integrating analyses: getting the best call set possible Making analysis ready calls SNP calls with hard filtering - GATK recommended hard filtering: arguments for VariantFiltrationWalker: Integrating analyses: getting the best call set possible Making analysis ready calls with variant quality score recalibration - newly developed - An alternative approach to hard filtering: Variant quality score recalibration - methods: assign a well-calibrated probability to each variant call in a call set. One can then create highly accurate call sets by filtering based on this single estimate for the accuracy of each call. Expected SNP call quality Using GATK walker : VariantEval - giving sensitivity, specificity, and Ti/Tv ratios for known and novel calls - Expected Ti/Tv ratios: evaluating the quality of SNP calls whole genome, or in the targeted whole exome (Agilent), or interested regions Reference Mark A et,al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics, 43: 491-498, 2011 Mark A et,al. The Genome Analysis Toolkit: a MapReduce framework for analyzing nextgeneration DNA sequencing data. Genome Res. 20:1297-303, 2010 Wiki: GATK http://www.broadinstitute.org/gsa/wiki/index.php/Main_Page Best Practice Variant Detection with the GATK v2 http://www.broadinstitute.org/gsa/wiki/index.php/Best_Practice_Variant_Detection_with_the_GATK_v2 1000 Genomes: A Deep Catalog of Human Genetic Variation http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40 VCF poster: The Variant Call Format and VCFtools, by Petr Danecek et. al. http://vcftools.sourceforge.net/VCF-poster.pdf VCFtools http://vcftools.sourceforge.net Wikipedia http://en.wikipedia.org/ Thanks for Attention !