Genetic Variant Caller (GATK)

advertisement
CBI NGS Workshop
Lesson 4
The Genome
Analysis Toolkit
(GATK)
Liu Huan(刘欢)
Center for Bioinformatics,
Peking University
2011-05-30
Outline
Basic Concepts
 Overview for Variant Discovery
 GATK Architecture
 Data Processing Pipeline of GATK for
Variant Detection

Basic Concepts

Single-nucleotide polymorphism (SNP)
- a DNA sequence variation occurring when a single
nucleotide - A,T,C,G- in the genome differs
between members of a biological species or
paired chromosomes in an individual.
- e.g. two DNA fragments from different individuals,
AAGCCTA to AAGCTTA (two alleles)
- Almost all common SNPs have only two alleles
- Within a population, SNPs can be assigned a
minor allele frequency
Basic Concepts

Indel
- an insertion or a deletion
- e.g.
reference: AT GG
indel 1:
AT _G
indel 2:
ATCGG
Basic Concepts
copy number variation (CNV)
- a form of structural variation
- alterations of the DNA of a genome that results in
the cell having an abnormal number of copies of
one or more sections of the DNA
- CNVs correspond to relatively large regions of the
genome that have been deleted (fewer than the
normal number) or duplicated (more than the
normal number) on certain chromosomes
- e.g. normal chromosome structure: A-B-C-D
CNV 1: A-B-C-C-D (a duplication of "C")
CNV 2: A-B-D (a deletion of "C")
- This variation accounts for roughly 12% of human
genomic DNA
- each variation may range from about one kilobase
(1,000 nucleotide bases) to several megabases in
size

Outline
Basic Concepts
 Overview for Variant Discovery
 GATK Architecture
 Data Processing Pipeline of GATK for
Variant Detection

Framework for variation discovery
and genotyping from next-generation
DNA sequencing

Phase 1:
- raw read data with platform-dependent biases were
transformed into a single, generic representation with wellcalibrated base error estimates, mapped to their correct
genomic origin and aligned consistently with respect to one
another. Mapping algorithms placed reads with an initial
alignment on the reference genome, either generated in, or
converted to, the technology-independent SAM reference
file format.
- molecular duplicates were eliminated
- initial alignments were refined by local realignment and
then an empirically accurate per-base error model was
determined.
Framework for variation discovery
and genotyping from next-generation
DNA sequencing

Phase 2:
- the analysis-ready SAM/BAM files were
analyzed to discover all sites with statistical
evidence for an alternate allele present
among the samples including SNPs, short
indels and copy number variations (CNVs)
Framework for variation discovery
and genotyping from next-generation
DNA sequencing

Phase 3:
- technical covariates, known sites of
variation, genotypes for individuals, linkage
disequilibrium (LD), and family and
population structure were integrated with the
raw variant calls from phase 2 to separate
true polymorphic sites from machine
artifacts, and at these sites, high-quality
genotypes were determined for all samples.
Outline
Basic Concepts
 Overview for Variant Discovery
 GATK Architecture
 Data Processing Pipeline of GATK for
Variant Detection

GATK architecture
MapReduce

MapReduce:
- parallel computation
- two steps: subdivide large problems into many discrete
independent pieces, which are fed to the map function,
followed by reduce function, joining the map results back
into a final product
- subdividing
- load balance

Example:
- SNP discovery: map function
- ChIP-seq (peak calling) : reduce function
GATK architecture

traversals
- provide the division and preparation of data

walkers
- analysis module
- provide the map and reduce methods that consume the
data

GATK can provide a nearly comprehensive set of
traversal types that satisfy the data access needs of
the majority of analysis tools
Traversal Types in GATK
“By each sequencer read” (read-based) and “by every read
covering each single base position in a genome” (locus-based)
- standard methods for accessing data for several analyses
- e.g. counting reads, building base quality histograms, reporting average
coverage of sequencer reads over the genome, calling SNP

Read-based Traversals
Read-based Traversal
- presents the analysis walker with each read individually, passing each
read once and only once to the walker’s map function.
- along with the sequencer read, the walker is presented with the
reference bases that the read overlaps
- is useful for analyzing read quality scores, alignment scores, and
merging reads from multiple bam files.

Locus-based Traversals

Locus-based Traversal
- It presents the analysis walkers with all the associated genomic data,
including all the reads that span the genomic location, all reference
ordered data, and the reference base at the specific locus in the
genome.
- Each of these single-base loci are passed to the walker’s map function
- e.g. depth of coverage calculation, variant analysis
Depth of Coverage Walker in GATK
Depth of Coverage:
- important in CNV discovery, SNP calling, and other downstream
analysis

Depth of Coverage Walker in GATK:
- at each site the walker receives a list of the reads covering the
reference base and emits the size of the pileup
- The end user can optionally exclude reads of low mapping quality, and
other read filtering criteria.
- can also be provided with a list of regions to calculate coverage,
summing the average coverage over each region
- can also be used to quantify sequencing results over complex or highly
variable regions, e.g major histocompatibility complex (MHC)

Depth of Coverage Walker in GATK
Outline
Basic Concepts
 Overview for Variant Discovery
 GATK Architecture
 Data Processing Pipeline of GATK for
Variant Detection

Data Processing Pipline of GATK
initial mapping
 refinement of the initial reads
 multi-sample indel and SNP calling
 filtering of the raw SNP calls
 finally variant quality score recalibration.

Reference Genome of GATK



hg19 is not supported
b37 is used
- to keep up to date with dbSNP and the 1000 Genomes Project data
files
Resources Download:
GSA FTP server:
location: ftp.broadinstitute.org
username: gsapubftp-anonymous
password: <blank>
Raw Data Processing

raw fastq file  NGS reads aligner

For Illumina data: recommend BWA
- accurate, fast, well-supported, opensource, and emits BAM files natively
Raw BAM to realigned,
recalibrated BAM
Purpose of realignment
- locally realign reads such that the number of mismatching bases is
minimized across all the reads
- In general, a large percent of regions requiring local realignment are
due to the presence of an insertion or deletion (indels) in the
individual’s genome with respect to the reference genome. Such
alignment artifacts result in many bases mismatching the reference
near the misalignment, which are easily mistaken as SNPs.


Two steps of realignment
- Step 1: Determining (small) suspicious intervals which are likely in
need of realignment
- Step 2: Running the realigner over those intervals
Raw BAM to realigned,
recalibrated BAM
Two types of realignment
- Realignment only at known sites
very efficient
can operate with little coverage
can only realign reads at known indels

- Fully local realignment uses mismatching bases to determine if a site
should be realigned, and relies on sufficient coverage to discover the
correct indel allele in the reads for alignment
much slower (involves SW step)
can discover new indel sites in the reads
Raw BAM to realigned,
recalibrated BAM
Purpose of base quality recalibration
- After recalibration, the quality scores in the QUAL field in each read in the
output BAM are more accurate in that the reported quality score is closer to its
actual probability of mismatching the reference genome
- the recalibration tool attempts to correct for variation in quality with machine
cycle and sequence contex
- more accurate quality scores

Base Quality Recalibration:
analyzing the covariation among several features of a base. e.g.
- Reported quality score
- The position within the read
- The preceding and current nucleotide observed by the sequencing machine
- Probability of mismatching the reference genome
these covariation recalibrate the quality scores of all reads in a BAM file

recommendation:
lane-level recalibration,
sample-level realignment
Initial variant discovery and
genotyping

Input BAMs for variant discovery and genotyping
- already have a single realigned, recalibrated, dedupped BAM per
sample, called sampleX.bam, for X from 1 to N samples in your cohort.

Multi-sample SNP and indel calling
- apply the Unified genotyper to identify sites among the cohort samples.
This will produce a multi-sample VCF file, with sites discovered across
samples and genotypes assigned to each sample in the cohort.
- Note: by default the Unified Genotyper calls SNPs only. To enable the
indel calling capabilities instead use the -glm DINDEL argument.
Initial variant discovery and
genotyping

Selecting an appropriate quality score threshold
- A common question is the confidence score threshold to
use for variant detection.
- Recommend:
Deep (> 10x coverage per sample) data
recommend a minimum confidence score threshold of Q30 with an
emission threshold of Q10. These Q10-Q30 calls will be emitted
filtered out as LowQual.
Shallow (< 10x coverage per sample) data
recommend a min. confidence score of Q4 and an emission
threshold of Q3, since variants have by necessity lower quality with
shallower coverage.
Initial variant discovery and
genotyping


Protocol
VCF (variant call format)
- standarised format for storing the most prevalent types of sequence
variation, including SNPs, indels and larger structural variants, together
with rich annotations
- usually stored in a compressed manner, and can be indexed for fast
data retrieval of variants from a range of positions on the reference
genome
- VCFtools: a software suite that implements various utilities for
processing VCF files, including validation, merging and comparing…
- http://vcftools.sourceforge.net
Initial variant discovery and
genotyping

VCF (variant call format)
Initial variant discovery and
genotyping
Integrating analyses: getting the
best call set possible
Problems of raw VCF file
- raw VCF will have many sites that aren't really genetic variants but are machine
artifacts that make the site statistically non-reference
- should separate out the FP machine artifacts from the TP genetic variants !


Tools:
- VariantFiltrationWalker: apply hard filters
- Variant quality score recalibration: build an adaptive error model using known
variant sites and then apply this model to estimate the probability that each
variant is a true genetic variant or a machine artifact.

Recommend:
Regardless of whether you'll ultimately apply hard filtering or adaptive error
modeling to select your final calls, first apply some common SNP filters to
avoid obvious misalignment and indel artifacts.
Integrating analyses: getting the
best call set possible

Analysis read VCF protocol:
Integrating analyses: getting the
best call set possible
Basic indel filtering:
- purpose: remove alignment artifacts from the data

- methods: flagging variants with high strand bias and in
poorly mapped regions (HARD_TO_VALIDATE set) with
more than 10% of the reads having mapping quality 0
- arguments for VariantFiltrationWalker:
Integrating analyses: getting the
best call set possible

Basic SNP filtering:
- purpose: remove alignment artifacts from the data
- methods: flagging SNPs within clusters (3 SNPs with 10
bp of each other) and those in poorly mapped regions
(HARD_TO_VALIDATE set) with more than 10% of the
reads having mapping quality 0
- arguments for VariantFiltrationWalker:
Integrating analyses: getting the
best call set possible
Filtering around indels
- Purpose: It's possible that, despite even local realignment,
misalignments around true and artifactual indels will result in some
false SNP calls. These errors are quite common if you didn't do local
realignment, didn't provide a set of known indels during local
realignment, and around very large indels that can't be modeled
properly by local realignment.

- methods: perform indel calling, then you can filter your SNP calls
around the raw indel calls from your data set
- arguments for VariantFiltrationWalker:
Integrating analyses: getting the
best call set possible
Making analysis ready calls SNP calls
with hard filtering
- GATK recommended hard filtering:
arguments for VariantFiltrationWalker:

Integrating analyses: getting the
best call set possible

Making analysis ready calls with variant quality score
recalibration
- newly developed
- An alternative approach to hard filtering: Variant quality score
recalibration
- methods: assign a well-calibrated probability to each variant call in a call
set. One can then create highly accurate call sets by filtering based on
this single estimate for the accuracy of each call.
Expected SNP call quality
Using GATK walker : VariantEval
- giving sensitivity, specificity, and Ti/Tv ratios for
known and novel calls

- Expected Ti/Tv ratios:
evaluating the quality of SNP calls whole genome,
or in the targeted whole exome (Agilent), or
interested regions
Reference

Mark A et,al. A framework for variation discovery and genotyping using next-generation DNA
sequencing data. Nature Genetics, 43: 491-498, 2011

Mark A et,al. The Genome Analysis Toolkit: a MapReduce framework for analyzing nextgeneration DNA sequencing data. Genome Res. 20:1297-303, 2010


Wiki: GATK
http://www.broadinstitute.org/gsa/wiki/index.php/Main_Page
Best Practice Variant Detection with the GATK v2
http://www.broadinstitute.org/gsa/wiki/index.php/Best_Practice_Variant_Detection_with_the_GATK_v2
1000 Genomes: A Deep Catalog of Human Genetic Variation
http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40


VCF poster: The Variant Call Format and VCFtools, by Petr Danecek et. al.
http://vcftools.sourceforge.net/VCF-poster.pdf

VCFtools
http://vcftools.sourceforge.net

Wikipedia
http://en.wikipedia.org/
Thanks for Attention !

Download