1000Genomes-2008 - BC Bioinformatics

advertisement
Read mapping and variant calling in
human short-read DNA sequences
Gabor T. Marth
Boston College Biology Department
1000 Genomes Meeting
Cold Spring Harbor Laboratory
May 5-6. 2008
1000 Genomes – related work
• Software
Read alignment / assembly (Michael Stromberg)
SNP / short-INDEL calling (Gabor Marth)
GigaBayes
Structural Variation calling (Chip Stewart)
Read simulator (Weichun Huang)
Benchmarking suite (Weichun Huang)
• Read mapping based studies
Read accuracy / quality value analysis
Read simulations
• Variant calling based study
SNP discovery: sample size / read coverage (Aaron Quinlan)
MOSAIK – sequence aligner/assembler
Poster
thumb
Michael Strömberg
(see poster at Genome Meeting)
• maps reads to reference: short-hash based scan + SmithWaterman alignment
MOSAIK – Features
• produces gapped alignments
essential for tolerating reads with
insertion-deletion read errors and
short insertion-deletion alleles
• adapted to all available NGS platforms
• can create “mixed” alignments of
reads from different platforms (except
SOLiD color-space reads)
MOSAIK – Resolving multiple map locations
MOSAIK – Performance
Illumina 35 bp (X Chromosome)
program
aligned reads/s
MOSAIK
180 - 16,658
ELAND
7,716
SOAP
1,637
MAQ
1,376
SHRIMP
39
Uses a lot of RAM for mammalian alignments –
precomputed file based versions are available for
RAM-limited users
Run dissection (timeline figure from Michael)
MOSAIK – Accuracy
Erroricity – read accuracy / quality values
Motivation: why does read accuracy matter?
Why does quality value accuracy matter?
we are using quality values to
distinguish between sequencing
error and true allelic difference
Q-values should correspond
well with actual sequencing
error rate
Erroricity – study design & pipeline
• Sampled 3 lanes each from 3 runs
• Aligned reads with MosaikPE (up to 4 mismatches),
keeping only consistently mapping pairs
• Looked at read-specific, position-specific error rates
• Compared SUB, IN, and DEL error rates
• Looked at overall quality value vs. measured error rate,
and position specific quality value vs. measured error rate
• Compared the first and the second end-reads of read pairs
• Compared RAW vs. CALIBRATED Q-values
Derek Barnett
Read simulations (Weichun Huang, Aaron)
• Input
• Conceptual schema of read simulation
• Representational biases (GC-driven and others…) [Chip]
• Error and Q-value generation: 2D tables of read position,
assigned Q-value  true Q-value, frequency
• Speed / RAM / space
• Data output
• Software benchmarking system
Weichun Huang, see poster at Genome Meeting
GigaBayes: SNP and short-INDEL calling
• The new GigaBayes program: pop-gen and diploid
priors, trio priors
• Speed
• Input / output behavior
• Bayesian math focused on the individual genotype
• How to deal with multiple reads from a single individual
• Diploid individual
• Multiple diploid individuals
• Trio members
• Prior frequency of an allele
• Taking into account Q-value for allele
• What is needed to call an allele? (# reads, Q, # people)
Variant calling (SNPs and short-INDELs)
population
aacgtCaggct
aacgtCaggct
individuals
G1
aacgtCaggct
aacgtCaggct
aacgtCaggct
aacgtCaggct
G2
aacgtCaggct
aacgtTaggct
aacgtCaggct
aacgtTaggct
fragments
reads
aacgtCaggct
aacgtTaggct
aacgtCaggct
aacgtCaggct
aacgtCaggct
aacgtCaggct
aacgtCaggct
aacgtCaggct
aacgtCaggct
aacgtTaggct
aacgtTaggct
aacgtTaggct
aacgtTaggct
G3
aacgtCaggct
aacgtTaggct
aacgtTaggct
aacgtTaggct
aacgtCaggct
aacgtCaggct
priors
aacgtTaggct
aacgtTaggct
aacgtTaggct
aacgtTaggct
aacgtTaggct
sampling
likelihood
aacgtCaggct
aacgtCaggct
aacgtCaggct
aacgtCaggct
aacgtTaggct
aacgtTaggct
aacgtTaggct
aacgtTaggct
aacgtTaggct
aacgtTaggct
quality values
Bayesian variant detection math


k
k
 Pr  Bi | Ti  Pr Ti | Gi   Pr  G1 , G2 , , Gn 

T k
i 1 


, Gn | B  
 n 

l
l
l
l
k
k
 Pr  Bi | Ti  Pr Ti | Gi  Pr G1 , G2 , , Gn
l 
k
G 

 i 1  T
n
Pr  G1 , G2 ,

 




Priors: (1) based on all the individuals from which reads are
aligned. (2) Theta, P(AF=i), specific diploid genotype layout
given AF=I
Which of the two chromosomes a read represents?
Calculated from multinomial (or binomial distribution)
P(base call is an B | read template is
T) – comes from quality values
SNP calling and genotyping
P(SNP) = total probability of all non-monomorphic
genotype combinations
P(Gi) = marginal probability
consequence: data from other individuals influence the
genotype call of a given individual: include illustration
using testProb program in GigaBayes package.
Variant calling tests in simulated data
Aaron Quinlan
(see poster at the Genome Meeting)
Variant calling – Estimated vs. population AF
0.5
0.5
Observed minor allele frequency
Observed minor allele frequency
0.45
100 Inds @ 16X
0.45
0.4
corr = 0.89
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0.4
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
corr = 0.92
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0
200 Inds @ 8X
0.5
0
0.05
Expected minor allele frequency
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.45
0.5
0.5
0.45
400 Inds @ 4X
Observed minor allele frequency
Observed minor allele frequency
0.15
Expected minor allele frequency
0.5
0.4
corr = 0.93
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0.1
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Expected minor allele frequency
0.45
0.5
0.45
800 Inds @ 4X
0.4
corr = 0.91
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Expected minor allele frequency
Variant calling – AF (cont’d)
Observed minor allele frequency
0.5
0.45
800 Inds @ 4X
0.4
corr = 0.91
0.35
0.05
0.3
0.045
0.25
0.2
0.04
0.15
0.035
0.1
0.05
0
0.03
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Expected minor allele
frequency
0.025
0.02
0.015
0.01
0.005
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
Variant calling – SNP discovery sensitivity
Variant calling – Genotype completeness
1600
1400
1200
1000
800
600
400
200
0
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
4
x 10
Fraction of confident genotypes
1
0.9
0.8
0.7
0.6
100 @ 16x: 0.975 +/- 0.121
0.5
200 @ 8x: 0.968 +/- 0.129
0.4
0.3
400 @ 4x: 0.924 +/- 0.151
0.2
800 @ 2x: 0.769 +/- 0.154
0.1
0
0
5000
10000
15000
20000
25000
30000
SNP Position
35000
40000
45000
50000
Variant calling – Genotype completeness
4000
4000
200 Inds @ 8X
3000
Number of SNPs
Number of SNPs
100 Inds @ 16X
2000
1000
0
2000
1000
0
0
0.2
0.4
0.6
0.8
1
Fraction of samples w/ confident genotype
0
0.2
0.4
0.6
0.8
1
Fraction of samples w/ confident genotype
3000
1500
400 Inds @ 4X
2500
800 Inds @ 2X
2000
Number of SNPs
Number of SNPs
3000
1500
1000
1000
500
500
0
0
0.2
0.4
0.6
0.8
1
Fraction of samples w/ confident genotype
0
0
0.2
0.4
0.6
0.8
Fraction of samples w/ confident genotype
1
Summary / Conclusions
Thanks
Michael
Derek
Aaron
Chip
Weichun
MOSAIK (Michael Stromberg)
• MOSAIK is a reference-sequence guided aligner / assembler
replace this with an animated figure illustrating mapping
against reference and padding, by moving / stretching
bases in the reads and in the reference sequence
MOSAIK – Features and characteristics
• aligns reads to genome (higher RAM usage but also many
desirable consequences)
• offers several algorithmic levels that trade off between
speed and accuracy
• able to report “every” decent alternative map location for
sequence reads, and distinguishes between uniquely and
non-uniquely mapped reads
• designed to work with all currently available technologies
(Illumina, 454, AB, Helicos) and to include mixed read sets
into a single anchored “assembly”
• PE-aware
• recently scaled up to mammalian alignments
Structural variation discovery
• copy number variations (deletions &
amplifications) can be detected from
variations in the depth of read coverage
• structural rearrangements (inversions and
translocations) require paired-end read data
Software evaluation suite
GigaBayes
bases per machine run
Read length and throughput
Illumina/Solexa, AB/SOLiD
short-read sequencers
1Gb
(1-4 Gb in 25-50 bp reads)
100 Mb
454 pyrosequencer
(20-100 Mb in 100-250 bp reads)
10 Mb
ABI capillary sequencer
1Mb
10 bp
100 bp
1,000 bp
read length
Current and future application areas
Genome re-sequencing: somatic mutation detection, organismal SNP discovery,
mutational profiling, structural variation discovery
reference genome
SNP
De novo genome sequencing
Short-read sequencing will be (at least) an
alternative to micro-arrays for:
• DNA-protein interaction analysis (CHiP-Seq)
• novel transcript discovery
• quantification of gene expression
• epigenetic analysis (methylation profiling)
DEL
Fundamental informatics challenges
1. Interpreting machine readouts – base calling, base error estimation
2. Dealing with nonuniqueness in the genome:
resequenceability
3. Alignment of billions
of reads
Informatics challenges (cont’d)
4. SNP and short INDEL, and
structural variation discovery
5. Data visualization
6. Data storage & management
Challenge 1. Base accuracy and base calling
• machine read-outs are
quite different
• read length, read accuracy, and sequencing error profiles are variable (and change
rapidly as machine hardware, chemistry, optics, and noise filtering improves)
• what is the instrument-specific error profile?
• are the base quality values satisfactory?
(1) are base quality values accurate?
(2) are most called bases high-quality?
454 pyrosequencer error profile
• multiple bases in a homo-polymeric run are incorporated in a single
incorporation test  the number of bases must be determined from a single
scalar signal  the majority of errors are INDELs
• error rates are nucleotide-dependent
454 base quality values
• the native 454 base caller assigns too low base quality values
PYROBAYES: determine base number
New 454 base caller:
data
likelihoods
priors
posterior base
number probability
PYROBAYES: base calls and quality values
• call the most likely number of nucleotides
• produce three base quality values:
QS (substitution)
QI (insertion)
QD (deletion)
PYROBAYES: Performance
• better correlation between assigned
and measured quality values
• higher fraction of high-quality bases
Illumina/Solexa base accuracy
• error rate grows as a
function of base position
within the read
• a large fraction of the reads
contains 1 or 2 errors
Illumina/Solexa base accuracy (cont’d)
• Actual base accuracy for a fixed base quality value is a function of base position
within the read (i.e. there is need for quality value calibration)
• Most errors are substitutions
 PHRED quality values work
Challenge 2. Resequenceability
• Reads from repeats cannot be
uniquely mapped back to their
true region of origin
• RepeatMasker does not capture
all micro-repeats, i.e. repeats at the
scale of the read length
100%
• Near-perfect micro-repeats can be
also a problem because we want to
align reads even with a few
sequencing errors and / or SNPs
Reads
80%
60%
40%
20%
0%
0
1
2
3
4
Mismatches
5
6
7
8
Repeats at the fragment level
“base masking”
“fragment masking”
Fragment level repeat annotation
0.40
0.35
Fraction of genome
0.30
0.25
0.20
0.15
0.10
0.05
0.00
0
1
2
Number of mismatches allowed
• bases in repetitive fragments may be resequenced with reads representing
other, unique fragments  fragment-level repeat annotations spare a higher
fraction of the genome than base-level repeat masking
Find perfect and near-perfect micro-repeats
• Hash based methods (fast but only work out to a couple of mismatches)
• Exact methods (very slow but find every repeat copy)
• Heuristic methods (fast but miss a fraction of the repeats)
Challenge 3. Read alignment and assembly
• resequencing requires reference
sequence-guided read alignment
• to align billions of reads the aligner has to be fast and efficient
• INDEL errors require gapped alignment
• individually aligned reads must be “assembled” together
• has to work for every read type (short, medium-length, and long reads)
• must tolerate sequencing errors and SNPs
• must work with both base-level and fragment-level repeat annotations
• transcribed sequences require additional features e.g. splice-site aware
alignment capability
• most frequently used tools: BLAT (only pair-wise), SSAHA (pair-wise), MAQ
(pair-wise and assembly), ELAND (pair-wise), MOSAIK (pair-wise and
assembly, gapped)
MOSAIK: co-assembling different read types
ABI/cap.
454/FLX
454/GS20
Illumina
Challenge 4. Polymorphism discovery
• shallow and deep
read coverage
• most candidates will never be “checked”  only very low error rates are
acceptable
• we updated PolyBayes to deal with
new read types
• made the new software (PBSHORT)
much more efficient
Challenge 5. Data visualization
1. aid software development: integration of trace data viewing, fast navigation,
zooming/panning
2. facilitate data validation (e.g. SNP validation): co-viewing of multiple read
types, quality value displays
3. promote hypothesis generation: integration of annotation tracks
Challenge 6. Massive data volumes
• two connected working groups to define standard data formats
Short-read format working group
ssrformat@ubc.ca
(Asim Siddiqui, UBC)
Assembly format working group
Boston College
http://assembly.bc.edu
Next-generation sequencing software
Machine manufacturers’ sites plus thirdparty developers’ sites, e.g.:
http://sourceforge.net/projects/maq/
http://bioinformatics.bc.edu/marthlab/Mosaik
Applications in various discovery projects
1. SNP discovery in shallow,
single-read 454 coverage
(Drosophila melanogaster)
2. Mutational profiling in deep 454 data
(Pichia stipitis)
(image from Nature Biotech.)
3. SNP and INDEL discovery in deep
Illumina / Solexa short-read coverage
(Caenorhabditis elegans)
SNP calling in single-read 454 coverage
DNA courtesy of Chuck Langley, UC Davis
• collaborative project with Andy Clark (Cornell) and Elaine Mardis (Wash. U.)
• goal was to assess polymorphism rates between 10 different African and American
melanogaster isolates
• 10 runs of 454 reads (~300,000 reads per isolate) were collected
• key informatics question: can we detect SNPs with high accuracy in low-coverage,
survey-style 454 reads aligned to finished reference genome sequence?
• reads were base-called with PyroBayes and aligned to the 180Mb reference melanogaster
genome sequence with Mosaik  0.16 x nominal read coverage  most reads are singletons
• SNP detection with PolyBayes
SNP calling success rates
iso-1 reference
46-2 454 read
46-2 ABI reads
(2 fwd + 2 rev)
• 92.9 % validation rate (1,342 / 1,443)
single-read coverage: 92.9% (1,275 / 1,372 )
double-read coverage: 94.3% (67 / 71)
• 2.0% missed SNP rate (25 / 1247)
single-read coverage: 2.12% (25 / 1176)
double-read coverage: 0% (0 / 59)
Genome variation in melanogaster isolates
• 658,280 SNPs discovered among all 10 lines.
• Nucleotide diversity Ѳ ≈ 5x10-3 (1 SNP / 200 bp) between each line and reference (in
line with expectations).
• 20.2% (133,264 sites) polymorphic among two or more lines. The 1 SNP / 900 bp
nominal density is sufficient for high-resolution marker mapping
Mutational profiling in deep 454 data
Pichia stipitis reference sequence
Image from JGI web site
• collaboration with Doug Smith at Agencourt
• Pichia stipitis is a yeast that efficiently converts xylose to ethanol (bio-fuel production)
• one specific mutagenized strain had especially high conversion efficiency
• goal was to determine where the mutations were that caused this phenotype
• we analyzed 10 runs (~3 million reads) of 454 reads (~20x coverage of the 15MB genome)
• processed the sequences with our 454 pipeline
• found 39 mutations (in as many reads in which we found 650K SNP in melanogaster)
• informatics analysis in < 24 hours (including manual checking of all candidates)
SNP calling in short-read coverage
C. elegans reference genome (Bristol, N2 strain)
Bristol, N2 strain
(3 ½ machine runs)
Pasadena, CB4858
(1 ½ machine runs)
• goal was to evaluate the Solexa/Illumina technology for the complete resequencing of
large model-organism genomes
• 5 runs (~120 million) Illumina reads from the Wash. U. Genome Center, as part of a
collaborative project lead by Elaine Mardis, at Washington University
• primary aim was to detect polymorphisms between the Pasadena and the Bristol strain
Polymorphism discovery in C. elegans
• MOSAIK aligned / assembled the reads (< 4 hours on 1 CPU)
• PBSHORT found 44,642 SNP candidates (2 hours on our 160-CPU cluster)
• SNP density: 1 in 1,630 bp (of non-repeat genome sequence)
• SNP calling error rate very low:
SNP
Validation rate = 97.8% (224/229)
Conversion rate = 92.6% (224/242)
Missed SNP rate = 3.75% (26/693)
• INDEL candidates validate and convert
at similar rates to SNPs:
Validation rate = 89.3% (193/216)
Conversion rate = 87.3% (193/221)
INS
Informatics of transcriptome sequencing
• novel transcript discovery
Inferred Exon 1
Inferred Exon 2
new genes & exons
novel transcripts in
known genes
Inferred Exon 1
Inferred Exon 2
• measuring gene expression levels by sequence tag counting requires
SAGE informatics-like approaches
Counts Per Transcript Based On SAGE Data Of C. elegans Adult Worm
(Jones et al. 2001, GEO 24438)
12000
10000
Frequency
8000
6000
4000
2000
0
1
2-25
25-50
50-75
75-100
101-200
201-300
Count Of Sage Tag
301-400
401-500
>500
Protein-DNA interactions: CHiP-Seq
Protein-bound DNA fragments are isolated
with chromatin immunoprecipitation
(ChIP) and then sequenced (Seq) on a highthroughput sequencing platform.
Sequences are mapped to the genome
sequence with a read alignment program.
Regions over-represented in the sequences
are identified.
Johnson et al. Science, 2007
Protein-DNA interactions: CHIP-SEQ
ChIP-Seq scales well for simultaneous analysis of binding sites in the
entire genome.
Mikkelsen et al. Nature 2007.
1000 Genomes – related work
1. Read mapping
Aligner / assembler – MOSAIK
Read accuracy / quality value analysis – ERRORICITY
Read simulations – ART
Software evaluation suite – BTA
2. Variant calling
SNP discovery program – GIGABAYES
SNP calling: # individuals vs. individual coverage
Structural Variation calling – SPANNER
Download