Slides

advertisement
Previous Lecture: Next-Generation
DNA Sequencing Technology
This Lecture
NGS Alignment
Slides are from Stratos Efstathiadis, Cole Trapneli, Steven Salzberg, Ben Langmead,
Thomas Keane, Dennis Wall, Jianhua Ruan, J Fass
Learning Objectives
•
•
•
•
•
•
•
•
Challenge of NGS read alignment
Suffix Arrays
FM index
BWA algorithm
SAM/BAM format
CIGAR string for variants
Software to work directly with SAM/BAM files
Sequence viewers
Short Read Alignment
• Given a reference and a set of reads, report at least
one “good” local alignment for each read if one
exists
– Approximate answer to: where in genome did the read originate?
• What is “good”? For now, we concentrate on:
– Fewer mismatches is better
…TGATCATA…
GATCAA
– Failing to align a low-quality base
is better than failing to align a …TGATATTA…
GATcaT
high-quality base
– Match Uniqueness of the
alignment
better than
…TGATCATA…
GAGAAT
better than
…TGATcaTA…
GTACAT
Multiple mapping
• A single tag may occur more than once in the
reference genome.
• The user may choose to ignore tags that
appear more than n times.
• As n gets large, you get more data, but also
more noise in the data.
Inexact matching
?
• An observed tag may not exactly match any position in the reference
genome.
• Sometimes, the tag almost matches one or more positions.
• Such mismatches may represent a SNP (single-nucleotide polymorphism,
see wikipedia) or a bad read-out.
• The user can specify the maximum number of mismatches, or a phredstyle quality score threshold.
• As the number of allowed mismatches goes up, the number of mapped
tags increases, but so does the number of incorrectly mapped tags.
% of Paired K-mers with Uniquely
Assignable Location
Read Length is Not As Important For
Resequencing
100%
90%
80%
70%
60%
E.COLI
50%
HUMAN
40%
30%
20%
10%
0%
8
Jay Shendure
10
12
14 16
18
20
Length of K-mer Reads (bp)
New alignment algorithms must address the
requirements and characteristics of NGS reads
– Hundreds of Millions of reads per run (30x genome coverage)
– Short Reads (as short as 36bp)
– Different types of reads (single-end, paired-end, mate-pair,
etc.)
– Base-calling quality factors (should the aligner use them?)
– Sequencing errors ( ~ 1%)
– Repetitive regions
– Sequencing sample vs. reference genome
– Must adjust to evolving sequencing technologies and data
formats
Index: an auxiliary data structure
Two classes of indexing algorithms:
(1) Hash tables (the “old” way)
Hash of Reads (MAQ, ELAND, ZOOM, …)
Smaller but variable memory requirements (depends on
the amount of reads).
Hash of Reference (SOAP, MOSAIK, … )
Predictable memory requirements.
(2) Suffix arrays (the “new” way)
BWA, Bowtie, SOAP2, …
Indexing
• Genomes and reads are too large for direct
approaches like dynamic programming
• Indexing is required
Suffix tree
Suffix array
Seed hash tables
Many variants, incl. spaced seeds
• Choice of index is key to performance
Suffix Array
Find "ctat” in the reference
Is the process reversible ?
NGS Read Alignment
Burrows Wheeler Transformation (BWT)
• Invented by David Wheeler in 1983 (Bell Labs). Published in 1994.
“A Block Sorting Lossless Data Compression Algorithm”
Systems Research Center Technical Report No 124. Palo Alto, CA: Digital Equipment
Corporation, Burrows M, Wheeler DJ. 1994
• Originally developed for compressing large files (bzip2, etc.)
• Lossless, Fully Reversible
• Alignment Tools based on BWT: bowtie, BWA, SOAP2, etc.
• Approach:
– Align reads on the transformed reference genome, using an efficient index (FM index)
– Solve the simple problem first (align one character) and then build on that solution to solve a
slightly harder problem (two characters) etc.
• Results in great speed and efficiency gains (a few GigaByte of RAM for the entire H.
Genome). Other approaches require tens of GigaBytes of memory and are much
slower.
FM-index
Burrows-Wheeler Transform is the basis of the bzip2
file compression tool. B-W uses an FM-index (Ferragina
& Manzini) which allows efficient finding of substring
matches within compressed text – algorithm is
sub-linear with respect to time and storage space
required for a certain set of input data (reference
genome)
Reduced memory footprint, faster execution.
Burrows-Wheeler
• Store entire reference
genome.
• Align tag base by base from
the end.
• When tag is traversed, all
active locations are
reported.
• If no match is found, then
back up and try a
substitution.
Jianhua Ruan
The University of Texas at San Antonio
Burrows-Wheeler Transform (BWT)
BWT
acaacg$
$acaacg
aacg$ac
acaacg$
acg$aca
caacg$a
cg$acaa
g$acaac
gc$aaac
Burrows-Wheeler Matrix (BWM)
Burrows-Wheeler Matrix
$acaacg
aacg$ac
acaacg$
acg$aca
caacg$a
cg$acaa
g$acaac
Burrows-Wheeler Matrix
3
1
4
2
5
6
$acaacg
aacg$ac
acaacg$
acg$aca
caacg$a
cg$acaa
g$acaac
See the suffix array?
Key observation
a1c1a2a3c2g1$1
“last first (LF) mapping”
The i-th occurrence of character X in the
last column corresponds to
the same text character as the i-th
occurrence of X in the first column.
1$acaacg1
2aacg$ac1
1acaacg$1
3acg$aca2
1caacg$a1
2cg$acaa3
1g$acaac2
Exact match (another example)
BWT(agcagcagact) = tgcc$ggaaaac
gca
gca
Search for pattern: gca
gca
gca
$agcagcagact
$agcagcagact
$agcagcagact
$agcagcagact
act$agcagcag
act$agcagcag
act$agcagcag
act$agcagcag
agact$agcagc
agact$agcagc
agact$agcagc
agact$agcagc
agcagact$agc
agcagact$agc
agcagact$agc
agcagact$agc
agcagcagact$
agcagcagact$
agcagcagact$
agcagcagact$
cagact$agcag
cagact$agcag
cagact$agcag
cagact$agcag
cagcagact$ag
cagcagact$ag
cagcagact$ag
cagcagact$ag
ct$agcagcaga
ct$agcagcaga
ct$agcagcaga
ct$agcagcaga
gact$agcagca
gact$agcagca
gact$agcagca
gact$agcagca
gcagact$agca
gcagact$agca
gcagact$agca
gcagact$agca
gcagcagact$a
gcagcagact$a
gcagcagact$a
gcagcagact$a
t$agcagcagac
t$agcagcagac
t$agcagcagac
t$agcagcagac
Test with your own seq and pattern at: http://www.allisons.org/ll/AlgDS/Strings/BWT/
Exact Matching with FM Index
• To match Q in T using BWT(T), repeatedly apply
rule:
top = LF(top, qc); bot = LF(bot, qc)
– Where qc is the next character in Q (right-to-left) and
LF(i, qc) maps row i to the row whose first character
corresponds to i’s last character as if it were qc
FM Index is Small
• Entire FM Index on DNA reference consists of:
– BWT (same size as T)
– Checkpoints (~15% size of T)
– SA sample (~50% size of T)
Assuming 2-bit-per-base encoding and
no compression, as in Bowtie
Assuming a 16-byte checkpoint every
448 characters, as in Bowtie
Assuming Bowtie defaults for suffixarray sampling rate, etc
• Total: ~1.65x the size of T
~1.65x
>45x
>15x
>15x
FASTQ Format:
• The de-facto file format for sharing sequence read data
• Sequence and a per-base quality score
SAM (Sequence Alignment/Map) format:
• A unified format for storing read alignments to a reference
genome.
• Generally large files (a byte per bp)
• Very compact in size but computationally efficient to access.
BAM (Binary Alignment/Map) format:
• A Binary equivalent to SAM.
• Developed for fast processing and indexing
http://bioinformatics.oxfordjournals.org/cgi/reprint/btp352v1
Short-read mapping software
Software
Technique
Developer
License
Eland
Hashing reads
Illumnia
?
SOAP
Hashing refs
BGI
Academic
Maq
Hashing reads
Sanger (Li, Heng) GNUPL
Bowtie
BWT
Salzberg/UMD
BWA
BWT
Sanger (Li, Heng) GNUPL
SOAP2
BWT & hashing
BGI
GNUPL
Academic
http://www.oxfordjournals.org/our_journals/bioinformatics/nextgenerationsequencing.html
FASTQ Files
Sequence Id (Illumina)
36 bps read
@HWUSI-EAS610_0001:3:1:4:1405#0/1
GATAGTTCAATTCCAGAGATCAGAGAGAGGTGAGTG
+
B;30;<4@7/5@=?5?7?1>A2?0<6?<<80>79##
36 Quality scores
•
•
•
•
•
The de-facto file format for sharing DNA sequence read data
4 Lines per read
Sequence line and a per-base Phred quality score line per read
FASTQ Files are Text files
There is No file Header
An Introduction to Phred Quality Score
  10

Q Phred
 is the Error Probability: The
probability that a base call is wrong.
10
Q Phred  10  log 10 ( )
Q: Phred Quality Score
Q

Probability the base call
in wrong (confidence)
40
0.0001
0.01% (99.99%)
30
0.001
0.1% (99.9%)
20
0.01
1% (99%)
10
0.1
10% (90%)
• Phred Quality Score encoding in FASTQ/SAM files: ASCII Character = Q + 33
• FASTQ Files: Q represents Base Call Quality: Probability the base call is wrong.
• SAM Files: Q represents Mapping Quality: Probability the mapping position of the read is incorrect.
$perl –e ‘print chr(33);’
http://en.wikipedia.org/wiki/FASTQ_format
The SAM file
SAM data fields
Mapping Quality (MAPQ) in BWA
Mapping Quality is a function of Edit Distance and the Uniqueness of the alignment.
BWA Mapping Quality
0
A read aligns equally well to multiple positions
(hits). BWA picks randomly one of the positions
and assigns MAPQ=0
1 – 23
25
Only 1 Best hit (with no suboptimal hits) with
more than 2 mismatches.
Or
Only 1 Best hit, with 1 suboptimal hit.
37
Only 1 Best hit (no suboptimal hits), with up to 2
mismatches
(edit distance could be more than 2)
SAM/BAM format
@HD
@SQ
@RG
@RG
Header section
VN:1.0
SN:chr20 AS:HG18 LN:62435964
ID:L1 PU:SC_1_10 LB:SC_1 SM:NA12891
ID:L2 PU:SC_2_12 LB:SC_2 SM:NA12891
position of alignment
Alignment section
Query Name
V00-HWI-EAS132:3:38:959:2035#0
V00-HWI-EAS132:4:99:122:772#0
V00-HWI-EAS132:4:44:473:970#0
V00-HWI-EAS132:4:29:113:1934#0
Ref sequence
147
177
25
99
chr1
chr1
chr1
chr1
query sequence (same strand as ref)
28
32
40
44
255
255
255
255
36M
36M
36M
36M
=
=
*
=
79
9127
0
107
0
0
0
0
GATCTGATGGCAGAAAACCCCTCTCAGTCCGTCGTG
AAAGGATCTGATGGCAGAAAACCCCTCTCAGTCCGT
GTCGTGGTGAAGGATCTGATGGCAGAAAACACCTCT
GGGTTTTCTGCCATCAGATCCTTTACCACGACAGAC
query quality
aaX`[\`Y^Y^]ZX``\EV_BBBBBBBBBBBBBBBB
aaaaaa\OWaI_\WL\aa`Xa^]\ZUaa[XWT\^XR
__YaZ`W[aZNUZ[U[_TL[KVVX^QURUTDRVZBB
aaaQaa__``]\\_^``^a^`a`_^^^_XQ[ZS\XX
NM:i:1
NM:i:1
NM:i:2
NM:i:1
Post-processing: Tools and programming APIs for
parsing and manipulating alignments:
Samtools: http://samtools.sourceforge.net/
Convert SAM to BAM and vice versa
Sort and Index BAM files
Merge multiple BAM files
Show alignments in text viewer
Remove Duplicates from PCR amplification step
Picard Tools: (Java-based)
http://picard.sourceforge.net/index.shtml
BAM is Indexed & Binary Compressed SAM
• BAM is indexed by genome location.
• Software toolkit allows other software to
extract sequence data from BAM at specific
genome – do not need to store entire data file
in memory during operation of program!
SAMTools/Picard
http://samtools.sourceforge.net/
• SAMTools is a simple toolkit to transform SAM
to BAM, merge, sort, index
• Can also calculate statistics (mean quality,
depth of coverage, etc.)
• filter duplicate reads
• create multiple alignments of all reads over a
genomic interval
• call variants
SAMTools commands
• Download and install SAMTools
http://sourceforge.net/projects/samtools/files/samtools/
Make the BAM file:
samtools view –bt ref.fa –o data.bam data.sam
Sort the BAM file:
samtools sort data.bam data.sorted.bam
Index the BAM file:
samtools index data.sorted.bam data.st_index.bam
http://www.htslib.org/doc/sam.html
http://samtools.sourceforge.net/samtools-c.shtml
Manual Reference Pages
- samtools (1)
DESCRIPTION
• Samtools is a set of utilities that manipulate alignments in the BAM
format. It imports from and exports to the SAM (Sequence Alignment/Map)
format, does sorting, merging and indexing, and allows to retrieve
reads in any regions swiftly.
SYNOPSIS
samtools
samtools
samtools
samtools
samtools
samtools
samtools
samtools
samtools
samtools
bcftools
bcftools
bcftools
view -bt ref_list.txt -o aln.bam aln.sam.gz
sort aln.bam aln.sorted
index aln.sorted.bam
idxstats aln.sorted.bam
view aln.sorted.bam chr2:20,100,000-20,200,000
merge out.bam in1.bam in2.bam in3.bam
faidx ref.fasta
pileup -vcf ref.fasta aln.sorted.bam
mpileup -C50 -gf ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam
tview aln.sorted.bam ref.fasta
index in.bcf
view in.bcf chr2:100-200 > out.vcf
view -vc in.bcf > out.vcf 2> out.afs
Heng Li from the Sanger Institute wrote the C version of samtools.
Picard (Java) Toolkit
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
AddCommentsToBam
AddOrReplaceReadGroups
BamToBfq
BamIndexStats
BedToIntervalList
BuildBamIndex
CalculateHsMetrics
CleanSam
CollectAlignmentSummaryMetrics
CollectBaseDistributionByCycle
CollectGcBiasMetrics
CollectInsertSizeMetrics
CollectMultipleMetrics
CollectTargetedPcrMetrics
CollectRnaSeqMetrics
CollectWgsMetrics
CompareSAMs
CreateSequenceDictionary
DownsampleSam
ExtractIlluminaBarcodes
EstimateLibraryComplexity
FastqToSam
FifoBuffer
FilterSamReads
FilterVcf
FixMateInformation
GatherBamFiles
GenotypeConcordance
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
IlluminaBasecallsToFastq
IlluminaBasecallsToSam
CheckIlluminaDirectory
IntervalListTools
MakeSitesOnlyVcf
MarkDuplicates
MarkDuplicatesWithMateCigar
MeanQualityByCycle
MergeBamAlignment
MergeSamFiles
MergeVcfs
NormalizeFasta
ExtractSequences
QualityScoreDistribution
ReorderSam
ReplaceSamHeader
RevertSam
RevertOriginalBaseQualitiesAndAddMateCigar
SamFormatConverter
SamToFastq
SortSam
SortVcf
UpdateVcfSequenceDictionary
VcfFormatConverter
MarkIlluminaAdapters
SplitVcfs
ValidateSamFile
ViewSam
VcfToIntervalList
Visualization
• BAM file format also simplifies visualization of NGS
data.
• Must make a .BAI index for each BAM file using
samtools index command.
• BAM and .BAI files must be located on your own
computer
• Index allows viewer to quickly find and load only read
data for a specific genomic interval.
• Integrative Genomics Viewer (IGV) from Broad Institute
is our current favorite.
• Use the Java Webstart to download and run IGV (use
the 1.2 GB version):
http://www.broadinstitute.org/igv/
Summary
•
•
•
•
•
•
•
•
Challenge of NGS read allignment
Suffix Array
FM index
BWA algorithm
SAM/BAM format
CIGAR string for variants
Software to work directly with SAM/BAM files
Sequence viewers
Next Lecture: ChIP-seq
Download