T - STAT 115

advertisement
High Throughput Sequencing
Xiaole Shirley Liu
STAT115, STAT215, BIO298, BIST520
First Generation
• Sanger Sequencing: sequencing and detection 2
different steps: 384 * 1kb / 3 hours
2
Second Generation
• Massively parallel sequencing by synthesis
• Many different technologies: Illumina, 454,
SOLiD, Helicos, etc
• Illumina: HiSeq, MiSeq, NextSeq
3
•
•
•
•
•
•
1-16 samples
25M-4B reads
30-300bp
1-8 days
15GB-1TB output
Moving targets
Illumina Library Prep
4
Illumina Cluster Generation
• Amplify sequenced
fragments in place
on the flow cell
• Can sequence from
both the pink and
purple adapters
(Paired-end seq)
• Can multiplex
many samples /
lane
5
Illumina Sequencing
6
Third Generation
• Single molecule sequencing: no amp
• Fewer but much longer reads
• Good for genome sequencing, but not for read
count applications
http://www.youtube.com/watch?v=v8p4ph2MAvI
7
High Throughput Sequencing
• Big (data), fast (speed), cheap (cost),
flexible (applications)
• Bioinformatic analyses become bottleneck
8
High Throughput Sequencing
Data Analysis
9
FASTQ File
• Format
– Sequence ID, sequence
– Quality ID, quality score
• Quality score using ASCII (higher -> better)
@HWI-EAS305:1:1:1:991#0/1
GCTGGAGGTTCAGGCTGGCCGGATTTAAACGTAT
+HWI-EAS305:1:1:1:991#0/1
MVXUWVRKTWWULRQQMMWWBBBBBBBBBBBBBB
@HWI-EAS305:1:1:1:201#0/1
AAGACAAAGATGTGCTTTCTAAATCTGCACTAAT
+HWI-EAS305:1:1:1:201#0/1
PXX[[[[XTXYXTTWYYY[XXWWW[TMTVXWBBB
10
FASTQC: Sequencing Quality
11
Read Mapping
• Mapping hundreds of millions of reads back
to the reference genome is CPU and RAM
intensive and slow
• Read quality decreases with length (small
single nucleotide mismatches or indels)
• Most mappers allow ~2 mismatches within
first 30bp (4 ^ 28 could still uniquely
identify most 30bp sequences in a 3GB
genome), slower when allowing indels
• Mapping output: SAM (BAM) or BED
12
Spaced seed
alignment
• Tags and tag-sized pieces of
reference are cut into small
“seeds.”
• Pairs of spaced seeds are
stored in an index.
• Look up spaced seeds for
each tag.
• For each “hit,” confirm the
remaining positions.
• Report results to the user.
Burrows-Wheeler
• Store entire reference
genome.
• Align tag base by base from
the end.
• When tag is traversed, all
active locations are
reported.
• If no match is found, then
back up and try a
substitution.
Trapnell & Salzberg, Nat Biotech 2009
Burrows-Wheeler Transform
• Reversible permutation used originally in compression
T
BWT(T)
Burrows
Wheeler
Matrix
Last column
Encoding for
compression
gc$ac
1111001
• Once BWT(T) is built, all else shown here is discarded
– Matrix will be shown for illustration only
Burrows M, Wheeler DJ: A block sorting lossless data compression algorithm. Digital Equipment
Corporation, Palo Alto, CA 1994, Technical Report 124; 1994
Slides from Ben
Langmead
Burrows-Wheeler Transform
• Property that makes BWT(T) reversible is “LF Mapping”
– ith occurrence of a character in Last column is same
text occurrence as the ith occurrence in First column
Rank: 2
BWT(T)
T
Rank: 2
Burrows Wheeler
Matrix
Slides from Ben Langmead
Burrows-Wheeler Transform
• To recreate T from BWT(T), repeatedly apply rule:
T = BWT[ LF(i) ] + T; i = LF(i)
– Where LF(i) maps row i to row whose first character
corresponds to i’s last per LF Mapping
Final T
Slides from Ben Langmead
Exact Matching with FM Index
• To match Q in T using BWT(T), repeatedly apply rule:
top = LF(top, qc); bot = LF(bot, qc)
– Where qc is the next character in Q (right-to-left) and
LF(i, qc) maps row i to the row whose first character
corresponds to i’s last character as if it were qc
Slides from Ben Langmead
Exact Matching with FM Index
• In progressive rounds, top & bot delimit the range of
rows beginning with progressively longer suffixes of Q
(from right to left)
• If range becomes empty the query suffix (and therefore
the query) does not occur in the text
• If no match, instead of giving up, try to “backtrack” to a
previous position and try a different base (mismatch,
much slower)
Slides from Ben Langmead
Seq Files
• Raw FASTQ
– Sequence ID, sequence
– Quality ID, quality score
• Mapped SAM
– Map: 0 OK, 4 unmapped,
16 mapped reverse strand
– XA (mapper-specific)
– MD: mismatch info
– NM: number of mismatch
• Mapped BED
– Chr, start, end, strand
20
@HWI-EAS305:1:1:1:991#0/1
GCTGGAGGTTCAGGCTGGCCGGATTTAAACGTAT
+HWI-EAS305:1:1:1:991#0/1
MVXUWVRKTWWULRQQMMWWBBBBBBBBBBBBBB
@HWI-EAS305:1:1:1:201#0/1
AAGACAAAGATGTGCTTTCTAAATCTGCACTAAT
+HWI-EAS305:1:1:1:201#0/1
PXX[[[[XTXYXTTWYYY[XXWWW[TMTVXWBBB
HWUSIEAS366_0112:6:1:1298:18828#0/1 16
chr9 9811660
0
255 38M *
0
0
TACAATATGTCTTT
ATTTGAGATATGGATTTTAGGCCG Y\]bc^dab\[_U
U`^`LbTUT\ccLbbYaY`cWLYW^ XA:i:1 MD:Z:3C30T
3 NM:i:2
HWUSIEAS366_0112:6:1:1257:18819#0/1 4
*
0
0
*
*
0
0
AGACCACATGAAGCTCAAGAA
GAAGGAAGACAAAAGTG ece^dddT\cT^c`a`ccdK\c
^^__]Yb\_cKS^_W\ XM:i:1
HWUSIEAS366_0112:6:1:1315:19529#0/1 16
chr9 1026102
63
255 38M *
0
0
GCACTCAAGGGT
ACAGGAAAAGGGTCAGAAGTGTGGCC ^c_Yc\Lc
b`bbYdTa\dd\`dda`cdd\Y\ddd^cT` XA:i:0 MD:Z:38
NM:i:0
chr1 123450 123500
+
chr5 28374615
28374615
-
http://samtools.sourceforge.net/SAM1.pdf
Mapping Statistics Terms
• Mappable locations: reads that can find
match to A location in the genome
• Uniquely mapped reads: reads that can find
match to A SINGLE location in the genome
– Repeat sequences in the genome, lengthdependent
• Uniquely mapped locations: number of
unique locations hit by uniquely mapped
reads
– Redundancy: potential PCR amplification bias
21
Summary
• Sequencing technologies
– 1st, 2nd, 3rd generation
• Sequence quality assessment
– FASTQC
• Read mapping
– Spaced seed
– BWA: Borrows Wheeler transformation, LF
mapping
22
Download