GNUMAP-SNP Nathan Clement The University of Texas Austin, TX, USA

advertisement
GNUMAP-SNP
Nathan Clement
The University of Texas
Austin, TX, USA
Outline

Motivation
 NGS Issues and Requirements
 Pair-HMM
Memory Optimizations
 Results
 Conclusion

Motivation
Mutation Detection:

SNP discovery
 HapMap and resequencing
Species Identification
 Bisulfite Sequencing

 Epigenetic influences
 RNA editing
Error Rates*
Instrument
Run Time
Mb/run
Bases/re
ad
Primary
Error Type
Error Rate
(%)
3730xl
(Capillary)
2h
0.06
650
Substitution
0.1-1
454 FLX+
18-20 h
900
700
Indel
1
Illumina
HiSeq2000
10 days
≤ 600,000
100+100
Substitution
≥0.1
Ion Torrent –
318 chip
2h
>1000
>100
Indel
~1
PacBio RS
0.5-2h
5-10
860-1100
CG
Deletions
16
* Data current as of May 2011:
Glenn, Travis C, “Field guide to next-generation DNA sequencers,” Molecular Ecology Resources, vol
11, pp 759-769, 2011
Pair-HMM
Pair-wise Alignment:
A
|
A
G
T
|
T
--
A
|
A
--
G
|
G
C
A
C
--
A
C
|
C
Equivalent Hidden Markov State Sequence:
GX
q
egin
M
pAA
TGM
GX
q
TMG
M
M
M
pTT
p
TMM AA
pGG
TGM
GY
q
TMG
TGM
TMG
End
M
M
p
pCA
TMM CC
Pair-HMM (Mathematics)

Match

Gap (in both directions)
Pair-HMM (M)
a
t
a
c
g
a
c
t
a
1.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
g
0.00
0.68
0.00
0.00
0.00
0.00
0.00
0.00
t
0.00
0.32
0.68
0.00
0.00
0.00
0.00
0.00
a
0.00
0.00
0.32
0.68
0.00
0.00
0.00
0.00
g
0.00
0.00
0.00
0.00
1.00
0.00
0.00
0.00
a
0.00
0.00
0.00
0.00
0.00
1.00
0.00
0.00
c
0.00
0.00
0.00
0.00
0.00
0.00
1.00
0.00
c
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.00
Pair-HMM (X)
a
t
a
c
g
a
c
t
a
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
g
0.31
0.00
0.00
0.00
0.00
0.00
0.00
0.00
t
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
a
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
g
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
a
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
c
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
c
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
Pair-HMM (Y)
a
t
a
c
g
a
c
t
a
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
g
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
t
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
a
0.00
0.00
0.00
0.31
0.00
0.00
0.00
0.00
g
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
a
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
c
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
c
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
Pair-HMM
A
C
G
T
a
1.00
0.00
0.00
0.00
g
0.00
0.00
0.68
0.31
t
0.32
0.00
0.00
0.68
a
0.99
0.00
0.00
0.00
g
0.00
0.00
1.00
0.00
a
1.00
0.00
0.00
0.00
c
0.00
1.00
0.00
0.00
c
0.00
1.00
0.00
0.00
Expected Results
CHR POS
TOT
A
C
G
T
SNP?
chrX
1755234
17.00
0.00
0.00
17
0.00
N
chrX
1755235
18.00
0.00
18.00
0.00
0.00
N
chrX
1755236
19.00
9.99
0.00
9.00
0.01
Y:g->a/g
chrX
1755237
19.50
0.00
0.00
0.00
19.50
N
chrX
1755238
19.50
0.00
0.00
19.50
0.00
N
chrX
1755239
46.00
0.01
19.49
0.00
0.00
N
PVAL
2.54e-08
Why Inline SNP Calling?

Post-Processing
 Disk space, less memory

Inline
 Requires more memory
 Less disk space
 Can include specifics probabilities for each
read
Previous Optimizations

Two methods for speeding up mapping:
1. Entire genome on one machine
2. Split memory among different machines
○ Must normalize across all genome portions
○ MPI reduction
Previous Optimizations
1000
2000
Sequence Processing Rate for Memory Allocation
●
200
●
100
●
●
50
●
20
●
●
●
10
# sequences / second
500
●
Expected
Single Machine
Spread Memory
●
1
2
4
8
16
# Processes
32
64
128
256
Memory Requirements

Human Genome (3gb)
 HashMap ≈ 12GB
 4 bits/character = 1.5GB
 5 floating point values per base (plus N) =
sizeof(float)*5 * 3GB=60GB
 Also stores total for easy computation =
sizeof(float) * 3GB = 12GB

Total of ≈ 90GB per run
Three Memory Optimizations
Normal (no optimization)
 Integer discretization
 Centroid discretization

Integer Discretization
Only need one floating point value (for
total) and 1 byte/nucleotide.
 “Parts per 255”
 Biggest hit: Going into and out of
“integer space”

Integer Discretization



Step 1: Convert from
Integer Space
Step 2: Add from ri
to Genome
Step 3: Convert
back to Integer
Space
Added from ri:
1.0
0.00
0.68
0.31
0.01
0.00
Total A
C
G
T
N
12.0
13.0
231
228
7
13
12
11
3
2
Total A
C
G
T
N
12.0
13.0
10.9
11.6
0.33
0.64
0.56
0.57
0.15
Genome
3
2
0.15
Centroid Discretization

Many states not used:
 [255, 255, 255, 255, 255]
 [0, 0, 0, 0, 0]

Many states not biologically relevant
 SNP transition (common) vs transversion
(not likely)

MSA uses this compression to perform
fast alignment of one-to-many alignment
Centroid Discretization (cont)
Centroid Discretization (cont)

Benefits
 Doesn’t waste impossible or infrequently
used space
 Much smaller memory footprint

Drawbacks:
 Slight overhead in converting from centroid
to floating point spaces
 Rounding error (how significant?)
Speed Comparison
Optimization Stats (chrX)
Optimization Memory
Mem %
Wallclock
TP
FP
Normal
4.76GB
100%
04:25:55
1309
127
CharDisc
2.58GB
54.2%
04:36:58
677
0
CentDisc
2.01GB
42.2%
04:27:29
166
9058
Conclusion

For high error rates, HMM approach is
ideal, but requires more memory
 Distributing the genome across processors
doesn’t scale linearly

Discretization methods provide good
memory reductions (up to 42%)
 Centroid discretization performs poorly
 Integer discretization can be used when
available memory is low
Questions
Download