PowerPoint Presentation - University of Connecticut

advertisement
Max-Planck-Institute
for Molecular Genetics
Bioinformatics Pipeline for Fosmid
based Molecular Haplotype
Sequencing
Jorge Duitama1,2, Thomas Huebsch1, Gayle McEwen1,
Sabrina Schulz1, Eun-Kyung Suk1, Margret R. Hoehe1
1. Max Planck Institute for Molecular Genetics, Berlin, Germany
2. Department of Computer Science and Engineering University of Connecticut, Storrs, CT, USA
Max-Planck-Institute
for Molecular Genetics
MHC: Key Region for Common
Diseases & Transplant Medicine
29,74
MHC class I
31,59
MHC class III
32,34
MHC class II
33,21
Max-Planck-Institute
for Molecular Genetics
MHC: Variation amongst Haplotypes
HLA-DRB Variation of MHC Haplotypes
against PGF reference
CNV
7 further MHC
Haplotype sequences
RCCX
CNV
PGF
reference
sequence
MHC class III
MHC class II
Variation amongst 8 MHC Haplotypes:
• 37.451 Substitutions
• 7.093 Short Indels
Variation and annotation map for eight MHC haplotypes,
Horton et al. Immunogenetics (2008) 60,1-18
Max-Planck-Institute
for Molecular Genetics
Experimental Approach
5000
fosmids
100 Individuals
100 Libraries
3x96-well = 288 fosmid pools
40 kb
haploid
molecules
One pool
SNP Mapping for Prioritization of MHC Informative Pools
SOLiD NGS Platform
Targeted
Complete
Shotgunning complete
Enrichment
Fosmid Pool
40 kb fosmids
Data Analysis Pipeline
Identification of
40 kb fosmid
sequences
Haplotype A
Haplotype B
Phasing molecular fosmid sequences
Contiguous
MHC haplotype
sequence
Max-Planck-Institute
for Molecular Genetics
Data Analysis Pipeline
Read Alignment
against Genome
Pairing
Fosmid Detection
Program
Fosmid Specific
Matching Algorithm
Fosmid Sequences
Based Phasing
Consensus
Calling
SNP Analysis
SOLiD Standard
Pipeline
Visualization &
MHC Database
In House Project Specific Analysis Pipeline
Max-Planck-Institute
for Molecular Genetics
Data Analysis Pipeline
Read Alignment
against Genome
Pairing
Fosmid Detection
Program
Fosmid Specific
Matching Algorithm
Fosmid Sequences
Based Phasing
Consensus
Calling
SNP Analysis
SOLiD Standard
Pipeline
Visualization &
MHC Database
In House Project Specific Analysis Pipeline
Max-Planck-Institute
for Molecular Genetics
Mapping real data
Bioscope classic
Bioscope local
repeat 40.3
Bioscope local
repeat 45.3
70
60
50
40
30
20
10
0
mapped reads %
unique mapped reads %
multiple hits %
Pool of 15.000 Fosmids 22 Mill. Reads 50bp
Max-Planck-Institute
for Molecular Genetics
Data Analysis Pipeline
Read Alignment
against Genome
Pairing
Fosmid Detection
Program
Fosmid Specific
Matching Algorithm
Fosmid Sequences
Based Phasing
Consensus
Calling
SNP Analysis
SOLiD Standard
Pipeline
Visualization &
MHC Database
In House Project Specific Analysis Pipeline
Max-Planck-Institute
for Molecular Genetics
SNP calls: Haploid fosmids vs.
genomic DNA
gDNA
Fosmid
# cov
ref
consen
F3
coord
335
C
Y
177/17
62511614
3345
T
C
3191/56
875
G
A
1795
G
707
# cov
ref
consen
F3
coord
595
C
T
572/91
62511614
62512095
3418
T
C
3278/98
62512095
862/25
62513689
2089
G
A
2048/98
62513689
K
722/23
62513754
2238
G
T
2194/98
62513754
C
S
528/13
62515375
1134
C
G
1107/73
62515375
2643
C
Y
1391/20
62517737
3104
C
T
2922/98
62517737
643
C
Y
417/23
62518998
1033
C
T
1014/83
62518998
1074
A
R
554/21
62522445
1799
A
G
1753/98
62522445
606
C
S
226/21
62524689
1053
C
G
1049/83
62524689
639
A
M
167/15
62532474
54
G
A
39/22
62527964
158
G
R
89/14
62533464
32
A
C
27/23
62529870
1032
A
R
443/26
62534973
1374
A
C
1355/95
62532474
7
A
G
7/4
62537153
973
G
A
946/97
62533464
775
T
G
742/26
62540402
2850
A
G
2745/98
62534973
10
G
C
10/5
62540465
49
A
G
48/33
62537153
698
G
C
684/29
62541769
1888
T
G
1845/95
62540402
40
C
T
40/4
62542550
37
G
C
36/20
62540465
94
C
G
93/9
62542574
923
G
C
901/97
62541769
286
C
T
283/16
62543011
8411
A
W
2006/78
62542258
194
C
A
190/22
62543067
253
C
T
253/47
62542550
Max-Planck-Institute
for Molecular Genetics
SNP Calling Accuracy in the MHC
– Affymetrix genotype information for 1583 SNP
positions as reference standard:
• - Homozygous identical with reference: 957
• - Heterozygous: 562
• - Homozygous different from reference: 64
– Compared to variants called from the SOLiD
sequenced genomic DNA sample (15x average
read coverage)
– Percentage of error in genotype calling: 3.66%
– False positive rate: 0.1%
– False negative rate: 9.25%
Max-Planck-Institute
for Molecular Genetics
Data Analysis Pipeline
Read Alignment
against Genome
Pairing
Fosmid Detection
Program
Fosmid Specific
Matching Algorithm
Fosmid Sequences
Based Phasing
Consensus
Calling
SNP Analysis
SOLiD Standard
Pipeline
Visualization &
MHC Database
In House Project Specific Analysis Pipeline
Max-Planck-Institute
for Molecular Genetics
Fosmids Detection
Fosmid Detection Algorithm
1. Assign each read to a single 1kb long bin. Select bins with more than
5 reads
2. Perform allele calls for each heterozygous SNP. Mark bins with
heterozygous calls
3. Cluster adjacent bins as belonging to the same fosmid if:
i. The gap distance between them is less than 10kb and
ii. There are no bins with heterozygous SNPs between them
4. Keep fosmids with lengths between 3kb and 60kb
UCSC Genome browser http://genome.ucsc.edu/
Kent et al. 2002 Genome Res. 12(6):996-1006.
3500
2500
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
10
number of contigs
Max-Planck-Institute
for Molecular Genetics
Fosmids Detection
Size distribution of read-contigs
3000
20 – 50 kb
2000
1500
fosmid sized contigs
1000
500
0
contig length kb
Max-Planck-Institute
for Molecular Genetics
Data Analysis Pipeline
Read Alignment
against Genome
Pairing
Fosmid Detection
Program
Fosmid Specific
Matching Algorithm
Fosmid Sequences
Based Phasing
Consensus
Calling
SNP Analysis
SOLiD Standard
Pipeline
Visualization &
MHC Database
In House Project Specific Analysis Pipeline
Max-Planck-Institute
for Molecular Genetics
Haplotyping
Locus Event
Alleles Hap 1 Alleles Hap 2
1
SNV
T
C,T
2
Deletion C
C,-
-
3
SNV
G
4
Insertion -,GC
-
A
A,G
C
GC
The process of grouping alleles that are present together on
the same chromosome copy of an individual is called
haplotyping
Max-Planck-Institute
for Molecular Genetics
Single Individual Haplotyping
• Input: Matrix M of m fragments covering n loci
Locus 1
2
3
4
5
...
n
f1
-
0
1
1
0
0
f2
1
1
0
-
1
1
f3
0
0
0
1
1
-
-
-
1
-
1
1
...
fm
Max-Planck-Institute
for Molecular Genetics
Single Individual Haplotyping
• Input: Matrix M of m fragments covering n loci
Locus 1
2
3
4
5
...
n
f1
-
0
1
1
0
0
f2
1
1
0
-
1
1
f3
0
0
0
1
1
-
-
-
1
-
1
1
...
fm
Max-Planck-Institute
for Molecular Genetics
Single Individual Haplotyping
• Input: Matrix M of m fragments covering n loci
Locus 1
2
3
4
5
...
n
f1
-
0
1
1
0
0
f2
1
1
0
-
1
1
f3
0
0
0
1
1
-
-
-
1
-
1
1
...
fm
Max-Planck-Institute
for Molecular Genetics
Single Individual Haplotyping
• Input: Matrix M of m fragments covering n loci
Locus 1
2
3
4
5
...
n
f1
-
0
1
1
0
0
f2
1
1
0
-
1
1
f3
0
0
0
1
1
-
-
-
1
-
1
1
...
fm
Max-Planck-Institute
for Molecular Genetics
ReFHap Problem Formulation
For two alleles a1, a2
For two rows i1, i2 of M
f1
-
f2
1 1 1
-
Score
0 1 -1
0 1
0 1
1 0
1
s(M,1,2) = 1
Max-Planck-Institute
for Molecular Genetics
ReFHap Problem Formulation
For a cut I of rows of M
Max-Planck-Institute
for Molecular Genetics
ReFHap Algorithm
• Reduce the problem to Max-Cut.
• Solve Max-Cut
• Build haplotypes according with the cut
Locus
1 2 3 4 5
f1
-
f2
1 1 0 -
f3
1 -
f4
-
1
0 1 1 0
-
1
0 -
0 0 -
1
h1 00110
h2 11001
-1
3
1
1
4
3
2
-1
Max-Planck-Institute
for Molecular Genetics
ReFHap Algorithm
1.
2.
3.
4.
Build G=(V,E,w) from M
Sort E from largest to smallest weight
Init I with a random subset of V
For each e in the first k edges
a) I’ ← GreedyInit(G,e)
b) I’ ← GreedyImprovement(G,I’)
c) If s(M, I) < s(M, I’) then I ← I’
Max-Planck-Institute
for Molecular Genetics
ReFHap Algorithm
• Classical greedy algorithm
1
4
1
4
3
2
2
3
Max-Planck-Institute
for Molecular Genetics
ReFHap Algorithm
• Edge flipping
1
2
2
1
3
4
3
4
Max-Planck-Institute
for Molecular Genetics
Phasing the MHC:
Mixed Diploid vs Fosmid-Based NGS
Libraries
Mixed Diploid
Fosmid-Based
Mate Pair & Paired
End Genomic DNA
1/3rd
Uniquely Mapped
47 Gb
Paired End
16 Barcoded Pools
15 Gb
Number of Blocks
407
40
1/10th
438 bp
3.7 kb
178 kb
12 %
85 kb
691 kb
3.4 Mb
66 %
194 x
186 x
19 x
5x
Av. Block Length
Max. Block Length
Total Length all Blocks
% of Phased SNPs
Max-Planck-Institute
for Molecular Genetics
Phasing MHC:
Preliminary Results
•
•
•
•
•
•
•
Number of blocks: 8
N50 block length: 793 kb
Maximum block length: 1.6 MB
Total extent of all blocks: 3.8 MB
Fraction of MHC phased into haplotype blocks: 95%
Number of heterozygous SNPs: 8030 SNPs
Fraction of SNPs phased: 86%
Max-Planck-Institute
for Molecular Genetics
Acknowledgements
Margret
Hoehe
Anita
Suk
Thomas
Hübsch
Sabrina
Schulz
Steffi
Palczewski
Britta
Horstmann
Roger
Horton
Gayle
McEwen
The Life Tech Team:
Thank You!
Kevin McKernan
Clarence Lee
Jessica Spangler
Tristen Weaver
Tamara Gilbert
Alexander Sartori
Dustin Holloway
Heather Peckham
Stephen McLaughlin
Tim Harkins
Max-Planck-Institute
for Molecular Genetics
Comparison Mapping algos
COX Haplotype simulated reads
Bioscope classic
Bioscope local iub
Bioscope classic iub
Bioscope local repeat
schema
Bioscope local
Bfast
120
100
80
60
40
20
0
mapped reads %
unique mapped reads %
multiple hits %
Max-Planck-Institute
for Molecular Genetics
Phasing MHC
Download