Supplementary Note 1

advertisement
Supplementary Information
Accurate detection of de novo and transmitted INDELs within exomecapture data using micro-assembly
Giuseppe Narzisi, Jason A. O’Rawe, Ivan Iossifov, Han Fang, Yoon-ha Lee, Zihua
Wang, Yiyang Wu, Gholson J. Lyon, Michael Wigler, and Michael C. Schatz
Supplementary Note 1
Benchmarking on simulated data
Sequencing data was simulated for ~200,000 human exons. For testing purposes, we
artificially introduced an unrealistic high mutation rate of one INDEL per exon. INDEL sizes
were randomly selected in the range [1,100] according to a lognormal distribution with
mean μ=1 and standard deviation σ=10. 100bp perfect reads were artificially generated
using realistic coverage distribution based on probe patterns (90X average coverage). The
reads were then aligned with both BWA-aln and BWA-mem [6] to the reference human
genome hg19 using default parameters. Note that here we opted for a simple simulation
model that, clearly, does not represent all of the complexities of a real exome project, e.g.,
potential cloning bias, sequencing errors, homopolymer runs, mutation hotspots, etc., but it
is chosen only for the purpose of testing and comparing the best-case relative power of
different pipelines to detect INDELs of different sizes under controlled conditions.
The figure below shows the size distribution of INDELs detected by seven different
pipelines (Scalpel, SOAPindel [1], HaplotypeCaller, SAMtools [2], UnifiedGenotyper [3],
FreeBayes [4], Platypus [5]) when applied to the simulated exome data. In accordance with
previously published results, we report reduced power for standard mapping pipelines to
detect large INDELs (specifically insertions): SAMtools, UnifiedGenotyper (GATK), and
FreeBayes have limited power to discover deletions larger than 20bp and this trend is
more evident for large insertions. The table shows the same trend: SAMtools,
UnifiedGenotyper (GATK), and FreeBayes have lower sensitivity compared to the other
tools. Surprisingly Platypus, although it is an assembly-based method, shows also limited
power to detect longer INDELs and a significantly higher false discovery rate compared to
Scalpel, SOAPindel, and HaplotypeCaller. Conversely, methods that use sequence assembly
techniques are able to detect mutations over the full spectrum of the “true” distribution.
Based on this analysis on simulated exome data, Scalpel, SOAPindel, and HaplotypeCaller
have the best sensitivity, although Scalpel has the highest specificity. However, various
simplifying assumptions used in the simulation may explain why simulated data appear
somewhat easier to analyze, and the performance of these tools changes dramatically when
applied to real exome capture data, as shown in the main text of the paper.
INDELs size distribution of different pipelines on simulated exome data aligned with BWA-aln. The true
distribution of INDEL sizes is colored in grey. Standard mapping approaches (“SAMtools”, “UnifiedGenotyper”,
“FreeBayes”) show reduced power to detect longer mutations compared to assembly based pipelines (“Scalpel”,
“SOAPindel”, and “HaplotypeCaller”).
Sensitivity and False Discovery Rate (FDR) for simulated data aligned with BWA-aln. The first two columns consider
all INDELs, whereas the last two just consider larger INDELs (>5bp in size).
Tool
Scalpel
SOAPindel
HaplotypeCaller
UnifiedGenotyper
SAMtools
FreeBayes
Platypus
Sensitivity (%)
(all)
92.9
93.3
93.7
86.8
88.1
85.8
88.5
FDR (%) Sensitivity (%)
(all)
(size > 5bp)
0.026
85.2
0.070
89.7
0.110
87.8
0.220
48.1
0.640
59.5
0.212
44.3
0.777
61.2
FDR (%)
(size > 5bp)
0.06
0.09
0.49
0.36
0.38
0.07
1.71
Alignment Analysis
More recently a new version of BWA has been released (BWA-mem [6]) which is designed
to be more sensitive to detecting longer INDELs. However, due to its new seeding algorithm,
BWA-mem is likely to achieve higher sensitivity only for longer reads (>>100bp) and might
not significantly improve the results for our simulated 100bp read dataset. We investigated
if there was any improvement by aligning the simulated reads with the most recent version
of BWA-mem. The Figure and the Table below show that for these data there is only a
marginal improvement across all the tools in terms of sensitivity. The same trend of
reduced power for standard mapping methods is clearly confirmed by the results.
Unfortunately SOAPindel is not compatible with the BAM file generated by BWA-mem and
it could not be included in the analysis (personal communications with BGI). The minor
improvement in sensitivity is also confirmed by the alignment statistics for BWA-aln and
BWA-mem where there is only a marginal increase in the number of mapped reads around
INDELs. We used QualiMap [7] to compute the alignment statistics.
INDELs size distribution of different pipelines on simulated exome data aligned with BWA-mem. The true
distribution of INDEL sizes is colored in grey. Standard mapping approaches (“SAMtools”, “UnifiedGenotyper”,
“FreeBayes”) show reduced power to detect longer mutations compared to assembly based pipelines (“Scalpel”,
“SOAPindel”, and “HaplotypeCaller”).
Sensitivity and False Discovery Rate (FDR) for simulated data aligned with BWA-mem. The first two columns
consider all INDELs, whereas the last two just consider larger INDELs (>5bp in size).
Tool
Scalpel
SOAPindel
HaplotypeCaller
UnifiedGenotyper
SAMtools
FreeBayes
Platypus
Sensitivity (%)
(all)
94.3
n.a.
94.4
86.3
88.2
86.9
88.1
FDR (%) Sensitivity (%)
(all)
(size > 5bp)
0.027
91.2
n.a.
n.a.
0.068
91.8
0.026
45.9
0.149
58.9
0.013
49.8
0.196
57.5
FDR (%)
(size > 5bp)
0.06
n.a.
0.17
0.06
0.51
0.04
1.27
Alignment statistics. Comparison between BWA-aln and BWA-mem.
BWA-aln
Number of reads
50,549,624
Mapped reads
50,208,487 / 99.33%
Paired reads
50,208,487 / 99.33%
Mapped reads, singletons
57,769 / 0.11%
Read min/max/mean length 45 / 100 / 100
Clipped reads
862,727 / 1.71%
Duplication rate
39.23%
Total reads with indels
6,458,280
Insertions
3,192,230
Deletions
3,266,050
Homopolymer indels
29.81%
BWA-mem
50,659,783*
50,608,913 / 99.9%
50,608,913 / 99.9%
15,362 / 0.03%
30 / 100 / 99.86
1,230,432 / 2.43%
39.36%
6,706,818
3,295,515
3,411,303
30.11%
* Note that the total number of reads reported by flagstat is slightly higher for BWA-mem. This is due to the
fact that BWA-mem marked a small portion of the reads as secondary alignment. These reads were counted
twice by QualiMap.
Supplementary Note 2
Comparison to RepeatSeq and lobSTR on simulated data
RepeatSeq [8] and lobSTR [9] are two recent computational tools specifically designed to
detect INDELs in short repeat structures. In particular, RepeatSeq is designed to determine
genotypes for microsatellite repeats in high-throughput sequencing data, while lobSTR is a
tool for profiling Short Tandem Repeats (STRs) from high throughput sequencing data. As
such, neither RepeatSeq nor lobSTR are general-purpose INDEL caller for exome capture
data, but they work only on selected regions of the human genome where these repeat
structures are identified. For example lobSTR requires the input of a custom-made index
file describing the STR loci. Out of 1638523 regions from the lobSTR index file, only 5436
are within the exonic region defined by our BED file used to generate the simulated data.
Also, as expected, only a small fraction of these regions contain a mutation exactly within a
STR sequence.
The tables below show the results of the comparative analysis between Scalpel, RepeatSeq
and lobSTR. RepeatSeq and lobSTR show reduced sensitivity compared to Scalpel. This is
particularly evident for longer INDELs (>5pb), where RepeatSeq and lobSTR demonstrate
inferior power. This result is explained by the fact that both RepeatSeq and lobSTR are
tools based on standard mapping approaches and, in agreement with the results reported
in Supplementary Note 1, they have reduced power to detect large mutations.
Tool
RepeatSeq
Scalpel
Tool
lobSTR
Scalpel
# of INDELS
True Positives
476
656
472
656
# of INDELS True Positives
431
662
370
662
# of INDELS
(>5bp)
36
164
# of INDELS
(>5bp)
56
166
True Positives
(>5bp)
36
164
True Positives
(>5bp)
22
166
Supplementary Note 3
Comparison between Scalpel and GATK (v2.4-3, v2.8, v3.0) on K8101-49685
The GATK pipeline is under active development. New versions of the software were
released subsequent to our initial MiSeq experimental validations (see the Results and
Online Methods sections). In this section, we compare two recent versions of the GATK
(v2.8 and v3.0) with the published version that was used for our comparisons in the main
text of the paper (v2.4-3). Specifically, in the figures below, we compare the INDEL size
distribution of HaplotypeCaller from all 3 versions of the GATK package (v2.4-3, v2.8 and
v3.0) using different filtering criteria. In the case of single exome studies, the GATK support
group does not recommend using Variant Quality Score Recalibration (VQSR) because it is
unlikely that the modeling step will perform well on a limited number of variants. Instead
they advise using a hard filtering approach ("QD < 2.0 || FS > 200.0”). The three plots below
show the comparison between the “raw” INDELs and the “hard-filtered” INDELs for the
three versions of HaplotypeCaller. It is evident from the distribution of INDELs of v2.4-3
that hard filtering is a very aggressive strategy that drastically reduces the power of
HaplotypeCaller to detect INDELs greater than 20bp long. This type of filtering seems to
reduce HaplotypeCaller to the same power of the GATK’s UnifiedGenotyper, so we did not
use this filtering step in our experimental comparison on sample K8101-49685. GATK’s
HaplotypeCaller v3.0 instead does not show any significant difference in size distribution
when the hard filter is applied, which is consistent with the GATK v3.0 release notes.
This comparison also shows that the new versions of HaplotypeCaller have significantly
removed the bias towards deletions, reported previously for GATK v2.4-3. Since most of the
false-positive mutations reported by HaplotypeCaller v2.4-3 were large deletions, we
devised additional experiments to evaluate the performance of GATK v3.0. Specifically, we
randomly selected 200 INDELs specific to HaplotypeCaller v3.0 and 200 specific to Scalpel
for targeted MiSeq validation (see the Online Methods section for the validation protocol).
Note this selection was blinded to any previous validation analysis, making it a fair, random
selection from both algorithms. By chance, 215 out of the 400 INDELs were covered by
more than 1000 reads in previous data sets, which includes data generated by the MiSeq
validation experiment performed on the initial 1000 validation targets (see the Results and
Online Methods sections) as well as data generated and reported in O’Rawe et al. 2013 [10].
We performed targeted MiSeq re-sequencing on the remaining 185 INDELs where data
were not already available. 58 out of the remaining 185 were INDELs specific to Scalpel and
127 were INDELs specific to HaplotypeCaller v3.0. 170 out of the 185 INDELs were
successfully sequenced and covered by more than 1000 reads, with an average coverage of
165142. The results of these re-sequencing experiments are reported in the two tables
below. Under both the position-based and exact-match comparison strategies, Scalpel
outperforms HaplotypeCaller v3.0. In particular, Scalpel shows a significantly higher
validation rate for longer INDELs (size >5bp) as was previously observed with the older
versions. We applaud the GATK developers for their improved algorithm, and expect them,
along with us, to make further refinements over time as our understanding of the mutation
mechanisms and error models improve.
HaplotypeCaller comparison using different filtering criteria. Comparison between raw and hard-filtered INDELs for
HaplotypeCaller in GATK v2.4-3, v2.8 and and v3.0.
Validation rate for Scalpel and HaplotypeCaller (v3.0) using position-based comparison.
Tool
Scalpel
HapCall(v3.0)
Valid Invalid
146
86
PPV
50 75%
103 45%
Valid
(>5bp)
26
5
Invalid
(>5bp)
4
13
PPV
(>5bp)
86%
27%
Validation rate for Scalpel and HaplotypeCaller (v3.0) using exact-match comparison.
Tool
Scalpel
HapCall(v3.0)
Valid Invalid
82
65
PPV
114 42%
124 34%
Valid
(>5bp)
23
5
Invalid
(>5bp)
7
13
PPV
(>5bp)
76%
27%
Supplementary Note 4
Characterization of validation rate
Among the different quality metrics computed by Scalpel, the chi-squared k-mer statistic
(χ2) and the coverage ratio are the most informative scores. In this section we characterize
the False-Discovery Rate (FDR) of Scalpel in real experiments as a function of both the chisquare statistics and coverage. In particular in the figures below we analyzed a total of 614
INDELs detected by Scalpel and validated by re-sequencing on the single-exome study (ID:
K8101-49685). Specifically, given a fixed chi-square threshold T, we computed the FDR for
all the mutations with chi-square score <T (x-axis). The corresponding FDR value is
reported on the y-axes. Higher chi-square thresholds produce more errors but the FDR
value stabilizes around 25%, in agreement with the experimental results reported in the
main text of the manuscript. However, different trends are reported for mutations within
microsatellites: even at low chi-square score the FDR for mutations within microsatellites
is higher than the non-microsatellite mutations. This analysis also demonstrates the
additional challenges faced in INDEL detection within microsatellites. Finally, different
curves are plotted for subsets of mutations with predefined maximum coverage. As
expected, mutations with higher coverage more frequently pass validation (lower FDR).
Characterization of False Discovery Rate as a function of the chi-squared score and coverage. The analysis is
reported for mutations outside of microsatellites (left plot) and within microsatellites (right plot). Different curves are
plotted for subsets of mutations with predefined maximum coverage.
We also analyzed the direct relationship between the chi-square score and coverage for the
614 mutations. The scattered plot below shows the same trend where mutations that did
not pass validation (red dots) are generally characterized by lower coverage. Finally it is
important to note that, since these results are based on a very small subset of only 614
mutations, there is a clear residual noise in both the previous FDR curves and the scattered
plot blow. However the sample is large enough to start identifying the major trends and
relationships between the validation rate and the characteristics of the data (such as chisquare score and coverage).
Supplementary Figure 1. Size distribution of INDELs detected by all pipelines (“Intersection”). The sizes of INDEL
belonging to the intersection among the three pipelines follow a lognormal distribution.
Supplementary Figure 2. Example of false-bubble induced by a near perfect repeat sequence. Near-perfect repeats
are a major source of false positive variation calls when using a de Bruijn graph assembly strategy. This figure shows an
example of near-perfect repeat that can be misinterpreted as a large deletion. The key point is that the beginning of this
sequence is a nearly perfect 69bp repeat. There is just 1bp difference between the two copies that are 15bp apart. The
sequence is segmented as 19-C-49-A-14-19-T-49-G-21 where 19 and 49 are 19bp and 49bp perfect repeats, separated by
a 15bp unique sequence (A, C, T, G are the regular bases). Since the longest exact repeat is 49bp long, one would expect
that using k-mer=55 should be large enough to correctly assemble this sequence. However, our data also contained reads
with sequence 19-C-49-G, which suggests the presence of an 84bp deletion of the A-14-19-T-49 segment but in actuality is
just a single base change. Since the de Bruijn graph is constructed using perfect matches of length k-1=54 (no mismatches
allowed), the only way to connect all the 55-mers from these two sequences is to construct a false bubble jumping form
the first copy of the near-perfect repeat to the second copy. When aligned to the reference, the sequence associated to the
false branch will show a (false-positive) large deletion.
Supplementary Figure 3. IGV snapshot of HaplotypeCaller false-positive deletion. A 35 bp deletion was erroneously
called by HaplotypeCaller at position 3:69928431. Reads aligned to this region show a cluster of highly variable softclipped sequences that are likely to be mapping artifacts.
Supplementary Figure 4. Comparison between Scalpel and GATK UnifiedGenotyper on 593 SSC families. Scalpel
demonstrates increased power to detect insertions.
Supplementary Figure 5. Abundance of INDELs for different annotation categories.
Supplementary Table 1. Validation rate for INDELs specific to each pipeline. PPV is the positive predictive value
computed as #TP/(#TP+#FP), where #TP is the number of true-positive calls and #FP is the number of false-positive calls.
Tool
Scalpel (v0.1.1)
SOAPindel (v2.0.1)
HaplotypeCaller (v2.4.3)
Valid
(all)
145
101
45
Invalid
(all)
43
99
155
PPV (%)
(all)
77.1
50.5
22.5
Valid
(≥30bp)
13
8
7
Invalid
(≥30bp)
1
129
62
PPV (%)
(≥30bp)
92.8
5.8
11.3
Supplementary Table 2. Number of insertions and deletions by annotation category. DI-ratio is the ratio between
the number of deletion and insertion for each category.
Category
Insertions Deletions DI-ratio Total
Intron
6295
11490
1.82 17785
Intergenic
310
645
2.08
955
Frame-shift
1207
2758
2.28
3965
Non-frame-shift
622
2049
3.2
2671
UTR
790
1377
1.74
2167
Splice site
56
196
3.5
252
All
9280
18515
1.99 27795
Supplementary Table 3. Summary of de novo INDELs in 593 SSC Families in different contexts. “Aut” stands for
autistic child and “Sib” for his/her sibling; “M” stands for males and “F” for females.
INDEL effect
Frame shift
Intron
Intergenic
No frame shift
Splice-site
UTR
Total
Aut Sib Aut M Aut F Sib M Sib F Total
35 16
25
10
12
4
51
13 16
11
2
6
10
29
2
0
2
0
0
0
2
4
5
4
0
1
4
9
2
0
2
0
0
0
2
2
2
2
0
0
2
4
58 39
46
12
19
20
97
Supplementary Table 4. Likely Gene-Disrupting (LGD) frame-shift INDELs in children affected with autism. The
“Family ID” column indicates the ID of the relevant family. The “Study” column shows the study in which the family was
previously analyzed: CSHL, YALE or University of Washington (WASH). Under “Gender,” M stands for males and F for
females. The “Location” column reports the location of the variant in chr:position format. The “Variant” column shows
detail for reconstructing the haplotype around the de novo variant relative to the reference genome as
follows:“ ins(seq)”indicates an insertion of the provided sequence “seq”; and “del(N)” denotes a deletion of N nucleotides.
The “Χ2 score” column reports the chi-square score for the mutation. The “Gene” column reports the affected gene. The
“Amino Acid Position” column shows the position of the first incorrectly encoded amino acid within the encoded
protein/the length of the protein. When a mutation affects multiple isoforms of a transcript, the earliest proportionate
coordinate is given. “FMRP target” indicates whether the corresponding gene's RNA was found to physically associate
with FMRP [23].
Family
ID
13548
Study
Gender
Location
Variant
Χ2 score
Gene
CSHL
F
11:11314680
del(8)
10.8
12858
CSHL
F
9:37015071
del(1)
2.21
12952
CSHL
M
7:104748101
del(1)
13646
CSHL
M
9:35060456
13548
CSHL
F
12673
CSHL
M
12939
CSHL
13018
GALNTL4
Amino Acid
Position
522/608
FMRP
Target
no
PAX5
111/392
no
0.1
MLL5
1066/1859
yes
del(5)
6.75
VCP
515/807
no
11:11314690
del(1)
9.97
GALNTL4
521/608
no
22:40661587
del(4)
1.58
TNRC6B
451/1834
yes
M
17:42399124
del(2)
0.18
SLC25A39
112/360
no
CSHL
M
7:100201680
del(1)
0.33
PCOLCE
101/450
no
13176
CSHL
F
14:68272015
del(1)
0.02
ZFYVE26
397/2540
no
12950
CSHL
M
7:138968840
del(4)
6.1
UBN2
1063/1348
no
13096
CSHL
M
7:150164232
del(1)
7.12
GIMAP8
149/666
no
12653
CSHL
M
6:170593076
ins(A)
0.0
DLL1
431/724
no
13616
CSHL
M
4:47571001
ins(G)
0.02
ATP10D
1001/1427
no
13092
CSHL
M
19:49004781
ins(AGGTCAG)
9.68
LMTK3
307/1490
yes
13162
CSHL
M
6:72889392
ins(A)
0.04
RIMS1
196/1693
no
13439
CSHL
M
10:53458250
ins(A)
3.2
CSTF2T
354/617
no
13590
CSHL
M
15:80137554
ins(A)
0.07
MTHFS
147/147
no
13398
CSHL
M
1:151377904
ins(CGTCATCA)
3.92
POGZ
1194/1402
no
13552
CSHL
M
21:38877834
del(1)
1.85
DYRK1A
496/764
no
13168
CSHL
F
11:119214625
del(1)
0.51
MFRP
342/580
no
12323
CSHL
M
9:96439930
del(1)
0.24
PHF2
1088/1097
no
13471
CSHL
M
1:152286920
del(2)
0.82
FLG
147/4062
no
12705
CSHL
M
10:428609
ins(C)
0.11
DIP2C
657/1557
yes
12235
YALE
M
13:99100553
del(2)
0.92
FARP1
1040/1046
no
12099
YALE
M
21:38845117
del(2)
0.06
DYRK1A
48/764
no
12507
YALE
M
16:89350772
del(4)
4.17
ANKRD11
725/2664
yes
13618
YALE
F
8:37702146
del(2)
1.0
BRF2
374/420
no
12383
YALE
M
3:52454425
del(1)
1.53
PHF7
129/382
no
11808
YALE
F
6:31525440
ins(TG)
4.17
NFKBIL1
124/367
no
13739
YALE
F
X:153135039
del(2)
2.42
L1CAM
401/1258
no
13000
YALE
M
17:8424205
del(1)
0.06
MYH10
755/2008
yes
11712
YALE
M
1:53416507
del(2)
3.66
SCP2
70/524
no
11282
YALE
M
1:155317483
ins(CTTG)
4.04
ASH1L
2589/2965
yes
13618
YALE
F
15:93524061
del(4)
3.85
CHD2
965/1829
no
13447
WASH
F
6:157527665
del(4)
1.69
ARID1B
1784/2237
yes
Supplementary Note 5
Examples of de novo INDELs corrected by Scalpel
Supplementary Figure 6 shows an example of de novo INDEL in the GNPTG gene that was
initially detected as a 2bp deletion using the popular BWA+GATK UnifiedGenotyper
pipeline. The same mutation was instead detected by Scalpel as a longer 33bp deletion and
then confirmed by specific re-sequencing. IGV inspection shows a clear cluster of softclipped reads where BWA was unable to align the read end-to-end (Supplementary Fig. 7).
Interestingly, the sequence composition of the soft-clipped sequences is consistently the
same among all the soft-clipped reads. This is the typical signature of a candidate mutation
that has been erroneously mapped. Targeted re-sequencing of the regions distinctly shows
the deletion of 33 base pairs (Supplementary Fig. 8).
Supplementary Figure 6. IGV inspection of a candidate 2bp deletion in the GNPTG gene. The snapshot shows a
suspicious set of reads starting at the same location.
Supplementary Figure 7. IGV inspection of the same 2bp deletion showing soft-clipped sequence. IGV shows that
the 2bp deletion from Supplementary Figure 6 is characterized by a suspicious cluster of soft-clipped reads.
Supplementary Figure 8. IGV inspection of a 33 bp deletion validated using longer reads. Scalpel and specific resequencing confirm a large 33 bp deletion at the location (instead of a 2bp deletion).
In another region of chromosome 13, the GATK-based sequencing pipeline detects two
different closely located deletions (1bp and 5bp) and, thus, they are flagged as LGD
mutations (Supplementary Fig. 9). Scalpel instead detected only one deletion of 6 bp
together with two adjacent SNPs detected in only one of the two haplotypes of the autistic
child (family ID: 13578, location: 13:25280526), which was later confirmed by targeted resequencing (Supplementary Fig. 10).
Supplementary Figure 9. IGV inspection of two closely located deletions detected by the GATK-UnifiedGenotyper
pipeline.
Supplementary Figure 10. Correct signature for the deletions of Supplementary Figure 9. The correct signature as
reported by Scalpel is a 6 bp deletion together with two adjacent SNPs detected in only one of the two haplotypes of the
autistic child (family ID: 13578, location: 13:25280526).
160
Father
Mother
Self
Sibling
140
Coverage
120
100
80
60
40
20
0
0
1000
2000
3000
4000
5000
6000
Genome location
Supplementary Figure 11. Read coverage across one long (6 kb) exon for one family of the SSC. The highly variable
distribution is clearly visible in the picture and it is consistent among the family members.
Supplementary Figure 12. Size distribution of the human exon targets. ~95% of the human exon-targets are shorter
than 400bp
Supplementary Note 6
Repeat composition analysis
Repeats are the most problematic sequences to assemble and the specificity of any INDEL
detection method is very much correlated to its ability to detect and analyze repetitive
sequences correctly. Although the exome sequence composition is generally assumed to be
relatively simple compared to the rest of the human genome, 30% of exons have a perfect
10bp or larger repeat (Supplementary Fig. 13). But more interestingly the amount of
near-perfect repeats increases substantially if more mismatches are permitted.
Supplementary Figure 13 shows the percent of locally repetitive exons as a function of
different k-mer values and maximum number of mismatches. Each exon is analyzed to
check for the presence of a repeat or near-perfect repeat in the same region. The y-axis
reports the percentage of those blocks that have been found to contain a repeat of size k.
Note that, given the generally low error rate of the Illumina technology, allowing 3mismatches for a 10-mer (30% read accuracy) would not be realistic for a sequencing
study. However, this repeat analysis must be examined in the context of performing
sequence assembly using de Bruijn graph method. Since the input reads are decomposed
into overlapping k-mers, the resulting assembly graph for a region with a near-perfect
repeat can contain “false-bubbles” as the one depicted in Supplementary Figure 2.
Supplementary Figure 13. Repeat content of the Human Exome. Repeat content distribution on the human exome
targets as a function of the k-mer size. Each target is analyzed to check for the presence of a repeat or near-perfect repeat
in the same region. The y-axis reports the percentage of those exons that have been found to contain a repeat of size k.
Supplementary Figure 14. Histogram showing the coverage distribution for the INDELs validated with MiSeq.
References
1. Li, S. et al. SOAPindel: Efficient identification of indels from short paired reads. Genome
Res. 23, 195-200 (2012).
2. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G.,
Durbin, R. and 1000 Genome Project Data Processing Subgroup. The Sequence
alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9 (2009).
3. DePristo, M. et al. A framework for variation discovery and genotyping using next-generation
DNA sequencing data. Nature Genetics 43, 491-498 (2011).
4. Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing.
arXiv:1207.3907v2 [q-bio.GN]. (Version: v9.9.2-43-ga97dbf8)
5. Rimmer, A., Mathieson, I., Lunter, G., McVean, G, (2012) Platypus: An Integrated Variant
Caller (www.well.ox.ac.uk/platypus).
6. Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM.
arXiv:1303.3997v2 [q-bio.GN] (2013). (Version: v0.7.6a)
7. García-Alcalde, F., Okonechnikov, K., Carbonell, J., Cruz, L.M., Götz, S., Tarazona, S.,
Dopazo, J., Meyer, T.M., & Ana Conesa, A. Qualimap: evaluating next-generation
sequencing alignment data. Bioinformatics 20, 2678-2679 (2012). (Version: v0.7.1)
8. Highnam, G., Franck, C., Martin A., Stephens, C., Puthige, A., & Mittelman, D. Accurate
human microsatellite genotypes from high-throughput resequencing data using informed
error profiles. Nucl. Acids Res. (7 January 2013) 41 (1): e32. (Version: v0.8.2)
9. Gymrek, M., Golan, D., Rosset, S., & Erlich, Y. lobSTR: A short tandem repeat profiler for
personal genomes. Genome Research. 2012 April 22. (Version: v2.0.4)
10. O'Rawe, J. et al. Low concordance of multiple variant-calling pipelines: practical implications
for exome and genome sequencing. Genome Medicine 5:28 (2013).
Download