file - Genome Biology

advertisement
Supplemental Materials
Category
Parameter
Default Value
Description
dataset:
reference
(None)
technology
illumina
read_length
100
read_count
(Computed)
coverage
1
paired
Yes
insert_size
500
mutation_rate
0.001
mutation_indel_frac
0.3
mutation_indel_avg_len
error_rate_mult
1
1
enabled
Yes
ratio
(Computed)
region_size
mappers
(Computed)
bwa,bowtie2,ngm
parameters
(None)
pos_threshold
50
Absolute path to reference genome file (.FASTA
file).
Sequencing technology
(illumina/ion_torrent/454).
Exact/Mean (depends on dataset.technology) read
length.
By default, computed from dataset.coverage,
subsampling.ratio and dataset.reference.
Target coverage for the subsampled reference,
used to compute dataset.read_count if it was not
set manually.
Reads will be paired-end if set to Yes, otherwise
single-end.
Distance between the outer ends of two reads in a
read pair.
Overall rate of mutations introduced into the
reference (0-1).
Fraction (0-1) of mutations that are insertions or
deletions. Other mutations are SNPS.
Average length of insertions and deletions.
Scaling factor for sequencing error rate. Base error
rate depends on the sequencing technology.
If set to No, no subsampling will be performed on
the reference for simulation.
Ranges from 50% (small genomes) to 1% (large
genomes).
Set to be 10 times of read_length
Identifiers of mappers to be evaluated for the data
set. Please conduct the Teaser manual for a list of
predefined mappers and their identifiers.
List where additional input parameters can be
specified.
Tolerance distance to simulated source position of
a mapped read in order to be considered correct.
sampling:
evaluation:
Table S1: Parameter list for Teaser.
Organism
Reference
Assembly
Sampled
Percent (%)
Subsampled
Genome Size (Mb)
Read Count
(Million)
Human
GRCh37
Full genome
3200
21
Teaser
Runtime
(minutes)
617.5
Mus musculus
GRCm38
Full genome
4800
32
1127.4
D. melanogaster
BDGP6
Full genome
146
1
18.84
Human
GRCh37
25
800
5.25
179.5
Mus musculus
GRCm38
25
1200
8
329.65
D. melanogaster
BDGP6
25
36.5
0.25
5.73
Human
GRCh37
10
320
2.1
66.98
Mus musculus
GRCm38
10
480
4
144.86
D. melanogaster
BDGP6
10
14.6
0.1
2.35
Human
GRCh37
5
160
1.05
32.55
Mus musculus
GRCm38
5
240
1.6
68.38
D. melanogaster
BDGP6
5
7.3
0.05
1.21
Human
GRCh37
1
80
0.21
10.28
Mus musculus
GRCm38
1
48
0.32
19.4
D. melanogaster
BDGP6
1
1.46
0.01
0.4
Table S2: Data sets used to verify the subsampling process
D2
12.88
8.40
10.16
12.91
18.30
25.10
27.58
32.99
53.57
42.26
42.68
53.57
57.17
52.52
66.79
86.27
10.67
6.02
6.75
11.23
16.79
23.13
30.26
42.27
91.65
91.22
91.50
91.51
91.25
91.52
91.52
91.26
91.53
91.52
91.56
91.61
91.60
91.58
91.62
91.61
91.59
91.62
91.61
91.60
91.64
91.63
91.62
91.65
91.63
91.62
91.65
91.63
91.65
91.45
25.64
16.18
9.70
9.16
16.49
9.80
9.27
17.57
10.25
9.41
23.71
11.41
10.12
24.91
11.83
10.41
26.11
13.23
10.66
30.17
13.16
11.52
31.46
13.96
11.69
32.70
14.38
11.76
12.22
8.10
90.63
85.97
85.67
72.82
86.05
85.71
72.83
86.06
85.73
72.83
88.36
86.35
73.36
88.43
86.39
73.37
88.44
86.41
73.38
88.76
86.66
73.59
88.82
86.70
73.60
88.82
86.70
73.61
91.56
91.66
36.06
17.50
10.47
8.79
17.87
10.59
8.85
18.59
10.95
9.45
26.19
13.59
10.34
27.04
14.16
10.84
28.42
14.85
10.97
33.98
16.86
12.58
34.98
17.73
12.77
36.26
17.77
12.77
29.09
10.22
88.86
53.05
54.86
38.20
53.11
54.89
38.20
53.13
54.90
38.20
56.61
55.46
38.61
56.66
55.49
38.62
56.68
55.50
38.62
57.15
55.74
38.78
57.20
55.77
38.78
57.21
55.78
38.78
89.31
92.34
46.47
15.51
9.13
6.12
15.54
9.14
6.05
16.34
9.71
6.18
25.88
11.86
6.99
26.48
12.01
7.34
27.43
12.49
7.52
34.53
14.72
8.58
35.44
15.12
9.29
36.78
15.50
9.46
35.21
28.21
-D 5 -R 1 -L 15
-D 5 -R 1 -L 20
-D 5 -R 1 -L 30
-D 5 -R 5 -L 15
-D 5 -R 5 -L 20
-D 5 -R 5 -L 30
-D 5 -R 20 -L 15
-D 5 -R 20 -L 20
-D 5 -R 20 -L 30
-D 15 -R 1 -L 15
-D 15 -R 1 -L 20
-D 15 -R 1 -L 30
-D 15 -R 5 -L 15
-D 15 -R 5 -L 20
-D 15 -R 5 -L 30
-D 15 -R 20 -L 15
-D 15 -R 20 -L 20
-D 15 -R 20 -L 30
-D 30 -R 1 -L 15
-D 30 -R 1 -L 20
-D 30 -R 1 -L 30
-D 30 -R 5 -L 15
-D 30 -R 5 -L 20
-D 30 -R 5 -L 30
-D 30 -R 20 -L 15
-D 30 -R 20 -L 20
-D 30 -R 20 -L 30
Table S3: Parameter optimization results for D. melanogaster.
Runtime
(sec)
84.65
76.14
76.66
84.65
88.43
77.38
82.47
90.18
Correctly
mapped
(%)
10.11
8.22
9.06
11.00
15.41
19.52
20.30
23.64
Runtime
(sec)
91.62
91.40
91.52
91.62
91.67
91.47
91.53
91.61
--very-fast
--fast
--sensitive
--very-sensitive
--very-fast-local
--fast-local
--sensitive-local
--very-sensitive-
D3
Correctly
mapped
(%)
bowtie2
bowtie2_00
bowtie2_01
bowtie2_02
bowtie2_03
bowtie2_04
bowtie2_05
bowtie2_06
bowtie2_07
local
bowtie2_08
bowtie2_09
bowtie2_10
bowtie2_11
bowtie2_12
bowtie2_13
bowtie2_14
bowtie2_15
bowtie2_16
bowtie2_17
bowtie2_18
bowtie2_19
bowtie2_20
bowtie2_21
bowtie2_22
bowtie2_23
bowtie2_24
bowtie2_25
bowtie2_26
bowtie2_27
bowtie2_28
bowtie2_29
bowtie2_30
bowtie2_31
bowtie2_32
bowtie2_33
bowtie2_34
bwamem
ngm
Runtime
(sec)
D1
Correctly
mapped
(%)
Mapper + parameter
9.90
8.54
8.80
10.51
18.67
16.40
21.80
27.49
38.92
14.58
10.12
7.90
23.38
14.17
8.18
54.01
29.93
13.90
19.38
11.71
8.67
27.83
16.17
10.77
61.65
33.75
15.74
24.75
13.50
10.22
34.35
18.70
10.63
69.17
37.60
17.84
19.52
8.09
76.48
75.86
75.87
76.48
76.64
95.90
97.24
99.64
99.88
75.98
76.55
75.75
75.99
76.63
76.55
75.99
76.63
76.56
76.62
76.55
75.75
76.63
76.63
76.55
76.62
76.63
76.56
76.64
76.55
75.75
76.63
76.64
76.55
76.63
76.64
76.56
99.88
99.82
Runtime
(sec)
77.62
74.84
76.08
77.62
78.58
98.67
98.98
99.23
99.40
73.34
75.07
74.87
73.62
75.40
75.12
74.04
75.71
75.20
76.68
77.65
77.13
77.07
78.04
77.41
77.44
78.31
77.49
78.51
79.08
78.33
78.91
79.46
78.60
79.23
79.67
78.67
99.32
97.98
--very-fast
--fast
--sensitive
--very-sensitive
--very-fast-local
--fast-local
--sensitive-local
--very-sensitive-local
-D 5 -R 1 -L 15
-D 5 -R 1 -L 20
-D 5 -R 1 -L 30
-D 5 -R 5 -L 15
-D 5 -R 5 -L 20
-D 5 -R 5 -L 30
-D 5 -R 20 -L 15
-D 5 -R 20 -L 20
-D 5 -R 20 -L 30
-D 15 -R 1 -L 15
-D 15 -R 1 -L 20
-D 15 -R 1 -L 30
-D 15 -R 5 -L 15
-D 15 -R 5 -L 20
-D 15 -R 5 -L 30
-D 15 -R 20 -L 15
-D 15 -R 20 -L 20
-D 15 -R 20 -L 30
-D 30 -R 1 -L 15
-D 30 -R 1 -L 20
-D 30 -R 1 -L 30
-D 30 -R 5 -L 15
-D 30 -R 5 -L 20
-D 30 -R 5 -L 30
-D 30 -R 20 -L 15
-D 30 -R 20 -L 20
-D 30 -R 20 -L 30
Correctly
mapped
(%)
bowtie2
bowtie2_00
bowtie2_01
bowtie2_02
bowtie2_03
bowtie2_04
bowtie2_05
bowtie2_06
bowtie2_07
bowtie2_08
bowtie2_09
bowtie2_10
bowtie2_11
bowtie2_12
bowtie2_13
bowtie2_14
bowtie2_15
bowtie2_16
bowtie2_17
bowtie2_18
bowtie2_19
bowtie2_20
bowtie2_21
bowtie2_22
bowtie2_23
bowtie2_24
bowtie2_25
bowtie2_26
bowtie2_27
bowtie2_28
bowtie2_29
bowtie2_30
bowtie2_31
bowtie2_32
bowtie2_33
bowtie2_34
bwamem
ngm
Simulated
Runtime
(sec)
Real
Correctly
mapped
(%)
Mapper + parameter
4.40
4.02
4.15
4.18
8.29
10.08
10.92
12.55
14.61
8.01
5.05
4.93
10.81
7.10
5.03
16.32
8.60
5.09
11.36
6.06
5.47
14.15
7.02
5.00
20.32
9.18
5.43
15.18
4.28
5.34
17.23
7.06
4.82
24.55
9.27
6.06
5.34
3.37
Table S4: Parameter optimization results for Cottus rhenanus.
Note S1: Description of Teaser Report
Teaser generates an HTML based report that summarizes the results in the form of
tables and interactive figures. This includes comparing the mapping results for
different measurements (as described above) and different mapping quality
thresholds.
The main part of a report comprises seven sections:
(1) Mapping statistics The performance of each mapper and parameter
setting is shown in a bar chart in terms of percentages of correctly, wrongly
and not mapped reads. The chart can be adjusted using a mapping quality
threshold. Thus, the user can identify the most suitable mapping quality
threshold for a specific data set characteristic (e.g. reference genome or read
length). Figure S1 shows such a plot for data set D2.
(2) Summary of mapping quality scores. This plot assists in identifying the
most suitable mapping quality threshold. The X-axis displays the mapping
quality. The Y-axis displays the wrongly mapped reads divided by the
correctly mapped reads that would be excluded by using the corresponding
mapping quality value as a threshold. Figure S2 shows such a plot for D2.
(3) Precision and recall rates for each mapper and parameter setting.
(4) Correctly mapped reads per second. This plot is useful when two
mappers achieve near identical performances in terms of accuracy.
(5) Runtime in minutes per mapper and parameter setting.
(6) Peak memory consumption of the individual mapper and parameter
settings.
(7) Scatter plot displaying the number of reads mapped per second and
the correctly mapped reads (%).This plot highlights the performance of
each mapper in its setting in terms of runtime and accuracy.
1.1 Individual run report
For each mapper and parameter setting mentioned above an individual report is
generated. This report provides further insight in the results. This includes
displaying the evaluation result for the particular setting in tabular form.
Furthermore, it provides a plot summarizing the mapping quality with respect to
the percentages of correctly and wrongly mapped reads. Figure S3 shows such plots
for the different mappers based on data set D2. At the end of the report one can
inspect the error log and the command line output of the mapper, as well as the
exact command that was used for execution.
Figure S1: Summarization report. (a) Evaluation of alignment statistics in terms of
fractions of correctly, wrongly and not mapped reads for the data set D2. Plot (b) illustrates
runtimes for this data set.
Figure S2: Overall evaluation of mapping quality thresholds for all mappers for D2.
Figure S3: Detailed evaluation of mapping quality thresholds Results are shown for the
D2 data set for (a) Bowtie 2, (b) BWA-MEM and (c) NextGenMap. This highlights the
differences of mapping quality computation for different mappers for the same data set.
Download