Benchmark_Proposal_URGI

advertisement
Transposable Element Annotation Meeting (TEAM)
April 18-24, 2014
McGill Bellairs Research Institute
Folkestone, St. James, Barbados
TE Annotation Benchmark Proposal (Template)
1.
Name
Benchmark for quality assessment of de novo repeat identification and genome annotation
2.
Authors
Florian Maumus and Hadi Quesneville
URGI-INRA, FRANCE
Contact: florian.maumus@versailles.inra.fr
3.
Description
This dataset aims to help assessing the sensitivity and specificity of de novo repeat detection and
annotation tools using the A. thaliana genome. It uses the coverage of known repeats as a proxy for
sensitivity and the coverage of a simulated genome as a proxy of specificity.
Transposable Element Annotation Meeting (TEAM)
April 18-24, 2014
McGill Bellairs Research Institute
Folkestone, St. James, Barbados
4.
Specification
Description
Comments
Type (real, modified real*,
simulated, other types?)
Real and simulated--
--
Primary Uses (to measure
sensitivity? specificity? other
metrics?)
Measure sensitivity and
specificity--
--
Additional Uses
--
--
Taxa
Arabidopsis thaliana --
--
Source
F. Maumus--
--
Documentation
In progress--
--
Version
v0.1--
--
Other
--
--
* e.g., modified real = real + modeled evolution
5.
Details
Blue files are available for download on the benchmark page
(Benchmark_proposal_URGI_v1.tar.gz)
As described previously [1], a library of reference repeats for A. thaliana has been generated by
combining sequences from Repbase [2] (species A. thaliana) and from Buisine et al. [3] and by filtering
redundant sequences1 (TAIR10_REFrepeats.fa).
This library has been used to annotate the TAIR10 assembly (TAIR10.fa) with RepeatMasker2 [4]
(TAIR10_REFrepeats_RM.gff3) and with TEannot3 [5] (TAIR10_REFrepeats_TEannot.gff3) (Table 1).
These two whole-genome annotations can be used to address the quality of a de novo repeat library by
assessing its capacity to mask the reference annotations using either RepeatMasker or the TEannot
pipeline.
Sensitivity= % REF coverage (using respective annotation tool)
Transposable Element Annotation Meeting (TEAM)
April 18-24, 2014
McGill Bellairs Research Institute
Folkestone, St. James, Barbados
As an example, a library of consensus sequences representative of repeated elements in the A. thaliana
genome (TAIR10.fa) was constructed with TEdenovo4 [6] (TAIR10_TEdenovov2.2.fa). Genome
annotation was performed using RepeatMasker2 (TAIR10_TEdenovov2.2_RM.gff3) and TEannot3
(TAIR10_TEdenovov2.2_TEannot.gff3). These annotations show 89.5% and 91.2% sensitivity with
RepeatMasker and TEannot, respectively (Table 1).
Table 1
Genome coverage
Covered by TEdenovo + RM
Covered by TEdenovo +
TEannot
Sensitivity
REF + RM
17962650
16083538
NA
89.54
REF + TEannot
20191809
NA
18409469
91.17
In order to assess the specificity of our annotations, we also annotate the reversed (not complement) A.
thaliana genome (TAIR10rev.fa) as a negative control. Because simple sequence repeats (SSRs) are
reversed into others, we filter SSR positions detected with TRF5 (TAIR10_TRF.gff3) from the potential
false positives (FPs) to obtain SSR-corrected FPs.
SSR-corrected FPs=coverage in reverse genome out of TRF annotations
SSR-corrected FDR(%)=(SSR-corrected FP coverage/Genome coverage) x 100
We calculate SSR-corrected FDRs of 0.44% with TAIR10_TEdenovov2.2_RM.gff3 and 3.92% with
TAIR10_TEdenovov2.2_TEannot.gff3 (Table 2).
Because bona fide repeats can also contain degenerate palindromic features, some regions that are
annotated in the sense genome can also be legitimately detected in the reversed genome. In order to
correct this effect, we filter the portions that are also annotated in the sense genome from the positions
annotated in the reversed genome to obtain the rev-corrected FPs.
Rev-corrected FPs=coverage in reverse genome out of sense annotations
Rev-corrected FDR(%)=(rev-corrected FP coverage/Genome coverage) x 100
We calculate rev-corrected FDRs of 0.09% for TAIR10_TEdenovov2.2_RM.gff3 and 0.55% with
TAIR10_TEdenovov2.2_TEannot.gff3 (Table 2).
Table 2
Genome
coverage
Reverse
coverage
Reverse in
TRF
SSRReverse
corrected
in sense
FP
Revcorrected
FP
SSRcorrected
FDR(%)
Revcorrected
FDR(%)
Transposable Element Annotation Meeting (TEAM)
April 18-24, 2014
McGill Bellairs Research Institute
Folkestone, St. James, Barbados
TEdenovo +
RM
TEdenovo +
TEannot
23132653
109342
7944
87663
101398
21679
0.44
0.09
27205974
1552569
484866
1403519
1067703
149050
3.92
0.55
1: Remove redundancy: 98% coverage; 95% identity (longest sequences kept)
2: RepeatMasker: default parameters but “slow” option and search engine wu-blast
3: TEannot from the REPET package version 2.2: search engine wu-blast, Blaster sensitivity set to “2”,
and target FDR set to 1% (see code)
4: TEdenovo from the REPET package version 2.2: with the RepeatScout and similarity (Blaster)
branches. Consensus sequences with at least 1 full length copy selected after annotation with TEannot.
5: TRF run with parameters 2 10 10 80 10 24 2000
6.
1.
2.
3.
4.
5.
6.
References
Maumus F, Quesneville H: Deep investigation of Arabidopsis thaliana junk DNA reveals a
continuum between repetitive elements and genomic dark matter. PLoS One 2014, 9(4):e94101.
Jurka J, Kapitonov VV, Pavlicek A, Klonowski P, Kohany O, Walichiewicz J: Repbase Update, a
database of eukaryotic repetitive elements. Cytogenet Genome Res 2005, 110(1-4):462-467.
Buisine N, Quesneville H, Colot V: Improved detection and annotation of transposable elements
in sequenced genomes using multiple reference sequence sets. Genomics 2008, 91(5):467-475.
Smit AFA, Hubley R, Green P: RepeatMasker Open-3.0. http://wwwrepeatmaskerorg 1996-2010.
Quesneville H, Bergman CM, Andrieu O, Autard D, Nouaud D, Ashburner M, Anxolabehere D:
Combined evidence annotation of transposable elements in genome sequences. PLoS Comput
Biol 2005, 1(2):166-175.
Flutre T, Duprat E, Feuillet C, Quesneville H: Considering transposable element diversification in
de novo annotation approaches. PLoS One 2011, 6(1):e16526.
Download