Transposable Element Annotation Meeting (TEAM) April 18-24, 2014 McGill Bellairs Research Institute Folkestone, St. James, Barbados TE Annotation Benchmark Proposal (Template) 1. Name Benchmark for quality assessment of de novo repeat identification and genome annotation 2. Authors Florian Maumus and Hadi Quesneville URGI-INRA, FRANCE Contact: florian.maumus@versailles.inra.fr 3. Description This dataset aims to help assessing the sensitivity and specificity of de novo repeat detection and annotation tools using the A. thaliana genome. It uses the coverage of known repeats as a proxy for sensitivity and the coverage of a simulated genome as a proxy of specificity. Transposable Element Annotation Meeting (TEAM) April 18-24, 2014 McGill Bellairs Research Institute Folkestone, St. James, Barbados 4. Specification Description Comments Type (real, modified real*, simulated, other types?) Real and simulated-- -- Primary Uses (to measure sensitivity? specificity? other metrics?) Measure sensitivity and specificity-- -- Additional Uses -- -- Taxa Arabidopsis thaliana -- -- Source F. Maumus-- -- Documentation In progress-- -- Version v0.1-- -- Other -- -- * e.g., modified real = real + modeled evolution 5. Details Blue files are available for download on the benchmark page (Benchmark_proposal_URGI_v1.tar.gz) As described previously [1], a library of reference repeats for A. thaliana has been generated by combining sequences from Repbase [2] (species A. thaliana) and from Buisine et al. [3] and by filtering redundant sequences1 (TAIR10_REFrepeats.fa). This library has been used to annotate the TAIR10 assembly (TAIR10.fa) with RepeatMasker2 [4] (TAIR10_REFrepeats_RM.gff3) and with TEannot3 [5] (TAIR10_REFrepeats_TEannot.gff3) (Table 1). These two whole-genome annotations can be used to address the quality of a de novo repeat library by assessing its capacity to mask the reference annotations using either RepeatMasker or the TEannot pipeline. Sensitivity= % REF coverage (using respective annotation tool) Transposable Element Annotation Meeting (TEAM) April 18-24, 2014 McGill Bellairs Research Institute Folkestone, St. James, Barbados As an example, a library of consensus sequences representative of repeated elements in the A. thaliana genome (TAIR10.fa) was constructed with TEdenovo4 [6] (TAIR10_TEdenovov2.2.fa). Genome annotation was performed using RepeatMasker2 (TAIR10_TEdenovov2.2_RM.gff3) and TEannot3 (TAIR10_TEdenovov2.2_TEannot.gff3). These annotations show 89.5% and 91.2% sensitivity with RepeatMasker and TEannot, respectively (Table 1). Table 1 Genome coverage Covered by TEdenovo + RM Covered by TEdenovo + TEannot Sensitivity REF + RM 17962650 16083538 NA 89.54 REF + TEannot 20191809 NA 18409469 91.17 In order to assess the specificity of our annotations, we also annotate the reversed (not complement) A. thaliana genome (TAIR10rev.fa) as a negative control. Because simple sequence repeats (SSRs) are reversed into others, we filter SSR positions detected with TRF5 (TAIR10_TRF.gff3) from the potential false positives (FPs) to obtain SSR-corrected FPs. SSR-corrected FPs=coverage in reverse genome out of TRF annotations SSR-corrected FDR(%)=(SSR-corrected FP coverage/Genome coverage) x 100 We calculate SSR-corrected FDRs of 0.44% with TAIR10_TEdenovov2.2_RM.gff3 and 3.92% with TAIR10_TEdenovov2.2_TEannot.gff3 (Table 2). Because bona fide repeats can also contain degenerate palindromic features, some regions that are annotated in the sense genome can also be legitimately detected in the reversed genome. In order to correct this effect, we filter the portions that are also annotated in the sense genome from the positions annotated in the reversed genome to obtain the rev-corrected FPs. Rev-corrected FPs=coverage in reverse genome out of sense annotations Rev-corrected FDR(%)=(rev-corrected FP coverage/Genome coverage) x 100 We calculate rev-corrected FDRs of 0.09% for TAIR10_TEdenovov2.2_RM.gff3 and 0.55% with TAIR10_TEdenovov2.2_TEannot.gff3 (Table 2). Table 2 Genome coverage Reverse coverage Reverse in TRF SSRReverse corrected in sense FP Revcorrected FP SSRcorrected FDR(%) Revcorrected FDR(%) Transposable Element Annotation Meeting (TEAM) April 18-24, 2014 McGill Bellairs Research Institute Folkestone, St. James, Barbados TEdenovo + RM TEdenovo + TEannot 23132653 109342 7944 87663 101398 21679 0.44 0.09 27205974 1552569 484866 1403519 1067703 149050 3.92 0.55 1: Remove redundancy: 98% coverage; 95% identity (longest sequences kept) 2: RepeatMasker: default parameters but “slow” option and search engine wu-blast 3: TEannot from the REPET package version 2.2: search engine wu-blast, Blaster sensitivity set to “2”, and target FDR set to 1% (see code) 4: TEdenovo from the REPET package version 2.2: with the RepeatScout and similarity (Blaster) branches. Consensus sequences with at least 1 full length copy selected after annotation with TEannot. 5: TRF run with parameters 2 10 10 80 10 24 2000 6. 1. 2. 3. 4. 5. 6. References Maumus F, Quesneville H: Deep investigation of Arabidopsis thaliana junk DNA reveals a continuum between repetitive elements and genomic dark matter. PLoS One 2014, 9(4):e94101. Jurka J, Kapitonov VV, Pavlicek A, Klonowski P, Kohany O, Walichiewicz J: Repbase Update, a database of eukaryotic repetitive elements. Cytogenet Genome Res 2005, 110(1-4):462-467. Buisine N, Quesneville H, Colot V: Improved detection and annotation of transposable elements in sequenced genomes using multiple reference sequence sets. Genomics 2008, 91(5):467-475. Smit AFA, Hubley R, Green P: RepeatMasker Open-3.0. http://wwwrepeatmaskerorg 1996-2010. Quesneville H, Bergman CM, Andrieu O, Autard D, Nouaud D, Ashburner M, Anxolabehere D: Combined evidence annotation of transposable elements in genome sequences. PLoS Comput Biol 2005, 1(2):166-175. Flutre T, Duprat E, Feuillet C, Quesneville H: Considering transposable element diversification in de novo annotation approaches. PLoS One 2011, 6(1):e16526.