Transposable Element Annotation Meeting (TEAM) April 18-24, 2014 McGill Bellairs Research Institute Folkestone, St. James, Barbados TE Annotation Benchmark Proposal 1. Name TEBenchPress 2. Authors Arian Smit and Robert Hubley Institute for Systems Biology Contact: rhubley@systemsbiology.org 3. Description TEBenchPress is an evolving set of TE annotation benchmarks we use to evaluate the performance of the RepeatMasker package. For this proposal we provide a GARLIC[1] simulated human intergenic sequence containing modeled TEs and simple repeat sequences as a demonstration benchmark for TE annotation software. In addition to the sequence we provide tools for comparing the known TE insertions with a user provided set of putative TE ranges to calculate false positive, false negative, true positive, true negative, specificity, sensitivity, accuracy and false discovery rate metrics. At this time we also provide reversed, and shuffled natural sequences as an additional false positive benchmark. Transposable Element Annotation Meeting (TEAM) April 18-24, 2014 McGill Bellairs Research Institute Folkestone, St. James, Barbados 4. Specification Description Comments Type (real, modified real*, simulated, other types?) Modified real, and simulated. Modeled intergenic sequences with modeled TE insertions from Repbase. Reversed, and shuffled sequences real genomic sequences. Primary Uses (to measure sensitivity? specificity? other metrics?) To measure both sensitivity and specificity of TE annotation ranges. No evaluation of repeat family membership or classification is performed. -- Additional Uses -- -- Taxa Currently Homo Sapiens -- Source R. Hubley -- Documentation Included -- Version 1.0 -- Other -- -- * e.g., modified real = real + modeled evolution 5. Details This package contains a human-like artificial sequence dataset for use as a TE annotation benchmark. Included is a “makefile” which was used to generate the benchmark dataset, an evaluation of the artificial sequence vs real human sequence using a variety of sequence complexity measures, and utilities to evaluate a set of annotations against the known locations of TEs in the artificial sequence. The artificial sequence containing inserted simple repeats and TEs is created using the GARLIC algorithm[1]. Using this sequence it is possible to evaluate both false positives, and false negatives for many types of repeat annotation programs. A simple BED format is used by the Transposable Element Annotation Meeting (TEAM) April 18-24, 2014 McGill Bellairs Research Institute Folkestone, St. James, Barbados comparison program to relate annotation ranges with the known insertion sites. A script is provided to convert RepeatMasker output into the BED format. Example run with RepeatMasker and evaluation of TE results: % RepeatMasker -engine cross_match -s artSeq.fasta % ./outToBed.pl -noSimple artSeq.fasta.out > artSeq.fasta.noSimple.out.bed % ./compareResults.pl artSeq.IROnly.inserts.bed artSeq.fasta.noSimple.out.bed artSeq.fasta The GARLIC modeling approach requires knowledge of the types, abundance, and age of repeats in a genome a priori. Using this model to create a benchmark sequence for programs used to define the model is circular. The primary limitation of this is that the program providing the initial repeat list sets the level of the bar for difficulty. False positives are still fairly evaluated , arising primarily due to the realistic simple repeat sequences included in the benchmark. As the TE insertions are mutated starting from a consensus library, the actual insertions are independent from a given detection method and provide a good false negative benchmark. 6. References 1. Caballero, Juan, et al. "Realistic artificial DNA sequences as negative controls for computational genomics." Nucleic acids research (2014): gku356.