TE-Annotation-Benchmark-Proposal

advertisement
Transposable Element Annotation Meeting (TEAM)
April 18-24, 2014
McGill Bellairs Research Institute
Folkestone, St. James, Barbados
TE Annotation Benchmark Proposal
1. Name
TEBenchPress
2. Authors
Arian Smit and Robert Hubley
Institute for Systems Biology
Contact: rhubley@systemsbiology.org
3. Description
TEBenchPress is an evolving set of TE annotation benchmarks we use to evaluate the
performance of the RepeatMasker package.
For this proposal we provide a GARLIC[1]
simulated human intergenic sequence containing modeled TEs and simple repeat sequences as
a demonstration benchmark for TE annotation software. In addition to the sequence we provide
tools for comparing the known TE insertions with a user provided set of putative TE ranges to
calculate false positive, false negative, true positive, true negative, specificity, sensitivity,
accuracy and false discovery rate metrics. At this time we also provide reversed, and shuffled
natural sequences as an additional false positive benchmark.
Transposable Element Annotation Meeting (TEAM)
April 18-24, 2014
McGill Bellairs Research Institute
Folkestone, St. James, Barbados
4. Specification
Description
Comments
Type (real, modified real*,
simulated, other types?)
Modified real, and simulated.
Modeled intergenic sequences
with modeled TE insertions
from Repbase. Reversed, and
shuffled sequences real
genomic sequences.
Primary Uses (to measure
sensitivity? specificity? other
metrics?)
To measure both sensitivity
and specificity of TE
annotation ranges. No
evaluation of repeat family
membership or classification
is performed.
--
Additional Uses
--
--
Taxa
Currently Homo Sapiens
--
Source
R. Hubley
--
Documentation
Included
--
Version
1.0
--
Other
--
--
* e.g., modified real = real + modeled evolution
5. Details
This package contains a human-like artificial sequence dataset for use as a TE annotation
benchmark. Included is a “makefile” which was used to generate the benchmark dataset, an
evaluation of the artificial sequence vs real human sequence using a variety of sequence
complexity measures, and utilities to evaluate a set of annotations against the known locations
of TEs in the artificial sequence.
The artificial sequence containing inserted simple repeats and TEs is created using the GARLIC
algorithm[1].
Using this sequence it is possible to evaluate both false positives, and false
negatives for many types of repeat annotation programs. A simple BED format is used by the
Transposable Element Annotation Meeting (TEAM)
April 18-24, 2014
McGill Bellairs Research Institute
Folkestone, St. James, Barbados
comparison program to relate annotation ranges with the known insertion sites. A script is
provided to convert RepeatMasker output into the BED format.
Example run with RepeatMasker and evaluation of TE results:
% RepeatMasker -engine cross_match -s artSeq.fasta
% ./outToBed.pl -noSimple artSeq.fasta.out > artSeq.fasta.noSimple.out.bed
% ./compareResults.pl artSeq.IROnly.inserts.bed artSeq.fasta.noSimple.out.bed artSeq.fasta
The GARLIC modeling approach requires knowledge of the types, abundance, and age of
repeats in a genome a priori. Using this model to create a benchmark sequence for programs
used to define the model is circular. The primary limitation of this is that the program providing
the initial repeat list sets the level of the bar for difficulty. False positives are still fairly evaluated
, arising primarily due to the realistic simple repeat sequences included in the benchmark. As
the TE insertions are mutated starting from a consensus library, the actual insertions are
independent from a given detection method and provide a good false negative benchmark.
6. References
1. Caballero, Juan, et al. "Realistic artificial DNA sequences as
negative controls for computational genomics." Nucleic acids
research (2014): gku356.
Download