Secondary databases consist of sequences of gene

advertisement
Secondary databases consist of sequences of gene products organized under functional
categories. Such databases will tend to cluster under well defined strings those sequences that
represent the same functional product in diverse organisms. One example is Kegg Orthology (KO).
We analyzed the performance of KO as a source for annotation using all entries in KO for seven
prominent organisms: C. familiaris (Cfa), M. musculus (Mmu), R. norvegicus (Rno), A. thaliana
(Ath), C. elegans (Cel), D. melanogaster (Dme) and H. sapiens (Hsa), totalizing 25,060 proteins
clustered under 17,056 distinct KO entries. These proteins were used to annotate ESTs from four
model organisms: Ath, Cel, Dme and Hsa. The annotation was classified as correct, changed and
speculated using as reference the relationship between an organisms’s ESTs and its own proteins.
We found a small number of changed annotation, but the speculation was high. This problem was
solved when we used proteins in KEGG GENES database that was not in KO as source for the
annotation, rising the correct annotate to over 90%. This problem was minimized when
proteins in KEGG GENES database that were not in KO were used to assign ESTs
to their cognate proteins, increasing the correct annotation to over 90%. Results
indicate that KEGG GENES database still contain candidates do be incorporated in
KO database.
Secondary databases consist of sequences of gene products organized under functional
categories. Such databases will tend to cluster under well defined strings those sequences that
represent the same functional product in diverse organisms. One example is Kegg Orthology
(KO). We analyzed the performance of KO as a source for annotation using all entries in KO for
seven prominent organisms: C. familiaris (Cfa), M. musculus (Mmu), R. norvegicus (Rno), A.
thaliana (Ath), C. elegans (Cel), D. melanogaster (Dme) and H. sapiens (Hsa), totalizing 25,060
proteins clustered under 17,056 distinct KO entries. These proteins were used to annotate EST
from four model organisms: Ath, Cel, Dme and Hsa. The annotation was classified as correct,
changed and speculated using as control the alignment of an EST to its cognate protein. We
found a small number of changed annotations (under 10%), but the speculation was high (2037%). This problem was minimized when proteins in KEGG GENES database that were not in
KO were used to assign ESTs to their cognate proteins, putatively increasing the correct
annotation to over 90% and reducing speculation to under 5%. Results indicate that KEGG
GENES database still contain candidates do be incorporated in KO database.
K-EST is a web-based application that shows the annotation
of large sets of ESTs from four model organisms (A. thaliana,
C. elegans, D. melanogaster and H. sapiens with the KOG database.
K-EST may be used as a tool to predict the EST sampling or, roughly,
gene expression in novel transcriptome projects by comparison to the
four model organism expression / sampling database calculated with
the KOG dataset. K-EST uses statistical methods to analyze differential
expression between organisms and internal cDNA libraries. The
expression / sampling may be normalized by constitutively expressed
transcripts: actin and GAPDH. Other important feature of K-EST is to
show the conservation between genes of the four organisms used.
Availability: http://www.biodados.icb.ufmg.br/K-EST/
Download