Complete genomes comparison based on the taxonomic

advertisement
Complete genomes comparison based on the taxonomic
distribution of protein sequence homologs
Tatiana Tatusova, Alexander Souvorov, Roman Tatusov
National Center for Biotechnology Information
National Library of Medicine
National Institutes of Health
Bldg. 38A 8600 Rockville Pike, Bethesda, MD 20894
The field of microbial genomics has grown at astonishing rate since the first genome sequence
of Haemophilus influenzae was completed in 1995. Genome sequences of 51 microbial species
are currently available in public database. Completed microbial genome sequences represent a
collection of > 100,000 predicted coding sequences.
Examining the differences between protein sequences of various organisms gives insight into
the origin of genes and the relationship between species. A new tool for the comparison of
microbial genomes, called TaxPlot, provides a genome-wide approach to the study of gene and
protein functions. TaxPlot produces a 2D plot in which the predicted proteins of a query
organism are represented as points plotted with Cartesian coordinates (X,Y) equivalent to best
BLAST scores to predicted proteins from two other organisms. The analysis of protein
similarities between organisms gives insight into their evolutionary relationships.
Another approach combines protein similarity searching with taxonomic classification of the
detected homologs. A whole genome graphical overview shows the taxonomic distribution of
the highest scoring BLAST hit by three taxonomic groups, Eukaryota, Eubacteria and Archaea.
This approach also takes advantage of the COG (Clusters of the Orthologous Groups) system,
which includes conserved protein families represented in at least three phylogenetically distant
organisms with completely sequenced genomes. The proteins that comprise a COG (Clusters of
the Orthologous Groups) are displayed in a whole genome graphical overview and are linked
to the COG database. Individual protein alignment display integrates heterogeneous NCBI
resources offering a variety of display options, that include the distribution of hits by
taxonomic grouping, sorting by taxonomic proximity, the best hit to each organism, the protein
domains in the query sequence, similar sequences that have known 3-D structures, and more.
Download