Complete genomes comparison based on the taxonomic distribution of protein sequence homologs Tatiana Tatusova, Alexander Souvorov, Roman Tatusov National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bldg. 38A 8600 Rockville Pike, Bethesda, MD 20894 The field of microbial genomics has grown at astonishing rate since the first genome sequence of Haemophilus influenzae was completed in 1995. Genome sequences of 51 microbial species are currently available in public database. Completed microbial genome sequences represent a collection of > 100,000 predicted coding sequences. Examining the differences between protein sequences of various organisms gives insight into the origin of genes and the relationship between species. A new tool for the comparison of microbial genomes, called TaxPlot, provides a genome-wide approach to the study of gene and protein functions. TaxPlot produces a 2D plot in which the predicted proteins of a query organism are represented as points plotted with Cartesian coordinates (X,Y) equivalent to best BLAST scores to predicted proteins from two other organisms. The analysis of protein similarities between organisms gives insight into their evolutionary relationships. Another approach combines protein similarity searching with taxonomic classification of the detected homologs. A whole genome graphical overview shows the taxonomic distribution of the highest scoring BLAST hit by three taxonomic groups, Eukaryota, Eubacteria and Archaea. This approach also takes advantage of the COG (Clusters of the Orthologous Groups) system, which includes conserved protein families represented in at least three phylogenetically distant organisms with completely sequenced genomes. The proteins that comprise a COG (Clusters of the Orthologous Groups) are displayed in a whole genome graphical overview and are linked to the COG database. Individual protein alignment display integrates heterogeneous NCBI resources offering a variety of display options, that include the distribution of hits by taxonomic grouping, sorting by taxonomic proximity, the best hit to each organism, the protein domains in the query sequence, similar sequences that have known 3-D structures, and more.