Genome signatures of microbial organisms identified by amino acid n-gram analysis B. Suman Bharathi1, Deborah Weisser2 and Judith Klein Seetharaman2 1 2 IBI-2 Forschugszentrum Juelich Juelich stadt, 52425 49 2461 612510, Fax. Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh, 15213 Tel. 412-268-8249, Fax. 412-268-2338 bsuman_1979@yahoo.com , dweisser@cs.cmu.edu, judithks@cs.cmu.edu Keywords statistical analysis, genome signatures Abstract Importance of genome signatures: Pathogen versus drug designing has held an untold significance in human medicine since medical research began. A detailed study of the molecular and chemistry of drugs against target sites of an antigen posed or expressed by the pathogen to the host can help us develop effective vaccines with minimum side effects. Extreme emphasis is laid on solving the problem of sequence-structure-form mapping of the proteins. With the use of the blmt toolkit, the search is performed for protein sequence 4-grams which act as genome signatures for an organism. 4-grams, or terapeptides, of the 20 amino acids in all probable combinations are explored and are subsequently sorted by their frequency of occurrence in the respective organism, (both expected and observed frequencies.) The difference between the observed and expected frequencies are calculated for each 4-gram, and the mean and standard deviation of the difference values is computed. The absolute difference from the mean, in terms of number of standard deviations, is calculated for all 160,000 4-grams. All 160,000 4-grams are sorted by difference. The genome size of the organism is compared with the standard deviation values obtained for the organism, and the relationship between the two quantities is studied. We observe different distributions of standard deviation values between similar organisms (same genus and different species): 1 Mycobacterium tuberculosis shows very high variation in its standard deviation values compared to another species in the same genus, M.leprae. Introduction This project uses the blmt toolkit to search for 4-gram signatures, small sets of 4-grams which uniquely characterize an organism. Statistical analysis of 4-grams is performed using the toolkit to obtain frequencies for each of the 160,000 possible 4-grams in an organism to find those 4-grams which occur much more or less likely than expected. Statistical computational analysis was conducted on 44 organisms, including bacteria, archaea, mycoplasma and human. Goals: The goal is to derive “signature” 4-grams for different organisms, where a “signature” is a small set of 4-grams whose unexpected presence or absence uniquely determines the organism. Toward this goal, the difference values between observed and expected frequencies for each of the 160,000 possible 4grams in an organism are computed, and then the standard deviation and mean of the difference is calculated. The magnitude of the difference value for each 4-gram in an organism is then computed in terms of number of standard deviations from the mean. The 4-grams which are extremely over- or under-represented are indicated by high standard deviation values. Those 4-grams which are over- or under-represented in one organism only, or those 4grams which have high observed frequency in one organism only, are candidates for “signature” 4grams, which can uniquely determine an organism. We have gathered statistical data for 44 organisms based upon observed versus expected frequencies for all the 4-grams within an organism. Systems and Methods 2 The data was gathered using the blmt toolkit. The program used by the blmt to perform statistical analysis uses gcc on UNIX. XMGR was used to make the graphs. Results and Discussion 4-gram statistics of different organisms Figure 1. Difference between observed and expected frequencies for all 160,000 4-grams for several organisms. The positive values indicate over-represented 4-grams, while the negative values indicate under-represented 4-grams. 3 Figure 2. Blowup of initial region of graph in Figure 1 (difference between observed and expected frequencies of 4-grams for several organisms.) Ureapasma shows high difference values (approximately 0.00021). High difference values indicate over-representation of 4-grams in the organism. 4 Figure 3. Distribution of magnitude of difference in terms of number of standard deviations from the mean. The magnitude tends to be higher for over-represented 4-grams (those with a positive difference.) 5 Figure 4. Blowup of initial region of Figure 3. Distribution of magnitude of difference in terms of number of standard deviations from the mean. M.tuberculosis is much higher than M.leprae. The highest value of standard deviations from the mean is seen expressed in the organism Mycobacterium_tuberculosis: a value of 179 is shown by the 4-gram GAGG, 175 by GGAG, and 102 by GNGG. The amino acids G, A, and L occur together in 4-grams much more than their unigram frequencies would suggest. We observe that these 4-grams are very highly expressed in Mycobacterium_tuberculosis as opposed to its adjacent species, Mycobacterium_leprae, which has significantly lower standard deviation values for these 4-grams, even though they belong to the same genus. 6 Genome size versus standard deviation distribution Human (22,889,476) Mesorhizobium (4,080,256) P. Aeruginosa (3,730,192) E Coli0157h7 (3,229,098) E Coli0157h7ED1933 (3,228,100) Figure 5. Highest 400 standard deviation values for several organisms. Y-axis is number of standard deviations, x-axis is 4-gram. Size of organism in terms of number of amino acids is after name. 7 Human E ColiK12 (22,889,476) (2,726,558) \ M.Tuberculosis(2,666,338) B.Subtilis (2,442,200) B.Halodurans (2,384,352) Synechocystis (2,072,748) Figure 6. Highest 400 standard deviation values for several organisms, where genome size is reflected in line thickness. Y-axis is number of standard deviations, x-axis is 4-gram. Size of organism in terms of number of amino acids is after name. 8 Human E ColiK12 (22,889,476) (2,726,558) \ M.Tuberculosis(2,666,338) B.Subtilis (2,442,200) B.Halodurans (2,384,352) Synechocystis (2,072,748) Figure 7. Lowest standard deviation values for several organisms. Y-axis is number of standard deviations, x-axis is 4-gram. Size of organism in terms of number of amino acids is after name. 9 We examine the relationship between genome size and distribution of standard deviation values and find no observable relationship. In Figure 5, Human and Mesorhizobium have nearly the same distribution, although Mesorhizobium is much smaller. In Figure 6, M.Tuberculosis has the highest maximum standard deviation value for a 4-gram (178), although it is much smaller than Human, whose maximum standard deviation value is much lower (around 100). In turn, there are other organisms whose maximum standard deviation values are lower than those of Human. In Figure 7, we observe that Human has the lowest standard deviation values for the minimum standard deviation values. Conclusions In our goal to derive “signature” 4-grams for different organisms, the difference values between observed and expected frequencies for each of the 160,000 possible 4-grams in an organism are computed, and then the standard deviation and mean of the difference is calculated. The magnitude of the difference value for each 4-gram in an organism is then computed in terms of number of standard deviations from the mean. The 4-grams which are extremely over- or under-represented are indicated by high standard deviation values. Those 4-grams which are over- or under-represented in one organism only, or those 4grams which have high observed frequency in one organism only, are candidates for “signature” 4grams, which can uniquely determine an organism. We examine the relationship between genome size distribution of standard deviation values and observe no relationship between the two. We also observe an example where two species in the same genus, M.Tuberculosis and M.leprae, have noticeably different distributions of standard deviation values. Acknowledgements: This research was supported by National Science Foundation Large Information Technology Research grant NSF 0225656. 10 Relevance of Approaches and Results for Complementary Domain 11