sumanreport - School of Computer Science

advertisement
Genome signatures of microbial organisms identified by amino acid n-gram analysis
B. Suman Bharathi1, Deborah Weisser2 and Judith Klein Seetharaman2
1
2
IBI-2
Forschugszentrum
Juelich
Juelich stadt, 52425
49 2461 612510, Fax.
Language Technologies Institute
School of Computer Science
Carnegie Mellon University
Pittsburgh, 15213
Tel. 412-268-8249, Fax. 412-268-2338
bsuman_1979@yahoo.com , dweisser@cs.cmu.edu, judithks@cs.cmu.edu
Keywords
statistical analysis, genome signatures
Abstract
Importance of genome signatures:
Pathogen versus drug designing has held an untold significance in human medicine since medical
research began. A detailed study of the molecular and chemistry of drugs against target sites of an
antigen posed or expressed by the pathogen to the host can help us develop effective vaccines with
minimum side effects. Extreme emphasis is laid on solving the problem of sequence-structure-form
mapping of the proteins.
With the use of the blmt toolkit, the search is performed for protein sequence 4-grams which act as
genome signatures for an organism. 4-grams, or terapeptides, of the 20 amino acids in all probable
combinations are explored and are subsequently sorted by their frequency of occurrence in the
respective organism, (both expected and observed frequencies.) The difference between the observed
and expected frequencies are calculated for each 4-gram, and the mean and standard deviation of the
difference values is computed. The absolute difference from the mean, in terms of number of standard
deviations, is calculated for all 160,000 4-grams. All 160,000 4-grams are sorted by difference. The
genome size of the organism is compared with the standard deviation values obtained for the organism,
and the relationship between the two quantities is studied. We observe different distributions of
standard deviation values between similar organisms (same genus and different species):
1
Mycobacterium tuberculosis shows very high variation in its standard deviation values compared to
another species in the same genus, M.leprae.
Introduction
This project uses the blmt toolkit to search for 4-gram signatures, small sets of 4-grams which
uniquely characterize an organism. Statistical analysis of 4-grams is performed using the toolkit to
obtain frequencies for each of the 160,000 possible 4-grams in an organism to find those 4-grams which
occur much more or less likely than expected. Statistical computational analysis was conducted on 44
organisms, including bacteria, archaea, mycoplasma and human.
Goals:
The goal is to derive “signature” 4-grams for different organisms, where a “signature” is a small set
of 4-grams whose unexpected presence or absence uniquely determines the organism. Toward this goal,
the difference values between observed and expected frequencies for each of the 160,000 possible 4grams in an organism are computed, and then the standard deviation and mean of the difference is
calculated. The magnitude of the difference value for each 4-gram in an organism is then computed in
terms of number of standard deviations from the mean.
The 4-grams which are extremely over- or under-represented are indicated by high standard
deviation values. Those 4-grams which are over- or under-represented in one organism only, or those 4grams which have high observed frequency in one organism only, are candidates for “signature” 4grams, which can uniquely determine an organism.
We have gathered statistical data for 44 organisms based upon observed versus expected frequencies
for all the 4-grams within an organism.
Systems and Methods
2
The data was gathered using the blmt toolkit. The program used by the blmt to perform statistical
analysis uses gcc on UNIX. XMGR was used to make the graphs.
Results and Discussion
4-gram statistics of different organisms
Figure 1. Difference between observed and expected frequencies for all 160,000 4-grams for several
organisms. The positive values indicate over-represented 4-grams, while the negative values
indicate under-represented 4-grams.
3
Figure 2. Blowup of initial region of graph in Figure 1 (difference between observed and expected
frequencies of 4-grams for several organisms.) Ureapasma shows high difference values
(approximately 0.00021). High difference values indicate over-representation of 4-grams in the
organism.
4
Figure 3. Distribution of magnitude of difference in terms of number of standard deviations from the
mean. The magnitude tends to be higher for over-represented 4-grams (those with a positive
difference.)
5
Figure 4. Blowup of initial region of Figure 3. Distribution of magnitude of difference in terms of
number of standard deviations from the mean.
M.tuberculosis is much higher than M.leprae.
The highest value of standard deviations from the mean is seen expressed in the organism
Mycobacterium_tuberculosis: a value of 179 is shown by the 4-gram GAGG, 175 by GGAG, and 102 by
GNGG. The amino acids G, A, and L occur together in 4-grams much more than their unigram
frequencies would suggest.
We observe that these 4-grams are very highly expressed in Mycobacterium_tuberculosis as opposed
to its adjacent species, Mycobacterium_leprae, which has significantly lower standard deviation values
for these 4-grams, even though they belong to the same genus.
6
Genome size versus standard deviation distribution
Human
(22,889,476)
Mesorhizobium (4,080,256)
P. Aeruginosa (3,730,192)
E Coli0157h7 (3,229,098)
E Coli0157h7ED1933
(3,228,100)
Figure 5. Highest 400 standard deviation values for several organisms. Y-axis is number of standard
deviations, x-axis is 4-gram. Size of organism in terms of number of amino acids is after name.
7
Human
E ColiK12
(22,889,476)
(2,726,558)
\ M.Tuberculosis(2,666,338)
B.Subtilis
(2,442,200)
B.Halodurans (2,384,352)
Synechocystis (2,072,748)
Figure 6. Highest 400 standard deviation values for several organisms, where genome size is
reflected in line thickness. Y-axis is number of standard deviations, x-axis is 4-gram. Size of
organism in terms of number of amino acids is after name.
8
Human
E ColiK12
(22,889,476)
(2,726,558)
\ M.Tuberculosis(2,666,338)
B.Subtilis
(2,442,200)
B.Halodurans (2,384,352)
Synechocystis (2,072,748)
Figure 7. Lowest standard deviation values for several organisms. Y-axis is number of standard
deviations, x-axis is 4-gram. Size of organism in terms of number of amino acids is after name.
9
We examine the relationship between genome size and distribution of standard deviation values and
find no observable relationship.
In Figure 5, Human and Mesorhizobium have nearly the same
distribution, although Mesorhizobium is much smaller. In Figure 6, M.Tuberculosis has the highest
maximum standard deviation value for a 4-gram (178), although it is much smaller than Human, whose
maximum standard deviation value is much lower (around 100). In turn, there are other organisms
whose maximum standard deviation values are lower than those of Human. In Figure 7, we observe that
Human has the lowest standard deviation values for the minimum standard deviation values.
Conclusions
In our goal to derive “signature” 4-grams for different organisms, the difference values between
observed and expected frequencies for each of the 160,000 possible 4-grams in an organism are
computed, and then the standard deviation and mean of the difference is calculated. The magnitude of
the difference value for each 4-gram in an organism is then computed in terms of number of standard
deviations from the mean.
The 4-grams which are extremely over- or under-represented are indicated by high standard
deviation values. Those 4-grams which are over- or under-represented in one organism only, or those 4grams which have high observed frequency in one organism only, are candidates for “signature” 4grams, which can uniquely determine an organism.
We examine the relationship between genome size distribution of standard deviation values and
observe no relationship between the two. We also observe an example where two species in the same
genus, M.Tuberculosis and M.leprae, have noticeably different distributions of standard deviation
values.
Acknowledgements:
This research was supported by National Science Foundation Large Information Technology
Research grant NSF 0225656.
10
Relevance of Approaches and Results for Complementary Domain
11
Download