GENOME SIGNATURES OF MICROBIAL ORGANISMS IDENTIFIED BY AMINO ACID N-GRAM ANALYSIS B. Suman Bharathi Advisor: Judith Klein-Seetharaman Forschungszentrum, Juelich, Germany Genome Signatures • Sequence peptides which occur with unusually high frequency unlike others in particular organism or pathogen • Potential applications: – Drug development: synthetize drugs which target genome signature in pathogen – Sensor development: use genome signature to identify organism quickly using antibody MPSE MPSE MPSE MPSE MPSE Neisseria meningitidis Homo sapiens Approach • Linguistic approach • N-gram analysis using toolkit • What the BLMT toolkit provides • N-gram statistical analysis • Definition of signature sequences • Use of toolkit on Neisseria Meningitidis 0.09 0.08 0.07 0.06 Neisseria meningitidis versus other species n=4 0.05 0.04 0.03 0.02 0.01 0 n-gram = sequence of length n Use of BLMT • N-gram statistical analysis gives us a detailed statistical data in terms of frequency of n-grams and their respective mean and standard deviations. • We have taken 45 organisms into consideration – bacteria, archaea, mycoplasmas and human • Search for n-grams whose standard deviations are away from the mean values. • Indicates the difference between expected and observed values in frequency of the n-grams. • Eventually helps us to see the unsusuality of this n-gram in the organism unlike the others compared. Difference Between Expected and Observed frequencies Xylella(black) Vibrio(red) Ureaplasma(green) Treponema(blue) Thermotoga(yellow) n-gram The positive values indicate the over-represented n-grams while the negative values indicate the under-represented n-grams Initial Points of difference between expected and observed frequency graph Xylella(black) Vibrio(red) Ureaplasma(green) Treponema(blue) Thermotoga(yellow) Ureapasma shows high difference values (approx 0.00021), indicating over-representation of n-grams compared to expected probability of occurence in the organism Standard deviation away from the mean Mycoplasma genitalium(black) M.tuberculosis(red) • Mycoplasma genitalium(black) M.leprae(green) • M.tuberculosis(red) Mesorhizobium(blue) • M.leprae(green) Lactococcus(yellow) • Mesorhizobium(blue) • Lactococcus(yellow) Shows distribution of n-gram standard deviations with both high and low values of difference, indicating the over-expressed and under-expressed n-gram values. Highest standard deviations away from the mean Mycoplasma genitalium(black) M.tuberculosis(red) M.leprae(green) Mesorhizobium(blue) Lactococcus(yellow) Shows initial (highest) values of standard deviation away from mean N-grams of M.tuberculosis much higher than M.leprae. Comparison of genome size with varying standard deviations • Examine the relationship between genome size and distribution of n-gram standard deviations for each organism • Human genome taken as reference. • Compare genome size and standard deviations within same genus but across different species. Size Distribution of Genomes 1.Human 22889476 23.Bacteria_Mycobacterium_leprae_strinTN 1080756 2.Bacteria_Mesorhizobium_loti 4080256 24.A_Methanobacterium_thermoautotrophicum_deltaH 1054752 3.Bacteria_Pseudomonas_aeruginosaPA01 3730192 25.Bacteria_Haemophilus_influenzaeRd 1045572 4.baceria E_coi0157H7Baceria_Escherichia_coiO157H7 3229098 26.Bacteria_Campylobacter_jejuni 1020944 5.Bacteria_Escherichia_coliO157H7EDL933 3228100 27.Bacteria_Helicobacter_pylori_strianJ99 990942 6.Bacteria_Escherichia_coliK12 2726558 28.Bacteria_Helicobacter_pylori26695 986258 7.Bacteria_Mycobacterium_tuberculosisH37Rv 2666338 29.Archaea_Methanococcus_jannaschii 970558 8.Bacteria_Bacillus_subtilis 2442200 30.Bacteriae_Aquifex_aeolicus 968068 9.Bacteria_Bacillus_halodurans_C125 2384352 31.Archaea_Thermoplasma_acidophilum 909164 10.Bacteria_SynechocystisPCC6803 2072748 32.Archaea_thermoplasma_volcanium 903228 11.Bacteria_Vibrio_cholerae_chr1 1725852 33.Bacteria_Chlamydophila_pneumonieaeJ138 735350 12.Bacteria_Deinococcus_radioduransR1_chr1 1559376 34.Bacteria_Chlamydophila_pneumonieaCWL029 725492 13.Bacteria_Xylella_fastidiosa 1490262 35.Bacteria_Chlamydophila_pneumonieaeAR39 729896 14.Archaea_Archaeoglobus_fulgidus 1343990 36.Bacteria_Treponema_pallidum 703414 15.Bacteria_Pasteurella_multocida 1340102 37.Bacteria_Chlamydia_muridarum 646712 16.Bacteria_Lactococcus_lactis_subsp_lactis 1335222 38.Bacteria_Chlamydia_trachomatis 626142 17.Archaea_Aeropyrum_pernix 1280062 39.Bacteria_Rickettsia_prowazekii_strain_MadridE 559828 18.B_Neisseria_meningitidis_serogroupBstrainMC58 1178096 40.Bacteria_Mycoplasma_pneumoniae 480870 19.Archaea_Halobacterium_spNRC1 1178038 41.Bacteria_Ureaplasma_urealyticum 457608 20.B_Neisseria_meningitidis_serogroupAstrainZ2491 1176104 42.Bacteria_Buchnera_sp_APS 371470 21.Bacteria_thermotoga_maritima 1167344 43.mycoplasma genitalium 352826 22.Bacteria_Pyrococcus_horikoshiiOT3 1141216 44.Bacteria_Borrelia_burgdorferi 300106 Size genome graph and varying std deviation values •Human(black22889476) •Mesorhizobium(red,4080256) •P.aeruginosa(green,3730192) •E_coi0157h7(blue,3229098) •E_coli0157h7EDl933 (yellow,3228100) The organisms are listed in descending order of genome size. The relation between distribution of n-gram standard deviations and size is compared. Tail end of Genome size and n-gram distribution of standard deviations Human(black,22889476) Mesorhizobium(red,4080256 P.aeruginosa(green,3730192) E_coi0157h7(blue,3229098) E_coli0157h7EDl933 (yellow,3228100) Human genome, though largest in size, has low values of n-gram standard deviation values away from the mean compared to smaller genomes Initial points: Genome size and n-gram distribution of standard deviations Human(black,22889476) Mesorhizobium(red,4080256) P.aeruginosa(green,3730192) E_coi0157h7(blue,3229098) E_coli0157h7EDl933 (yellow,3228100) Human n-gram std deviation values are almost equal to Mesorhizobium though Mesorhizobium has much smaller genome. Genome size and n-gram distribution of standard deviations Human (black,22889476) •E_coliK12(red,2726558) •M.tuberculosis(green,2666338) •B.subtilis(blue,2442200) •B.halodurans(yellow,2384352) •Synechocystis(brown,2072748) M.tuberculosis has very high n-gram standard deviation values. It exceeds the values of human, despite its smaller genome size. Initial points of Genome size and n-gram distribution of standard deviations Human (black,22889476) E_coliK12(red,2726558) M.tuberculosis(green,2666338) B.subtilis(blue,2442200) B.halodurans(yellow,2384352) Synechocystis(brown,2072748) The thickness of lines indicates the genome size. The thinnest line represents E_coliK12. Mycobacterium tuberculosis shows highest values. Final points of Genome size and n-gram distribution of standard deviations Human (black,22889476) E_coliK12(red,2726558) M.tuberculosis(green,2666338) B.subtilis(blue,2442200) B.halodurans(yellow,2384352) Synechocystis(brown,2072748) M.tuberculosis and all other organisms here have n-grams with higher difference values than human. Same genus / different species • 4-grams in M. tuberculosis have much higher 4-gram standard deviations from mean than M. leprae Mycobacterium M. tuberculosis GAGG GGAG GNGG GGNG AAAA AGGA GTGG GGTG GDGG GGDG LAAA GSGG GGSG NGGA ALAA NGGN AGGN GVGG GGVG AALA VAAA 179 175 102 79 68 65 58 55 46 42 37 32 31 30 29 29 26 25 25 24 23 M. leprae AAAA LAAA AALA AVAA AAAV ALAA VAAA VAAL AELA AAVA LAAL ELAA LAGL AAAL TAAA LAEL 47 39 32 31 29 28 27 26 26 25 25 24 22 22 22 21 Other Organisms Thermotoga maritima Human EEEE PPPP AAAA SSSS GGGG LLLL QQQQ HTGE GPPG GEKP TGEK EKPY ECGK PPGP PGPP KKKK AMAA RSRS CGKA EEED GKAF EDEE IHTG PPAP DEEE 107 95 89 86 63 59 55 47 46 40 39 34 32 32 31 30 26 25 25 24 23 22 22 22 21 AMKK EAMK LKEK LEEI EILK GKTT LEEL EILE EKLK EELK LEKL EALK KALE EEIE LKKL LLEK Synechocystis spec. 31 28 28 26 25 24 24 24 23 23 23 22 22 22 22 21 QAIA LAIA TAIA GDRL AIAA EAIA GDRQ AIAV AAIA AIAK AIAL GAIA VAIA AIAD AIAS EPEP AIAG PEPE AIAE AIAI KAIA AIAR LGDR MAIA 64 63 61 59 49 47 46 44 42 39 39 36 36 30 29 27 27 27 26 26 23 23 22 22 Neisseria meningitidis SDGI MPSE AAAA GRLK AAAL LAAA AVAA ALAA AAAV FQTA AAEA EAAA QTAL AVAM Haemophilus influenza 55 50 49 34 32 26 24 24 23 23 23 22 21 21 LTAL KSAV TALL AMKK TALS SAVK KAMK ESAV STAL SAVE KKAM TALF LSGG QSAV KLTA GKST 75 45 40 37 32 31 30 30 28 27 27 26 22 21 21 21 Conclusions • n-grams which are at least 30 standard deviations away from the mean are significant candidates for genome signatures. • Difference graphs: estimate the likelihood of ngram observed in an organism. • Genome size graphs : there is no specific relationship between the size of genome and its standard deviation values. • Same genus and different species, where genome size is specified: There is a noticeable difference observed between Mycobacterium species (M.leprae and M.tuberculosis). Current and future work • Find n-gram signatures n-grams in E.coli. • Explore the relationship between genome size and distribution of n-gram standard deviations different species of the same organism. • Find more specific targets to differentiate species in terms of signature peptides for all the 44 organisms taken for study.