GENOME SIGNATURES OF MICROBIAL ORGANISMS B. Suman Bharathi

advertisement
GENOME SIGNATURES OF MICROBIAL ORGANISMS
IDENTIFIED BY AMINO ACID N-GRAM ANALYSIS
B. Suman Bharathi
Advisor: Judith Klein-Seetharaman
Forschungszentrum, Juelich, Germany
Genome Signatures
• Sequence peptides which occur with unusually high
frequency unlike others in particular organism or pathogen
• Potential applications:
– Drug development: synthetize drugs which target genome
signature in pathogen
– Sensor development: use genome signature to identify organism
quickly using antibody
MPSE
MPSE
MPSE
MPSE
MPSE
Neisseria meningitidis
Homo sapiens
Approach
• Linguistic approach
• N-gram analysis using
toolkit
• What the BLMT toolkit
provides
• N-gram statistical analysis
• Definition of signature
sequences
• Use of toolkit on Neisseria
Meningitidis
0.09
0.08
0.07
0.06
Neisseria meningitidis
versus other species
n=4
0.05
0.04
0.03
0.02
0.01
0
n-gram = sequence of length n
Use of BLMT
• N-gram statistical analysis gives us a detailed
statistical data in terms of frequency of n-grams
and their respective mean and standard deviations.
• We have taken 45 organisms into consideration –
bacteria, archaea, mycoplasmas and human
• Search for n-grams whose standard deviations are
away from the mean values.
• Indicates the difference between expected and
observed values in frequency of the n-grams.
• Eventually helps us to see the unsusuality of this
n-gram in the organism unlike the others
compared.
Difference Between Expected and Observed frequencies
Xylella(black)
Vibrio(red)
Ureaplasma(green)
Treponema(blue)
Thermotoga(yellow)
n-gram
The positive values indicate the over-represented n-grams while
the negative values indicate the under-represented n-grams
Initial Points of difference between expected and
observed frequency graph
Xylella(black)
Vibrio(red)
Ureaplasma(green)
Treponema(blue)
Thermotoga(yellow)
Ureapasma shows high difference values (approx 0.00021), indicating
over-representation of n-grams compared
to expected probability of occurence in the organism
Standard deviation away from the mean
Mycoplasma genitalium(black)
M.tuberculosis(red)
• Mycoplasma
genitalium(black)
M.leprae(green)
• M.tuberculosis(red)
Mesorhizobium(blue)
• M.leprae(green)
Lactococcus(yellow)
• Mesorhizobium(blue)
• Lactococcus(yellow)
Shows distribution of n-gram standard deviations with
both high and low values of difference, indicating the
over-expressed and under-expressed n-gram values.
Highest standard deviations away from the mean
Mycoplasma genitalium(black)
M.tuberculosis(red)
M.leprae(green)
Mesorhizobium(blue)
Lactococcus(yellow)
Shows initial (highest) values of standard deviation away from mean
N-grams of M.tuberculosis much higher than M.leprae.
Comparison of genome size with
varying standard deviations
• Examine the relationship between genome
size and distribution of n-gram standard
deviations for each organism
• Human genome taken as reference.
• Compare genome size and standard
deviations within same genus but across
different species.
Size Distribution of Genomes
1.Human
22889476
23.Bacteria_Mycobacterium_leprae_strinTN
1080756
2.Bacteria_Mesorhizobium_loti
4080256
24.A_Methanobacterium_thermoautotrophicum_deltaH
1054752
3.Bacteria_Pseudomonas_aeruginosaPA01
3730192
25.Bacteria_Haemophilus_influenzaeRd
1045572
4.baceria E_coi0157H7Baceria_Escherichia_coiO157H7
3229098
26.Bacteria_Campylobacter_jejuni
1020944
5.Bacteria_Escherichia_coliO157H7EDL933
3228100
27.Bacteria_Helicobacter_pylori_strianJ99
990942
6.Bacteria_Escherichia_coliK12
2726558
28.Bacteria_Helicobacter_pylori26695
986258
7.Bacteria_Mycobacterium_tuberculosisH37Rv
2666338
29.Archaea_Methanococcus_jannaschii
970558
8.Bacteria_Bacillus_subtilis
2442200
30.Bacteriae_Aquifex_aeolicus
968068
9.Bacteria_Bacillus_halodurans_C125
2384352
31.Archaea_Thermoplasma_acidophilum
909164
10.Bacteria_SynechocystisPCC6803
2072748
32.Archaea_thermoplasma_volcanium
903228
11.Bacteria_Vibrio_cholerae_chr1
1725852
33.Bacteria_Chlamydophila_pneumonieaeJ138
735350
12.Bacteria_Deinococcus_radioduransR1_chr1
1559376
34.Bacteria_Chlamydophila_pneumonieaCWL029
725492
13.Bacteria_Xylella_fastidiosa
1490262
35.Bacteria_Chlamydophila_pneumonieaeAR39
729896
14.Archaea_Archaeoglobus_fulgidus
1343990
36.Bacteria_Treponema_pallidum
703414
15.Bacteria_Pasteurella_multocida
1340102
37.Bacteria_Chlamydia_muridarum
646712
16.Bacteria_Lactococcus_lactis_subsp_lactis
1335222
38.Bacteria_Chlamydia_trachomatis
626142
17.Archaea_Aeropyrum_pernix
1280062
39.Bacteria_Rickettsia_prowazekii_strain_MadridE
559828
18.B_Neisseria_meningitidis_serogroupBstrainMC58
1178096
40.Bacteria_Mycoplasma_pneumoniae
480870
19.Archaea_Halobacterium_spNRC1
1178038
41.Bacteria_Ureaplasma_urealyticum
457608
20.B_Neisseria_meningitidis_serogroupAstrainZ2491
1176104
42.Bacteria_Buchnera_sp_APS
371470
21.Bacteria_thermotoga_maritima
1167344
43.mycoplasma genitalium
352826
22.Bacteria_Pyrococcus_horikoshiiOT3
1141216
44.Bacteria_Borrelia_burgdorferi
300106
Size genome graph and varying std deviation values
•Human(black22889476)
•Mesorhizobium(red,4080256)
•P.aeruginosa(green,3730192)
•E_coi0157h7(blue,3229098)
•E_coli0157h7EDl933
(yellow,3228100)
The organisms are listed in descending order of genome size.
The relation between distribution of n-gram standard deviations
and size is compared.
Tail end of Genome size and n-gram distribution of
standard deviations
Human(black,22889476)
Mesorhizobium(red,4080256
P.aeruginosa(green,3730192)
E_coi0157h7(blue,3229098)
E_coli0157h7EDl933
(yellow,3228100)
Human genome, though largest in size, has low values
of n-gram standard deviation values away from the mean
compared to smaller genomes
Initial points: Genome size and n-gram distribution
of standard deviations
Human(black,22889476)
Mesorhizobium(red,4080256)
P.aeruginosa(green,3730192)
E_coi0157h7(blue,3229098)
E_coli0157h7EDl933 (yellow,3228100)
Human n-gram std deviation values are almost equal to Mesorhizobium
though Mesorhizobium has much smaller genome.
Genome size and n-gram distribution of standard
deviations
Human (black,22889476)
•E_coliK12(red,2726558)
•M.tuberculosis(green,2666338)
•B.subtilis(blue,2442200)
•B.halodurans(yellow,2384352)
•Synechocystis(brown,2072748)
M.tuberculosis has very high n-gram standard deviation values.
It exceeds the values of human, despite its smaller genome size.
Initial points of Genome size and n-gram distribution
of standard deviations
Human (black,22889476)
E_coliK12(red,2726558)
M.tuberculosis(green,2666338)
B.subtilis(blue,2442200)
B.halodurans(yellow,2384352)
Synechocystis(brown,2072748)
The thickness of lines indicates the genome size.
The thinnest line represents E_coliK12.
Mycobacterium tuberculosis shows highest values.
Final points of Genome size and n-gram distribution
of standard deviations
Human (black,22889476)
E_coliK12(red,2726558)
M.tuberculosis(green,2666338)
B.subtilis(blue,2442200)
B.halodurans(yellow,2384352)
Synechocystis(brown,2072748)
M.tuberculosis and all other organisms here
have n-grams with higher difference values than human.
Same genus / different species
• 4-grams in M. tuberculosis have much
higher 4-gram standard deviations from
mean than M. leprae
Mycobacterium
M. tuberculosis
GAGG
GGAG
GNGG
GGNG
AAAA
AGGA
GTGG
GGTG
GDGG
GGDG
LAAA
GSGG
GGSG
NGGA
ALAA
NGGN
AGGN
GVGG
GGVG
AALA
VAAA
179
175
102
79
68
65
58
55
46
42
37
32
31
30
29
29
26
25
25
24
23
M. leprae
AAAA
LAAA
AALA
AVAA
AAAV
ALAA
VAAA
VAAL
AELA
AAVA
LAAL
ELAA
LAGL
AAAL
TAAA
LAEL
47
39
32
31
29
28
27
26
26
25
25
24
22
22
22
21
Other Organisms
Thermotoga
maritima
Human
EEEE
PPPP
AAAA
SSSS
GGGG
LLLL
QQQQ
HTGE
GPPG
GEKP
TGEK
EKPY
ECGK
PPGP
PGPP
KKKK
AMAA
RSRS
CGKA
EEED
GKAF
EDEE
IHTG
PPAP
DEEE
107
95
89
86
63
59
55
47
46
40
39
34
32
32
31
30
26
25
25
24
23
22
22
22
21
AMKK
EAMK
LKEK
LEEI
EILK
GKTT
LEEL
EILE
EKLK
EELK
LEKL
EALK
KALE
EEIE
LKKL
LLEK
Synechocystis
spec.
31
28
28
26
25
24
24
24
23
23
23
22
22
22
22
21
QAIA
LAIA
TAIA
GDRL
AIAA
EAIA
GDRQ
AIAV
AAIA
AIAK
AIAL
GAIA
VAIA
AIAD
AIAS
EPEP
AIAG
PEPE
AIAE
AIAI
KAIA
AIAR
LGDR
MAIA
64
63
61
59
49
47
46
44
42
39
39
36
36
30
29
27
27
27
26
26
23
23
22
22
Neisseria
meningitidis
SDGI
MPSE
AAAA
GRLK
AAAL
LAAA
AVAA
ALAA
AAAV
FQTA
AAEA
EAAA
QTAL
AVAM
Haemophilus
influenza
55
50
49
34
32
26
24
24
23
23
23
22
21
21
LTAL
KSAV
TALL
AMKK
TALS
SAVK
KAMK
ESAV
STAL
SAVE
KKAM
TALF
LSGG
QSAV
KLTA
GKST
75
45
40
37
32
31
30
30
28
27
27
26
22
21
21
21
Conclusions
• n-grams which are at least 30 standard deviations
away from the mean are significant candidates for
genome signatures.
• Difference graphs: estimate the likelihood of ngram observed in an organism.
• Genome size graphs : there is no specific
relationship between the size of genome and its
standard deviation values.
• Same genus and different species, where genome
size is specified: There is a noticeable difference
observed between Mycobacterium species
(M.leprae and M.tuberculosis).
Current and future work
• Find n-gram signatures n-grams in E.coli.
• Explore the relationship between genome size and
distribution of n-gram standard deviations
different species of the same organism.
• Find more specific targets to differentiate species
in terms of signature peptides for all the 44
organisms taken for study.
Download