Genomes of bacterial pathogens and their diversity Philippe Glaser - pglaser@pasteur.fr 1. Introduction: general concepts on pathogenic bacteria and their genomes 2. How to sequence a bacterial genome 3. Two examples: the genus Listeria and Streptococcus agalactiae Examples of bacterial species and diseases Tuberculosis Leper Cholera Whooping cough (coqueluche) Soar throat Meningitis Mycobacterium tuberculosis Mycobacterium leprae Vibrio cholera Bordetella pertussis Streptococcus pyogenes and viruses Neisseria meningitidis and other bacteria Gonococci Plague (la peste) Dysentery Gastric cancer, ulcer, gastritis Multiple diseases …. Neisseria gonorrhoeae Yersinia pestis Shigella flexneri Helicobacter pylori Escherichia coli, Staphylococcus aureus Published genome sequence of bacterial pathogens Shigella Escherichia coli Salmonella Helicobacter Pseudomonas Yersinia Stenotrophomonas Burkholderia Flavobacterium Acinetobacter Vibrio Campylobacter Staphylococcus Enterococcus Streptococcus Listeria Nocardia Corynebacterium Mycoplasma 2 4+2 3 3 1+2 3 0 0 0 0 4 1 4 1 9 4+1 0 1+3 6 1 12 1 3 2 Chlamydiae Neisseria Branhamella Bordetella Pasteurella Actinobacillus Haemophilus Bartonella Legionella Leptospira Borrelia Treponema Mycobacterium Rickettsia Anaplasma Coxiella Ehrlichia Clostridium 4+1 2 0 3 1 0 2 3 3 2 1 2 5 3 0 1 0 2+1 2 4 2 1 Total: > 80 published genomes Biodiversity of the microbial world 4 000 000 000 000 000 000 000 000 000 000 bacteria on hearth 3,5 billion years of evolution 5000 culturable species - 500 000 (?) species Bacterial diversity in a Yellowstone hot spring Principle of the experiment: Sample PCR amplification of 16 S RNA Cloning 300 clones 84 sequences 14 phyla First analysis by restriction DNA Sequencing 54 bacterial of 122 clones groups (Hugenholtz et al., J. Bacteriol 1998, 180 366-376) 38 sequences 12 new phyla Diversity of the non-culturable bacterial world How to define a bacterial species • For eukaryotes the species definition is based on sexual reproduction. Not possible for bacteria 1. Phenotypic definition 2. Molecular definition: 70% of “similarity” by genomic DNA hybridization More than 97% of identities between the 16S RNA genes =>A convenient definition but not fully satisfactory Interactions between humans (the host) and bacteria • The human body constitutes multiple ecosystems for bacterial communities: – – – – › › › • The digestive tract The throat The skin Other places are normally sterile (urine, milk, blood) Symbiotic bacteria Commensal bacteria Pathogenic bacteria Opportunistic pathogens and obligatory pathogens Bacteria and their environments Reservoir Animals Water Soil Food … Human host Vectors The ecology of the pathogenic bacteria or understanding its adaptation to these environments (growth conditions) Some questions in the study of human bacterial pathogens • What are the virulence factors and the host - pathogens interaction factors? • What is the physiology (the metabolism) of the bacteria in interaction with the host? • What is the evolution of the bacteria which lead to its adaptation to its host, and the relation with the non-pathogenic related species? • The identification of diagnostic and typing molecular tools • The identification on a rational basis of antigens for a-cellular vaccines • The identification of drug targets How to use genomics (and post-genomics) to solve these questions Evolution & Biodiversity Genome variability DNA repair Barriers to DNA transfer Selection Point mutation Genome rearrangement Gene duplication Horizontal gene transfer Biodiversity => virulence and pathogenicity Size of bacterial genomes Nanoarchaeum equitans Mycoplasma genitalium : Minimal genome Escherichia coli Mesorhizobium loti Streptomyces coelicolor : Human <500 kb 0.580 Mb 4.6-5.6 7.036 Mb 8.667 Mb 3,000.000 Mb 481 genes 300-400 genes 4289-5648 genes 6752 genes 7825 genes 30000 genes Adaptation : Transcription regulators - vs genome size (http://www.regx.de/m_project_bioinformatics.php) Gene transfers in bacteria Bacteriophages Transduction Plasmids Transposons Conjugation Competence Transformation Mobile elements and gene gain • IS elements => no associated function, gene integration by IS mediated homologous recombination, gene inactivation. • Transposon => carry functional genes • Integron => a platform to incorporate new functions, multi-antibiotics resistance. • Phages => may carry virulence genes (cholera toxin) • Pathogenicity (functional) islands • Plasmids => may also carry transposons or integrons • + gene duplication Identification of such elements in genome sequences Gene lost • By homologous recombination • By insertion of IS elements • By mutation : gene => pseudogene Evolutionary impact Reductive evolution (M. leprae, Y. pestis, B. pertussis) Role in virulence: lysine decarboxylase in Shigella (cadA+ derivative are less virulent) Antigenic variation • By recombination: a gene cassette is inserted in front of an active promoter or remove from this position. (Brucella, Mycoplamsa galisepticum) • By mutation: variation of a micro satellite sequence length (homo polymer tract) lead to frameshift deletion or reversion (Helicobacter pylori, Neisseria meningitidis) Protein families and gene duplications • May arise by gene duplication or horizontal gene acquisition • Metabolic functions, surface proteins (antigens) • Correspond to a specificity of a species • Frequently discovered after whole genome sequencing Analysis of the genome of a bacterial pathogen • Annotation of the genome • Analysis of regulatory genes • Analysis of inactivated genes (pseudogenes) • Identification of protein families and mechanisms of phase variation • Identification of mobile elements • Identification of atypical regions (recently acquired) Information obtained from comparative genomics DNA sequencing DNA automated sequencing machines produce 800 bases long sequences with an accuracy of 99 %. => How to sequence a 4 Mb bacterial genome with an accuracy higher than 99.99%? Two strategies : directed or random Directed strategy Chromosome Ordering clones of a large-insert library (cosmids, lambda or BAC) Sequencing clone by clone of the minimum tiling path Complete sequence Random strategy Chromosome Random sequencing of a large number of clones Sequence assembly Complete sequence ‘Whole genome shotgun’ Large-insert library (pSYX34 and BAC) Chromosome End-sequencing (large-insert fragments) Small-insert library (pcDNA2.1) End-sequencing (small-insert fragments) Assembly of sequences in contigs Annotation closure Complete Genome sequence Organization of a project Choice of the strategy Library construction DNA preparation of plasmid clones High throughput sequencing of both ends of inserts Assembly Finishing: gap Annotation closure and resequencing of low quality regions Libraries Libraries of insufficient quality => No sequence Important features : coverage of the chromosome, absence of co-ligation, absence of clones without an insert, size of the inserts. Different types of libraries: * size of the inserts * copy number of the vector High-copy number vector : 1 to 3 kb inserts Low-copy number vector : 8 to 12 kb inserts Bacterial artificial chromosome : 50 to 100 kb inserts Construction of a 1 - 3 kb long inserts library Chromosomal DNA pcDNA: high copy number vector Two repeated BstXI sites 5’CCAG TGTG ATGG…CCAG CACA CTGG3’ 3’GGTC ACAC TACC…GGTC GTGT GACC5’ Nebulization End repair by T4 polymerase Ligation of BstX I adaptors, Size selection of the inserts Purification of the digested vector (two 5’ protruding ends) 5’pCTTTCCAGCACA3’ 3’GAAAGGTCp 5’ TGTG ACAC Ligation, transformation CACA GTGT Recombinant plasmid Bacterial artificial chromosome (BAC) Vector based on naturally occurring F-factor plasmid found in E. coli Cloning of DNA fragments of 100- to 300-kb (average, 150 kb) in E. coli » strict copy number control »stably maintained at 1-2 copies per cell »lacZ-based color selection of BAC clones with inserts BAC library construction Preparation of chromosomal DNA in agarose plugs Partial digestion with HindIII or BamHI 200 kb 150 kb 100 kb 50 kb Ligation vector + DNA purified from agarose plugs Electroporation into E. coli DH10B Verification of insert size on PFGE gels after NotI digestion Inserts of 70 - 150 kb Linearized BAC vector (7kb) 200 kb 150 kb 100 kb 50 kb Automation High throughput sequencing DNA Sequencing 15 years ago! Automated DNA sequencing Automated sequencing Sequence assembly Phred, Phrap, Consed http://www.phrap.o rg Statistics and progress of the project Finishing Re-sequencing Sequencing of regions containing low ‘quality’ sequences of ‘missing’ regions Contig A Contig B Sequence gaps Cloning gaps Contig A Contig D Contig B Contig E Contig C Contig F Timing of a bacterial genome project Library construction and verification (one month) Plasmid preparation 5000 minipreps per Mb (7 days) Sequencing : 10000 sequences per Mb (20 days, ABI 3700) PCR : highly variable (250 reactions per Mb) Consumable costs : 10 000 Euro per Mb Listeria monocytogenes foodborne pathogen Transmission: dairy products, meat, vegetables, fish Disease: meningitis, encephalitis, septicemia, abortions, neonatal infections, gastroenteritis Population at risk: elderly, newborns, immuno-comprimised, pregnant women Mortality rate: 30% Concern for public health Problem for food industry Ecology of L. monocytogenes • Ability to survive and to grow in extreme conditions: low temperature, low water activity, broad ranges of pH… • Ubiquitous in the environment but at very low count • Variable count depending on the microenvironment and the season at a single location • Interaction with the vegetal world (silage) and the animal world (waste) Interaction of Listeria with its hosts • • • • Carriage is frequent but transient Low concentration of Listeria in feces Intracelullar parasite Ability to cross three barriers: intestinal, hemato-encephalic and placental barrier • Provokes a broad range of diseases : gastroenteritis, septicemia, meningitis, encephalitis, abortions • At risk population : immuno-compromised, elderly, pregnant women and new-born What are the relations between the two facets of this bacterium? Phylogenetic tree of the genus Listeria L. ivanovii L. grayi L. seeligeri L. innocua L. welshimeri L. monocytogenes (Pathogenic species) B. subtilis Vaneechoutte et al. Int J Syst Bact. (1998) 48, 127-139 Genome comparison L. monocytogenes L. monocytogenes L. innocua L. ivanovii EGDe 4b Genome size rRNA operons CDS Phages IS Plasmide 2944 kb 2943 kb 3011 kb 2929 kb 6 6 6 6 2848 2795 2968 2782 1 0 5 0 1 (3 copies) 1 transposon 0 0 5 -- -- 81.9 kb -- L. monocytogenes/B. subtilis synteny 4500000 4000000 3500000 3000000 2500000 2000000 1500000 1000000 500000 0 0 500000 1000000 1500000 2000000 Listeria monocytogenes 2500000 3000000 L. innocua L. ivanovii Synteny between Listeria genomes L. monocytogenes EGDe L. monocytogenes EGDe Absence of rearrangement between genomes Rare translocations : probably deletion + insertion L. monocytogenes chromosome map L . monocytogenes 270 ‘specific’ genes L. innocua 149 ‘specific’ genes http://genolist.pasteur.fr/listilist G+C content G+C content of the 270 CDSs specific for L. monocytogenes 14 Total 12 Nb of CDSs (%) 10 Specific 8 6 4 2 0 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 52 G+C% Competence operons in L. monocytogenes A B C D E F G comG 37-34 32-21 A 34-37 30-18 33-18 32-23 B C 39-69 34-32 31-17 comE 34-44 comC 27 - 24 comF A 38-43 C 37 : GC% 34 : % identities Bs ortholog 35-35 2695.1 and 2014.1 two comEC paralogs (DNA binding protein) Amino acids 41 surface proteins with an LPXTG motif 2500 2000 * 1500 1000 500 * * * * * * * * * * * * * ** * 0 InlA-like * = absent from the L. innocua genome * ** L. monocytogenes / L. inocua comparison Known virulence factors missing in L. innocua Surface proteins missing in L. innocua Metabolic pathways missing in L. innocua Sugar PTS Hexose phosphate permease Bile acid hydrolase Arginine deimidase Glutamate decarboxylase L. monocytogenes - L. ivanovii 2944 / 0 L . monocytogenes 345 ‘specific’ genes Virulence gene cluster inlA inlB hpt L. ivanovii 350 ‘specific’ genes bsh inlC L. ivanovii L. grayi L. seeligeri The virulence gene cluster L. innocua prs gcaD spoVC ctc L. welshimeri L. monocytogenes mfd yabK B. subtilis prs prfA plcA hly mpl actA plcB orfX orfZ orfB orfA ldh prfA ldh ctc plcA hly mpl i-actA plcB orfS orfT orfB orfA ldh ctc orfT orfB orfA ldh ctc prs prs L. monocytogenes orfZ orfB orfA prs prs ctc prfA plcA hly orfT orfB orfA prs orfB 5,5 kb ldh ldh ctc L. innocua L. ivanovii L. welshimeri L. seeligeri L. grayi Complex history with several events of insertion and deletions B. subtilis Entrée InlA, InlB The inlA - inlB region Lyse de la double membrane LLO, PlcB Lyse de la vacuole LLO, PlcA Passage de cellule à cellule ActA Mouvement intracellulaire ActA L. monocytogenes EGDe lmo0415 lmo0432 inlA inlB lmo0435 (LPXTG) lmo0439 17 gènes L. monocytogenes 4b lmo0432 wapA-like inlA Lin439* lmo0439 inlB L. innocua Lmo0435 (LPXTG) wapA-like lmo0439 lin439 lmo0432 L. ivanovii lmo0415 amidase inl-like inlA inlB-like inl-like inlB lmo0439 Other virulence genes bsh, bile salt shydrolase 2066 bsh groEL L. monocytogenes bsh groEL LPXTG L. ivanovii groEL L. innocua PrfA box is not conserved hpt, hexose phosphate transport 837 295nt hpt 107nt 839 L. monocytogenes 457nt hpt 153nt L. ivanovii L. innocua 31nt The PrfA box is conserved : pseudogene Listeria ivanovii - closer to a real pathogen? Some specific functions related to virulence tRNA lmo1240 1241 1242 L. monocytogenes and L. innocua tRNA i-inlB2 sphingomyelinase-c i-inlL i-inlK i-inlB i-inlJ i-inlI i-inlH i-inlG i-inlF i-inlE lmo1242 lmo1240 A second pathogenicity island L. ivanovii Lmo2699 : soluble internalin 2700 L. monocytogenes and L. innocua Lmo2699 L. ivanovii 2700 LPXTG Capsule biosynthesis ? And 96 inactivated genes (pseudogenes) Conclusions Contrary to the rest of the genome, virulence genes have a complex history. Possible cycle of virulence genes gain and lost. These cycle may play a role in the evolution of the genus and in the emergence of species. Functions required for intracellular multiplication are conserved between the two pathogenic species. Interactions with the host and physiopathology are probably different and involve different factors. The specialization of L. ivanovii is linked to the presence of specific genes and to the lost of a large number of functions. What is the diversity within the species L. monocytogenes Listeria monocytogenes Serovars 1/2a 1/2b 1/2c 3b 3c 3a 4a 4ab 4b 4c 4d 4e 7 Epidemiological data • The great majority of human listeriosis cases is caused by 1/2a, 1/2b and 4b strains • Serovar 4b strains are responsible for almost all major epidemics of human listeriosis as well as for most of the sporadic cases AscI profiles of L. monocytogenes strains WHO-multi center study AscI genomic fingerprints of 62 representative Listeria monocytogenes strains Genomic Division I 1/2a, 3a 1/2c,3c Genomic Division II 1/2b, 3b 4b , 4d, 4e kb 582 485 388 291 242 194 145 97 48 23 Brosch et al., 1994, AEM 60:2584-92, High density membranes for Listeria hybridisation with chromosomal DNA of • clinical (epidemic) isolates • food isolates • environmental isolates Correlation of genomic and epidemiological data Should allow the: Identification of genes consistently absent or present in e.g. epidemic and clinical isolates Development of: New tools for genomic typing New accurate methods for diagnostics gene A gene B gene C control L. monocytogenes EGDe 1/2a L. innocua 6a L. monocytogenes 4b Hybridization patterns of L. monocytogenes Hybridized with genomic DNA of: L.m. sv. 1/2a L.m. sv. 1/2c Hybridized with genomic DNA of: L.m. sv. 4b L.m. sv. 1/2b Hybridisation with different Listeria strains L. monocytogenes 94 strains Serovar: 1/2a, 1/2c, 1/2b, 3a, 3b, 3c, 4a, 4b, 4c, 4d, 4e, 7 Origin: Environment, food, animals, production environnement human (sporadic and epidemic cases) Listeria ivanovii 5 strains Listeria innocua 7 strains Listeria welshimeri 2 strains L isteria seeligeri 2 strains In total 110 strains belonging to all species of the genus Listeria Grouping 460 genes for 112 strains of Listeria Sérovar: 4b, 4e, 4d Sérovar: 1/2b, 3b, 7 Sérovar: 1/2a, 3a Sérovar: 1/2c, 3c Sérovar: 4a, 4c Listeria sp. L. monocytogenes I I.1 II I.2 II.2 III II.1 ORF0799 ORF2372 ORF2110 ORF2819 ORF3840 ORF2568 ORF1761 ORF0029 Lmo0171 Lmo0172 Lmo0525 Lmo0734 Lmo0735 Lmo0736 Lmo0737 Lmo0738 Lmo0739 Lmo1060 Lmo1061 Lmo1062 Lmo1063 Lmo1968 Lmo1969 Lmo1971 Lmo1973 Lmo1974 Conclusion The L. monocytogenes species shows a broad genomic diversity Genomes are stable and horizontal genetic exchanges are rare. The species and subspecies are well defined by a set of genes and it seems that there is no continuum between groups. The notion of species is probably not only an arbitrary one. DNA array is a powerful genome-level typing tool for epidemiological studies and research. Streptococcus agalactiae (group B) Part of the normal flora colonizing the gastrointestinal tract, of an important part of the population, and may colonize the urogenital tract. Disease: Rare infections of immuno-compromised adults Leading cause of invasive infections in neonates septicemia (early onset disease) pneumonia (early onset disease) meningites (late onset disease) => Surveillance of pregnant women to avoid mother-infant transmission => Development of a vaccine ? Biodiversity within the species S. agalactiae Two ecovars Characterstics Human Bovine mastitis __________________________________________________ Pigment Lactose Salicin Beta-galactosidase Bacitracine sensitivity Protein antigens + + R, Icp + + + + X (Finch & Martin, 1984) Other animal origins: diseases in various mammals and fishes Human origin: carriage or invasive strains MLEE, MLST pointed the existence of an hypervirulent lineage. Q. What is the genomics basis of this diversity? Phylogenetic relationship among Streptoccocci S. agalactiae S. pyogenes S. equi S. anginosus S. pneumoniae S. mitis S. uberis S. sanguis S. suis S. salivarus S. bovis S. mutans S. pleomorphus (from Kawamura et al. J. Syst. Bacteriol. 1995) Genome comparison S. agalactiae NEM316 S. pyogenes S. pneumoniae 2160 kb Size of the genome 2 206 kb 1852 kb Ribosomal operons 8 6 4 2182 1752 2236 CDSs Mobile elements 8 IS 17 IS 12 phage like 4 bacteriophages integrases 2 integrated plasmid (1 with 3 copies, 42kb) 105 IS Synteny between S. agalactiae and S. pneumoniae (1141 pairs of orthologous genes) S. pneumoniae 2000000 1500000 1000000 500000 0 0 500000 1000000 1500000 S. agalactiae 2000000 Synteny between S. agalactiae and S. pyogenes (1170 pairs of orthologous genes) 2000000 1800000 S. pyogenes 1600000 1400000 1200000 1000000 800000 600000 400000 200000 0 0 500000 36 recombination breakpoints 1000000 1500000 S. agalactiae 2000000 14 mobile islands (532 genes) G+C/G-C G+C% Genes related to mobile element within the islands tRNA-A I II tRNA-L int rep rep mob. int rep tra III VII VIII IV tRNA-R V 16 kb rep parA plasm.tra ssb XII tRNA-K XIII XIV 46 kb 19 kb 11 kb tnp 59 kb tra IX XI rep Plasm. Phage int tRNA-T int tnp tnp VI X 18 kb 46 kb mob plasm. tra phage rep int tRNA-A tnp Int 46 kb tnp pol int rep int rel hel 25 kb tra rep 33 kb phage 86 kb rep phage int int 45 kb 23 kb NEM316 - SAG2603 genome comparison • No chromosomal rearrangement between the two strains • No integrated plasmid in SAG2603 but three prophages •1799 orthologs among these two genomes (633 100% identical) => 241 Nem316 genes are missing in SAG2603 (37, backbone) => 258 Sag2603 genes are missing in NEM316 (42, backbone) Although highly variable 10 mobile islands are conserved. NEM316 / SAG2603 - conserved backbone 0 gbs1823 His triad prot gbs1740-1749 ABC transporter gbs0046-47 gbs0086-87 gbs0162-163 sga0046 sag0086-88 sag1780 sag1697-1703 gbs0493 gbs1400-1401 ABC transporter sag1330-1331 protein R5 cpsJDNMH NEM316 SAG2603 gbs1240-1242 Comparative analysis of island XII NEM316 Lmb scpB Lactose utilization SAG2603 A/B Mercuric and cadmium resistance 98%< 95%< 90%< 80%< 70%< 60%< <100% <98% <95% <90% <80% <70% <60% adhP : alcohol dehydrogenase pheS : Phenylalanyl tRNA synthetase atr Amino acid transporter glnA glutamine synthetase sdhA serine dehydratase glcK glucokinase tkt transketolase MLST results for S. agalactiae Sag2603 « Hypervirulent » NEM316 (Jones et al., 2003 Int. J. Clin. Microbiol.) DNA arrays hybridization for genome characterization 68 strains analyzed by MLST and hybridization • 10 invasive ST-17 strains (MLST study) • BM110, hypervirulent clone defined by MLEE • 18 invasive strains (Hôpital Necker) • 13 carriage strains (Hôpital Necker) • 14 strains from bovine mastitis • 12 strains of animal origin (horse, dog, cat, rabbit, guinea pig, fish) Genome diversity is essentially located within genomic islands 300 250 200 islands backbone 150 100 50 0 lapin_6144_98 gui_pig_622 chien_928662 chat_693 chat_3448_97 poisson_2_22 bov_44 bov_501_19 bov_549.13 bov_547.25 bov_543.05 bov_527.25 bov_411.07 port_60_36bis port_65.8bis port_37.39 port_41bis port_38bis inv_1573 inv_318 inv_1568 inv_1560 inv_1002 inv_1572 inv_1000 inv_wc3 inv_mk2 inv_j95 inv_j81 inv_h11 inv_b9 Hierarchical clustering of 69 strains and comparison with MLST data st19 st1 st10,6,9 st23 st17 st103 st23 Two loci heterogeneously distributed among isolates I rofA hemagglutinin II Glycosyl transferase secY secA fibronectin rogB binding protein LPXTG srtB srtC LPXTG rofA and rogB are mutated in sag2603 I ------------------++++++++++++++++++++++++++++++++--++++++++++ II ------------------++++++++++++++++++++++++++++--------+-+-++++ Conclusion • Strains from different origin do not cluster except invasive ST17 strains. • ST17 strains constitute a highly homogenous group • Diversity reside mostly within islands • Antigenic diversity is highlighted by genome analysis and is found both within and outside islands • DNA arrays, a powerful method for molecular epidemiology