Supporting Information Appendix S1: More information on the robust regression method used in this study Figure S1: SIGI-HMM predictions in 54 Escherichia coli strains The left panel shows SIGI-HMM predictions (cyan regions) from the genomes of 54 E. coli strains, colored blue according to genomic %AT from light blue (most AT rich) to dark blue (most GC rich). The horizontal axis designates the chromosomal position in base pairs (bp), while the vertical axis to the right indicates total size of predicted GIs (bp). The right panel shows more detailed annotations of E. coli O26:H11 strain 11368, the E. coli with the largest genome, and E. coli K-12 substrain MG1655, a non-pathogenic model organism with one of the smallest E. coli genomes. Genes ending with an underscore indicate multiple variants of the gene. ECO26_P_ designates multiple variants of E. coli O26:H11 prophages. More details regarding these annotations can be found in tables S5 and S6 for E. coli K-12 and E. coli O26:H11, respectively. Graphical representation dissimilarities of HT regions between the genomes in the left and right panels are due to scale differences. Table S1: Results from phylum level genome size versus AT content robust regression analysis The table includes regression estimates (column 2) for the bacteria in each phylogenetic group and Proteobacteria subphyla (column 1) as well as standard error (column 3), tstatistic (column 4), p values (column 5) and number n of strains included in the analysis (column 6). Group/subphyla Actinobacteria Estimate Std. Error -17,54 1,24 Firmicutes -0,75 Tenericutes 0,25 Bacteroides-Chlorobi T p n -14,2 0,0001 214 0,75 -1 0,32 423 0,57 0,44 0,66 54 -5,55 2,16 -2,57 0,01 93 Alpha-proteobacteria -7,86 0,39 -20,02 0,0001 215 Beta-proteobacteria -8,26 0,76 -10,81 0,0001 120 Delta-proteobacteria -4,36 1,69 -2,58 0,01 43 0,43 0,19 2,2 0,03 74 -10,88 0,48 -22,77 0,0001 429 Epsilon-proteobacteria Gamma-proteobacteria Chlamydiae-Verrucombia 5,75 0,01 673,46 0,0001 49 Cyanobacteria -3,83 2,29 -1,67 0,1 44 Spirochaetes -2,56 1,66 -1,54 0,15 47 Table S2: Result from species level genomic size vs AT content robust regression analysis The table shows the results from regression analyses between genome size and genomic %AT in the strains of the designated species in column 1. Column 2 shows the regression estimates, column 3 - standard error, column 4 - t statistic, column 5 - p value and column 6 the number n of strains included in each regression model. Species Estimate Std. Error t p n Chlamydia trachomatis 20,56 1,34 15,34 0,0001 21 Neisseria meningitides 17,1 5,29 3,34 0,006 14 Helicobacter pylori 11,45 3,76 3,04 0,004 38 149,64 22,21 6,74 0,0001 54 -2,7 44,59 -0,06 0,95 25 -103,5 45,82 -2,26 0,04 12 Acinetobacter baumannii 2,17 31,89 0,07 0,95 10 Francisella tularensis 2,88 0,31 9,34 0,0001 10 Buchnera aphidicola 0,02 0,01 2,23 0,05 11 Bacillus cereus 0,48 8,6 0,06 0,95 18 Escherichia coli Salmonella enterica Yersinia pestis Clostridium botulinum -28,02 8,31 -3,37 0,006 13 Listeria monocytogenes 47,27 7,56 6,25 0,0001 22 Staphylococcus aureus -11,8 12,58 -0,94 0,35 31 33,3 12,9 2,58 0,02 19 33,16 8,66 3,83 0,002 14 10,5 12,99 0,81 0,43 16 0,6 0,52 1,16 0,27 14 -19,05 11,65 -1,64 0,13 13 Corynebacterium pseudotuberculosis -50,1 8,79 -5,7 0,0001 15 Prochlorococcus marinus -3,59 0,32 -11,22 0,0001 12 Streptococcus pneumoniae Streptococcus suis Streptococcus pyogenes Mycobacterium tuberculosis Corynebacterium diphtheriae Table S3 - Robust regression analysis of relative entropy versus genomic %AT species level The table shows the results from regression analyses between genome-based relative entropy and genomic %AT in the strains of the designated species in column 1. Column 2 shows the regression estimates, column 3 - standard error, column 4 - t statistic, column 5 - p value and column 6 the number n of strains included in each regression model. Species Estimate Std. Error T p n Chlamydia trachomatis 0,05 0,08 0,67 0,51 21 Neisseria meningitides -0,5 0,06 -7,72 0,0001 14 Helicobacter pylori -0,04 0,12 -0,34 0,74 38 Escherichia coli -0,67 0,06 -12,17 0,0001 54 Salmonella enterica -1,04 0,04 -23,91 0,0001 26 Yersinia pestis -0,03 0,14 -0,22 0,83 12 0,01 0,1 0,05 0,96 10 Francisella tularensis 0,2 0,07 2,94 0,02 10 Buchnera aphidicola 0,24 1,00E-03 158,37 0,0001 11 Acinetobacter baumannii Bacillus cereus -0,07 0,07 -1,12 0,28 18 0,72 0,06 12,62 0,0001 13 Listeria monocytogenes -0,55 0,04 -13,31 0,0001 22 Staphylococcus aureus 0,05 0,05 0,9 0,38 31 Streptococcus pneumoniae -0,24 0,07 -3,19 0,005 19 Streptococcus suis -0,12 0,09 -1,43 0,18 14 0,01 0,12 0,1 0,92 16 Mycobacterium tuberculosis -0,06 1,60E-03 -36,58 0,0001 14 Corynebacterium diphtheriae -0,04 0,09 -0,46 0,65 13 0,09 0,06 1,46 0,17 15 -0,03 0,01 -2,43 0,04 12 Clostridium botulinum Streptococcus pyogenes Corynebacterium pseudotuberculosis Prochlorococcus marinus Table S4 - Robust regression analysis of relative entropy versus genome size species level The table shows the results from regression analyses between genome-based relative entropy and genome size in the strains of the designated species in column 1. Column 2 shows the regression estimate, column 3 - standard error, column 4 - t statistic, column 5 - p value and column 6 the number n of strains included in each regression model. Species Estimate Std. Error T p Chlamydia trachomatis -7,78 0,79 -9,79 Neisseria meningitidis -32,69 8,52 Helicobacter pylori -20,69 4,84 -280,34 n 0,0001 21 -3,84 0,002 14 -4,27 0,0001 38 21,97 -12,76 0,0001 54 26,41 23,84 1,11 0,28 26 107,03 126,28 0,85 0,42 12 -133,97 108,25 -1,24 0,25 10 Francisella tularensis -1,63 6,43 -0,25 0,81 10 Buchnera aphidicola 0,05 3,00E-02 1,75 0,11 11 Bacillus cereus -245,54 32,23 -7,62 0,0001 18 Clostridium botulinum -127,68 3,45 -37,05 0,0001 13 Listeria monocytogenes -124,06 13,44 -9,23 0,0001 22 Staphylococcus aureus -203,96 14,61 -13,96 0,0001 31 Streptococcus pneumoniae -133,67 22,91 -5,83 0,0001 19 -82,33 43,49 -1,89 0,08 14 -87,4 11,58 -7,55 0,0001 16 0,27 2,30E-01 1,17 0,26 14 Escherichia coli Salmonella enterica Yersinia pestis Acinetobacter baumannii Streptococcus suis Streptococcus pyogenes Mycobacterium tuberculosis Corynebacterium diphtheriae Corynebacterium pseudotuberculosis Prochlorococcus marinus -25,97 48,16 -0,54 0,6 13 -335,35 13,55 -24,75 0,0001 15 -25,02 5,28 -4,74 0,0008 12 Table S5 - Annotations of SIGI-HMM predicted regions in Escherichia coli K-12 substrain MG1655 in Excel format NCBI name and chromosome position of SIGI-HMM predictions are found in the first column, with more detailed explanation of the predicted DNA in columns 2 and 3. All annotations are taken from the Islandviewer web-site. Table S6 - Annotations of SIGI-HMM predicted regions in Escherichia coli O26:H11 strain 11368 in Excel format NCBI name and chromosome position of SIGI-HMM predictions are found in the first column, with more detailed explanation of the predicted DNA in columns 2 and 3. All annotations are taken from the Islandviewer web-site.