Supporting Online Materials Bayesian reconstruction of the whole-genome phylogenetic tree of S. pneumoniae. We have tested the reliability of the phylogenetic reconstruction shown in Fig. 1 using a Bayesian Markov Chain Monte Carlo method [69]. The obtained tree is shown in Fig. S1. The structure of the tree is similar to the structure of the tree in Fig. 1. The clades conserved between the two trees are indicated as lineage I-VI in Fig.1. The major differences are the positon of Group VI, which in this reconstruction appears to be closely related to clade I, and the position of the serotype 7F strain CDC1087, which does not belong to Group I, but instead belongs to a monophyletic brnch with the serotype 5 strain 70585. Using the possibility to define test statistics in the program BEAST [69], we have tested the probability that the six lineages I-VI are monophyletic. We obtain 100% probability for groups II-VI. For group I, the probability is 0 if we include also the strain CDC1087, and 100% if we only include the serotype 3 strains), suggesting that while the composition of lineages II-VI is robust, the inclusion of CDC1087 in lineage I depends on the tree reconstruction method. Whole genome phylogenetic analysis generalizes MLST-like typing methods To quantify the degree of correlation between phylogeny and dispensable genome, we measured the association between the lineages of Fig.1 and presence/absence of dispensable genes present in more than one strain (see also Table 1). For each of these genes, we quantified the strength of the association between the allelic variant and the lineages of Fig.1 by computing the association index V (see Methods). A value of V=1 means perfect association, while V=0 indicates random association between allelic variants and lineages. The average association value was 0.5 (median of the P-values equal to 0.07) indicating, in most cases, a non-random association between presence/absence of dispensable genes and lineages. Repeating the analysis for the association of the dispensable genome with the CCs we found an average value of V=0.82 (median of the Pvalues equal to 0.007). Given the extent of homologous recombination, we investigated the extent to which the allelic variants of core genes correlated with lineages I-VI inferred by molecular phylogenetics. We identified 680 core genes that were present in single copy in each genome, and classified the distinct alleles of each of these genes. The average association is 0.747 (median of the P-values equal to 3.9 106 ) indicating a strong association between alleles and lineages. Not surprisingly since CCs are always contained in a single lineage, their association was even stronger (V equal to 0.94, median of the P-values equal to 5.3 1011 ). We conclude that, knowing the CC of a strain, we can predict the allelic variant of most core genes with high level of confidence. Taken together, these results indicate that there is a strong and statistically significant association between the allelic form of core genes and the classification into lineages I-VI and CCs. A significant association exists also between presence/absence of dispensable genes and both lineages and CCs. This association is weaker than in the case of allelic variants of core genes, probably due to the more dynamical nature of the dispensable genome, which is rich in mobile elements [3]. Clustering based on presence/absence of dispensable genes. The pattern of presence/absence can be used to obtain a hierarchical clustering, that identifies groups of related strains. In Fig.S8 we show a hierarchical clustering based on presence-absence of dispensable regions; red boxes mark the cluster exceeding the 98% confidence level using a multiscale bootstrap resampling method [70, 71]. The clusters obtained with this method coincide with monophyletic branches of the phylogenetic trees shown in Fig.1, confirming the correlation between genomic content and genetic distance. Compared to Fig. 1, there are variations in the mutual position of some groups. For instance, Group V containing all serotype 1 strains is split by this analysis into tree distinct clusters, one corresponding to the CC306 strains, the second formed by the CC90 strains 670_6B and SP18 whose dispensable genome is more closely related to the Taiwan19F_14 and G54 strains, and the third corresponding to the CC217/CC2296 strains, that are more strictly related to the serotype 3 strains. Interestingly this is confirmed by an analysis of the repeats present in these genomes, because NCTC7465, P1031 and P1041 share some repeat families that are only found in the serotype 3 isolates in this dataset (see below). Group I, containing all serotype 3 strains and the serotype 7F CDC1087 strain, has been split in a group containing only the serotype 3 strains, while the strain CDC1087 is not strictly related to any other strain, suggesting that, as indicated also by the Bayesian posterior probability, the relative order of the A45 and CDC1087 branches in Fig.1 is probably an artifact of the phylogeny reconstruction method. Insertion Sequences are more variable than other phylogenetic markers. In total, intact representatives of 19 different insertion sequence (IS) types could be identified in the dataset. This includes ISSpn7, found to have caused an insertion disrupting the hyaluronate lyase gene of S. pneumoniae strain SP18, and ISSpn8. The frequency distribution of IS elements is variable. For instance, we found that IS200 is present as a single copy (either intact or degenerate) in each genome, and there were no observed instances of transposition. In contrast, other elements, such as ISSpn5, were absent from some isolates, but increased the copy number in others. It seems likely that this dissemination within genomes occurs relatively quickly compared to other changes in the genome, e.g. alterations in the sequence of the MLST loci. For example, IS1515 in the serotype 1 genome Sp033038 and Sp032672 contain 16 and 14 copies of this transposon respectively, whilst strain Sp061370 (of the same ST) has only one copy, suggesting that loss or gain of IS1515 has occurred since these strains have diverged from each other. However, there is little phylogenetic signal in the distribution of repeats, which is likely to be a consequence of the high turnover of such elements: the majority of CDS annotated as being transposons appear to be non-functional, suggesting they are remnants of past IS expansions. Hence the IS complement of each pneumococcal chromosome appears to be in a state of flux, with new families imported and expanding to replace those that are eroded by inactivating mutations. Fig.S1. Bayesian phylogenetic tree obtained using the SNPs of the whole genome alignment of the 44 genomes of S.pneumoniae. Fig. S2 Hierarchical clustering of the stains based on the presence/absence of dispensable regions. Red numbers indicate the percentage p-value of the branch computed using the Approximate Umbiased (AU) method (see Methods), while green numbers indicate the bootstrap support. Blue and red bars indicate MLST Clonal Complexes and Sequence Types, respectively. Table S1. Sequenced strains used in this study. For each strain, we report name, genome length, number of predicted coding sequences, locus of isolation, disease or carriage status, sequencing status, source of the sequence, accession number, sequencing technology and average sequence coverage. For pneumococcal strains, also ST and CC determined by the eBURST algorithm [21] are reported. *)Finished genomes are complete genomes that were inspected for proper assembly and manually inspected for ambiguous consensus base calls.**)These Whole Genome Shotgun projects have been deposited at DDBJ/EMBL/GenBank under the accession xxxx00000000. The version described in this paper is the first version, xxxx01000000 Table S2. For each strain (rows) we report the number of genes not conserved in the other sequenced genomes (columns). Table S3. Distribution of the genes from Table 3 which are not shared by all S. pneumoniae strains. Green squares indicate presence, gray squares indicate absence of the gene.