Supplementary Methods

Supplementary Methods Simulation of equal-moving E. coli pangenome We generated virtual pangenomes where all genes change their locations at an equal rate on a null hypothesis that this will randomly generate a comparable set of genes with their order unchanged. The simulated pangenomes is comparable with an actual E. coli pangenome as equal number of genomes (19 genomes), genome size (2,175 operons), total gene pool (10059 operons), and gene moving frequency (835±455 operons). Here, operon instead of gene was used in this simulation as moving unit. Briefly, each simulation begins with an initial virtual genome of 2,175 operons, and the second genome was generated by moving 835±455 operons that were randomly selected from the 2,175 operons of the first genome. The selected operons were randomly deleted, inserted (selected from the total operon pool), or moved to alternative genome sites. The rest 17 virtual genomes were generated in a similar way. During this process, each genome was given a “death rate” according to the number of operons moved (Rn) in comparison to all previous genomes as (Rn-835)/835. The death rate simulates natural negative selection of strains with large genome structure variation. The number of location-unchanged operons is much less than the number of GOF operons (876 operons). Even when we reduced the moving rate by 15%, 30%, 45%, and 60%, and did 500 simulations for each moving rate, the number of location-unchanged operons in each simulated pangenome never reaches that of GOF (Mann Whitney U test, P<0.01; Figure S4). Thus, the null hypothesis is rejected and the alternative hypothesis is that genes on genome are not equal in moving frequency. Connectivity and essentiality of GOF genes in E. coli Given that GOF matters for the organization of the genome, the question arises whether GOF genes have a higher number of functional partners than average. To solve this issue, a “gene network” of a represantive E. coli strain—K12-DH10B—is developed according to its gene links/actions in the string database (http://string-db.org/) (1). For strength score cut-offs of links between genes over 800, the genes with top connections are significantly enriched with GOF genes (p-value calculation see below methods), which means that GOF genes have more strong connections with other genes and are the hub of the gene network. (Table S4). Next we examined the protein interactions of known interaction type with score≥800. All links connected to GOF and non-GOF genes are depicted with Cytoscape (2)(Figure 5C). In this diagram representing the core of the gene network, GOF genes are overwhelming majority, being absolutely functional hub of the genome. To assess whether the GOF genes are essential for cell viability or perform housekeeping functions, they were compared to relevant and widely recognized data sets (3-5). Most housekeeping genes are actually essential, that is, their inactivation is lethal for bacteria. These comprehensive sets of genes involved in the maintenance of the basal cellular functions, the “housekeeping core” of all bacterial cells. Of all these housekeeping genes in each dataset, percentages of GOF and non-GOF core genes are counted, and p-value are calculated to evaluate significance of enrichment (Table S2). In sum, essential and housekeeping genes are highly enriched in genes belonging to the GOF. GOF gene enrichment is tested for COG categories, gene sets with top connections, and essential gene set. The p-value associated to a property P and that concerns a subset of n GOF genes corresponds to the probability of obtaining more than n genes with property P when drawing 1486 GOF genes among the 2542 p-core genes according to a random process without replacement. Statistical validity is assessed by the hypergeometric distribution (HD) test with respect to the initial set of correlated genes and includes a Bonferroni correction due to the repetitive nature of the analysis (5). The p-value is given 1486 2542−𝑀 (𝑀 𝑘 )( 1486−𝑘 ) by ∑ 𝑘=𝑛 (2542 1486) , where (••) indicates binomial coefficients. Generation of simulated drafts and their assembly Usually, in bacterial genome drafts after de novo assembly of pair-end reads obtained from Illumine sequencing platform, the lengths of scaffolds/contigs obey a power distribution. According to the data of self-sequenced fifty strains of E. coli, we obtained the geometric mean of scaffolds/contigs length — 4,588bp. To mimic empirical data when generating virtual drafts from testing genomes, we first removed large repeats >500bp, then interrupted the genome into contigs with random lengths that obey the same power distribution (where geometric mean is 4,588bp). Between each pair of neighboring contigs, 1kb-interval was deleted to make gaps. Each testing genome was randomly broken for ten times to generate ten testing drafts. In assembly of these simulated drafts, contigs were orientated according to the order of GOF genes on them, and those do not contain GOF gene were missed. Such missed contigs only take up to about 3% in term of total contig length in all tested species except E. coli which has large plasticity regions. For this species, we further incorporated nonGOF p-core genes as secondary indexes, i.e., missed contigs are located to where the non-GOF p-core genes have highest occurrence frequency. Such secondary index greatly reduced missed contigs to 0.29±0.28%, but generates a total error rate of 2.03±2.69% (Table S7). Next, accuracy of the GOF-directed contig-ordering was compared to that directed by reference-genome in the 34 testing E. coli genomes, and the total length of contigs that were falsely located or missed were only 94.6±122.3 kb/genome in GOF-based method, two-third less than that in reference-based method (Figure S7). Assembly of ten empirical E.coli drafts The efficiency of the GOF-based strategy is tested with ten empirical E. coli drafts which were sequenced using Illumina Hiseq 2000 system at an average depth of 200×, and de novo assembled with SOAPdenovo into 30-80 scaffolds per draft. According to GOF then non-GOF core gene indexes, we were able to orientate averagely 97.18% scaffolds in total length. Most of neighboring relationships of scaffolds are supported by pair-end reads, rRNA sequence and/or missed contigs, and therefore ab initio join them into 2-13 (averagely 7.4) super-scaffolds per genome. PCR reactions verified all the GOF- predicted scaffold coordination even those gaps without supporting reads/sequences (Table S8). The gaps were closed with matching reads and their paired-reads or PCR products sequencing. Thus, the GOF-directed strategy greatly simplified genome assembly and made it possible to finish an E.coli genome with merely 500bp paired-end reads and a dozen of PCR reactions. Reference 1. Franceschini A, Szklarczyk D, Frankild S, Kuhn M, Simonovic M, Roth A, Lin J, Minguez P, Bork P, von Mering C, Jensen LJ. (2013). STRING v9.1: protein-protein interaction networks, with increased coverage and integration. Nucleic Acids Res. 2013,D808-15. 2. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T. (2003) .Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13(11):2498-504. 3. Gil, R., Silva, F. J., Peretó, J. & Moya, A. (2004). Determination of the core of a minimal bacterial gene set. Microbiol. Mol. Biol. Rev. 68, 518–537. 4. Baba, T., Ara, T., Hasegawa, M., Takai, Y., Okumura, Y., Baba, M. et al. (2006). Construction of Escherichia coli K-12 in-frame, single-gene knockout mutants: the Keio collection. Mol. Syst. Biol. 2, 2006.0008. 5. Boyle, E. I., Weng, S., Gollub, J., Jin, H., Botstein, D., Cherry, J. M. & Sherlock, G. (2004). GO:TermFinder—open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics, 20, 3710–3715.

Supplementary Methods

Related documents

Products

Support

Supplementary Methods

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib