Supplementary Methods

advertisement
Supplementary Methods
Simulation of equal-moving E. coli pangenome
We generated virtual pangenomes where all genes change their locations at an equal
rate on a null hypothesis that this will randomly generate a comparable set of genes with
their order unchanged. The simulated pangenomes is comparable with an actual E. coli
pangenome as equal number of genomes (19 genomes), genome size (2,175 operons),
total gene pool (10059 operons), and gene moving frequency (835±455 operons). Here,
operon instead of gene was used in this simulation as moving unit. Briefly, each
simulation begins with an initial virtual genome of 2,175 operons, and the second
genome was generated by moving 835±455 operons that were randomly selected from
the 2,175 operons of the first genome. The selected operons were randomly deleted,
inserted (selected from the total operon pool), or moved to alternative genome sites. The
rest 17 virtual genomes were generated in a similar way. During this process, each
genome was given a “death rate” according to the number of operons moved (Rn) in
comparison to all previous genomes as (Rn-835)/835. The death rate simulates natural
negative selection of strains with large genome structure variation. The number of
location-unchanged operons is much less than the number of GOF operons (876
operons). Even when we reduced the moving rate by 15%, 30%, 45%, and 60%, and did
500 simulations for each moving rate, the number of location-unchanged operons in each
simulated pangenome never reaches that of GOF (Mann Whitney U test, P<0.01; Figure
S4). Thus, the null hypothesis is rejected and the alternative hypothesis is that genes on
genome are not equal in moving frequency.
Connectivity and essentiality of GOF genes in E. coli
Given that GOF matters for the organization of the genome, the question arises whether
GOF genes have a higher number of functional partners than average. To solve this
issue, a “gene network” of a represantive E. coli strain—K12-DH10B—is developed
according to its gene links/actions in the string database (http://string-db.org/) (1). For
strength score cut-offs of links between genes over 800, the genes with top connections
are significantly enriched with GOF genes (p-value calculation see below methods),
which means that GOF genes have more strong connections with other genes and are
the hub of the gene network. (Table S4). Next we examined the protein interactions of
known interaction type with score≥800. All links connected to GOF and non-GOF genes
are depicted with Cytoscape (2)(Figure 5C). In this diagram representing the core of the
gene network, GOF genes are overwhelming majority, being absolutely functional hub of
the genome.
To assess whether the GOF genes are essential for cell viability or perform housekeeping
functions, they were compared to relevant and widely recognized data sets (3-5). Most
housekeeping genes are actually essential, that is, their inactivation is lethal for bacteria.
These comprehensive sets of genes involved in the maintenance of the basal cellular
functions, the “housekeeping core” of all bacterial cells. Of all these housekeeping genes
in each dataset, percentages of GOF and non-GOF core genes are counted, and p-value
are calculated to evaluate significance of enrichment (Table S2). In sum, essential and
housekeeping genes are highly enriched in genes belonging to the GOF.
GOF gene enrichment is tested for COG categories, gene sets with top connections, and
essential gene set. The p-value associated to a property P and that concerns a subset of
n GOF genes corresponds to the probability of obtaining more than n genes with property
P when drawing 1486 GOF genes among the 2542 p-core genes according to a random
process without replacement. Statistical validity is assessed by the hypergeometric
distribution (HD) test with respect to the initial set of correlated genes and includes a
Bonferroni correction due to the repetitive nature of the analysis (5). The p-value is given
1486
2542−𝑀
(𝑀
𝑘 )( 1486−𝑘 )
by ∑
𝑘=𝑛
(2542
1486)
, where (••) indicates binomial coefficients.
Generation of simulated drafts and their assembly
Usually, in bacterial genome drafts after de novo assembly of pair-end reads obtained
from Illumine sequencing platform, the lengths of scaffolds/contigs obey a power
distribution. According to the data of self-sequenced fifty strains of E. coli, we obtained
the geometric mean of scaffolds/contigs length — 4,588bp. To mimic empirical data when
generating virtual drafts from testing genomes, we first removed large repeats >500bp,
then interrupted the genome into contigs with random lengths that obey the same power
distribution (where geometric mean is 4,588bp). Between each pair of neighboring
contigs, 1kb-interval was deleted to make gaps. Each testing genome was randomly
broken for ten times to generate ten testing drafts.
In assembly of these simulated drafts, contigs were orientated according to the order of
GOF genes on them, and those do not contain GOF gene were missed. Such missed
contigs only take up to about 3% in term of total contig length in all tested species except
E. coli which has large plasticity regions. For this species, we further incorporated nonGOF p-core genes as secondary indexes, i.e., missed contigs are located to where the
non-GOF p-core genes have highest occurrence frequency. Such secondary index
greatly reduced missed contigs to 0.29±0.28%, but generates a total error rate of
2.03±2.69% (Table S7). Next, accuracy of the GOF-directed contig-ordering was
compared to that directed by reference-genome in the 34 testing E. coli genomes, and
the total length of contigs that were falsely located or missed were only 94.6±122.3
kb/genome in GOF-based method, two-third less than that in reference-based method
(Figure S7).
Assembly of ten empirical E.coli drafts
The efficiency of the GOF-based strategy is tested with ten empirical E. coli drafts which
were sequenced using Illumina Hiseq 2000 system at an average depth of 200×, and de
novo assembled with SOAPdenovo into 30-80 scaffolds per draft. According to GOF then
non-GOF core gene indexes, we were able to orientate averagely 97.18% scaffolds in
total length. Most of neighboring relationships of scaffolds are supported by pair-end
reads, rRNA sequence and/or missed contigs, and therefore ab initio join them into 2-13
(averagely 7.4) super-scaffolds per genome. PCR reactions verified all the GOF-
predicted scaffold coordination even those gaps without supporting reads/sequences
(Table S8). The gaps were closed with matching reads and their paired-reads or PCR
products sequencing. Thus, the GOF-directed strategy greatly simplified genome
assembly and made it possible to finish an E.coli genome with merely 500bp paired-end
reads and a dozen of PCR reactions.
Reference
1. Franceschini A, Szklarczyk D, Frankild S, Kuhn M, Simonovic M, Roth A, Lin J, Minguez P, Bork P,
von Mering C, Jensen LJ. (2013). STRING v9.1: protein-protein interaction networks, with
increased coverage and integration. Nucleic Acids Res. 2013,D808-15.
2. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T.
(2003) .Cytoscape: a software environment for integrated models of biomolecular interaction
networks. Genome Res. 13(11):2498-504.
3. Gil, R., Silva, F. J., Peretó, J. & Moya, A. (2004). Determination of the core of a minimal bacterial
gene set. Microbiol. Mol. Biol. Rev. 68, 518–537.
4. Baba, T., Ara, T., Hasegawa, M., Takai, Y., Okumura, Y., Baba, M. et al. (2006). Construction of
Escherichia coli K-12 in-frame, single-gene knockout mutants: the Keio collection. Mol. Syst. Biol.
2, 2006.0008.
5. Boyle, E. I., Weng, S., Gollub, J., Jin, H., Botstein, D., Cherry, J. M. & Sherlock, G. (2004).
GO:TermFinder—open source software for accessing Gene Ontology information and finding
significantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics, 20,
3710–3715.
Download