Supplementary Text S1 (doc 50K)

advertisement
1
Text S1. Methodological details
2
Genome sequencing
3
A combination of Illumina and 454 shotgun sequencing was performed on the single cell
4
re-MDA products for Candidatus Poribacteria sp. WGA-4E and 4G. For Illumina
5
sequencing, 0.3 kbp shotgun libraries were constructed for each SAG. Briefly, 3 µg MDA
6
product was sheared in 100 µl using the Covaris E210 with the setting of 10% duty
7
cycle, intensity 5, and 200 cycle per burst for 3 min per sample and the fragmented DNA
8
was purified using QIAquick columns (Qiagen) according to the manufacturer's
9
instructions. The sheared DNA was end-repaired and A-tailed according to the Illumina
10
standard PE protocol and purified using the MinElute PCR Purification Kit (Qiagen) with
11
a final elution in 12 µl of Buffer EB. After quantification using a Bioanalyzer DNA 1000
12
chip (Agilent), the fragments were ligated to the Illumina adaptors according to the
13
Illumina standard PE protocol, followed by a purification step of the ligation product
14
using AMPure SPRI beads. The Illumina libraries were quantified using a Bioanalyzer
15
DNA High Sensitivity chip (Agilent) and 300 ng of DNA (in 6 ul) then underwent
16
normalization using the Duplex-Specific Nuclease (DSN) Kit (Axxora) (Bogdanova et al
17
2009). For normalization, the dsDNA was denatured for 3 min at 98°C, following a
18
hybridization step at 68°C for 5h and DSN treatment at 68°C for 20 min. The normalized
19
libraries were amplified by PCR for 12 cycles, gel-purified and QC assessed on a
20
Bioanalyzer DNA High Sensitivity chip (Agilent), and then sequenced using an Illumina
21
GAIIx sequencer (run mode 2x76 bp). For 454 pyrosequencing, a 4 kbp paired-end
22
library was constructed and sequenced for each SAG. All general aspects of and
23
detailed protocols for library construction and sequencing can be found at the JGI
1
24
website (http://www.jgi.doe.gov/). Sequencing yielded the following raw data sets: SAG
25
4E: 6.8 Gb Illumina sequence and 74.4 Mbp of 454 sequence (276672 reads), SAG 4G:
26
5.8 Gb Illumina sequence and 97.1 Mbp of 454 sequence (335757 reads).
27
For SAG 4C sequencing was conducted at LGC Genomics GmbH, Berlin, Germany
28
using also a hybrid approach of Illumina and 454 pyroseqeuncing. A 3kb paired end and
29
standard shotgun library were constructed and sequenced using 454 FLX Titanium
30
technology. For Illumina sequencing a standard shotgun library (1x100bp) was
31
constructed and sequenced using the Illumina HISeq2000 platform. This resulted in
32
2.3Gbp Illumina sequence and 153.6 Mbp of 454 sequence (481,505 reads).
33
The draft genomes of SAGs 3G and 4CII were generated at the JGI using Illumina
34
technology. An Illumina Std shotgun library was constructed and sequenced using the
35
Illumina HiSeq 2000 platform. Sequencing yielded raw data sets of 1.4 Gbp of Illumina
36
sequence for SAG 3G and 0.8Gb of Illumina sequence for SAG 4CII. General aspects of
37
library construction and sequencing performed at the JGI can be found at
38
http://www.jgi.doe.gov.
39
40
Genome assembly
41
All raw Illumina sequence data was passed through DUK, a filtering program developed
42
at JGI, which removes known Illumina sequencing and library preparation artifacts
43
(http://duk.sourceforge.net/), using the following parameters -k 22 -s 1 -c 1. Specifically,
44
all reads containing sequencing adapters, low complexity reads and reads containing
45
short tandem repeats were removed. Artifact-filtered sequence data were then screened
2
46
and trimmed according to the k–mers present in the dataset using kmernorm
47
(http://sourceforge.net/projects/kmernorm/). High–depth k–mers, presumably derived
48
from MDA amplification bias, cause problems in the assembly, especially if the k–mer
49
depth varies in orders of magnitude for different regions of the genome. For the SAGs
50
3G and 4CII reads with high k–mer coverage (>30X average k-mer depth, k=31) were
51
normalized to an average depth of 30X and reads with an average k-mer depth of less
52
than 2X were removed. For SAGs 4C, 4E, and 4G we removed reads representing high-
53
abundance k-mers (>32x k-mer coverage, k=31) and trimmed reads that contained
54
unique k-mers. After filtering, 1.7M reads for 3G, 0.2M reads for 4CII, 5.1M for 4C, 3M
55
for 4E, and 1.3M for 4G remained.
56
For SAGs 4E, 4G, and 4C assemblies were performed in the following steps: (1) filtered
57
Illumina reads were assembled using Velvet version 1.1.02 (Zerbino and Birney, 2008).
58
The VelvetOptimiser script (version 2.1.7) was used with default optimization functions
59
(n50 for k-mer choice, total number of base pairs in large contigs for cov_cutoff
60
optimization). (2) The Velvet contigs were used to simulate reads from long-insert
61
libraries, which were used together with the filtered reads as input for Allpaths-LG
62
(Gnerre et al., 2011) assembly. (3) Next, Allpaths contigs larger than 1 kb were
63
shredded into 1-kb pieces with 200 bp overlaps. (4) Lastly, the Allpaths shreds and raw
64
454 pyrosequence reads were assembled using the 454 Newbler assembler version 2.5
65
(Roche/454 Life Sciences, Branford, CT, USA).
66
The following steps were performed for assembly of 3G and 4CII: (1) normalized
67
Illumina reads were assembled using Velvet version 1.1.04 (Zerbino and Birney 2008).
68
(2) 1–3 Kbp simulated paired end reads were created from Velvet contigs using wgsim
3
69
(https://github.com/lh3/wgsim). (3) Normalized Illumina reads were assembled with
70
simulated read pairs using Allpaths–LG (version r39750) (Gnerre et al 2011).
71
Parameters for assembly steps were: 1) Velvet (velveth: 71 –shortPaired and velvetg: –
72
very clean yes –export-Filtered yes –min contig_lgth 500 –scaffolding no –cov_cutoff
73
10). 2) wgsim ( –e 0 –1 100 –2 100 –r 0 –R 0 –X 0). 3) Allpaths–LG
74
(PrepareAllpathsInputs: PHRED 64=1 PLOIDY=1 FRAG COVERAGE=125 JUMP
75
COVERAGE=25
76
RUN=std_shredpairs
77
OVERWRITE=True).
78
These approaches resulted in the following draft assemblies: SAG 3G: total assembly
79
size of 5,627,474bp (304 contigs); SAG 4CII: total assembly size of 596,887bp (64
80
contigs); SAG 4C: total assembly size of 1,713,200 bp (302 contigs); SAG 4E: total
81
assembly size of 3,679,266 bp (540 contigs); and SAG 4G: total assembly size of
82
1,443,813 bp (296 contigs).
LONG
JUMP
COV=50,
TARGETS=standard
RunAllpathsLG:
VAPI
WARN
THREADS=8
ONLY=True
83
84
Genome annotation and SAG whole genome sequencing quality control
85
The five poribacterial SAGs sequence assemblies were complemented by an additional
86
poribacterial SAG, which was previously sequenced and analyzed by (Siegl et al 2011),
87
Candidatus Poribacteria WGA A3 (hereafter 3A). All following steps were conducted with
88
the five newly sequenced SAGs and the assembly of SAG 3A, which can be accessed
89
under Genbank accession number ADFK00000000.
4
90
Genes were identified using Prodigal (Hyatt et al 2010). The predicted CDSs were
91
translated and used to search the National Center for Biotechnology Information (NCBI)
92
nonredundant database (nr), UniProt, TIGRFam, Pfam, KEGG, COG, and InterPro
93
databases. The tRNAScan-SE tool (Hacker and Kaper 2000) was used to find tRNA
94
genes, whereas ribosomal RNA genes were found by searches against models of the
95
ribosomal RNA genes built from SILVA (Pruesse et al 2007). Other non–coding RNAs
96
such as the RNA components of the protein secretion complex and the RNase P were
97
identified by searching genomes for the corresponding Rfam profiles using INFERNAL
98
(Makarova et al 1999). Additional gene prediction analysis and manual functional
99
annotation was performed within the Integrated Microbial Genomes (IMG) (Markowitz et
100
al 2008) platform (particularly IMG/mer) developed by the Joint Genome Institute,
101
Walnut Creek, CA, USA (http://img.jgi.doe.gov).
102
All genome sequences were quality checked automatically by mapping against known
103
contaminants, as well as manually using several tools in the IMG/mer system, such as
104
tetranucleotide frequency analysis, phylogenetic distribution of genes and GC content
105
distribution. We generally followed a conservative approach and removed all contigs that
106
appeared as contamination in one of the screenings. A detailed description of the
107
contamination
108
http://img.jgi.doe.gov/mer/doc/SingleCellDataDecontamination.pdf. An additional quality
109
screen independent of the IMG system was conducted by phylogenetic assignment of all
110
genes using blastx against the NCBI nonredundant database and MEGAN (Huson et al
111
2007). This additional approach enabled us to detect contaminating sequences from
112
sources not included in the IMG system at the time, such as mitochondrial DNA from the
113
sponge host.
screening
process
in
5
IMG/mer
can
be
found
at
114
Contamination originated largely from previously identified contaminants of the WGA
115
reaction kit (Blainey and Quake 2011, Woyke et al 2011) and mitochondrial DNA of the
116
sponge host (Table S1). However, in SAGs 3A and 4G we detected contamination from
117
additional sources and the amount of non-poribacterial DNA in these two datasets was
118
larger than in the other SAGs. Thus, we excluded all reads that lacked genes with
119
significant homologies (≥60 % ID) to any of the other cleaned poribacterial SAG genes.
120
Since a larger proportion of the previously published dataset 3A was contaminated
121
(Siegl et al 2011) we updated the original genome sequence and deposited the updated
122
version at DDBJ/EMBL/GenBank under the accession number ADFK00000000. The
123
version described in this paper is version ADFK02000000. After contamination removal
124
the final assembly sizes resulted in 0.41 Mbp, 5.44 Mbp, 1.63 Mbp, 0.54 Mbp, 3.65 Mbp,
125
and 0.19 Mbp for SAGs 3A, 3G, 4C, 4CII, 4E, and 4G, respectively.
126
Gene prediction of all cleaned SAG annotations was evaluated and corrected (if
127
necessary) using the GenePRIMP software (Pati et al 2010). Updated versions were
128
resubmitted to IMG/mer replacing the previous submissions for functional analysis.
129
Unless stated otherwise all functional analyses were conducted with tools in the
130
IMG/mer software system.
131
132
133
Blainey PC, Quake SR (2011). Digital MDA for enumeration of total nucleic acid
contamination. Nucleic Acids Res 39: e19.
134
135
136
137
Bogdanova E, Shagina I, Mudrik E, Ivanov I, Amon P, Vagner L et al (2009). DSN
Depletion is a simple method to remove selected transcripts from cDNA populations. Mol
Biotechnol 41: 247-253.
138
6
139
140
141
Gnerre S, Maccallum I, Przybylski D, Ribeiro FJ, Burton JN, Walker BJ et al (2011).
High-quality draft assemblies of mammalian genomes from massively parallel sequence
data. Proc Natl Acad Sci U S A 108: 1513-1518.
142
143
144
Hacker J, Kaper JB (2000). Pathogenicity islands and the evolution of microbes. Annu
Rev Of Microbiol 54: 641–679.
145
146
147
Huson D, Auch A, Qi J, Schuster S (2007). MEGAN analysis of metagenomic data.
Genome Res 17: 377 - 386.
148
149
150
151
Hyatt D, Chen G-L, LoCascio P, Land M, Larimer F, Hauser L (2010). Prodigal:
Prokaryotic gene recognition and translation initiation site identification. BMC
Bioinformatics 11: 119.
152
153
154
155
Makarova KS, Aravind L, Galperin MY, Grishin NV, Tatusov RL, Wolf YI et al (1999).
Comparative genomics of the Archaea (Euryarchaeota): Evolution of conserved protein
families, the stable core, and the variable shell. Genome Res 9: 608–628.
156
157
158
159
Markowitz VM, Ivanova NN, Szeto E, Palaniappan K, Chu K, Dalevi D et al (2008).
IMG/M: a data management and analysis system for metagenomes. Nucleic Acids Res
36: D534-538.
160
161
162
163
Pati A, Ivanova NN, Mikhailova N, Ovchinnikova G, Hooper SD, Lykidis A et al (2010).
GenePRIMP: a gene prediction improvement pipeline for prokaryotic genomes. Nat
Methods 7: 455-457.
164
165
166
167
Pruesse E, Quast C, Knittel K, Fuchs BM, Ludwig W, Peplies J et al (2007). SILVA: A
comprehensive online resource for quality checked and aligned ribosomal RNA
sequence data compatible with ARB. Nucleic Acids Research 35: 7188–7196.
7
168
169
170
171
Siegl A, Kamke J, Hochmuth T, Piel J, Richter M, Liang C et al (2011). Single-cell
genomics reveals the lifestyle of Poribacteria, a candidate phylum symbiotically
associated with marine sponges. ISME J 5: 61-70.
172
173
174
175
Woyke T, Sczyrba A, Lee J, Rinke C, Tighe D, Clingenpeel S et al (2011).
Decontamination of MDA Reagents for Single Cell Whole Genome Amplification. PLoS
One 6: e26161.
176
177
178
Zerbino DR, Birney E (2008). Velvet: algorithms for de novo short read assembly using
de Bruijn graphs. Genome Res 18: 821-829.
179
180
181
182
183
8
Download