1 Text S1. Methodological details 2 Genome sequencing 3 A combination of Illumina and 454 shotgun sequencing was performed on the single cell 4 re-MDA products for Candidatus Poribacteria sp. WGA-4E and 4G. For Illumina 5 sequencing, 0.3 kbp shotgun libraries were constructed for each SAG. Briefly, 3 µg MDA 6 product was sheared in 100 µl using the Covaris E210 with the setting of 10% duty 7 cycle, intensity 5, and 200 cycle per burst for 3 min per sample and the fragmented DNA 8 was purified using QIAquick columns (Qiagen) according to the manufacturer's 9 instructions. The sheared DNA was end-repaired and A-tailed according to the Illumina 10 standard PE protocol and purified using the MinElute PCR Purification Kit (Qiagen) with 11 a final elution in 12 µl of Buffer EB. After quantification using a Bioanalyzer DNA 1000 12 chip (Agilent), the fragments were ligated to the Illumina adaptors according to the 13 Illumina standard PE protocol, followed by a purification step of the ligation product 14 using AMPure SPRI beads. The Illumina libraries were quantified using a Bioanalyzer 15 DNA High Sensitivity chip (Agilent) and 300 ng of DNA (in 6 ul) then underwent 16 normalization using the Duplex-Specific Nuclease (DSN) Kit (Axxora) (Bogdanova et al 17 2009). For normalization, the dsDNA was denatured for 3 min at 98°C, following a 18 hybridization step at 68°C for 5h and DSN treatment at 68°C for 20 min. The normalized 19 libraries were amplified by PCR for 12 cycles, gel-purified and QC assessed on a 20 Bioanalyzer DNA High Sensitivity chip (Agilent), and then sequenced using an Illumina 21 GAIIx sequencer (run mode 2x76 bp). For 454 pyrosequencing, a 4 kbp paired-end 22 library was constructed and sequenced for each SAG. All general aspects of and 23 detailed protocols for library construction and sequencing can be found at the JGI 1 24 website (http://www.jgi.doe.gov/). Sequencing yielded the following raw data sets: SAG 25 4E: 6.8 Gb Illumina sequence and 74.4 Mbp of 454 sequence (276672 reads), SAG 4G: 26 5.8 Gb Illumina sequence and 97.1 Mbp of 454 sequence (335757 reads). 27 For SAG 4C sequencing was conducted at LGC Genomics GmbH, Berlin, Germany 28 using also a hybrid approach of Illumina and 454 pyroseqeuncing. A 3kb paired end and 29 standard shotgun library were constructed and sequenced using 454 FLX Titanium 30 technology. For Illumina sequencing a standard shotgun library (1x100bp) was 31 constructed and sequenced using the Illumina HISeq2000 platform. This resulted in 32 2.3Gbp Illumina sequence and 153.6 Mbp of 454 sequence (481,505 reads). 33 The draft genomes of SAGs 3G and 4CII were generated at the JGI using Illumina 34 technology. An Illumina Std shotgun library was constructed and sequenced using the 35 Illumina HiSeq 2000 platform. Sequencing yielded raw data sets of 1.4 Gbp of Illumina 36 sequence for SAG 3G and 0.8Gb of Illumina sequence for SAG 4CII. General aspects of 37 library construction and sequencing performed at the JGI can be found at 38 http://www.jgi.doe.gov. 39 40 Genome assembly 41 All raw Illumina sequence data was passed through DUK, a filtering program developed 42 at JGI, which removes known Illumina sequencing and library preparation artifacts 43 (http://duk.sourceforge.net/), using the following parameters -k 22 -s 1 -c 1. Specifically, 44 all reads containing sequencing adapters, low complexity reads and reads containing 45 short tandem repeats were removed. Artifact-filtered sequence data were then screened 2 46 and trimmed according to the k–mers present in the dataset using kmernorm 47 (http://sourceforge.net/projects/kmernorm/). High–depth k–mers, presumably derived 48 from MDA amplification bias, cause problems in the assembly, especially if the k–mer 49 depth varies in orders of magnitude for different regions of the genome. For the SAGs 50 3G and 4CII reads with high k–mer coverage (>30X average k-mer depth, k=31) were 51 normalized to an average depth of 30X and reads with an average k-mer depth of less 52 than 2X were removed. For SAGs 4C, 4E, and 4G we removed reads representing high- 53 abundance k-mers (>32x k-mer coverage, k=31) and trimmed reads that contained 54 unique k-mers. After filtering, 1.7M reads for 3G, 0.2M reads for 4CII, 5.1M for 4C, 3M 55 for 4E, and 1.3M for 4G remained. 56 For SAGs 4E, 4G, and 4C assemblies were performed in the following steps: (1) filtered 57 Illumina reads were assembled using Velvet version 1.1.02 (Zerbino and Birney, 2008). 58 The VelvetOptimiser script (version 2.1.7) was used with default optimization functions 59 (n50 for k-mer choice, total number of base pairs in large contigs for cov_cutoff 60 optimization). (2) The Velvet contigs were used to simulate reads from long-insert 61 libraries, which were used together with the filtered reads as input for Allpaths-LG 62 (Gnerre et al., 2011) assembly. (3) Next, Allpaths contigs larger than 1 kb were 63 shredded into 1-kb pieces with 200 bp overlaps. (4) Lastly, the Allpaths shreds and raw 64 454 pyrosequence reads were assembled using the 454 Newbler assembler version 2.5 65 (Roche/454 Life Sciences, Branford, CT, USA). 66 The following steps were performed for assembly of 3G and 4CII: (1) normalized 67 Illumina reads were assembled using Velvet version 1.1.04 (Zerbino and Birney 2008). 68 (2) 1–3 Kbp simulated paired end reads were created from Velvet contigs using wgsim 3 69 (https://github.com/lh3/wgsim). (3) Normalized Illumina reads were assembled with 70 simulated read pairs using Allpaths–LG (version r39750) (Gnerre et al 2011). 71 Parameters for assembly steps were: 1) Velvet (velveth: 71 –shortPaired and velvetg: – 72 very clean yes –export-Filtered yes –min contig_lgth 500 –scaffolding no –cov_cutoff 73 10). 2) wgsim ( –e 0 –1 100 –2 100 –r 0 –R 0 –X 0). 3) Allpaths–LG 74 (PrepareAllpathsInputs: PHRED 64=1 PLOIDY=1 FRAG COVERAGE=125 JUMP 75 COVERAGE=25 76 RUN=std_shredpairs 77 OVERWRITE=True). 78 These approaches resulted in the following draft assemblies: SAG 3G: total assembly 79 size of 5,627,474bp (304 contigs); SAG 4CII: total assembly size of 596,887bp (64 80 contigs); SAG 4C: total assembly size of 1,713,200 bp (302 contigs); SAG 4E: total 81 assembly size of 3,679,266 bp (540 contigs); and SAG 4G: total assembly size of 82 1,443,813 bp (296 contigs). LONG JUMP COV=50, TARGETS=standard RunAllpathsLG: VAPI WARN THREADS=8 ONLY=True 83 84 Genome annotation and SAG whole genome sequencing quality control 85 The five poribacterial SAGs sequence assemblies were complemented by an additional 86 poribacterial SAG, which was previously sequenced and analyzed by (Siegl et al 2011), 87 Candidatus Poribacteria WGA A3 (hereafter 3A). All following steps were conducted with 88 the five newly sequenced SAGs and the assembly of SAG 3A, which can be accessed 89 under Genbank accession number ADFK00000000. 4 90 Genes were identified using Prodigal (Hyatt et al 2010). The predicted CDSs were 91 translated and used to search the National Center for Biotechnology Information (NCBI) 92 nonredundant database (nr), UniProt, TIGRFam, Pfam, KEGG, COG, and InterPro 93 databases. The tRNAScan-SE tool (Hacker and Kaper 2000) was used to find tRNA 94 genes, whereas ribosomal RNA genes were found by searches against models of the 95 ribosomal RNA genes built from SILVA (Pruesse et al 2007). Other non–coding RNAs 96 such as the RNA components of the protein secretion complex and the RNase P were 97 identified by searching genomes for the corresponding Rfam profiles using INFERNAL 98 (Makarova et al 1999). Additional gene prediction analysis and manual functional 99 annotation was performed within the Integrated Microbial Genomes (IMG) (Markowitz et 100 al 2008) platform (particularly IMG/mer) developed by the Joint Genome Institute, 101 Walnut Creek, CA, USA (http://img.jgi.doe.gov). 102 All genome sequences were quality checked automatically by mapping against known 103 contaminants, as well as manually using several tools in the IMG/mer system, such as 104 tetranucleotide frequency analysis, phylogenetic distribution of genes and GC content 105 distribution. We generally followed a conservative approach and removed all contigs that 106 appeared as contamination in one of the screenings. A detailed description of the 107 contamination 108 http://img.jgi.doe.gov/mer/doc/SingleCellDataDecontamination.pdf. An additional quality 109 screen independent of the IMG system was conducted by phylogenetic assignment of all 110 genes using blastx against the NCBI nonredundant database and MEGAN (Huson et al 111 2007). This additional approach enabled us to detect contaminating sequences from 112 sources not included in the IMG system at the time, such as mitochondrial DNA from the 113 sponge host. screening process in 5 IMG/mer can be found at 114 Contamination originated largely from previously identified contaminants of the WGA 115 reaction kit (Blainey and Quake 2011, Woyke et al 2011) and mitochondrial DNA of the 116 sponge host (Table S1). However, in SAGs 3A and 4G we detected contamination from 117 additional sources and the amount of non-poribacterial DNA in these two datasets was 118 larger than in the other SAGs. Thus, we excluded all reads that lacked genes with 119 significant homologies (≥60 % ID) to any of the other cleaned poribacterial SAG genes. 120 Since a larger proportion of the previously published dataset 3A was contaminated 121 (Siegl et al 2011) we updated the original genome sequence and deposited the updated 122 version at DDBJ/EMBL/GenBank under the accession number ADFK00000000. The 123 version described in this paper is version ADFK02000000. After contamination removal 124 the final assembly sizes resulted in 0.41 Mbp, 5.44 Mbp, 1.63 Mbp, 0.54 Mbp, 3.65 Mbp, 125 and 0.19 Mbp for SAGs 3A, 3G, 4C, 4CII, 4E, and 4G, respectively. 126 Gene prediction of all cleaned SAG annotations was evaluated and corrected (if 127 necessary) using the GenePRIMP software (Pati et al 2010). Updated versions were 128 resubmitted to IMG/mer replacing the previous submissions for functional analysis. 129 Unless stated otherwise all functional analyses were conducted with tools in the 130 IMG/mer software system. 131 132 133 Blainey PC, Quake SR (2011). Digital MDA for enumeration of total nucleic acid contamination. Nucleic Acids Res 39: e19. 134 135 136 137 Bogdanova E, Shagina I, Mudrik E, Ivanov I, Amon P, Vagner L et al (2009). DSN Depletion is a simple method to remove selected transcripts from cDNA populations. Mol Biotechnol 41: 247-253. 138 6 139 140 141 Gnerre S, Maccallum I, Przybylski D, Ribeiro FJ, Burton JN, Walker BJ et al (2011). High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc Natl Acad Sci U S A 108: 1513-1518. 142 143 144 Hacker J, Kaper JB (2000). Pathogenicity islands and the evolution of microbes. Annu Rev Of Microbiol 54: 641–679. 145 146 147 Huson D, Auch A, Qi J, Schuster S (2007). MEGAN analysis of metagenomic data. Genome Res 17: 377 - 386. 148 149 150 151 Hyatt D, Chen G-L, LoCascio P, Land M, Larimer F, Hauser L (2010). Prodigal: Prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11: 119. 152 153 154 155 Makarova KS, Aravind L, Galperin MY, Grishin NV, Tatusov RL, Wolf YI et al (1999). Comparative genomics of the Archaea (Euryarchaeota): Evolution of conserved protein families, the stable core, and the variable shell. Genome Res 9: 608–628. 156 157 158 159 Markowitz VM, Ivanova NN, Szeto E, Palaniappan K, Chu K, Dalevi D et al (2008). IMG/M: a data management and analysis system for metagenomes. Nucleic Acids Res 36: D534-538. 160 161 162 163 Pati A, Ivanova NN, Mikhailova N, Ovchinnikova G, Hooper SD, Lykidis A et al (2010). GenePRIMP: a gene prediction improvement pipeline for prokaryotic genomes. Nat Methods 7: 455-457. 164 165 166 167 Pruesse E, Quast C, Knittel K, Fuchs BM, Ludwig W, Peplies J et al (2007). SILVA: A comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Research 35: 7188–7196. 7 168 169 170 171 Siegl A, Kamke J, Hochmuth T, Piel J, Richter M, Liang C et al (2011). Single-cell genomics reveals the lifestyle of Poribacteria, a candidate phylum symbiotically associated with marine sponges. ISME J 5: 61-70. 172 173 174 175 Woyke T, Sczyrba A, Lee J, Rinke C, Tighe D, Clingenpeel S et al (2011). Decontamination of MDA Reagents for Single Cell Whole Genome Amplification. PLoS One 6: e26161. 176 177 178 Zerbino DR, Birney E (2008). Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18: 821-829. 179 180 181 182 183 8