bit25182-sm-0001-SuppData-S1

Genome sequencing procedure and analysis Genomic DNA was extracted using GeneJET Genomic DNA Purification Kit (Fermentas, Cat No. #K0721). The samples were prepared for sequencing with the TruSeq DNA sample preparation kit (Illumina, Cat No. FC-121-2001) and sequenced multiplexed on a single lane on an Illumina Hiseq 2000 (2x101 bp), producing 38-58 million read pairs for each sample. Due to the very high abundance (approximately 2000x coverage) the data were subjected to stringent filtering. Raw reads were processed with the FastqMcf tool in order to remove adapters and to clip poor quality ends (quality score < 30). Any reads shorter than 90 bp after processing were discarded. The processed reads were mapped to a reference genome (Escherichia coli str. K12 substr. W3110, GenBank accession number: AP009048.1) using MosaikAligner (v. 2.1.32) with a hash size of 12 bp. Indels were detected and realigned using the GenomeAnalysisToolKit (GATK, v.2.2.9) modules RealignerTargetCreator and IndelRealigner. PCR duplicates were eliminated using the Picard program MarkDuplicates. Mapping stats of the libraries are shown in table SI. Single nucleotide variants and small indels, as compared to the reference genome, were detected using GATK HaplotypeMapper. The list of variants was filtered for all variants detected as homozygous for an alternative allele in one or more of the four samples. All variants were visually inspected using the Savant Genome Browser (Fiume et al., 2010) in order to eliminate obvious false positives. In order to discover possible larger structural variation (large deletions, insertions, duplications and inversions) a number of approaches was used. First a mapping depth approach was applied as implemented in CNVnator 2.2.5 (Abyzov et al., 2011) which detects duplications and deletions by finding areas with deviant mapping depth. Secondly, SVseq2 1 (Zhang et al., 2012) was used, which detects the breakpoints of insertions and deletions using information from split reads. Finally, a de novo assembly approach was used, which should be able to predict any structural variation, including new genetic material in plasmids. For this ABySS (Simpson et al., 2009) was used to create individual de novo assemblies for each sample (using a k-mer size of 54 bp), whereafter the approach and helper scripts of the SOAPsv (Li et al., 2011) pipeline were used to detect variants. Additionally all contigs that did not have BLAST hits to the reference genome were further investigated as possible inserts/plasmids. The complete list of shared genetic variation among AF1000, AF1000ara, PPA652 and PPA652ara against W3110 are listed in table SII. References Abyzov A, Urban AE, Snyder M, Gerstein M. 2011. CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 21:974–984. Fiume M, Williams V, Brook A, Brudno M. 2010. Savant: genome browser for highthroughput sequencing data. Bioinformatics 26:1938–1944. Li Y, Zheng H, Luo R, Wu H, Zhu H, Li R, Cao H, Wu B, Huang S, Shao H, Ma H, Zhang F, Feng S, Zhang W, Du H, Tian G, Li J, Zhang X, Li S, Bolund L, Kristiansen K, de Smith AJ, Blakemore AIF, Coin LJM, Yang H, Wang J, Wang J. 2011. Structural variation in two human genomes mapped at single-nucleotide resolution by whole genome de novo assembly. Nat Biotechnol 29:723–730. Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJM, Birol I. 2009. ABySS: a parallel assembler for short read sequence data. Genome Res. 19:1117–1123. Zhang J, Wang J, Wu Y. 2012. An improved approach for accurate and efficient calling of structural variations with low-coverage sequence data. BMC Bioinformatics 13 Suppl 2 6:S6. 3

bit25182-sm-0001-SuppData-S1

Related documents

Products

Support

bit25182-sm-0001-SuppData-S1

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib