Microbial Genome Assembly Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy 1 Outline-summary 1. QUICK INTRODUCTION 3. ASSEMBLY STRATEGIES 2. GENOME ASSEMBLY 4. CASE STUDY 2 DNA packaging 3 DNA packaging 4 Outline-summary 1. QUICK INTRODUCTION 3. ASSEMBLY STRATEGIES 2. GENOME ASSEMBLY 4. CASE STUDY 5 Next Generation Sequencing ACGTAGGCTAGCGTTAGCGA ........ CTGCAT C TCTTATTGTGACC TAGGCTAGCTTAG GCAATGCAGTAAC TCCAGCTAGGTTC 6 Genome Assembly OVERLAPPING SEQUENCE ALIGMENT 1. 2. 3. 4. GENOME SEQUENCING PRELIMINARY ANALYSIS ASSEMBLY ADVANCED BIOINFORMATIC ANALYSIS 7 On the feasibility of sequence assembly Sequencing the human genome with shotgun sequencing + assembly is the only feasible strategy Weber, James L., and Eugene W. Myers. "Human whole-genome shotgun sequencing." Genome Research 7.5 (1997): 401-409. Computational assembly of shotgun sequencing data is simply unfeasible, and a bad idea anyway Green, Philip. "Against a whole-genome shotgun.“ Genome Research 7.5 (1997): 410-417. They were both right! (…well, Weber and Myers were a bit more right from the practical viewpoint…) Outline-summary 1. QUICK INTRODUCTION 3. ASSEMBLY STRATEGIES 2. GENOME ASSEMBLY 4. CASE STUDY 9 Genome assembly strategies Greedy approach → SSAKE De Bruijn graph (DBG) → Velvet, SOAPdenovo Overlap Consensus Layout (OLC) → MIRA Mixed approaches → MaSuRCA 10 Genome assembly strategies DE BRUIJN GRAPH APPROACH (DBG) Nodes = overlapping sequences of reads of uniform length Edges = kmer (unique subsequences within reads) Velvet, SOAPdenovo2 EULERIAN PATH 11 Genome assembly strategies OVERLAP CONSENSUS LAYOUT (OLC) Nodes = reads Edges = overlap between reads MIRA 1. OVERLAP 2. LAYOUT 3. CONSENSUS HAMILTONIAN PATH 12 Genome assembly strategies 13 Genome assembly strategies DBG ADVANTAGES OLC Very sensitive to repeats Modular algorithmic design Kmer storaged just once Flexibility and robustness Eulerian cycle Never explicitly computes pairwise computation DISADVANTAGES Sensitive to sequencing errors (new k-mers) Hamiltonian cycle Large computational memory Overlap stage istimespace requirements consuming Genome-size limitations 14 Genome assembly strategies Greedy approach → SSAKE De Bruijn graph (DBG) → Velvet, SOAPdenovo Overlap Consensus Layout (OLC) → MIRA Mixed approaches → MaSuRCA 15 Genome Assemblers Average Coverage Number of Contigs Number of Contigs > 1Kb N50 contig size Fraction of reads assembled Total consensus (in nt) Number of scaffolds N50 scaffolds size Ion Torrent PGM → MIRA 3.9 Illumina → MaSuRCA MIRA 3.9 too produced good quality results, but it has a longer execution time and it becomes unstable with large amount of small reads 16 Outline-summary 1. QUICK INTRODUCTION 3. ASSEMBLY STRATEGIES 2. GENOME ASSEMBLY 4. CASE STUDY 17 Mycobacteria Assembly: Case Study Responsible for many animal and human diseases M. tuberculosis and M. leprae (TM) M. fortuitum (NTM) outbreak (nail salon, 2002) M. chelonae (NTM) outbreak (face lifts, 2004) Illumina HiSeq sequencing (NGS Facility – CIBIO/UNITN) Twenty mycobacterial strains From 20 different Mycobacteria species → MaSuRCA Novel mycobacteria detection clinical tests 18 Raw data quality assessment and pre-processing Fastq-mcf tool • poor quality ends of reads • Ns, duplicates and sequencing adapters • reads that are too short Reduction up to 73% 19 Assembly parameters setting K-mers: strings of a particular length k, which are shorter than entire reads Best empirical k-mer length: 91 bases long High coverage 20 MaSuRCA results of Mycobacteria Genome size too high Abnormal GC content 21 GC content based quality analysis Examples of environmental contaminations Staphylococcus epidermidis 22 Thanks http://gcat.davidson.edu/phast/#methods Photo coming soon