Presentazione standard di PowerPoint

advertisement
Microbial Genome Assembly
Pamela Ferretti
Laboratory of Computational Metagenomics
Centre for Integrative Biology
University of Trento
Italy
1
Outline-summary
1. QUICK INTRODUCTION
3. ASSEMBLY STRATEGIES
2. GENOME ASSEMBLY
4. CASE STUDY
2
DNA packaging
3
DNA packaging
4
Outline-summary
1. QUICK INTRODUCTION
3. ASSEMBLY STRATEGIES
2. GENOME ASSEMBLY
4. CASE STUDY
5
Next Generation Sequencing
ACGTAGGCTAGCGTTAGCGA ........ CTGCAT C
TCTTATTGTGACC
TAGGCTAGCTTAG
GCAATGCAGTAAC
TCCAGCTAGGTTC
6
Genome Assembly
OVERLAPPING SEQUENCE
ALIGMENT
1.
2.
3.
4.
GENOME SEQUENCING
PRELIMINARY ANALYSIS
ASSEMBLY
ADVANCED BIOINFORMATIC ANALYSIS
7
On the feasibility of sequence assembly
Sequencing the human
genome with shotgun
sequencing + assembly is
the only feasible strategy
Weber, James L., and Eugene W. Myers. "Human whole-genome shotgun sequencing."
Genome Research 7.5 (1997): 401-409.
Computational assembly of
shotgun sequencing data is
simply unfeasible, and a
bad idea anyway
Green, Philip. "Against a whole-genome shotgun.“
Genome Research 7.5 (1997): 410-417.
They were both right!
(…well, Weber and Myers were a bit more right from the practical viewpoint…)
Outline-summary
1. QUICK INTRODUCTION
3. ASSEMBLY STRATEGIES
2. GENOME ASSEMBLY
4. CASE STUDY
9
Genome assembly strategies
 Greedy approach → SSAKE
 De Bruijn graph (DBG) → Velvet, SOAPdenovo
 Overlap Consensus Layout (OLC) → MIRA
 Mixed approaches → MaSuRCA
10
Genome assembly strategies
 DE BRUIJN GRAPH APPROACH (DBG)
Nodes = overlapping sequences of reads of uniform length
Edges = kmer (unique subsequences within reads)
 Velvet, SOAPdenovo2
EULERIAN PATH
11
Genome assembly strategies
 OVERLAP CONSENSUS LAYOUT (OLC)
Nodes = reads
Edges = overlap between reads
 MIRA
1. OVERLAP
2. LAYOUT
3. CONSENSUS
HAMILTONIAN PATH
12
Genome assembly strategies
13
Genome assembly strategies
DBG
ADVANTAGES
OLC
Very sensitive to repeats
Modular algorithmic
design
Kmer storaged just once
Flexibility and
robustness
Eulerian cycle
Never explicitly computes
pairwise computation
DISADVANTAGES
Sensitive to sequencing
errors (new k-mers)
Hamiltonian cycle
Large computational memory Overlap stage istimespace requirements
consuming
Genome-size limitations
14
Genome assembly strategies
 Greedy approach → SSAKE
 De Bruijn graph (DBG) → Velvet, SOAPdenovo
 Overlap Consensus Layout (OLC) → MIRA
 Mixed approaches → MaSuRCA
15
Genome Assemblers
Average Coverage
Number of Contigs
Number of Contigs > 1Kb
N50 contig size
Fraction of reads assembled
Total consensus (in nt)
Number of scaffolds
N50 scaffolds size
Ion Torrent PGM → MIRA 3.9
Illumina → MaSuRCA
MIRA 3.9 too produced good quality results, but it has a longer execution time
and it becomes unstable with large amount of small reads
16
Outline-summary
1. QUICK INTRODUCTION
3. ASSEMBLY STRATEGIES
2. GENOME ASSEMBLY
4. CASE STUDY
17
Mycobacteria Assembly: Case Study
 Responsible for many animal and human diseases
M. tuberculosis and M. leprae (TM)
M. fortuitum (NTM) outbreak (nail salon, 2002)
M. chelonae (NTM) outbreak (face lifts, 2004)
 Illumina HiSeq sequencing (NGS Facility – CIBIO/UNITN)
 Twenty mycobacterial strains
 From 20 different Mycobacteria species
→ MaSuRCA
Novel mycobacteria detection clinical tests
18
Raw data quality assessment and pre-processing
Fastq-mcf tool
• poor quality ends of reads
• Ns, duplicates and sequencing
adapters
• reads that are too short
Reduction up to 73%
19
Assembly parameters setting
K-mers: strings of a particular length k,
which are shorter than entire reads
Best empirical k-mer length:
91 bases long
High coverage
20
MaSuRCA results of Mycobacteria
Genome size
too high
Abnormal
GC content
21
GC content based quality analysis
Examples of environmental contaminations
Staphylococcus
epidermidis
22
Thanks
http://gcat.davidson.edu/phast/#methods
Photo
coming
soon
Download