Lee Katz - Compgenomics2014

advertisement
Computational assembly for
prokaryotic sequencing projects
Lee Katz, Ph.D.
Bioinformatician, Enteric Diseases
Laboratory Branch
January 15, 2014
Disclaimers
The findings and conclusions in this presentation have not been formally disseminated by the Centers for Disease
Control and Prevention and should not be construed to represent any agency determination or policy.
The findings and conclusions in this [report/presentation] are those of the author(s) and do not necessarily
represent the official position of CDC
Partners in Public Health
Graduated Oct 2010
CDC 2010 - present
Lee Katz, Present
• Currently in the National Enteric Reference
Laboratory
• Vibrio, Campylobacter, Escherichia, Shigella,
Yersinia, Salmonella
• Focusing on Listeria and Vibrio
One of my projects is #2 on
CDC’s list of accomplishments
for 2013!
#2
http://www.cdc.gov/features/endofyear/
Outline
• Sequencing
– 1st gen
– 2nd gen
– 3rd gen
• Reads
– Quality control (Q/C)
• Read metrics
– Read-cleaning
• Assembly
– Algorithms
– Assembly metrics
Prokaryotic Sequencing Projects
Stages
• Sequencing
• Assembly
• Feature prediction
• Functional annotation
• …analysis…
• Display (Genome Browser)
Sequencing
Assembly
Examples
• Haemophilus influenzae
• Neisseria meningitidis
• Bordetella bronchisceptica
• Vibrio cholerae
• Listeria monocytogenes
prediction
annotation
Display
Fleischman et al. (1995) “Whole-Genome Random Sequencing and Assembly of Haemophilus influenzae Rd” Science 269:5223
Kislyuk et al. (2010) “A computational genomics pipeline for prokaryotic sequencing projects” Bioinformatics 26:15
Out with the old; in with the new:
Two new technologies to the
compgenomics class!
•
•
•
•
454
Illumina single end reads
Illumina paired end reads
PacBio
Sanger Sequencing (1st gen)
Sequencing: first generation
Sheer DNA
Margulies et al. (2005)
Genome sequencing in open
microfabricated high density
picoliter reactors. Nature
437:7057
Cloning into
bacterial
vectors
Amplification
Sanger
sequencing
Sanger sequencing output
• Usually .ab1/.scf file format
454 Sequencing (2nd Gen)
A
454 Pyrosequencing
+ PCR Reagents
+ Emulsion Oil
B
Mix DNA library
& capture beads
(limited dilution)
“Break micro-reactors”
Isolate DNA containing beads
Create
“Water-in-oil”
emulsion
Perform emulsion PCR
454 Pyrosequencing
Load enzyme beads
Load beads into
PicoTiter™Plate
PicoTiter™Plate
Diameter = 44 μm
44 μm
Depth = 55 μm
Well size = 75 pl
Well density = 480 wells mm-2
1.6 million wells per slide
454 Pyrosequencing
Sequencing by
synthesis
Photons generated
are captured by CCD
camera
Reagent flow
Margulies et al., 2005
454 sequencing output
Flowgram (.sff file format)
Flow Order
4-mer
3-mer
2-mer
1-mer
T
A
C
G
KEY (TCAG)
Measures the presence or
absence of each
nucleotide at any given
position
Illumina sequencing (2nd Gen)
The following animations are
courtesy of Illumina, Inc.
Region complementary to P5 grafting primer
Index 2
P5 primer
DNA insert
P7 primer
Index 1
P5 grafting primer
P7 grafting primer
Flow cell surface
The following animations are
courtesy of Illumina, Inc.
SBS Sequencing Primer Hybridization
The following animations are
courtesy of Illumina, Inc.
Sequence (Cycle 1)
Sequence (Cycle 1)
Index 1 Seq Primer Hybridization
Index 1 read – 8 cycles
Unblock
P5 grafting primer
7 dark cycles
P5 grafting primer
Index 2 index read
8 cycles
7 dark cycles
P5 grafting primer
Index 2 index read
8 cycles
7 dark cycles
P5 grafting primer
Linearization
Original strand
New strand
Illumina sequencing video
• http://www.youtube.com/watch?v=womKfik
WlxM
PacBio sequencing* (3rd Gen)
*Pacific Biosciences
http://www.youtube.com/watch?v=NHCJ8PtYCFc
SMRT Bell
Zero-mode waveguide (ZMW),
a very fancy and very small well
Thanks to PacBio for donating some
slide materials in this section
Eid et al Science,
January 2009/10.1126/science.1162986
http://www.youtube.com/watch?v=NHCJ8PtYCFc
Eid et al Science,
January 2009/10.1126/science.1162986
Eid et al Science,
January 2009/10.1126/science.1162986
PacBio video
http://www.youtube.com/watch?v=NHCJ8PtYCFc
Q/C + cleaning + metrics
READS
Q/C
• You need to know if your data are good!
• Example software
– FastQC
– Computational Genomics Pipeline (CG-Pipeline)
Quality Control
FastQC output
Quality Control bioinformatics
FastQC output
The CG-Pipeline way
run_assembly_readMetrics.pl
File
tmp.fastq
avgReadLength
80.00
totalBases
177777760
minReadLength
80
maxReadLength
80
avgQuality
35.39
Read cleaning
Read cleaning with CG-Pipeline
(not validated; please use with caution)
F. Read
R. Read
Read
%ACGT
Phred
http://sourceforge.net/projects/cg-pipeline/
Graphs made with FastqQC (AMOS)
1. Trimming low-qual ends
run_assembly_trimLowQualEnds.pl
F. Read
R. Read
Read
1A. %ACGT
1B. Phred
http://sourceforge.net/projects/cg-pipeline/
Graphs made with FastqQC (AMOS)
2a. Removing duplicate reads
2b. Sometimes: downsampling
run_assembly_removeDuplicateReads.pl
Trimmed reads
http://sourceforge.net/projects/cg-pipeline/
3. Trimming and filtering
run_assembly_trimClean.pl
Min length
Min avg. quality
Min length
Min avg. quality
3A. trimming
3B. filtering
http://sourceforge.net/projects/cg-pipeline/
More
• Software
–
–
–
–
Fastx toolkit http://hannonlab.cshl.edu/fastx_toolkit/
EA-utils https://code.google.com/p/ea-utils/
AMOS amos: SourceForge.net
… and more is out there!
• Evaluation
– Fabbro et al 2013, “An extensive evaluation of read
trimming effects on Illumina NGS data analysis”
Algorithms + metrics
ASSEMBLY
Whole genome sequencing: WGS
Large pieces and de novo assembly
“Business dog” http://www.buzzfeed.com/tiad/business-dog
52
Whole genome sequencing: WGS
Small pieces and reference assembly
“Business cat” http://www.quickmeme.com/Business-Cat/
53
Assembly
• Overlaps between reads
• Generate contigs
(contiguous sequences)
• Generate scaffolds
NNN
N
Derive consensus sequence
TAGATTACACAGATTACTGA-TTGATGGCGTAA-CTA
TAGATTACACAGATTACTGACTTGATGGCGTAAACTA
TAG-TTACACAGATTATTGACTTCATGGCGTAA-CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA-CTA
TAGATTACACAGATTACTGACTTGATGGGGTAA-CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA-CTA
Derive each consensus base by weighted voting
Slide adapted from Andrey Kislyuk, http://www.compgenomics2009.biology.gatech.edu/images/1/12/2009-01-14-compgenomicskislyuk.pdf
Recap of assembly
reads
Paired end reads
contigs
Scaffold
NNNNNNNNNN
CG-Pipeline way for Illumina
run_assembly reads.fastq.gz –o
assembly.fasta
• No module yet in CGP for PacBio
unfortunately…
• Be on the look out for several papers that
compare Illumina assemblers.
PacBio Assembly
• The following slides are courtesy of PacBio
Finishing Genomes Using Only PacBio® Reads
Hierarchical Genome Assembly Process (HGAP)
• Utilizes all PacBio data from single, long-insert library
– Longest reads for continuity
– All reads for high consensus accuracy
• Now available through SMRT® Portal in SMRT Analysis v2.0.1
Chin et al (2013), “Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing
data” Nature Methods. doi 10.1038/nmeth.2474
Hierarchical Genome Assembly Process (HGAP)
1. Start with long ‘seed’ reads
3. Build consensus
2. Align other reads
4. Construct accurate (>99%)
pre-assembled reads
HGAP Example - Meiothermus ruber
10 kb SMRTbell™ library
3 SMRT® Cells
(C2-C2 Chemistry, PacBio® RS)
250 Mb
Long seed reads (>5kb)
pre-assembly
Pre-assembled long reads
Celera Assembler
5 contigs
Polish, Quiver
1 contig
Collaboration with A. Clum, A. Copeland (Joint Genome Institute)
>5 kb
HGAP Example - Meiothermus ruber
10 kb SMRTbell™ library
3 SMRT® Cells
(C2-C2 Chemistry, PacBio® RS)
Long seed reads (>5 kb)
pre-assembly
Pre-assembled long reads
Celera Assembler
5 contigs
Polish, Quiver
1 contig
Collaboration with A. Clum, A. Copeland (Joint Genome Institute)
HGAP Example - Meiothermus ruber
10kb SMRTbell™ library
3 SMRT® Cells
(C2-C2 Chemistry, PacBio® RS)
Long seed reads (>5 kb)
Pre-assembly
Pre-assembled long reads
Celera Assembler
5 contigs
Polish, Quiver
1 contig
Collaboration with A. Clum, A. Copeland (Joint Genome Institute)
HGAP Example - Meiothermus ruber
10kb SMRTbell™ library
3 SMRT® Cells
(C2-C2 Chemistry, PacBio® RS)
Long seed reads (>5 kb)
Pre-assembly
Pre-assembled long reads
Celera® Assembler
Minimus2
5 contigs
Quiver
1 contig
1 contig
Collaboration with A. Clum, A. Copeland (Joint Genome Institute)
• Single-contig assembly
• 99.99965% concordance
with reference
• 99.3% genes predicted
Polish with Quiver for High Accuracy
Organism
Assembly size
(bases)
Differences
with Sanger
reference
Meiothermus ruber
3,098,781
11
M. ruber Sanger reference
PacBio® reads
Targeted Sanger validation
SNPs
Concordance
Nominal validated as Remaining
with Sanger
QV
correct
differences
reference
PacBio calls
QV
99.99965%
60
54.5
8
1(3)
Estimated Coverage Targets for Finishing Smaller Genomes
Assembly Approach /
Software Tool
Recommended
PacBio® Coverage
Additional Data Sets
Genome Size
Constraints
None
< 10 MB (SMRT Portal)
< 130 MB (Command
Hierarchical
SMRT® Analysis
implementation of HGAP
(uses
Celera® Assembler
75-100X PacBio CLR
7.0)
Celera® Assembler via
PacBiotoCA (recent
compilation) see Koren et al (2013)
Line)
75-100X PacBio CLR
None
20-50X PacBio CLR
50X short reads
ALLPATHS-LG
50X PacBio 3 kb CLR
- 50X Illumina® PE
- 50X Illumina® jumping
libraries
MIRA (with PacBiotoCA)
20-50X PacBio CLR
50X short reads
10X PacBio CLR
High-confidence contigs
Similar to above
http://arxiv.org/abs/1304.3752
Hybrid
Celera® Assembler 7.0 with
PacBiotoCA (SMRT® Analysis)
20 MB
Scaffolding
AHA (SMRT Analysis)
<200 MB;
<20,000 contigs
Selected publications with PacBio
• Application
– Katz et al 2013 Mbio “Evolutionary Dynamics of Vibrio
cholerae O1 following a Single-Source Introduction to Haiti”
– Chin et al 2011 NEJM “The Origin of the Haitian Cholera
Outbreak Strain”
– Rasko et al 2011 NEJM “Origins of the E. coli Strain Causing an
Outbreak of Hemolytic–Uremic Syndrome in Germany”
• Assemblers
– Chin et al 2013 Nature “Nonhybrid, finished microbial genome
assemblies from long-read SMRT sequencing data”
– Koren et al 2012 Nature Biotechnology “Hybrid error correction
and de novo assembly of single-molecule sequencing reads”
– Bashir et al 2012 Nature Biotechnology “A hybrid approach for
the automated finishing of bacterial genomes”
The epigenome
• PacBio can also detect epigenetic
modifications, especially the methylome
• Roles for DNA methylation and
methyltransferases are uncharacterized in
most bacteria
• Not your primary task, but it could be a very
interesting and novel result of the 2014
compgenomics class
Davis et al 2013, Current opinion in microbiology “Entering the era of bacterial epigenomics with single molecule real time
DNA sequencing”
ASSEMBLY DIFFICULTIES
One problem: randomly low coverage
(Lander-Waterman)
• Assuming random
distribution of reads and
ignoring repeat resolution
issues,
• G= genome length
• L = length of a single read
• N= number of reads
sequenced
• T= minimum overlap to
align the reads together
• Then overall coverage is
C = LN/G
• Coverage for any given
base obeys the Poisson
distribution:
• The number of
gaps(bases with 0
coverage) is:
A good Lander-Waterman reference: http://www.math.ucsd.edu/~gptesler/186/shotgun_08-handout.pdf
Lander and Waterman (1988). Genomic Mapping by Fingerprinting random clones: a mathematical analysis. Genomics 2 pp. 231-239
http://www.cmb.usc.edu/papers/msw_papers/msw-081.pdf
Major Problem: repeat elements
?
?
Major Problem: repeat elements
Major Problem: repeat elements
D (1)
A (1)
B (2)
C (2)
E (2)
G (1)
F (1)
Unipath graph of the 1.8-Mb genome of C. jejuni
Possible paths:
ABCDBCEFCEG
ABCEFCDBCEG
Butler J et al. (1998) ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res. 18:810-820
Comparisons of assemblers
• Second-generation
– Zhang et al 2011 “A practical comparison of de novo
genome assembly software tools for next-generation
sequencing technolgies”
– Genome Assembly Gold-standard Evaluations (GAGE) http://gage.cbcb.umd.edu/
– Lin et al 2011, “Comparative studies of de novo
assembly tools for next-generation sequencing
technologies”
– Don’t ignore many newer assemblers including SPAdes
• Third-generation
– Too new!
Zhang et al 2011
A quick note on reference assembly
• It is possible to map reads to a reference
genome using short-read mappers, and then
deriving a consensus sequence
• Some favorites
– Smalt
– BWA
– Bowtie
• Must know how to use samtools for this route
Reference assembly notes
• To my knowledge no paper exists that
compares reference assemblers (could be
wrong)
• Assembly could be biased
– Miss genomic islands
– Ends of contigs might be tapered
• Good practice for reference assembly:
perform de novo assembly on unused reads
just in case you missed something
Assembly Metrics
How do you tell if your assembly is good?
Metric
description
Assembly Length
The size of the
concatenated assembly
Number of Contigs
The count of contigs
N50
The size of the contig at
where half the genome is
located in size >N50 and
half is located in size <N50
reference
Longest Contig
Average contig length
Kmer21
Frequency of kmers with
k=21
GC-content
Percentage of the genome
that is either G or C
Assembly score
Log(N50/numberOfContigs)
http://www.homolog.us/blogs/blog/2012/06/26/
what-is-wrong-with-n50-how-can-we-make-itbetter-part-ii/
CG-Pipeline/Lee Katz
Assembly evaluation
QUAST
• http://bioinf.spbau.ru/quast
The CG-Pipeline way
$ run_assembly_metrics.pl assembly.fasta|column –t
File
assembly.fasta
genomeLength
2992976
N50
2992976
numContigs
1
assemblyScore
14.9117787680924
AMOS
• amosvalidate
• http://amos.sourceforge.net
– (build this from source to avoid bugs)
Post-assembly manual methods
• View the pileups and see if you agree with base
calls
–
–
–
–
Hawkeye (AMOS)
IGV viewer
Tablet viewer
Samtools tview (command line interface)
• Compare to other genomes; sort contigs
– MAUVE and MAUVE contig mover
– Mummer (mummerplot)
– Abacas
Post-assembly manual methods
• Close gaps: Scaffold
with Illumina
–
–
–
–
–
–
• Scaffold with PacBio
– AHA (assembler)
GRASS
SSPACE
SOPRA
Babmus2 (AMOS)
IMAGE
Many others
NNN
N
Acknowledgements
• Every single compgenomics class
• For help with slides: Eleni Paxinos, Andrew
Huang, Amber Schmitke, Maryann Turnsek
• For letting me off work, my supervisor Cheryl
Tarr
• Many, many others who I work with on a daily
basis
Questions?
Lee Katz
lkatz@cdc.gov
The findings and conclusions in this presentation have not been formally disseminated by the Centers for Disease Control and Prevention and should not be
construed to represent any agency determination or policy.
The findings and conclusions in this [report/presentation] are those of the author(s) and do not necessarily represent the official position of CDC
86
Download