DNA Assembly

advertisement
Institute of Biomedical Sciences
University of São Paulo
DNA Assembly and Mapping
Arthur Gruber
Coccilab – ICB/USP
Next-generation sequencing platforms
•
•
Mid 2000’s: next-generation sequencers (NGS) were developed
•
2004 – 454 (Roche, formerly 454 Life Sciences)
•
2006 – Illumina (formerly Solexa)
•
2008 – SOLiD (Life Technologies, formerly Applied
Biosystems)
•
2011 –Ion Torrent /Proton (Life Technologies)
•
2011 – PacBio RS (Pacific Biosciences)
Massively parallel sequencing - tipo shotgun (random fragments)
Generate millions of sequences in one single run at a low cost per
base
•
Data generation x cost
Cost per MB of sequence
Source: Sboner et al. (2011) - Genome Biol. 12 (8): 125
Moore Law
Evolution of sequencing costs
An estimate of the evolution of sequencing costs over the last 10 years. Costs are given for
sequencing a megabase using a logarithmic scale. This curve is adapted from [15]. Time of
introduction of new technologies is indicated.
Source: Delseny et al. (2010). Plant Science 179 (5): 407–422
DNA Assembly
Coccilab – ICB/USP
NGS – Lower cost and greater data generation
Source: Sboner et al. (2011) - Genome Biol. 12 (8): 125
Next-generation sequencing platforms
Source: Glen (2011). Mol Ecol Resources 11: 759–769
Next-generation sequencing platforms
Source: Glen (2011). Mol Ecol Resources 11: 759–769
Next-generation sequencing platforms
Source: Glen (2011). Mol Ecol Resources 11: 759–769
Different types of sequencing methods
A flow chart of the different types of
sequencing methods
Source: Delseny et al. (2010). Plant Science 179 (5): 407–422
Coccilab – ICB/USP
454 Workflow
Source:
Mardis. (2008). Annu. Rev. Genomics Hum. Genet. 9: 387–402
Illumina Workflow
Source: Mardis. (2008). Annu. Rev.
Genomics Hum. Genet. 9: 387–402
SOLiD Workflow
Source:
Mardis. (2008). Annu. Rev. Genomics Hum. Genet. 9: 387–402
NGS platforms – applications
Source: Homer et al. (2009). Brief Bioinformatics II (2): 181-197.
DNA Assembly
Coccilab – ICB/USP
NGS platforms – applications
Tool
Website
Category
Platform
Source: Homer et al. (2009). Brief Bioinformatics II (2): 181-197.
DNA Assembly
Coccilab – ICB/USP
Sequence assembly
• Current sequencing platform can only generate sequence reads of dozens of bp (so
called short reads) or some hundreds of reads (Sanger, 454, Ion Torrent, PacBio)
1
• Computational
tools
are
necessary
to
assemble
the
sequence reads into a larger
sequence segment/genome
• Sequence
assemblers use two
different approaches to assemble
reads:
• Overlap layout consensus
• de Bruijn graphs
2
Schatz et al. (2010) - Assembly of large genomes
using second-generation sequencing
K-mer graph
A pair-wise overlap represented by a K-mer graph.
(a) Two reads have an error-free
overlap of 4 bases.
(b) One K-mer graph, with K=4, represents both
reads. The pair-wise alignment is a by-product of
the graph construction.
(c) The simple path through the graph implies a
contig whose consensus sequence is easily
reconstructed from the path.
Source: Miller et al. (2010). Genomics 95: 315-327
DNA Assembly
Coccilab – ICB/USP
Complexity in K-mer graphs
Complexity in K-mer graphs can be diagnosed with
read multiplicity information. In these graphs, edges
represented in more reads are drawn with thicker
arrows.
(a) An errant base call toward the end of a read
causes a “spur” or short dead-end branch. The
same pattern could be induced by coincidence of
zero coverage after polymorphism near a repeat.
(b) An errant base call near a read middle causes a
“bubble” or alternate path. Polymorphisms
between donor chromosomes would be expected to
induce a bubble with parity of read multiplicity on
the divergent paths.
(c) Repeat sequences lead to the “frayed rope”
pattern of convergent and divergent paths.
Source: Miller et al. (2010). Genomics 95: 315-327
DNA Assembly
Coccilab – ICB/USP
de Bruijn Graphs
Advantages:
Can deal with large amounts of data, consolidates redundant reads (high coverage) in
a very efficient way
• Sequencing errors are promptly identified from the topology of the graph and k-mer
coverage
•
de BRUIJN Graph
Erro
Edge formation in the graph
Evaluating assemblies
Size of Largest Contig
• Number of contigs > n length
• N50
Given a set of sequences of varying lengths, the N50
length is defined as the length N for which half of all
bases in the sequences are in a sequence of length L
< N. In other words, N50 is the contig length such that
using equal or longer contigs produces half the bases
of the genome. Therefore, the number of bases from
of all sequences shorter than the N50 will equal the
number of bases from all sequences longer than the
N50.
•
Evaluating assemblies
•
N50
Contig or scaffold N50 is a weighted median statistic
such that 50% of the entire assembly is contained in
contigs or scaffolds equal to or larger than this value
Some definitions
Contig
A sequence contig is a contiguous, overlapping
sequence read resulting from the reassembly of the
small DNA fragments generated by sequencing
strategies
• Scaffold
Using paired-end sequencing technology, the distance
between both sequence ends of a fragment is known.
This gives additional information about the orientation
of contigs constructed from these reads and allows for
their assembly into scaffolds.
•
Libraries for NGS platforms
Paired-end technology
A) Schematic drawing of the paired-end
technology. Adaptors and genome
fragments are represented respectively
by the black and grey lines.
B) B) Strategy for sequencing large DNA
fragments: short reads are assembled
into contigs. A high coverage is
required. In the next steps, paired-ends
derived from larger fragments are used
to assemble contigs into scaffolds.
Source: Delseny et al. (2010). Plant Science 179 (5): 407–422
DNA Assembly
Coccilab – ICB/USP
Contigs and scaffolds
An example of a real file 454 data
........................................................................................................................................................................................................ Analysis
........................................................................................................................................................................................................ Analysis
2. Results of Data processing
2.1. Raw data
b. Mate Paired Library (MN7_MP-3)
a. General Library (MN7_RL)
Read count
Total bases
Average read length
Read count
Total bases
Average read length
325,436
146,870,257
451.304
347,988
150,377,821
432.136
Read length distribution
Read length distribution
........................................................................................................................................................................................................ Analysis
An example of real file 454 data
3. Results of Analysis
3.1. Results of assembly
3.1.1. Read status
Number of reads
Number of bases
Assembled
Partial
Singleton
Repeat
Outlier Too short
895,203
283,678,189
878,324
7,067
4,374
4,283
1,155
0
- Number of reads : the read used in the assembly computation.
- Number of bases : the read’s bases used in the assembly computation.
- Assembled : the read is fully incorporated into the assembly.
- Partial : only part of the read was included in the assembly.
- Singleton : the read did not overlap with any other reads in the input.
- Repeat : the read deemed to be from repeat regions.
- Outlier : the read was identified by the GS De Novo Assembler as problematic.
- Too short : the read was too short to be used in the computation.
3.1.2. Paired read status
Both mapped
One unmapped
Multiply mapped
217,008
1,586
3,936
Both unmapped Distance Avg
- Both mapped : both halves of the pair were aligned.
170
2662.5
Distance Dev
736.9
An example of real file 454 data
........................................................................................................................................................................................................ Analysis
3.1.3. Scaffolds
Number of scaffolds
Number of bases
Avg. size
N50 size
11
5,308,521
482,592
1,552,834
- Number of scaffolds : the number of scaffolds identified.
- Number of bases : the total number of bases in the scaffolds.
- Avg. size : the average scaffold size.
- N50 size : the N50 scaffold size.
- Largest size : the size of the largest scaffold.
Largest size
1,903,376
3.1.4. Scaffold contigs
Number of contigs
Number of bases
Avg. size
N50 size
50
5,281,008
105,620
211,376
- Number of contigs : the number of contigs identified in scaffold.
- Number of bases : the total number of bases in the scaffold contigs.
- Avg.size : the average scaffold contig size.
- N50 size : the N50 scaffold contig size.
- Largest size : the size of the largest scaffold contig.
Largest size
450,948
3.1.5. Large contigs (Length >= 500bp)
Num of contigs
Num of bases
Avg.size
N50 size
Largest size
Q40Plus bases
%Q40
58
5,288,826
91,186
211,376
450,948
5,288,020
99.98%
- Num of contigs : the number of large contigs identified.
- Num of bases : the total number of bases in the large contigs.
- Avg. size : the average contig size.
- N50 size : An N50 means that half of all bases reside in contigs of this size or longer.
- Largest size : the size of the largest contig.
- Q40Plus bases : the number of bases called that have a quality score of 40 or above.
- %Q40 : the percentage of bases called that have a quality score of 40 or above.
3.1.6. All contigs (Length >= 100bp)
Number of contigs
Number of bases
106
5,299,016
- Number of contigs : the number of all contigs identified.
- Number of bases : the total number of bases in the all contigs.
NGS platforms – performances and features
Source: Homer et al. (2009). Brief Bioinformatics II (2): 181-197.
DNA Assembly
Coccilab – ICB/USP
Comparison of De Novo Genome Assemblers
Source: Zhang et al. (2011). PLoS ONE 6 (3): e17915.
DNA Assembly
Coccilab – ICB/USP
Comparison of De Novo Genome Assemblers
Accuracy and integrity for
36-mer datasets assembly.
The quality of consequential
contigs is shown with:
(A) the accuracy of
assembled contigs
(B) the genome coverage of
the assembled contigs.
No data is shown when
the RAM is insufficient
or the assembly tool is
not suitable for the
dataset.
Source: Zhang et al. (2011). PLoS ONE 6 (3): e17915.
DNA Assembly
Coccilab – ICB/USP
Comparison of De Novo Genome Assemblers
Accuracy and integrity for
75-mer datasets assembly.
The quality of consequential
contigs is shown with:
(A) the accuracy of
assembled contigs
(B) the genome coverage of
the assembled contigs.
No data is shown when
the RAM is insufficient
or the assembly tool is
not suitable for the
dataset.
Source: Zhang et al. (2011). PLoS ONE 6 (3): e17915.
DNA Assembly
Coccilab – ICB/USP
Comparison of De Novo Genome Assemblers
Statistics for assembled contigs of 36-mer short reads. Indicatrix that illustrates the feature of size
distribution are adopted for analysis. ‘‘#’’ denotes the RAM of machine is not enough, and ‘‘N/A’’ means
the data is not available. The N50 size and N80 size represent the maximum read length for which all
contigs greater than or equal to the threshold covered 50% or 80% of the reference genome.
Source: Zhang et al. (2011). PLoS ONE 6 (3): e17915.
DNA Assembly
Coccilab – ICB/USP
Comparison of De Novo Genome Assemblers
Statistics for assembled contigs of 75-mer short reads. Indicatrix that illustrates the feature of size
distribution are adopted for analysis. ‘‘#’’ denotes the RAM of machine is not enough, and ‘‘N/A’’ means
the data is not available. The N50 size and N80 size represent the maximum read length for which all
contigs greater than or equal to the threshold covered 50% or 80% of the reference genome.
Source: Zhang et al. (2011). PLoS ONE 6 (3): e17915.
DNA Assembly
Coccilab – ICB/USP
Genomes assembled de novo exclusively from
Illumina short sequence reads
Organisms:
•
•
•
•
•
•
•
•
•
•
Turkey (Meleagris gallopavo)
Giant panda (Ailuropoda melanoleuca)
Bacillus subtilis 168
Bacillus subtilis natto
Pseudomonas syringae pv. tabaci 11528
Pseudomonas syringae pv. syringae Psy642
Pseudomonas syringae pv. tomato T1
Pseudomonas syringae pv. Aesculi
Apple scab (Ventura inaequalis)
Pine (Pinus species) chloroplast
Paszkiewicz & Studholme (2010). Brief Bioinform 11 (5): 457-472.
DNA Assembly
Coccilab – ICB/USP
Assembly results using real illumina single-end and
paired-end reads from SRA
Source: Bao et al. (2011). Journal of Human Genetics 56: 406–414.
DNA Assembly
Coccilab – ICB/USP
Biases in real short-read sequence data
(A) Illustrates the depth of coverage by
aligned reads over the 6 Mb circular
chromosome. Coverage is shallower
around the 3 Mb region than it is near
the origin of replication (position 0)
(B) Illustrates the expected frequency
distribution of alignment depth,
assuming random sampling of the
genome
(A) Illustrates the observed frequency
distribution of alignment depth, which is
broader than the expected distribution,
indicating greater variance due to
biased sampling.
Source: Paszkiewicz & Studholme (2010). Brief Bioinform 11 (5): 457-472.
DNA Assembly
Coccilab – ICB/USP
Limitations of next-generation genome
sequence assembly
Limitations:
• NGS technologies typically generate shorter sequences with higher
error rates from relatively short insert libraries
• Assembly of longer repeats and duplications will suffer from this short
read length
• Assembly methods for short reads are based on de Bruijn graph and
Eulerian path approaches, which have difficulty in assembling
complex regions of the genome.
• DNA contamination or insertion polymorphism?
Source: Alkan et al. (2010). Nat Methods 8(1): 61-65.
DNA Assembly
Coccilab – ICB/USP
Limitations of next-generation genome
sequence assembly
Limitations:
• Repeat content
• WGS-based de novo sequence assembly algorithm will collapse
identical repeats, resulting in reduced or lost genomic complexity.
• Missing and fragmented genes
Source: Alkan et al. (2010). Nat Methods 8(1): 61-65.
DNA Assembly
Coccilab – ICB/USP
Data generation
and analysis steps
of a typical RNAseq experiment.
Source: Martin & Wang. (2011). Nature
Reviews Genetics 12, 671-682.
Coccilab – ICB/USP
Reference-based transcriptome assembly strategy
Source: Martin & Wang. (2011). Nature
Reviews Genetics 12, 671-682.
DNA Assembly
Coccilab – ICB/USP
Overview of the de novo transcriptome assembly strategy
Source: Martin & Wang. (2011). Nature
Reviews Genetics 12, 671-682.
Coccilab – ICB/USP
Alternative approaches for combined transcriptome
assembly
Source: Martin & Wang. (2011). Nature Reviews Genetics 12, 671-682.
DNA Assembly
Coccilab – ICB/USP
Software for transcriptome assembly
Source: Martin & Wang. (2011). Nature Reviews Genetics 12, 671-682.
DNA Assembly
Coccilab – ICB/USP
Splice-aware short-read aligners
Source: Martin & Wang. (2011). Nature Reviews Genetics 12, 671-682.
DNA Assembly
Coccilab – ICB/USP
Mapping reads onto a reference sequence
Programs:
• Bowtie is an ultrafast, memory-efficient short read aligner. It aligns
short DNA sequences (reads) to the human genome at a rate of over
25 million 35-bp reads per hour.
• Available at http://bowtie-bio.sourceforge.net/index.shtml
• SHRiMP is a software package for aligning genomic reads against a
target genome. Available at http://compbio.cs.toronto.edu/shrimp/
• BarraCUDA - an ultra fast short read sequence alignment software
using GPUs.
• Available at http://www.manycore.group.cam.ac.uk/projects/lam.shtml
• Burrows-Wheeler Aligner (BWA) is an efficient program that aligns
relatively short nucleotide sequences against a long reference
sequence such as the human genome.
• Available at http://bio-bwa.sourceforge.net/
DNA Assembly
Coccilab – ICB/USP
Mapping reads onto a reference sequence
Programs:
• BLAT is a bioinformatics software a tool which performs rapid
mRNA/DNA and cross-species protein alignments
• Available at http://www.kentinformatics.com/products.html
• BFAST facilitates the fast and accurate mapping of short reads to
reference sequences. Some advantages of BFAST include:
• Speed: enables billions of short reads to be mapped quickly.
• Accuracy: A priori probabilities for mapping reads with defined set
of variants.
• An easy way to measurably tune accuracy at the expense of
speed.
• Available at
http://sourceforge.net/apps/mediawiki/bfast/index.php?title=Main
_Page
Coccilab – ICB/USP
Visualizing reads mapped onto a reference
sequence
Programs:
• TABLET - lightweight, high-performance graphical viewer for next
generation sequence assemblies and alignments.
• Available at http://bioinf.scri.ac.uk/tablet/index.shtml
• IGV - Integrative Genomics Viewer - a high-performance visualization
tool for interactive exploration of large, integrated genomic datasets.
• Available at http://www.broadinstitute.org/igv/
Coccilab – ICB/USP
TABLET - graphical viewer
Coccilab – ICB/USP
Integrative Genomics Viewer (IGV)
Coccilab – ICB/USP
Data formats - SOLiD
Color Space:
• Also known as 2-base (Di-Base) encoding, is based on ligation
sequencing rather than sequencing by synthesis.
• Each base in this sequencing method is read twice. This changes the
color of two adjacent color space calls, therefore in order to miscall a
SNP, two adjacent colors must be miscalled.
• Requires specific software to manipulate the data. Most assemblers
are not designed to deal with color space.
Coccilab – ICB/USP
Data formats - SOLiD
SOLiD 4 – data is provided as *csfasta and *.qual
csfasta:
>1_7_80_F3
T223003300123201021020110010200020002200000000300000000001000020000110002200
>1_7_157_F3
T120030200320003020020010020100300003100031000300001000000000010000000000000
>1_7_202_F3
T230020100031001030000230000000200003100000000000003000000000010000000000000
qual:
>1_7_80_F3
40 42 16 4 42 4 7 32 4 42 4 27 36 4 42 4 16 42 4 42 4 27 35 4 4 4 27 35 4 4 7 27
4 4 4 4 27 4 4 4 4 22 4 4 4 4 4 4 4 4 4 4 4 4 4 4 16 4 4 4 4 16 4 4 4 4 16 11 4
4 4 22 7 4 4
>1_7_157_F3
40 42 4 4 42 42 40 4 4 40 42 32 4 4 42 4 7 4 4 7 4 4 36 4 4 40 4 16 4 4 36 4 4 4
4 42 4 4 4 4 42 4 4 4 4 36 4 4 4 4 36 4 4 4 4 4 7 4 4 4 4 4 4 4 4 4 7 4 4 4 4 4
4 4 4
>1_7_202_F3
42 42 4 4 42 42 42 4 4 42 40 35 4 4 42 4 27 4 4 40 4 36 42 4 4 36 4 42 4 4 27 4
4 4 4 16 4 4 4 4 7 7 4 4 4 16 4 4 4 4 4 4 4 4 4 4 7 4 4 4 4 16 4 4 4 4 4 4 4 4 4
11 4 4 4
Coccilab – ICB/USP
Data formats - SOLiD
Color space can be converted into DE (Double encoding)
Life Technologies provides a set of scripts (SOLiD™ de novo accessory
tools 2.0) for conversion and data usage with Velvet assembler.
• The program prepares reads in the format accepted by Velvet
assembly engine.
• The program removes first base and first color, double encodes reads
(i.e., 0 for A,1 for C,2 for G,3 for T).
• After running the assembler, the DE contigs must be converted into
base space.
Coccilab – ICB/USP
Data formats - SOLiD
WSQ:
• Extensible Sequence (XSQ) format introduced with the 5500 series
SOLiD Sequencer.
• Developed to store each call and quality value in a single byte, which
results in file sizes that are up to 75% smaller.
• Binary format – can be converted into csfasta/qual and *.fastq formats
using the SOLiDTM System XSQ Tools (available at Life Technologies)
Coccilab – ICB/USP
Data formats – 454 platform
FASTA and QUAL:
• Files can be provided in FASTA (*.fna) and QUAL (*.qual)
formats.
SFF:
• Standard Flowgram Format - equivalent of the scf/ab1/trace file for
Sanger sequencing, contains information on the signal strength for
each flow.
• Binary format – can be converted into FASTA/QUAL using a python
script (sff_extract) or using sff2fastq script.
Coccilab – ICB/USP
Data formats – Illumina
FASTAQ
• Originally developed at the Wellcome Trust Sanger Institute to bundle
a FASTA sequence and its quality data.
• FASTQ format is a text-based format for storing both a biological
sequence (usually nucleotide sequence) and its corresponding quality
scores.
• Adopted by the Illumina Genome Analyzer.
• FASTQ has become an almost universal format. It is accepted by
many assemblers (e.g. Edena, Euler, Velvet, ABySS, etc. ) and
sequence mapping programs (e.g. Bowtie, BFAST, SHRIMP,
MOSAIK, etc.)
• FASTQ can be converted into FASTA using the FASTX-Toolkit.
Coccilab – ICB/USP
Data formats – Illumina
FASTAQ
• Both the sequence letter and quality score are encoded with a single
ASCII character for brevity.
• Line 1 begins with a '@' character and is followed by a sequence
identifier and an optional description (like a FASTA title line).
• Line 2 is the raw sequence letters.
• Line 3 begins with a '+' character and is optionally followed by the
same sequence identifier (and any description) again.
• Line 4 encodes the quality values for the sequence in Line 2, and
must contain the same number of symbols as letters in the
sequence.
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
Coccilab – ICB/USP
Data formats – Illumina
FASTAQ Encoding
• Sanger and Illumina use slightly different base quality calculations.
• Sanger
Qsanger = -10 log10p
• Illumina (prior to version 1.3)
Qillumina = -10 log10 [ p /(1-p)]
• Solexa/Illumina 1.0 format can encode a quality score from -5 to 62
using ASCII 59 to 126 (Solexa+64).
• Sanger format uses Phred quality from 0 to 93 using ASCII characters
33 to 126 (Phred+33):
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
|
|
|
|
|
|
33
59
64
73
104
126
Coccilab – ICB/USP
Download