Assembly algorithms

advertisement
Module 5. Genome assembly: Assembly
algorithms
Background
Three types of assembly algorithms include Greedy, Overlap/Layout/Consensus (OLC) and De Bruijn
Graphs See Miller et al. (2010) for detail. Greedy algorithms work by, for any given contig, using the
next highest-scoring overlap to join next. Scoring algorithms may vary. Greedy algorithms do not
“reconsider” whether a read would help another contig more, after adding to an earlier contig. These
are fast algorithms but can miss the optimal contig construction. We won’t be using a Greedy assembler
here. OLC assemblers perform all-vs-all pairwise read comparison using shared kmers as alignment
seeds. Overlap graphs are constructed and multiple sequence alignments used to refine alignment
(Miller et al. 2010). The all-vs-all step is particularly computationally intensive as there are [n(n-1)]/2
pairwise comparisons among reads, and scores for each alignment must be stored.
Reflection Q: How many pairwise comparisons are there among 100 million reads?
The Newbler assembler is a widely used OLC assembler used predominantly for Roche 454 data
(Margulies et al. 2005). First unitigs are generated from reads using the OLC method. Unitigs are
contigs that do not overlap with reads in other unitigs. The unitigs seed generation of larger contigs that
are joined based on pairwise overlap among unitigs. Unitigs may be split in the process if their
beginning and ends align to different contigs. On the 454 platform multiple calls of a nucleotide are
made at the same time and produce a light intensity that is proportional to the number of bases in a
row. As the number of identical bases repeated in a stretch (i.e. homopolymer like AAAAAAAAAA)
increases, it becomes increasingly difficult for 454 technologies to resolve the exact number of repeats.
Consensus of the raw signal strength from multiple aligning nucleotides is used to help resolve
homopolymer length (Miller et al. 2010).
De Bruijn graphing techniques are described in the figure below (Compeau et al. 2011).
1
F IGURE 1. F ROM C OMPEAU ET AL . 2011
The figure below shows how higher kmer size can result in fragmented assemblies if coverage isn’t high
enough. It also shows that low kmer size may result in ambiguities.
2
F IGURE 2. F ROM HTTP :// GCAT . DAVIDSON . EDU / PHAST / DEBRUIJN . HTML . A) E XAMPLE READS FOR THE CONSENSUS SEQUENCE SHOWN .
B) L ARGE KMERS (5 BP ) FROM THESE READS CAN CAUSE FRAGMENTATION ( AN UNRESOLVED GRAPH ) IF NOT ALL KMERS ARE PRESENT .
T HE READ IN RED IS THROWN OUT , AND THE KMER ATTAG IS NOT CONNECTED . C) S MALL KMERS (4 BP ) MAY COVER THE WHOLE
GENOME BUT RESULT IN PATH AMBIGUITIES . R EADS IN GREEN CAN BE USED TO RESOLVE THE GRAPH , CAUSING THE RESOLUTION
PLOTTED IN RED IN (C).
Manual Exercise
Given the following set of 3bp reads {ATG, CAT, TGC, GCA}, construct a genome using a De Bruijn graph
with kmer size 2.
Errors cause many kmers to be affected, and cause “bulges” in de Bruijn graphs.
3
4
F IGURE 3. M ILLER ET AL . 2010.
F IGURE 4. M ILLER ET AL . 2010
5
SOAPdenovo
SOAPdenovo (Li et al. 2010; Luo et al 2012) is comprised of 4 distinct commands that typically run at the
same time.
Pregraph: construct kmer-graph
Contig: eliminate errors and output contigs
Map: map reads to contigs
Scaff: construct scaffolds
All: do all of the above in turn
Error correction in SOAPdenovo itself includes calculating kmer frequencies and filtering kmers below a
certain frequency, correcting bubbles, and frayed robe patterns. It creates DeBruijn graphs and to
create scaffolds, maps all paired reads to contig consensus sequences, including reads not used in the
graph. Below we will try assembly with all SOAPdenovo modules, on both the raw data and Quakecorrected data.
Goals




De novo genome assembly in Linux OS
Effects of key variables on assembly quality
Measuring assembly quality
Checking bioinformatic results against a standard
V&C core competencies addressed
1) Ability to apply the process of science: Observational strategies, Hypothesis testing, Experimental
design, Evaluation of experimental evidence, Developing problem-solving strategies
2) Ability to use quantitative reasoning: Developing and interpreting graphs, Applying statistical
methods to diverse data, Mathematical modeling, Managing and analyzing large data sets
3) Use modeling and simulation to understand complex biological systems: Computational
modeling of dynamic systems, Applying informatics tools, Managing and analyzing large data sets,
Incorporating stochasticity into biological models
6
GCAT-SEEK sequencing requirements
any
Computer/program requirements for data analysis
Linux OS, SOAPdenovo 2
Optional Cluster: Qsub
If starting from Window OS: Putty
If starting from Mac or Linux OS: SSH
Protocols
Genome assembly using SOAPdenovo2 in the Linux environment
We will perform genome assembly on the HHMI cluster. We will start off trying to repeat some work
that was published in the Genome Assembly Gold-standard Evaluations (GAGE) project (Salzberg et al.
2012). Our work will focus on bacterial genomes for purposes of brevity, but this work, in this
computing environment, can be applied to even mammalian sized genomes (>1Gb). You now have 4
bacterial genome datasets uploaded in two forms, raw and quality filtered/error corrected. The raw
data includes a paired end fragment libraries of 101bp reads, an “insert length” of 180bp, in “innie”
orientation, with 1,294,104 reads providing 45x genome coverage (two files: frag_1.fastq, frag_2.fastq).
The raw data also includes two shorter read (37bp) mate-paired jumping libraries with an “insert length”
of 3500bp, in “outie” orientation, with 3,494,070 reads providing another 45x genome coverage (two
files: shortjump_1.fastq, shortjump_2.fastq).
A. Log onto your Linux home directory on the cluster.
B. Make a directory entitled, soap, check that it is there, and move into it.
$mkdir soap
$ls
$cd soap
C. Make the config file using nano. This will tell Soap which files to use, where you can enter
important characteristics of the data. Below you can skip the comment lines starting with “#”.
When finished, hit control-X, and you will be prompted to save the file.
$nano
___________________________
#maximal read length
7
max_rd_len=101
#below starts a new library
[LIB]
#average insert size
avg_ins=180
#if sequence needs to be reversed put 1, otherwise 0
reverse_seq=0
#in which part(s) the reads are used. Flag of 1 means only contigs.
asm_flags=1
#use only first 101 bps of each read
rd_len_cutoff=101
#in which order the reads are used while scaffolding. Small fragments usually first, but
worth playing with.
rank=1
# cutoff of pair number for a reliable connection (at least 3 for short insert size). Not sure
what this means.
pair_num_cutoff=3
#minimum aligned length to contigs for a reliable read location (at least 32 for short insert
size)
map_len=32
#a pair of fastq file, read 1 file should always be followed by read 2 file
q1=frag_1.cor.fastq
q2=frag_2.cor.fastq
#now you will enter the information for the mate pair library
[LIB]
avg_ins=3500
reverse_seq=1
asm_flags=2
rank=2
8
# cutoff of pair number for a reliable connection (at least 5 for large insert size)
pair_num_cutoff=5
#minimum aligned length to contigs for a reliable read location (at least 35 for large insert
size)
map_len=35
q1=shortjump_1.cor.fastq
q2=shortjump_2.cor.fastq
Control-X to exit and save the file as “config.txt”
D. Move your Quake corrected fastq files into this directory. First navigate into the Quake folder
and then move the files.
$mv *.cor.fastq ../soap/
E. Copy your Qsub script to your current directory and edit it for running SOAP. Please use the low
memory (63mer) version of SOAPdenovo, a kmer (-K) of 31, 16 processors (-p), the config file
you just made (-s), and give all output files the prefix asm (-o). Use the following text at the end
of your example Qsub control file. NOTE: If you simply type the following into the command
prompt, you will not be running Qsub. Make sure to save.
$SOAPdenovo-63mer all -K 31 -p 16 -s config.txt -o asm
Then run the Qsub script using:
$qsub –p 100 NameofYourQsubScript
F. Run GapCloser using Qsub with the following instructions at the end of the Qsub script. You will
use the config file you made above (-b), fill gaps in the sequence file asm.scafSeq (-a), make the
output file asm2.scafSeq, use 16 processers (-t 16), and make sure there is an ovelap of 31 nt
before gap filling (-p).
$GapCloser -b config.txt -a asm.scafSeq -o asm2.scafSeq -t 16 -p 31
G. Examine directory contents using $ls
H. $Less the *.scafSeq file. It will have the scaffold result files.
I. $Less the *.scafStatistics file. This will contain detailed information on scaffold
assembly.
Size_includeN
Total size of assembly in scaffolds, including Ns
Size_withoutN
Total size of assembly in scaffolds, not including Ns
Scaffold_Num
Number of scaffolds
Mean_Size
Mean size of scaffolds
9
Median_Size
Median size of scaffolds
Longest_Seq
Longest scaffold
Shortest_Seq
Shortest scaffold
Singleton_Num
Number of singletons
Average_length_of_break(N)_in_scaffold Average length of unknown nucleotides (N) in scaffolds
Also contained will be counts of scaffolds above certain sizes, percent of each nucleotide and N (Gap)
values, and “N statistics.” An N50 is the size of the smallest scaffold such that 50% of the genome is
contained in scaffolds of size N50 or larger (Salzberg et al. 2012). A line here showing “N50 35836 28”
indicates that there are 28 scaffolds of at least 35836 nucleotides, and that they contain at least 50% of
the overall assembly length. Statistics for contigs (pre-scaffold assemblies) are also shown.
Assessment
Fill in the table below and compare to values reported from the GAGE paper.
Q. Given the total number of scaffolds and the number greater than 100bp, how many were less than
100bp?
Note that the GAGE paper throws out “chaff” contigs that are less than 200bp in length and we did not
do that. GAGE also used a SOAPdenovo module called GapCloser, which will use read data to substitute
Ns with bp data between reads used to generate scaffolds. This would not change the length and
number of the scaffolds or contigs, however. We are also using a later version of SOAPdenovo which
contructs fewer chimeric or misjoined scaffolds.
Contigs
Setting
GAGE Reference
Optimal Conditions
Without Quality Filtering
Without Jumping Library
Small
Large
Quality
Filtering
Y
Y
N
Y
Y
Y
Kmer Size
31
31
31
31
18
51
Jumping
Library
Y
Y
Y
N
Y
Y
Setting
GAGE Reference
Optimal Conditions
Without Quality Filtering
Without Jumping Library
Small
Large
Quality
Filtering
Y
Y
N
Y
Y
Y
Jumping
Kmer Size Library
31
Y
31
Y
31
Y
31
N
18
Y
51
Y
N50
62.7
#
107
Total #bp
assembled
2,872,915
Scaffolds
10
N50
284
#
99
Total #bp
assembled
2,872,915
A. Now re-run the analysis using the raw reads instead of Quake Filtered reads, without using the
jumping library, and with different kmer parameter choices to examine effects on some key
assembly statistics.
Time line of module
Two hours
Discussion topics for class
Discuss effects of changing each parameter on quality of assembly.
Relevant lecture topics include genome structure and sequencing, correction of errors in sequence
reads, genome assembly approaches, and cited literature.
References
Literature Cited
Compeau PEC, Pevzner PA, Tesler G. 2011. How to apply de Bruijn graphs to genome assembly. Nature
Biotechnology. 29:987-991.
Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen Y-J,
Chen Z et al . 2005. Genome sequencing in open fabricated high density picoliter reactors. Nature 437:
376–380.
Miller JR, Koren S, Sutton G. 2010. Assembly algorithms for next-generation sequencing data.
Genomics 95: 315-327.
Salzberg SL, Phillippy AM, Zimin A, et al. 2012. GAGE: a critical evaluation of genome assemblies and
assembly algorithms. Genome Res 22: 557-567. [important online supplements!]
11
Download