Module 5. Genome assembly: Assembly algorithms Background Three types of assembly algorithms include Greedy, Overlap/Layout/Consensus (OLC) and De Bruijn Graphs See Miller et al. (2010) for detail. Greedy algorithms work by, for any given contig, using the next highest-scoring overlap to join next. Scoring algorithms may vary. Greedy algorithms do not “reconsider” whether a read would help another contig more, after adding to an earlier contig. These are fast algorithms but can miss the optimal contig construction. We won’t be using a Greedy assembler here. OLC assemblers perform all-vs-all pairwise read comparison using shared kmers as alignment seeds. Overlap graphs are constructed and multiple sequence alignments used to refine alignment (Miller et al. 2010). The all-vs-all step is particularly computationally intensive as there are [n(n-1)]/2 pairwise comparisons among reads, and scores for each alignment must be stored. Reflection Q: How many pairwise comparisons are there among 100 million reads? The Newbler assembler is a widely used OLC assembler used predominantly for Roche 454 data (Margulies et al. 2005). First unitigs are generated from reads using the OLC method. Unitigs are contigs that do not overlap with reads in other unitigs. The unitigs seed generation of larger contigs that are joined based on pairwise overlap among unitigs. Unitigs may be split in the process if their beginning and ends align to different contigs. On the 454 platform multiple calls of a nucleotide are made at the same time and produce a light intensity that is proportional to the number of bases in a row. As the number of identical bases repeated in a stretch (i.e. homopolymer like AAAAAAAAAA) increases, it becomes increasingly difficult for 454 technologies to resolve the exact number of repeats. Consensus of the raw signal strength from multiple aligning nucleotides is used to help resolve homopolymer length (Miller et al. 2010). De Bruijn graphing techniques are described in the figure below (Compeau et al. 2011). 1 F IGURE 1. F ROM C OMPEAU ET AL . 2011 The figure below shows how higher kmer size can result in fragmented assemblies if coverage isn’t high enough. It also shows that low kmer size may result in ambiguities. 2 F IGURE 2. F ROM HTTP :// GCAT . DAVIDSON . EDU / PHAST / DEBRUIJN . HTML . A) E XAMPLE READS FOR THE CONSENSUS SEQUENCE SHOWN . B) L ARGE KMERS (5 BP ) FROM THESE READS CAN CAUSE FRAGMENTATION ( AN UNRESOLVED GRAPH ) IF NOT ALL KMERS ARE PRESENT . T HE READ IN RED IS THROWN OUT , AND THE KMER ATTAG IS NOT CONNECTED . C) S MALL KMERS (4 BP ) MAY COVER THE WHOLE GENOME BUT RESULT IN PATH AMBIGUITIES . R EADS IN GREEN CAN BE USED TO RESOLVE THE GRAPH , CAUSING THE RESOLUTION PLOTTED IN RED IN (C). Manual Exercise Given the following set of 3bp reads {ATG, CAT, TGC, GCA}, construct a genome using a De Bruijn graph with kmer size 2. Errors cause many kmers to be affected, and cause “bulges” in de Bruijn graphs. 3 4 F IGURE 3. M ILLER ET AL . 2010. F IGURE 4. M ILLER ET AL . 2010 5 SOAPdenovo SOAPdenovo (Li et al. 2010; Luo et al 2012) is comprised of 4 distinct commands that typically run at the same time. Pregraph: construct kmer-graph Contig: eliminate errors and output contigs Map: map reads to contigs Scaff: construct scaffolds All: do all of the above in turn Error correction in SOAPdenovo itself includes calculating kmer frequencies and filtering kmers below a certain frequency, correcting bubbles, and frayed robe patterns. It creates DeBruijn graphs and to create scaffolds, maps all paired reads to contig consensus sequences, including reads not used in the graph. Below we will try assembly with all SOAPdenovo modules, on both the raw data and Quakecorrected data. Goals De novo genome assembly in Linux OS Effects of key variables on assembly quality Measuring assembly quality Checking bioinformatic results against a standard V&C core competencies addressed 1) Ability to apply the process of science: Observational strategies, Hypothesis testing, Experimental design, Evaluation of experimental evidence, Developing problem-solving strategies 2) Ability to use quantitative reasoning: Developing and interpreting graphs, Applying statistical methods to diverse data, Mathematical modeling, Managing and analyzing large data sets 3) Use modeling and simulation to understand complex biological systems: Computational modeling of dynamic systems, Applying informatics tools, Managing and analyzing large data sets, Incorporating stochasticity into biological models 6 GCAT-SEEK sequencing requirements any Computer/program requirements for data analysis Linux OS, SOAPdenovo 2 Optional Cluster: Qsub If starting from Window OS: Putty If starting from Mac or Linux OS: SSH Protocols Genome assembly using SOAPdenovo2 in the Linux environment We will perform genome assembly on the HHMI cluster. We will start off trying to repeat some work that was published in the Genome Assembly Gold-standard Evaluations (GAGE) project (Salzberg et al. 2012). Our work will focus on bacterial genomes for purposes of brevity, but this work, in this computing environment, can be applied to even mammalian sized genomes (>1Gb). You now have 4 bacterial genome datasets uploaded in two forms, raw and quality filtered/error corrected. The raw data includes a paired end fragment libraries of 101bp reads, an “insert length” of 180bp, in “innie” orientation, with 1,294,104 reads providing 45x genome coverage (two files: frag_1.fastq, frag_2.fastq). The raw data also includes two shorter read (37bp) mate-paired jumping libraries with an “insert length” of 3500bp, in “outie” orientation, with 3,494,070 reads providing another 45x genome coverage (two files: shortjump_1.fastq, shortjump_2.fastq). A. Log onto your Linux home directory on the cluster. B. Make a directory entitled, soap, check that it is there, and move into it. $mkdir soap $ls $cd soap C. Make the config file using nano. This will tell Soap which files to use, where you can enter important characteristics of the data. Below you can skip the comment lines starting with “#”. When finished, hit control-X, and you will be prompted to save the file. $nano ___________________________ #maximal read length 7 max_rd_len=101 #below starts a new library [LIB] #average insert size avg_ins=180 #if sequence needs to be reversed put 1, otherwise 0 reverse_seq=0 #in which part(s) the reads are used. Flag of 1 means only contigs. asm_flags=1 #use only first 101 bps of each read rd_len_cutoff=101 #in which order the reads are used while scaffolding. Small fragments usually first, but worth playing with. rank=1 # cutoff of pair number for a reliable connection (at least 3 for short insert size). Not sure what this means. pair_num_cutoff=3 #minimum aligned length to contigs for a reliable read location (at least 32 for short insert size) map_len=32 #a pair of fastq file, read 1 file should always be followed by read 2 file q1=frag_1.cor.fastq q2=frag_2.cor.fastq #now you will enter the information for the mate pair library [LIB] avg_ins=3500 reverse_seq=1 asm_flags=2 rank=2 8 # cutoff of pair number for a reliable connection (at least 5 for large insert size) pair_num_cutoff=5 #minimum aligned length to contigs for a reliable read location (at least 35 for large insert size) map_len=35 q1=shortjump_1.cor.fastq q2=shortjump_2.cor.fastq Control-X to exit and save the file as “config.txt” D. Move your Quake corrected fastq files into this directory. First navigate into the Quake folder and then move the files. $mv *.cor.fastq ../soap/ E. Copy your Qsub script to your current directory and edit it for running SOAP. Please use the low memory (63mer) version of SOAPdenovo, a kmer (-K) of 31, 16 processors (-p), the config file you just made (-s), and give all output files the prefix asm (-o). Use the following text at the end of your example Qsub control file. NOTE: If you simply type the following into the command prompt, you will not be running Qsub. Make sure to save. $SOAPdenovo-63mer all -K 31 -p 16 -s config.txt -o asm Then run the Qsub script using: $qsub –p 100 NameofYourQsubScript F. Run GapCloser using Qsub with the following instructions at the end of the Qsub script. You will use the config file you made above (-b), fill gaps in the sequence file asm.scafSeq (-a), make the output file asm2.scafSeq, use 16 processers (-t 16), and make sure there is an ovelap of 31 nt before gap filling (-p). $GapCloser -b config.txt -a asm.scafSeq -o asm2.scafSeq -t 16 -p 31 G. Examine directory contents using $ls H. $Less the *.scafSeq file. It will have the scaffold result files. I. $Less the *.scafStatistics file. This will contain detailed information on scaffold assembly. Size_includeN Total size of assembly in scaffolds, including Ns Size_withoutN Total size of assembly in scaffolds, not including Ns Scaffold_Num Number of scaffolds Mean_Size Mean size of scaffolds 9 Median_Size Median size of scaffolds Longest_Seq Longest scaffold Shortest_Seq Shortest scaffold Singleton_Num Number of singletons Average_length_of_break(N)_in_scaffold Average length of unknown nucleotides (N) in scaffolds Also contained will be counts of scaffolds above certain sizes, percent of each nucleotide and N (Gap) values, and “N statistics.” An N50 is the size of the smallest scaffold such that 50% of the genome is contained in scaffolds of size N50 or larger (Salzberg et al. 2012). A line here showing “N50 35836 28” indicates that there are 28 scaffolds of at least 35836 nucleotides, and that they contain at least 50% of the overall assembly length. Statistics for contigs (pre-scaffold assemblies) are also shown. Assessment Fill in the table below and compare to values reported from the GAGE paper. Q. Given the total number of scaffolds and the number greater than 100bp, how many were less than 100bp? Note that the GAGE paper throws out “chaff” contigs that are less than 200bp in length and we did not do that. GAGE also used a SOAPdenovo module called GapCloser, which will use read data to substitute Ns with bp data between reads used to generate scaffolds. This would not change the length and number of the scaffolds or contigs, however. We are also using a later version of SOAPdenovo which contructs fewer chimeric or misjoined scaffolds. Contigs Setting GAGE Reference Optimal Conditions Without Quality Filtering Without Jumping Library Small Large Quality Filtering Y Y N Y Y Y Kmer Size 31 31 31 31 18 51 Jumping Library Y Y Y N Y Y Setting GAGE Reference Optimal Conditions Without Quality Filtering Without Jumping Library Small Large Quality Filtering Y Y N Y Y Y Jumping Kmer Size Library 31 Y 31 Y 31 Y 31 N 18 Y 51 Y N50 62.7 # 107 Total #bp assembled 2,872,915 Scaffolds 10 N50 284 # 99 Total #bp assembled 2,872,915 A. Now re-run the analysis using the raw reads instead of Quake Filtered reads, without using the jumping library, and with different kmer parameter choices to examine effects on some key assembly statistics. Time line of module Two hours Discussion topics for class Discuss effects of changing each parameter on quality of assembly. Relevant lecture topics include genome structure and sequencing, correction of errors in sequence reads, genome assembly approaches, and cited literature. References Literature Cited Compeau PEC, Pevzner PA, Tesler G. 2011. How to apply de Bruijn graphs to genome assembly. Nature Biotechnology. 29:987-991. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen Y-J, Chen Z et al . 2005. Genome sequencing in open fabricated high density picoliter reactors. Nature 437: 376–380. Miller JR, Koren S, Sutton G. 2010. Assembly algorithms for next-generation sequencing data. Genomics 95: 315-327. Salzberg SL, Phillippy AM, Zimin A, et al. 2012. GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res 22: 557-567. [important online supplements!] 11