Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye Two problematic libraries M19107 & M21639 were resequenced due to low coverage M19107 still had low coverage after re-seq M21639 coverage increased to ~80X De novo assembly – exponential dec. Effect of coverage depth on contig number 1400 1200 Total Contig Count 1000 800 De novo Reference 600 2 sample de novo 400 200 0 0 10 20 30 40 50 Coverage 60 70 80 90 M21639 still bad Effect of coverage depth on contig number 1400 1200 Total Contig Count 1000 800 De novo Reference 600 2 sample de novo 400 200 0 0 10 20 30 40 50 Coverage 60 70 80 90 Mixed culture? Is this a mixed culture? What does a mixed culture assembly look like? newProject, addRun, addRun, runProject … M19501 & M21127 M21127 & M21621 Mixed culture Effect of coverage depth on contig number 1400 1200 Total Contig Count 1000 800 De novo Reference 600 2 sample de novo 400 200 0 0 10 20 30 40 50 Coverage 60 70 80 90 Possible explanations for poor assembly About 20% larger genome ◦ Recent plasmid or other large genome insertion Also lost hemolytic ability and H2S production De novo assembly – 36 is the best Effect of coverage depth on contig number 1400 1200 Total Contig Count 1000 800 De novo 600 Reference 2 sample de novo 400 200 0 0 10 20 30 40 50 Coverage 60 70 80 90 De novo assembly Limited by our data, 36 contigs is lower limit ◦ Number of repeat elements ◦ rRNA 20X coverage is sufficient General assembly stats Genome Newbler de novo Newbler Mira3 AMOScmp reference de novo reference minimus2 minimus2 newbler/mira newbler/AMOScmp M19107.sff 217 1260 208 471 123 109 M19501.sff 75 988 181 521 22 27 M21127.sff 59 1013 89 538 38 28 M21621.sff 50 986 67 515 28 23 M21639.sff 175 1272 175 573 54 69 M21709.sff 52 313 83 187 37 32 M19107_1.sff 1336 1361 M19107_2.sff 450 1006 M21639_1.sff 266 1165 M21639_2.sff 147 1282 Assembler Evaluation Strategy Single assembler evaluation Minimus2 assembler evaluation Feedback by gene prediction group Single Assembler Evaluation (“Hard Measurement”) Single assembler evaluation “Soft” measurement: satisfy gene prediction group's requirement. RNA prediction group requires a file which can trace back the depth of the reads. For this, we use the .tsv file in the Newbler output. Newbler Output •454AlignmentInfo.tsv (-infoall/-info/-noinfo) base consensus, quality, depth and flow-signal, at each position in each contig. A very useful file. •eg: Position Consensus >contig00008 1 G 2 A 3 T 4 T 5 G ...etc... Quality Unique Align Score Depth Depth (incl. duplicates) 64 64 64 64 64 26 27 27 27 27 32 33 33 33 33 Signal Signal StdDev 0.98 0.94 1.97 1.97 0.97 0.05 0.13 0.14 0.14 0.06 Single Assembler Evaluation For “hard” measurement: we focused mainly on “Total Contigs” “N50 Contigs Bases”, “Total Big Contigs”, “Big Contigs Percent Bases”, “Big Contig Reads”, “Singleton Reads”. For “soft” measurement: we focused on the trace back of depth for each base pair. Final Rank of Single Assembler Combine the “hard” and “soft” measurement manually. We get as a result: 1. Newbler De Novo 2. Mira3 3. Amos Eliminated : Newbler reference & velvet Minimus2 Evaluation Since our top choice is Newbler, we want to include Newbler’s results in the merged contigs. Thus, we analysed the statistics of: 1. Newbler merged with Amos 2. Newbler merged with Mira Visulization tool: hawkeye Minimus2 Evaluation Minimus2 Evaluation Gene prediction results feedback Single assembler evaluation (Sample Data: contigs) Single assembler evaluation (Sample Data: Big contigs) Final Rankings of Single Assembler Combined the “hard” and “soft” measurement manually and we got: 1. Newbler De Novo 2. Mira3 3. Amos Eliminated : Newbler reference & velvet Minimus2 Evaluation (sample data: contigs) Minimus2 Evaluation (sample data: reads) Minimus2 Evaluation Using the same strategy as above, our rankings are: 1. Newbler merged with Mira; 2. Newbler merged with Amos; Final Recommendation 1. Merged Contigs: (1) Newbler merged with Mira (2) Newbler merged with Amos 2. Single Assembler: (1) Newbler (2) Mira (3) Amos Feedback Strategy Gene Prediction Group may use predicted genes and RNAs to evaluate our assembly results. Minimus2 Overview Minimus2 is a modified minimus pipeline It is designed to merge one or two sequence sets hereafter referred to as S1 or S2 Uses Nucmer based Overlap Detector instead of the Smith-Waterman hash overlap Minimus1 uses (much faster) Minimus2 Usage minimus2 prefix \ -D REFCOUNT=n \ # Number of sequences is the first set -D OVERLAP=n \ # Minimum overlap (Default 40bp) -D CONSERR=f \ # Maximum consensus error (0..1) (Def 0.06) -D MINID=n \ # Minimum overlap %id for align. (Def 94) -D MAXTRIM=n # Maximum sequence trimming length (Def 20bp) ◦ Prefix refers to an .AFG filename Minimus2 Usage, Cont’d. REFCOUNT should be set to number of sequences in the first set “all vs all” alignment is ran by default and sets REFCOUNT to 0 unless user-specified S1 & S2 should be merged and converted to AMOS format using toAmos command toAmos –s S1-S2.seq –o S1-S2.afg Minimus2 Usage, Cont’d. After you merge the data, you actually run minimus on it Minimus2 S1-S2 –D REFCOUNT=## Input S1-S2.afg Output S1-S2.fasta (contig) S1-S2.singletons.seq (single) Nucmer Algorithm A modification of the MUMmer package matching algorithm Operates via building and then searching a suffix tree data structure This is a significant upgrade from the minimus1 approach as searching using suffix trees is O(n) and minimus1’s method is O(n2) Linear time versus polynomial time MUMmer link : http://mummer.sourceforge.net/ Nucmer Algorithm The Nucmer strategy uses approximately 17 bytes of memory for each basepair in the reference sequence The query supplied by the user is streamed past the reference suffix tree so that the memory requirements do not depend on the size of the query sequence In English: Bigger query does not mean order of magnitude longer operating time Unique algorithm that can be found and analyzed as it is open source on Sourceforge MIRA Based on OLC approach Strategies: ◦ ◦ ◦ ◦ Preprocessing: high confidence region (HCR) Use a quick heuristic algorithm to alignment the HCR of reads Overlaps are reviewed with Smith-Waterman alignment algorithm Contigs can be be optionally analysed and corrected by an incorporated version of an automatic editor ◦ Repeats are resolved by searching for typical mis-assembly patterns ◦ Optional pre-assembly read extension step: the assembler can try to extend HCRs of reads by analysing the overlap pairs from the previous alignments. MIRA outputs d_results: this directory contains all the output files of the assembly in different formats. d_info: this directory contains information files of the final assembly. d_log: this directory contains log files and temporary assembly files. d_chkpt: this directory contains checkpoint files needed to resume assemblies that crashed or were stopped (not implemented yet, but soon) d_results out.padded.fasta: this file contains as FASTA sequence the consensus of the contigs that were assembled in the process. Positions in the consensus containing gaps (also called 'pads', denoted by an asterisk) are still present. out.unpadded.fasta: this file contains as FASTA sequence the consensus of the contigs that were assembled in the process, put positions in the consensus containing gaps were removed. qual files Outputs with other formats ◦ caf, ace, gap4d d_info info_assembly.txt: some statistics as well as whether or not problematic areas remain in the result. info_callparameters.txt: This file contains the parameters as given on the mira command line when the assembly was started. info_contigstats.txt: This file contains in tabular format statistics info_contigreadlist.txt: This file contains information which reads have been assembled into which contigs. info_readstooshort: A list containing the names of those reads that have been sorted out of the assembly only due to the fact that they were too short, before any processing started. error_reads_invalid: A list of sequences that have been found to be invalid due to various reasons (given in the output of the assembler). Newbler Another OLC assembler ◦ Starts with `indexing` Scans the .sff file, trims the reads, Performs some checks for possible 3’ and 5’ primers. ◦ Finds overlaps between reads Splits the phase between long reads and short reads. Alignments proceed using seed and extend. ◦ Simplifies overlap graph and generates consensus contigs. Uses the quality information for base calling. Newbler: Metrics file 454NewblerMetrics.txt ◦ runData Total number of reads and bases in the file, also the number of reads and bases after trimming. ◦ runMetrics Number of searches, seeds and overlaps during the alignment phase of assembly ◦ readAlignmentResults Number of reads and bases aligned to other reads, Inferred read error – No. of errors, mainly indels, between the contigs and the reads. ◦ consensusDistribution This section deals with base calling of the consensus contigs. ◦ consensusResults A summary of the read alignments and assembly statistics Newbler: Contigs 454AllContigs.fna (>100 bp default) 454LargeContigs.fna (>500 bp default) >contig00001 length=381 numreads=158 ◦ The fasta header: Gives the the unique contig number, its length in bp and the number of reads in the alignment used to build this contig >contig00002 length=144 numreads=560 GGGAGAACTCATCTCTTGGCAAGTTTCGTGCTTAGATGCTTTCAGCACTTATCTCTTCCG CACTTAGCTACCCGGCAATGCGTCTGGCGACACAACCGGAACACCAGTGaTGCGTCCACT CCGGTCCTCTCGTACTAGGAGCAG >contig00002 length=144 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 numreads=560 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 20 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 ◦ lower case bases correspond to quality values below 40. 454Contigs.ace ◦ All contigs in .ace format. Useful for visualization using for e.g. Eagleview 64 64 64 64 Newbler: The status files 454TrimStatus.txt: This file describes what (trimmed) part of the read was considered for alignment. ◦ read Id, trim points used, used trimmed length, raw length 454ReadStatus.txt: This file describes where reads ended up after assembly was complete. ◦ Id, Status (assembled, partially assembled, singleton, outlier, too short). 454AlignmentInfo.tsv: This file gives a consensus alignment overview for each position in each contig. ◦ Position, consensus, quality score, depth, signal, std. deviation Automating assembly Motivation ◦ Troublesome installation, number of dependencies ◦ Difficult to remember command line parameters Automation ◦ install.sh : A script to install assemblers and their dependencies ◦ assembler.sh: Script to run assemblers with default arguments. Future work Continue to dialog with G.P. to determine assembly of choice. Metrics: ◦ tRNA count ◦ rRNA count ◦ Protein coding regions Questions?