AssemblyIII - Compgenomics 2011

advertisement
Robert Arthur
Kevin Lee
Xing Liu
Pushkar Pande
Gena Tang
Racchit Thapliyal
Tianjun Ye
Two problematic libraries



M19107 & M21639 were resequenced due to
low coverage
M19107 still had low coverage after re-seq
M21639 coverage increased to ~80X
De novo assembly – exponential
dec.
Effect of coverage depth on contig number
1400
1200
Total Contig Count
1000
800
De novo
Reference
600
2 sample de novo
400
200
0
0
10
20
30
40
50
Coverage
60
70
80
90
M21639 still bad
Effect of coverage depth on contig number
1400
1200
Total Contig Count
1000
800
De novo
Reference
600
2 sample de novo
400
200
0
0
10
20
30
40
50
Coverage
60
70
80
90
Mixed culture?





Is this a mixed culture?
What does a mixed culture assembly look
like?
newProject, addRun, addRun, runProject …
M19501 & M21127
M21127 & M21621
Mixed culture
Effect of coverage depth on contig number
1400
1200
Total Contig Count
1000
800
De novo
Reference
600
2 sample de novo
400
200
0
0
10
20
30
40
50
Coverage
60
70
80
90
Possible explanations for poor
assembly

About 20% larger genome
◦ Recent plasmid or other large genome insertion

Also lost hemolytic ability and H2S production
De novo assembly – 36 is the best
Effect of coverage depth on contig number
1400
1200
Total Contig Count
1000
800
De novo
600
Reference
2 sample de novo
400
200
0
0
10
20
30
40
50
Coverage
60
70
80
90
De novo assembly

Limited by our data, 36 contigs is lower limit
◦ Number of repeat elements
◦ rRNA

20X coverage is sufficient
General assembly stats
Genome
Newbler
de novo
Newbler Mira3 AMOScmp
reference de novo reference
minimus2 minimus2 newbler/mira newbler/AMOScmp
M19107.sff
217
1260
208
471
123
109
M19501.sff
75
988
181
521
22
27
M21127.sff
59
1013
89
538
38
28
M21621.sff
50
986
67
515
28
23
M21639.sff
175
1272
175
573
54
69
M21709.sff
52
313
83
187
37
32
M19107_1.sff
1336
1361
M19107_2.sff
450
1006
M21639_1.sff
266
1165
M21639_2.sff
147
1282
Assembler Evaluation Strategy



Single assembler evaluation
Minimus2 assembler evaluation
Feedback by gene prediction group
Single Assembler Evaluation
(“Hard Measurement”)
Single assembler evaluation

“Soft” measurement: satisfy gene prediction group's
requirement.

RNA prediction group requires a file which can
trace back the depth of the reads. For this, we use
the .tsv file in the Newbler output.
Newbler Output
•454AlignmentInfo.tsv (-infoall/-info/-noinfo) base consensus, quality,
depth and flow-signal, at each position in each contig. A very useful file.
•eg:
Position
Consensus
>contig00008
1
G
2
A
3
T
4
T
5
G
...etc...
Quality Unique
Align
Score
Depth
Depth
(incl.
duplicates)
64
64
64
64
64
26
27
27
27
27
32
33
33
33
33
Signal Signal
StdDev
0.98
0.94
1.97
1.97
0.97
0.05
0.13
0.14
0.14
0.06
Single Assembler Evaluation


For “hard” measurement: we focused mainly on
“Total Contigs” “N50 Contigs Bases”, “Total Big
Contigs”, “Big Contigs Percent Bases”, “Big Contig
Reads”, “Singleton Reads”.
For “soft” measurement: we focused on the trace
back of depth for each base pair.
Final Rank of Single Assembler





Combine the “hard” and “soft” measurement
manually. We get as a result:
1. Newbler De Novo
2. Mira3
3. Amos
Eliminated : Newbler reference & velvet
Minimus2 Evaluation


Since our top choice is Newbler, we want to
include Newbler’s results in the merged contigs.
Thus, we analysed the statistics of:
1. Newbler merged with Amos
2. Newbler merged with Mira
Visulization tool: hawkeye
Minimus2 Evaluation
Minimus2 Evaluation
Gene prediction results feedback
Single assembler evaluation (Sample Data:
contigs)
Single assembler evaluation
(Sample Data: Big contigs)
Final Rankings of Single
Assembler





Combined the “hard” and “soft” measurement
manually and we got:
1. Newbler De Novo
2. Mira3
3. Amos
Eliminated : Newbler reference & velvet
Minimus2 Evaluation (sample
data: contigs)
Minimus2 Evaluation (sample
data: reads)
Minimus2 Evaluation

Using the same strategy as above, our rankings
are:

1. Newbler merged with Mira;
2. Newbler merged with Amos;

Final Recommendation







1. Merged Contigs:
(1) Newbler merged with Mira
(2) Newbler merged with Amos
2. Single Assembler:
(1) Newbler
(2) Mira
(3) Amos
Feedback Strategy

Gene Prediction Group may use predicted genes
and RNAs to evaluate our assembly results.
Minimus2 Overview

Minimus2 is a modified minimus pipeline

It is designed to merge one or two sequence sets
hereafter referred to as S1 or S2

Uses Nucmer based Overlap Detector instead of the
Smith-Waterman hash overlap Minimus1 uses (much
faster)
Minimus2 Usage

minimus2 prefix \

-D REFCOUNT=n \ # Number of sequences is the first
set
-D OVERLAP=n \ # Minimum overlap (Default 40bp)
-D CONSERR=f \ # Maximum consensus error (0..1)
(Def 0.06)
-D MINID=n \ # Minimum overlap %id for align. (Def 94)
-D MAXTRIM=n # Maximum sequence trimming length
(Def 20bp)




◦ Prefix refers to an .AFG filename
Minimus2 Usage, Cont’d.



REFCOUNT should be set to number of
sequences in the first set
“all vs all” alignment is ran by default and sets
REFCOUNT to 0 unless user-specified
S1 & S2 should be merged and converted to
AMOS format using toAmos command
 toAmos –s S1-S2.seq –o S1-S2.afg
Minimus2 Usage, Cont’d.



After you merge the data, you actually run
minimus on it
 Minimus2 S1-S2 –D REFCOUNT=##
Input
 S1-S2.afg
Output
 S1-S2.fasta (contig)
 S1-S2.singletons.seq (single)
Nucmer Algorithm



A modification of the MUMmer package
matching algorithm
Operates via building and then searching a
suffix tree data structure
This is a significant upgrade from the minimus1
approach as searching using suffix trees is O(n)
and minimus1’s method is O(n2)
 Linear time versus polynomial time

MUMmer link : http://mummer.sourceforge.net/
Nucmer Algorithm



The Nucmer strategy uses approximately 17 bytes
of memory for each basepair in the reference
sequence
The query supplied by the user is streamed past the
reference suffix tree so that the memory
requirements do not depend on the size of the query
sequence
 In English: Bigger query does not mean order of
magnitude longer operating time
Unique algorithm that can be found and analyzed as
it is open source on Sourceforge
MIRA


Based on OLC approach
Strategies:
◦
◦
◦
◦
Preprocessing: high confidence region (HCR)
Use a quick heuristic algorithm to alignment the HCR of reads
Overlaps are reviewed with Smith-Waterman alignment algorithm
Contigs can be be optionally analysed and corrected by an incorporated
version of an automatic editor
◦ Repeats are resolved by searching for typical mis-assembly patterns
◦ Optional pre-assembly read extension step: the assembler can try to
extend HCRs of reads by analysing the overlap pairs from the previous
alignments.
MIRA outputs




d_results: this directory contains all the output files of
the assembly in different formats.
d_info: this directory contains information files of the
final assembly.
d_log: this directory contains log files and temporary
assembly files.
d_chkpt: this directory contains checkpoint files
needed to resume assemblies that crashed or were
stopped (not implemented yet, but soon)
d_results




out.padded.fasta: this file contains as FASTA sequence the
consensus of the contigs that were assembled in the process.
Positions in the consensus containing gaps (also called
'pads', denoted by an asterisk) are still present.
out.unpadded.fasta: this file contains as FASTA sequence
the consensus of the contigs that were assembled in the
process, put positions in the consensus containing gaps were
removed.
qual files
Outputs with other formats
◦ caf, ace, gap4d
d_info






info_assembly.txt: some statistics as well as whether or not problematic
areas remain in the result.
info_callparameters.txt: This file contains the parameters as given on the
mira command line when the assembly was started.
info_contigstats.txt: This file contains in tabular format statistics
info_contigreadlist.txt: This file contains information which reads have
been assembled into which contigs.
info_readstooshort: A list containing the names of those reads that have
been sorted out of the assembly only due to the fact that they were too
short, before any processing started.
error_reads_invalid: A list of sequences that have been found to be invalid
due to various reasons (given in the output of the assembler).
Newbler

Another OLC assembler
◦ Starts with `indexing`
 Scans the .sff file, trims the reads,
 Performs some checks for possible 3’ and 5’ primers.
◦ Finds overlaps between reads
 Splits the phase between long reads and short reads.
 Alignments proceed using seed and extend.
◦ Simplifies overlap graph and generates consensus
contigs.
 Uses the quality information for base calling.
Newbler: Metrics file

454NewblerMetrics.txt
◦ runData
 Total number of reads and bases in the file, also the number
of reads and bases after trimming.
◦ runMetrics
 Number of searches, seeds and overlaps during the
alignment phase of assembly
◦ readAlignmentResults
 Number of reads and bases aligned to other reads,
 Inferred read error – No. of errors, mainly indels, between
the contigs and the reads.
◦ consensusDistribution
 This section deals with base calling of the consensus
contigs.
◦ consensusResults
 A summary of the read alignments and assembly statistics
Newbler: Contigs


454AllContigs.fna (>100 bp default)
454LargeContigs.fna (>500 bp default)
>contig00001
length=381
numreads=158
◦ The fasta header:
 Gives the the unique contig number, its length in bp and the number of
reads in the alignment used to build this contig
>contig00002 length=144
numreads=560
GGGAGAACTCATCTCTTGGCAAGTTTCGTGCTTAGATGCTTTCAGCACTTATCTCTTCCG
CACTTAGCTACCCGGCAATGCGTCTGGCGACACAACCGGAACACCAGTGaTGCGTCCACT
CCGGTCCTCTCGTACTAGGAGCAG
>contig00002 length=144
64 64 64 64 64 64 64 64 64
64 64 64 64 64 64 64 64 64
64 64 64 64 64 64 64 64 64
64 64 64 64 64 64 64 64 64
64 64 64 64 64 64 64 64 64
numreads=560
64 64 64 64 64
64 64 64 64 64
64 64 64 64 64
64 64 64 64 64
64 64 64 64 64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
20
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
64
◦ lower case bases correspond to quality values below 40.

454Contigs.ace
◦ All contigs in .ace format. Useful for visualization using for e.g.
Eagleview
64
64
64
64
Newbler: The status files

454TrimStatus.txt: This file describes what
(trimmed) part of the read was considered for
alignment.
◦ read Id, trim points used, used trimmed length, raw
length

454ReadStatus.txt: This file describes where
reads ended up after assembly was complete.
◦ Id, Status (assembled, partially assembled, singleton,
outlier, too short).

454AlignmentInfo.tsv: This file gives a consensus
alignment overview for each position in each
contig.
◦ Position, consensus, quality score, depth, signal, std.
deviation
Automating assembly

Motivation
◦ Troublesome installation, number of dependencies
◦ Difficult to remember command line parameters

Automation
◦ install.sh : A script to install assemblers and their
dependencies
◦ assembler.sh: Script to run assemblers with default
arguments.
Future work


Continue to dialog with G.P. to determine
assembly of choice.
Metrics:
◦ tRNA count
◦ rRNA count
◦ Protein coding regions
Questions?
Download