FASTG: Representing the true information content of a genome

advertisement
FASTG: Representing the true information content of a genome assembly
Iain MacCallum, David B. Jaffe
Broad Institute of MIT and Harvard, Cambridge, MA
In collaboration with: Michael Schatz, Daniel Rokhsar, and Assemblathon Format Group
The Problem: FASTA is flat
The Solution: FASTG
• The FASTA format is the standard way to represent an assembly:
TAC…
>scaffold
TACTAGGCNNNNNNNATTAGGCCG
TGCNNNNNNNNNNNGCGCCGTTAC
CATTCNNNNNNACTGCCGTTGACT
…ACT
FASTG captures important gap information
• Enhanced FASTA to capture known complexity of the genome
AT…
>assembly;
ATCGGCNNNN[4:gap:size=(4,3..9)]ATTACCTG
GCTTATAC[1:alt|C,G]TACCCGATACGTTTACGGTA
TACGAAAAA[5:tandem:size=(5,4..11)|A]TCT
…CT
• Linear representation of the genome
• Superset of FASTA
• Easy to parse and human readable
• Faithfully represents genome assemblies without error or loss of information
• Provides a simple co-ordinate system that allows easy annotation
• Hybrid approach:
• Supported by many tools
- preserves underlying linearity of the genome, but:
• Been the warhorse for ~20 years
- captures nonlinear complexity
Assembly gap with
complex contents
CATAT
FASTA
>scaffold1
CATATNNNNNNNNNNGATGT
GATGT
Any information about the gap
content is lost
T
FASTG encodes all ambiguities
• Strictly linear nature forces assemblers to introduce errors:
• FASTG natively encodes ambiguities that are lost in FASTA
A
ACATT
TACTG
CATAT
GAC
FASTA forces assemblers to make mistakes
A
• FASTG retains this information
FASTG
>scaffold1;
CATAT NNNNNNNNNN
[10:gap:
size=(10,5..20),
start=(a,e),end=(d,g)|
>a:b,c,d;CA
>b:c,d;T
>c:f,g;GAC
>d;AA
>e:f,g;TT
>f:g;A
>g;AGT
] GATGT
• Easily converted to FASTA
 But FASTA has a number of limitations:
ACATT
• Gaps in FASTA often hide additional information, e.g. frame shifts
ACATT
TACTG
A[1:alt|A,T]
GATGT
TACTG
T
Assembler forced to pick A or T
A
Uncertain base or SNP
5Cs
ACATT
7Cs
ACATT
TACTG
6Cs
Assembler forced to chose the repeat length
ACATT CCCCCC[6:tandem:size=(6,5..7)|C] TACTG
TACTG
Putative gap sequences
derived from FASTG, e.g.
1) CATGACAAGT
2) CATTTAA
3) TTAAAAGT
7Cs
Uncertain tandem repeat
FASTG captures possible
content of the gap
CGAGG
ACATT
AAGCC
ACATT
TACTG
TACTG
ACATT CGAGG[5:digraph:path=(a)|>a;CGAGG>b;AAGCC] TACTG
AAGCC
Assembler forced to chose a haplotype
• FASTG looks like FASTA
Haplotype separation
• These simple events are difficult to represented in the FASTA format
• FASTG can be easily converted to FASTA
• Optional properties (probability, copy-number, etc.) can be associated with these events, e.g.
• The assembler is forced to choose, resulting in a loss of information and errors.
FASTG
FASTA cannot represent graph assemblies
FASTG captures graphs using a hierarchic approach
• Not all assemblies can be reduced to a linear form, due to:
• FASTG is FASTA-like, preserves linearity and keeps local complexity local
• FASTG is easy to use
- Polymorphism that cannot be linearized
Genome
Assembly graph
• Assemblies must be broken into linear sections, losing information
Jumping libraries too short
to disambiguate the repeat
ContigC
Long imperfect
repeat
C
• FASTG can easily be converted to FASTA by removing the FASTG extensions “[…]”
• Conversion can be done with a simple shell or perl script
• The resulting FASTA can be processed by existing tools
FASTG
- Inversions that cannot be disambiguated
B
FASTA conversion
• FASTG and derived FASTA files share the same base co-ordinate system
- Long repeats that cannot be bridged with jumping data
Long Repeat
• Any existing tool that works on FASTA can use converted FASTG
A[1:alt:allele|A,T]
• Quality scores, annotation, and IUPAC codes provide only a partial solution
A
FASTG is FASTA compatible
ContigB
• No additional markup language required
• FASTA + Markup will produce the original FASTG
Single base
difference
Long Repeat
A
D
B
C
FASTA
D
C
Long Repeat
A
B
Graph
>contig1
TACCGCNNNNAGCCTGCC
GTTATACCTCCCTGGATA
CGTTTAGGATATATCC
+
• FASTG extensions plus start location = Markup
D
Long Repeat
>contig1;
TACCGCNNNN[4:gap:size=(4,3..5)]AGCCTGCC
GTTATAC[1:alt:allele|C,G]TCCCTGGATACGTT
TAGGATATAT[6:tandem:size=(3,2..5)|AT]CC
FASTA
• Can convert markup to existing annotation formats
Uncertain tandem
repeat
Global graph structure encoded here
FASTG
>ContigA:ContigC;
TCGA…[7:tandem:size=(7,6..9)|T]…CATG
>ContigB:ContigC;
ATAGCG…ATCCAT
>ContigC:ContigA,ContigB;
CGTA…[1:alt|C,G]…AATC
ContigA
Markup
>contig1;
6 [4:gap:size=(4,3..5)]
26 [1:alt:allele|C,G]
52 [6:tandem:size=(3,2..5)|AT]
- but only for a subset of FASTG features
Coming soon: FASTG support in ALLPATHS-LG
• Our genome assembler ALLPATHS-LG will soon produce FASTG assemblies
• For the latest ALLPATHS-LG news visit our blog:
http://www.broadinstitute.org/software/allpaths-lg/blog/
Download