FASTG: Representing the true information content of a genome assembly Iain MacCallum, David B. Jaffe Broad Institute of MIT and Harvard, Cambridge, MA In collaboration with: Michael Schatz, Daniel Rokhsar, and Assemblathon Format Group The Problem: FASTA is flat The Solution: FASTG • The FASTA format is the standard way to represent an assembly: TAC… >scaffold TACTAGGCNNNNNNNATTAGGCCG TGCNNNNNNNNNNNGCGCCGTTAC CATTCNNNNNNACTGCCGTTGACT …ACT FASTG captures important gap information • Enhanced FASTA to capture known complexity of the genome AT… >assembly; ATCGGCNNNN[4:gap:size=(4,3..9)]ATTACCTG GCTTATAC[1:alt|C,G]TACCCGATACGTTTACGGTA TACGAAAAA[5:tandem:size=(5,4..11)|A]TCT …CT • Linear representation of the genome • Superset of FASTA • Easy to parse and human readable • Faithfully represents genome assemblies without error or loss of information • Provides a simple co-ordinate system that allows easy annotation • Hybrid approach: • Supported by many tools - preserves underlying linearity of the genome, but: • Been the warhorse for ~20 years - captures nonlinear complexity Assembly gap with complex contents CATAT FASTA >scaffold1 CATATNNNNNNNNNNGATGT GATGT Any information about the gap content is lost T FASTG encodes all ambiguities • Strictly linear nature forces assemblers to introduce errors: • FASTG natively encodes ambiguities that are lost in FASTA A ACATT TACTG CATAT GAC FASTA forces assemblers to make mistakes A • FASTG retains this information FASTG >scaffold1; CATAT NNNNNNNNNN [10:gap: size=(10,5..20), start=(a,e),end=(d,g)| >a:b,c,d;CA >b:c,d;T >c:f,g;GAC >d;AA >e:f,g;TT >f:g;A >g;AGT ] GATGT • Easily converted to FASTA But FASTA has a number of limitations: ACATT • Gaps in FASTA often hide additional information, e.g. frame shifts ACATT TACTG A[1:alt|A,T] GATGT TACTG T Assembler forced to pick A or T A Uncertain base or SNP 5Cs ACATT 7Cs ACATT TACTG 6Cs Assembler forced to chose the repeat length ACATT CCCCCC[6:tandem:size=(6,5..7)|C] TACTG TACTG Putative gap sequences derived from FASTG, e.g. 1) CATGACAAGT 2) CATTTAA 3) TTAAAAGT 7Cs Uncertain tandem repeat FASTG captures possible content of the gap CGAGG ACATT AAGCC ACATT TACTG TACTG ACATT CGAGG[5:digraph:path=(a)|>a;CGAGG>b;AAGCC] TACTG AAGCC Assembler forced to chose a haplotype • FASTG looks like FASTA Haplotype separation • These simple events are difficult to represented in the FASTA format • FASTG can be easily converted to FASTA • Optional properties (probability, copy-number, etc.) can be associated with these events, e.g. • The assembler is forced to choose, resulting in a loss of information and errors. FASTG FASTA cannot represent graph assemblies FASTG captures graphs using a hierarchic approach • Not all assemblies can be reduced to a linear form, due to: • FASTG is FASTA-like, preserves linearity and keeps local complexity local • FASTG is easy to use - Polymorphism that cannot be linearized Genome Assembly graph • Assemblies must be broken into linear sections, losing information Jumping libraries too short to disambiguate the repeat ContigC Long imperfect repeat C • FASTG can easily be converted to FASTA by removing the FASTG extensions “[…]” • Conversion can be done with a simple shell or perl script • The resulting FASTA can be processed by existing tools FASTG - Inversions that cannot be disambiguated B FASTA conversion • FASTG and derived FASTA files share the same base co-ordinate system - Long repeats that cannot be bridged with jumping data Long Repeat • Any existing tool that works on FASTA can use converted FASTG A[1:alt:allele|A,T] • Quality scores, annotation, and IUPAC codes provide only a partial solution A FASTG is FASTA compatible ContigB • No additional markup language required • FASTA + Markup will produce the original FASTG Single base difference Long Repeat A D B C FASTA D C Long Repeat A B Graph >contig1 TACCGCNNNNAGCCTGCC GTTATACCTCCCTGGATA CGTTTAGGATATATCC + • FASTG extensions plus start location = Markup D Long Repeat >contig1; TACCGCNNNN[4:gap:size=(4,3..5)]AGCCTGCC GTTATAC[1:alt:allele|C,G]TCCCTGGATACGTT TAGGATATAT[6:tandem:size=(3,2..5)|AT]CC FASTA • Can convert markup to existing annotation formats Uncertain tandem repeat Global graph structure encoded here FASTG >ContigA:ContigC; TCGA…[7:tandem:size=(7,6..9)|T]…CATG >ContigB:ContigC; ATAGCG…ATCCAT >ContigC:ContigA,ContigB; CGTA…[1:alt|C,G]…AATC ContigA Markup >contig1; 6 [4:gap:size=(4,3..5)] 26 [1:alt:allele|C,G] 52 [6:tandem:size=(3,2..5)|AT] - but only for a subset of FASTG features Coming soon: FASTG support in ALLPATHS-LG • Our genome assembler ALLPATHS-LG will soon produce FASTG assemblies • For the latest ALLPATHS-LG news visit our blog: http://www.broadinstitute.org/software/allpaths-lg/blog/