Java implementation of the kFM-index 1 Introduction This document provided information on the Java implementation of the kFM-index encoding the de Bruijn subgraph information. Note that in the implementation, nodes of the de Bruijn graph are assumed to be k-strings with edges corresponding to k + 1-strings. This is different from the article in which nodes were k − 1-strings and edges were k-strings. The reason for this disagreement is that in the article, the objective was to provide an index of k-substrings of a set of strings, and the information theoretic computations were hence performed with this in mind. In the implementation, however, it is more practical to think of the index as a map that for each node provides the list of in-edges to this node, and so the focus is more on the node than on the edge in the implementation making it more natural to let k correspond to the length of the node strings. The objective of the implementation is to represent de Bruijn subgraphs encoding the k +1-mer composition of DNA reads in a compact manner: we use k + 1-mers since it is more convenient implementationwise to let nodes represent k-mers. However, the main implementation is independent of DNA and may be used for other alphabets, although the present implementation only has the DNA alphabet implemented. The top level package is wordtable, with main subpackages wordtable.list, wordtable.sequence etc. referred to as list etc. for brevity. The executable class is WordTable, which may execute the command line options, or invoke the application WordTableApp which opens the GUI WordTableView. 1.1 Usage and command line options The command for starting the application is (possibly depending on operating system) java [java−options] −jar WordTable.jar [options] where the option -Xmxmem may be used to specify how much memory the application is allowed to allocate, e.g. -Xmx4g for up to 4 GB. Java will not be allowed to allocate more than the specified amount of memory, and without explicitly setting the max memory may easily run out of memory even if the computer has more memory available. The default behaviour to send text output to the consol. However, if no output as text or to file is requested, it will be default open the GUI. Either may be overruled by using the option --gui to launch the GUI, or --no-gui (or --batch) to prevent the GUI. The command line options for processing sequence data will be executed in either case, sending the text output to the console or to the GUI log view, and save file output to the specified file names. The GUI will, after processing sequence data into a kFM-index, open a view to the kFM-index as a table. The GUI also provides a resource view which shows how much memory is available, allocated, and in use. Remember that memory may be registered as in use even if the application is not using it any more: it will only be released after garbage collection, which can be call explicitly at any time. For an overview of command line options, run WordTable with the option --help (or -h). 2 2.1 Implementation General overview The DNA alphabet consists of the tokens {A, C, G, T }. In general, we represent the alphabet (DNA or other alphabet) by a set of tokens enumerated 0, . . . , m − 1. The size of the token set, m, needs to be fixed, so dynamic token sets are not possible in this implementation. The TokenSet interface specifies methods for converting char to token (int) and back and is used when converting between sequence data as text and sequences as stored in memory using tokens. I will use the term string also to refer to a sequence of tokens in this document in cases where these represents strings/sequences or substrings of these. Hence, a string of tokens need not be stored as a String object, but may be a bit-packed encoding of the string. 1 Methods for representing and storing strings/sequences as strings of tokens exist (sequence package) as well as for reading sequences from a Fasta or Fastq file. The kFM-index is not constructed directly from the original strings. Instead, a list of edges is generated. This is then sorted (by node), and the kFM-index is generated from the sorted list of edges. Because of this, there is, in addition to the implementation of the kFM-index and supporting classes, and implementation for storing lists of edges in a compact manner. However, compact storage of a list of edges is far less compact than the kFM-index. The main aim of the implementation is to demonstrate the final kFM-index, not optimal generation of the kFM-index from the original strings in terms of speed or intermediary memory consumption. 2.2 Computation of the kFM-index from a set of sequences Starting with a set of strings/sequences, in memory or in a file, and generating the kFM-index corresponding to all k + 1-substrings (i.e. the de Bruijn subgraph of degree k in which nodes correspond to k-strings), is done in a few steps. Read sequence data: The sequence data may be stored in memory or read from file, e.g. from a Fasta file using the FastaReader. Convert sequence data to in-edge list: An InEdgeList object is created, and the sequences added to the list. This adds all k-substrings of the sequences. The last k-substring is marked as a final node, i.e. one that may be without out-going edges. Sort and complete the in-edge list with final edges: After all in-edges have been added to the in-edge list, a completion step is run (using the complete method). This sorts the in-edges by node, removes duplicates, and adds finalising edges and nodes so that the special node $k can be reached from all nodes in the graph. Generate in-edge main data for the kFM-index: The InEdgeData may be generated from the completed in-edge list. The user is responsible for ensuring that complete() is called on the InEdgeList object before using it to construct the InEdgeData object. In practice, this step will normally be done, not on an InEdgeData object, but as part of constructing the final KFMindex which contains both the main data of the kFM-index as well as the auxiliary data used to enable fast look-up in the table. Add the kFM-index look-up table for fast walking of the graph: The KFMindex object, in addition to the main in-edge data stored as part of being an extension of the InEdgeData class, also contains precomputed values used to allow the previous node map, ρ(a, i), to be computed efficiently. This is done by precomputing ρ(a, i) for a subset of the positions i (i.e. at regular intervals), and then computing ρ(a, i) in-between these precomputed values on demand. This is done by a call to computeKFMarray after all the in-edges have been added to the InEdgeData. Since KFMindex is an extension of InEdgeData, the construction of the object’s InEdgeData data and completion of the object as an KFMindex object is normally done in one step when calling the constructor as new KFMindex(inEdgeList) with the completed InEdgeList object as argument. Generate kFM-indices on partitioned sequence data and merge these: A KFMaggregator object handles the splitting of sequence data into blocks, generates kFM-indices for each block, and subsequently merges these into one complete kFM-index. This is useful if the data is too big to process in one go. The combined code to generate a kFM-index from a Fasta file can be as simple as follows: KFMindex kFMindexFromFastaFile(File file,int order) { // Set up list for in−edges of given order, and set initial size based on file size InEdgeList list=InEdgeList.getNew(DNA.get(),order,(int)file.size()); try { FastaReader fasta = new FastaReader(file); // Open Fasta file for (TokenIterator seq : fasta) { list.add(seq); // Add sequences one by one } fasta.close(); } catch (FileNotFoundException ex) { 2 System.err.println("ERROR: "+ex); return null; } // Complete list (sort and add finalising edges) list.complete(); // Produce kFM−index: adds list to main data, then computes indexing data KFMindex index = new KFMindex(list); return index; } Note that an initial size is provided when constructing the InEdgeList object. Having an initial size which is big enough to contain all the in-edges generated when parsing the sequence data avoids having to grow the array, which can take up substantial memory when it takes place. Guessing the number of in-edges from the file size should mostly by quite adequate unless the Fasta file contains a substantial portion of non-sequence text. The same can be done using a KFMaggregator object, which does the KFMindex creation for us. Code for processing a Fastq file, excluding nucleotides that are unknown or have quality score less than 20, can be done as simply as follows: KFMindex kFMindexFromFastqFile(File file,int order,int quality,int blocksize) { // Set up kFM−index aggregator, splitting input into subsets of size blocksize KFMaggregator aggr=new KFMaggregator(DNA.get(),order,blocksize); try { FastqReader fastq=new FastqReader(file); // Open Fastq file reader.setQualityCutoff(quality); // Set quality cutoff aggr.addAll(fastq); // Add all sequences fastq.close(); } catch (FileNotFoundException ex) { System.err.println("ERROR: "+ex); return null; } KFMindex index=aggr.getKFMindex(); return index; } 2.3 Class diagrams The class diagrams indicate the main classes: abstract classes and interfaces are green and yellow, with regular classes are blue. Type of class is also indicated in the upper right corner, while the package is indicated in the upper left corner. Arrows with whole lines indicate inheritance (filled arrowhead) or interface implementation (empty arrowhead). Arrows with dashed and dotted lines indicate that the pointing class references the class pointed to: dashed lines are used to indicate that objects of the class pointed to are produced by objects of the pointing class, while dotted lines merely indicate that the pointing class references or operates on the class pointed to. Key methods may be listed, but this will not be complete: the main focus is on public methods that are critical for understanding and using the classes. Important constructors will be listed first, then comes important methods, and at the end comes important static methods. For the sake of brevity, methods are only listed where their interface is first defined, not in implementing or extending classes. Also for the sake of brevity, ". . . " is used to shorten down the list of arguments keeping the essentials, or to shorten down the list of similar methods in a class: i.e. the main role of presenting methods in the class diagrams is to indicate which functions the class performs and what data it contains, not to give an overview of the API. 2.4 Implementation details Java must be given access to sufficient memory, e.g. by the -Xmx option on the command line. There is a resource monitor available in this implementation which indicates how much memory is made available, how much has been allocated, and how much is actually in use. It may be necessary to run the garbage collector to free up memory after the kFM-index has been generated to see how much memory is actually required for storing it since memory used in the intermediate steps may not have been freed yet. 3 list BitpackedList BitpackedList(int bits, int initsize) long size() long get(long pos) void put(long value) void set(long pos, long value) list abstract CumulativeArray kfm.KFMindex UnusedFinals static CumulativeArray get(long size, long maxstep) kfm InEdgeData void removeUnneededEdges() IntList getUnusedNodes() InEdgeData(InEdgeList list) kfm TokenSet tokenset() int k() boolean hasInEdge(long pos, int token) boolean isLastInGroup(long pos) void remove(IntList list) void set(long pos, long value) void cumulate() long get(long pos) list BitpackedCumulativeArray KFMmerger static KFMindex merge(KFMindex a, KFMindex b, . . . ) kfm KFMindex kfm KFMaggregator KFMaggregator(TokenSet tokenset, int k) . . . void add(Sequence seq) . . . KFMindex getKFMindex() KFMindex(InEdgeList list) static KFMindex merge(KFMindex a, KFMindex b, . . . ) long prevPos(long pos, int token) TokenSequence nodeTokens(long pos) KFMindexInterval newIndexInterval() int pruneFinalCompletions() void computeKFMarray(int indexsize) . . . long kfmIndexPos(int index) int kfmNearbyIndex(int long) long kfmSolve(long posvalue) kfm KFMindexInterval long startPos() long endPos() boolean isEmpty() boolean inNonempty() boolean backtrack(int token) void reset() Figure 1: The main class for storing the kFM-index is KFMindex. The in-edge storage is covered by InEdgeData, which relied on BitpackedList for compact storage, while the stored values of the kFM-index are handled by a cumulative array. The KFMindexInterval references an interval of the kFM-index, e.g. nodes with a given prefix. Utility classes KFMmerger and UnusedFinals implement algorithms for merging kFM-indices and removing superfluous nodes, and are included in the diagram (in grey) for that reason only. 2.4.1 Array and list sizes and indexing Java arrays are indexed using int values. A list containing more than 231 ≈ 2 · 109 items can therefore not be represented as a Java array. The lists used to represent kFM-indices are not pure arrays, but made from arrays of arrays, and so overcomes this limitation. This also allows them to grow more easily. Normally, a growable array is grown by replacing it with a larger array and copying the values, a process that temporarily requires considerable memory overhead. When implemented as an array of arrays, the main array is itself much smaller since it only stores the references to the contained data arrays, and only these references are copied rather than the values themselves. The kFM-index is indexed using 64 bit long addresses, not 32 bit int which would have been used for regular Java arrays, with the vertex data stored using arrays of arrays as described above. Mapping the long addresses to array positions requires the splitting each long address into int array indices. There may be choices of settings for this may fail, e.g. if the arrays containing the data are set to be so small that the number of such arrays exceeds what the array of arrays can store, but I don’t think this should represent a practical issue. While the kFM-index related classes support large lists indexed by long position values, the InEdgeList used to store a list of in-edges for generating kFM-indices is implemented as an array and only supports indexing by int, hence is limited by the maximal array size. Since each stored block in the IntBlockList, which underlies InEdgeList, may require more than one value per block (each value of type int since IntBlockList stores the data in an int array), the number of blocks it can hold will be ≈ 2 · 109 /δ where δ is the number of values per block. Note that the sequence classes allow for sequences that are longer than 2·109 and have positions indexed by long values. 3 The kFM-index classes The main classes for storing and manipulating the kFM-index are contained in the list package. Figure 3 illustrates the class structures related to the KFMindex class which is used to store the main information of the kFM-index and provides methods for accessing this information. In order 4 to provide compact storage, the BitpackedList class is used which packs multiple data items into a long, and then stores this as an array of long values. Objects of InEdgeListInterval represent intervals of indexes in the kFM-index, which are used when retrieving the interval of strings in the kFM-index list with a specific prefix. The correspondence between algorithms and functions in the manuscript and class methods is as follows: (KFMindex) prevPos(pos,token): This corresponds to the previous node position function, ρ(a, i), where pos = i is the index position in the list of nodes, and token = a is the token (where tokens are enumerated 0, 1, . . .). The implementation is slightly different from that in the manuscript in that it looks up the next stored value of ρ(a, i) and works it way down to i: in the manuscript, it starts at the previous stored values and works itself up to i. (KFMindexInterval) backtrack(token): This corresponds to one step in the computation of [α(x), β(x)i: i.e. the computation of [α(ax), β(ax)i from [α(x), β(x)i where token = a. In practice, calls to backtrack are delegated to methods backtrackInterval for the first k tokens, which thus corresponds to the Interval function described in the manuscript, and to backtrackPosition for subsequent tokens. (KFMindex) nodeTokens(pos): This corresponds to the Vertex function defined in the manuscript which recovers the string (token sequence) that a particular node represents. It relies on calls to kfmSolve(value) which corresponds to the function ρinv (i). Specification of stored which positions of ρ(a, i) are stored, i.e. the list i0 < . . . < im is implemented in (KFMindex) kfmIndexPos(index), where index = r. This is implemented as ir = brm/nc, but the only assumption is that it should be increasing, and have i0 = 0 and im = n. The method (KFMindex) kfmNearbyIndex(pos) is used to find a nearby stored position, and is implemented as brm/n − 1/2c; the only assumptions made, however, are that for ir it should return r, and that it should be non-decreasing. 3.1 BitpackedList A BitpackedList object implements a growable list of items, indexed by 64 bit long position, where each item requires a fixed number of bits. These items are first packed into a long value with a fixed number of items per long value. The long values are then put into an array of fixed length, which in turn are stored in a growable array. In this manner, the list can hold more items than can be indexed by an int, and can also grow without having to duplicate the data since only the references to arrays holding the arrays have to be copied when the referencing array is being grown, not the data itself. 3.2 InEdgeData The KFMindex is the main class of the kFM-index implementation. However, in order to help structure the code, the implementation has been divided in two, with InEdgeData implementing the part containing the main data and KFMindex completing the kFM-index implementation with the auxiliary data and methods for kFM-index look-up. The storage of the main data is implemented in InEdgeData as an extension of BitpackedList: the in-edge bits (one bit per token of the token set), and a group end flag (one bit) indicating the end of each node group where nodes with the same k − 1 prefix string are grouped together. Thus, the BitpackedList part of the implementation handles the general data storage and retrieval mechanisms, while the InEdgeData extension provides method specific for storing in-edges. 3.3 KFMindex The KFMindex adds to InEdgeData the kFM-index which provides a map from each node (identified by its position in the table) to the nodes from which in-coming edges originate. This index, ρ(a, i), named prevPos in the implementation, is not stored in full as it can be computed from the main data (available from the InEdgeData implementation). However, for efficient computation, a subset is stored as an array of non-decreasing values using an implementation of CumulativeArray. In essence, a kFM-index is made by first filling the KFMindex object with in-edge data, i.e. the part covered by the InEdgeData part of the implementation. After the in-edge data has been entered, in order to be able to efficiently compute the kFM-index function ρ(a, i), implemented as 5 the prevPos function, a subset of the values are stored. In the implementation, ρ(a, i) is stored for values i0 , . . . , im where m is the size of the stored index. The function kfmIndexPos(r) returns ir , while kfmNearbyIndex is used to return the r such that i ≤ ir is nearby. When calling prevPos to compute ρ(a, i), the function kfmNearbyIndex is used a nearby i ≤ ir , the value of ρ(a, ir ) is looked up, which technically is stored as κ(r + am) = ρ(a, ir ), and ρ(a, i) is then computed from this. In the present implementation, ir = kfmIndexPos(r) = brm/nc is used, while makes kfmNearbyIndex(i) = din/m + 1/2e. Generation of the κ array from the stored in-edge data is done by calling computeKFMarray, providing the desired array size, m, or allowing a default choice to be made based on the size of the index, n. The inverse of ρ(a, i) is defined by ρinv (j) = (a, i) where ρ(a, i) = j while ρ(a, i + 1) = j + 1 and implemented as kfmSolve. The pair (a, i) is represented by am + i, and thus kfmSolve(j) returns am + i. 3.4 CumulativeArray and BitpackedCumulativeArray The stored values, ρ(a, ir ) for 0 = i0 < . . . < im = n, may be coded into a non-decreasing array κ(r + am) = ρ(a, ir ). If ir − ir−1 ≤ q for all r, the effect is that κ is non-decreasing with increments at most q. There are several ways in which this can be stored efficiently. The abstract class CumulativeArray only declares the required interface, while BitpackedCumulativeArray provides an implementation: other implementations were made prior to this, which are still available although not in use. Basically, BitpackedCumulativeArray separates κ(j) = uj + qvj where uj ∈ {0, . . . , q − 1}, where q is chosen to be a power of 2. This results in uj which can be stored in a bit-packed list, while vj is a list where increments are either 0 or 1. By storing ∆vj = vj − vj−1 ∈ {0, 1} as a list of bits, and only storing vj for j whole multipla of 64 (corresponding to sets of 64 bits of ∆vj stored in a single 64 bit long, memory consumption can be kept low while retrieving arbitrary values of κ can be done quite fast. 3.5 KFMindexInterval The KFMindexInterval objects specify an interval of the kFM-index list. The nodes with a given p-string prefix correspond to such an interval, and methods for computing the interval from the string are implemented here. The same is methods for backtracking the de Bruijn subgraph starting at a given node: e.g. a node determined from a given k-string. 3.6 KFMmerger The KFMmerger class takes two kFM-indices, i.e. two KFMindex objects, and performs a merge of them returning a merged kFM-index. If the total sequence data is large, producing the kFM-index directly by making a list of all in-edges in a InEdgeList object may be too memory demanding since this requires storing all k + 1mers. Instead, the sequence data may be split into more managable parts, kFM-indices made for each, and then pairwise merger is performed until all have been merged into one big kFM-index. When KFMmerger performs a merge of two kFM-indices, it has the option to delete the data from the source kFM-indices and thus release the data as it has been used. Since the target kFM-index is being grown on demand as it is being created, allowing the memory held by the source kFM-indices to be release as the target kFM-index is being created avoid the memory overhead of having the combined memory of the two source and the target kFM-indices at once. The KFMaggregator automatically does this, and does so safely since the intermediate kFM-indices holding subsets of the data are not visible to the user. Memory overhead required for the merger is 3p + 2q bit. This consists of arrays storing the order of elements from the two indices (p+q bit), which nodes in the two indices correspond (p bit), and node groups (same k − 1 prefix) found in both indices (p + q bit to mark corresponding nodes that are not the last node in the group). Calls to the KFMmerger class should normally be done through the static merge function in KFMindex, and the use of KFMmerger should thus remain invisible to the user. 6 list abstract util IntBlockList interface «Sortable» IntBlockList(int blocksize, int initsize) int compare(int p, int q) void swap(int p, int q) void copy(int p, int q) int size() TokenIterator tokens(int position) void sort() kfm.list InEdgeList InEdgeList getNew(TokenSet tokenset, int k, int initsize) util TokenSet tokenset() int k() void add(Sequence seq) . . . int edgeToken(int position) int indexLength(int position) boolean isFinal(int position) boolean nodeEquals(int p, int q) void complete() Sorter static void sort(Sortable list, int start, int end) Figure 2: The main class for storing a list of edges is InEdgeList which is used to store and process in-edges read from sequence data, stored as k-string node values with in-edge token data. The IntBlockList provides the general data structure for storing these in a compact manner, as well as Sortable methods that allows Sorter to sort the list. 3.7 UnusedFinals When kFM-indices are merged, final-completing nodes required to make nodes accessible from the root node (or final node) may become superfluous and could be trimmed away. The process of identifying final-completing nodes that are not required uses the internal class UnusedFinals of KFMindex to scan through the tree of final-completing nodes recursively removing unneeded edges and marking nodes left unused as superfluous. These can then be removed from the kFM-index. The method pruneFinalisingNodes in KFMindex uses a UnusedFinals object to find all finalcompleting nodes that can be removed, removes them, and updates the precomputed and stored part of the kFM-index. The use of the UnusedFinals class should normally remain invisible to the user. 3.8 KFMaggregator The KFMaggregator class is a utility class for building a kFM-index from sequence data. It encapsulates the passage of sequence data, through InEdgeList objects, into KFMindex objects, splitting the sequence data into suitable blocks and in turn merging the subset kFM-indices into a final, complete kFM-index. Note that it will not split data within a sequence, only between sequences, so if it is run on a very long sequence, like a large genome, the entire sequence will be processed. When two kFM-indices are merged by the aggregator, the deleteSources flag is set which makes the KFMmerger remove the data from the kFM-indices that are being merged to free up memory. This way, only a minimum of data is kept in duplicate, and the memory required is little more than that required by the two source kFM-indices: if the source kFM-indices are not removed, memory would be required to store both the source kFM-indices and the merged kFM-index. If the data are split into q blocks, q kFM-indices are created. If we enumerate the blocks 0, . . . , q − 1, blocks 2r and 2r + 1 are merged in the first level of merger, leaving a new list of kFM-indices enumerated 0, . . . , bq/2c − 1. This is then repeated until they have been merged into one kFM-index. This process may be thought of as a binary tree with the initial q kFM-indices as leaves. In the implementation, kFM-indices are merged immediately once both are available: i.e. whenever two child nodes in the binary tree have been populated, they are merged. This may help reduce memory in the cases where intermediate kFM-indices contain overlapping sets of nodes since the merged kFM-index will contain fewer nodes than the two source kFM-indices combined. In the present implementation, all is run on one thread. A potential improvement could be to have the mergers run on separate threads. 4 Storing a list of edges compactly As a first step in generating the kFM-index for a set of strings, a list of corresponding edges must be produced. Edges here are ment to be in-edges to a node, and so the edge consists of a k-string corresponding to a node and a token (1-string) indicating the edge (i.e. preceeding token in the 7 original string). However, in addition to the tokens used to make up the string, the sequences are padded at the end with $, and so nodes may correspond to a p-string padded with $k−p (i.e. the $ token k − p times). In order to reduce overhead, this edge information is stored in a compact manner in an int array. Each int is 32 bit long, and a whole number of int values are used. This is administrated by the IntBlockList which implements a growable array for packing data items (i.e. edges) into an int array without the need for memory overhead used to create objects out of each edge. 4.1 IntBlockList An IntBlockList object contains a growable list of items where each item is represented by a block of int values (of fixed size specified at creation). 4.2 InEdgeList The InEdgeList is an extension of IntBlockList for which the stored items are edges represented by a node and a token indicating the in-edge. Each edge is stored with three pieces of information: the k-string of the node (using the 0 token for $), the position of $ in or relative to the k-string (0 for the first position, k − 1 for the last position, k for immediately after the k-string, −1 otherwise or as a default), data specifying the in-edge (or no in-edge). The data is packed, using the IntBlockList implementation, into r int values, allowing 32r bit to be stored in each pack. If the node is v ∈ Σp ◦ $k−p , i.e. a p-word followed by k − p $ tokens, and the edge is av for some a ∈ Σ, we store v where each token of Σ is represented by a number 0, . . . , σ − 1 and $ is stored as 0, the number p indicating the position of the first $ token is stored, and the edge is specified by storing a again as a number 0, . . . , σ. For DNA, σ = 4 and so it requires 2 bit per letter, plus additional bits to store p. The bit-packing looks like this 32l bit z v1 | }| . . . vp $ . . . $ 0 . . . . . . . 0 {z }| {z } | p×s bit (k−p)×s bit { p {z t bit a } |{z} (1) s0 bit where s = dlog2 σe bit are required per token, s0 = dlog2 (σ + 1)e bit is used to store the in-edge 0 token a but with 2s − 1 used to represent a vertex without an in-coming edge (the start of the sequence), t = dlog2 (k + 2)e are used to store p, and the remaining 32r − ks − t − s0 bit are padded with 0. The value of p is stored as a regular binary number for p = 0, . . . , k − 1, but the value p = k is only used temporarily to mark final nodes while 2t − 1 is used for regular nodes that contain no $ tokens. The special marking of the final nodes are then later used to help generate the final-completing nodes, after which only values 0, . . . , k − 1, 2t − 1 are used. 4.3 Sorter and Sortable The Sorter and Sortable classes are both in the util package. The Sorter implements the heap sort algorithm for items available through the Sortable interface. This interface does not requires access to the actual values stored in the Sortable object, but instead the interface provides methods for comparing values and reordering them. This is used to sort edges stored in a InEdgeList object, which in turn are blocks of int values in an array administered by IntBlockList. The main use of Sorter is to perform a heap sort on the entire list. It can also perform an updated sort when additional values have been added to the list after a previous sorting. 5 Tokens and sequences The original sequence data is thought of as a sequence of tokens, or a string of characters: the terms are sometimes used interchangedly, but with the distinction that a character has data type char while a token is a number of type int. The tokens are specified through a TokenSet object which provides the correspondence between tokens and characters for conversion between text strings (e.g. input from a String or a Fasta file) and token sequences. The core interfaces for dealing with tokens and sequences are all found in the sequence package. This package also contains abstract implementations of these interfaces providing the main methods. Concrete implementations for DNA sequences are found in the sequence.dna package. 8 sequence interface «Sequence» sequence interface «SequenceConstructor» <S extends Sequence> static S get(TokenSequence it) <S extends Sequence> S sequence(TokenSequence tokens) TokenSet tokenset() long length() int tokenAt(long pos) S subsequence(long start, long len) TokenSequence sequenceTokens() sequence interface sequence boolean hasNext() int next() sequence abstract AbstractSequence int size() int charToToken(char ch) char tokenToChar(int token) TokenSequence charsToTokens(CharSequence charseq) interface «TokenSequence» «TokenSet» sequence interface «TokenIterator» long length() long position() <S extends Sequence> sequence sequence abstract abstract AbstractTokenSequence AbstractTokenSet sequence abstract PackedSequence <S extends Sequence> PackedSequence(TokenSequence tokens) boolean equals(Object o) int compareTo(PackedSequence seq) sequence.dna DNAsequence DNAsequence(TokenSequence tokens) DNAsequence(CharSequence charseq) DNAsequence complement() . . . sequence abstract AbstractAlphabet AbstractAlphabet(char[] letters, . . . ) sequence.dna DNAtokens static DNAtokens get() int complement(int token) Figure 3: Classes for speficying token sets and handling sequences. Some of the methods are required to return sequences, not just as Sequence objects, but as belonging to a particular concrete sequence class: e.g. the DNAtokens token set can be used to construct DNA sequences, and will return these as DNAsequence objects, being an implementation of SequenceConstructor<DNAsequence>. 5.1 TokenSet Implementations of the TokenSet interface are used to specify the alphabet from which sequences are made. It provides the size of the alphabet and the correspondence between tokens and characters. There is an implementation, DNAtokens, representing the four DNA nucleotides, but throughout the implementation of the kFM-index is independent of the alphabet: it only requires a static alphabet of fixed, known size, i.e. no dynamic alphabets which allows letters to be added. The SequenceConstructor interface is intended used with concrete implementation of TokenSet so that the token set can be used to convert sequences of tokens into Sequence objects. The interface takes the concrete extension of Sequence as a generic parameter, indicating what type of sequence is returned. The reason this is needed is in part because Sequence objects need to refer to a specific token set, while the TokenSequence interface merely returns a sequence of tokens (int values) without any reference to the token set. The TokenSet and SequenceConstructor interfaces have been defined as separate interfaces, but it might be natural to integrate them into one: the only present implementation of either is DNAtokens, and it might be argued that any implementation should implement both simultaneously. However, implementations of SequenceConstructor need to specify a concrete Sequence implementation, while token sets do not require this. 5.2 Sequence The Sequence interface represents token sequences of known length over a given token set. It is used for sequences stored in memory: it provides random access of the sequence tokens. The interface takes a Sequence extending class as a generic parameter: i.e. it is defined as Sequence<S extends Sequence>. The S class specified should be the concrete implementation, which should be final, and is used to specify the type of sequence returned by methods of the class. Since the methods of any concrete implementation of Sequence should be sequences of the same class, the definition should be final S<S> to ensure this. There is an abstract implementation, AbstractSequence, which implements most of the methods of the interface, but no sequence storage. The extension PackedSequence provides the main support 9 for storing a sequence of tokens bit-packed into a long array. A concrete implementation for DNA sequences is DNAsequence, found in the sequence.dna package. 5.3 Token readers and interators The interfaces TokenIterator and TokenSequence are used for sequential reading of the tokens of a sequence, as opposed to Sequence which allows for random access. The main difference between the two is that TokenIterator merely provides a sequence of tokens without any other information (like an iterator but without returning objects so it cannot extend the Iterator interface), while TokenSequence is used to iterate over a sequence of known length and where each token has a known position. The TokenIterator class is used by implementations of SequenceReader which returns TokenIterator objects when parsing sequence files. The TokenSequence is used, by ways of it’s abstract implentation AbstractTokenSequence, to access sequences or subsequences. 6 Sequence readers and filters There are two file formats supported: Fasta and Fastq. The file readers are FastaReader and FastqReader, both of which are extensions of SequenceReader. A sequence reader is iterable over TokenIterator, and can be handed directly to a KFMaggregator through the addAll method. A FastaReader reads a Fasta file, and returns all the sequences. A FastqReader reads a Fastq file, but the sequence and the quality scores. It has the option to exclude bases with quality score below a given cutoff, in which case the sequences will be split up into smaller sequences. By default, the files are assumed to contain DNA sequences, but the token set can be replaced with another token set should one wish. 7 Settings and options handling Instead of passing around multiple options, e.g. between the user interfaces and the KFMaggregator object used to generate the kFM-index from input data, container classes for storing options are used. The main options class for this use is KFMoptions (in the wordtable.kfm.util package). This contains the required parameters for specifying the source sequence data, de Bruijn graph order, etc., in addition to methods for determining file format, etc. The options class is not only a container class, but also specifies the command line options used to set these options and documents their use. This is done through the JCommander package. The command line interface, WordTable, also utilises an extension, WordTableOptions, of KFMoptions, which adds a number of output and reporting options. Processing of these options is also specified within the options classes, mostly by calls to methods in the KFMtasks class. The GUI will use these same methods, but access them through menu item tasks (e.g. ApplicationTask objects). 8 8.1 Computational speed Utilising the kFM-index The single most critical factor influencing the computational speed is the prevPos function which implements ρ. This is called repeatedly, both when using the kFM-index and when generating it. The vast majority of computational time may be expected to be spent on computing ρ. The most obvious factor influencing the time required to compute prevPos is the distance between the precomputed, stored values: i.e. the size of the array of prestored kFM-index values. The time it takes to compute ρ for any particular index is proportional to the distance to the nearest stored value, and so on average is proportional to the distance between the stored values. However, the memory consumption of the stored values is roughly proportional to the size of the array of prestored values, which is inversely proportional to the distance between the stored values. In the Java implementation, the main impact on the computational speed of prevPos is the implementation of the array lookup: i.e. (BitpackedList) get(pos). Since the implementation of BitpackedList uses a double array, i.e. long[][], the get(pos) function needs to compute with index for both the main array and the contained data array, as well as the location of the bits are 10 within the long value. Implementation details that facilitates compiler optimisation are of great importance here: there was a factor 2 improvement merely from declaring the number of items stored within each data array constant and thus allowing the compiler to inline it. The use of double arrays, although required to implement arrays larger than what could be indexed by an int and suitable for implementing growable arrays with little overhead, has as considerable computational cost compared to a simple array: I found a factor 2–3 difference. This choice was in part due to a limitation of Java, and a different choice of language and implementation might have bypassed this, hence increased speed accordingly. Exploiting known block sizes, e.g. the knowledge that each base requires 2 bit of data, could have helped increase speed, but at the cost of reducing the generality and readability of the code. 8.2 Generating the kFM-index Generation of a kFM-index from raw sequence data is done in two steps. First, the sequence data is partitioned into managable parts for which all in-edge data can be stored in memory; this involves holding the entire list of k + 1-mers of the edges in memory at one time, and so the partition size will normally be limited by available memory. For each partition, the kFM-index is produced, which requires little memory in comparison. The second step is consists of recursively performing pairwise merges of these kFM-indices. The implementation performs the merges successively as the kFM-indices are being generated from the data partitions. The way this is done is that whenever two new kFM-indices are generated from raw data, they are merged into a level 1 merged kFM-index. Whenever two level r kFMindices are generated, they are merged into a level r + 1 kFM-index. Since the merged kFM-index is at most as big as the two source kFM-indices, less if they have vertices in common, this may help free up memory. The generation of in-edge lists and production of kFM-indices from each of these is memory demanding, but computationally fast. The merging of kFM-indices is in comparison computationally more demanding: each merge takes time proportional to the size of the smallest of the two indices times k since it has to look up the position of each node (although some speed is gained by looking up entire interval of suffixes at a time), and finally the number of times the kFM-indices have to be pairwise merged before they have all been merged into one is the 2-logarithm of the number of partitions. Doubling the block size reduces the number of times the intermediate kFM-indices have to be merged, but at the cost of doubling the required memory for holding the in-edge list. 11