Java implementation of the kFM-index 1 Introduction

advertisement
Java implementation of the kFM-index
1
Introduction
This document provided information on the Java implementation of the kFM-index encoding the
de Bruijn subgraph information.
Note that in the implementation, nodes of the de Bruijn graph are assumed to be k-strings
with edges corresponding to k + 1-strings. This is different from the article in which nodes were
k − 1-strings and edges were k-strings. The reason for this disagreement is that in the article, the
objective was to provide an index of k-substrings of a set of strings, and the information theoretic
computations were hence performed with this in mind. In the implementation, however, it is more
practical to think of the index as a map that for each node provides the list of in-edges to this
node, and so the focus is more on the node than on the edge in the implementation making it more
natural to let k correspond to the length of the node strings.
The objective of the implementation is to represent de Bruijn subgraphs encoding the k +1-mer
composition of DNA reads in a compact manner: we use k + 1-mers since it is more convenient
implementationwise to let nodes represent k-mers. However, the main implementation is independent of DNA and may be used for other alphabets, although the present implementation only has
the DNA alphabet implemented.
The top level package is wordtable, with main subpackages wordtable.list, wordtable.sequence
etc. referred to as list etc. for brevity. The executable class is WordTable, which may execute the
command line options, or invoke the application WordTableApp which opens the GUI WordTableView.
1.1
Usage and command line options
The command for starting the application is (possibly depending on operating system)
java [java−options] −jar WordTable.jar [options]
where the option -Xmxmem may be used to specify how much memory the application is allowed
to allocate, e.g. -Xmx4g for up to 4 GB. Java will not be allowed to allocate more than the specified
amount of memory, and without explicitly setting the max memory may easily run out of memory
even if the computer has more memory available.
The default behaviour to send text output to the consol. However, if no output as text or to
file is requested, it will be default open the GUI. Either may be overruled by using the option --gui
to launch the GUI, or --no-gui (or --batch) to prevent the GUI. The command line options for
processing sequence data will be executed in either case, sending the text output to the console or
to the GUI log view, and save file output to the specified file names.
The GUI will, after processing sequence data into a kFM-index, open a view to the kFM-index
as a table. The GUI also provides a resource view which shows how much memory is available,
allocated, and in use. Remember that memory may be registered as in use even if the application is
not using it any more: it will only be released after garbage collection, which can be call explicitly
at any time.
For an overview of command line options, run WordTable with the option --help (or -h).
2
2.1
Implementation
General overview
The DNA alphabet consists of the tokens {A, C, G, T }. In general, we represent the alphabet (DNA
or other alphabet) by a set of tokens enumerated 0, . . . , m − 1. The size of the token set, m, needs
to be fixed, so dynamic token sets are not possible in this implementation. The TokenSet interface
specifies methods for converting char to token (int) and back and is used when converting between
sequence data as text and sequences as stored in memory using tokens.
I will use the term string also to refer to a sequence of tokens in this document in cases where
these represents strings/sequences or substrings of these. Hence, a string of tokens need not be
stored as a String object, but may be a bit-packed encoding of the string.
1
Methods for representing and storing strings/sequences as strings of tokens exist (sequence
package) as well as for reading sequences from a Fasta or Fastq file.
The kFM-index is not constructed directly from the original strings. Instead, a list of edges
is generated. This is then sorted (by node), and the kFM-index is generated from the sorted
list of edges. Because of this, there is, in addition to the implementation of the kFM-index and
supporting classes, and implementation for storing lists of edges in a compact manner. However,
compact storage of a list of edges is far less compact than the kFM-index.
The main aim of the implementation is to demonstrate the final kFM-index, not optimal generation of the kFM-index from the original strings in terms of speed or intermediary memory
consumption.
2.2
Computation of the kFM-index from a set of sequences
Starting with a set of strings/sequences, in memory or in a file, and generating the kFM-index
corresponding to all k + 1-substrings (i.e. the de Bruijn subgraph of degree k in which nodes
correspond to k-strings), is done in a few steps.
Read sequence data: The sequence data may be stored in memory or read from file, e.g. from
a Fasta file using the FastaReader.
Convert sequence data to in-edge list: An InEdgeList object is created, and the sequences
added to the list. This adds all k-substrings of the sequences. The last k-substring is marked
as a final node, i.e. one that may be without out-going edges.
Sort and complete the in-edge list with final edges: After all in-edges have been added to
the in-edge list, a completion step is run (using the complete method). This sorts the in-edges
by node, removes duplicates, and adds finalising edges and nodes so that the special node $k
can be reached from all nodes in the graph.
Generate in-edge main data for the kFM-index: The InEdgeData may be generated from
the completed in-edge list. The user is responsible for ensuring that complete() is called on
the InEdgeList object before using it to construct the InEdgeData object. In practice, this
step will normally be done, not on an InEdgeData object, but as part of constructing the final
KFMindex which contains both the main data of the kFM-index as well as the auxiliary data
used to enable fast look-up in the table.
Add the kFM-index look-up table for fast walking of the graph: The KFMindex object, in
addition to the main in-edge data stored as part of being an extension of the InEdgeData
class, also contains precomputed values used to allow the previous node map, ρ(a, i), to be
computed efficiently. This is done by precomputing ρ(a, i) for a subset of the positions i (i.e.
at regular intervals), and then computing ρ(a, i) in-between these precomputed values on
demand. This is done by a call to computeKFMarray after all the in-edges have been added to
the InEdgeData. Since KFMindex is an extension of InEdgeData, the construction of the object’s InEdgeData data and completion of the object as an KFMindex object is normally done
in one step when calling the constructor as new KFMindex(inEdgeList) with the completed
InEdgeList object as argument.
Generate kFM-indices on partitioned sequence data and merge these: A KFMaggregator
object handles the splitting of sequence data into blocks, generates kFM-indices for each
block, and subsequently merges these into one complete kFM-index. This is useful if the
data is too big to process in one go.
The combined code to generate a kFM-index from a Fasta file can be as simple as follows:
KFMindex kFMindexFromFastaFile(File file,int order) {
// Set up list for in−edges of given order, and set initial size based on file size
InEdgeList list=InEdgeList.getNew(DNA.get(),order,(int)file.size());
try {
FastaReader fasta = new FastaReader(file); // Open Fasta file
for (TokenIterator seq : fasta) {
list.add(seq); // Add sequences one by one
}
fasta.close();
} catch (FileNotFoundException ex) {
2
System.err.println("ERROR: "+ex);
return null;
}
// Complete list (sort and add finalising edges)
list.complete();
// Produce kFM−index: adds list to main data, then computes indexing data
KFMindex index = new KFMindex(list);
return index;
}
Note that an initial size is provided when constructing the InEdgeList object. Having an initial
size which is big enough to contain all the in-edges generated when parsing the sequence data avoids
having to grow the array, which can take up substantial memory when it takes place. Guessing the
number of in-edges from the file size should mostly by quite adequate unless the Fasta file contains
a substantial portion of non-sequence text.
The same can be done using a KFMaggregator object, which does the KFMindex creation for
us. Code for processing a Fastq file, excluding nucleotides that are unknown or have quality score
less than 20, can be done as simply as follows:
KFMindex kFMindexFromFastqFile(File file,int order,int quality,int blocksize) {
// Set up kFM−index aggregator, splitting input into subsets of size blocksize
KFMaggregator aggr=new KFMaggregator(DNA.get(),order,blocksize);
try {
FastqReader fastq=new FastqReader(file); // Open Fastq file
reader.setQualityCutoff(quality); // Set quality cutoff
aggr.addAll(fastq); // Add all sequences
fastq.close();
} catch (FileNotFoundException ex) {
System.err.println("ERROR: "+ex);
return null;
}
KFMindex index=aggr.getKFMindex();
return index;
}
2.3
Class diagrams
The class diagrams indicate the main classes: abstract classes and interfaces are green and yellow,
with regular classes are blue. Type of class is also indicated in the upper right corner, while the
package is indicated in the upper left corner. Arrows with whole lines indicate inheritance (filled
arrowhead) or interface implementation (empty arrowhead). Arrows with dashed and dotted lines
indicate that the pointing class references the class pointed to: dashed lines are used to indicate
that objects of the class pointed to are produced by objects of the pointing class, while dotted
lines merely indicate that the pointing class references or operates on the class pointed to.
Key methods may be listed, but this will not be complete: the main focus is on public methods
that are critical for understanding and using the classes. Important constructors will be listed
first, then comes important methods, and at the end comes important static methods. For the
sake of brevity, methods are only listed where their interface is first defined, not in implementing or
extending classes. Also for the sake of brevity, ". . . " is used to shorten down the list of arguments
keeping the essentials, or to shorten down the list of similar methods in a class: i.e. the main role
of presenting methods in the class diagrams is to indicate which functions the class performs and
what data it contains, not to give an overview of the API.
2.4
Implementation details
Java must be given access to sufficient memory, e.g. by the -Xmx option on the command line.
There is a resource monitor available in this implementation which indicates how much memory
is made available, how much has been allocated, and how much is actually in use. It may be
necessary to run the garbage collector to free up memory after the kFM-index has been generated
to see how much memory is actually required for storing it since memory used in the intermediate
steps may not have been freed yet.
3
list
BitpackedList
BitpackedList(int bits, int initsize)
long size()
long get(long pos)
void put(long value)
void set(long pos, long value)
list
abstract
CumulativeArray
kfm.KFMindex
UnusedFinals
static CumulativeArray get(long size, long maxstep)
kfm
InEdgeData
void removeUnneededEdges()
IntList getUnusedNodes()
InEdgeData(InEdgeList list)
kfm
TokenSet tokenset()
int k()
boolean hasInEdge(long pos, int token) boolean isLastInGroup(long pos)
void remove(IntList list)
void set(long pos, long value)
void cumulate()
long get(long pos)
list
BitpackedCumulativeArray
KFMmerger
static KFMindex merge(KFMindex a, KFMindex b, . . . )
kfm
KFMindex
kfm
KFMaggregator
KFMaggregator(TokenSet tokenset, int k) . . .
void add(Sequence seq) . . .
KFMindex getKFMindex()
KFMindex(InEdgeList list)
static KFMindex merge(KFMindex a, KFMindex b, . . . )
long prevPos(long pos, int token)
TokenSequence nodeTokens(long pos)
KFMindexInterval newIndexInterval()
int pruneFinalCompletions()
void computeKFMarray(int indexsize) . . .
long kfmIndexPos(int index)
int kfmNearbyIndex(int long)
long kfmSolve(long posvalue)
kfm
KFMindexInterval
long startPos()
long endPos()
boolean isEmpty()
boolean inNonempty()
boolean backtrack(int token)
void reset()
Figure 1: The main class for storing the kFM-index is KFMindex. The in-edge storage is covered
by InEdgeData, which relied on BitpackedList for compact storage, while the stored values of the
kFM-index are handled by a cumulative array. The KFMindexInterval references an interval of the
kFM-index, e.g. nodes with a given prefix. Utility classes KFMmerger and UnusedFinals implement
algorithms for merging kFM-indices and removing superfluous nodes, and are included in the
diagram (in grey) for that reason only.
2.4.1
Array and list sizes and indexing
Java arrays are indexed using int values. A list containing more than 231 ≈ 2 · 109 items can
therefore not be represented as a Java array. The lists used to represent kFM-indices are not
pure arrays, but made from arrays of arrays, and so overcomes this limitation. This also allows
them to grow more easily. Normally, a growable array is grown by replacing it with a larger array
and copying the values, a process that temporarily requires considerable memory overhead. When
implemented as an array of arrays, the main array is itself much smaller since it only stores the
references to the contained data arrays, and only these references are copied rather than the values
themselves.
The kFM-index is indexed using 64 bit long addresses, not 32 bit int which would have been
used for regular Java arrays, with the vertex data stored using arrays of arrays as described above.
Mapping the long addresses to array positions requires the splitting each long address into int
array indices. There may be choices of settings for this may fail, e.g. if the arrays containing the
data are set to be so small that the number of such arrays exceeds what the array of arrays can
store, but I don’t think this should represent a practical issue.
While the kFM-index related classes support large lists indexed by long position values, the
InEdgeList used to store a list of in-edges for generating kFM-indices is implemented as an array
and only supports indexing by int, hence is limited by the maximal array size. Since each stored
block in the IntBlockList, which underlies InEdgeList, may require more than one value per block
(each value of type int since IntBlockList stores the data in an int array), the number of blocks it
can hold will be ≈ 2 · 109 /δ where δ is the number of values per block.
Note that the sequence classes allow for sequences that are longer than 2·109 and have positions
indexed by long values.
3
The kFM-index classes
The main classes for storing and manipulating the kFM-index are contained in the list package.
Figure 3 illustrates the class structures related to the KFMindex class which is used to store the
main information of the kFM-index and provides methods for accessing this information. In order
4
to provide compact storage, the BitpackedList class is used which packs multiple data items into a
long, and then stores this as an array of long values.
Objects of InEdgeListInterval represent intervals of indexes in the kFM-index, which are used
when retrieving the interval of strings in the kFM-index list with a specific prefix.
The correspondence between algorithms and functions in the manuscript and class methods is
as follows:
(KFMindex) prevPos(pos,token): This corresponds to the previous node position function, ρ(a, i),
where pos = i is the index position in the list of nodes, and token = a is the token (where
tokens are enumerated 0, 1, . . .). The implementation is slightly different from that in the
manuscript in that it looks up the next stored value of ρ(a, i) and works it way down to i:
in the manuscript, it starts at the previous stored values and works itself up to i.
(KFMindexInterval) backtrack(token): This corresponds to one step in the computation of [α(x), β(x)i:
i.e. the computation of [α(ax), β(ax)i from [α(x), β(x)i where token = a. In practice, calls
to backtrack are delegated to methods backtrackInterval for the first k tokens, which thus
corresponds to the Interval function described in the manuscript, and to backtrackPosition
for subsequent tokens.
(KFMindex) nodeTokens(pos): This corresponds to the Vertex function defined in the manuscript
which recovers the string (token sequence) that a particular node represents. It relies on calls
to kfmSolve(value) which corresponds to the function ρinv (i).
Specification of stored which positions of ρ(a, i) are stored, i.e. the list i0 < . . . < im is
implemented in (KFMindex) kfmIndexPos(index), where index = r. This is implemented as ir =
brm/nc, but the only assumption is that it should be increasing, and have i0 = 0 and im = n.
The method (KFMindex) kfmNearbyIndex(pos) is used to find a nearby stored position, and is
implemented as brm/n − 1/2c; the only assumptions made, however, are that for ir it should
return r, and that it should be non-decreasing.
3.1
BitpackedList
A BitpackedList object implements a growable list of items, indexed by 64 bit long position, where
each item requires a fixed number of bits. These items are first packed into a long value with a
fixed number of items per long value. The long values are then put into an array of fixed length,
which in turn are stored in a growable array. In this manner, the list can hold more items than
can be indexed by an int, and can also grow without having to duplicate the data since only the
references to arrays holding the arrays have to be copied when the referencing array is being grown,
not the data itself.
3.2
InEdgeData
The KFMindex is the main class of the kFM-index implementation. However, in order to help
structure the code, the implementation has been divided in two, with InEdgeData implementing
the part containing the main data and KFMindex completing the kFM-index implementation with
the auxiliary data and methods for kFM-index look-up.
The storage of the main data is implemented in InEdgeData as an extension of BitpackedList:
the in-edge bits (one bit per token of the token set), and a group end flag (one bit) indicating
the end of each node group where nodes with the same k − 1 prefix string are grouped together.
Thus, the BitpackedList part of the implementation handles the general data storage and retrieval
mechanisms, while the InEdgeData extension provides method specific for storing in-edges.
3.3
KFMindex
The KFMindex adds to InEdgeData the kFM-index which provides a map from each node (identified
by its position in the table) to the nodes from which in-coming edges originate. This index, ρ(a, i),
named prevPos in the implementation, is not stored in full as it can be computed from the main
data (available from the InEdgeData implementation). However, for efficient computation, a subset
is stored as an array of non-decreasing values using an implementation of CumulativeArray.
In essence, a kFM-index is made by first filling the KFMindex object with in-edge data, i.e.
the part covered by the InEdgeData part of the implementation. After the in-edge data has been
entered, in order to be able to efficiently compute the kFM-index function ρ(a, i), implemented as
5
the prevPos function, a subset of the values are stored. In the implementation, ρ(a, i) is stored for
values i0 , . . . , im where m is the size of the stored index.
The function kfmIndexPos(r) returns ir , while kfmNearbyIndex is used to return the r such that
i ≤ ir is nearby. When calling prevPos to compute ρ(a, i), the function kfmNearbyIndex is used a
nearby i ≤ ir , the value of ρ(a, ir ) is looked up, which technically is stored as κ(r + am) = ρ(a, ir ),
and ρ(a, i) is then computed from this. In the present implementation, ir = kfmIndexPos(r) =
brm/nc is used, while makes kfmNearbyIndex(i) = din/m + 1/2e.
Generation of the κ array from the stored in-edge data is done by calling computeKFMarray,
providing the desired array size, m, or allowing a default choice to be made based on the size of
the index, n.
The inverse of ρ(a, i) is defined by ρinv (j) = (a, i) where ρ(a, i) = j while ρ(a, i + 1) = j + 1 and
implemented as kfmSolve. The pair (a, i) is represented by am + i, and thus kfmSolve(j) returns
am + i.
3.4
CumulativeArray and BitpackedCumulativeArray
The stored values, ρ(a, ir ) for 0 = i0 < . . . < im = n, may be coded into a non-decreasing array
κ(r + am) = ρ(a, ir ). If ir − ir−1 ≤ q for all r, the effect is that κ is non-decreasing with increments
at most q. There are several ways in which this can be stored efficiently. The abstract class
CumulativeArray only declares the required interface, while BitpackedCumulativeArray provides an
implementation: other implementations were made prior to this, which are still available although
not in use.
Basically, BitpackedCumulativeArray separates κ(j) = uj + qvj where uj ∈ {0, . . . , q − 1}, where
q is chosen to be a power of 2. This results in uj which can be stored in a bit-packed list, while vj
is a list where increments are either 0 or 1. By storing ∆vj = vj − vj−1 ∈ {0, 1} as a list of bits,
and only storing vj for j whole multipla of 64 (corresponding to sets of 64 bits of ∆vj stored in a
single 64 bit long, memory consumption can be kept low while retrieving arbitrary values of κ can
be done quite fast.
3.5
KFMindexInterval
The KFMindexInterval objects specify an interval of the kFM-index list. The nodes with a given
p-string prefix correspond to such an interval, and methods for computing the interval from the
string are implemented here. The same is methods for backtracking the de Bruijn subgraph starting
at a given node: e.g. a node determined from a given k-string.
3.6
KFMmerger
The KFMmerger class takes two kFM-indices, i.e. two KFMindex objects, and performs a merge of
them returning a merged kFM-index.
If the total sequence data is large, producing the kFM-index directly by making a list of all
in-edges in a InEdgeList object may be too memory demanding since this requires storing all k + 1mers. Instead, the sequence data may be split into more managable parts, kFM-indices made for
each, and then pairwise merger is performed until all have been merged into one big kFM-index.
When KFMmerger performs a merge of two kFM-indices, it has the option to delete the data
from the source kFM-indices and thus release the data as it has been used. Since the target
kFM-index is being grown on demand as it is being created, allowing the memory held by the
source kFM-indices to be release as the target kFM-index is being created avoid the memory
overhead of having the combined memory of the two source and the target kFM-indices at once.
The KFMaggregator automatically does this, and does so safely since the intermediate kFM-indices
holding subsets of the data are not visible to the user.
Memory overhead required for the merger is 3p + 2q bit. This consists of arrays storing the
order of elements from the two indices (p+q bit), which nodes in the two indices correspond (p bit),
and node groups (same k − 1 prefix) found in both indices (p + q bit to mark corresponding nodes
that are not the last node in the group).
Calls to the KFMmerger class should normally be done through the static merge function in
KFMindex, and the use of KFMmerger should thus remain invisible to the user.
6
list
abstract
util
IntBlockList
interface
«Sortable»
IntBlockList(int blocksize, int initsize)
int compare(int p, int q)
void swap(int p, int q)
void copy(int p, int q)
int size()
TokenIterator tokens(int position)
void sort()
kfm.list
InEdgeList
InEdgeList getNew(TokenSet tokenset, int k, int initsize)
util
TokenSet tokenset()
int k()
void add(Sequence seq) . . .
int edgeToken(int position)
int indexLength(int position)
boolean isFinal(int position)
boolean nodeEquals(int p, int q)
void complete()
Sorter
static void sort(Sortable list, int start, int end)
Figure 2: The main class for storing a list of edges is InEdgeList which is used to store and process
in-edges read from sequence data, stored as k-string node values with in-edge token data. The
IntBlockList provides the general data structure for storing these in a compact manner, as well as
Sortable methods that allows Sorter to sort the list.
3.7
UnusedFinals
When kFM-indices are merged, final-completing nodes required to make nodes accessible from the
root node (or final node) may become superfluous and could be trimmed away. The process of
identifying final-completing nodes that are not required uses the internal class UnusedFinals of
KFMindex to scan through the tree of final-completing nodes recursively removing unneeded edges
and marking nodes left unused as superfluous. These can then be removed from the kFM-index.
The method pruneFinalisingNodes in KFMindex uses a UnusedFinals object to find all finalcompleting nodes that can be removed, removes them, and updates the precomputed and stored
part of the kFM-index.
The use of the UnusedFinals class should normally remain invisible to the user.
3.8
KFMaggregator
The KFMaggregator class is a utility class for building a kFM-index from sequence data. It encapsulates the passage of sequence data, through InEdgeList objects, into KFMindex objects, splitting
the sequence data into suitable blocks and in turn merging the subset kFM-indices into a final,
complete kFM-index. Note that it will not split data within a sequence, only between sequences,
so if it is run on a very long sequence, like a large genome, the entire sequence will be processed.
When two kFM-indices are merged by the aggregator, the deleteSources flag is set which makes
the KFMmerger remove the data from the kFM-indices that are being merged to free up memory.
This way, only a minimum of data is kept in duplicate, and the memory required is little more than
that required by the two source kFM-indices: if the source kFM-indices are not removed, memory
would be required to store both the source kFM-indices and the merged kFM-index.
If the data are split into q blocks, q kFM-indices are created. If we enumerate the blocks
0, . . . , q − 1, blocks 2r and 2r + 1 are merged in the first level of merger, leaving a new list of
kFM-indices enumerated 0, . . . , bq/2c − 1. This is then repeated until they have been merged into
one kFM-index. This process may be thought of as a binary tree with the initial q kFM-indices
as leaves. In the implementation, kFM-indices are merged immediately once both are available:
i.e. whenever two child nodes in the binary tree have been populated, they are merged. This may
help reduce memory in the cases where intermediate kFM-indices contain overlapping sets of nodes
since the merged kFM-index will contain fewer nodes than the two source kFM-indices combined.
In the present implementation, all is run on one thread. A potential improvement could be to
have the mergers run on separate threads.
4
Storing a list of edges compactly
As a first step in generating the kFM-index for a set of strings, a list of corresponding edges must
be produced. Edges here are ment to be in-edges to a node, and so the edge consists of a k-string
corresponding to a node and a token (1-string) indicating the edge (i.e. preceeding token in the
7
original string). However, in addition to the tokens used to make up the string, the sequences are
padded at the end with $, and so nodes may correspond to a p-string padded with $k−p (i.e. the
$ token k − p times).
In order to reduce overhead, this edge information is stored in a compact manner in an int
array. Each int is 32 bit long, and a whole number of int values are used. This is administrated
by the IntBlockList which implements a growable array for packing data items (i.e. edges) into an
int array without the need for memory overhead used to create objects out of each edge.
4.1
IntBlockList
An IntBlockList object contains a growable list of items where each item is represented by a block
of int values (of fixed size specified at creation).
4.2
InEdgeList
The InEdgeList is an extension of IntBlockList for which the stored items are edges represented by
a node and a token indicating the in-edge. Each edge is stored with three pieces of information:
the k-string of the node (using the 0 token for $), the position of $ in or relative to the k-string (0
for the first position, k − 1 for the last position, k for immediately after the k-string, −1 otherwise
or as a default), data specifying the in-edge (or no in-edge).
The data is packed, using the IntBlockList implementation, into r int values, allowing 32r bit
to be stored in each pack. If the node is v ∈ Σp ◦ $k−p , i.e. a p-word followed by k − p $ tokens,
and the edge is av for some a ∈ Σ, we store v where each token of Σ is represented by a number
0, . . . , σ − 1 and $ is stored as 0, the number p indicating the position of the first $ token is stored,
and the edge is specified by storing a again as a number 0, . . . , σ. For DNA, σ = 4 and so it
requires 2 bit per letter, plus additional bits to store p. The bit-packing looks like this
32l bit
z
v1
|
}|
. . . vp $ . . . $ 0 . . . . . . . 0
{z
}|
{z
}
|
p×s bit
(k−p)×s bit
{
p
{z
t bit
a
} |{z}
(1)
s0 bit
where s = dlog2 σe bit are required per token, s0 = dlog2 (σ + 1)e bit is used to store the in-edge
0
token a but with 2s − 1 used to represent a vertex without an in-coming edge (the start of the
sequence), t = dlog2 (k + 2)e are used to store p, and the remaining 32r − ks − t − s0 bit are padded
with 0. The value of p is stored as a regular binary number for p = 0, . . . , k − 1, but the value
p = k is only used temporarily to mark final nodes while 2t − 1 is used for regular nodes that
contain no $ tokens. The special marking of the final nodes are then later used to help generate
the final-completing nodes, after which only values 0, . . . , k − 1, 2t − 1 are used.
4.3
Sorter and Sortable
The Sorter and Sortable classes are both in the util package. The Sorter implements the heap sort
algorithm for items available through the Sortable interface. This interface does not requires access
to the actual values stored in the Sortable object, but instead the interface provides methods for
comparing values and reordering them. This is used to sort edges stored in a InEdgeList object,
which in turn are blocks of int values in an array administered by IntBlockList.
The main use of Sorter is to perform a heap sort on the entire list. It can also perform an
updated sort when additional values have been added to the list after a previous sorting.
5
Tokens and sequences
The original sequence data is thought of as a sequence of tokens, or a string of characters: the terms
are sometimes used interchangedly, but with the distinction that a character has data type char
while a token is a number of type int. The tokens are specified through a TokenSet object which
provides the correspondence between tokens and characters for conversion between text strings
(e.g. input from a String or a Fasta file) and token sequences.
The core interfaces for dealing with tokens and sequences are all found in the sequence package. This package also contains abstract implementations of these interfaces providing the main
methods. Concrete implementations for DNA sequences are found in the sequence.dna package.
8
sequence
interface
«Sequence»
sequence
interface
«SequenceConstructor»
<S extends Sequence>
static S get(TokenSequence it)
<S extends Sequence>
S sequence(TokenSequence tokens)
TokenSet tokenset()
long length()
int tokenAt(long pos)
S subsequence(long start, long len)
TokenSequence sequenceTokens()
sequence
interface
sequence
boolean hasNext()
int next()
sequence
abstract
AbstractSequence
int size()
int charToToken(char ch)
char tokenToChar(int token)
TokenSequence charsToTokens(CharSequence charseq)
interface
«TokenSequence»
«TokenSet»
sequence
interface
«TokenIterator»
long length()
long position()
<S extends Sequence>
sequence
sequence
abstract
abstract
AbstractTokenSequence
AbstractTokenSet
sequence
abstract
PackedSequence
<S extends Sequence>
PackedSequence(TokenSequence tokens)
boolean equals(Object o)
int compareTo(PackedSequence seq)
sequence.dna
DNAsequence
DNAsequence(TokenSequence tokens)
DNAsequence(CharSequence charseq)
DNAsequence complement() . . .
sequence
abstract
AbstractAlphabet
AbstractAlphabet(char[] letters, . . . )
sequence.dna
DNAtokens
static DNAtokens get()
int complement(int token)
Figure 3: Classes for speficying token sets and handling sequences. Some of the methods are
required to return sequences, not just as Sequence objects, but as belonging to a particular concrete sequence class: e.g. the DNAtokens token set can be used to construct DNA sequences,
and will return these as DNAsequence objects, being an implementation of SequenceConstructor<DNAsequence>.
5.1
TokenSet
Implementations of the TokenSet interface are used to specify the alphabet from which sequences are
made. It provides the size of the alphabet and the correspondence between tokens and characters.
There is an implementation, DNAtokens, representing the four DNA nucleotides, but throughout
the implementation of the kFM-index is independent of the alphabet: it only requires a static
alphabet of fixed, known size, i.e. no dynamic alphabets which allows letters to be added.
The SequenceConstructor interface is intended used with concrete implementation of TokenSet
so that the token set can be used to convert sequences of tokens into Sequence objects. The
interface takes the concrete extension of Sequence as a generic parameter, indicating what type of
sequence is returned. The reason this is needed is in part because Sequence objects need to refer
to a specific token set, while the TokenSequence interface merely returns a sequence of tokens (int
values) without any reference to the token set.
The TokenSet and SequenceConstructor interfaces have been defined as separate interfaces, but
it might be natural to integrate them into one: the only present implementation of either is DNAtokens, and it might be argued that any implementation should implement both simultaneously.
However, implementations of SequenceConstructor need to specify a concrete Sequence implementation, while token sets do not require this.
5.2
Sequence
The Sequence interface represents token sequences of known length over a given token set. It
is used for sequences stored in memory: it provides random access of the sequence tokens. The
interface takes a Sequence extending class as a generic parameter: i.e. it is defined as Sequence<S
extends Sequence>. The S class specified should be the concrete implementation, which should
be final, and is used to specify the type of sequence returned by methods of the class. Since the
methods of any concrete implementation of Sequence should be sequences of the same class, the
definition should be final S<S> to ensure this.
There is an abstract implementation, AbstractSequence, which implements most of the methods
of the interface, but no sequence storage. The extension PackedSequence provides the main support
9
for storing a sequence of tokens bit-packed into a long array.
A concrete implementation for DNA sequences is DNAsequence, found in the sequence.dna
package.
5.3
Token readers and interators
The interfaces TokenIterator and TokenSequence are used for sequential reading of the tokens of a
sequence, as opposed to Sequence which allows for random access. The main difference between
the two is that TokenIterator merely provides a sequence of tokens without any other information
(like an iterator but without returning objects so it cannot extend the Iterator interface), while
TokenSequence is used to iterate over a sequence of known length and where each token has a
known position.
The TokenIterator class is used by implementations of SequenceReader which returns TokenIterator objects when parsing sequence files. The TokenSequence is used, by ways of it’s abstract
implentation AbstractTokenSequence, to access sequences or subsequences.
6
Sequence readers and filters
There are two file formats supported: Fasta and Fastq. The file readers are FastaReader and
FastqReader, both of which are extensions of SequenceReader. A sequence reader is iterable over
TokenIterator, and can be handed directly to a KFMaggregator through the addAll method.
A FastaReader reads a Fasta file, and returns all the sequences. A FastqReader reads a Fastq
file, but the sequence and the quality scores. It has the option to exclude bases with quality score
below a given cutoff, in which case the sequences will be split up into smaller sequences. By default,
the files are assumed to contain DNA sequences, but the token set can be replaced with another
token set should one wish.
7
Settings and options handling
Instead of passing around multiple options, e.g. between the user interfaces and the KFMaggregator
object used to generate the kFM-index from input data, container classes for storing options are
used. The main options class for this use is KFMoptions (in the wordtable.kfm.util package). This
contains the required parameters for specifying the source sequence data, de Bruijn graph order,
etc., in addition to methods for determining file format, etc.
The options class is not only a container class, but also specifies the command line options used
to set these options and documents their use. This is done through the JCommander package.
The command line interface, WordTable, also utilises an extension, WordTableOptions, of KFMoptions, which adds a number of output and reporting options. Processing of these options is
also specified within the options classes, mostly by calls to methods in the KFMtasks class. The
GUI will use these same methods, but access them through menu item tasks (e.g. ApplicationTask
objects).
8
8.1
Computational speed
Utilising the kFM-index
The single most critical factor influencing the computational speed is the prevPos function which
implements ρ. This is called repeatedly, both when using the kFM-index and when generating it.
The vast majority of computational time may be expected to be spent on computing ρ.
The most obvious factor influencing the time required to compute prevPos is the distance
between the precomputed, stored values: i.e. the size of the array of prestored kFM-index values.
The time it takes to compute ρ for any particular index is proportional to the distance to the
nearest stored value, and so on average is proportional to the distance between the stored values.
However, the memory consumption of the stored values is roughly proportional to the size of the
array of prestored values, which is inversely proportional to the distance between the stored values.
In the Java implementation, the main impact on the computational speed of prevPos is the
implementation of the array lookup: i.e. (BitpackedList) get(pos). Since the implementation of
BitpackedList uses a double array, i.e. long[][], the get(pos) function needs to compute with index
for both the main array and the contained data array, as well as the location of the bits are
10
within the long value. Implementation details that facilitates compiler optimisation are of great
importance here: there was a factor 2 improvement merely from declaring the number of items
stored within each data array constant and thus allowing the compiler to inline it.
The use of double arrays, although required to implement arrays larger than what could be
indexed by an int and suitable for implementing growable arrays with little overhead, has as
considerable computational cost compared to a simple array: I found a factor 2–3 difference. This
choice was in part due to a limitation of Java, and a different choice of language and implementation
might have bypassed this, hence increased speed accordingly.
Exploiting known block sizes, e.g. the knowledge that each base requires 2 bit of data, could
have helped increase speed, but at the cost of reducing the generality and readability of the code.
8.2
Generating the kFM-index
Generation of a kFM-index from raw sequence data is done in two steps. First, the sequence data is
partitioned into managable parts for which all in-edge data can be stored in memory; this involves
holding the entire list of k + 1-mers of the edges in memory at one time, and so the partition size
will normally be limited by available memory. For each partition, the kFM-index is produced,
which requires little memory in comparison. The second step is consists of recursively performing
pairwise merges of these kFM-indices.
The implementation performs the merges successively as the kFM-indices are being generated
from the data partitions. The way this is done is that whenever two new kFM-indices are generated
from raw data, they are merged into a level 1 merged kFM-index. Whenever two level r kFMindices are generated, they are merged into a level r + 1 kFM-index. Since the merged kFM-index
is at most as big as the two source kFM-indices, less if they have vertices in common, this may
help free up memory.
The generation of in-edge lists and production of kFM-indices from each of these is memory demanding, but computationally fast. The merging of kFM-indices is in comparison computationally
more demanding: each merge takes time proportional to the size of the smallest of the two indices
times k since it has to look up the position of each node (although some speed is gained by looking
up entire interval of suffixes at a time), and finally the number of times the kFM-indices have to
be pairwise merged before they have all been merged into one is the 2-logarithm of the number
of partitions. Doubling the block size reduces the number of times the intermediate kFM-indices
have to be merged, but at the cost of doubling the required memory for holding the in-edge list.
11
Download