Asynchronous Generic Key/Value Database by Kyle R. Rose B.A., Computer Science and Mathematics Cornell University Submitted to the Department of Electrical Engineering and Computer Science in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Science at the Massachusetts Institute of Technology September 2000 @2000 Massachusetts Institute of Technology All rights reserved ... I. ... ......................... Signature of Author ....... Department of Electrical Engineering and Computer Science September 4, 2000 ................... Certified by ........... Frans Kaashoek Associate Proessor of Electrical Engineering and Computer Science Thesis Supervisor Certified by.. . . .... ................................ . . . . . . ...... Ph.D. Student, Department fElEe ical Engineer .. . . David Mazieres Science Thesis Suprvisor Certified by ............ Professor A.C. Smith Department of Electrical Engineering and Computer Science Chairman, Committee on Graduate Students MASSACHUSETTS 11NSTiITUTE OF TECHNOLOGY OCT 232000 LIBRARIES ASYNCHRONOUS GENERIC KEY/VALUE DATABASE by KYLE R. ROSE Submitted to the Department of Electrical Engineering and Computer Science on September 4, 2000 in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Science ABSTRACT B-Trees are ideal structures for building databases with fixed-size keys, and have been successfully extended in a variety of ways to accomodate specific key distributions; however, in the general case in which the key distribution either is unknown beforehand or is intentionally pathological, even the most time-honored B-Tree variants-such as prefix-compressed trees-provide sub-optimal performance; e.g., Sleepycat's poor performance on key distributions with many large keys. Insufficient generality in dealing with different key distributions makes most B-Tree variants unsuitable for general applications such as file systems. Furthermore, implementations of B-Trees are often limited to either preemptive or cooperative multithreaded operation with synchronous I/O primitives: the overhead caused by lock contention and multiple stacks makes this an insufficient solution for highly-parallel tasks. This thesis helps fill the void in these areas by introducing and analyzing the performance of a C++ implementation of the String B-Tree of Ferragina and Rossi that meets certain efficiency and flexibility constraints-such as operating within typical B+ Tree time bounds and providing good performance on long arbitrarily-distributed keys-while requiring only asynchronous I/O primitives. Thesis Supervisor: Frans Kaashoek Title: Associate Professor of Electrical Engineering and Computer Science Thesis Supervisor: David Mazieres Title: Ph.D. Student, Department of Electrical Engineering and Computer Science 1 Introduction B-Tree variants are among the simplest, most elegant structures used to create databases, and are actually widely used in practice. Although the standard B-Tree is limited to fixedsize keys, later research resulting in implementations supporting variable-length keys and providing optimal pagination satisfy the following requirements, necessary for truly generic databases: " Many B-Tree variants support variable length keys in a non-trivial manner. By this I mean that adding a few large keys to a key distribution increases the size of the tree only locally. " B-Trees attempt to minimize the number of disk accesses. Local database performance is typically bounded by the high latency and low throughput of disk access. B-Trees, combined with an effective cache, minimize the number of disk accesses required for basic operations. * B-Trees support random access inserts, finds, and removes in O(logm) disk accesses and ordered, sequential seeks in 0(1) disk accesses, where m is the number of keys in the database. Hash tables, while typically much faster for the first three operations, do not support ordered, sequential access. Despite meeting these requirements, traditional B-Tree design relies on a single assumption that is not true for all databases: that keys are typically small compared with the blocksize, with rare exceptions. Typical key/value databases store some of each key in the index nodes, providing overflow nodes for the remainder; unfortunately, these overflow nodes not only introduce extra latency into database accesses, but also provide the user little a priori information about the time required to complete an operation. One of the contributions of this project is a B+ Tree variant that performs well on key distributions with many large keys. Additionally, a source of inefficiency in most current implementations is disk I/O. Making the assumption that the underlying operating system is good at scheduling disk accesses, we should like to provide the OS with as much choice as possible in determining the order of such accesses: typically, B-Trees (and databases in general) accomplish this through the use of cooperative or preemptive threads along with the synchronous I/O primitives provided by the operating system's C library. Unfortunately, this leads to problems with lock contention and to the overhead of multiple stack frames. Therefore, another primary contribution of this project is the use of asynchronous I/O primitives, which avoid these problems while still maximizing the concurrency of disk accesses. Specifically, we use the asynchronous I/O library developed for the Self-certifying File System (SFS, [6]). Finally, we wish to evaluate the real-world performance of a new algorithm, the String B-Tree[5] of Ferragina and Rossi. 4 2 Design The design of the algorithm implemented in this project-the String B-Tree-is motivated in large part by the limitations of B+ Trees and B+ Tree variants. In a standard B+ Tree, keys are stored in the nodes of the tree, requiring a constant keysize to maintain invariants required for the standard pagination operations.[2] This led to the development of B+ Tree variants that supported variable-length keys.[2, 3] However, these solutions only increased applicability; they did not attempt to deal with the resulting inefficiency caused by the many real-life databases with key distributions in which many keys share large substrings. 1 The entire key would still have to be stored in the index, leading to a very low branching factor for those index nodes with very large keys. Prefix compression[1] addresses a specific instance of this problem: that many types of databases will contain many keys sharing a common prefix. The solution is to "compress" the keys in a particular index node by storing the shared prefix only once: only the portion of each key after the prefix would be individually saved. This unfortunately does not work well for all databases. Consider a database with uniformly-chosen keys. The expected number of long keys with any x-byte uniformly-chosen prefix in a database of 224 keys is 2 24-8x: when x is just 3, we expect only one key to have a given prefix. In this instance, prefix compression is essentially useless, so the number of keys in any index node will be small, with a correspondingly low branching factor. String B-Tree 2.1 Ferragina and Grossi come to the rescue with their String B-Tree.[5] Instead of storing prefixcompressed keys at each index node, each key is stored in full (perhaps along with its associated value) in a consecutive sequence of data blocks, and each downward-traversal decision is made by a combination of Patricia trie[7] search and the consultation of a single key. 2.1.1 Notation and conventions " String B-Tree nodes-also called index nodes-are represented by Greek letters. Index nodes can be divided naturally into two sets: internal index nodes, and leaf index nodes. The number of child index nodes of an internal index node ir is denoted by 17rl, and the child index node associated with a particular key D (either Li or Ri for some i E {1, 2, ... , |ir|}) is C(D). * For ease of exposition and analysis, we will assume that all keys in the database end with a unique termination character "$"; we later remove this requirement in the implementation. " Keys are represented by capital letters. The ith character (starting from 0) of a key T is denoted by T[i]. 'This is evident in the UNIX file system: typically, half the files on a given machine will have the prefix "/usr/". 5 " Elements specific to a particular index node may be associated with index node name: e.g., a key T that is associated with an index node w may be written 7r.T. This does not necessarily mean that T is stored in r; T may be a temporary variable used in some context associated with 7r. " The set of keys in the database is denoted by A. " The blocksize (in bytes) of our String B-Tree's underlying block database is B. " The location of some index node 7 on the disk is denoted &7r. " Patricia tries in general are labelled by PT, perhaps subscripted (as in PTL) or associated with some index node (as in r.PT). 2.1.2 Overview of String B-Tree design A String B-Tree is like a B+ Tree in that the values are referenced only at the leaves of the tree, while the internal nodes contain only branching information. However, it differs from traditional B+ Tree variants in several ways: o In a standard B+ Tree, an internal index node with k keys points to k + 1 children: the first child points to those keys less than or equal to the first key; the second child points to those keys greater than the first key but less than or equal to the second key; and so forth. In a String B-Tree, an internal index node w with 171 children "contains" (in a loose sense, as we shall see) 2171 keys, represented by the ordered set 7r.6 = {L 1 , R 1 , L 2 , R 2 ,..., L171 , Rr} in which L, < R1 < L 2 < R 2 < -.. < L < Risi. Two useful invariants are maintained on the keys in this node: 1. The ith child index node contains both Li and Ri, and the leaves of the subtree rooted at this child contain exactly those keys in A greater than or equal to Li but less than or equal to Ri. 2. All keys T in A such that Li < T < R 1,I must satisfy Li < T < Ri for some i. By induction over the structure of the String B-Tree, every key in the database must be "covered" by some index node at each level of the tree. These invariants must be maintained by all operations, as they are an integral part of the String B-Tree's search algorithm. o Unlike typical B+ Tree variants, neither keys nor subsequences thereof are stored in the index nodes themselves: rather, a Patricia trie is used to minimally distinguish between the keys "contained" in that node, while each actual key is stored in full elsewhere in the database. Briefly-as it will be described in full further on-a Patricia trie is a compacted trie with only the first character of each branching string stored in the trie. Figure 2 shows a compacted trie and its corresponding Patricia trie. Right now we need know only 6 STRING B-TREE C(Li) = QRI) Qle) = COW) I C(R3 Figure 1: An example of a String B-Tree with two levels. Each box represents an index node containing a Patricia trie, each leaf of which points to a key in some consecutive sequence of blocks in the database. Internal B-Tree index nodes (such as the root in the above example) contain Patricia tries with an even number of leaves, ordered lexicographically by the keys to which they point. The ith consecutive pair of leaves then points to keys Li and Ri, which are associated with a child B-Tree index node C(Li) = C(Ri). that the search key P, ir.PT, and the choice of a single additional key (one of the Li or Ri from 7r.6) are sufficient to determine the lexicographic position of P among 7r.5. As a result of our assumption that all keys end with a unique termination character, we can assume for the remainder of this section that all complete keys represented in a Patricia trie terminate at leaf nodes. Thus, each of the keys "contained" in an internal index node is represented by a leaf in its Patricia trie. This assumption will be relaxed in the implementation section. 9 As with the internal index nodes, a leaf index node also contains a Patricia trie. The number of trie leaf nodes in each of these tries is equal to the number of key/value pairs represented by the leaf index node, under the same assumption that keys end with a unique termination character. To simplify the notation, the set of strings represented by a leaf index node 7r is also denoted by ir.6, although it does not have any of the restrictions imposed when ir is an internal index node (e.g., that I-r.61 be even). A visualization of the String B-Tree structure is given in figure 1. 7 2.1.3 SBT-f ind(String P) Traversing a String B-Tree is substantially different from traversing a typical B+ Tree variant, in which there is always an obvious child to which to traverse from any internal index node with k + 1 children: for any search key P, there is always either one consecutive pair of keys (Ki, Ki+1) in the node for which Ki < P < Ki+1, or P < K, or P > Kk, and all pairs plus the two endpoints are each associated with a unique child index node. The structure of the String B-Tree index nodes motivates a different search algorithm. Internal index nodes. At each internal index node 7r, we determine (using 7r.PT and either Li or Ri for some i) the lexicographic position of P among the keys "contained" in that node. As will be shown later in our discussion of Patricia tries, our search on PT must result in one of two cases: 1. If Li < P < Ri for some i, then we continue the search for P at child index node C(Li) = C(Ri). Note that, in this case, P may actually be in A, though we don't know this a priori. 2. If Ri < P < Li+1 for some i, then we traverse to C(Ri) or C(Li+ 1 ) arbitrarily; if P < L then we traverse to C(Li); if P > R 1, 1, we traverse to C(RIT). From this point forward, we know that P is not in A. Leaf index nodes. Once we reach a leaf index node ir, we determine (again using 7r.PT and a single additional key) the lexicographic position of P among the keys "contained" in that node. If P E A, then P will match one of the keys in this node; otherwise, P's proper insertion point is either immediately before ir's first key, between two consecutive keys, or immediately after 7r's last key. The proof that this procedure results in finding the lexicographic position of P among A can be done simply by induction. 2.1.4 SBT-insert(String P, KeyPtr K, ValuePtr V) We first perform a downward traversal of the String B-Tree to find P's insertion point in a leaf index node r. If P already exists in the database, we simply replace the pointer to the old value with a pointer to the new value, and deallocate the space used by the old value. If P does not exist in the database, then we insert it into 7r.PT and associate the new trie leaf node with the key/value pointer pair (K, V). At this point we may need to perform some cleanup: if P is the least or greatest key in 7, the parent index node will need to have one of its keys replaced in order to satisfy our String B-Tree invariants. Since this may change the least or greatest key in 7's parent, this operation may cascade to the root. Additionally, i may exceed its key limitation and need to be split, an operation that is discussed later. Since a split introduces two new strings into 7's parent, this operation may also cascade to the root. The insertion of these two new strings at each level appears to require an unbounded number of block loads; however, as we will show later, this is not the case. 8 2.1.5 SBT-remove(String P) As with the insert operation, removing a key P from the database involves a traversal to find P's insertion point in a leaf index node r. If P does not exist in the database, then there is nothing to do. If P does exist, then we remove P from r.PT and deallocate the space associated with P's value. As in the insert case, we may need to perform cleanup if P was the least or greatest key in 7r, although here it is more extensive since the next-to-least or next-to-greatest key may not be in memory: although this appears to require an unbounded number of block loads, the discussion on Patricia tries later demonstrates a procedure that enables us to do this with no additional block loads. Finally, if ir and one of its adjacent sibling nodes both have too few keys, we combine them using the join operation discussed later. Analagously to the insert case, this operation may remove two keys from r's parent, causing it to cascade to the root. If the root index node drops to only two keys (i.e., one child), we remove it and make the only child the new root. 2.2 Patricia tries A detailed description of Patricia tries can be found in [5]. As stated earlier, a Patricia trie is a compacted trie with only the first character of each branching string stored in the trie. Figure 2 shows a compacted trie and its corresponding Patricia trie. We first describe a procedure for finding the insertion position of a key P in index node 7r using potentially all characters of some key in r.PT, which leads to a worst-case bound on disk accesses of O(IPI/B - log M), where B is a minimum branching factor based on B and properties of Patricia tries; later, we improve this bound substantially to O(P/B + log M) by noting that we can upper bound the total number of blocks loaded during an entire String B-Tree search. 2.2.1 Notation and conventions " As with index nodes, Patricia trie nodes can also naturally be divided into two sets, one of the internal trie nodes and the other of the leaf trie nodes. The assumption that there is a unique termination character implies that a key represented by a Patricia trie must terminate in a leaf node. " The string represented by a particular node in the Patricia trie is found by concatenating the substrings along the path from the root to that node in the corresponding compacted trie. Only some nodes have strings that are also full keys: under the assumption that each key ends in a unique termination character, these are exactly the leaves of the trie. " Denote by S(x) the string associated with trie node x. The "length" of a Patricia trie node x is the length of S(x), and is denoted by len (x). Note that S(x) ends with the termination character $ only if x is a leaf. " The successor leaf node of a Patricia trie node x is denoted by succ(x); the corresponding predecessor leaf node would be pred(x). 9 COMPACTED TRIE PATRICIA TRIE 0 0 ab a b a b S-a a b a b 3 b a b a$ $ab aT a b a b4 $~ ~ a 3 ab $a2ab a b 4 b 2 ab 3 b b a 4 a b a a b a $ a ba a 3 ab a b ab a a b b $ 5 a b a a a $ 3 4 b a ab 5 a$ b a b a 5 a $ F4 $ b a b b$ b a b b Figure 2: On the left is a compacted trie. Note that any string represented by a node in the trie can be constructed by concatenating all of the substrings on that node's path from the root: e.g., the string at the leftmost leaf aaba is constructed by following the branches a, ab, and a in order. On the right is a Patricia trie, in which this property no longer holds: notice that only the first character of each branch is stored in the trie. A node is labelled with its "length," which is the length of the string represented by that node. 2.2.2 PT-search(PT, String P) We wish to find a path from the root index node of the String B-Tree to a leaf index node in which either (a) we find a pointer to P's associated value or (b) we can add P and maintain a lexicographic ordering of the keys. At any internal index node -x in our String B-Tree traversal, we perform a downward traversal ("blind search") of the Patricia trie in that index node using the characters of P: at an internal trie node with length j, we choose the branch associated with character P[i1. Eventually, we will either "get stuck"-get to some node of length j in which no branch matches P[j]-or reach some leaf node in the Patricia trie. (If we get stuck in the traversal, we choose some arbitrary leaf of the subtrie below the stuck node.) Call the leaf we reach f. f is associated with some key T = S(f) located in one or more consecutive data blocks. Since the Patricia trie contains the minimal set of characters required to distinguish between the keys of that index node, if P is not one of the strings in that index node it is possible that our traversal led to the wrong position in the trie, i.e., one in which a move to C(T) will lead to an incorrect leaf index node. See figure 3 for an example. 10 Smart search. As our goal is to find the insertion point of P, we need to perform additional work if P is not in 7r.6; this we call "smart search," a procedure for correcting the possible mistake made during blind search. After completing the blind search for P, we load T and compare it character-by-character with P to determine in which position i they first differ. We then move back upwards through the trie to the highest node h along the path to i for which len (h) is greater than or equal to i. Ferragina and Grossi call this the "hit node." From h and T[i], we can determine the correct lexicographic position for P among ir.6: 1. If len (h) = i, then none of the branching characters bo < b1 < ... < bk from h match P[i]. If b, < P[i] < bj+1 for some consecutive branching characters bj and bj+ 1 , then the lexicographic position of P in r.6 is directly after string D = S(x), where x is the rightmost leaf of the subtrie rooted at the child along branch bj. (Note that D = Lx or D=Rx for somexE {1,...,l7r}.) If P[i] < bo, then the lexicographic position of P is immediately before the string D = S(y), where y is the leftmost leaf of the subtrie rooted at the child along branch boSimilarly, P[i] > bk implies that P's lexicographic position is immediately following the string D = S(z), where z is the rightmost leaf of the subtrie rooted at the child along branch bk. 2. If len (h) > i, then all strings associated with leaves in the subtrie rooted at h have character T[i] in position i. Denote by x the leftmost leaf of this subtrie, and by y the rightmost leaf. Then, if P[i] < T[i], the lexicographic position of P is immediately before the string D = S(x); otherwise, if P[i] > T[i], it is immediately after the string D = S(y). Note that all strings associated with leaves of subtries rooted at h's siblings must be either strictly less than or strictly greater than P lexicographically, since they differ from P in some index j < i. Using this "smart search," we can determine the correct child index node r = C(D) of ir to which to branch, without requiring the index nodes to themselves contain the keys. An example of the smart search procedure is given in figures 3 and 4. Reducing disk access. Ferragina and Grossi note that we don't actually need to load all of string T (which can be very large) when performing the smart search if we know a priori that the first I characters (indices 0,... ,l - 1) of P and T match: in this case, we need only load those blocks of T containing indices l,... ,i. It turns out that we can keep track of I iteratively through the String B-Tree search. Say we determine 7r.i, ir.T, and 7r.D in the blind and smart search procedures in ir's Patricia trie, and traverse to child index node -r = 7r.C(7r.D). It must be the case that some string Z E -. 6 matches the first ir.i characters of P, since P matches the first ir.i characters of 7r.T, 7r.T matches (at least) the first ir.i characters of -r.D, and 7r.D must be in r.6 by our primary invariant. 11 BLIND SEARCH SMART SEARCH 0 0 hb alb a ab 3 3 2 alb alb alb 4 a Li4i] b b a a a a b $$ $, T | | 3 | F4 L;jb ab b a ?a 5 a a b a b a 3 b 4 b a a a a b b $ b $ $ a 5 5a a b a b b $ 1 a b 3 alb b a 5 a 1 alb F_4_] $ b a 3 a b | | 4 a a $ 5 a | - a a a$ ' 2 a b 4 ab b b b T Figure 3: In the blind search for P = abba, we will follow the path indicated by the dashed line. Note that we make a mistake on the second arc: in the corresponding compacted trie, that branch has a as its second character, but the character in the same position in the search string is b. We find the mistake by loading string T = abaaa and comparing it character-bycharacter with P, until we find they do not "hit node" a /a a b a b a a b a b b b 3 || 4 b b $ b b $ 'ba 5 b correct | $ a insertion position D=L Figure 4: We then proceed back to the "hit node," the highest node with length > i, which in this case is the node on our path immediately below the mistake. We then follow rule 2 since len (h) = 3 and i = 2, and proceed to find the insertion point between ababb and baa since b = P[2] > T[2] = a. match at index i = 2. Consider a string X in T.6 that differs from P in some index j < -x.i. Then, since P[J] = Z[j], it must be that X[j $ Z[j], so there must be some node in T.PT with length j at which one branch heads toward X and another toward Z. Clearly, a blind search for P starting at this node would take the branch leading toward Z. By induction over the nodes with length <r .i, our blind search must lead to a string r.T which matches at least the first ir.i characters. Thus, for the I in a Patricia trie search we can actually use the i from the smart search performed in the parent's Patricia trie. Analysis. Given the technical description of Patricia tries, we can finally demonstrate bounds on block accesses for the String B-Tree search. The first thing to notice is that we can derive a lower bound on the branching factor of each index node. A Patricia trie of 217r keys contains at most 4Iir +1 nodes (since every node except the root must have at least two children), so we can represent the trie in O(brj) space. 12 Its independence of the sizes of the individual keys gives us a lower bound B on the number of keys per node. Thus, the height of a String B-Tree with m strings is at most log m. Given the lower bound on the branching factor, we can derive good bounds on the number of disk accesses required by a search. Throughout a String B-Tree search through index nodes iro,0 7r , , ... ,7rM, we need to load at most ro~j B +log m I +log[m blocks of key data. The logB m comes from the fact that we may be loading the same section from more than one distinct key during our traversal down through the index nodes, but this is eaten up by the index node loading itself. Thus, the total number of blocks to be loaded in any one String B-Tree search operation is O(IPI/B + log M). 2.2.3 PT-insert(PT, String P, KeyPtr K, ValuePtr V) We first perform a traversal using the above procedure to find i, h, and D. We then ascend from the leaf associated with D to the hit node h. From here, one of two cases applies: 1. If len (n) = i, then we insert a branch from n along character P[i] to a new node with key/value pointer pair (K, V). We know there is not already a node here since P[i] $ D[i] and there is no string in PT that shares a longer prefix with P than D. 2. If len (h) > i, then we insert a new node n' between n and its parent, with one branch along character D[i] leading to n and another along character P[i] leading to a new node with key/value pointer pair (K,V). 2.2.4 PT-remove(PT, PTNode n) Given a node n in trie PT, PT-remove removes n and perhaps removes n's parent. If n's parent now has only one child, we remove that node as well. 2.2.5 PT-split(PT) Given an input trie PT, PT-split needs to return two roughly equivalent-sized tries PTL and PTR together covering all strings in PT and such that all strings in PTL are less than all strings in PTRWe first choose some appropriate leaf node n at which to split PT. We then divide PT into two pieces: PTL contains copies of all nodes along the path, plus every node to the left of the path; PTR also contains copies of all nodes along the path except n, plus every 13 X XL a a b A B c A c N c abc a C XR B b b c cE E N D D Figure 5: An illustration of a Patricia trie split. All nodes on and to the left of the dashed path are included in XL, and all nodes on and to the right of the dashed path (except N) are included in XR. The nodes marked with crosses are unecessary and will be removed during clean up. node to the right of the path. We then proceed to "clean up" PTL and PTR by removing unimportant nodes, i.e., those nodes other than the root having only one child. A trie split is diagrammed in figure 5. 2.2.6 PT-concat(PTL, PTR) PT-concat is essentially the opposite of the PT-split operation: given two tries PTL and PTR, the position i in which the greatest string SL in PTL and the least string SR in PTR differ, and SL[i] and SR[il, we join the right spine of PTL to the left spine of PTR through position i, removing redundant nodes if necessary. 2.2.7 Revisiting SBT-remove We now return to a problem mentioned previously: that removing the least or greatest key from 7r.PT requires us to insert a new key into the Patricia trie T.PT of r's parent index node in order to maintain our String B-Tree invariants. While it seems at first that we need to load the next-to-least or next-to-greatest key in order to insert it into Y, this is not actually T the case. In the following, ir is a String B-Tree index node and T is the parent index node of 7r. First we define some helper functions. Briefly, PT-diff produces the description of the first difference between two strings A and B in nodes a and b of 7r.PT, respectively, and PT-patch-insert uses this difference information to patch B into r.PT, which contains A but not B. 2 2PT-patch-insert actually has a more specific restriction: r.PT cannot contain any string that shares a longer prefix with B than A does. 14 " PT-diff(7r.PT, PTNode a, PTNode b), where a and b are trie leaf nodes in PT. We know just from looking at the trie how exactly ir.S(b) first differs from ir.S(a): the path from the root to a will diverge from the path from the root to b at some node whose length we call i. Then, we know that Vk < i, 7r.S(a)[k] = 7r.S(b)[k] but ir.S(a)[i] = 7r.S(b)[i], where 7r.S(a)[i] is the branching character leading toward a and ir.S(b)[i] is the branching character leading toward b. We return (i, 7r.S(a)[i], 7r.S(b)[i]). " PT-patch-insert(r.PT, PTNode x, i, char xi, char ni, KeyPtr K, ValuePtr V), where x is a trie leaf node in r.PT and xi = r.S(x)[i]. Say N is the string we are attemping to insert into T.PT (i.e., one for which N[i] = ni). We find the highest node h along the path from the root of r.PT to x for which len (x) ;> i; then, one of two cases applies: 1. If len (h) = i, then we add a branch from h along character ni to a new leaf node with key/value pointer pair (K, V). 3 2. If len (h) > i, then we add a new node h' with len (h') = i between h and its parent. A branch from h' labelled by character xi leads to h, and the other branch from h' labelled by character ni leads to a new node with key/value pointer pair (K, V). The following is then an approximation of the SBT-remove procedure, neglecting the orthogonal matters of joining two undersized index nodes and of removing the root if it becomes empty: SBT-remove(P) 1. Traverse to the leaf index node 7r with key P, or return if P § A 2. Set d to be the leaf trie node in 7r.PT for which 7r.S(d) = P 3. If d is neither the leftmost node f nor the rightmost node r in 7r.PT, or 7r is the root of the String B-Tree, PT-remove(ir.PT, d) and return 4. If d = f, set n := succ(d) 5. 6. 7. 8. Otherwise, d = r, so set n := pred(d) (i, i, ni) :=PT-dif f(7r.PT, f, n) (j, rj, nj) := PT-diff(7r.PT, r, n) Set r to be the parent of 7r 9. If i > j 10. 1. Set x to be the node in T.PT corresponding to 7r.S(e) 2. PT-patch-insert(r.PT, X, i, fi, ni, n.K, &7r) Otherwise 1. Set x to be the node in r.PT corresponding to ir.S(r) 2. PT-patch-insert(r.PT, X, j, ry, nj, n.K, &7r) 3 If, among all strings in -r.PT,r.S(x) shares the longest prefix with N, then there is not already a branch along character ni here: no string Y in r.PT sharing the first i characters with N also satisfies Y[i] = ni. 15 11. PT-remove(w.PT, d) 12. Set 13. 14. Set d:= x Go to 3 7 := T Using this procedure, we can remove any key from the database without needing to load additional keys beyond those used for the downward traversal. Analysis. We can now complete the analysis of SBT-remove very simply: since the only disk accesses required are those on the downward traversal-even in the case that we remove the leftmost or rightmost node from some index node's Patricia trie-we do not exceed O(IPI/B + logB m) for the entire operation. 2.2.8 Revisiting SBT-insert As with SBT-remove, SBT-insert can require us to perform Patricia trie inserts on keys we do not have in memory; and, as before, we find that we do not actually need the keys in memory in order to do this. We can directly use the infrastructure built for SBT-remove to complete SBT-insert, ignoring the orthogonal matter of replacing ancestor keys if the key we insert is the leftmost or rightmost in some Patricia trie: SBT-insert(P, KeyPtr K, ValuePtr V) 1. Traverse to P's insertion point in leaf index node r 2. If P c A, then replace its associated value with V and return 3. PT-insert(7r.PT, P, K, V) 4. If 7r does not exceed its key count limitation, return 5. If r is the root of the String B-Tree, create a new empty root 6. Set T to be the parent of 7r 7. Set f to be the leftmost node of 7 8. Set r to be the rightmost node of 7 9. Create a new index node < as ru's new successor, redirecting pointers as necessary 10. (PTL, PTR) := PT-split(w.PT) 11. Set n to be the node in .PT corresponding to the rightmost node in PTL 12. (1, fi, ni =PT- dif f (7. PT,7f, n) 13. (j, rj, nj) := PT-diff(r.PT, r, n) 14. If i > J 1. Set x to be the node in T.PT corresponding to r.S(f) 2. PT-patch-insert(T.PT, x, i, fi, ni, n.K, &7r) 15. Otherwise 1. Set x to be the node in T.PT corresponding to .S(r) 2. PT-patch-insert(T.PT, X, j, ri, nj, n.K, &r) 16. Set m to be the node in 7r.PT corresponding to the leftmost node in PTR 17. (1, ni, mi) :=PT-dif f (r. PT, n, m) 18. (j, rj, m) := PT-dif f(.PT, r, m) 19. If i > j 16 20. 1. Set x to be the node in r.PT corresponding to ir.S(n) 2. PT-pat ch- insert(-r. PT, x, i, ni, mi, m.K, &0) Otherwise 1. Set x to be the node in -r.PT corresponding to 7r.S(r) 2. PT-patch-insert(r.PT, x, j, rj, m, m.K, &#) 21. 22. 23. 24. Set Set Set Go 7r.PT 4.PT r := r to 4 PTL PTR Using this procedure, we can successfully perform a String B-Tree insert with cascading splits without loading any additional blocks of key data from the disk. Analysis. Just as with SBT-remove, we do not need to perform additional disk accesses in SBT-insert above the O(|P/B + logB m) required for the traversal. 3 Implementation An implementation of a modified version of the String B-Tree algorithm has been developed in C++ using the asynchronous I/O libraries (hereafter designated "AIO") developed for SFS. 3.1 Asynchronous I/O The AIO library provides facilities for performing disk I/O using callbacks: essentially, when a programmer wishes to perform an I/O operation, instead of blocking on completion, he passes to the AIO operation a callback with the logical "next step" in the algorithm. So, for instance, to read a block from disk using blocking I/O and then operate on it with some function g, we would do something like FILE *f ; fread(buf, len, 1, f); g(buf); whereas with AIO we would provide a callback to perform the action: ptr<aiofh> f; f->read(pos, buf, wrap(g)); The interesting part about this is that we can dispatch many read calls in parallel, whereupon the callbacks will be called in whatever order the underlying I/O operations complete. This allows the operating system to reorder the accesses optimally for the hardware: ptr<aiofh> f; f->read(posl, f->read(pos2, f->read(pos3, f->read(pos4, buf1, buf2, buf3, buf4, wrap(gl)); wrap(g2)); wrap(g3)); wrap(g4)); 17 aiobuf. The AIO library provides aiobuf, a class of buffer that is allocated in an asynchronous manner and which avoids memory fragmentation; however, for the purposes of this discussion, it can be thought of as a character array. 3.2 Database components The implementation of the database has been split up into three primary logical components: a generic block database; a Patricia trie implementation; and a String B-Tree implementation based on the previous two components. Various support data structures (container maps, sets, and lists) support these three components. 3.2.1 Block database The block database is a database operating on whole blocks and consecutive groups of whole blocks, and that enforces atomicity on user defined operations (called "requests"). It provides facilities for allocating and deallocating groups of blocks, for loading and saving multi-block structures (called "elements"), as well as for retrieving unstructured, read-only subsequences of one or more whole blocks. It also provides an element cache for recently-accessed elements. It is implemented as class BDB, and provides a minimal interface: typedef uint32_t blockd; // typedef uint32_t blockct; // block descriptor block count class BDB { // Buf<T> is a container buffer for an aggregate or scalar // parameterized type T typedef Buf<unsigned char> Cbuf; // attempts to dispatch any new requests that have been created void dispatch-requests 0; }; As is obvious from the spartan interface above, nearly all interaction with the block database is done through the friend class BDB: :Request, which are discussed below. Internally, BDB stores pointers to cached elements and requests in private hash tables; to minimize the impact of memory management, nearly all data managed directly by the block database are reference counted. Elements. An "element" is a structure that can be written in binary form to one or more blocks in the database: examples include index nodes and data nodes in a String B-Tree. Elements must provide minimal facilities for constructing themselves from binary data and converting themselves to binary data, to simplify the storage model. Elements are implemented as classes derived from the base class BDB: : Element. The following interface must be implemented by each derived element X: 4 4 from-bin-aiobuf and allocate do not, of course, have to be class functions. 18 class X : public BDB::Element { X(BDB *bdb, blockd start, blockct count, ... ) BDB: :Element (bdb, start, count) // converts an instance of X into a binary data // and writes it to an aiobuf void (ptr<aiobuf>) const; to-bin-aiobuf // passed as a callback to Request::loadelement, // constructs a new instance of X from an aiobuf static ptr<BDB::Element> (ptr<aiobuf>, frombinaiobuf BDB *db, blockd, blockct); // duplicates an instance of X ptr<BDB::Element> 0 const; copy // passed as a callback to Request::allocate-element, // creates a new instance of X static ptr<BDB: :Element> () const; allocate }; From the time of its allocation by the block database, an element is associated with some consecutive sequence of blocks on the disk, and cannot be moved to another location or resized. Therefore, the base class BDB: :Element can provide two useful methods: block-start 0, which returns the first disk block allocated for the element; and block-count (), which returns the number of blocks allocated for the element. Blocks are allocated for elements using the binary buddy block allocation algorithm: the total store is first divided into regions of some equal size 2 N bytes; then, each of these regions is recursively subdivided into two equally-sized subregions (called "buddies"), down to regions of size 2' bytes. Allocation of a sequence of bytes must be performed on a region boundary, and removes all its subregions from the free block list; this implies that an allocation request for a sequence of x bytes is effectively rounded up to one for 2 [0 2 x bytes, i.e., up to the next power of two.[4] Conversion to and from binary form (using to-bin-aiobuf and from-bin-aiobuf) can be expensive, depending on the internal representation of the element. In the case of a String B-Tree index node, for example, the Patricia trie is represented internally as an arbitrary tree with memory pointers leading from node to node; externally, they are represented as binary strings structured recursively from the root in prefix order. As a result of both the dynamic 19 allocation used in the Patricia trie structure and the general complexity of the structure, this operation is very CPU intensive. Requests. A "request" to the block database is a logical sequence of operations that must together be performed atomically (i.e., without interruption the asynchronous I/O facilities). As a result, if a request would block on the retrieval of data from the disk, the request is completely reset and all changes undone; it is later reinitialized once the requested data has been loaded. The rationale behind this design was that, since databases are largely I/Obound, a transaction system requiring fewer resets would be far more complex without much added performance. In addition to loading a complete element and constructing it from binary form, a request may ask for sets of single blocks ("partial-elements"), which are loaded from the disk and stored in read-only binary form. However, a partial-element can be in memory only when the associated full element is not; therefore, a request needs to be able to glean its needed information from either form, since it does not know a priori which form it will have access to. Requests are implemented as classes derived from the base class BDB: :Request. The following interface must be implemented by each derived request X: class X : public BDB::Request { X(BDB *bdb, ... ) BDB: :Request(bdb) // called by BDB: :dispatchrequests to start (or restart) // the execution of the request void init 0; // reports an exception resulting from the failed execution // of some asynchronous operation void exception (const BDB::Exception&); // called after the request has been completed but before // this instance of X has been destroyed void post 0; Creating a new request is as simple as calling new X (bdb, ... ). Since a request may be restarted at any time, no new requests may be allocated anywhere in the execution tree starting from init 0: due to AIO-induced restarts, such a request may inadvertently be re-created several times. Along these lines, some restrictions on the data that requests may preserve across restarts unenforcable at either compile-time or run-time include: * no pointers to any elements or partial-element data 20 e no block descriptors These may not be preserved across request restarts because elements may move around in memory, partial-element data may be evicted from memory, and blocks on disk may be allocated or deallocated as the result of another request's completion; systems build on top of the block database must determine a set of invariants that will be true at the start of any request (e.g., the location of the superblock in the String B-Tree implementation) so the request can actually assume something useful about the state of the database. Requests have access to the following interface from BDB: :Request: typedef callback<void> generic-cb; class BDB { // callback<R, P1, P2, ... > is a template class that performs a // type of limited function currying on functions with return // type R and parameter types P1, P2, ... typedef callback<void, ptr<const Element> >: :ref viewdata-cb; typedef callback<ptr<Element>, ptr<aiobuf>, BDB*, from-bin-aiobuf-cb; blockd, blockct>: :ref class BDB::Request { // allocates space in the database and constructs a new // instance of an element using allocatenew-cb ptr<Element> (sizet, allocate-element BDB::allocatenew-cb); // frees space in the database associated with an element // that does not necessarily have to be in memory void (blockd, free-element blockct); // executes the viewer callback if the requested element is // already in memory; otherwise, initiates an asynchronous load void (blockd, load-element blockct, BDB::frombinaiobufcb, BDB::viewdata-cb); // executes the callback if the requested partial-element 21 // consisting of blocks (load-start, loadstart + count - 1) // associated with full element (elt-start) is already in // memory; otherwise, initiates an asynchronous load void load-partialelement (blockd eltstart, blockd loadstart, blockct load count, generic-cb); // returns a non-const pointer to a given element; if the // element is not in memory, an exception is thrown ptr<Element> modify-element (blockd); ptr<Element> modifyelement (ptr<const Element>); // returns true iff the full element is in memory or is // being loaded by the AID subsystem bool use-fullelement (blockd); // returns a pointer to a constant character buffer // associated with a particular partial-element; if the // partial-element is not in memory, an exception is thrown ptr<const BDB::Cbuf> partialdata (blockd); Element cache. Once a request asks for a particular element, that element is locked in memory-in the element cache-until the completion of the request. Only when all requests locking a particular element have completed can the element be saved and evicted from the cache. An element is not immediately flushed from the cache once the last request for which it is locked completes; rather, some arbitrary elements are flushed once the cache reaches a threshold size and a new element is loaded from the disk. This delayed eviction is motivated both by the complexity of the binary-internal conversion and by the presumption that keeping an element's data in user space is significantly more efficient than moving it in and out of the buffer cache as needed. There is also a partial-element cache that operates in much the same way, with one primary difference: if the full element encompassing some cached partial-element is loaded, the partial-element is evicted to preserve consistency. 22 3.2.2 Patricia tries Patricia tries are implemented basically as stated in [5], with the following exceptions: o No termination character. The requirement for a unique termination character is trivial on paper, but is inefficient to implement. The result is that nearly all Patricia trie functions have to change in at least a minor way to support the lack of a termination character. " The structure of the Patricia trie is changed to support key/value pointer pairs at each node, and the dichotomy between leaf trie nodes and internal trie nodes is replaced by one between those with key pointers and those without. Call those nodes with key pointers "key nodes" and those without "non-key nodes." " For any string A and any non-empty string B, A < AB lexicographically. " For a key node x, succ(x) points to the key node of the lexicographically succeeding key; similarly, pred(x) points to the key node of the lexicographically preceeding key. " Wherever there is a reference to the "leftmost leaf," we now refer to the "bottom" key node, i.e., the one for which there is no predecessor. 5 " The blind search portion of PT. search (P) will not actually reach an index for which P[len (x)] does not match any of the branching characters of x if there is some leaf y such that S(y) is a prefix of P; in this case, we choose t = y. " In the smart search portion of PT. search(P), we have two new cases: " if P is a prefix of some string in 7r.PT, then the hit node h is the first node IPI. The correct x along the path from the root to T such that len (x) lexicographic position of P is immediately before S(x) for x the highest key node along the left spine of the subtrie rooted at h. * if there is some leaf y in 7r.PT such that S(y) is a prefix of P, then the hit node h = y. (Note that this case is a true exception: this is the only case in JPJ.) which the hit node h does not satisfy len (h) " PT.insert (PT, P, K, V) has two new cases: " if P is the prefix of some string in PT and len (h) > JP for the hit node h, we need to insert the new node between h and h's parent " if h is a leaf and S(h) is a prefix of P, then we insert the new node along branch P[len (h)]. " PT.remove(PT, n) must be changed in the following ways: 5 The astute reader will note that this must be some node along the left spine of the trie. 23 " If n is an interior node, we first check to see if it is a key node; if so, we remove the key/value pointer pair and then remove n if it is redundant (i.e., has only one child). " If n is a leaf and we remove it, we do not remove n's parent if it is a key node. " PT. split () first chooses some key node n, and splits PT into two pieces: PTL contains copies of all nodes on the path root-+n plus copies of those nodes to the left of the path; PTR contains copies of all nodes on the path root-+n plus copies of those nodes to the right of the path and a copy of the subtrie rooted at n. Additionally, since key nodes can be interior nodes, we need to remove the key/value pointer pairs from the copy of the path root-+n in PTR. Finally, we remove the redundant nodes (non-key nodes with only one child) along the right spine of PTL and along the left spine of PTR" PT. concatenate (PTL, PTR) is essentially the same operation as PT-concat, except that only non-key nodes can be considered redundant during clean-up. o No PT . dif f or PT . patch-insert. Due to time considerations and the difficulty of adapting these operations to the lack of a termination character, these operations were not implemented in time for the analysis. Asynchronous operation. When the programmer wishes to perform a Patricia trie search, insert, or remove operation, he must provide a callback that performs the string comparison between the search string P and the blind search string T; this callback then continues the trie search procedure with the location of the difference between the strings (i), the strings' characters at that point (P[i] and T[il), and whether P is a prefix of T or lexicographically before T, T is a prefix of P or lexicographically before P, or P is equal to T. This setup is designed to allow the comparison function to make AIO calls that may cause the request to restart. Since this information is sufficient to perform the search, insert, and remove operations, the complexity of the string comparison is left up to the owner. For example, in a simpler block database implementation without the ability to load partial-elements, all of T can be loaded to perform the comparison. Parameterized character type. The Patricia trie is implemented as a template class paramaterized over the character type (typically char or unsigned char), the string location descriptor type (an instance of which is passed to the comparison function to describe the location of the requested string), and a data type an instance of which is associated with each complete key. The use of templates here allows one to use more complex character types (say wchar for Unicode) without modifying any of the Patricia trie implementation. Node structure. The structure of a Patricia trie node is what one would expect: struct PTrie<C,K,V>::Node { 24 // Pointers to other nodes, where applicable (0 otherwise) parent; Node pred; Node succ; Node // C Character along branch from parent to this node branch-char // Branches to child trie nodes typedef Map<C, Node*, less<C> > branch-map; branches; branch-map // Pointer to key/value for key nodes key; ptr<K> value; ptr<V> The precise container used for the branch map has only one restriction: that iteration through it be done from least character value to greatest character value; therefore, hash-based maps are not applicable, while all balanced, ordered tree-based maps are. 3.2.3 String B-Tree The String B-Tree is an implementation of the simpler of the two algorithms in [5], which solves only the prefix search problem, not the substring search problem. It is implemented atop the block database, with each operation type (allocate, insert, remove, find, and iterate) a request and each node type (superblock, index node, and data node) an element. The only functionality discussed in the design section not yet implemented at the time of this writing is sequential access. Superblock. The superblock is the first block in the database. Currently, it contains only the location of the root index node. Index nodes. An index node is simply the binary representation of a Patricia trie, along with pointers to preceeding and succeeding index nodes on the same level of the tree. As stated above, internally a Patricia trie is represented as a complex data structure making use of splay tree-based maps; thus, when an index node is loaded, its binary representation is converted to this internal representation, and back again when it is saved. The string comparison callback passed to the Patricia trie operations uses the partialelement loading ability of the block database, loading one block at a time to perform string comparisons on traversals. Data nodes. A data node is a series of blocks containing a windowed key (a 32-bit unsigned integer i in network byte order followed by i key bytes) followed by the associated value. The 25 data nodes are placed in no particular order in the database: nodes based on key value. Operations. databases: there is no locality of data The interface of the String B-Tree is slightly different from that of other class SBTree : protected BDB { struct valueloct { blockd start; blockct count; }; typedef callback<void, ptr<const Cbuf> >::ref found_cb; typedef callback<void, valueloc-t, ptr<Cbuf> >::ref postalloccb; // Initializes the database and executes a user-defined ''main' // function once the database is ready void start (generic-cb main); // Called before insert to allocate space for some new data void alloc (const-keyt key, sizet freespace, post.alloc-cb post-request); // Inserts a previously allocated key/value into the database void insert (const-key-t key, valueloc_t insertloc, valuet value, generic-cb post-request); // Removes a key and its associated value from the database void remove (const-key-t key, generic-cb post-request); // Searches for a key and executes the appropriate callback void find (const-key.t key, foundcb found, generic-cb notfound, generic-cb post-request); 26 // Signals the String B-Tree that it should shut down once all // requests have completed void post-shutdown); (generic.cb terminate }; The primary difference between this setup and those of other databases is that the data node allocation and insertion operations are logically separate: this is necessary since the user cannot know a priori how much space is required for the value and key, which are both stored together. Therefore, upon allocation, the user specifies a minimum amount of free space he desires in the data node, but the alloc function may provide much more: this will be evident by the size of the Cbuf passed to the post..request callback, all of which may be filled as desired. Once the user has finished modifying the value buffer, he passes it to insert which performs the actual insertion into the tree. It is important that an alloc actually be followed by an insert, or those allocated blocks will be lost. 3.2.4 Support data structures Several data structures support the operation of the above components. These are described here briefly.6 Splay tree. Based on the algorithm in Sleator and Tarjan's original paper,[8] the class SplayTree is a container with parameters for the key, value, and an asymmetric and transitive key comparison relation class. SplayTree provides the expected basic functionality: insertion (both exclusive values for a key and multiple values associated with the same key), removal, search, and sequential iteration. Two containers are derived from the SplayTree: Map, which enforces a unique value for each key; and Set, which has no value for a particular key. The key and value can each be any scalar or aggregate type, but the comparison relation class is worth noting. It is a template class on some type T with only one method-the application method, operator(const T &a, const T &b)-returningtrue iff a < b. Hash table. A typical hash table with adaptive expansion, HashTable is a template class on key and value parameters that, on n buckets, hashes keys modulo n according to the unsigned result of a call to hash-primary (key). A future improvement will be to parameterize the hash function as was done for the comparison relation in SplayTree. 6 The C++ standard template library containers were not used primarily because they cause significant global namespace pollution on some primitive C++ compilers-including any version of gcc before 2.9-that interferes with the compilation of the AIO library. 27 4 Evaluation To demonstrate the performance of this database against a well-known baseline, in some tests I have chosen to compare it to the freely-available Sleepycat implementation of the standard Berkeley database in B-Tree mode. The version of Sleepycat used for the tests was 3.1.14. There were two machine configurations used: " Small. Intel Celeron 233 with a 6GB 5400 RPM ATAPI hard drive and 64 MB of 60ns SIMMs running Linux 2.2.14 * Large. AMD K6-2 450 with a 10GB 5400 RPM ATAPI hard drive and 256 MB of PC-100 DIMMs running Linux 2.2.16 Wallclock time in the below results was timed from the beginning of the test-generally, opening the database-until all of the test data was written to disk using the sync operations provided by the database and operating system. In particular, the measured time does not include that spent scanning the Sleepycat database to produce statistics. In each test comparing the String B-Tree to Sleepycat, both databases use the same specified blocksize. Finally, to ensure consistent results from back-to-back tests, flushb (a Linux ext2 filesystem command to flush the read buffers from the VFS buffer cache) was called immediately before a test began. 4.1 Benefit from asynchronous I/O One of our primary goals for this project was to determine the benefit derived from asynchronous I/O. In this test on the String B-Tree only, a database with a blocksize of 8192 bytes and already populated with 30,000 16000-byte keys was hit with 100,000 random accesses (in the proportion of 5% inserts, 5% removes, and 90% searches). Figure 6 graphs the number of concurrent requests versus the wallclock time required for the random accesses to complete on the large machine. When there is only one request, all I/O operations are essentially serialized: only one element fetch at a time can be initiated by a request. When multiple requests run concurrently, " the AIO subsystem has greater choice in scheduling the I/O operations; " requests desiring data already in the buffer cache do not have to wait for a costly disk access to complete, reducing latency; and " requests not blocked on I/O and which consume non-trivial CPU time can run while disk I/O is occuring, essentially eliminating disk idle time This is clearly evident from figure 6, in which the total running time drops by 23% from the single-request scenario to the one in which there is an optimal number of concurrent requests (8-9). Concurrency beyond this point appears to add only overhead and not additional performance, but of course, this most likely depends on the application. 28 - 0 700 650- 6000 0 a500-0 0 450- E 10 40 20 30 Concurrent requests 50 Figure 6: This graph shows the running time of 100,000 random accesses on a database of 16000-byte keys. The benefit of concurrent requests due to the AIO library is substantial. 4.2 Large keys Another of our goals for this project was to demonstrate that this algorithm would perform comparatively well on key distributions composed of keys large relative to the blocksize. 4.2.1 Varying keysize For each of several keysizes between 500 and 10,000 bytes, we populate a database of 1024byte blocks 7 with 30,000 keys and then perform 10,000 searches for random keys in the database. The String B-Tree had eight concurrent requests at any one time. Tests were performed on the small machine, to minimize the impact of the buffer cache. Figure 7 graphs the keysize versus the running time of the insert operations, while figure 8 does the same for the search operations. In both tests, the String B-Tree maintains a sizeable advantage over Sleepycat over the entire range of tested keysizes, despite the absence of PT-diff and PT-patch-insert. This can likely be attributed to Sleepycat's frequent access of key overflow blocks in the search test (which is a series of searches for keys we know are in the database) and to fewer index nodes in the String B-Tree, leading both to more effective use of the buffer cache and to fewer levels in the tree. While the String B-Tree also needs to load a complete key when the search key is in the database, it needs to do this only once: Sleepycat may need to do this several times on a downward traversal, a problem that is exacerbated by the low branching factor on trees with 7 1024-byte blocks were used due to limitations on the test machine: smaller blocks mean smaller keys are large relative to the block size, allowing us to fit more keys into a database of the same size. 29 6000 -T --- C - Sleepycat .1600 - 0 5000StigBTeString B-Tree a in- cn Sleepycat _-- String B-Tree 1200 400-1000 U) ------ - - - 800 3000 CY) 0 600- 0 a) E 2000 400- 1000 0 2 4 6 Keysize (in KB) 8 200 10 Figure 7: The String B-Tree is faster than Sleepycat at creating large databases, at any keysize. 0 2 4 6 Keysize (in KB) 8 10 Figure 8: Searching for large keys known to be in the database again favors the String B-Tree. large keys. 4.2.2 Varying population For database sizes ranging from 500 to 50,000 4000-byte keys, we wished to determine the relative performance of the String B-Tree and Sleepycat. There were two stages to this test: first, the total running time of the inserts was clocked; and second, the running time for 100,000 random accesses was clocked. The databases each were configured to use 1024-byte blocks, and the tests were run on the small machine to minimize the impact of the buffer cache. Figure 9 graphs the database size versus the wallclock time for the insertions to complete. Sleepycat has performance similar to the String B-Tree at first, but slows down considerably at around 10,000 keys and continues to fare poorly thereafter. This is likely due to the overhead of writing multiple key overflow blocks, and to the operating system's inability to cache the large number of index nodes resulting from a low branching factor: the String B-Tree has fewer index nodes-which presumably are all cached-and must therefore load at most one block per level. Figure 10 graphs the database size versus the wallclock time for the searches to complete. As for inserts, the performance of Sleepycat is better than or similar to the String B-Tree until about 10,000 keys, when it becomes markedly worse. This is probably due to accessing key overflow blocks on finds and removes, and to the large number of index nodes, as above. 30 -0 5000 3500 C 0 Sleepycat ......- 3000 -0- W String B-Tree - 4000 - Sleepycat String B-Tree 2500 - 8 2000- 3000 - 0 / C 1500- 2000- E 1000 - 50 1000 500- 0 0 20 10 30 40 0 50 10 20 30 40 50 Keys (in thousands) Inserts (in thousands) Despite the hiccup (presum- Figure 9: For the insertion of large num- Figure 10: bers of relatively large keys, the String BTree has much better asymptotic performance than Sleepycat. ably) caused by the buffer cache, the String B-Tree search on relatively long keys is significantly faster than in Sleepycat. 4.3 Discussion The preceeding tests, though sufficient to demonstrate that the String B-Tree satisfies the desired properties, are not the whole story: although the String B-Tree succeeds admirably in certain cases, it may not yet be truly "generic." Some implementation issues and a single primary deficiency of the String B-Tree design related to this application add up to the need for some improvements. 4.3.1 Small databases Although the String B-Tree seems to perform well on databases with large numbers of large keys, it fares worse on smaller databases. A test on the large machine involving 100,000 random accesses on an already-existing database of 5,000 keys was performed to gauge the performance of the String B-Tree (at eight concurrent requests) versus Sleepycat on small databases. The blocksize for each database was set at 4096 bytes, and the test varied the keysize. Figure 11 shows the running time versus the keysize. The results here indicate that when the database has few keys, Sleepycat outperforms the String B-Tree at any particular keysize. Profiling suggests that much time is wasted in the support data structures: in the splay trees used to store Patricia trie node branches, in the conversion between the internal Patricia trie representation and the binary representation, and in keeping reference counts in many data structures that change frequently. The String B-Tree's AIO advantage may not manifest itself on such small databases, which makes CPU usage a much greater portion of the overall running time. This suggests 31 25inn - - Sleepycat String B-Tree 2000 5 1500 E )1000-- 500 0 - 0 20 40 60 80 Keysize (in KB) 100 120 Figure 11: This graph shows the running time of 100,000 random accesses on a small database (5000 keys). Tests run on a small database favor Sleepycat at any keysize. that reducing CPU usage would make a database which is already very I/O efficient even faster. 4.3.2 Block loads A test similar to the one on varying keysize in section 4.2 was performed, except that some searches were performed for keys not in the database: the impact of this on the String B-Tree search is that at most one block needs to be loaded for each level of the tree, regardless of how long the keys are. We performed 10,000 random accesses on an existing database of 30,000 keys of a fixed size. As in the other tests, we used eight concurrent requests. To provide a "worst-case" scenario for this test, the blocksize was set at 512 bytes, forcing a low branching factor and, consequently, adding many levels to the tree over a larger blocksize. The results are graphed in figure 12. Although its wallclock performance is not indicative of key-independence, the constancy in the number of block accesses versus varying keysize suggests this database provides optimal performance for this particular input parameter. As suggested above, other factors-including but not limited to high CPU usage-may account for the difference. In this case, the increasing size of the database and the resulting greater latency in accessing a random block may account for the increase. 4.3.3 Small keys The one main design deficiency related to this analysis of the String B-Tree is its failure to deal adequately with keys that are small relative to the blocksize. The problem is that on distributions of very small keys, two entire blocks must be loaded at every level of the tree: one for the index node itself, one for the key used in smart search. 32 C21 0 CO C 209150- Coa 0 CO 100 E 0 24COV)197-- 0 - "C13 -170 0 1 Fo 4-- 0 S50- 0 C) 16 0 i= 0 0 I 2 4 15 I 6 8 10 Keysize (in KB) 12 14 0 2 4 8 10 6 Keysize (in KB) 12 14 Figure 12: Despite the non-constant time to perform random accesses on databases of varying keysize, the number of block loads was essentially constant, in accordance with the theoretical foundations of the String B-Tree. This suggests other factors impact the running time. Assume the branching factor of a String B-Tree is B. For a tree in which only one block needs to be loaded at each level for distributions with small keys (such as Sleepycat, which stores small keys entirely within the index nodes), the equivalent branching factor b is given by log m = 2 log m log 2 m =-> log22 m = 2 log 2 b log 92 b = -> log 2 B log 2 B b = v/B which is significantly smaller, as we would expect in a prefix-compressed tree. Thus, the large branching factor of the String B-Tree doesn't help in databases with small keys, since other databases can have twice as many levels of index nodes and still require fewer disk accesses. The real win with the String B-Tree is for databases with large keys, for which designs like Sleepycat's require key overflow blocks. A possible solution to this problem is to design a hybrid tree: one in which individual index nodes can each take on a form best suited to its data: small keys would be governed by a prefix compressed index node, while large keys that would otherwise end up in overflow blocks could be governed by Patricia trie-based index nodes. Another solution which is easier to implement would be to traverse the entire tree assuming the key exists. When the key is actually in the tree, no mistakes would be made, so we would naturally end up at the correct node; however, if the key isn't in the tree, we could 33 -5 700 500 0 0 - --- Sleepycat - - - 600 String B-Tree String B-Tree 400 Sleepycat C a) 500 W a 2 00 CO) (0 1)00 0 CO) Cu 400 .0 Ca - CO a) E 0) a) 0- 300 200 0) 100 ) 0 5 10 15 20 25 Files (in thousands) 30 0 35 Figure 13: The performance of file system emulation was very similar on both databases, with Sleepycat having a slight edge on this dataset. 5 10 15 20 25 Files (in thousands) 30 35 Figure 14: Due to the contiguous placement of file blocks in the String B-Tree, much more space was wasted than in Sleepycat, in which files are automatically broken up into smaller pieces. find this out by loading the associated key blocks one at a time in succession, backtracking to the point in the traversal at which we made a mistake, and continuing from there in a second search for the insertion point using the standard algorithm. Denote by IVI the length of the data node, which includes both the key and its associated value. Using this strategy we would load the following numbers of blocks in each case: not in tree in tree 1st search comparison 2nd search logB m JP1/B + logB T logB M logB RI/B + 10gB M value fetch TIM - (1PI/B + ogB M) In the case that the key is not in the tree, we wind up loading a possible IPI/B + 3 logB m blocks, worse than the Pl/B + 2 logB m from the standard algorithm; however, in the case that the key is in the tree, we load only logB m + IVI/B, the minimum number required to traverse the height of the tree and load the entire data node. Thus, when most of our searches are for keys in the database, it makes sense to use this strategy. Furthermore, when we are concerned only with the case in which a key exists-i.e., when we don't need to know the insertion position of a key not in the database-we do not perform any disk accesses over the standard algorithm in either case. 4.3.4 File system emulation Some of the files in the /usr tree on a Debian 2.2 box were added to a database, and then random accesses (in the same percentages as in the uniform key tests) were performed. To account for the limited storage on the test machine, the size of a single file was capped at 34 300,000 bytes. A blocksize of 4096 bytes-a typical cluster size for a Linux machine-was used. As before, the String B-Tree had eight concurrent requests at any one time. Figure 13 graphs random access time versus the number of files (chosen randomly from /usr) in the database. Although Sleepycat jumps out to an early lead, the String B-Tree performance seems to level-off after about 15,000 files. This is likely due to the small key inefficiency discussed above, and to the inability to fragment large values discussed below. Figure 14 graphs database size versus the number of files in the database. Sleepycat automatically breaks up large values into smaller pieces, and can scatter them throughout the database; this String B-Tree implementation has no such mechanism, and therefore wastes more space as a result of the binary buddy block allocation algorithm: a key/value combination of 216 + 1 bytes would actually take up 217 bytes, nearly double the space actually needed. (This extra data also needs to be loaded at the end of a search, further burdening an already busy disk.) 5 Future work In addition to the aforementioned performance improvement suggestions, there are several possibilities for future implementation work: " Reduce complexity of underlying data structures. The inability of the String BTree to perform acceptably relative to Sleepycat on small databases and small keysizes is not necessarily due to the String B-Tree algorithm itself, but may be caused by relatively high CPU usage. Simplifying the support data structures and the complexity of the Patricia trie implementation has the potential to remove this roadblock. Reducing the time spent saving a copy of each modified element would provide additional performance gain, since each element must be copied in its entirety in case some request is backed out. A log-structured approach to modifying elements may provide the desired time relief, with the added benefit of crash recovery. Insertion and removal times are " Implement PT-diff and PT-patch-insert. abyssmal when small blocksizes are combined with large keysizes, because the String B-Tree insert and remove operations may require one key load in the case of a parent key replacement or two full key loads per level in the case of a split. Adapting these procedures to the lack of a termination character and implementing them properly should greatly improve the performance of large key inserts and rmoves. " Improve the element cache. Abominable performance caused by the rampant conversion between the binary and internal representations of an element motivated the rapid development of an internal element cache. It is possible that the parameters on this cache can be tuned to provide better performance; even more likely is that a better element cache altogether would dramatically reduce the number of binary/internal conversions. The average element cache hit-rate was only ~80% on all tests performed. " Use a more intelligent transaction system. Right now, requests are limited to "all-or-nothing" atomic operation: either complete the request without blocking, or 35 restore everything and start over. Fine-grained locking and resultingly fewer restarts may lower CPU usage, despite the added complexity of deadlock detection. 9 Allow for a variable maximum branching factor. Currently, the branching factor of an index node is fixed at the pessimal number of strings per Patricia trie fitting in a single block in binary form; this situation occurs only when the Patricia trie is a complete binary tree with an extra node above the root. This restriction is caused by the dichotomy between the insert and remove code: under a system with a variable split threshold, a String B-Tree remove operation can cause some nodes to increase in size. Currently, there is no way to split a node in the remove operation. Joining the insert and remove code would trivially allow for variability in the split threshold. e Provide an automatic value-fragmenting capability. This will improve space efficiency in databases with large values. Acknowledgements I wish to acknowledge the assistance and thoughtfulness of my advisers Frans Kaashoek and David Mazieres, both in giving me the opportunity to participate in a large software project and in providing guidance along the way. I also wish to thank David Karger for the timely 6.854 project that resulted in the study of the String B-Tree algorithm. References [1] R. Bayer and K. Unterauer. Prefix B-trees. ACM Transactions on Database Systems, 2(1), March 1977. Also published in/as: IBM Yorktwon, T.R.RJ1796, Jun.1976. [2] D. Comer. The ubiquitous B-tree. ACM C. Surveys, 11(2), June 1979. [3] George Diehr and Bruce Faaland. Optimal pagination of B-trees with variable-length items. Communications of the ACM, 27(3):241-247, March 1984. [4] Paul R. Wilson et al. Dynamic storage allocation: A survey and critical review. Proceedings of the IWMM, September 1995. [5] Paolo Ferragina and Roberto Grossi. The string B-tree: a new data structure for string search in external memory and its applications. Journal of the ACM, 46(2):236-280, March 1999. [6] David Mazieres, Michael Kaminsky, M. Frans Kaashoek, and Emmett Witchel. Separating key management from file system security. In Proceedings of the 17th ACM Symposium on Operating Systems Principles (SOSP '99), Kiawah Island, South Carolina, December 1999. 36 [7] D.R. Morrison. Patricia: Practical algorithm to retrieve information coded in alphanumeric. J. A CM, 15, October 1968. [8] Daniel Sleator and Robert Tarjan. Self-adjusting binary search trees. ACM, 32(3):652-686, July 1985. 37 Journal of the